Creating Binary Variables

When dealing with categorical data in regression analysis, converting these variables into numerical formats—commonly referred to as dummy variables—can be useful.

Today, we explore how to create dummy variables using the GSOEP9402 dataset from the AER package in R. Specifically, we will focus on differentiating academic and vocational school types.

Why Dummy Variables?

Dummy variables act as numerical proxies for categorical data, allowing them to be used in regression models. Each category is represented by a binary variable (0 or 1), indicating the absence or presence of that category.

Working with GSOEP9402 Data

The dataset GSOEP9402 contains various attributes, including school, which classifies individuals based on their educational institution type. Let’s dive into creating dummy variables for academic and vocational schools using the ifelse function.

Step 1: Load the Data

First, ensure the AER library is installed and loaded to access the dataset:

library(AER)
data("GSOEP9402")

Step 2: Create Dummy Variables

We will create two dummy variables—school_academic and school_vocational—to indicate whether an individual attended an academic gymnasium or a vocational school (Hauptschule or Realschule), respectively:

GSOEP9402$school_academic = ifelse(GSOEP9402$school %in% c("Gymnasium"), 1, 0)
GSOEP9402$school_vocational = ifelse(GSOEP9402$school %in% c("Hauptschule", "Realschule"), 1, 0)

In this code: - ifelse checks whether each value in school matches our specified categories. - If it matches, it assigns a 1; otherwise, a 0.

Step 3: Verifying Your Dummy Variables

To ensure the dummy variables were created correctly, use the table() function to cross-tabulate the newly created variables with the original school variable:

table(GSOEP9402$school_academic, GSOEP9402$school)
table(GSOEP9402$school_vocational, GSOEP9402$school)

These tables help verify that the assignment process accurately reflects the data:

  • The school_academic table should show 1 for gymnasium categories and 0 elsewhere.
  • The school_vocational table should indicate 1 for Hauptschule and Realschule, with 0 for others.

Additional Example: Income Categories

Let’s enhance our understanding by categorising individuals as “rich” or “poor” based on income data in GSOEP9402. This example will demonstrate the use of the ifelse function for recoding continuous numerical data into categorical variables.

Step 1: Load the Dataset

Ensure you have the AER library loaded to access the GSOEP9402 dataset.

library(AER)
data("GSOEP9402")

Step 2: Define Income Categories

We’ll categorize individuals as “poor” if their income is below the median and “rich” otherwise. First, examine the income distribution:

summary(GSOEP9402$income)

Based on our data, we choose 66555 as the median income for demonstration purposes. Now, let’s use ifelse to recode income into income_rec.

GSOEP9402$income_rec = ifelse(GSOEP9402$income < 66555, "poor", "rich")

Understanding the Code: - The ifelse function evaluates whether each income value is less than 66555. - It assigns the string “poor” to incomes below this threshold and “rich” otherwise.

Step 3: Verification

To ensure the recoding reflects the data accurately, you can look at the distribution of the newly created income categories:

table(GSOEP9402$income_rec)

Use in Regression

We can now use income_rec in a OLS regression

lm(size ~ income_rec, GSOEP9402)