Creating Binary Variables
When dealing with categorical data in regression analysis, converting these variables into numerical formats—commonly referred to as dummy variables—can be useful.
Today, we explore how to create dummy variables using the GSOEP9402
dataset from the AER
package in R. Specifically, we will focus on differentiating academic and vocational school types.
Why Dummy Variables?
Dummy variables act as numerical proxies for categorical data, allowing them to be used in regression models. Each category is represented by a binary variable (0 or 1), indicating the absence or presence of that category.
Working with GSOEP9402 Data
The dataset GSOEP9402
contains various attributes, including school
, which classifies individuals based on their educational institution type. Let’s dive into creating dummy variables for academic and vocational schools using the ifelse
function.
Step 1: Load the Data
First, ensure the AER
library is installed and loaded to access the dataset:
library(AER)
data("GSOEP9402")
Step 2: Create Dummy Variables
We will create two dummy variables—school_academic
and school_vocational
—to indicate whether an individual attended an academic gymnasium or a vocational school (Hauptschule or Realschule), respectively:
$school_academic = ifelse(GSOEP9402$school %in% c("Gymnasium"), 1, 0)
GSOEP9402$school_vocational = ifelse(GSOEP9402$school %in% c("Hauptschule", "Realschule"), 1, 0) GSOEP9402
In this code: - ifelse
checks whether each value in school
matches our specified categories. - If it matches, it assigns a 1
; otherwise, a 0
.
Step 3: Verifying Your Dummy Variables
To ensure the dummy variables were created correctly, use the table()
function to cross-tabulate the newly created variables with the original school
variable:
table(GSOEP9402$school_academic, GSOEP9402$school)
table(GSOEP9402$school_vocational, GSOEP9402$school)
These tables help verify that the assignment process accurately reflects the data:
- The
school_academic
table should show1
for gymnasium categories and0
elsewhere. - The
school_vocational
table should indicate1
for Hauptschule and Realschule, with0
for others.
Additional Example: Income Categories
Let’s enhance our understanding by categorising individuals as “rich” or “poor” based on income data in GSOEP9402
. This example will demonstrate the use of the ifelse
function for recoding continuous numerical data into categorical variables.
Step 1: Load the Dataset
Ensure you have the AER
library loaded to access the GSOEP9402
dataset.
library(AER)
data("GSOEP9402")
Step 2: Define Income Categories
We’ll categorize individuals as “poor” if their income is below the median and “rich” otherwise. First, examine the income distribution:
summary(GSOEP9402$income)
Based on our data, we choose 66555
as the median income for demonstration purposes. Now, let’s use ifelse
to recode income
into income_rec
.
$income_rec = ifelse(GSOEP9402$income < 66555, "poor", "rich") GSOEP9402
Understanding the Code: - The ifelse
function evaluates whether each income value is less than 66555
. - It assigns the string “poor” to incomes below this threshold and “rich” otherwise.
Step 3: Verification
To ensure the recoding reflects the data accurately, you can look at the distribution of the newly created income categories:
table(GSOEP9402$income_rec)
Use in Regression
We can now use income_rec
in a OLS regression
lm(size ~ income_rec, GSOEP9402)