library(AER)
data("GSOEP9402")Recoding Variable Categories in R
Recoding categorical variables is one of the most common data-cleaning tasks in quantitative research.
This tutorial shows how to recode categories using base R, ifelse(), case_when(), and case_match().
We use the GSOEP9402 dataset from the AER package for illustration.
Why Recode Categories?
Real-world categorical variables are not always in the form you need:
- You may want to rename categories
- Combine several categories into one
- Create binary variables for regression
- Transform long labels into simpler ones
R provides several easy ways to do this, from base R to tidyverse tools.
Example Dataset
We will work with the variable school, which contains educational track categories such as Gymnasium, Realschule, and Hauptschule.
1. Recoding with Base R
Base R allows direct replacement using logical indexing. This is the most transparent and beginner-friendly approach.
Replace one category with another
table(GSOEP9402$school)
Hauptschule Realschule Gymnasium
199 199 277
is.factor(GSOEP9402$school)[1] TRUE
Given that GSOEP9402$school is a factor, if we want to use base R code, we need to convert it to character
GSOEP9402$school_char = as.character(GSOEP9402$school)GSOEP9402$school_char[GSOEP9402$school_char == "Gymnasium"] <- "Academic"
# check
table(GSOEP9402$school_char)
Academic Hauptschule Realschule
277 199 199
Replace multiple categories
GSOEP9402$school_char[GSOEP9402$school_char %in% c("Hauptschule", "Realschule")] <- "Vocational"Explanation
GSOEP9402$school == "Gymnasium"creates a TRUE/FALSE vector- Selecting the TRUE positions and assigning
"Academic"replaces only those values %in%checks membership in multiple categories
Verify the recode
table(GSOEP9402$school_char)
Academic Vocational
277 398
2. Recoding with ifelse()
ifelse() is useful for creating binary (dummy) variables.
Academic vs. all other school types
GSOEP9402$school_academic <-
ifelse(GSOEP9402$school == "Gymnasium", 1, 0)Vocational schools (Realschule + Hauptschule)
GSOEP9402$school_vocational <-
ifelse(GSOEP9402$school %in% c("Hauptschule", "Realschule"), 1, 0)Check results
table(GSOEP9402$school_academic, GSOEP9402$school)
Hauptschule Realschule Gymnasium
0 199 199 0
1 0 0 277
table(GSOEP9402$school_vocational, GSOEP9402$school)
Hauptschule Realschule Gymnasium
0 0 0 277
1 199 199 0
These tables confirm whether categories were correctly assigned.
3. Recoding with dplyr::case_when()
case_when() is part of the tidyverse and provides a clean, readable syntax when you need multiple recoding rules.
library(dplyr)
GSOEP9402 <- GSOEP9402 %>%
mutate(
school_recode = case_when(
school == "Gymnasium" ~ "Academic",
school %in% c("Hauptschule", "Realschule") ~ "Vocational",
TRUE ~ "Other"
)
)Benefits
- Great readability
- Multiple conditions handled elegantly
- No need for nested ifelse()
Check it
table(GSOEP9402$school_recode)
Academic Vocational
277 398
4. Recoding with case_match()
case_match() is a tidyverse tool ideal for direct mapping between categories. It is simpler than case_when() when you do not need logical conditions.
GSOEP9402 <- GSOEP9402 %>%
mutate(
school_clean = case_match(
school,
"Gymnasium" ~ "Academic",
"Realschule" ~ "Vocational",
"Hauptschule" ~ "Vocational",
.default = "Other"
)
)Verification
table(GSOEP9402$school_clean, GSOEP9402$school)
Hauptschule Realschule Gymnasium
Academic 0 0 277
Vocational 199 199 0
5. Another Example: Income Groups
Here we recode a continuous numeric variable into two categories: “poor” and “rich”.
Check the distribution
summary(GSOEP9402$income) Min. 1st Qu. Median Mean 3rd Qu. Max.
1248 49229 66555 71311 86646 258341
Suppose the median is around 66555.
Recode using ifelse()
GSOEP9402$income_group <-
ifelse(GSOEP9402$income < 66555, "poor", "rich")Verify
table(GSOEP9402$income_group)
poor rich
337 338
Use in regression
lm(size ~ income_group, data = GSOEP9402)
Call:
lm(formula = size ~ income_group, data = GSOEP9402)
Coefficients:
(Intercept) income_grouprich
3.9881 0.5415
Summary
Base R
- Best for simple renaming and replacements
df$x[df$x == "hello"] <- "hola"
ifelse()
- Ideal for binary/dummy variables
case_when()
- Best for multi-rule recoding
case_match()
- Perfect for simple category-to-category mapping
Recoding is a fundamental step in cleaning and preparing data for analysis. Once mastered, it makes your workflow clearer, faster, and easier to extend.
Exercises: Recoding Categories in R
Below are short, practical exercises for students learning how to recode categorical variables using base R, ifelse(), case_when(), and case_match().
Each exercise includes:
- A clear task
- Hints where needed
- A complete solution
All examples use the dataset:
library(AER)
data("GSOEP9402")Exercise 1 — Recode School Categories (Base R)
Task
Using base R only, recode the variable school:
"Gymnasium"→"Academic""Hauptschule"and"Realschule"→"Vocational"- Leave all other values unchanged
Call the new variable: school_rec_base.
Solution
# If the original variable is a factor, then you must transform to character before using this recoding method
GSOEP9402$school_rec_base <- as.character(GSOEP9402$school)
GSOEP9402$school_rec_base[GSOEP9402$school == "Gymnasium"] <- "Academic"
GSOEP9402$school_rec_base[
GSOEP9402$school %in% c("Hauptschule", "Realschule")
] <- "Vocational"Check
table(GSOEP9402$school_rec_base)
Academic Vocational
277 398
Exercise 2 — Create a Binary Variable with ifelse()
Task
Create a dummy variable:
school_acad_dummy = 1ifschoolis"Gymnasium"0otherwise
Solution
GSOEP9402$school_acad_dummy <- ifelse(GSOEP9402$school == "Gymnasium", 1, 0)Check
table(GSOEP9402$school_acad_dummy, GSOEP9402$school)
Hauptschule Realschule Gymnasium
0 199 199 0
1 0 0 277
Exercise 3 — Recode Income Groups Using ifelse()
Task
Using the numeric variable income:
- Create a new variable
income_group "low"if income is below 50,000"high"otherwise
Solution
GSOEP9402$income_group <-
ifelse(GSOEP9402$income < 50000, "low", "high")Check
table(GSOEP9402$income_group)
high low
500 175
Exercise 4 — Use case_when() to Recode School Tracks
Task
Using dplyr::case_when() create school_track:
"Academic"ifschool == "Gymnasium""Vocational"if"Hauptschule"or"Realschule""Other"for all remaining cases
Solution
library(dplyr)
GSOEP9402 <- GSOEP9402 %>%
mutate(
school_track = case_when(
school == "Gymnasium" ~ "Academic",
school %in% c("Hauptschule", "Realschule") ~ "Vocational",
TRUE ~ "Other"
)
)Check
table(GSOEP9402$school_track)
Academic Vocational
277 398
Exercise 5 — Use case_match() for Direct Mapping
Task
Recode school categories again using case_match():
"Gymnasium"to"A""Realschule"to"V""Hauptschule"to"V"- All others to
"O"
Name the variable school_abbrev.
Solution
GSOEP9402 <- GSOEP9402 %>%
mutate(
school_abbrev = case_match(
school,
"Gymnasium" ~ "A",
"Realschule" ~ "V",
"Hauptschule" ~ "V",
.default = "O"
)
)Check
table(GSOEP9402$school_abbrev, GSOEP9402$school)
Hauptschule Realschule Gymnasium
A 0 0 277
V 199 199 0
Exercise 6 — Replace Categories Using Base R Only
Task
Use base R to replace:
"married"to"M""single"to"S""divorced"to"D"
in the variable marital, creating marital_short.
(Use logical indexing only.)
Solution
GSOEP9402$marital_short <- as.character(GSOEP9402$marital)
GSOEP9402$marital_short[GSOEP9402$marital == "married"] <- "M"
GSOEP9402$marital_short[GSOEP9402$marital == "single"] <- "S"
GSOEP9402$marital_short[GSOEP9402$marital == "divorced"] <- "D"Check
table(GSOEP9402$marital_short)
D M S separated widowed
59 566 11 29 10