Recoding Variable Categories in R

Recoding categorical variables is one of the most common data-cleaning tasks in quantitative research.
This tutorial shows how to recode categories using base R, ifelse(), case_when(), and case_match().
We use the GSOEP9402 dataset from the AER package for illustration.

Why Recode Categories?

Real-world categorical variables are not always in the form you need:

  • You may want to rename categories
  • Combine several categories into one
  • Create binary variables for regression
  • Transform long labels into simpler ones

R provides several easy ways to do this, from base R to tidyverse tools.


Example Dataset

library(AER)
data("GSOEP9402")

We will work with the variable school, which contains educational track categories such as Gymnasium, Realschule, and Hauptschule.


1. Recoding with Base R

Base R allows direct replacement using logical indexing. This is the most transparent and beginner-friendly approach.

Replace one category with another

table(GSOEP9402$school)

Hauptschule  Realschule   Gymnasium 
        199         199         277 
is.factor(GSOEP9402$school)
[1] TRUE

Given that GSOEP9402$school is a factor, if we want to use base R code, we need to convert it to character

GSOEP9402$school_char = as.character(GSOEP9402$school)
GSOEP9402$school_char[GSOEP9402$school_char == "Gymnasium"] <- "Academic"

# check
table(GSOEP9402$school_char)

   Academic Hauptschule  Realschule 
        277         199         199 

Replace multiple categories

GSOEP9402$school_char[GSOEP9402$school_char %in% c("Hauptschule", "Realschule")] <- "Vocational"

Explanation

  • GSOEP9402$school == "Gymnasium" creates a TRUE/FALSE vector
  • Selecting the TRUE positions and assigning "Academic" replaces only those values
  • %in% checks membership in multiple categories

Verify the recode

table(GSOEP9402$school_char)

  Academic Vocational 
       277        398 

2. Recoding with ifelse()

ifelse() is useful for creating binary (dummy) variables.

Academic vs. all other school types

GSOEP9402$school_academic <-
  ifelse(GSOEP9402$school == "Gymnasium", 1, 0)

Vocational schools (Realschule + Hauptschule)

GSOEP9402$school_vocational <-
  ifelse(GSOEP9402$school %in% c("Hauptschule", "Realschule"), 1, 0)

Check results

table(GSOEP9402$school_academic, GSOEP9402$school)
   
    Hauptschule Realschule Gymnasium
  0         199        199         0
  1           0          0       277
table(GSOEP9402$school_vocational, GSOEP9402$school)
   
    Hauptschule Realschule Gymnasium
  0           0          0       277
  1         199        199         0

These tables confirm whether categories were correctly assigned.


3. Recoding with dplyr::case_when()

case_when() is part of the tidyverse and provides a clean, readable syntax when you need multiple recoding rules.

library(dplyr)

GSOEP9402 <- GSOEP9402 %>%
  mutate(
    school_recode = case_when(
      school == "Gymnasium" ~ "Academic",
      school %in% c("Hauptschule", "Realschule") ~ "Vocational",
      TRUE ~ "Other"
    )
  )

Benefits

  • Great readability
  • Multiple conditions handled elegantly
  • No need for nested ifelse()

Check it

table(GSOEP9402$school_recode)

  Academic Vocational 
       277        398 

4. Recoding with case_match()

case_match() is a tidyverse tool ideal for direct mapping between categories. It is simpler than case_when() when you do not need logical conditions.

GSOEP9402 <- GSOEP9402 %>%
  mutate(
    school_clean = case_match(
      school,
      "Gymnasium" ~ "Academic",
      "Realschule" ~ "Vocational",
      "Hauptschule" ~ "Vocational",
      .default = "Other"
    )
  )

Verification

table(GSOEP9402$school_clean, GSOEP9402$school)
            
             Hauptschule Realschule Gymnasium
  Academic             0          0       277
  Vocational         199        199         0

5. Another Example: Income Groups

Here we recode a continuous numeric variable into two categories: “poor” and “rich”.

Check the distribution

summary(GSOEP9402$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1248   49229   66555   71311   86646  258341 

Suppose the median is around 66555.

Recode using ifelse()

GSOEP9402$income_group <-
  ifelse(GSOEP9402$income < 66555, "poor", "rich")

Verify

table(GSOEP9402$income_group)

poor rich 
 337  338 

Use in regression

lm(size ~ income_group, data = GSOEP9402)

Call:
lm(formula = size ~ income_group, data = GSOEP9402)

Coefficients:
     (Intercept)  income_grouprich  
          3.9881            0.5415  

Summary

  • Base R

    • Best for simple renaming and replacements
    • df$x[df$x == "hello"] <- "hola"
  • ifelse()

    • Ideal for binary/dummy variables
  • case_when()

    • Best for multi-rule recoding
  • case_match()

    • Perfect for simple category-to-category mapping

Recoding is a fundamental step in cleaning and preparing data for analysis. Once mastered, it makes your workflow clearer, faster, and easier to extend.

Exercises: Recoding Categories in R

Below are short, practical exercises for students learning how to recode categorical variables using base R, ifelse(), case_when(), and case_match().

Each exercise includes:

  • A clear task
  • Hints where needed
  • A complete solution

All examples use the dataset:

library(AER)
data("GSOEP9402")

Exercise 1 — Recode School Categories (Base R)

Task

Using base R only, recode the variable school:

  • "Gymnasium""Academic"
  • "Hauptschule" and "Realschule""Vocational"
  • Leave all other values unchanged

Call the new variable: school_rec_base.


Solution

# If the original variable is a factor, then you must transform to character before using this recoding method
GSOEP9402$school_rec_base <- as.character(GSOEP9402$school)

GSOEP9402$school_rec_base[GSOEP9402$school == "Gymnasium"] <- "Academic"

GSOEP9402$school_rec_base[
  GSOEP9402$school %in% c("Hauptschule", "Realschule")
] <- "Vocational"

Check

table(GSOEP9402$school_rec_base)

  Academic Vocational 
       277        398 

Exercise 2 — Create a Binary Variable with ifelse()

Task

Create a dummy variable:

  • school_acad_dummy = 1 if school is "Gymnasium"
  • 0 otherwise

Solution

GSOEP9402$school_acad_dummy <- ifelse(GSOEP9402$school == "Gymnasium", 1, 0)

Check

table(GSOEP9402$school_acad_dummy, GSOEP9402$school)
   
    Hauptschule Realschule Gymnasium
  0         199        199         0
  1           0          0       277

Exercise 3 — Recode Income Groups Using ifelse()

Task

Using the numeric variable income:

  • Create a new variable income_group
  • "low" if income is below 50,000
  • "high" otherwise

Solution

GSOEP9402$income_group <-
  ifelse(GSOEP9402$income < 50000, "low", "high")

Check

table(GSOEP9402$income_group)

high  low 
 500  175 

Exercise 4 — Use case_when() to Recode School Tracks

Task

Using dplyr::case_when() create school_track:

  • "Academic" if school == "Gymnasium"
  • "Vocational" if "Hauptschule" or "Realschule"
  • "Other" for all remaining cases

Solution

library(dplyr)

GSOEP9402 <- GSOEP9402 %>%
  mutate(
    school_track = case_when(
      school == "Gymnasium" ~ "Academic",
      school %in% c("Hauptschule", "Realschule") ~ "Vocational",
      TRUE ~ "Other"
    )
  )

Check

table(GSOEP9402$school_track)

  Academic Vocational 
       277        398 

Exercise 5 — Use case_match() for Direct Mapping

Task

Recode school categories again using case_match():

  • "Gymnasium" to "A"
  • "Realschule" to "V"
  • "Hauptschule" to "V"
  • All others to "O"

Name the variable school_abbrev.


Solution

GSOEP9402 <- GSOEP9402 %>%
  mutate(
    school_abbrev = case_match(
      school,
      "Gymnasium" ~ "A",
      "Realschule" ~ "V",
      "Hauptschule" ~ "V",
      .default = "O"
    )
  )

Check

table(GSOEP9402$school_abbrev, GSOEP9402$school)
   
    Hauptschule Realschule Gymnasium
  A           0          0       277
  V         199        199         0

Exercise 6 — Replace Categories Using Base R Only

Task

Use base R to replace:

  • "married" to "M"
  • "single" to "S"
  • "divorced" to "D"

in the variable marital, creating marital_short.

(Use logical indexing only.)


Solution

GSOEP9402$marital_short <- as.character(GSOEP9402$marital)

GSOEP9402$marital_short[GSOEP9402$marital == "married"] <- "M"
GSOEP9402$marital_short[GSOEP9402$marital == "single"]  <- "S"
GSOEP9402$marital_short[GSOEP9402$marital == "divorced"] <- "D"

Check

table(GSOEP9402$marital_short)

        D         M         S separated   widowed 
       59       566        11        29        10