Handling Missing Values in R
In most real datasets (especially surveys), missing values are not always stored as NA. Instead, you often see special numeric codes such as:
-9= Inapplicable-1= Refused to answer999= Don’t know
Before analysing anything, you must:
- Select the variables you will actually analyse
- Inspect them carefully
- Recode invalid placeholders to
NA - Drop rows with missing values only in the variables relevant to your analysis
This chapter shows how to do all of this first in base R, then introduces tidyverse and naniar as modern tools.
Step 1 — Select variables relevant for analysis (Base R)
Most datasets contain many variables you do not need.
Start by selecting only the variables required for your analysis.
Example: suppose the full dataset is named df, and you only need:
SexAgeHeightWage
df_sub <- df[, c("Sex", "Age", "Height", "Wage")]This is now your working dataset.
Step 2 — Summarize each variable (Base R)
summary() helps detect invalid codes:
summary(df_sub$Sex)
summary(df_sub$Age)
summary(df_sub$Height)
summary(df_sub$Wage)If you encounter some issues, try
summary(as.numeric(df_sub$Wage))
# etc for all variablesFor example, you might see:
Wage:
Min. :-9
1st Qu.:15
Median :20
Mean :18
Max. :35
The negative values (-9, -1) are obviously not real wages.
You can also count how many of these appear:
sum(df_sub$Wage == -1, na.rm = TRUE) # refused
sum(df_sub$Wage == -9, na.rm = TRUE) # inapplicable
sum(is.na(df_sub$Wage)) # true missingStep 3 — Recode invalid values to NA (Base R)
Option A — Replace each code separately
df_sub$Wage[df_sub$Wage == -1] <- NA
df_sub$Wage[df_sub$Wage == -9] <- NAOption B — More general rule: “any negative wage is invalid”
df_sub$Wage[df_sub$Wage < 0] <- NAThis catches all placeholder codes at once.
Step 4 — Drop incomplete cases (Base R)
Only drop rows missing the variables you need, not every missing value in the entire dataset.
df_clean <- na.omit(df_sub)This keeps only rows where all selected variables (Sex, Age, Height, Wage) are present.
Important: If the full dataset had 200 variables and you used na.omit(df) instead of na.omit(df_sub), you might lose 80% of your data.
Optional — Preserve the reason for missingness (Base R)
If you want to keep the original meaning (refused, inapplicable, etc.), create a label first:
df_sub$Wage_reason <- ifelse(df_sub$Wage == -1, "Refused",
ifelse(df_sub$Wage == -9, "Inapplicable",
ifelse(is.na(df_sub$Wage), "Missing", "Reported")))Then clean the numeric wage:
df_sub$Wage[df_sub$Wage < 0] <- NAUp to this point: everything was Base R.
Now let’s introduce the modern tools.
Step 5 — A Modern Alternative: tidyverse + naniar
Required packages
library(dplyr)
library(naniar)
library(ggplot2)- dplyr → data manipulation
- naniar → missing data summaries + visualisations
- ggplot2 → used automatically for plots
(pacman: an optional helper package that makes loading/installing packages easier; you do not need it here.)
Step 6 — Recode missing values with tidyverse
df_clean <- df %>%
mutate(
Wage = na_if(Wage, -1),
Wage = na_if(Wage, -9)
)Or the general rule:
df_clean <- df %>%
mutate(Wage = if_else(Wage < 0, NA_real_, Wage))Step 7 — Explore Missingness with naniar
1. Variable-level missingness
naniar::miss_var_summary(df_clean)This shows:
n_miss: number of missing valuespct_miss: percentage missing
2. Graph of missing data
naniar::vis_miss(df_clean)3. Number of missing values per row
naniar::miss_case_table(df_clean)Interpretation of typical output:
| n_miss_in_case | n_cases | pct_cases |
|---|---|---|
| 0 | 4 | 33% |
| 1 | 7 | 58% |
| 2 | 1 | 8% |
Meaning:
- 4 rows have no missing data
- 7 rows have exactly one missing value
- 1 row has two missing values
4. Number of complete cases
naniar::n_complete(df_clean)Or proportion:
naniar::prop_complete_case(df_clean)Final Summary
- Always start by selecting the variables relevant for analysis.
- Use base R (
summary()) to detect invalid placeholder values. - Replace placeholders (
-1,-9, etc.) withNA. - Drop rows missing essential variables using
na.omit()on your subset. - Optionally keep a label explaining the reason for missingness.
- Finally, learn tidyverse and naniar for easier missing-data workflows and visualisation.