Handling Missing Values in R

In most real datasets (especially surveys), missing values are not always stored as NA. Instead, you often see special numeric codes such as:

-9 = Inapplicable
-1 = Refused to answer
999 = Don’t know

Before analysing anything, you must:

Select the variables you will actually analyse
Inspect them carefully
Recode invalid placeholders to NA
Drop rows with missing values only in the variables relevant to your analysis

This chapter shows how to do all of this first in base R, then introduces tidyverse and naniar as modern tools.

Step 1 — Select variables relevant for analysis (Base R)

Most datasets contain many variables you do not need.

Start by selecting only the variables required for your analysis.

Example: suppose the full dataset is named df, and you only need:

Sex
Age
Height
Wage

df_sub <- df[, c("Sex", "Age", "Height", "Wage")]

This is now your working dataset.

Step 2 — Summarize each variable (Base R)

summary() helps detect invalid codes:

summary(df_sub$Sex)
summary(df_sub$Age)
summary(df_sub$Height)
summary(df_sub$Wage)

If you encounter some issues, try

summary(as.numeric(df_sub$Wage))
# etc for all variables

For example, you might see:

Wage:
 Min. :-9  
 1st Qu.:15
 Median :20
 Mean  :18
 Max. :35

The negative values (-9, -1) are obviously not real wages.

You can also count how many of these appear:

sum(df_sub$Wage == -1, na.rm = TRUE)  # refused
sum(df_sub$Wage == -9, na.rm = TRUE)  # inapplicable
sum(is.na(df_sub$Wage))               # true missing

Step 3 — Recode invalid values to NA (Base R)

Option A — Replace each code separately

df_sub$Wage[df_sub$Wage == -1] <- NA
df_sub$Wage[df_sub$Wage == -9] <- NA

Option B — More general rule: “any negative wage is invalid”

df_sub$Wage[df_sub$Wage < 0] <- NA

This catches all placeholder codes at once.

Step 4 — Drop incomplete cases (Base R)

Only drop rows missing the variables you need, not every missing value in the entire dataset.

df_clean <- na.omit(df_sub)

This keeps only rows where all selected variables (Sex, Age, Height, Wage) are present.

Important: If the full dataset had 200 variables and you used na.omit(df) instead of na.omit(df_sub), you might lose 80% of your data.

Optional — Preserve the reason for missingness (Base R)

If you want to keep the original meaning (refused, inapplicable, etc.), create a label first:

df_sub$Wage_reason <- ifelse(df_sub$Wage == -1, "Refused",
                       ifelse(df_sub$Wage == -9, "Inapplicable",
                       ifelse(is.na(df_sub$Wage), "Missing", "Reported")))

Then clean the numeric wage:

df_sub$Wage[df_sub$Wage < 0] <- NA

Up to this point: everything was Base R.

Now let’s introduce the modern tools.

Step 5 — A Modern Alternative: tidyverse + naniar

Required packages

library(dplyr)
library(naniar)
library(ggplot2)

dplyr → data manipulation
naniar → missing data summaries + visualisations
ggplot2 → used automatically for plots

(pacman: an optional helper package that makes loading/installing packages easier; you do not need it here.)

Step 6 — Recode missing values with tidyverse

df_clean <- df %>%
  mutate(
    Wage = na_if(Wage, -1),
    Wage = na_if(Wage, -9)
  )

Or the general rule:

df_clean <- df %>%
  mutate(Wage = if_else(Wage < 0, NA_real_, Wage))

Step 7 — Explore Missingness with naniar

1. Variable-level missingness

naniar::miss_var_summary(df_clean)

This shows:

n_miss: number of missing values
pct_miss: percentage missing

2. Graph of missing data

naniar::vis_miss(df_clean)

3. Number of missing values per row

naniar::miss_case_table(df_clean)

Interpretation of typical output:

n_miss_in_case	n_cases	pct_cases
0	4	33%
1	7	58%
2	1	8%

Meaning:

4 rows have no missing data
7 rows have exactly one missing value
1 row has two missing values

4. Number of complete cases

naniar::n_complete(df_clean)

Or proportion:

naniar::prop_complete_case(df_clean)

Final Summary

Always start by selecting the variables relevant for analysis.
Use base R (summary()) to detect invalid placeholder values.
Replace placeholders (-1, -9, etc.) with NA.
Drop rows missing essential variables using na.omit() on your subset.
Optionally keep a label explaining the reason for missingness.
Finally, learn tidyverse and naniar for easier missing-data workflows and visualisation.