Import SPSS/STATA Data

Importing Datasets in RStudio

When working with survey data or datasets from statistical software like SPSS or Stata, you’ll need to import these files into R. The haven package provides the most reliable and modern tools for reading these file formats while preserving important metadata like variable labels and value labels.

# Install if necessary
# install.packages("haven")

# Load the library
library(haven)

Why Use haven?

The haven package is specifically designed to read data from SPSS, Stata, and SAS. It has several advantages:

  • Preserves labels: Variable labels and value labels are maintained during import
  • Handles missing values: Properly interprets user-defined missing values from SPSS and Stata
  • Modern and maintained: Actively developed as part of the tidyverse ecosystem
  • Encoding support: Handles different character encodings correctly
  • Metadata retention: Keeps important dataset attributes that other packages might lose

Importing Stata Files

Stata files typically have the extension .dta. To import them, use the read_dta() function:

# Import a Stata dataset
df <- read_dta("~/Documents/my_data.dta")

Common read_dta() Arguments

The read_dta() function has several useful arguments:

df <- read_dta(
  file = "~/Documents/my_data.dta",
  encoding = NULL,  # Specify encoding if needed (e.g., "latin1")
  col_select = NULL # Select specific columns to import
)

Key arguments:

  • file: The path to your .dta file
  • encoding: Character encoding (usually auto-detected, but specify if you have issues)
  • col_select: Import only specific columns to save memory with large datasets

Importing SPSS Files

SPSS files come in two formats: .sav (standard SPSS format) and .por (portable format). Use read_sav() for .sav files and read_por() for .por files:

# Import an SPSS .sav file
df <- read_sav("~/Documents/my_data.sav")

# Import an SPSS .por file
df <- read_por("~/Documents/my_data.por")

Common read_sav() Arguments

df <- read_sav(
  file = "~/Documents/my_data.sav",
  encoding = NULL,      # Specify encoding if needed
  user_na = FALSE,      # Should user-defined missing values be read as NA?
  col_select = NULL     # Select specific columns
)

Key arguments:

  • file: The path to your .sav or .por file
  • user_na: If TRUE, user-defined missing values are converted to NA
  • encoding: Character encoding specification
  • col_select: Import only specific columns

Understanding File Paths

Understanding Path Components

  • ~ represents your home directory
  • / separates folders (use forward slashes even on Windows)
  • Absolute paths start from the root directory
  • Relative paths start from your current working directory

Using RStudio’s Import Dialog (When Paths Are Tricky)

Sometimes constructing the correct file path can be tricky, especially on Windows or when dealing with network drives. RStudio provides a helpful graphical interface for importing data.

Step-by-Step Process

  1. Navigate to the Import dialog:
    • Click FileImport DatasetFrom SPSS (or From Stata)
  2. Browse for your file:
    • Click “Browse” and navigate to your data file
    • Select the file and click “Open”
  3. Preview and configure:
    • Review the data preview
    • Adjust import options if needed
    • Note the dataset name
  4. Copy the generated code:
    • Look at the “Code Preview” pane in the import dialog
    • Copy the entire import command before clicking “Import”
  5. Paste into your script:
    • Paste the copied code into your R script
    • Save this in your script for future use

Example of Code from Import Dialog

When you use the import dialog, RStudio generates code like this:

# Code generated by RStudio's import dialog
library(haven)
df <- read_sav("C:/Users/YourName/Documents/Projects/survey_2024/data.sav")

Important: Always copy this generated code into your script. This ensures:

  • You have a record of where the data came from
  • You can re-run the import without using the dialog again
  • Your analysis is reproducible

Working with Imported Data

Once imported, your data will retain its labels. You can inspect them using functions from the labelled package:

library(labelled)

# View variable labels
look_for(df)

# Check a specific variable's labels
val_labels(df$region)

# Convert labelled data to factors (if needed)
df <- df %>% 
  mutate(across(where(is.labelled), as_factor))

Alternative: The foreign Package

If haven doesn’t work for some reason (e.g., with very old file formats or in legacy R installations), you can use the foreign package as a fallback:

# Install if necessary
# install.packages("foreign")

library(foreign)

# Import Stata files (versions up to 12)
df <- read.dta("~/Documents/my_data.dta")

# Import SPSS files
df <- read.spss("~/Documents/my_data.sav", 
                to.data.frame = TRUE,
                use.value.labels = TRUE)

Limitations of foreign

  • Older and less maintained than haven
  • Limited Stata support: Only works with Stata versions up to 12
  • Label handling: Less sophisticated preservation of metadata
  • Encoding issues: More prone to character encoding problems

Recommendation: Only use foreign if haven fails. In most modern workflows, haven should be your first choice.


Common Issues and Solutions

Issue: File Not Found

# Error: 'my_data.dta' does not exist in current working directory

Solution: Check your working directory and use a full path:

# Check current working directory
getwd()

# Use full path instead
df <- read_dta("~/Documents/my_data.dta")

Issue: Encoding Problems

If you see strange characters in your imported data:

# Specify encoding explicitly
df <- read_dta("~/Documents/my_data.dta", encoding = "latin1")
# or
df <- read_sav("~/Documents/my_data.sav", encoding = "UTF-8")

Issue: Large File Takes Too Long

For very large datasets, import only the columns you need:

# Import only specific columns
df <- read_dta("~/Documents/large_data.dta", 
               col_select = c(id, age, gender, income))

Best Practices

  1. Always use haven as your first choice for SPSS and Stata files
  2. Use full file paths in your scripts for reproducibility
  3. Keep a copy of the import code from RStudio’s dialog in your script
  4. Document your data source with comments in your script:
# Data source: National Survey 2024
# Original file: responses_final_v3.sav
# Date imported: 2024-11-10
df <- read_sav("~/Documents/projects/survey2024/responses_final_v3.sav")
  1. Inspect labels immediately after import using look_for() or str()
  2. Version control your scripts, not your data files

Summary of Key Functions

Package Function Purpose File Types
haven read_dta() Import Stata files .dta (all versions)
haven read_sav() Import SPSS files .sav
haven read_por() Import SPSS portable files .por
foreign read.dta() Import Stata files (legacy) .dta (up to v12)
foreign read.spss() Import SPSS files (legacy) .sav