Import SPSS/STATA Data

Importing Datasets in RStudio

When working with survey data or datasets from statistical software like SPSS or Stata, you’ll need to import these files into R. The haven package provides the most reliable and modern tools for reading these file formats while preserving important metadata like variable labels and value labels.

# Install if necessary
# install.packages("haven")

# Load the library
library(haven)

Why Use `haven`?

The haven package is specifically designed to read data from SPSS, Stata, and SAS. It has several advantages:

Preserves labels: Variable labels and value labels are maintained during import
Handles missing values: Properly interprets user-defined missing values from SPSS and Stata
Modern and maintained: Actively developed as part of the tidyverse ecosystem
Encoding support: Handles different character encodings correctly
Metadata retention: Keeps important dataset attributes that other packages might lose

Importing Stata Files

Stata files typically have the extension .dta. To import them, use the read_dta() function:

# Import a Stata dataset
df <- read_dta("~/Documents/my_data.dta")

Common `read_dta()` Arguments

The read_dta() function has several useful arguments:

df <- read_dta(
  file = "~/Documents/my_data.dta",
  encoding = NULL,  # Specify encoding if needed (e.g., "latin1")
  col_select = NULL # Select specific columns to import
)

Key arguments:

file: The path to your .dta file
encoding: Character encoding (usually auto-detected, but specify if you have issues)
col_select: Import only specific columns to save memory with large datasets

Importing SPSS Files

SPSS files come in two formats: .sav (standard SPSS format) and .por (portable format). Use read_sav() for .sav files and read_por() for .por files:

# Import an SPSS .sav file
df <- read_sav("~/Documents/my_data.sav")

# Import an SPSS .por file
df <- read_por("~/Documents/my_data.por")

Common `read_sav()` Arguments

df <- read_sav(
  file = "~/Documents/my_data.sav",
  encoding = NULL,      # Specify encoding if needed
  user_na = FALSE,      # Should user-defined missing values be read as NA?
  col_select = NULL     # Select specific columns
)

Key arguments:

file: The path to your .sav or .por file
user_na: If TRUE, user-defined missing values are converted to NA
encoding: Character encoding specification
col_select: Import only specific columns

Understanding File Paths

Using Full Paths (Recommended)

It’s recommended to use full file paths in your scripts for reproducibility:

# Full path examples
df <- read_dta("~/Documents/projects/survey_2024/data/responses.dta")
df <- read_sav("C:/Users/YourName/Documents/survey_data.sav")  # Windows
df <- read_sav("/Users/YourName/Documents/survey_data.sav")    # Mac/Linux

Benefits of full paths:

Your script will work regardless of your current working directory
Easy to identify exactly which file is being imported
Reduces confusion when sharing code with collaborators

Understanding Path Components

~ represents your home directory
/ separates folders (use forward slashes even on Windows)
Absolute paths start from the root directory
Relative paths start from your current working directory

Using RStudio’s Import Dialog (When Paths Are Tricky)

Sometimes constructing the correct file path can be tricky, especially on Windows or when dealing with network drives. RStudio provides a helpful graphical interface for importing data.

Step-by-Step Process

Navigate to the Import dialog:
- Click File → Import Dataset → From SPSS (or From Stata)
Browse for your file:
- Click “Browse” and navigate to your data file
- Select the file and click “Open”
Preview and configure:
- Review the data preview
- Adjust import options if needed
- Note the dataset name
Copy the generated code:
- Look at the “Code Preview” pane in the import dialog
- Copy the entire import command before clicking “Import”
Paste into your script:
- Paste the copied code into your R script
- Save this in your script for future use

Example of Code from Import Dialog

When you use the import dialog, RStudio generates code like this:

# Code generated by RStudio's import dialog
library(haven)
df <- read_sav("C:/Users/YourName/Documents/Projects/survey_2024/data.sav")

Important: Always copy this generated code into your script. This ensures:

You have a record of where the data came from
You can re-run the import without using the dialog again
Your analysis is reproducible

Working with Imported Data

Once imported, your data will retain its labels. You can inspect them using functions from the labelled package:

library(labelled)

# View variable labels
look_for(df)

# Check a specific variable's labels
val_labels(df$region)

# Convert labelled data to factors (if needed)
df <- df %>% 
  mutate(across(where(is.labelled), as_factor))

Alternative: The `foreign` Package

If haven doesn’t work for some reason (e.g., with very old file formats or in legacy R installations), you can use the foreign package as a fallback:

# Install if necessary
# install.packages("foreign")

library(foreign)

# Import Stata files (versions up to 12)
df <- read.dta("~/Documents/my_data.dta")

# Import SPSS files
df <- read.spss("~/Documents/my_data.sav", 
                to.data.frame = TRUE,
                use.value.labels = TRUE)

Limitations of `foreign`

Older and less maintained than haven
Limited Stata support: Only works with Stata versions up to 12
Label handling: Less sophisticated preservation of metadata
Encoding issues: More prone to character encoding problems

Recommendation: Only use foreign if haven fails. In most modern workflows, haven should be your first choice.

Common Issues and Solutions

Issue: File Not Found

# Error: 'my_data.dta' does not exist in current working directory

Solution: Check your working directory and use a full path:

# Check current working directory
getwd()

# Use full path instead
df <- read_dta("~/Documents/my_data.dta")

Issue: Encoding Problems

If you see strange characters in your imported data:

# Specify encoding explicitly
df <- read_dta("~/Documents/my_data.dta", encoding = "latin1")
# or
df <- read_sav("~/Documents/my_data.sav", encoding = "UTF-8")

Issue: Large File Takes Too Long

For very large datasets, import only the columns you need:

# Import only specific columns
df <- read_dta("~/Documents/large_data.dta", 
               col_select = c(id, age, gender, income))

Best Practices

Always use haven as your first choice for SPSS and Stata files
Use full file paths in your scripts for reproducibility
Keep a copy of the import code from RStudio’s dialog in your script
Document your data source with comments in your script:

# Data source: National Survey 2024
# Original file: responses_final_v3.sav
# Date imported: 2024-11-10
df <- read_sav("~/Documents/projects/survey2024/responses_final_v3.sav")

Inspect labels immediately after import using look_for() or str()
Version control your scripts, not your data files

Summary of Key Functions

Package	Function	Purpose	File Types
`haven`	`read_dta()`	Import Stata files	`.dta` (all versions)
`haven`	`read_sav()`	Import SPSS files	`.sav`
`haven`	`read_por()`	Import SPSS portable files	`.por`
`foreign`	`read.dta()`	Import Stata files (legacy)	`.dta` (up to v12)
`foreign`	`read.spss()`	Import SPSS files (legacy)	`.sav`