# Install if necessary
# install.packages("labelled")
# Load the library
library(labelled)
library(dplyr)Searching for Variables in a Dataset
When working with large datasets, searching for and identifying variables can be cumbersome.
This is especially true when working with survey data imported from software such as Stata or SPSS. A common challenge is locating the variable labels, which describe what each variable represents.
The R package labelled provides a suite of useful functions to manage and explore these labels. One particularly powerful tool is look_for(), which simplifies the process of finding and understanding variable labels within your dataset.
Understanding Stata and SPSS Labels
Before using look_for(), it helps to understand how labels work in Stata and SPSS datasets. Labels annotate both variables and values with descriptive text, making datasets more human-readable.
For example, a numeric variable might store regional codes where 1 = North, 2 = South, and 3 = Center. These raw numbers are not very informative on their own — labels provide the necessary context that allows for easier interpretation and analysis.
The Power of the look_for() Function
The look_for() function is a convenient way to search, explore, and document labelled data. It can generate a dictionary of all variable labels in your dataset or help you search for specific variables using keywords.
Example: Creating a Labelled Dataset
Let’s create a small labelled dataset using the labelled_spss() and labelled() functions. This mimics how survey data might appear when imported from SPSS or Stata.
df <- tibble(
region = labelled_spss(
c(1, 2, 1, 9, 2, 3),
c(north = 1, south = 2, center = 3, missing = 9),
na_values = 9,
label = "Region of the respondent"
),
sex = labelled(
c("f", "f", "m", "m", "m", "f"),
c(female = "f", male = "m"),
label = "Sex of the respondent"
),
age_g = labelled(
c(1, 2, 3, 1, 1, 2),
c(`18-45` = 1, `46-65` = 2, `>65` = 3),
label = "Age group"
)
)We now have a simple dataset with three labelled variables:
df# A tibble: 6 × 3
region sex age_g
<dbl+lbl> <chr+lbl> <dbl+lbl>
1 1 [north] f [female] 1 [18-45]
2 2 [south] f [female] 2 [46-65]
3 1 [north] m [male] 3 [>65]
4 9 (NA) [missing] m [male] 1 [18-45]
5 2 [south] m [male] 1 [18-45]
6 3 [center] f [female] 2 [46-65]
Generating a Dictionary of Variables
Using look_for()
The look_for() function can produce a complete dictionary of all variables and their associated labels, which is extremely useful for documentation and data exploration.
# Generate a dictionary for all variables
look_for(df) pos variable label col_type missing values
1 region Region of the respondent dbl+lbl 1 [1] north
[2] south
[3] center
[9] missing
2 sex Sex of the respondent chr+lbl 0 [f] female
[m] male
3 age_g Age group dbl+lbl 0 [1] 18-45
[2] 46-65
[3] >65
By default, look_for() displays several pieces of information:
- pos: Position of the variable in the dataset
- variable: Variable name
- label: Descriptive label for the variable
- col_type: Data type (e.g., integer, character, factor)
- levels: Value labels (for categorical variables)
- value_labels: Named vector showing the relationship between codes and labels
Alternative: generate_dictionary()
The labelled package also provides generate_dictionary(), which serves a similar purpose to look_for():
# Generate a dictionary with full details
generate_dictionary(df) pos variable label col_type missing values
1 region Region of the respondent dbl+lbl 1 [1] north
[2] south
[3] center
[9] missing
2 sex Sex of the respondent chr+lbl 0 [f] female
[m] male
3 age_g Age group dbl+lbl 0 [1] 18-45
[2] 46-65
[3] >65
You can control the level of detail displayed using the details argument:
# Generate a simplified dictionary without value labels
generate_dictionary(df, details = FALSE) pos variable label
1 region Region of the respondent
2 sex Sex of the respondent
3 age_g Age group
When details = FALSE, the output omits the value labels column, providing a more concise overview of your variables. This is particularly useful when you only need variable names and descriptions without the full value mappings.
Searching for Specific Variables
If you’re interested in specific variables, look_for() can perform keyword-based searches. It supports partial word matching, so you can search for incomplete terms such as "reg" or "regio".
# Search for a specific variable (partial match allowed)
look_for(df, "regio") pos variable label col_type missing values
1 region Region of the respondent dbl+lbl 1 [1] north
[2] south
[3] center
[9] missing
Controlling Output Detail
If you only want to display the variable names (without additional details), you can use the details argument:
# Show only variable names that match the search
look_for(df, "regio", details = FALSE) pos variable label
1 region Region of the respondent
The details argument works similarly across both look_for() and generate_dictionary():
details = TRUE(default): Shows complete information including value labelsdetails = FALSE: Shows a simplified view with just variable names and labels
Advanced Usage and Transformations
For a more structured view, the output of look_for() can be transformed into a tidy long format. This makes it easier to work with programmatically or export to documentation.
# Transform the dictionary into a long, tidy format
df %>%
look_for() %>%
lookfor_to_long_format() %>%
convert_list_columns_to_character()# A tibble: 9 × 7
pos variable label col_type missing levels value_labels
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 1 region Region of the respondent dbl+lbl 1 <NA> [1] north
2 1 region Region of the respondent dbl+lbl 1 <NA> [2] south
3 1 region Region of the respondent dbl+lbl 1 <NA> [3] center
4 1 region Region of the respondent dbl+lbl 1 <NA> [9] missing
5 2 sex Sex of the respondent chr+lbl 0 <NA> [f] female
6 2 sex Sex of the respondent chr+lbl 0 <NA> [m] male
7 3 age_g Age group dbl+lbl 0 <NA> [1] 18-45
8 3 age_g Age group dbl+lbl 0 <NA> [2] 46-65
9 3 age_g Age group dbl+lbl 0 <NA> [3] >65
This transformation is particularly useful when:
- You need to export the dictionary to a different format (e.g., Excel, CSV)
- You want to programmatically filter or manipulate the metadata
- You’re generating automated documentation or reports
Why Use look_for() with Survey Data?
When working with labelled survey datasets, look_for() is invaluable. It allows you to:
- Quickly understand the structure and labels within your dataset
- Search and filter variables efficiently using keywords
- Generate clean data dictionaries for reporting and documentation
- Minimize errors when selecting or merging variables across large datasets
- Share comprehensive metadata with collaborators who may not have access to the original codebook
Overall, look_for() is an essential tool for anyone managing labelled survey data in R.
Summary of Key Functions
| Function | Purpose | Key Arguments |
|---|---|---|
look_for() |
Search and explore variable labels | details: Show full info (TRUE) or simplified (FALSE) |
generate_dictionary() |
Generate a complete data dictionary | details: Include value labels (TRUE) or exclude (FALSE) |
lookfor_to_long_format() |
Convert dictionary to long format | None |
convert_list_columns_to_character() |
Convert list columns to character strings | None |