Searching for Variables in a Dataset

When working with large datasets, searching for and identifying variables can be cumbersome.

This is especially true when working with survey data imported from software such as Stata or SPSS. A common challenge is locating the variable labels, which describe what each variable represents.

The R package labelled provides a suite of useful functions to manage and explore these labels. One particularly powerful tool is look_for(), which simplifies the process of finding and understanding variable labels within your dataset.

# Install if necessary
# install.packages("labelled")

# Load the library
library(labelled)
library(dplyr)

Understanding Stata and SPSS Labels

Before using look_for(), it helps to understand how labels work in Stata and SPSS datasets. Labels annotate both variables and values with descriptive text, making datasets more human-readable.

For example, a numeric variable might store regional codes where 1 = North, 2 = South, and 3 = Center. These raw numbers are not very informative on their own — labels provide the necessary context that allows for easier interpretation and analysis.


The Power of the look_for() Function

The look_for() function is a convenient way to search, explore, and document labelled data. It can generate a dictionary of all variable labels in your dataset or help you search for specific variables using keywords.


Example: Creating a Labelled Dataset

Let’s create a small labelled dataset using the labelled_spss() and labelled() functions. This mimics how survey data might appear when imported from SPSS or Stata.

df <- tibble(
  region = labelled_spss(
    c(1, 2, 1, 9, 2, 3),
    c(north = 1, south = 2, center = 3, missing = 9),
    na_values = 9,
    label = "Region of the respondent"
  ),
  sex = labelled(
    c("f", "f", "m", "m", "m", "f"),
    c(female = "f", male = "m"),
    label = "Sex of the respondent"
  ), 
  age_g = labelled(
    c(1, 2, 3, 1, 1, 2),
    c(`18-45` = 1, `46-65` = 2, `>65` = 3),
    label = "Age group"
  )
)

We now have a simple dataset with three labelled variables:

df
# A tibble: 6 × 3
  region           sex        age_g    
  <dbl+lbl>        <chr+lbl>  <dbl+lbl>
1 1 [north]        f [female] 1 [18-45]
2 2 [south]        f [female] 2 [46-65]
3 1 [north]        m [male]   3 [>65]  
4 9 (NA) [missing] m [male]   1 [18-45]
5 2 [south]        m [male]   1 [18-45]
6 3 [center]       f [female] 2 [46-65]

Generating a Dictionary of Variables

Using look_for()

The look_for() function can produce a complete dictionary of all variables and their associated labels, which is extremely useful for documentation and data exploration.

# Generate a dictionary for all variables
look_for(df)
 pos variable label                    col_type missing values     
 1   region   Region of the respondent dbl+lbl  1       [1] north  
                                                        [2] south  
                                                        [3] center 
                                                        [9] missing
 2   sex      Sex of the respondent    chr+lbl  0       [f] female 
                                                        [m] male   
 3   age_g    Age group                dbl+lbl  0       [1] 18-45  
                                                        [2] 46-65  
                                                        [3] >65    

By default, look_for() displays several pieces of information:

  • pos: Position of the variable in the dataset
  • variable: Variable name
  • label: Descriptive label for the variable
  • col_type: Data type (e.g., integer, character, factor)
  • levels: Value labels (for categorical variables)
  • value_labels: Named vector showing the relationship between codes and labels

Alternative: generate_dictionary()

The labelled package also provides generate_dictionary(), which serves a similar purpose to look_for():

# Generate a dictionary with full details
generate_dictionary(df)
 pos variable label                    col_type missing values     
 1   region   Region of the respondent dbl+lbl  1       [1] north  
                                                        [2] south  
                                                        [3] center 
                                                        [9] missing
 2   sex      Sex of the respondent    chr+lbl  0       [f] female 
                                                        [m] male   
 3   age_g    Age group                dbl+lbl  0       [1] 18-45  
                                                        [2] 46-65  
                                                        [3] >65    

You can control the level of detail displayed using the details argument:

# Generate a simplified dictionary without value labels
generate_dictionary(df, details = FALSE)
 pos variable label                   
 1   region   Region of the respondent
 2   sex      Sex of the respondent   
 3   age_g    Age group               

When details = FALSE, the output omits the value labels column, providing a more concise overview of your variables. This is particularly useful when you only need variable names and descriptions without the full value mappings.


Searching for Specific Variables

If you’re interested in specific variables, look_for() can perform keyword-based searches. It supports partial word matching, so you can search for incomplete terms such as "reg" or "regio".

# Search for a specific variable (partial match allowed)
look_for(df, "regio")
 pos variable label                    col_type missing values     
 1   region   Region of the respondent dbl+lbl  1       [1] north  
                                                        [2] south  
                                                        [3] center 
                                                        [9] missing

Controlling Output Detail

If you only want to display the variable names (without additional details), you can use the details argument:

# Show only variable names that match the search
look_for(df, "regio", details = FALSE)
 pos variable label                   
 1   region   Region of the respondent

The details argument works similarly across both look_for() and generate_dictionary():

  • details = TRUE (default): Shows complete information including value labels
  • details = FALSE: Shows a simplified view with just variable names and labels

Advanced Usage and Transformations

For a more structured view, the output of look_for() can be transformed into a tidy long format. This makes it easier to work with programmatically or export to documentation.

# Transform the dictionary into a long, tidy format
df %>%
  look_for() %>%
  lookfor_to_long_format() %>%
  convert_list_columns_to_character()
# A tibble: 9 × 7
    pos variable label                    col_type missing levels value_labels
  <int> <chr>    <chr>                    <chr>      <int> <chr>  <chr>       
1     1 region   Region of the respondent dbl+lbl        1 <NA>   [1] north   
2     1 region   Region of the respondent dbl+lbl        1 <NA>   [2] south   
3     1 region   Region of the respondent dbl+lbl        1 <NA>   [3] center  
4     1 region   Region of the respondent dbl+lbl        1 <NA>   [9] missing 
5     2 sex      Sex of the respondent    chr+lbl        0 <NA>   [f] female  
6     2 sex      Sex of the respondent    chr+lbl        0 <NA>   [m] male    
7     3 age_g    Age group                dbl+lbl        0 <NA>   [1] 18-45   
8     3 age_g    Age group                dbl+lbl        0 <NA>   [2] 46-65   
9     3 age_g    Age group                dbl+lbl        0 <NA>   [3] >65     

This transformation is particularly useful when:

  • You need to export the dictionary to a different format (e.g., Excel, CSV)
  • You want to programmatically filter or manipulate the metadata
  • You’re generating automated documentation or reports

Why Use look_for() with Survey Data?

When working with labelled survey datasets, look_for() is invaluable. It allows you to:

  • Quickly understand the structure and labels within your dataset
  • Search and filter variables efficiently using keywords
  • Generate clean data dictionaries for reporting and documentation
  • Minimize errors when selecting or merging variables across large datasets
  • Share comprehensive metadata with collaborators who may not have access to the original codebook

Overall, look_for() is an essential tool for anyone managing labelled survey data in R.


Summary of Key Functions

Function Purpose Key Arguments
look_for() Search and explore variable labels details: Show full info (TRUE) or simplified (FALSE)
generate_dictionary() Generate a complete data dictionary details: Include value labels (TRUE) or exclude (FALSE)
lookfor_to_long_format() Convert dictionary to long format None
convert_list_columns_to_character() Convert list columns to character strings None