Searching for variables in a dataset

When working with large datset, searching and finding variables can be cumbersome.

Moreover, when working with survey data, especially those downloaded/imported from a Stata or SPSS file, one common challenge is finding the variable labels of the variables.

The labelled package in R offers a suite of helpful functions to manage and explore the labels easily. One such function, look_for, simplifies the process of searching and understanding these labels within your dataset.

# install if necessary
# install.packages(labelled)
# load the library
library(labelled)

Understanding Stata and SPSS Labels

Before looking into the look_for function, it’s helpful to understand the concept of labels in Stata or SPSS datasets. These labels annotate variables and their values with descriptive information, making datasets more user-friendly and easier to interpret. For instance, a numerical variable might represent regions in a survey with values labeled as north, south, or center. However, raw numbers alone don’t convey enough information, which is where labels play a crucial role.

The Power of the look_for Function

The look_for function is extremely helpful when searching for variables and their labels. It also allows you to quickly generate a dictionary of labels for all variables in your dataset or search for specific variable labels using keywords.

Example: Creating Labelled Data

Let’s start by creating a small labelled dataset using labelled_spss() and labelled() functions, which mimic how survey data might appear when imported.

Simply run the following chunk of code

library(labelled)

d <- dplyr::tibble(
      region = labelled_spss(
        c(1, 2, 1, 9, 2, 3),
        c(north = 1, south = 2, center = 3, missing = 9),
        na_values = 9,
        label = "Region of the respondent"
      ),
      sex = labelled(
        c("f", "f", "m", "m", "m", "f"),
        c(female = "f", male = "m"),
        label = "Sex of the respondent"
      ), 
      age_g = labelled(
        c(1, 2, 3, 1, 1, 2), # Fill with NA for consistent length
        c(`18-45` = 1, `46-65` = 2, `>65` = 3),
        label = "Age groups"
      )
    )

We have a very basic dataset with three labelled variables

d
# A tibble: 6 × 3
  region           sex        age_g    
  <dbl+lbl>        <chr+lbl>  <dbl+lbl>
1 1 [north]        f [female] 1 [18-45]
2 2 [south]        f [female] 2 [46-65]
3 1 [north]        m [male]   3 [>65]  
4 9 (NA) [missing] m [male]   1 [18-45]
5 2 [south]        m [male]   1 [18-45]
6 3 [center]       f [female] 2 [46-65]

Generating a Dictionary of All Variables

The look_for function generates a comprehensive dictionary of all variable labels in the dataset, allowing for a quick overview.

# Generate dictionary for all variables
look_for(d)
 pos variable label                    col_type missing values     
 1   region   Region of the respondent dbl+lbl  1       [1] north  
                                                        [2] south  
                                                        [3] center 
                                                        [9] missing
 2   sex      Sex of the respondent    chr+lbl  0       [f] female 
                                                        [m] male   
 3   age_g    Age groups               dbl+lbl  0       [1] 18-45  
                                                        [2] 46-65  
                                                        [3] >65    

Searching for Specific Variables

If you’re interested in specific variables , look_for is really helpful for searching for variables.

look_for allows incomplete word search, you can enter reg, or regio, etc.

# Search for specific variable, even with incomplete words
look_for(d, "regio")
 pos variable label                    col_type missing values     
 1   region   Region of the respondent dbl+lbl  1       [1] north  
                                                        [2] south  
                                                        [3] center 
                                                        [9] missing

If you are only looking for the variables, set details = F

look_for(d, "regio", details = F)
 pos variable label                   
 1   region   Region of the respondent

Advanced Usage with Transformations

For a more detailed examination of your dictionary output, you can transform it into a more accessible format.

# Convenient transformation of dictionary
d %>%
  look_for() %>%
  lookfor_to_long_format() %>%
  convert_list_columns_to_character()
# A tibble: 9 × 7
    pos variable label                    col_type missing levels value_labels
  <int> <chr>    <chr>                    <chr>      <int> <chr>  <chr>       
1     1 region   Region of the respondent dbl+lbl        1 <NA>   [1] north   
2     1 region   Region of the respondent dbl+lbl        1 <NA>   [2] south   
3     1 region   Region of the respondent dbl+lbl        1 <NA>   [3] center  
4     1 region   Region of the respondent dbl+lbl        1 <NA>   [9] missing 
5     2 sex      Sex of the respondent    chr+lbl        0 <NA>   [f] female  
6     2 sex      Sex of the respondent    chr+lbl        0 <NA>   [m] male    
7     3 age_g    Age groups               dbl+lbl        0 <NA>   [1] 18-45   
8     3 age_g    Age groups               dbl+lbl        0 <NA>   [2] 46-65   
9     3 age_g    Age groups               dbl+lbl        0 <NA>   [3] >65     

Why Use look_for with Survey Data?

When working with survey datasets imported from Stata or SPSS, look_for becomes invaluable. It allows to:

  • Quickly ascertain the structure and labeling within their datasets.
  • Search and filter relevant variables based on labels, saving time and minimising errors.