# install if necessary
# install.packages(labelled)
# load the library
library(labelled)
Searching for variables in a dataset
When working with large datset, searching and finding variables can be cumbersome.
Moreover, when working with survey data, especially those downloaded/imported from a Stata or SPSS file, one common challenge is finding the variable labels of the variables.
The labelled
package in R offers a suite of helpful functions to manage and explore the labels easily. One such function, look_for
, simplifies the process of searching and understanding these labels within your dataset.
Understanding Stata and SPSS Labels
Before looking into the look_for
function, it’s helpful to understand the concept of labels in Stata or SPSS datasets. These labels annotate variables and their values with descriptive information, making datasets more user-friendly and easier to interpret. For instance, a numerical variable might represent regions in a survey with values labeled as north, south, or center. However, raw numbers alone don’t convey enough information, which is where labels play a crucial role.
The Power of the look_for
Function
The look_for
function is extremely helpful when searching for variables and their labels. It also allows you to quickly generate a dictionary of labels for all variables in your dataset or search for specific variable labels using keywords.
Example: Creating Labelled Data
Let’s start by creating a small labelled dataset using labelled_spss()
and labelled()
functions, which mimic how survey data might appear when imported.
Simply run the following chunk of code
library(labelled)
<- dplyr::tibble(
d region = labelled_spss(
c(1, 2, 1, 9, 2, 3),
c(north = 1, south = 2, center = 3, missing = 9),
na_values = 9,
label = "Region of the respondent"
),sex = labelled(
c("f", "f", "m", "m", "m", "f"),
c(female = "f", male = "m"),
label = "Sex of the respondent"
), age_g = labelled(
c(1, 2, 3, 1, 1, 2), # Fill with NA for consistent length
c(`18-45` = 1, `46-65` = 2, `>65` = 3),
label = "Age groups"
) )
We have a very basic dataset with three labelled variables
d
# A tibble: 6 × 3
region sex age_g
<dbl+lbl> <chr+lbl> <dbl+lbl>
1 1 [north] f [female] 1 [18-45]
2 2 [south] f [female] 2 [46-65]
3 1 [north] m [male] 3 [>65]
4 9 (NA) [missing] m [male] 1 [18-45]
5 2 [south] m [male] 1 [18-45]
6 3 [center] f [female] 2 [46-65]
Generating a Dictionary of All Variables
The look_for
function generates a comprehensive dictionary of all variable labels in the dataset, allowing for a quick overview.
# Generate dictionary for all variables
look_for(d)
pos variable label col_type missing values
1 region Region of the respondent dbl+lbl 1 [1] north
[2] south
[3] center
[9] missing
2 sex Sex of the respondent chr+lbl 0 [f] female
[m] male
3 age_g Age groups dbl+lbl 0 [1] 18-45
[2] 46-65
[3] >65
Searching for Specific Variables
If you’re interested in specific variables , look_for
is really helpful for searching for variables.
look_for
allows incomplete word search, you can enter reg
, or regio
, etc.
# Search for specific variable, even with incomplete words
look_for(d, "regio")
pos variable label col_type missing values
1 region Region of the respondent dbl+lbl 1 [1] north
[2] south
[3] center
[9] missing
If you are only looking for the variables, set details = F
look_for(d, "regio", details = F)
pos variable label
1 region Region of the respondent
Advanced Usage with Transformations
For a more detailed examination of your dictionary output, you can transform it into a more accessible format.
# Convenient transformation of dictionary
%>%
d look_for() %>%
lookfor_to_long_format() %>%
convert_list_columns_to_character()
# A tibble: 9 × 7
pos variable label col_type missing levels value_labels
<int> <chr> <chr> <chr> <int> <chr> <chr>
1 1 region Region of the respondent dbl+lbl 1 <NA> [1] north
2 1 region Region of the respondent dbl+lbl 1 <NA> [2] south
3 1 region Region of the respondent dbl+lbl 1 <NA> [3] center
4 1 region Region of the respondent dbl+lbl 1 <NA> [9] missing
5 2 sex Sex of the respondent chr+lbl 0 <NA> [f] female
6 2 sex Sex of the respondent chr+lbl 0 <NA> [m] male
7 3 age_g Age groups dbl+lbl 0 <NA> [1] 18-45
8 3 age_g Age groups dbl+lbl 0 <NA> [2] 46-65
9 3 age_g Age groups dbl+lbl 0 <NA> [3] >65
Why Use look_for
with Survey Data?
When working with survey datasets imported from Stata or SPSS, look_for
becomes invaluable. It allows to:
- Quickly ascertain the structure and labeling within their datasets.
- Search and filter relevant variables based on labels, saving time and minimising errors.