Selecting Columns and Variables

In this section, we will focus on selecting specific columns using R.

We will work with the dataset CASchools from the AER library.

# Install the library if not already installed
# install.packages("AER")

# Load the library and data
library(AER)
data("CASchools")

We also load the library tidyverse

# Install the tidyverse package if not already installed
# install.packages("tidyverse")

# Load the tidyverse library
library(tidyverse)

We can use ? to learn about this dataset

?CASchools

Description

The dataset contains data on test performance, school characteristics and student demographic backgrounds for school districts in California.

Let’s use the head() function to examine the first five rows of the dataset.

head(CASchools)
  district                          school  county grades students teachers
1    75119              Sunol Glen Unified Alameda  KK-08      195    10.90
2    61499            Manzanita Elementary   Butte  KK-08      240    11.15
3    61549     Thermalito Union Elementary   Butte  KK-08     1550    82.90
4    61457 Golden Feather Union Elementary   Butte  KK-08      243    14.00
5    61523        Palermo Union Elementary   Butte  KK-08     1335    71.50
6    62042         Burrel Union Elementary  Fresno  KK-08      137     6.40
  calworks   lunch computer expenditure    income   english  read  math
1   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6 690.0
2  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5 661.9
3  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3 650.9
4  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9 643.5
5  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8 639.9
6  12.3188 86.9565       25    5580.147 10.415000 12.408759 605.7 605.4

Next, we’ll create a new dataset containing only the variables: district, students, teachers, and calworks. R provides various methods to accomplish this task.

Selecting Variables - Base R Method

We’ll use the c() function to specify the names of the desired variables.

We create a new dataset with CASchools_select =

CASchools_select = CASchools[ , c("district", "students", "teachers", "calworks")]

This code will extract the specified columns from the CASchools dataset.

Selecting Variables - Tidyverse Method

An alternative method to select specific columns is by using the select function from the tidyverse package, which (might )simplifies the process with a more intuitive syntax.

The select function is designed to make column selection straightforward. You simply pass the dataframe and then list the column names you wish to retain. Here’s how you can create a new dataset using select:

CASchools_select = CASchools |> select(district, students, teachers, calworks)

In this code: - The |> operator (the “forward pipe operator”) is used to pass the CASchools dataset into the select function. - The select function takes the dataset and returns a new one containing only the specified columns: district, students, teachers, and calworks.