Subsetting and Filtering Data

In data analysis, “subsetting” and “filtering” are common tasks that allow you to extract a portion of your dataset based on specific conditions. R provides several methods to easily filter data rows based on one or more criteria.

We will demonstrate how to select certain rows or population using the swiss dataset, which contains socio-economic data on 47 French-speaking provinces of Switzerland in the late 19th century.

First, load the built-in swiss dataset:

# Load the swiss dataset
data("swiss")

Let’s focus on filtering provinces where the percentage of land dedicated to agriculture (Agriculture) is greater than 50%.

Subsetting and Filtering Using Base R

One method to filter data is by using base R indexing, which involves logical conditions:

# Using base R indexing
agriculture_above_50_base <- swiss[swiss$Agriculture > 50, ]

# check if it worked
# agriculture_above_50_base

In this example, we apply the logical condition swiss$Agriculture > 50 to identify rows where the Agriculture column values are greater than 50. The resulting dataset stores only these rows.

Using the subset() Function

The subset() function is another convenient base R method to filter data. It allows specifying a logical expression directly:

# Using the subset function
agriculture_above_50_subset <- subset(swiss, Agriculture > 50)

This produces the same result as the previous approach, using a cleaner and more readable syntax where the filtering condition is clearly outlined within the function.

Filtering Using the Tidyverse

For those familiar with tidyverse, the filter() function from the dplyr package offers a clear and efficient method for subsetting datasets:

# Load the tidyverse if not already loaded
library(tidyverse)

# Using dplyr's filter function
agriculture_above_50_tidy <- filter(swiss, Agriculture > 50)

Using the Native Pipe Operator |>

Combining the |> operator with filter() can improve readability.

# Using the native pipe operator |>
agriculture_above_50_pipe <- swiss |> filter(Agriculture > 50)

Here, the pipe operator |> passes the swiss dataset into the filter() function, which applies the filtering condition, producing the same filtered dataset.


Complex Subsetting and Filtering with the iris Dataset

Let’s see a more complex example with the iris dataset, consisting of 150 observations of iris flowers with five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

Base R Method

To subset the iris dataset for only those observations where the species is setosa and the sepal length is greater than 5, you can use logical operators within base R indexing.

# Using base R with logical operators
subset_iris_base <- iris[iris$Species == "setosa" & iris$Sepal.Length > 5, ]

Explanation

  • The expression iris$Species == "setosa" checks each entry in the Species column to see if it equals “setosa”.
  • The expression iris$Sepal.Length > 5 checks if the Sepal.Length for each row exceeds 5.
  • The & operator combines these two logical conditions, ensuring both must be true for any row to be included in subset_iris_base

Tidyverse Method

Using filter

# Using dplyr's filter function with pipes
subset_iris_tidyverse <- iris |> 
  filter(Species == "setosa" & Sepal.Length > 5)

Extending the Example

We can add complexity with more conditions or operations. For instance, if you want to add more columns to your conditions, such as filtering rows where Petal.Width is also less than 1.2, you can do:

  • Base R:

    more_complex_iris_base <- iris[iris$Species == "setosa" & iris$Sepal.Length > 5 & iris$Petal.Width < 1.2, ]
  • Tidyverse:

    more_complex_iris_tidyverse <- iris |> 
      filter(Species == "setosa" & Sepal.Length > 5 & Petal.Width < 1.2)

Exercises Part 1

Exercise 1: Basic Filtering

Filter the iris dataset to include only the rows where the Species is versicolor.

Task:

  • Subset the iris dataset.
  • Include only rows with Species equal to versicolor.

Exercise 2: Single Condition on Numerical Data

Filter the iris dataset to include only rows where the Sepal.Width is greater than 3.5.

Task:

  • Subset the iris dataset.
  • Include only rows where Sepal.Width is greater than 3.5.

Exercise 3: Multiple Conditions with AND Operator

Filter the iris dataset for observations that are of the species virginica and have a Petal.Length greater than 5.

Task:

  • Subset the iris dataset.
  • Include rows where Species is virginica and Petal.Length is greater than 5.

Exercise 4: Multiple Conditions with OR Operator

Filter the iris dataset to include observations where either the Sepal.Length is less than 5 or the Petal.Width is greater than 1.5.

Task:

  • Subset the iris dataset.
  • Include rows where Sepal.Length is less than 5 or Petal.Width is greater than 1.5.

Exercise 5: Complex Filtering

Create a subset of the iris dataset that includes rows where:

  • The Species is setosa.
  • The Sepal.Length is between 4.5 and 5.5 (inclusive).
  • The Sepal.Width is greater than or equal to Petal.Width.

Task:

  • Subset the iris dataset according to the specified conditions.

Solutions

Solution to Exercise 1

iris_versicolor <- iris[iris$Species == "versicolor", ]

Solution to Exercise 2

iris_sepal_width <- iris[iris$Sepal.Width > 3.5, ]

Solution to Exercise 3

iris_virginica_petal_length <- iris[iris$Species == "virginica" & iris$Petal.Length > 5, ]

Solution to Exercise 4

iris_sepal_or_petal <- iris[iris$Sepal.Length < 5 | iris$Petal.Width > 1.5, ]

Solution to Exercise 5

iris_complex <- iris[iris$Species == "setosa" & 
                     iris$Sepal.Length >= 4.5 & 
                     iris$Sepal.Length <= 5.5 & 
                     iris$Sepal.Width >= iris$Petal.Width, ]

These exercises, varying from basic to more complex filtering operations, should enhance your ability to subset datasets using multiple conditions in R.


Exercises Part 2

Exercise 1: Basic Filtering

Filter the CASchools dataset to include only the rows where the number of students is greater than 2,000.

Task:

  • Subset the CASchools dataset.
  • Include only rows with students greater than 2,000.

Exercise 2: Single Condition on Numerical Data

Filter the CASchools dataset to include only rows where the teachers count is less than 100.

Task:

  • Subset the CASchools dataset.
  • Include only rows where teachers is less than 100.

Exercise 3: Multiple Conditions with AND Operator

Filter the CASchools dataset for observations where the calworks percentage is below 20 and the school district has more than 4,000 students.

Task:

  • Subset the CASchools dataset.
  • Include rows where calworks is less than 20 and students is greater than 4,000.

Exercise 4: Multiple Conditions with OR Operator

Filter the CASchools dataset to include observations where the district is greater than or equal to 900 or teachers is more than 150.

Task:

  • Subset the CASchools dataset.
  • Include rows where district is at least 900 or teachers is more than 150.

Exercise 5: Complex Filtering

Create a subset of the CASchools dataset that includes rows where:

  • The district number is less than 500.
  • The students are between 3,000 and 5,000 (inclusive).
  • The calworks percentage is greater than twice the teachers count.

Task:

  • Subset the CASchools dataset according to the specified conditions.

Solutions

Solution to Exercise 1

CASchools_students <- CASchools[CASchools$students > 2000, ]

Solution to Exercise 2

CASchools_teachers <- CASchools[CASchools$teachers < 100, ]

Solution to Exercise 3

CASchools_calworks_students <- CASchools[CASchools$calworks < 20 & CASchools$students > 4000, ]

Solution to Exercise 4

CASchools_district_or_teachers <- CASchools[CASchools$district >= 900 | CASchools$teachers > 150, ]

Solution to Exercise 5

CASchools_complex <- CASchools[CASchools$district < 500 & 
                                CASchools$students >= 3000 & CASchools$students <= 5000 & 
                                CASchools$calworks > 2 * CASchools$teachers, ]