# Load the swiss dataset
data("swiss")
Subsetting and Filtering Data
In data analysis, “subsetting” and “filtering” are common tasks that allow you to extract a portion of your dataset based on specific conditions. R provides several methods to easily filter data rows based on one or more criteria.
We will demonstrate how to select certain rows or population using the swiss
dataset, which contains socio-economic data on 47 French-speaking provinces of Switzerland in the late 19th century.
First, load the built-in swiss
dataset:
Let’s focus on filtering provinces where the percentage of land dedicated to agriculture (Agriculture
) is greater than 50%.
Subsetting and Filtering Using Base R
One method to filter data is by using base R indexing, which involves logical conditions:
# Using base R indexing
<- swiss[swiss$Agriculture > 50, ]
agriculture_above_50_base
# check if it worked
# agriculture_above_50_base
In this example, we apply the logical condition swiss$Agriculture > 50
to identify rows where the Agriculture
column values are greater than 50. The resulting dataset stores only these rows.
Using the subset()
Function
The subset()
function is another convenient base R method to filter data. It allows specifying a logical expression directly:
# Using the subset function
<- subset(swiss, Agriculture > 50) agriculture_above_50_subset
This produces the same result as the previous approach, using a cleaner and more readable syntax where the filtering condition is clearly outlined within the function.
Filtering Using the Tidyverse
For those familiar with tidyverse
, the filter()
function from the dplyr
package offers a clear and efficient method for subsetting datasets:
# Load the tidyverse if not already loaded
library(tidyverse)
# Using dplyr's filter function
<- filter(swiss, Agriculture > 50) agriculture_above_50_tidy
Using the Native Pipe Operator |>
Combining the |>
operator with filter()
can improve readability.
# Using the native pipe operator |>
<- swiss |> filter(Agriculture > 50) agriculture_above_50_pipe
Here, the pipe operator |>
passes the swiss
dataset into the filter()
function, which applies the filtering condition, producing the same filtered dataset.
Complex Subsetting and Filtering with the iris
Dataset
Let’s see a more complex example with the iris
dataset, consisting of 150 observations of iris flowers with five variables: Sepal.Length
, Sepal.Width
, Petal.Length
, Petal.Width
, and Species
.
Base R Method
To subset the iris
dataset for only those observations where the species is setosa
and the sepal length is greater than 5, you can use logical operators within base R indexing.
# Using base R with logical operators
<- iris[iris$Species == "setosa" & iris$Sepal.Length > 5, ] subset_iris_base
Explanation
- The expression
iris$Species == "setosa"
checks each entry in theSpecies
column to see if it equals “setosa”. - The expression
iris$Sepal.Length > 5
checks if theSepal.Length
for each row exceeds 5. - The
&
operator combines these two logical conditions, ensuring both must be true for any row to be included insubset_iris_base
Tidyverse Method
Using filter
# Using dplyr's filter function with pipes
<- iris |>
subset_iris_tidyverse filter(Species == "setosa" & Sepal.Length > 5)
Extending the Example
We can add complexity with more conditions or operations. For instance, if you want to add more columns to your conditions, such as filtering rows where Petal.Width
is also less than 1.2, you can do:
Base R:
<- iris[iris$Species == "setosa" & iris$Sepal.Length > 5 & iris$Petal.Width < 1.2, ] more_complex_iris_base
Tidyverse:
<- iris |> more_complex_iris_tidyverse filter(Species == "setosa" & Sepal.Length > 5 & Petal.Width < 1.2)
Exercises Part 1
Exercise 1: Basic Filtering
Filter the iris
dataset to include only the rows where the Species
is versicolor
.
Task:
- Subset the
iris
dataset. - Include only rows with
Species
equal toversicolor
.
Exercise 2: Single Condition on Numerical Data
Filter the iris
dataset to include only rows where the Sepal.Width
is greater than 3.5.
Task:
- Subset the
iris
dataset. - Include only rows where
Sepal.Width
is greater than 3.5.
Exercise 3: Multiple Conditions with AND Operator
Filter the iris
dataset for observations that are of the species virginica
and have a Petal.Length
greater than 5.
Task:
- Subset the
iris
dataset. - Include rows where
Species
isvirginica
andPetal.Length
is greater than 5.
Exercise 4: Multiple Conditions with OR Operator
Filter the iris
dataset to include observations where either the Sepal.Length
is less than 5 or the Petal.Width
is greater than 1.5.
Task:
- Subset the
iris
dataset. - Include rows where
Sepal.Length
is less than 5 orPetal.Width
is greater than 1.5.
Exercise 5: Complex Filtering
Create a subset of the iris
dataset that includes rows where:
- The
Species
issetosa
. - The
Sepal.Length
is between 4.5 and 5.5 (inclusive). - The
Sepal.Width
is greater than or equal toPetal.Width
.
Task:
- Subset the
iris
dataset according to the specified conditions.
Solutions
Solution to Exercise 1
<- iris[iris$Species == "versicolor", ] iris_versicolor
Solution to Exercise 2
<- iris[iris$Sepal.Width > 3.5, ] iris_sepal_width
Solution to Exercise 3
<- iris[iris$Species == "virginica" & iris$Petal.Length > 5, ] iris_virginica_petal_length
Solution to Exercise 4
<- iris[iris$Sepal.Length < 5 | iris$Petal.Width > 1.5, ] iris_sepal_or_petal
Solution to Exercise 5
<- iris[iris$Species == "setosa" &
iris_complex $Sepal.Length >= 4.5 &
iris$Sepal.Length <= 5.5 &
iris$Sepal.Width >= iris$Petal.Width, ] iris
These exercises, varying from basic to more complex filtering operations, should enhance your ability to subset datasets using multiple conditions in R.
Exercises Part 2
Exercise 1: Basic Filtering
Filter the CASchools
dataset to include only the rows where the number of students
is greater than 2,000.
Task:
- Subset the
CASchools
dataset. - Include only rows with
students
greater than 2,000.
Exercise 2: Single Condition on Numerical Data
Filter the CASchools
dataset to include only rows where the teachers
count is less than 100.
Task:
- Subset the
CASchools
dataset. - Include only rows where
teachers
is less than 100.
Exercise 3: Multiple Conditions with AND Operator
Filter the CASchools
dataset for observations where the calworks
percentage is below 20 and the school district has more than 4,000 students.
Task:
- Subset the
CASchools
dataset. - Include rows where
calworks
is less than 20 andstudents
is greater than 4,000.
Exercise 4: Multiple Conditions with OR Operator
Filter the CASchools
dataset to include observations where the district
is greater than or equal to 900 or teachers
is more than 150.
Task:
- Subset the
CASchools
dataset. - Include rows where
district
is at least 900 orteachers
is more than 150.
Exercise 5: Complex Filtering
Create a subset of the CASchools
dataset that includes rows where:
- The district number is less than 500.
- The
students
are between 3,000 and 5,000 (inclusive). - The
calworks
percentage is greater than twice theteachers
count.
Task:
- Subset the
CASchools
dataset according to the specified conditions.
Solutions
Solution to Exercise 1
<- CASchools[CASchools$students > 2000, ] CASchools_students
Solution to Exercise 2
<- CASchools[CASchools$teachers < 100, ] CASchools_teachers
Solution to Exercise 3
<- CASchools[CASchools$calworks < 20 & CASchools$students > 4000, ] CASchools_calworks_students
Solution to Exercise 4
<- CASchools[CASchools$district >= 900 | CASchools$teachers > 150, ] CASchools_district_or_teachers
Solution to Exercise 5
<- CASchools[CASchools$district < 500 &
CASchools_complex $students >= 3000 & CASchools$students <= 5000 &
CASchools$calworks > 2 * CASchools$teachers, ] CASchools