# Load the swiss dataset
data("swiss")Subsetting and Filtering Data
In data analysis, “subsetting” and “filtering” are common tasks that allow you to extract a portion of your dataset based on specific conditions. R provides several methods to easily filter data rows based on one or more criteria.
We will demonstrate how to select certain rows or population using the swiss dataset, which contains socio-economic data on 47 French-speaking provinces of Switzerland in the late 19th century.
First, load the built-in swiss dataset:
Let’s focus on filtering provinces where the percentage of land dedicated to agriculture (Agriculture) is greater than 50%.
Subsetting and Filtering Using Base R
One method to filter data is by using base R indexing, which involves logical conditions:
# Using base R indexing
agriculture_above_50_base <- swiss[swiss$Agriculture > 50, ]
# check if it worked
# agriculture_above_50_baseIn this example, we apply the logical condition swiss$Agriculture > 50 to identify rows where the Agriculture column values are greater than 50. The resulting dataset stores only these rows.
Using the subset() Function
The subset() function is another convenient base R method to filter data. It allows specifying a logical expression directly:
# Using the subset function
agriculture_above_50_subset <- subset(swiss, Agriculture > 50)This produces the same result as the previous approach, using a cleaner and more readable syntax where the filtering condition is clearly outlined within the function.
Filtering Using the Tidyverse
For those familiar with tidyverse, the filter() function from the dplyr package offers a clear and efficient method for subsetting datasets:
# Load the tidyverse if not already loaded
library(tidyverse)
# Using dplyr's filter function
agriculture_above_50_tidy <- filter(swiss, Agriculture > 50)Using the Native Pipe Operator |>
Combining the |> operator with filter() can improve readability.
# Using the native pipe operator |>
agriculture_above_50_pipe <- swiss |> filter(Agriculture > 50)Here, the pipe operator |> passes the swiss dataset into the filter() function, which applies the filtering condition, producing the same filtered dataset.
Complex Subsetting and Filtering with the iris Dataset
Let’s see a more complex example with the iris dataset, consisting of 150 observations of iris flowers with five variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
Base R Method
To subset the iris dataset for only those observations where the species is setosa and the sepal length is greater than 5, you can use logical operators within base R indexing.
# Using base R with logical operators
subset_iris_base <- iris[iris$Species == "setosa" & iris$Sepal.Length > 5, ]Explanation
- The expression
iris$Species == "setosa"checks each entry in theSpeciescolumn to see if it equals “setosa”. - The expression
iris$Sepal.Length > 5checks if theSepal.Lengthfor each row exceeds 5. - The
&operator combines these two logical conditions, ensuring both must be true for any row to be included insubset_iris_base
Tidyverse Method
Using filter
# Using dplyr's filter function with pipes
subset_iris_tidyverse <- iris |>
filter(Species == "setosa" & Sepal.Length > 5)Extending the Example
We can add complexity with more conditions or operations. For instance, if you want to add more columns to your conditions, such as filtering rows where Petal.Width is also less than 1.2, you can do:
Base R:
more_complex_iris_base <- iris[iris$Species == "setosa" & iris$Sepal.Length > 5 & iris$Petal.Width < 1.2, ]Tidyverse:
more_complex_iris_tidyverse <- iris |> filter(Species == "setosa" & Sepal.Length > 5 & Petal.Width < 1.2)
Exercises Part 1
Exercise 1: Basic Filtering
Filter the iris dataset to include only the rows where the Species is versicolor.
Task:
- Subset the
irisdataset. - Include only rows with
Speciesequal toversicolor.
Exercise 2: Single Condition on Numerical Data
Filter the iris dataset to include only rows where the Sepal.Width is greater than 3.5.
Task:
- Subset the
irisdataset. - Include only rows where
Sepal.Widthis greater than 3.5.
Exercise 3: Multiple Conditions with AND Operator
Filter the iris dataset for observations that are of the species virginica and have a Petal.Length greater than 5.
Task:
- Subset the
irisdataset. - Include rows where
SpeciesisvirginicaandPetal.Lengthis greater than 5.
Exercise 4: Multiple Conditions with OR Operator
Filter the iris dataset to include observations where either the Sepal.Length is less than 5 or the Petal.Width is greater than 1.5.
Task:
- Subset the
irisdataset. - Include rows where
Sepal.Lengthis less than 5 orPetal.Widthis greater than 1.5.
Exercise 5: Complex Filtering
Create a subset of the iris dataset that includes rows where:
- The
Speciesissetosa. - The
Sepal.Lengthis between 4.5 and 5.5 (inclusive). - The
Sepal.Widthis greater than or equal toPetal.Width.
Task:
- Subset the
irisdataset according to the specified conditions.
Solutions
Solution to Exercise 1
iris_versicolor <- iris[iris$Species == "versicolor", ]Solution to Exercise 2
iris_sepal_width <- iris[iris$Sepal.Width > 3.5, ]Solution to Exercise 3
iris_virginica_petal_length <- iris[iris$Species == "virginica" & iris$Petal.Length > 5, ]Solution to Exercise 4
iris_sepal_or_petal <- iris[iris$Sepal.Length < 5 | iris$Petal.Width > 1.5, ]Solution to Exercise 5
iris_complex <- iris[iris$Species == "setosa" &
iris$Sepal.Length >= 4.5 &
iris$Sepal.Length <= 5.5 &
iris$Sepal.Width >= iris$Petal.Width, ]These exercises, varying from basic to more complex filtering operations, should enhance your ability to subset datasets using multiple conditions in R.
Exercises Part 2
Exercise 1: Basic Filtering
Filter the CASchools dataset to include only the rows where the number of students is greater than 2,000.
Task:
- Subset the
CASchoolsdataset. - Include only rows with
studentsgreater than 2,000.
Exercise 2: Single Condition on Numerical Data
Filter the CASchools dataset to include only rows where the teachers count is less than 100.
Task:
- Subset the
CASchoolsdataset. - Include only rows where
teachersis less than 100.
Exercise 3: Multiple Conditions with AND Operator
Filter the CASchools dataset for observations where the calworks percentage is below 20 and the school district has more than 4,000 students.
Task:
- Subset the
CASchoolsdataset. - Include rows where
calworksis less than 20 andstudentsis greater than 4,000.
Exercise 4: Multiple Conditions with OR Operator
Filter the CASchools dataset to include observations where the district is greater than or equal to 900 or teachers is more than 150.
Task:
- Subset the
CASchoolsdataset. - Include rows where
districtis at least 900 orteachersis more than 150.
Exercise 5: Complex Filtering
Create a subset of the CASchools dataset that includes rows where:
- The district number is less than 500.
- The
studentsare between 3,000 and 5,000 (inclusive). - The
calworkspercentage is greater than twice theteacherscount.
Task:
- Subset the
CASchoolsdataset according to the specified conditions.
Solutions
Solution to Exercise 1
CASchools_students <- CASchools[CASchools$students > 2000, ]Solution to Exercise 2
CASchools_teachers <- CASchools[CASchools$teachers < 100, ]Solution to Exercise 3
CASchools_calworks_students <- CASchools[CASchools$calworks < 20 & CASchools$students > 4000, ]Solution to Exercise 4
CASchools_district_or_teachers <- CASchools[CASchools$district >= 900 | CASchools$teachers > 150, ]Solution to Exercise 5
CASchools_complex <- CASchools[CASchools$district < 500 &
CASchools$students >= 3000 & CASchools$students <= 5000 &
CASchools$calworks > 2 * CASchools$teachers, ]