data(iris)
data(cars)
data(swiss)
7 R basics Part II
Data Frames
One of the most common type of data structure in R is called a data frame. Most dataset come in the form of dataframe
.
R
comes with several pre-installed data set 1.
To import a pre-installed R dataset into your environment, simply run data()
with the name of the dataset you wish to import.
For instance, for the dataset iris
, cars
and swiss
, enter the following command
You should be able to see them in your Environment panel (top right panel)
Now, if you click on the little table symbol (highlighted in red in the picture above) you can open a spreadsheet for visualing the data.
The swiss dataset
Let us explore the swiss
dataset.
We can get information about functions and dataset by simply adding a question mark in front of an object ?
?swiss
We get a description of the swiss
dataset (displayed on the bottom right window).
Swiss Fertility and Socioeconomic Indicators (1888) Data. Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
The swiss
dataset is a data.frame
where each rows corresponds to a French-speaking Swiss province.
Structure of a Dataset/DataFrame
Dataset are generally organised in the following way.
Each row represents an observation (e.g. countries, individuals) and each column represents a variable (e.g. age, sex) or feature of the observation.
data[rows, columns]
data[observation, variable]
This is similar to how data would be organised in a spreadsheet.
The comma (,
) in R is used to separate rows from columns when accessing the data within the data frame.
The information before the comma refers to the row(s), and the information after the comma refers to the column(s).
For example:
- To select all rows and the 2nd column of a data frame, you would write
data[,2]
. - To select the 3rd row and all columns of a data frame, you would write
data[3,]
. - To select the 4th row and 5th column of a data frame, you would write
data[4,5]
.
Let’s practice with the swiss
dataset
- Let’s select the rows
1
to3
, and columns4
and5
1:3, 4:5] swiss[
Education Catholic
Courtelary 12 9.96
Delemont 9 84.84
Franches-Mnt 5 93.40
We filtered three cities: Courtelary
, Delemont
and Franches-Mnt
.
You can select multiple rows and columns in R
by specifying their positions inside the c()
function.
For example, to select rows 1, 3, 5
and columns 1, 4
, you can use:
c(1,3,5), c(1,4)] swiss[
Fertility Education
Courtelary 80.2 12
Franches-Mnt 92.5 5
Neuveville 76.9 15
Selecting Characteristics (Variables)
In R
, columns of data frame
can also be referred to by their name. So, if a data frame
has a column named Education
, you could access this column with data[,"Education"]
.
Let’s select the columns Education
and Catholic
for the swiss
dataset with
1:2, c("Education", "Catholic")] swiss[
Education Catholic
Courtelary 12 9.96
Delemont 9 84.84
For individual columns, you can specify the column after the $
sign. For example:
$Education
swiss$Catholic swiss
This is particularly useful when using conditions as we will learn below.
Type and run
$Education swiss
You will see that all the rows are printed for the variable Education
.
By using square brackets []
after the $
selection, such as swiss$Education[1:3]
, you can select specific rows.
For instance, let’s select row number 10
for swiss$Education
$Education[10] swiss
which corresponds to the following city (Sarine
)
10, ] swiss[
Fertility Agriculture Examination Education Catholic Infant.Mortality
Sarine 82.9 45.2 16 13 91.38 24.4
Practice
Look at the two cities displayed below. What can you learn from the cities by looking at their characteristics? For instance, which city is more Catholic
? Which city has higher levels of Education
? Which is has higher levels of Infant Mortality
?
c(42, 45), ] swiss[
Fertility Agriculture Examination Education Catholic
Neuchatel 64.4 17.6 35 32 16.92
V. De Geneve 35.0 1.2 37 53 42.34
Infant.Mortality
Neuchatel 23
V. De Geneve 18
Practice
For the swiss
dataset
- Select row number 2
- Select row number 2 and row number 5
- Select column number 3
- Select column number 3 and row number 2 and 3
- Select column numbers 3 to 6
- Select the first row (value) for the variable
Catholic
See 2
list of all dataset pre-installed https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html↩︎
Solution
↩︎swiss[2,] swiss[c(2,5),] swiss[, 3] swiss[c(2,3), 3] swiss[, 3:6] swiss$Catholic[1]