7  R basics Part II

Data Frames

One of the most common type of data structure in R is called a data frame. Most dataset come in the form of dataframe.

R comes with several pre-installed data set 1.

To import a pre-installed R dataset into your environment, simply run data() with the name of the dataset you wish to import.

For instance, for the dataset iris, cars and swiss, enter the following command

data(iris)
data(cars)
data(swiss)

You should be able to see them in your Environment panel (top right panel)


Now, if you click on the little table symbol (highlighted in red in the picture above) you can open a spreadsheet for visualing the data.


The swiss dataset

Let us explore the swiss dataset.

We can get information about functions and dataset by simply adding a question mark in front of an object ?

?swiss

We get a description of the swiss dataset (displayed on the bottom right window).

Swiss Fertility and Socioeconomic Indicators (1888) Data. Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.

The swiss dataset is a data.frame where each rows corresponds to a French-speaking Swiss province.

Structure of a Dataset/DataFrame

Dataset are generally organised in the following way.

Each row represents an observation (e.g. countries, individuals) and each column represents a variable (e.g. age, sex) or feature of the observation.

  • data[rows, columns]
  • data[observation, variable]

This is similar to how data would be organised in a spreadsheet.

The comma (,) in R is used to separate rows from columns when accessing the data within the data frame.

The information before the comma refers to the row(s), and the information after the comma refers to the column(s).

For example:

  • To select all rows and the 2nd column of a data frame, you would write data[,2].
  • To select the 3rd row and all columns of a data frame, you would write data[3,].
  • To select the 4th row and 5th column of a data frame, you would write data[4,5].

Let’s practice with the swiss dataset

  • Let’s select the rows 1 to 3, and columns 4 and 5
swiss[1:3, 4:5]
             Education Catholic
Courtelary          12     9.96
Delemont             9    84.84
Franches-Mnt         5    93.40

We filtered three cities: Courtelary, Delemont and Franches-Mnt.

You can select multiple rows and columns in R by specifying their positions inside the c() function.

For example, to select rows 1, 3, 5 and columns 1, 4, you can use:

swiss[c(1,3,5), c(1,4)]
             Fertility Education
Courtelary        80.2        12
Franches-Mnt      92.5         5
Neuveville        76.9        15

Selecting Characteristics (Variables)

In R, columns of data frame can also be referred to by their name. So, if a data frame has a column named Education, you could access this column with data[,"Education"].

Let’s select the columns Education and Catholic for the swiss dataset with

swiss[1:2, c("Education", "Catholic")]
           Education Catholic
Courtelary        12     9.96
Delemont           9    84.84

For individual columns, you can specify the column after the $ sign. For example:

swiss$Education
swiss$Catholic

This is particularly useful when using conditions as we will learn below.

Type and run

swiss$Education

You will see that all the rows are printed for the variable Education.

By using square brackets [] after the $ selection, such as swiss$Education[1:3], you can select specific rows.

For instance, let’s select row number 10 for swiss$Education

swiss$Education[10]

which corresponds to the following city (Sarine)

swiss[10, ]
       Fertility Agriculture Examination Education Catholic Infant.Mortality
Sarine      82.9        45.2          16        13    91.38             24.4

Practice

Look at the two cities displayed below. What can you learn from the cities by looking at their characteristics? For instance, which city is more Catholic? Which city has higher levels of Education? Which is has higher levels of Infant Mortality?

swiss[c(42, 45), ]
             Fertility Agriculture Examination Education Catholic
Neuchatel         64.4        17.6          35        32    16.92
V. De Geneve      35.0         1.2          37        53    42.34
             Infant.Mortality
Neuchatel                  23
V. De Geneve               18

Practice

For the swiss dataset

  • Select row number 2
  • Select row number 2 and row number 5
  • Select column number 3
  • Select column number 3 and row number 2 and 3
  • Select column numbers 3 to 6
  • Select the first row (value) for the variable Catholic

See 2


  1. list of all dataset pre-installed https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html↩︎

  2. Solution

    swiss[2,]
    swiss[c(2,5),]
    swiss[, 3]
    swiss[c(2,3), 3]
    swiss[, 3:6]
    swiss$Catholic[1]
    ↩︎