Merging Datasets

Merging datasets involves combining two (or more) datasets based on a unique identifier.

For example, suppose you have a dataset containing population information for Italy and France and another dataset with GDP data for the same countries. You can merge them using a common column, such as country, as follows:

pop <- data.frame(country = c("France", "Italy"), population = c(68.17, 58.76))
gdp <- data.frame(country = c("France", "Italy"), gdp = c(3.031, 2.255))


head(pop)
  country population
1  France      68.17
2   Italy      58.76
head(gdp)
  country   gdp
1  France 3.031
2   Italy 2.255
# Merge by country
pop_gdp <- merge(pop, gdp, by = "country")
pop_gdp
  country population   gdp
1  France      68.17 3.031
2   Italy      58.76 2.255

Now you have a new dataset called pop_gdp that combines all the variables from both datasets.

Unique Identifier

Now, imagine you have not only information about countries but also about years.

pop <- data.frame(country = c("France", "France", "Italy", "Italy"), 
                  year = c(1990, 2023, 1990, 2023), population = c(57.99, 68.17, 56.75, 58.76))

gdp <- data.frame(country = c("France", "France", "Italy", "Italy"), 
                  year = c(1990, 2023, 1990, 2023), gdp = c(1.269, 3.031, 1.181, 2.255))

head(pop)
  country year population
1  France 1990      57.99
2  France 2023      68.17
3   Italy 1990      56.75
4   Italy 2023      58.76
head(gdp)
  country year   gdp
1  France 1990 1.269
2  France 2023 3.031
3   Italy 1990 1.181
4   Italy 2023 2.255

In this case, you cannot simply merge by country alone; you need to merge by both country and year.

Incorrect merge:

pop_gdp <- merge(pop, gdp, by = "country")

Correct merge:

pop_gdp <- merge(pop, gdp, by = c("country", "year"))
pop_gdp
  country year population   gdp
1  France 1990      57.99 1.269
2  France 2023      68.17 3.031
3   Italy 1990      56.75 1.181
4   Italy 2023      58.76 2.255

This correct merge results in a combined dataset pop_gdp that aligns data by both country and year, ensuring all relevant information is matched correctly.

Survey Example

Often when working with surveys, you will have to merge several different files, such as household data and individual-level data.

Let’s imagine two datasets: one representing household-level information, and another containing individual-level data. We’ll merge them using both household_id and pers_id as identifiers.

Note that pers_id is unique to each individual but we will use household_id to match the larger household context.

Household Dataset

This dataset contains unique information for each household, such as region and household income.

household <- data.frame(
  household_id = c(1, 2, 3),
  region = c("North", "East", "West"),
  household_income = c(50000, 55000, 60000)
)

head(household)
  household_id region household_income
1            1  North            50000
2            2   East            55000
3            3   West            60000

This dataset includes a single row for each household, capturing unique household-level information. All household members share the same values for household-specific attributes, such as the region and household income. This means that every individual within the household is associated with these common household characteristics, providing a consistent context for all members.

Individual Dataset

This dataset contains individual-level data, such as age and sex, for members of various households.

In this dataset, individuals are nested within households. For example, the first two individuals (rows) belong to the same household, indicating they share a common household (household_id = 1) context while maintaining their personal characteristics.

individuals <- data.frame(
  household_id = c(1, 1, 2, 2, 3),
 pers_id = c(1, 2, 1, 2, 1),
  age = c(34, 28, 45, 40, 25),
  sex = c("M", "F", "M", "F", "M")
)

head(individuals)
  household_id pers_id age sex
1            1       1  34   M
2            1       2  28   F
3            2       1  45   M
4            2       2  40   F
5            3       1  25   M

Merging the Datasets

We can merge the individual-level dataset to the household dataset by simply using household_id.

# Merge using household_id
merged_data <- merge(household, individuals, by = "household_id")
merged_data
  household_id region household_income pers_id age sex
1            1  North            50000       1  34   M
2            1  North            50000       2  28   F
3            2   East            55000       1  45   M
4            2   East            55000       2  40   F
5            3   West            60000       1  25   M

The new dataset merged_data will contain information for each individual, enriched with the household-specific details such as region and household income.

Merging Additional Individual-Level Datasets

In some surveys, individual-level data may be distributed across several files.

For example, there might be a file named income that contains various income sources for each person, while another dataset named individuals contains socio-demographic information.

# income file
income <- data.frame(
  household_id = c(1, 1, 2, 2, 3),
  pers_id = c(1, 2, 1, 2, 1),
  labour_income = c(600, 450, 800, 300, 430),
  investments = c(0, 10, 1000, 55, 0)
)

head(income)
  household_id pers_id labour_income investments
1            1       1           600           0
2            1       2           450          10
3            2       1           800        1000
4            2       2           300          55
5            3       1           430           0
# socio-demo file
individuals <- data.frame(
  household_id = c(1, 1, 2, 2, 3),
  pers_id = c(1, 2, 1, 2, 1),
  age = c(34, 28, 45, 40, 25),
  sex = c("M", "F", "M", "F", "M")
)

head(individuals)
  household_id pers_id age sex
1            1       1  34   M
2            1       2  28   F
3            2       1  45   M
4            2       2  40   F
5            3       1  25   M

To combine the income and individuals datasets, it is essential to use both household_id and pers_id as merging keys. This ensures that each individual’s income data is accurately aligned with their socio-demographic information.

df_ind = merge(individuals, income, by = c("household_id", "pers_id"))
df_ind
  household_id pers_id age sex labour_income investments
1            1       1  34   M           600           0
2            1       2  28   F           450          10
3            2       1  45   M           800        1000
4            2       2  40   F           300          55
5            3       1  25   M           430           0