Merging datasets involves combining two (or more) datasets based on a unique identifier.
For example, suppose you have a dataset containing population information for Italy and France and another dataset with GDP data for the same countries. You can merge them using a common column, such as country, as follows:
pop <-data.frame(country =c("France", "Italy"), population =c(68.17, 58.76))gdp <-data.frame(country =c("France", "Italy"), gdp =c(3.031, 2.255))head(pop)
country population
1 France 68.17
2 Italy 58.76
head(gdp)
country gdp
1 France 3.031
2 Italy 2.255
# Merge by countrypop_gdp <-merge(pop, gdp, by ="country")pop_gdp
country population gdp
1 France 68.17 3.031
2 Italy 58.76 2.255
Now you have a new dataset called pop_gdp that combines all the variables from both datasets.
Unique Identifier
Now, imagine you have not only information about countries but also about years.
pop <-data.frame(country =c("France", "France", "Italy", "Italy"), year =c(1990, 2023, 1990, 2023), population =c(57.99, 68.17, 56.75, 58.76))gdp <-data.frame(country =c("France", "France", "Italy", "Italy"), year =c(1990, 2023, 1990, 2023), gdp =c(1.269, 3.031, 1.181, 2.255))head(pop)
country year population
1 France 1990 57.99
2 France 2023 68.17
3 Italy 1990 56.75
4 Italy 2023 58.76
head(gdp)
country year gdp
1 France 1990 1.269
2 France 2023 3.031
3 Italy 1990 1.181
4 Italy 2023 2.255
In this case, you cannot simply merge by country alone; you need to merge by both country and year.
Incorrect merge:
pop_gdp <- merge(pop, gdp, by = "country")
Correct merge:
pop_gdp <-merge(pop, gdp, by =c("country", "year"))pop_gdp
country year population gdp
1 France 1990 57.99 1.269
2 France 2023 68.17 3.031
3 Italy 1990 56.75 1.181
4 Italy 2023 58.76 2.255
This correct merge results in a combined dataset pop_gdp that aligns data by both country and year, ensuring all relevant information is matched correctly.
Survey Example
Often when working with surveys, you will have to merge several different files, such as household data and individual-level data.
Let’s imagine two datasets: one representing household-level information, and another containing individual-level data. We’ll merge them using both household_id and pers_id as identifiers.
Note that pers_id is unique to each individual but we will use household_id to match the larger household context.
Household Dataset
This dataset contains unique information for each household, such as region and household income.
household_id region household_income
1 1 North 50000
2 2 East 55000
3 3 West 60000
This dataset includes a single row for each household, capturing unique household-level information. All household members share the same values for household-specific attributes, such as the region and household income. This means that every individual within the household is associated with these common household characteristics, providing a consistent context for all members.
Individual Dataset
This dataset contains individual-level data, such as age and sex, for members of various households.
In this dataset, individuals are nested within households. For example, the first two individuals (rows) belong to the same household, indicating they share a common household (household_id = 1) context while maintaining their personal characteristics.
household_id pers_id age sex
1 1 1 34 M
2 1 2 28 F
3 2 1 45 M
4 2 2 40 F
5 3 1 25 M
Merging the Datasets
We can merge the individual-level dataset to the household dataset by simply using household_id.
# Merge using household_idmerged_data <-merge(household, individuals, by ="household_id")merged_data
household_id region household_income pers_id age sex
1 1 North 50000 1 34 M
2 1 North 50000 2 28 F
3 2 East 55000 1 45 M
4 2 East 55000 2 40 F
5 3 West 60000 1 25 M
The new dataset merged_data will contain information for each individual, enriched with the household-specific details such as region and household income.
Merging Additional Individual-Level Datasets
In some surveys, individual-level data may be distributed across several files.
For example, there might be a file named income that contains various income sources for each person, while another dataset named individuals contains socio-demographic information.
household_id pers_id age sex
1 1 1 34 M
2 1 2 28 F
3 2 1 45 M
4 2 2 40 F
5 3 1 25 M
To combine the income and individuals datasets, it is essential to use both household_id and pers_id as merging keys. This ensures that each individual’s income data is accurately aligned with their socio-demographic information.
df_ind =merge(individuals, income, by =c("household_id", "pers_id"))df_ind
household_id pers_id age sex labour_income investments
1 1 1 34 M 600 0
2 1 2 28 F 450 10
3 2 1 45 M 800 1000
4 2 2 40 F 300 55
5 3 1 25 M 430 0