Merge data frames in R

Data Manipulation in R Data transformation
Join data frames in R with the merge function

The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others. In this tutorial you will learn how to merge datasets in base R in the possible available ways with several examples.

Merge function in R

The syntax of the R merge function with a brief description of its arguments is shown in the following block of code:

merge(x, y, ...)

# For data frames:
merge(x, y, # Data frames or objects to be coerced
      by = intersect(names(x), names(y)), # Columns used for merging
      by.x = by, by.y = by, # Columns used for merging
      all = FALSE, # If TRUE, all.x = TRUE and all.y = TRUE
      all.x = all, all.y = all, # If TRUE, adds rows for each row in x (y) that not match a row in y (x).
      sort = TRUE, # Whether to sort the output by the 'by' columns
      suffixes = c(".x",".y"), # Suffixes for creating unique column names
      no.dups = TRUE, # Whether to avoid duplicated column names appending more suffixes or not
      incomparables = NULL, # How to deal with values that can not be matched
      ...) # Additional arguments

Note that the main method of the R merge function is for data frames. However, merge is a generic function that can be also used with other objects (like vectors or matrices), but they will be coerced to data.frame class.

R merge data frames

In order to create a reproducible example to show how to merge two data frames in R we are going to use the following sample datasets named df_1, that represents the id, name and monthly salary of some employees of a company and df_2, that shows the id, name, age and position of some employees.

set.seed(61)

employee_id <- 1:10
employee_name <- c("Andrew", "Susan", "John", "Joe", "Jack",
                   "Jacob", "Mary", "Kate", "Jacqueline", "Ivy")
employee_salary <- round(rnorm(10, mean = 1500, sd = 200))
employee_age <- round(rnorm(10, mean = 50, sd = 8))
employee_position <- c("CTO", "CFO", "Administrative", rep("Technician", 7))


df_1 <- data.frame(id = employee_id[1:8], name = employee_name[1:8],
                   month_salary = employee_salary[1:8])
df_2 <- data.frame(id = employee_id[-5], name = employee_name[-5],
                   age = employee_age[-5], position = employee_position[-5])

df_1
df_2
 # df_1                                  # df_2
  id   name   month_salary                id       name     age         position
  1  Andrew      1424                      1       Andrew    40              CTO
  2   Susan      1425                      2        Susan    38              CFO
  3    John      1156                      3         John    54   Administrative
  4     Joe      1570                      4          Joe    66       Technician
  5    Jack      1223                      6        Jacob    38       Technician
  6   Jacob      1462                      7         Mary    53       Technician
  7    Mary      1641                      8         Kate    56       Technician
  8    Kate      1603                      9   Jacqueline    55       Technician
                                          10          Ivy    43       Technician

Note that on a real life example, all ids will be unique but the names can be repeated. Also note that ‘Jack’ is missing in the second table (neither his age nor his position are available) and ‘Jacqueline’ and ‘Ivy’ are missing in the first (their monthly salaries are not available with the current data).

Inner join

An inner join (actually a natural join), is the most usual join of data sets that you can perform. It consists on merging two dataframes in one that contains the common elements of both, as described in the following illustration:

XY INNER JOIN

In order to merge the two sample data sets, you just have to pass them to the merge function without the need of changing other arguments, due to by default, the function merges the data sets by the common column names. In consequence, in this case, the function merges the data by two columns (id and name).

merge(x = df_1, y = df_2)
merge(x = df_1, y = df_2, by = c("id", "name")) # Equivalent
 id    name  month_salary  age        position
  1  Andrew      1424      40              CTO
  2   Susan      1425      38              CFO
  3    John      1156      54   Administrative
  4     Joe      1570      66       Technician
  6   Jacob      1462      38       Technician
  7    Mary      1641      53       Technician
  8    Kate      1603      56       Technician

As we pointed out before, ‘Jack’, ‘Ivy’ and ‘Jacqueline’ were not in both tables. In consequence, in the resulting output of this join they are missing.

Full (outer) join

The outer join, also known as full outer join or full join, merges all the columns of both data sets into one for all elements:

XY OUTER JOIN

In order to create a full outer join of the two data frames in R you have to set the argument all to TRUE:

merge(x = df_1, y = df_2, all = TRUE)
  id       name    month_salary   age         position
   1      Andrew      1424         40              CTO
   2       Susan      1425         38              CFO
   3        John      1156         54   Administrative
   4         Joe      1570         66       Technician
   5        Jack      1223         NA             <NA>    # <-- NA values
   6       Jacob      1462         38       Technician
   7        Mary      1641         53       Technician
   8        Kate      1603         56       Technician
   9  Jacqueline        NA         55       Technician    # <-- NA value
  10         Ivy        NA         43       Technician    # <-- NA value

As not all rows in the first data frame match all the rows in the second, the output is filled with NA values in those cases.

Left (outer) join in R

The left join in R consist on matching all the rows in the first data frame with the corresponding values on the second. Recall that ‘Jack’ was on the first table but not on the second.

XY LEFT JOIN

In order to create the join, you just have to set all.x = TRUE as follows:

merge(x = df_1, y = df_2, all.x = TRUE)
  id   name    month_salary  age         position
   1  Andrew      1424        40              CTO
   2   Susan      1425        38              CFO
   3    John      1156        54   Administrative
   4     Joe      1570        66       Technician
   5    Jack      1223        NA             <NA>      # <-- NA values
   6   Jacob      1462        38       Technician
   7    Mary      1641        53       Technician
   8    Kate       603        56       Technician

As the employee corresponding to id = 5 is on the first data set but not on the second, the corresponding missing values are displayed as NA.

Right (outer) join in R

The right join in R is the opposite of the left outer join. In this case, the merge consists on joining all the rows in the second data frame with the corresponding on the first.

XY RIGHT JOIN

In consequence, you will need to set the argument all.y to TRUE to join the datasets this way.

merge(x = df_1, y = df_2, all.y = TRUE)
  id       name  month_salary   age         position
   1      Andrew     1424        40              CTO
   2       Susan     1425        38              CFO
   3        John     1156        54   Administrative
   4         Joe     1570        66       Technician
   6       Jacob     1462        38       Technician
   7        Mary     1641        53       Technician
   8        Kate     1603        56       Technician
   9  Jacqueline       NA        55       Technician     # <-- Note the difference
  10         Ivy       NA        43       Technician     # <-- with the left join

As ‘Jacqueline’ and ‘ Ivy’ are on the second data set but not on the first, the corresponding values of monthly salary are not available.

Cross join

The R cross join performs the Cartesian product of the datasets df_1 and df_2:

XY CROSS JOIN112323

You can create a cross join in R setting as NULL the argument by of the R merge function. Note that we only show the first rows of the output, as it is very large.

Merged <- merge(x = df_1, y = df_2, by = NULL)
head(Merged)
  id.x   name.x   month_salary   id.y   name.y    age   position
    1    Andrew         1424      1     Andrew     40      CTO
    2     Susan         1425      1     Andrew     40      CTO
    3      John         1156      1     Andrew     40      CTO
    4       Joe         1570      1     Andrew     40      CTO
    5      Jack         1223      1     Andrew     40      CTO
    6     Jacob         1462      1     Andrew     40      CTO

Merge rows in R

You can also merge data frames by row names. For illustration purposes, consider the following datasets:

df1 <- data.frame(var = c("one", "two", "three", "four", "five"),
                  data = c(1, 5, 1, 6, 8))
rownames(df1) <- c("A", "B", "C", "D", "E")
df1

df2 <- data.frame(var = c("three", "one", "eight", "two", "nine"),
                  data = c(1, 5, 1, 6, 8))
rownames(df2) <- c("E", "A", "B", "D", "C")
df2
# df1                         # df2
     var  data                      var  data
A    one    1                   E  three    1
B    two    5                   A    one    5
C  three    1                   B  eight    1
D   four    6                   D    two    6
E   five    8                   C   nine    8

In this case, in order to join the data frames by the row names you have to set the argument by to 0 or to "row.names".

merge(df1, df2, by = 0, all = TRUE) 
merge(df1, df2, by = "row.names", all = TRUE) # Equivalent
  Row.names  var.x  data.x  var.y  data.y
1      A      one      1      one     5
2      B      two      5    eight     1
3      C    three      1     nine     8
4      D     four      6      two     6
5      E     five      8    three     1

As you can observe, the output contains as many rows as different row names. Note that we applied a full outer join (as in this case it is equivalent to a left and right join), but you could join the data as you want.

Merge more than two dataframes in R

Finally, it is worth to mention that you can iteratively merge data frames in R, concatenating the merge function. Consider, for instance the following data frames:

x <- data.frame(id = 1:4, year = 1995:1998)
x

y <- data.frame(id = c(4, 1, 3, 2),
                year = c(1998, 1995, 1997, 1996), age = c(22, 25, 23, 24))
y

z <- data.frame(id = c(1, 2, 3), year = 1995:1997, wage = c(1000, 1200, 1599))
z
# x          # y              # z
 id  year     id  year age     id  year  wage
  1  1995      4  1998  22      1  1995  1000
  2  1996      1  1995  25      2  1996  1200
  3  1997      3  1997  23      3  1997  1599
  4  1998      2  1996  24

You can merge the three data frames merging two and then merging the output with the remaining data set.

merge(x, merge(y, z))
id  year  age  wage
1   1995   25  1000
2   1996   24  1200
3   1997   23  1599

Note that you can specify the arguments you prefer for each join and that you can concatenate as many merges as you need.

merge(x, merge(y, z, all = TRUE), all = TRUE)
id  year  age  wage
1  1995   25   1000
2  1996   24   1200
3  1997   23   1599
4  1998   22     NA

A cleaner alternative is to use the Reduce function as follows, so instead of concatenating the merge functions, you can specify all the data frames inside a list. Nonetheless, you will have to specify the same arguments for all joins.

Reduce(function(x, y) merge(x, y), list(x, y, z))
id  year  age  wage
1   1995   25  1000
2   1996   24  1200
3   1997   23  1599