Create and modify columns in R with the mutate() function from dplyr

Create and modify columns in R with dplyr

The mutate function from dplyr package is used to create new columns or modify existing columns in a data frame, while retaining the original structure. It enables users to apply functions or operations to data within a data frame and store the results as new variables.

Syntax

The syntax of the mutate function is the following:

# Basic usage
mutate(.data, new_column_name = expression)


mutate(
  .data,  # Data set
  ...,    # New columns (new_column_name = expression)
  .by = NULL, # Grouping variables
  .keep = c("all", "used", "unused", "none"), # Which columns to keep
  .before = NULL, # New columns will appear before this
  .after = NULL   # New columns will appear after this
)

Create columns

To create a new column you can specify the new column name (Var3) and a expression to calculate the values of the new column. In the example below the expression is the sum of two other columns (Var1 + Var2).

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column with the sum of 'Var1' and 'Var2'
df_2 <- df %>%
  mutate(Var3 = Var1 + Var2)

df_2
  Var1 Var2 Var3
1   32   39   71
2   34    1   35
3   15   29   44
4   12    3   15
5   42   35   77

mutate() function from dplyr

You can also apply a function to a column to create a new one. The following example illustrates how to create a new column with the square root of another.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column with the square root of 'Var1'
df_2 <- df %>%
  mutate(Sqrt_Var1 = sqrt(Var1))

df_2
  Var1 Var2 Sqrt_Var1
1   32   39  5.656854
2   34    1  5.830952
3   15   29  3.872983
4   12    3  3.464102
5   42   35  6.480741

Create new columns in R using dplyr

Add multiple columns at once

Notice that you can add several columns simultaneously by passing more expressions separated by commas to mutate.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# Create three new columns named 'Var3', 'Var4' and 'Var5'
df_2 <- df %>%
  mutate(Var3 = Var1 + Var2,
         Var4 = cumsum(Var1),
         Var5 = ifelse(Var1 > Var2, TRUE, FALSE))

df_2
  Var1 Var2 Var3 Var4  Var5
1   32   39   71   32 FALSE
2   34    1   35   66  TRUE
3   15   29   44   81 FALSE
4   12    3   15   93  TRUE
5   42   35   77  135  TRUE

Add multiple columns at once in R with the mutate function from dplyr

Using across

The across function allows to select specific columns with helper functions while using mutate.

In the following example, the sqrt function is applied to all columns containing "1" in their names. Subsequently, the new columns are named using the previous column name ("{.col}") followed by a suffix ("_sqrt").

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# Create new columns with all columns that contain "1" in their names
df_2 <- df %>%
  mutate(across(contains("1"), sqrt, .names = "{.col}_sqrt"))

df_2
  Var1 Var2 Var1_sqrt
1   32   39  5.656854
2   34    1  5.830952
3   15   29  3.872983
4   12    3  3.464102
5   42   35  6.480741

Using the mutate() and across() functions from dplyr

Modify existing columns

The mutate function is also capable of modifying existing columns. To achieve this, simply use the syntax: old_column_name = expression

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# Modify 'Var1' as its value divided by the mean of the column
df_2 <- df %>%
  mutate(Var1 = Var1 / mean(Var1, na.rm = TRUE)) 

df_2
       Var1 Var2
1 1.1851852   39
2 1.2592593    1
3 0.5555556   29
4 0.4444444    3
5 1.5555556   35

Modify multiple columns

With across, you can select specific columns and apply a custom function to them without creating new columns. The example below demonstrates how to apply the log function to all columns except for Var1.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# Apply log to all columns but 'Var1'
df_2 <- df %>%
  mutate(across(!Var1, log)) 

df_2
  Var1     Var2
1   32 3.663562
2   34 0.000000
3   15 3.367296
4   12 1.098612
5   42 3.555348

Modify columns with the mutate function from dplyr

In such cases where the desired function takes additional arguments, you will need to use ~ before the function and reference . to represent the column values.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# Modify all columns but 'Var1' with paste0
df_2 <- df %>%
  mutate(across(!Var1, ~paste0("Value: ", .))) 

df_2
  Var1      Var2
1   32 Value: 39
2   34  Value: 1
3   15 Value: 29
4   12  Value: 3
5   42 Value: 35

Modify the selected columns applying a function in dplyr

Position of the new columns

By default, when you use mutate to create a new column, it appends that column to the end of the data frame. Nevertheless, the mutate function allows for the use of .before or .after arguments to control the position of the new columns relative to other columns.

In the example below, a new column is added before Var2:

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column before 'Var2'
df_2 <- df %>%
  mutate(Var3 = Var1 / Var2, .before = Var2)

df_2
  Var1       Var3 Var2
1   32  0.8205128   39
2   34 34.0000000    1
3   15  0.5172414   29
4   12  4.0000000    3
5   42  1.2000000   35

Add a new column before other in dplyr

Additionally, you can specify positions based on the index of the column. For instance, setting .before = 1 will add the new column as the first column.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column on the first position (before first column)
df_2 <- df %>%
  mutate(Var3 = Var1 / Var2, .before = 1)

df_2
        Var3 Var1 Var2
1  0.8205128   32   39
2 34.0000000   34    1
3  0.5172414   15   29
4  4.0000000   12    3
5  1.2000000   42   35

Add a new column on a specific position in dplyr

In the example below, setting .after = "Var2" will place the new column after Var2.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column after 'Var2'
df_2 <- df %>%
  mutate(Var3 = Var1 * Var2, .after = "Var2")

df_2
  Var1 Var2 Var3
1   32   39 1248
2   34    1   34
3   15   29  435
4   12    3   36
5   42   35 1470

Add a new column after other column in dplyr

Keep or drop columns

When you add new columns to a data frame all other columns are preserved by default. However, the .keep argument, which defaults to "all", can also be set to "used" to only keep columns used inside mutate, to "unused" to keep unused columns and to "none" to delete all the old columns.

The following example demonstrates how to use .keep by retaining only used columns.

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column and keep the new column ('Var3') and the column used ('Var1')
df_2 <- df %>%
  mutate(Var3 = Var1 * 2, .keep = "used")

df_2
  Var1 Var3
1   32   64
2   34   68
3   15   30
4   12   24
5   42   84

Add new columns in R with the mutate function and keep used columns

The opposite is to keep only the new column and the columns that haven’t been used setting .keep = "unused".

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column and keep the new column ('Var3') and the column unused ('Var2')
df_2 <- df %>%
  mutate(Var3 = Var1 * 2, .keep = "unused")

df_2
  Var2 Var3
1   39   64
2    1   68
3   29   30
4    3   24
5   35   84

Add new columns with the mutate function from dplyr and keep unused columns

Finally, to remove all columns from the original data frame, use .keep = "none".

library(dplyr)

set.seed(8)
df <- data.frame(Var1 = sample(1:50, 5), Var2 = sample(1:50, 5))

# New column and only keep the new column ('Var3')
df_2 <- df %>%
  mutate(Var3 = Var1 * 2, .keep = "none")

df_2
  Var3
1   64
2   68
3   30
4   24
5   84

Create a new column with dplyr and remove the others