Create statistical summaries in R with the summarise() function from dplyr

The summarise() function from dplyr

The summarise (or summarize) function is used for aggregating and summarizing data. It’s particularly helpful for condensing data into a single row per group, offering various statistical summaries or computations for each group. This function creates a new data frame with the specified summary statistics.

Syntax

The summarise or summarize function takes a dataset as input and creates a new one with columns calculated by applying a function to one or multiple columns from the original data. The syntax is as follows:

summarise(data, new_column = function(column))

summarise and summarize are both the same function.

Statistical summaries of the data

Given a dataset, you can generate a new data frame containing statistical summaries of specific variables from the original data frame. The table below describes some of the most useful functions for use with summarise, such as mean or sum.

Function Description
mean() Mean of the values
median() Median of the values
sd(), var() Standard deviation and variance of the values
quantile() Quantiles of the values
IQR() Interquartile range
min(), max() Minimum and maximum value
first() First value
last() Last value
nth() Nth value
n() Number of elements per group
n_distinct() Number of unique values

In the example below we demonstrate how to generate a new data frame containing the average of the numerical variables from the original data frame.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Mean of 'x' and mean of 'y'
df_2 <- df %>%
  summarise(mean_x = mean(x), mean_y = mean(y))

df_2
  mean_x mean_y
1   21.6   38.2

The summarise() function from dplyr in R

Note that the resulting output will contain as many rows as the values returned by the input function.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Quartiles of 'x' and quartiles of 'y'
df_2 <- df %>%
  summarise(quartiles_x = quantile(x), quartiles_y = quantile(y))

df_2
  quartiles_x quartiles_y
1           3          18
2          12          37
3          19          42
4          30          44
5          44          50

Create statistical summaries of data in dplyr

Summarise data by group with group_by

The summarise function is particularly useful in conjunction with group_by. In this scenario, the new data frame will contain statistical summaries for each group.

The example below calculates the mean for each column based on the groups of the group variable.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'B' by group
df_2 <- df %>%
  group_by(group) %>%
  summarise(mean_x = sum(x), mean_y = sum(y))

df_2
# A tibble: 2 × 3
  group mean_x mean_y
  <chr>  <int>  <int>
1 G1        52     97
2 G2        56     94

Summarise data by group with summarise() and group_by() in R

Additionally, you can group by more than one categorical variable. In this scenario, the function calculates statistical summaries for each group and subgroup. By default, the output is grouped by the first categorical variable, as indicated by a message.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
  group_by(group, group_2) %>%
  summarise(sum_x = sum(x), sum_y = sum(y))

df_2
`summarise()` has grouped output by 'group'. You can override using the `.groups` argument.
# A tibble: 4 × 4
# Groups:   group [2]
  group group_2 mean_x mean_y
  <chr> <chr>    <int>  <int>
1 G1    G3          74     86
2 G1    G4          18     30
3 G2    G3          48     22
4 G2    G4          37     35

Summarise data by several group in dplyr with the summarise() and group_by() functions

The .groups argument is optional and can take one of the following values: "drop_last" to drop the last level of grouping, "drop" to drop all groups, "keep" to preserve the original grouping or "rowwise", to treat each row as its own group.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
  group_by(group, group_2) %>%
  summarise(sum_x = sum(x), sum_y = sum(y), .groups = "drop")

df_2
# A tibble: 4 × 4
  group group_2 sum_x sum_y
  <chr> <chr>   <int> <int>
1 G1    G3         74    86
2 G1    G4         18    30
3 G2    G3         48    22
4 G2    G4         37    35

Notice the difference from the previous output:

Summarise data by several groups and then drop the grouping levels in dplyr

Summarise multiple columns

Instead of manually specifying several columns, you can create summaries by selecting them based on a condition using summarise in combination with across. See the list of helper functions to select columns.

In the following example, the variance of all columns except group is calculated, and the resulting columns are renamed using the original column names with the "_var" suffix.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Summarise all columns by their variance except 'group'
df_2 <- df %>%
  summarise(across(!group, var, .names = "{.col}_var"))

df_2
  x_var y_var
1 254.3 149.2

Summarise data for a selection of columns using summarise() and across() functions from dplyr

The where function is highly useful as it enables the selection of columns based on a condition, like choosing only numeric columns using where(is.numeric).

For example, the following code calculates the median for all numeric variables.

library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
                 x = sample(1:50, 5), y = sample(1:50, 5))

# Summarise ALL the NUMERIC columns by their median
df_2 <- df %>%
  summarise(across(where(is.numeric), median, .names = "{.col}_median"))

df_2
  x_median y_median
1       19       42

summarise() and across() functions from dplyr