# Create statistical summaries in R with the summarise() function from dplyr

The `summarise` (or `summarize`) function is used for aggregating and summarizing data. It’s particularly helpful for condensing data into a single row per group, offering various statistical summaries or computations for each group. This function creates a new data frame with the specified summary statistics.

## Syntax

The `summarise` or `summarize` function takes a dataset as input and creates a new one with columns calculated by applying a function to one or multiple columns from the original data. The syntax is as follows:

``summarise(data, new_column = function(column))``

`summarise` and `summarize` are both the same function.

## Statistical summaries of the data

Given a dataset, you can generate a new data frame containing statistical summaries of specific variables from the original data frame. The table below describes some of the most useful functions for use with `summarise`, such as `mean` or `sum`.

Function Description
mean() Mean of the values
median() Median of the values
sd(), var() Standard deviation and variance of the values
quantile() Quantiles of the values
IQR() Interquartile range
min(), max() Minimum and maximum value
first() First value
last() Last value
nth() Nth value
n() Number of elements per group
n_distinct() Number of unique values

In the example below we demonstrate how to generate a new data frame containing the average of the numerical variables from the original data frame.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Mean of 'x' and mean of 'y'
df_2 <- df %>%
summarise(mean_x = mean(x), mean_y = mean(y))

df_2``````
``````  mean_x mean_y
1   21.6   38.2``````

Note that the resulting output will contain as many rows as the values returned by the input function.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Quartiles of 'x' and quartiles of 'y'
df_2 <- df %>%
summarise(quartiles_x = quantile(x), quartiles_y = quantile(y))

df_2``````
``````  quartiles_x quartiles_y
1           3          18
2          12          37
3          19          42
4          30          44
5          44          50``````

## Summarise data by group with `group_by`

The `summarise` function is particularly useful in conjunction with `group_by`. In this scenario, the new data frame will contain statistical summaries for each group.

The example below calculates the mean for each column based on the groups of the group variable.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'B' by group
df_2 <- df %>%
group_by(group) %>%
summarise(mean_x = sum(x), mean_y = sum(y))

df_2``````
``````# A tibble: 2 × 3
group mean_x mean_y
<chr>  <int>  <int>
1 G1        52     97
2 G2        56     94``````

Additionally, you can group by more than one categorical variable. In this scenario, the function calculates statistical summaries for each group and subgroup. By default, the output is grouped by the first categorical variable, as indicated by a message.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
group_by(group, group_2) %>%
summarise(sum_x = sum(x), sum_y = sum(y))

df_2``````
```````summarise()` has grouped output by 'group'. You can override using the `.groups` argument.
# A tibble: 4 × 4
# Groups:   group [2]
group group_2 mean_x mean_y
<chr> <chr>    <int>  <int>
1 G1    G3          74     86
2 G1    G4          18     30
3 G2    G3          48     22
4 G2    G4          37     35``````

The `.groups` argument is optional and can take one of the following values: `"drop_last"` to drop the last level of grouping, `"drop"` to drop all groups, `"keep"` to preserve the original grouping or `"rowwise"`, to treat each row as its own group.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
group_by(group, group_2) %>%
summarise(sum_x = sum(x), sum_y = sum(y), .groups = "drop")

df_2``````
``````# A tibble: 4 × 4
group group_2 sum_x sum_y
<chr> <chr>   <int> <int>
1 G1    G3         74    86
2 G1    G4         18    30
3 G2    G3         48    22
4 G2    G4         37    35``````

Notice the difference from the previous output:

## Summarise multiple columns

Instead of manually specifying several columns, you can create summaries by selecting them based on a condition using `summarise` in combination with `across`. See the list of helper functions to select columns.

In the following example, the variance of all columns except `group` is calculated, and the resulting columns are renamed using the original column names with the `"_var"` suffix.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Summarise all columns by their variance except 'group'
df_2 <- df %>%
summarise(across(!group, var, .names = "{.col}_var"))

df_2``````
``````  x_var y_var
1 254.3 149.2``````

The `where` function is highly useful as it enables the selection of columns based on a condition, like choosing only numeric columns using `where(is.numeric)`.

For example, the following code calculates the median for all numeric variables.

``````library(dplyr)

set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))

# Summarise ALL the NUMERIC columns by their median
df_2 <- df %>%
summarise(across(where(is.numeric), median, .names = "{.col}_median"))

df_2``````
``````  x_median y_median
1       19       42``````