# Create statistical summaries in R with the summarise() function from dplyr

The `summarise`

(or `summarize`

) function is used for aggregating and summarizing data. It’s particularly helpful for condensing data into a single row per group, offering various statistical summaries or computations for each group. This function creates a new data frame with the specified summary statistics.

## Syntax

The `summarise`

or `summarize`

function takes a dataset as input and creates a new one with columns calculated by applying a function to one or multiple columns from the original data. The syntax is as follows:

`summarise(data, new_column = function(column))`

`summarise`

and `summarize`

are both the same function.

## Statistical summaries of the data

Given a dataset, you can **generate a new data frame containing statistical summaries of specific variables from the original data frame**. The table below describes some of the most useful functions for use with `summarise`

, such as `mean`

or `sum`

.

Function | Description |
---|---|

mean() | Mean of the values |

median() | Median of the values |

sd(), var() | Standard deviation and variance of the values |

quantile() | Quantiles of the values |

IQR() | Interquartile range |

min(), max() | Minimum and maximum value |

first() | First value |

last() | Last value |

nth() | Nth value |

n() | Number of elements per group |

n_distinct() | Number of unique values |

In the example below we demonstrate how to generate a new data frame containing the **average of the numerical variables** from the original data frame.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Mean of 'x' and mean of 'y'
df_2 <- df %>%
summarise(mean_x = mean(x), mean_y = mean(y))
df_2
```

```
mean_x mean_y
1 21.6 38.2
```

Note that the resulting output will contain as many rows as the values returned by the input function.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Quartiles of 'x' and quartiles of 'y'
df_2 <- df %>%
summarise(quartiles_x = quantile(x), quartiles_y = quantile(y))
df_2
```

```
quartiles_x quartiles_y
1 3 18
2 12 37
3 19 42
4 30 44
5 44 50
```

##
Summarise data by group with `group_by`

The `summarise`

function is particularly useful in conjunction with `group_by`

. In this scenario, the new data frame will contain **statistical summaries for each group**.

The example below calculates the mean for each column based on the groups of the group variable.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Sum of 'x' and sum of 'B' by group
df_2 <- df %>%
group_by(group) %>%
summarise(mean_x = sum(x), mean_y = sum(y))
df_2
```

```
# A tibble: 2 × 3
group mean_x mean_y
<chr> <int> <int>
1 G1 52 97
2 G2 56 94
```

Additionally, you can **group by more than one categorical variable**. In this scenario, the function calculates statistical summaries for each group and subgroup. By default, the output is grouped by the first categorical variable, as indicated by a message.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
group_by(group, group_2) %>%
summarise(sum_x = sum(x), sum_y = sum(y))
df_2
```

```
`summarise()` has grouped output by 'group'. You can override using the `.groups` argument.
# A tibble: 4 × 4
# Groups: group [2]
group group_2 mean_x mean_y
<chr> <chr> <int> <int>
1 G1 G3 74 86
2 G1 G4 18 30
3 G2 G3 48 22
4 G2 G4 37 35
```

The `.groups`

argument is **optional** and can take one of the following values: `"drop_last"`

to drop the last level of grouping, `"drop"`

to drop all groups, `"keep"`

to preserve the original grouping or `"rowwise"`

, to treat each row as its own group.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
group_2 = sample(c("G3", "G4"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Sum of 'x' and sum of 'y' by group
df_2 <- df %>%
group_by(group, group_2) %>%
summarise(sum_x = sum(x), sum_y = sum(y), .groups = "drop")
df_2
```

```
# A tibble: 4 × 4
group group_2 sum_x sum_y
<chr> <chr> <int> <int>
1 G1 G3 74 86
2 G1 G4 18 30
3 G2 G3 48 22
4 G2 G4 37 35
```

Notice the difference from the previous output:

## Summarise multiple columns

Instead of manually specifying several columns, you can create summaries by **selecting them based on a condition using** `summarise`

in combination with `across`

. See the list of helper functions to select columns.

In the following example, the variance of all columns except `group`

is calculated, and the resulting columns are renamed using the original column names with the `"_var"`

suffix.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Summarise all columns by their variance except 'group'
df_2 <- df %>%
summarise(across(!group, var, .names = "{.col}_var"))
df_2
```

```
x_var y_var
1 254.3 149.2
```

The `where`

function is **highly useful** as it enables the selection of columns based on a condition, like choosing only numeric columns using `where(is.numeric)`

.

For example, the following code **calculates the median for all numeric variables**.

```
library(dplyr)
set.seed(9)
df <- data.frame(group = sample(c("G1", "G2"), 5, replace = TRUE),
x = sample(1:50, 5), y = sample(1:50, 5))
# Summarise ALL the NUMERIC columns by their median
df_2 <- df %>%
summarise(across(where(is.numeric), median, .names = "{.col}_median"))
df_2
```

```
x_median y_median
1 19 42
```