# Aggregate in R

In R, you can use the `aggregate`

function to **compute summary statistics for subsets of the data**. This function is very similar to the `tapply`

function, but you can also input a formula or a time series object and in addition, the output is of class `data.frame`

. In this tutorial you will learn how to use the R aggregate function with several examples, to aggregate rows by a grouping factor.

## The aggregate() function in R

The syntax of the R `aggregate`

function will depend on the input data. There are three possible input types: a **data frame**, a **formula** and a **time series object**. The arguments and its description for each method are summarized in the following block:

```
# Data frame
aggregate(x, # R object
by, # List of variables (grouping elements)
FUN, # Function to be applied for summary statistics
..., # Additional arguments to be passed to FUN
simplify = TRUE, # Whether to simplify results as much as possible or not
drop = TRUE) # Whether to drop unused combinations of grouping values or not.
# Formula
aggregate(formula, # Input formula
data, # List or data frame where the variables are stored
FUN, # Function to be applied for summary statistics
..., # Additional arguments to be passed to FUN
subset, # Observations to be used (optional)
na.action = na.omit) # How to deal with NA values
# Time series
aggregate(x, # Time series object
nfrequency = 1, # Observations per unit of time (submultiple of x)
FUN = sum, # Function to be applied for summary statistics
ndeltat = 1, # Fraction between successive observations
ts.eps = getOption("ts.eps"), # Tolerance to determine if 'nfrequency' is a submultiple of the frequency of x
...) # Additional arguments to be passed to FUN
```

Recall to type `help(aggregate)`

or `?aggregate`

for additional information.

In the following sections we will show examples and use cases about aggregating data, like aggregating the mean, the count or the quantiles, among other examples. Using `aggregate`

in R is very simple and it is worth to mention that **you can apply any function you want, even a custom function**.

## Aggregate mean in R by group

Consider, for instance, the following dataset, which contains the weight and the type of feed of a sample of chickens:

```
df <- chickwts
head(df)
```

```
weight feed
179 horsebean
160 horsebean
136 horsebean
227 horsebean
217 horsebean
168 horsebean
```

In order to use the `aggregate`

function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case `mean`

) on the third. An alternative is to specify a formula of the form: `numerical ~ categorical`

.

```
group_mean <- aggregate(df$weight, list(df$feed), mean)
group_mean <- aggregate(weight ~ feed, data = df, mean) # Equivalent
group_mean
```

```
Group.1 x
casein 323.5833
horsebean 160.2000
linseed 218.7500
meatmeal 276.9091
soybean 246.4286
sunflower 328.9167
```

Note that, when using a formula, the grouping variable is coerced to factor. In consequence, you could also use a numerical variable for representing groups.

However, you might have noticed that the column names of the resulting data frame doesn’t represent the variables. In order to modify the column names of the output, you can use the `colnames`

function as follows:

```
colnames(group_mean) <- c("Group", "Mean")
group_mean
```

```
Group Mean
casein 323.5833
horsebean 160.2000
linseed 218.7500
meatmeal 276.9091
soybean 246.4286
sunflower 328.9167
```

## Aggregate count

Sometimes it can be useful to know the number of elements of each group of a categorical variable. Although you could use the `table`

function, if you want the output to be a data frame, you can get the count applying the `length`

function to `aggregate`

.

```
aggregate(chickwts$feed, by = list(chickwts$feed), FUN = length)
aggregate(feed ~ feed, data = chickwts, FUN = length) # Equivalent
```

```
Group.1 x
casein 12
horsebean 10
linseed 12
meatmeal 11
soybean 14
sunflower 12
```

## Aggregate quantile

In this section we are going to use a time series object of class `xts`

as an example, although you could use a data frame instead to apply the function. Consider the following sample object that represents the monthly returns of an investment fund over a year:

```
set.seed(1)
Dates <- seq(dmy("01/01/2014"), dmy("01/01/2015"), by = "day")
Return <- rnorm(length(Dates))
# install.packages("xts")
library(xts)
tserie <- xts(Return, Dates)
head(tserie)
```

```
[, 1]
2014-01-01 -0.6264538
2014-01-02 0.1836433
2014-01-03 -0.8356286
2014-01-04 1.5952808
2014-01-05 0.3295078
2014-01-06 -0.8204684
```

In this scenario, you may be interested in aggregating the quantiles by date (aggregate daily data to monthly or to weekly, for instance). Hence, you can calculate the quantiles 5% and 95% for the returns of each month typing:

```
dat <- aggregate(tserie ~ month(index(tserie)), FUN = quantile,
probs = c(0.05, 0.95))
colnames(dat)[1] <- "Month"
dat
```

```
Month V1.5% V2.95%
1 -1.7041221 1.2301178
2 -0.9388331 1.7888182
3 -1.3886001 1.5261942
4 -1.0834517 1.5697239
5 -1.6527890 1.0883061
6 -1.4064641 2.1472166
7 -1.3117488 1.6377309
8 -1.8666917 1.1298708
9 -1.8513285 1.4615978
10 -1.0359233 1.7162873
11 -1.4453582 1.3389949
12 -2.0918996 1.5258865
```

Note that you can add additional arguments of the function you are applying separating them with commas after the `FUN`

argument.

## Aggregate by multiple columns in R

Finally, it is worth to mention that it is possible to **aggregate more than one variable**. For this purpose, there exist three options: aggregating more than one categorical variable, aggregating multiple numerical variables or both at the same time.

On the one hand, we are going to create a **new categorical variable** named `cat_var`

.

```
set.seed(1)
cat_var <- sample(c("A", "B", "C"), nrow(df), replace = TRUE)
df_2 <- cbind(df, cat_var)
head(df_2)
```

```
weight feed cat_var
179 horsebean A
160 horsebean C
136 horsebean A
227 horsebean B
217 horsebean A
168 horsebean C
```

Now, you can use the `aggregate`

function to aggregate the sum to summarize the data frame based on the two variables:

```
aggregate(df_2$weight, by = list(df_2$feed, df_2$cat_var), FUN = sum)
# Equivalent to:
aggregate(weight ~ feed + cat_var, data = df_2, FUN = sum)
```

```
feed cat_var weight
casein A 1005
horsebean A 532
linseed A 1079
meatmeal A 242
soybean A 1738
sunflower A 882
casein B 1131
horsebean B 494
linseed B 780
meatmeal B 2244
soybean B 1355
sunflower B 2109
casein C 1747
horsebean C 576
linseed C 766
meatmeal C 560
soybean C 357
sunflower C 956
```

By applying the `aggregate`

function to several categorical variables, all possible combinations between them are made and the corresponding statistical summary is created for each one.

On the other hand, we are going to create a new numeric variable named `num_var`

.

```
set.seed(1)
num_var <- rnorm(nrow(df))
df_3 <- cbind(num_var, df)
head(df_3)
```

```
num_var weight feed
-0.6264538 179 horsebean
0.1836433 160 horsebean
-0.8356286 136 horsebean
1.5952808 227 horsebean
0.3295078 217 horsebean
-0.8204684 168 horsebean
```

In this scenario, when working with two or more numerical variables you can make use of the `cbind`

function to concatenate them:

`aggregate(cbind(df_3$num_var, df_3$weight), list(df_3$feed), mean)`

```
Group.1 V1 V2
casein 0.4043795 323.5833
horsebean 0.1322028 160.2000
linseed 0.3491303 218.7500
meatmeal 0.2125804 276.9091
soybean -0.2314387 246.4286
sunflower 0.1651836 328.9167
```

Thus, the statistical summary is created for the numeric variables based on the factor.

You could also apply the function with multiple numerical and categorical variables. In this a situation, there would be as many summaries as numerical variables and as many groups as possible combinations.