HOME

Aggregate in R

Data Manipulation in R Data transformation

In R, you can use the aggregate function to compute summary statistics for subsets of the data. This function is very similar to the tapply function, but you can also input a formula or a time series object and in addition, the output is of class data.frame. In this tutorial you will learn how to use the R aggregate function with several examples, to aggregate rows by a grouping factor.

The aggregate() function in R

The syntax of the R aggregate function will depend on the input data. There are three possible input types: a data frame, a formula and a time series object. The arguments and its description for each method are summarized in the following block:

# Data frame
aggregate(x,               # R object
          by,              # List of variables (grouping elements)
          FUN,             # Function to be applied for summary statistics
          ...,             # Additional arguments to be passed to FUN
          simplify = TRUE, # Whether to simplify results as much as possible or not
          drop = TRUE)     # Whether to drop unused combinations of grouping values or not.

# Formula
aggregate(formula,             # Input formula
          data,                # List or data frame where the variables are stored
          FUN,                 # Function to be applied for summary statistics
          ...,                 # Additional arguments to be passed to FUN
          subset,              # Observations to be used (optional)
          na.action = na.omit) # How to deal with NA values

# Time series
aggregate(x,                   # Time series object
          nfrequency = 1,      # Observations per unit of time (submultiple of x)
          FUN = sum,           # Function to be applied for summary statistics
          ndeltat = 1,         # Fraction between successive observations
          ts.eps = getOption("ts.eps"), # Tolerance to determine if 'nfrequency' is a submultiple of the frequency of x
          ...)                 # Additional arguments to be passed to FUN

Recall to type help(aggregate) or ?aggregate for additional information.

In the following sections we will show examples and use cases about aggregating data, like aggregating the mean, the count or the quantiles, among other examples. Using aggregate in R is very simple and it is worth to mention that you can apply any function you want, even a custom function.

Aggregate mean in R by group

Consider, for instance, the following dataset, which contains the weight and the type of feed of a sample of chickens:

df <- chickwts
head(df)

   weight     feed
    179    horsebean
    160    horsebean
    136    horsebean
    227    horsebean
    217    horsebean
    168    horsebean

In order to use the aggregate function for mean in R, you will need to specify the numerical variable on the first argument, the categorical (as a list) on the second and the function to be applied (in this case mean) on the third. An alternative is to specify a formula of the form: numerical ~ categorical.

group_mean <- aggregate(df$weight, list(df$feed), mean)
group_mean <- aggregate(weight ~ feed, data = df, mean) # Equivalent
group_mean

    Group.1      x
    casein   323.5833
 horsebean   160.2000
   linseed   218.7500
  meatmeal   276.9091
   soybean   246.4286
 sunflower   328.9167

Note that, when using a formula, the grouping variable is coerced to factor. In consequence, you could also use a numerical variable for representing groups.

However, you might have noticed that the column names of the resulting data frame doesn’t represent the variables. In order to modify the column names of the output, you can use the colnames function as follows:

colnames(group_mean) <- c("Group", "Mean")
group_mean

      Group     Mean
    casein   323.5833
 horsebean   160.2000
   linseed   218.7500
  meatmeal   276.9091
   soybean   246.4286
 sunflower   328.9167

Aggregate count

Sometimes it can be useful to know the number of elements of each group of a categorical variable. Although you could use the table function, if you want the output to be a data frame, you can get the count applying the length function to aggregate.

aggregate(chickwts$feed, by = list(chickwts$feed), FUN = length)
aggregate(feed ~ feed, data = chickwts, FUN = length) # Equivalent

    Group.1    x
     casein   12
  horsebean   10
    linseed   12
   meatmeal   11
    soybean   14
  sunflower   12

Aggregate quantile

In this section we are going to use a time series object of class xts as an example, although you could use a data frame instead to apply the function. Consider the following sample object that represents the monthly returns of an investment fund over a year:

set.seed(1)

Dates <- seq(dmy("01/01/2014"), dmy("01/01/2015"), by = "day")
Return <- rnorm(length(Dates))

# install.packages("xts")
library(xts)
tserie <- xts(Return, Dates)

head(tserie)

                 [, 1]
2014-01-01  -0.6264538
2014-01-02   0.1836433
2014-01-03  -0.8356286
2014-01-04   1.5952808
2014-01-05   0.3295078
2014-01-06  -0.8204684

In this scenario, you may be interested in aggregating the quantiles by date (aggregate daily data to monthly or to weekly, for instance). Hence, you can calculate the quantiles 5% and 95% for the returns of each month typing:

dat <- aggregate(tserie ~ month(index(tserie)), FUN = quantile,
                 probs = c(0.05, 0.95))
colnames(dat)[1] <- "Month"
dat

  Month    V1.5%      V2.95%
    1   -1.7041221   1.2301178
    2   -0.9388331   1.7888182
    3   -1.3886001   1.5261942
    4   -1.0834517   1.5697239
    5   -1.6527890   1.0883061
    6   -1.4064641   2.1472166
    7   -1.3117488   1.6377309
    8   -1.8666917   1.1298708
    9   -1.8513285   1.4615978
   10   -1.0359233   1.7162873
   11   -1.4453582   1.3389949
   12   -2.0918996   1.5258865

Note that you can add additional arguments of the function you are applying separating them with commas after the FUN argument.

Aggregate by multiple columns in R

Finally, it is worth to mention that it is possible to aggregate more than one variable. For this purpose, there exist three options: aggregating more than one categorical variable, aggregating multiple numerical variables or both at the same time.

On the one hand, we are going to create a new categorical variable named cat_var.

set.seed(1)

cat_var <- sample(c("A", "B", "C"), nrow(df), replace = TRUE)
df_2 <- cbind(df, cat_var)
head(df_2)

 weight    feed      cat_var
  179    horsebean       A
  160    horsebean       C
  136    horsebean       A
  227    horsebean       B
  217    horsebean       A
  168    horsebean       C

Now, you can use the aggregate function to aggregate the sum to summarize the data frame based on the two variables:

aggregate(df_2$weight, by = list(df_2$feed, df_2$cat_var), FUN = sum)

# Equivalent to:
aggregate(weight ~ feed + cat_var, data = df_2, FUN = sum)

      feed   cat_var  weight
    casein      A     1005
 horsebean      A      532
   linseed      A     1079
  meatmeal      A      242
   soybean      A     1738
 sunflower      A      882
    casein      B     1131
 horsebean      B      494
   linseed      B      780
  meatmeal      B     2244
   soybean      B     1355
 sunflower      B     2109
    casein      C     1747
 horsebean      C      576
   linseed      C      766
  meatmeal      C      560
   soybean      C      357
 sunflower      C      956

By applying the aggregate function to several categorical variables, all possible combinations between them are made and the corresponding statistical summary is created for each one.

On the other hand, we are going to create a new numeric variable named num_var.

set.seed(1)

num_var <- rnorm(nrow(df))

df_3 <- cbind(num_var, df)

head(df_3)

    num_var   weight      feed
 -0.6264538    179    horsebean
  0.1836433    160    horsebean
 -0.8356286    136    horsebean
  1.5952808    227    horsebean
  0.3295078    217    horsebean
 -0.8204684    168    horsebean

In this scenario, when working with two or more numerical variables you can make use of the cbind function to concatenate them:

aggregate(cbind(df_3$num_var, df_3$weight), list(df_3$feed), mean)

   Group.1       V1       V2
    casein   0.4043795  323.5833
 horsebean   0.1322028  160.2000
   linseed   0.3491303  218.7500
  meatmeal   0.2125804  276.9091
   soybean  -0.2314387  246.4286
 sunflower   0.1651836  328.9167

Thus, the statistical summary is created for the numeric variables based on the factor.

You could also apply the function with multiple numerical and categorical variables. In this a situation, there would be as many summaries as numerical variables and as many groups as possible combinations.