tapply in R

Data Manipulation in R apply family
Learn how to use tapply() function in R

What does tapply mean in R? The tapply function allows you to create statistical summaries by group based on the levels of one or several factors. In this tutorial you will learn how to use tapply in R in several scenarios with examples.

The tapply function

The R tapply function is very similar to the apply function. In the following block of code we show the function syntax and the simplified description of each argument.

tapply(X,               # Object you can split (matrix, data frame, ...)
       INDEX,           # List of factors of the same length
       FUN,             # Function to be applied to factors (or NULL)
       ...,             # Additional arguments to be passed to FUN
       default = NA,    # If simplify = TRUE, is the array initialization value
       simplify = TRUE) # If set to FALSE returns a list object

Note that the three first arguments are the most usual and that it is common to not specify the arguments name in the apply family functions due to its simple syntax.

How to use tapply in R?

The tapply function is very easy to use in R. First, consider the following example dataset, that represents the price of some objects, its type and the store where they were sold.

set.seed(2)

data_set <- data.frame(price = round(rnorm(25, sd = 10, mean = 30)),
                       type = sample(1:4, size = 25, replace = TRUE),
                       store = sample(paste("Store", 1:4),
                                      size = 25, replace = TRUE))

head(data_set)
price   type   store  
  21     2    Store 2
  32     3    Store 3
  46     4    Store 4
  19     3    Store 4
  29     1    Store 4
  31     3    Store 4

Second, store the values as variables and convert the column named type to factor.

price <- data_set$price
store <- data_set$store
type <- factor(data_set$type,
               labels = c("toy", "food", "electronics", "drinks"))

Finally, you can use the tapply function to calculate the mean by type of object of the stores as follows:

# Mean price by product type
mean_prices <- tapply(price, type, mean)
mean_prices
   toy       food     electronics    drinks 
39.50000    30.33333    32.20000    29.33333

Note that the tapply arguments must have the same length. You can verify it with the length function. It also should be noticed that the default output is of class “array”.

class(mean_prices) # "array"

Hence, if needed, you can access each element of the output specifying the desired index in square brackets.

mean_prices[2] # 30.33333

However, you can modify the output class to list if you set the simplify argument to FALSE.

# Mean price by product type
mean_prices_list <- tapply(price, type, mean, simplify = FALSE)
mean_prices_list
$toy
[1] 39.5

$food
[1] 30.33333

$electronics
[1] 32.2

$drinks
[1] 29.33333

In this case, you can access the output elements with the $ sign and the element name.

mean_prices_list$toy # 39.5

Additional arguments example: Ignore NA

Suppose that your data frame contains some NA values in its columns.

# Adding a NA values to the data set
data_set[1, 1] <- NA
data_set[2, 3] <- NA

# Mean price by store
tapply(data_set$price, data_set$store, mean)
 Store 1   Store 2   Store 3    Store 4 
 32.00000    NA     39.25000   33.14286

Within the tapply function you can specify additional arguments of the function you are applying, after the FUN argument. In this case, the mean function allows you to specify the na.rm argument to remove NA values. Note that this argument defaults to FALSE.

tapply(data_set$price, data_set$store, mean, na.rm = TRUE)
 Store 1   Store 2   Store 3   Store 4 
32.00000  33.50000  39.25000  33.14286 

The previous is equivalent to the following:

f <- function(x) mean(x, na.rm = TRUE)
tapply(data_set$price, data_set$store, f)

Tapply in R with multiple factors

You can apply the tapply function to multiple columns (or factor variables) passing them through the list function. In this example, we are going to apply the tapply function to the type and store factors to calculate the mean price of the objects by type and store.

# Mean price by product type and store
tapply(price, list(type, store), mean)
            Store 1   Store 2   Store 3   Store 4
toy           46      31.00000     49    36.66667
food          26      30.33333     39          NA
electronics   50      29.00000     32    25.00000
drinks        22      40.00000     20    36.00000

Note that as there were no food sold in the Store 4, the corresponding cell returns a NA value. To override this behavior you can set the default argument to the value you want, instead of NA. In this example we decided to set it to 0.

# Mean price by product type and store, changing default argument
tapply(price, list(type, store), mean, default = 0)
           Store 1    Store 2    Store 3    Store 4
toy          46      31.00000     49      36.66667
food         26      30.33333     39       0.00000
electronics  50      29.00000     32      25.00000
drinks       22      40.00000     20      36.00000