tapply in R
What does tapply
mean in R? The tapply
function allows you to create statistical summaries by group based on the levels of one or several factors. In this tutorial you will learn how to use tapply in R in several scenarios with examples.
The tapply function
The R tapply
function is very similar to the apply
function. In the following block of code we show the function syntax and the simplified description of each argument.
tapply(X, # Object you can split (matrix, data frame, ...)
INDEX, # List of factors of the same length
FUN, # Function to be applied to factors (or NULL)
..., # Additional arguments to be passed to FUN
default = NA, # If simplify = TRUE, is the array initialization value
simplify = TRUE) # If set to FALSE returns a list object
Note that the three first arguments are the most usual and that it is common to not specify the arguments name in the apply family functions due to its simple syntax.
How to use tapply in R?
The tapply
function is very easy to use in R. First, consider the following example dataset, that represents the price of some objects, its type and the store where they were sold.
set.seed(2)
data_set <- data.frame(price = round(rnorm(25, sd = 10, mean = 30)),
type = sample(1:4, size = 25, replace = TRUE),
store = sample(paste("Store", 1:4),
size = 25, replace = TRUE))
head(data_set)
price type store
21 2 Store 2
32 3 Store 3
46 4 Store 4
19 3 Store 4
29 1 Store 4
31 3 Store 4
Second, store the values as variables and convert the column named type
to factor.
price <- data_set$price
store <- data_set$store
type <- factor(data_set$type,
labels = c("toy", "food", "electronics", "drinks"))
Finally, you can use the tapply
function to calculate the mean by type of object of the stores as follows:
# Mean price by product type
mean_prices <- tapply(price, type, mean)
mean_prices
toy food electronics drinks
39.50000 30.33333 32.20000 29.33333
Note that the tapply
arguments must have the same length. You can verify it with the length
function. It also should be noticed that the default output is of class “array”.
class(mean_prices) # "array"
Hence, if needed, you can access each element of the output specifying the desired index in square brackets.
mean_prices[2] # 30.33333
However, you can modify the output class to list
if you set the simplify
argument to FALSE
.
# Mean price by product type
mean_prices_list <- tapply(price, type, mean, simplify = FALSE)
mean_prices_list
$toy
[1] 39.5
$food
[1] 30.33333
$electronics
[1] 32.2
$drinks
[1] 29.33333
In this case, you can access the output elements with the $
sign and the element name.
mean_prices_list$toy # 39.5
Additional arguments example: Ignore NA
Suppose that your data frame contains some NA
values in its columns.
# Adding a NA values to the data set
data_set[1, 1] <- NA
data_set[2, 3] <- NA
# Mean price by store
tapply(data_set$price, data_set$store, mean)
Store 1 Store 2 Store 3 Store 4
32.00000 NA 39.25000 33.14286
Within the tapply
function you can specify additional arguments of the function you are applying, after the FUN
argument. In this case, the mean
function allows you to specify the na.rm
argument to remove NA
values. Note that this argument defaults to FALSE
.
tapply(data_set$price, data_set$store, mean, na.rm = TRUE)
Store 1 Store 2 Store 3 Store 4
32.00000 33.50000 39.25000 33.14286
The previous is equivalent to the following:
f <- function(x) mean(x, na.rm = TRUE)
tapply(data_set$price, data_set$store, f)
Tapply in R with multiple factors
You can apply the tapply
function to multiple columns (or factor variables) passing them through the list
function. In this example, we are going to apply the tapply
function to the type
and store
factors to calculate the mean price of the objects by type and store.
# Mean price by product type and store
tapply(price, list(type, store), mean)
Store 1 Store 2 Store 3 Store 4
toy 46 31.00000 49 36.66667
food 26 30.33333 39 NA
electronics 50 29.00000 32 25.00000
drinks 22 40.00000 20 36.00000
Note that as there were no food sold in the Store 4, the corresponding cell returns a NA
value. To override this behavior you can set the default
argument to the value you want, instead of NA
. In this example we decided to set it to 0.
# Mean price by product type and store, changing default argument
tapply(price, list(type, store), mean, default = 0)
Store 1 Store 2 Store 3 Store 4
toy 46 31.00000 49 36.66667
food 26 30.33333 39 0.00000
electronics 50 29.00000 32 25.00000
drinks 22 40.00000 20 36.00000