Coefficient of variation in R
The coefficient of variation (CV) is a dispersion measure used to represent the relative variability of a dataset. It is calculated as the ratio of the standard deviation to the mean of a dataset, often expressed as a percentage.
This coefficient is an alternative to the standard deviation, as it allows to compare the variability among several datasets even if they have different scale or measurement units.
How to calculate the coefficient of variation?
The population coefficient of variation (CV) is the ratio between the standard deviation and the absolute mean:
\[CV = \frac{\sigma}{|\mu|}\]
The coefficient of variation is often expressed as a percentage, multiplying the ratio by 100:
\[CV = \frac{\sigma}{|\mu|} \cdot 100\]
For a sample, it is computed as the ratio between the sample standard deviation (\(S\)) and the sample mean (\(\bar{X}\)):
\[CV = \frac{S}{|\bar{X}|}\]
Notice that the mean must be different than 0.
In R, the coefficient of variation of a set of values can be computed calculating the ratio between the standard deviation with sd
and the mean with mean
.
# Sample data
x <- c(10, 30, 3, 44, 12, 15)
# Standard deviation and mean
sigma <- sd(x)
mu <- mean(x)
# Coefficient of variation
cv <- sigma / abs(mu)
cv
0.797503
If you want to calculate the coefficient of variation as percentage just multiply the previous result by 100:
# Sample data
x <- c(10, 30, 3, 44, 12, 15)
# Standard deviation and mean
sigma <- sd(x)
mu <- mean(x)
# Coefficient of variation in percentage
cv <- sigma / abs(mu) * 100
cv
79.7503
Function to compute the coefficient of variation
You can create your own function to calculate the coefficient of variation of a set of values.
The following block of code contains a function named cv
that takes a vector as input and provides two additional arguments: percentage
to calculate the coefficient as percentage when set to TRUE
and na.rm
to remove missing values if needed when set to TRUE
.
# cv function
cv <- function(x, percentage = TRUE, na.rm = FALSE) {
cv <- sd(x, na.rm = na.rm)/abs(mean(x, na.rm = na.rm)) * ifelse(percentage, 100, 1)
return(cv)
}
Making use of the previous function now you can compute the coefficient of variation in a single line:
# Sample data
x <- c(50, 48, 65, 13, 4, 19)
# Coefficient of variation
cv(x, percentage = TRUE)
73.54353
Recall to set percentage = FALSE
if you donât want to calculate the coefficient as percentage:
# Sample data
x <- c(50, 48, 65, 13, 4, 19)
# Coefficient of variation
cv(x, percentage = FALSE)
0.7354353
Comparing the coefficient of variation across groups
The principal use case of the coefficient of variation is to compare the dispersion between several datasets with different scale or measurement units. Consider the following data for two different groups:
group1 | group2 |
---|---|
19 | 160 |
30 | 290 |
12 | 280 |
56 | 330 |
As the scales between groups are different, calculating the standard deviation for each group will result in the second group having a higher standard deviation merely due to its scale. Therefore, comparing variability between groups wonât be possible.
# Sample data
df <- data.frame(group1 = c(19, 30, 12, 56),
group2 = c(160, 290, 280, 330))
# Standard deviation and mean by group
s <- rbind(apply(df, 2, sd), apply(df, 2, mean))
rownames(s) <- c("sd", "mean")
s
group1 group2
sd 19.31105 73.25754
mean 29.25000 265.00000
However, if you calculate the coefficient of variation for each group, the variability can be compared between groups, as this coefficient is relative.
# Sample data
df <- data.frame(group1 = c(19, 30, 12, 56), group2 = c(160, 290, 280, 330))
# Coefficient of variation for each column
# (using cv function defined in the previous section)
apply(df, 2, cv)
group1 group2
66.02069 27.64435
We can conclude that the first group has a higher variability (66.02%) than the second group (27.64%) despite the standard deviation of the second group was higher.