Mode estimation in R
The mode is a measure of location that can be defined as the most probable outcome of a random variable or as the most frequent value on a set of observations. It is a robust measure that coincides with the mean and the median in symmetric distributions. In this tutorial we will review how to calculate the mode in R for both discrete and continuous one-dimensional variables.
Discrete unimodal estimation
Consider the following vector x
:
x <- c(1, 5, 1, 6, 2, 1, 6, 7, 1)
The mode can be calculated as the most repeated value withing the variable, which in this case is 1. A simple way of calculating the mode in R in this case is using the following function:
mode <- function(x) {
return(as.numeric(names(which.max(table(x)))))
}
In this case, we can check that the mode is 1 passing the vector to the function:
mode(x) # 1
If you want to visualize the number of times that each data point is repeated you can also create a bar chart.
barplot(table(x), col = c(4, rep("gray", 4)))
legend("topright", "Mode", fill = 4)
Continous unimodal estimation
If our variable of interest in continuous instead of discrete we cannot use the previous procedure, but we must resort to another method. The most usual procedure in the literature is calculating the maximum of the estimation of the density function of the data making use of any algorithm.
Consider the following normal data (unimodal) with mean 0 and standard deviation of 1. As the normal distribution is symmetric, we know that the mean, the median and the mode are equal (0).
set.seed(1234)
x2 <- rnorm(1000)
In order to visualize the modes you can draw the histogram and the density function estimation. Note that the the selection of the bandwidth will determine the shape of the estimated density.
# Histogram
hist(x2, freq = FALSE)
# Density
dx <- density(x2)
lines(dx$x, dx$y, col = 2, lwd = 2)
# Theoretical mode
abline(v = 0, col = 4, lty = 2, lwd = 3)
In order to perform the calculation you will need to use the mlv
function of the modeest
package, that allows selection among different algorithms. We recommend you to use the mean-shift algorithm, as displayed on the following block of code.
# install.packages("modeest")
library(modeest)
# Moda
mlv(x2, method = "meanshift") # -0.03912067
We can observe that the estimated mode (-0.039) is very close to the theoretical mode (0). Other available methods are ālientzā
, ānaiveā
, āventerā
, āgrenanderā
, āhsmā
, āparzenā
, ātsybakovā
and āasselinā
.
Discrete multimodal estimation
Unlike the median or mean, the mode can take multiple values at the same time. For instance, consider the vector y
, which has two modes.
y <- c(3, 5, 3, 3, 5, 6, 5)
# Histogram
hist(y)
In this case the most repeated values are 3 and 5. In order to calculate several modes you can make use of the mlv
function of the modeest
package and apply the mfv
method.
# install.packages("modeest")
library(modeest)
# Modes
mlv(y, method = "mfv") # 3 5
Continuous multimodal estimation
In you want to calculate several modes when our variable is continuous you can use the locmodes
of the multimode
package.
Consider the following multimodal data, which theoretical modes are 40 y 120, represented with vertical red lines.
n <- 1000
bin <- rbinom(n, 1, 0.6)
y2 <- rnorm(n, mean = 120, sd = 11) * bin +
rnorm(n, mean = 40, sd = 5) * (1 - bin)
# Histogram
hist(y2)
# Theoretical mode 1
abline(v = 40, col = 2, lwd = 2)
# Theoretical mode 2
abline(v = 120, col = 2, lwd = 2)
You can apply the locmodes
function to the previous data, indicating the number of modes that you expect to find in the mod0
argument.
# install.packages("multimode")
library(multimode)
modes <- locmodes(y2, mod0 = 2)
modes
Estimated location
Modes: 40.56825 120.8625
Antimode: 69.94661
Estimated value of the density
Modes: 0.02535653 0.02033563
Antimode: 8.184294e-08
Critical bandwidth: 3.746696
Warning message:
In locmodes(y, mod0 = 2) :
If the density function has an unbounded support, artificial modes may have been created in the tails
On the previous output you can observe that the estimated modes are 40.57 and 120.86, very close to the theoretical values.
The library also provides a S3 method to plot the estimations returned by the locmodes
function, displaying the localization of the modes, of the antimodes and the bandwidth used.
plot(modes)
The package also provides the modetest
function to test for multimodality and functions for exploring the number of modes, such as modetree
, modeforest
and sizes
.