Boxplot in R
What is box plot in R programming? A boxplot in R, also known as box and whisker plot, is a graphical representation which allows you to summarize the main characteristics of the data (position, dispersion, skewness, …) and identify the presence of outliers. In this tutorial we will review how to make a base R box plot.
How to interpret a box plot in R?
The box of a boxplot starts in the first quartile (25%) and ends in the third (75%). Hence, the box represents the 50% of the central data, with a line inside that represents the median. On each side of the box there is drawn a segment to the furthest data without counting boxplot outliers, that in case there exist, will be represented with circles.
An outlier is that observation that is very distant from the rest of the data. A data point is said to be an outlier if it is greater than \(Q_3\) + 1.5 \(\cdot IQR\) (right outlier), or is less than $ Q_1$ – 1.5 \(\cdot IQR\) (left outlier), being \(Q_1\) the first quartile, \(Q_3\) the third quartile and \(IQR\) the interquartile range (\(Q_3\) – \(Q_1\)) that represents the width of the box for horizontal boxplots.
The boxplot function in R
A box and whisker plot in base R can be plotted with the boxplot
function. You can plot this type of graph from different inputs, like vectors or data frames, as we will review in the following subsections. In case of plotting boxplots for multiple groups in the same graph, you can also specify a formula as input. In addition, you can customize the resulting box plot with several arguments.
Boxplot from vector
If you are wondering how to make box plot in R from vector, you just need to pass the vector to the boxplot
function. By default, the boxplot will be vertical, but you can change the orientation setting the horizontal
argument to TRUE
.
x <- c(8, 5, 14, -9, 19, 12, 3, 9, 7, 4,
4, 6, 8, 12, -8, 2, 0, -1, 5, 3)
boxplot(x, horizontal = TRUE)
Note that boxplots hide the underlying distribution of the data. In order to solve this issue, you can add points to boxplot in R with the stripchart
function (jittered data points will avoid to overplot the outliers) as follows:
stripchart(x, method = "jitter", pch = 19, add = TRUE, col = "blue")
Since R 4.0.0 boxplots are gray by default instead of white.
Box plot with confidence interval for the median
You can represent the 95% confidence intervals for the median in an R boxplot, setting the notch
argument to TRUE
.
boxplot(x, notch = TRUE)
Note that if the notches of two or more boxplots don’t overlap means there is strong evidence that the medians differ.
Boxplot by group in R
If your dataset has a categorical variable containing groups, you can create a boxplot from formula. In this example, we are going to use the base R chickwts
dataset.
head(chickwts)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
Now, you can create a boxplot of the weight against the type of feed. Notice that when working with datasets you can call the variable names if you specify the dataframe name in the data
argument.
boxplot(chickwts$weight ~ chickwts$feed)
boxplot(weight ~ feed, data = chickwts) # Equivalent
In addition, in this example you could add points to each boxplot typing:
stripchart(chickwts$weight ~ chickwts$feed, vertical = TRUE, method = "jitter",
pch = 19, add = TRUE, col = 1:length(levels(chickwts$feed)))
Multiple boxplots
In case all variables of your dataset are numeric variables, you can directly create a boxplot from a dataframe. For illustration purposes we are going to use the trees
dataset.
head(trees)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
Note the difference respect to the chickwts
dataset. Nevertheless, you can convert this dataset as one of the same format as the chickwts
dataset with the stack
function.
stacked_df <- stack(trees)
head(stacked_df)
values ind
1 8.3 Girth
2 8.6 Girth
3 8.8 Girth
4 10.5 Girth
5 10.7 Girth
6 10.8 Girth
Now, you can plot the boxplot with the original or the stacked dataframe as we did in the previous section. Note that you can change the boxplot color by group with a vector of colors as parameters of the col
argument. Thus, each boxplot will have a different color.
# Boxplot from the R trees dataset
boxplot(trees, col = rainbow(ncol(trees)))
# Equivalent to:
boxplot(stacked_df$values ~ stacked_df$ind,
col = rainbow(ncol(trees)))
You can stack dataframe columns with the stack
function.
In case you need to plot a different boxplot for each column of your R dataframe you can use the lapply
function and iterate over each column. In this case, we will divide the graphics par
in one row and as many columns as the dataset has, but you could plot individual graphs. Note that the invisible
function avoids displaying the output text of the lapply
function.
par(mfrow = c(1, ncol(trees)))
invisible(lapply(1:ncol(trees), function(i) boxplot(trees[, i])))
Reorder boxplot in R
By default, boxplots will be plotted with the order of the factors in the data. However, you can reorder or sort a boxplot in R reordering the data by any metric, like the median or the mean, with the reorder
function.
par(mfrow = c(1, 2))
# Lower to higher
medians <- reorder(chickwts$feed, chickwts$weight, median)
# medians <- with(chickwts, reorder(feed, weight, median)) # Equivalent
boxplot(chickwts$weight ~ medians, las = 2, xlab = "", ylab = "")
# Higher to lower
medians <- reorder(chickwts$feed, -chickwts$weight, median)
# medians <- with(chickwts, reorder(feed, -weight, median)) # Equivalent
boxplot(chickwts$weight ~ medians, las = 2, xlab = "", ylab = "")
par(mfrow = c(1, 1))
If you want to order the boxplot with other metric, just change median
for the one you prefer.
Boxplot customization
A boxplot can be fully customized for a nice result. In the following block of code we show a wide example of how to customize an R box plot and how to add a grid. Note that there are even more arguments than the ones in the following example to customize the boxplot, like boxlty
, boxlwd
, medlty
or staplelwd
. Review the full list of graphical boxplot parameters in the pars
argument of help(bxp)
or ?bxp
.
plot.new()
set.seed(1)
# Light gray background
rect(par("usr")[1], par("usr")[3], par("usr")[2], par("usr")[4],
col = "#ebebeb")
# Add white grid
grid(nx = NULL, ny = NULL, col = "white", lty = 1,
lwd = par("lwd"), equilogs = TRUE)
# Boxplot
par(new = TRUE)
boxplot(rnorm(500), # Data
horizontal = FALSE, # Horizontal or vertical plot
lwd = 2, # Lines width
col = rgb(1, 0, 0, alpha = 0.4), # Color
xlab = "X label", # X-axis label
ylab = "Y label", # Y-axis label
main = "Customized boxplot in base R", # Title
notch = TRUE, # Add notch if TRUE
border = "black", # Boxplot border color
outpch = 25, # Outliers symbol
outbg = "green", # Outliers color
whiskcol = "blue", # Whisker color
whisklty = 2, # Whisker line type
lty = 1) # Line type (box and median)
# Add a legend
legend("topright", legend = "Boxplot", # Position and title
fill = rgb(1, 0, 0, alpha = 0.4), # Color
inset = c(0.03, 0.05), # Modify margins
bg = "white") # Legend background color
Add mean point to a boxplot in R
By default, when you create a boxplot the median is displayed. Nevertheless, you may also like to display the mean or other characteristic of the data. For that purpose, you can use the segments
function if you want to display a line as the median, or the points
function to just add points. Note that the code is slightly different if you create a vertical boxplot or a horizontal boxplot.
In the following code block we show you how to add mean points and segments to both type of boxplots when working with a single boxplot.
par(mfrow = c(1, 2))
#-----------------
# Vertical boxplot
#-----------------
boxplot(x)
# Add mean line
segments(x0 = 0.8, y0 = mean(x),
x1 = 1.2, y1 = mean(x),
col = "red", lwd = 2)
# abline(h = mean(x), col = 2, lwd = 2) # Entire line
# Add mean point
points(mean(x), col = 3, pch = 19)
#-------------------
# Horizontal boxplot
#-------------------
boxplot(x, horizontal = TRUE)
# Add mean line
segments(x0 = mean(x), y0 = 0.8,
x1 = mean(x), y1 = 1.2,
col = "red", lwd = 2)
# abline(v = mean(x), col = 2, lwd = 2) # Entire line
# Add mean point
points(mean(x), 1, col = 3, pch = 19)
par(mfrow = c(1, 1))
Note that, in this case, the mean and the median are almost equal, as the distribution is symmetric.
You can change the mean
function of the previous code for other function to display other measures.
You can also add the mean point to boxplot by group. In this case, you can make use of the lapply
function to avoid for
loops. In order to calculate the mean
for each group you can use the apply
function by columns or the colMeans
function. You can follow the code block to add the lines and points for horizontal and vertical box and whiskers diagrams.
par(mfrow = c(1, 2))
my_df <- trees
#--------------------------
# Vertical boxplot by group
#--------------------------
boxplot(my_df, col = rgb(0, 1, 1, alpha = 0.25))
# Add mean lines
invisible(lapply(1:ncol(my_df),
function(i) segments(x0 = i - 0.4,
y0 = mean(my_df[, i]),
x1 = i + 0.4,
y1 = mean(my_df[, i]),
col = "red", lwd = 2)))
# Add mean points
means <- apply(my_df, 2, mean)
means <- colMeans(my_df) # Equivalent (more efficient)
points(means, col = "red", pch = 19)
#----------------------------
# Horizontal boxplot by group
#----------------------------
boxplot(my_df, col = rgb(0, 1, 1, alpha = 0.25),
horizontal = TRUE)
# Add mean lines
invisible(lapply(1:ncol(my_df),
function(i) segments(x0 = mean(my_df[, i]),
y0 = i - 0.4,
x1 = mean(my_df[, i]),
y1 = i + 0.4,
col = "red", lwd = 2)))
# Add mean points
means <- apply(my_df, 2, mean)
means <- colMeans(my_df) # Equivalent (more efficient)
points(means, 1:ncol(my_df), col = "red", pch = 19)
par(mfrow = c(1, 1))
Return values from boxplot
If you assign the boxplot to a variable, you can return a list with different components. Create a boxplot with the trees
dataset and store it in a variable:
res <- boxplot(trees)
res
$`stats`
[, 1] [, 2] [, 3]
[1, ] 8.30 63 10.2
[2, ] 11.05 72 19.4
[3, ] 12.90 76 24.2
[4, ] 15.25 80 37.3
[5, ] 20.60 87 58.3
$n
[1] 31 31 31
$conf
[, 1] [, 2] [, 3]
[1, ] 11.70814 73.72979 19.1204
[2, ] 14.09186 78.27021 29.2796
$out
[1] 77
$group
[1] 3
$names
[1] "Girth" "Height" "Volume"
The output will contain six elements described below:
- stats: each column represents the lower whisker, the first quartile, the median, the third quartile and the upper whisker of each group.
- n: number of observations of each group.
- conf: each column represents the lower and upper extremes of the confidence interval of the median.
- out: total number of outliers.
- group: total number of groups.
- names: names of each group.
It is worth to mention that you can create a boxplot from the variable you have just created (res
) with the bxp
function.
bxp(res)
Boxplot and histogram
One limitation of box plots is that there are not designed to detect multimodality. For that reason, it is also recommended plotting a boxplot combined with a histogram or a density line.
par(mfrow = c(1, 1))
# Multimodal data
n <- 20000
ii <- rbinom(n, 1, 0.5)
dat <- rnorm(n, mean = 110, sd = 11) * ii +
rnorm(n, mean = 70, sd = 5) * (1 - ii)
# Histogram
hist(dat, probability = TRUE, ylab = "", col = "grey",
axes = FALSE, main = "")
# Axis
axis(1)
# Density
lines(density(dat), col = "red", lwd = 2)
# Add boxplot
par(new = TRUE)
boxplot(dat, horizontal = TRUE, axes = FALSE,
lwd = 2, col = rgb(0, 1, 1, alpha = 0.15))
The boxplot can’t detect multimodality in the data.
As an alternative to this problem you can use violin plots or beanplots.
Boxplot in R ggplot2
The boxplots we created in the previous sections can also be plotted with ggplot2
library. For further details read the complete ggplot2 boxplots tutorial.
Boxplot in ggplot2 from vector
The input of the ggplot
library has to be a data frame, so you will need convert the vector to data.frame
class. Then, you can use the geom_boxplot
function to create and customize the box and the stat_boxplot
function to add the error bars.
# install.packages("ggplot2")
library(ggplot2)
# Data
x <- c(8, 5, 14, -9, 19, 12, 3, 9, 7, 4,
4, 6, 8, 12, -8, 2, 0, -1, 5, 3)
# Transform our 'x' vector
x <- data.frame(x)
# Boxplot with vector
ggplot(data = x, aes(x = "", y = x)) +
stat_boxplot(geom = "errorbar", # Error bars
width = 0.2) +
geom_boxplot(fill = "#4271AE", # Box color
outlier.colour = "red", # Outliers color
alpha = 0.9) + # Box color transparency
ggtitle("Boxplot with vector") + # Plot title
xlab("") + # X-axis label
coord_flip() # Horizontal boxplot
Boxplot in ggplot2 by group
If you want to create a ggplot boxplot by group, you will need to specify variables in the aes
argument as follows:
# Boxplot by group
ggplot(data = chickwts, aes(x = feed, y = weight)) +
stat_boxplot(geom = "errorbar", # Boxplot with error bars
width = 0.2) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", # Colors
alpha = 0.9, outlier.colour = "red") +
scale_y_continuous(name = "Weight") + # Continuous variable label
scale_x_discrete(name = "Feed") + # Group label
ggtitle("Boxplot by groups ggplot2") + # Plot title
theme(axis.line = element_line(colour = "black", # Theme customization
size = 0.25))
Boxplot in ggplot2 from wide format dataframe
Finally, for creating a boxplot with ggplot2
with a data frame like the trees
dataset, you will need to stack the data with the stack
function:
# Boxplot from dataframe
ggplot(data = stack(trees), aes(x = ind, y = values)) +
stat_boxplot(geom = "errorbar", # Boxplot with error bars
width = 0.2) +
geom_boxplot(fill = "#4271AE", colour = "#1F3552", # Colors
alpha = 0.9, outlier.colour = "red") +
scale_y_continuous(name = "Weight") + # Continuous variable label
scale_x_discrete(name = "Feed") + # Group label
ggtitle("Boxplot from data frame ggplot2") + # Plot title
theme(axis.line = element_line(colour = "black", # Theme customization
size = 0.25))