- 1 How to interpret a box plot in R?
- 2 The boxplot function in R
- 3 Add mean point to a boxplot in R
- 4 Return values from boxplot
- 5 Boxplot and histogram
- 6 Boxplot in R ggplot2
How to interpret a box plot in R?
The box of a boxplot starts in the first quartile (25%) and ends in the third (75%). Hence, the box represents the 50% of the central data, with a line inside that represents the median. On each side of the box there is drawn a segment to the furthest data without counting boxplot outliers, that in case there exist, will be represented with circles.
The boxplot function in R
A box and whisker plot in base R can be plotted with the
boxplot function. You can plot this type of graph from different inputs, like vectors or data frames, as we will review in the following subsections. In case of plotting boxplots for multiple groups in the same graph, you can also specify a formula as input. In addition, you can customize the resulting box plot with several arguments.
Boxplot from vector
If you are wondering how to make box plot in R from vector, you just need to pass the vector to the
boxplot function. By default, the boxplot will be vertical, but you can change the orientation setting the
horizontal argument to
x <- c(8, 5, 14, -9, 19, 12, 3, 9, 7, 4, 4, 6, 8, 12, -8, 2, 0, -1, 5, 3)
boxplot(x, horizontal = TRUE)
Note that boxplots hide the underlying distribution of the data. In order to solve this issue, you can add points to boxplot in R with the
stripchart function (jittered data points will avoid to overplot the outliers) as follows:
stripchart(x, method = "jitter", pch = 19, add = TRUE, col = "blue")
Box plot with confidence interval for the median
You can represent the 95% confidence intervals for the median in a R boxplot, setting the
notch argument to
boxplot(x, notch = TRUE)
Note that if the notches of two or more boxplots don’t overlap means there is strong evidence that the medians differ.
Boxplot by group in R
If your dataset has a categorical variable containing groups, you can create a boxplot from formula. In this example, we are going to use the base R
weight feed 1 179 horsebean 2 160 horsebean 3 136 horsebean 4 227 horsebean 5 217 horsebean 6 168 horsebean
Now, you can create a boxplot of the weight against the type of feed. Notice that when working with datasets you can call the variable names if you specify the dataframe name in the
boxplot(chickwts$weight ~ chickwts$feed) boxplot(weight ~ feed, data = chickwts) # Equivalent
In addition, in this example you could add points to each boxplot typing:
stripchart(chickwts$weight ~ chickwts$feed, vertical = TRUE, method = "jitter", pch = 19, add = TRUE, col = 1:length(levels(chickwts$feed)))
In case all variables of your dataset are numeric variables, you can directly create a boxplot from a dataframe. For illustration purposes we are going to use the
Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2 4 10.5 72 16.4 5 10.7 81 18.8 6 10.8 83 19.7
Note the difference respect to the
chickwts dataset. Nevertheless, you can convert this dataset as one of the same format as the
chickwts dataset with the
stacked_df <- stack(trees) head(stacked_df)
values ind 1 8.3 Girth 2 8.6 Girth 3 8.8 Girth 4 10.5 Girth 5 10.7 Girth 6 10.8 Girth
Now, you can plot the boxplot with the original or the stacked dataframe as we did in the previous section. Note that you can change the boxplot color by group with a vector of colors as parameters of the
col argument. Thus, each boxplot will have a different color.
# Boxplot from the R trees dataset boxplot(trees, col = rainbow(ncol(trees))) # Equivalent to: boxplot(stacked_df$values ~ stacked_df$ind, col = rainbow(ncol(trees)))
In case you need to plot a different boxplot for each column of your R dataframe you can use the
lapply function and iterate over each column. In this case, we will divide the graphics
par in one row and as many columns as the dataset has, but you could plot individual graphs. Note that the
invisible function avoids displaying the output text of the
par(mfrow = c(1, ncol(trees))) invisible(lapply(1:ncol(trees), function(i) boxplot(trees[, i])))
Reorder boxplot in R
By default, boxplots will be plotted with the order of the factors in the data. However, you can reorder or sort a boxplot in R reordering the data by any metric, like the median or the mean, with the
par(mfrow = c(1, 2)) # Lower to higher medians <- reorder(chickwts$feed, chickwts$weight, median) # medians <- with(chickwts, reorder(feed, weight, median)) # Equivalent boxplot(chickwts$weight ~ medians, las = 2, xlab = "", ylab = "") # Higher to lower medians <- reorder(chickwts$feed, -chickwts$weight, median) # medians <- with(chickwts, reorder(feed, -weight, median)) # Equivalent boxplot(chickwts$weight ~ medians, las = 2, xlab = "", ylab = "") par(mfrow = c(1, 1))
If you want to order the boxplot with other metric, just change
median for the one you prefer.
A boxplot can be fully customized for a nice result. In the following block of code we show a wide example of how to customize an R box plot and how to add a grid. Note that there are even more arguments than the ones in the following example to customize the boxplot, like
staplelwd. Review the full list of graphical boxplot parameters in the
pars argument of
plot.new() set.seed(1) # Light gray background rect(par("usr"), par("usr"), par("usr"), par("usr"), col = "#ebebeb") # Add white grid grid(nx = NULL, ny = NULL, col = "white", lty = 1, lwd = par("lwd"), equilogs = TRUE) # Boxplot par(new = TRUE) boxplot(rnorm(500), # Data horizontal = FALSE, # Horizontal or vertical plot lwd = 2, # Lines width col = rgb(1, 0, 0, alpha = 0.4), # Color xlab = "X label", # X-axis label ylab = "Y label", # Y-axis label main = "Customized boxplot in base R", # Title notch = TRUE, # Add notch if TRUE border = "black", # Boxplot border color outpch = 25, # Outliers symbol outbg = "green", # Outliers color whiskcol = "blue", # Whisker color whisklty = 2, # Whisker line type lty = 1) # Line type (box and median) # Add a legend legend("topright", legend = "Boxplot", # Position and title fill = rgb(1, 0, 0, alpha = 0.4), # Color inset = c(0.03, 0.05), # Modify margins bg = "white") # Legend background color
Add mean point to a boxplot in R
By default, when you create a boxplot the median is displayed. Nevertheless, you may also like to display the mean or other characteristic of the data. For that purpose, you can use the
segments function if you want to display a line as the median, or the
points function to just add points. Note that the code is slightly different if you create a vertical boxplot or a horizontal boxplot.
In the following code block we show you how to add mean points and segments to both type of boxplots when working with a single boxplot.
par(mfrow = c(1, 2)) #----------------- # Vertical boxplot #----------------- boxplot(x) # Add mean line segments(x0 = 0.8, y0 = mean(x), x1 = 1.2, y1 = mean(x), col = "red", lwd = 2) # abline(h = mean(x), col = 2, lwd = 2) # Entire line # Add mean point points(mean(x), col = 3, pch = 19) #------------------- # Horizontal boxplot #------------------- boxplot(x, horizontal = TRUE) # Add mean line segments(x0 = mean(x), y0 = 0.8, x1 = mean(x), y1 = 1.2, col = "red", lwd = 2) # abline(v = mean(x), col = 2, lwd = 2) # Entire line # Add mean point points(mean(x), 1, col = 3, pch = 19) par(mfrow = c(1, 1))
Note that, in this case, the mean and the median are almost equal, as the distribution is symmetric.
meanfunction of the previous code for other function to display other measures.
You can also add the mean point to boxplot by group. In this case, you can make use of the
lapply function to avoid
for loops. In order to calculate the
mean for each group you can use the
apply function by columns or the
colMeans function. You can follow the code block to add the lines and points for horizontal and vertical box and whiskers diagrams.
par(mfrow = c(1, 2)) my_df <- trees #-------------------------- # Vertical boxplot by group #-------------------------- boxplot(my_df, col = rgb(0, 1, 1, alpha = 0.25)) # Add mean lines invisible(lapply(1:ncol(my_df), function(i) segments(x0 = i - 0.4, y0 = mean(my_df[, i]), x1 = i + 0.4, y1 = mean(my_df[, i]), col = "red", lwd = 2))) # Add mean points means <- apply(my_df, 2, mean) means <- colMeans(my_df) # Equivalent (more efficient) points(means, col = "red", pch = 19) #---------------------------- # Horizontal boxplot by group #---------------------------- boxplot(my_df, col = rgb(0, 1, 1, alpha = 0.25), horizontal = TRUE) # Add mean lines invisible(lapply(1:ncol(my_df), function(i) segments(x0 = mean(my_df[, i]), y0 = i - 0.4, x1 = mean(my_df[, i]), y1 = i + 0.4, col = "red", lwd = 2))) # Add mean points means <- apply(my_df, 2, mean) means <- colMeans(my_df) # Equivalent (more efficient) points(means, 1:ncol(my_df), col = "red", pch = 19) par(mfrow = c(1, 1))
Return values from boxplot
If you assign the boxplot to a variable, you can return a list with different components. Create a boxplot with the
trees dataset and store it in a variable:
res <- boxplot(trees) res
$`stats` [, 1] [, 2] [, 3] [1, ] 8.30 63 10.2 [2, ] 11.05 72 19.4 [3, ] 12.90 76 24.2 [4, ] 15.25 80 37.3 [5, ] 20.60 87 58.3 $n  31 31 31 $conf [, 1] [, 2] [, 3] [1, ] 11.70814 73.72979 19.1204 [2, ] 14.09186 78.27021 29.2796 $out  77 $group  3 $names  "Girth" "Height" "Volume"
The output will contain six elements described below:
- stats: each column represents the lower whisker, the first quartile, the median, the third quartile and the upper whisker of each group.
- n: number of observations of each group.
- conf: each column represents the lower and upper extremes of the confidence interval of the median.
- out: total number of outliers.
- group: total number of groups.
- names: names of each group.
It is worth to mention that you can create a boxplot from the variable you have just created (
res) with the
Boxplot and histogram
One limitation of box plots is that there are not designed to detect multimodality. For that reason, it is also recommended plotting a boxplot combined with a histogram or a density line.
par(mfrow = c(1, 1)) # Multimodal data n <- 20000 ii <- rbinom(n, 1, 0.5) dat <- rnorm(n, mean = 110, sd = 11) * ii + rnorm(n, mean = 70, sd = 5) * (1 - ii) # Histogram hist(dat, probability = TRUE, ylab = "", col = "grey", axes = FALSE, main = "") # Axis axis(1) # Density lines(density(dat), col = "red", lwd = 2) # Add boxplot par(new = TRUE) boxplot(dat, horizontal = TRUE, axes = FALSE, lwd = 2, col = rgb(0, 1, 1, alpha = 0.15))
As an alternative to this problem you can use violin plots or beanplots.
Boxplot in R ggplot2
The boxplots we created in the previous sections can also be plotted with
ggplot2 library. For further details read the complete ggplot2 boxplots tutorial.
Boxplot in ggplot2 from vector
The input of the
ggplot library has to be a data frame, so you will need convert the vector to
data.frame class. Then, you can use the
geom_boxplot function to create and customize the box and the
stat_boxplot function to add the error bars.
# install.packages("ggplot2") library(ggplot2) # Transform our 'x' vector x <- data.frame(x) # Boxplot with vector ggplot(data = x, aes(x = "", y = x)) + stat_boxplot(geom = "errorbar", # Error bars width = 0.2) + geom_boxplot(fill = "#4271AE", # Box color outlier.colour = "red", # Outliers color alpha = 0.9) + # Box color transparency ggtitle("Boxplot with vector") + # Plot title xlab("") + # X-axis label coord_flip() # Horizontal boxplot
Boxplot in ggplot2 by group
If you want to create a ggplot boxplot by group, you will need to specify variables in the
aes argument as follows:
# Boxplot by group ggplot(data = chickwts, aes(x = feed, y = weight)) + stat_boxplot(geom = "errorbar", # Boxplot with error bars width = 0.2) + geom_boxplot(fill = "#4271AE", colour = "#1F3552", # Colors alpha = 0.9, outlier.colour = "red") + scale_y_continuous(name = "Weight") + # Continuous variable label scale_x_discrete(name = "Feed") + # Group label ggtitle("Boxplot by groups ggplot2") + # Plot title theme(axis.line = element_line(colour = "black", # Theme customization size = 0.25))
Boxplot in ggplot2 from dataframe
Finally, for creating a boxplot with
ggplot2 with a data frame like the
trees dataset, you will need to stack the data with the
# Boxplot from dataframe ggplot(data = stack(trees), aes(x = ind, y = values)) + stat_boxplot(geom = "errorbar", # Boxplot with error bars width = 0.2) + geom_boxplot(fill = "#4271AE", colour = "#1F3552", # Colors alpha = 0.9, outlier.colour = "red") + scale_y_continuous(name = "Weight") + # Continuous variable label scale_x_discrete(name = "Feed") + # Group label ggtitle("Boxplot from data frame ggplot2") + # Plot title theme(axis.line = element_line(colour = "black", # Theme customization size = 0.25))