Home » Graphics » Histogram in R

# Histogram in R ## How to make a histogram in R? The R hist function

If you are reading this you are wondering how to plot a histogram in R. So in order to explain the steps to create a histogram in R, we are going to use the following data, that represents the distance (in yards) of a golf ball after being hit.

distance <- c(241.1, 284.4, 220.2, 272.4, 271.1, 268.3,
291.6, 241.6, 286.1, 285.9, 259.6, 299.6,
253.1, 239.6, 277.8, 263.8, 267.2, 272.6,
283.4, 234.5, 260.4, 264.2, 295.1, 276.4,
263.1, 251.4, 264.0, 269.2, 281.0, 283.2)

You can plot a histogram in R with the hist function. By default, the function will create a frequency histogram.

hist(distance, main = "Frequency histogram") # Frequency

However, if you set the argument prob to TRUE, you will get a density histogram.

hist(distance, prob = TRUE, main = "Density histogram") # Density

In addition, you can also add a grid to the histogram with the grid function as follows:

hist(distance, prob = TRUE)
grid(nx = NA, ny = NULL, lty = 2, col = "gray", lwd = 1)
hist(distance, prob = TRUE, add = TRUE, col = "white")

Note that you have to plot the histogram twice to display the grid under the main plot.

Since R 4.0.0 histograms are gray by default, not white.

## Change histogram color

Now that you know how to create a histogram in R you can also customize it. Hence, if you want to change the bins color, you can set the col parameter to the color you prefer. As any other plots, you can customize lots of features of the graph, like the title, the axes, font size …

hist(distance, col = "lightblue")

## Breaks in R histogram

Histograms are very useful to represent the underlying distribution of the data if the number of bins is selected properly. However, the selection of the number of bins (or the binwidth) can be tricky:

1. Few bins will group the observations too much.
2. With many bins there will be a few observations inside each, increasing the variability of the obtained plot.

There are several rules to determine the number of bins. In R, the Sturges method is used by default. If you want to change the number of bins, you can set the argument breaks to the number you desire.

par(mfrow = c(1, 3))

hist(distance, breaks = 2, main = "Few bins")
hist(distance, breaks = 50, main = "Too many bins")
hist(distance, main = "Sturges method")

par(mfrow = c(1, 1))

You can also use the plug-in methodology to select the bin width of a histogram by Wand (1995) implemented in the KernSmooth library as follows:

# Plug-in methodology
# install.packages("KernSmooth")
library(KernSmooth)

bin_width <- dpih(distance)

nbins <- seq(min(distance) - bin_width,
max(distance) + bin_width, by = bin_width)

hist(distance, breaks = nbins, main = "Plug-in")

## Histogram in R with two variables

Setting the argument add to TRUE allows you to plot a histogram over other plot. As an example, you could create an R histogram by group with the code of the following block:

set.seed(1)

x <- rnorm(1000)    # First group
y <- rnorm(1000, 1) # Second group

hist(x, main = "Two variables")
hist(y, add = TRUE, col = rgb(1, 0, 0, 0.5))

The rgb function sets color in RGB channel and the alpha argument sets the transparency. Indeed, when combining plots it is a good idea to set colors with transparency to see the plot behind.

## Add normal curve to histogram

In order to plot a normal line curve over the histogram you can use the dnorm and the lines functions as follows:

hist(distance, prob = TRUE, main = "Histogram with normal curve")
x <- seq(min(distance), max(distance), length = 40)
f <- dnorm(x, mean = mean(distance), sd = sd(distance))
lines(x, f, col = "red", lwd = 2)

## Add density line to histogram

In order to add a density curve over a histogram you can use the lines function for plotting the curve and density for calculating the underlying non-parametric (kernel) density of the distribution.

hist(distance, freq = FALSE, main = "Density curve")
lines(density(distance), lwd = 2, col = 'red')
The bandwidth selection for adjusting non-parametric densities is an area of intense research. Also note that, by default, the density function uses the Gaussian kernel. For more information call ?density.

We are going to join the previous codes within a function to automatically create a histogram with normal and density lines:

histDenNorm <- function (x, main = "") {
hist(x, prob = TRUE, main = main) # Histogram
lines(density(x), col = "blue", lwd = 2) # Density
x2 <- seq(min(x), max(x), length = 40)
f <- dnorm(x2, mean(x), sd(x))
lines(x2, f, col = "red", lwd = 2) # Normal
legend("topright", c("Histogram", "Density", "Normal"), box.lty = 0,
lty = 1, col = c("black", "blue", "red"), lwd = c(1, 2, 2))
}

Now, you can check the behavior of the function with sample data.

set.seed(1)

# Normal data
x <- rnorm(n = 5000, mean = 110, sd = 5)

# Exponential data
y <- rexp(n = 3000, rate = 1)

par(mfcol = c(1, 2))

histDenNorm(x, main = "Histogram of X")
histDenNorm(y, main = "Histogram of Y")

par(mfcol = c(1, 1))

## Combination: histogram and boxplot in R

You can add a boxplot over a histogram calling par(new = TRUE) between the plots.

hist(distance, probability = TRUE, ylab = "", main = "",
col = rgb(1, 0, 0, alpha = 0.5), axes = FALSE)
par(new = TRUE)
boxplot(distance, horizontal = TRUE, axes = FALSE,
lwd = 2, col = rgb(0, 0, 0, alpha = 0.2))
You could also add the normal or density curve to the previous plot.

## Histogram in R with ggplot2

In order to create a histogram with the ggplot2 package you need to use the ggplot + geom_histogram functions and pass the data as data.frame. In the aes argument you need to specify the variable name of the dataframe.

# install.packages("ggplot2")
library(ggplot2)

ggplot(data.frame(distance), aes(x = distance)) +
geom_histogram(color = "gray", fill = "white")
This plot will return a message warning you that the histogram was calculated using 30 bins. That’s because, by default, ggplot doesn’t use the Sturges method.

Now we are going to calculate the number of bins with the Sturges method as the hist function does and set it with the breaks argument. Note you could also set the binwidth argument if preferred.

# Calculating the breaks like the hist() function
nbreaks <- pretty(range(distance), n = nclass.Sturges(distance),
min.n = 1)

ggplot(data.frame(distance), aes(x = distance)) +
geom_histogram(breaks = nbreaks, color = "gray", fill = "white")

As you can see, this is equal to the first histogram.

In ggplot2 you can also add the density curve with the geom_density function. Moreover, if you want to fill the area under the curve, set the argument fill to the color you prefer and alpha to level of transparency of the color. Note that you need to set a new aes inside the geom_histogram as follows:

ggplot(data.frame(distance), aes(x = distance)) +
geom_histogram(aes(y = ..density..), breaks = nbreaks,
color = "gray", fill = "white") +
geom_density(fill = "black", alpha = 0.2)

## Plotly histogram

An alternative for creating histograms is to use the plotly package (an adaptation of the JavaScript plotly library to R), which creates graphics in an interactive format. For instance, you could run the following:

# install.packages("plotly")
library(plotly)

# Frequency histogram
fig <- plot_ly(x = distance, type = "histogram")
fig

# Density histogram
fig <- plot_ly(x = distance, type = "histogram", histnorm = "probability")
fig