Home » Statistics with R » Normal distribution in R

Normal distribution in R

The normal or gaussian distribution

Among continuous random variables, the most important is the Normal or Gaussian distribution. This variable was introduced by Carl Friedrich in the XIX century for studying error measures.

Let X \sim N(\mu, \sigma), namely a random variable following a normal distribution with mean \mu and standard deviation \sigma:

• The probability density function (PDF), also known as Bell curve, of x is f(x) = \frac{1}{\sqrt{2\pi \sigma^{2}}} e^{\frac{1}{2} (\frac{x - \mu}{\sigma})^2}.
• The cumulative distribution function (CDF) is F(x) = P(X \leq x).
• The quantile function is Q(p) = F^{-1}(p).
• The expencted mean and variance are E(X) = \mu and Var(X) = \sigma^2, respectively.

In R there exist the dnorm, pnorm and qnorm functions, which allows calculating the normal density, distribution and quantile function for a set of values. In addition, the rnorm function allows obtaining random observations that follow a normal distibution. The following table summarizes the functions related to the normal distribution:

By default, these functions consider the standard Normal distribution, which has a mean of zero and a standard deviation of one.

Although we will review each function in detail on the corresponding section, in the following illustration you can appreciate the relationship between the dnorm, pnorm and qnorm functions.

The dnorm function

In R, you can make use of the dnorm function to calculate the density function with mean \mu and standard deviation \sigma for any value of x, \mu and \sigma.

dnorm(x,           # X-axis values (grid)
mean = 0,    # Integer or vector representing the mean/s
sd = 1,      # Integer or vector representing the standard deviation/s
log = FALSE) # If TRUE, probabilities are given as log

Consider, for instance, that you want to obtain the PDF for x \in (-4, 4), with mean 1 and standard deviation of 3. In order to calculate it, you could type:

x <- -4:4
# x <- seq(-4, 4, length = 100) # More data points
dnorm(x, mean = 1, sd = 3) 
0.03315905 0.05467002 0.08065691 0.10648267 0.12579441
0.13298076 0.12579441 0.10648267 0.08065691

You can also specify vectors to the mean and sd arguments of the function. In the following example, the value of even elements are from \mu = 1, \sigma = 3 and odds are from \mu = 2, \sigma = 4.

dnorm(x, mean = c(1, 2), sd = c(3, 4))
0.03315905 0.04566227 0.08065691 0.07528436 0.12579441
0.09666703 0.12579441 0.09666703 0.08065691

Plot Normal distribution in R

Creating a normal distribution plot in R is easy. You just need to create a grid for the X-axis for the first argument of the plot function and pass as input of the second the dnorm function for the corresponding grid. In the following example we show how to plot normal distributions for different means and variances.

par(mfrow = c(1, 2))

# Grid of X-axis values
x <- seq(-4, 8, 0.1)

#-----------------------------------------
# Same standard deviation, different mean
#-----------------------------------------
# Mean 0, sd 1
plot(x, dnorm(x, mean = 0, sd = 1), type = "l",
ylim = c(0, 0.6), ylab = "", lwd = 2, col = "red")
# Mean 3, sd 1
lines(x, dnorm(x, mean = 3, sd = 1), col = "blue", lty = 1, lwd = 2)

legend("topright", legend = c("0 1", "3 1"), col = c("red", "blue"),
title = expression(paste(mu, " ", sigma)),
title.adj = 0.9, lty = 1, lwd = 2, box.lty = 0)

#-----------------------------------------
# Same mean, different standard deviation
#-----------------------------------------
# Mean 1, sd 1
plot(x, dnorm(x, mean = 1, sd = 1), type = "l",
ylim = c(0, 1), ylab = "", lwd = 2, col = "red")
# Mean 1, sd 0.5
lines(x, dnorm(x, mean = 1, sd = 0.5), col = "blue", lty = 1, lwd = 2)

legend("topright", legend = c("1 1", "1 0.5"), col = c("red", "blue"),
title = expression(paste(mu, " ", sigma)),
title.adj = 0.75, lty = 1, lwd = 2, box.lty = 0)

par(mfrow = c(1, 1))

The pnorm function

The pnorm function gives the Cumulative Distribution Function (CDF) of the Normal distribution in R, which is the probability that the variable X takes a value lower or equal to x.

The syntax of the function is the following:

pnorm(q,
mean = 0,
sd = 1,
lower.tail = TRUE, # If TRUE, probabilities are P(X <= x), or P(X > x) otherwise
log.p = FALSE)     # If TRUE, probabilities are given as log

As an example, taking into account that the Normal distribution is symmetric, the probability that the variable will take a value lower than the mean is 0.5:

pnorm(0, mean = 0, sd = 1) # 0.5

pnorm function example

Now, suppose that you have a machine that packages rice inside boxes. The process follows a Normal distribution and it is known that the mean of the weight of each box is 1000 grams and the standard deviation is 10 grams. You can plot the density function typing:

Mean <- 1000
Sd <- 10

# X grid for non-standard normal distribution
x <- seq(-3, 3, length = 100) * Sd + Mean

# Density function
f <- dnorm(x, Mean, Sd)

plot(x, f, type = "l", lwd = 2, col = "blue", ylab = "", xlab = "Weight")
abline(v = Mean) # Vertical line on the mean

First, if you want to calculate the probability of a box weighing less than 1010 grams (P(X < 1010) = P(X \leq 1010)), you can type the following:

pnorm(1010, Mean, Sd) # 0.8413447 or 84.13%
1 - pnorm(1010, Mean, Sd, lower.tail = FALSE) # Equivalent

So the probability of a box wheighing less than 1010 grams is 0.8413 or 84.13%, which corresponds to the following area:

lb <- min(x) # Lower bound
ub <- 1010   # Upper bound

x2 <- seq(min(x), ub, length = 100) # New Grid
y <- dnorm(x2, Mean, Sd) # Density

plot(x, f, type = "l", lwd = 2, col = "blue", ylab = "", xlab = "Weight")
abline(v = ub)

polygon(c(lb, x2, ub), c(0, y, 0), col = rgb(0, 0, 1, alpha = 0.5))
text(995, 0.01, "84.13%")

As shading the area under the Normal curve can be tricky and requires several lines of code, we have created a simple function to achieve it in a single line:

# mean: mean of the Normal variable
# sd: standard deviation of the Normal variable
# lb: lower bound of the area
# ub: upper bound of the area
# acolor: color of the area
# ...: additional arguments to be passed to lines function

normal_area <- function(mean = 0, sd = 1, lb, ub, acolor = "lightgray", ...) {
x <- seq(mean - 3 * sd, mean + 3 * sd, length = 100)

if (missing(lb)) {
lb <- min(x)
}
if (missing(ub)) {
ub <- max(x)
}

x2 <- seq(lb, ub, length = 100)
plot(x, dnorm(x, mean, sd), type = "n", ylab = "")

y <- dnorm(x2, mean, sd)
polygon(c(lb, x2, ub), c(0, y, 0), col = acolor)
lines(x, dnorm(x, mean, sd), type = "l", ...)
}

As an example, if you want to shade the area between -1 and 2 of a standard Normal distribution you can type:

normal_area(mean = 0, sd = 1, lb = -1, ub = 2, lwd = 2)

Second, in case that you want to calculate the probability of a box weighing more than 980 grams (P(X > 980) = P(X \geq 980)) you can use the lower.tail argument.

pnorm(980, Mean, Sd, lower.tail = FALSE) # 0.9772499 or 97.72%
1 - pnorm(980, Mean, Sd) # Equivalent
pnorm(1020, Mean, Sd)    # Equivalent by symmetry

The calculated probability corresponds to the following area:

normal_area(mean = Mean, sd = Sd, lb = 980, acolor = rgb(0, 0, 1, alpha = 0.5))
text(1000, 0.01, "97.72%")

Finally, if you want to calculate the probability of a box weighing more than 990 grams and less than 1000 you have to calculate P(X \leq 1000) - P(x \leq 990) = P(X < 1000) - P(x <990) and hence you can type:

pnorm(1000, Mean, Sd) - pnorm(990, Mean, Sd) # 0.3413447 or 34.13%

You can plot the area with the following code:

normal_area(mean = Mean, sd = Sd, lb = 990, ub = 1000,
acolor = rgb(0, 0, 1, alpha = 0.5))
text(995, 0.01, "34.13%", srt = 90)
For continuous variables, as P(X = x) = 0, P(X \geq x) = P(X > x) and P(X \leq x) = P(X < x).

Plot normal cumulative distribution function in R

With the pnorm function you can also plot the cumulative density function of the Gaussian or Normal distribution in R:

par(mfrow = c(1, 2))

# Grid of X-axis values
x <- seq(-4, 8, 0.1)

#-----------------------------------------
# Same standard deviation, different mean
#-----------------------------------------
# Mean 0, sd 1
plot(x, pnorm(x, mean = 0, sd = 1), type = "l",
ylim = c(0, 1), ylab = "", lwd = 2, col = "red")
# Mean 3, sd 1
lines(x, pnorm(x, mean = 3, sd = 1), col = "blue", lty = 1, lwd = 2)

# Legend
legend("topleft", legend = c("0 1", "3 1"), col = c("red", "blue"),
title = expression(paste(mu, " ", sigma)),
title.adj = 0.9, lty = 1, lwd = 2, box.lty = 0)

#-----------------------------------------
# Same mean, different standard deviation
#-----------------------------------------
# Mean 1, sd 1
plot(x, pnorm(x, mean = 1, sd = 1), type = "l",
ylim = c(0, 1), ylab = "", lwd = 2, col = "red")
# Mean 1, sd 0.5
lines(x, pnorm(x, mean = 1, sd = 0.5), col = "blue", lty = 1, lwd = 2)

# Legend
legend("topleft", legend = c("1 1", "1 0.5"), col = c("red", "blue"),
title = expression(paste(mu, " ", sigma)),
title.adj = 0.75, lty = 1, lwd = 2, box.lty = 0)

par(mfrow = c(1, 1))

Recall that P(X < 0) = 0.5 for a standard Normal distribution:

x <- seq(-4, 4, 0.1)
plot(x, pnorm(x, mean = 0, sd = 1), type = "l",
ylim = c(0, 1), ylab = "P(X < x)", lwd = 2, col = "red")
segments(0, 0, 0, 0.5, lwd = 2, lty = 2)
segments(-4, 0.5, 0, 0.5, lwd = 2, lty = 2)

The qnorm function

The qnorm function allows you to find the quantile (percentile) Q for any probability p. Hence, the qnorm function is the inverse of the pnorm function. The syntax of qnorm is as follows:

qnorm(p,                  # Integer or vector of probabilities
mean = 0,           # Integer or vector of means
sd = 1,             # Integer or vector of standard deviations
lower.tail = TRUE,  # If TRUE, probabilities are P(X <= x), or P(X > x) otherwise
log.p = FALSE)      # If TRUE, probabilities are given as log

As a first example, the quantile for probability 0.5 (Q(0.5)) on a symmetric distribution is equal to the mean:

qnorm(0.5, mean = 0, sd = 1) # 0

In addition, you can obtain the quantile for any given probability. Note the relation between pnorm and qnorm functions:

x <- pnorm(-1.5, mean = 0, sd = 1) # 0.0668072
qnorm(x, mean = 0, sd = 1) # -1.5

normal_area(mean = 0, sd = 1, ub = -1.5,
lwd = 2, acolor = rgb(0, 0, 1, alpha = 0.5))
arrows(-0.5, 0.1, -1.45, 0, lwd = 2, length = 0.2)
text(-0.25, 0.13, "-1.5", cex = 1.5)

If you want to calculate, for instance, the quantile Q(P(X > 1.5)) = Q(1 - P(X \leq 1.5)) = Q(0.067) you can set the lower.tail argument to TRUE.

x <- pnorm(1.5, mean = 0, sd = 1, lower.tail = FALSE) # 0.0668072
qnorm(x, mean = 0, sd = 1, lower.tail = FALSE) # 1.5

# Equivalent to:
# x <- 1- pnorm(1.5, mean = 0, sd = 1)
# qnorm(1 - x, mean = 0, sd = 1)

normal_area(mean = 0, sd = 1, lb = 1.5, lwd = 2,
acolor = rgb(0, 0, 1, alpha = 0.5))
arrows(0, 0.1, 1.45, 0, lwd = 2, length = 0.2)
text(0, 0.13, "1.5", cex = 1.5)

Plotting the Normal quantile function

You can plot the quantile function of a standard Normal distribution typing the following:

plot(qnorm, pnorm(-4), pnorm(4), lwd = 2, xlab = "p", ylab = "Q(p)")

# Equivalent to:
# x <- seq(pnorm(-4), pnorm(4), length = 100)
# plot(x, qnorm(x, mean = 0, sd = 1), type = "l",
#      lwd = 2, xlab = "p", ylab = "Q(p)")

The previous plot represents the possible outcomes of the qnorm function for the standard Normal distribution. Recall that Q(0.5) = 0:

plot(qnorm, pnorm(-4), pnorm(4), lwd = 2,
xlab = "p", ylab = "Q(p)")
segments(0.5, -4, 0.5, 0, lty = 2, lwd = 2)
segments(0, 0, 0.5, 0, lty = 2, lwd = 2)

The rnorm function

The rnorm function generates n observations from the Normal distribution with mean \mu and standard deviation \sigma. The syntax of the rnorm function in R is the following:

rnorm(n,        # Number of observations to be generated
mean = 0, # Integer or vector of means
sd = 1)   # Integer or vector of standard deviations

Hence, you can generate 10 observations of a standard Normal distribution in R with the following code:

rnorm(10)
0.3320065 -0.6454991 -0.3692699 -0.4160260  0.5756340
-0.8793241 -0.8095590  0.4926698 -0.8038873 -0.3446890

However, it should be noted that if you don’t specify a seed the output won’t be reproducible. You can make use of the set.seed function to make your code reproducible:

# Seed for reproducibility
set.seed(1)

rnorm(1) # -0.6264538

In addition, in the following plot you can observe how increasing the number of observations, the histogram of the data approaches to the true Normal density function:

par(mfrow = c(1, 3))

x <- seq(-10, 10, length = 200)

set.seed(3)
# n = 10
hist(rnorm(10, mean = 0, sd = 1), main = "n = 10",
xlab = "", prob = TRUE)
lines(x, dnorm(x), col = "red", lwd = 2)

# n = 100
hist(rnorm(100, mean = 0, sd = 1), main = "n = 100",
xlab = "", prob = TRUE)
lines(x, dnorm(x), col = "red", lwd = 2)

# n = 1000
hist(rnorm(1000, mean = 0, sd = 1), main = "n = 1000",
xlab = "", prob = TRUE)
lines(x, dnorm(x), col = "red", lwd = 2)

par(mfrow = c(1, 1))