Random samples and permutations in R

Statistics with R Simulation
Random samples and permutations in R

The sample function in R is used to create random samples or permutations (samples with or without replacement) and even select elements randomly based on specific probabilities assigned to each element (weighted sampling).

If you run the examples of this tutorial you will get other output. Set a seed with set.seed to make the results reproducible.

Syntax of sample

The sample function has the following syntax:

sample(x, size, replace = FALSE, prob = NULL)

Being:

  • x: a vector or list containing the elements from which to select a sample.
  • size: the number of items to select. If replace = TRUE, it specifies the number of items to sample with replacement.
  • replace: a logical value indicating whether sampling should be done with replacement (TRUE) or without replacement (FALSE). Default is FALSE.
  • prob: an optional vector of probability weights for obtaining the elements of x.

Random sample without replacement

By default, the sample function returns a random permutation of the input vector, this is, it returns the elements of the vector but in a different order (without repeating any). The following example illustrates how the function returns the vector named x but in one of the possible different orders.

x <- 1:10

# Permutation of 'x'
sample(x)
2  3 10  7  5  9  4  6  1  8

Note that you can also utilize the sample function with character or boolean vectors and even lists.

y <- c("A", "A", "C", "D")

# Permutation of 'y'
sample(y)
"D" "C" "A" "A"

With size you can control the length of the output. The example below illustrates how to return only one sample, this is, 1 out of the 10 elements of the input vector.

x <- 1:10

# One sample
sample(x, size = 1)
3

Notice that you can’t input a value greater than the length of the input vector. In this scenario an error will arise unless you set replace = TRUE.

x <- 1:10

# 'size' can't be greater than the length of 'x'
sample(x, size = 15)
Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

Random sample with replacement

When replace = TRUE the sampling is performed with replacement, so if an element is choose it can also be chosen in the following sample and the same element can appear several times.

x <- 1:10

# Random sample with replacement
sample(x, replace = TRUE)
5  2  4  6  7  6  3  7 10  1

In this scenario, the sample size can be greater than the length of the input vector and the elements can be repeated.

x <- 1:10

# Random sample
sample(x, size = 15, replace = TRUE)
2  7  4  3  1  7  2  4  4  3 10  7  2  5  3

Weighted sampling

When a random sample is computed all the elements have the same probability. This can be illustrated with a bar plot with the proportions of a random sample of a vector with two different elements.

set.seed(21)
tab <- table(sample(c("A", "B"), size = 10000, replace = TRUE))

# Bar plot with the proportions
barplot(prop.table(tab), col = c(4, 3), main = "Random sampling")

Random sampling in R with the sample() function

However, the prob argument allows specifying different probabilities for each element to reflect real-world scenarios. In the following example the first element of the vector will have a probability of 0.8 while the second 0.2.

# Weighted sampling
sample(c("A", "B"), size = 10, prob = c(0.8, 0.2), replace = TRUE)
"A" "B" "B" "A" "A" "A" "A" "A" "A" "A"

The proportions of the elements of the output will be (as in this example) or will be close to the specified probabilities.

tab <- table(c("A", "B", "B", "A", "A", "A", "A", "A", "A", "A"))

# Bar plot with the proportions
barplot(prop.table(tab), col = c(4, 3), main = "Weighted sampling")

Weighted sampling in R with the sample() function

Notice that the proportions of elements of the output will converge to the specified probabilities as the sample size increases. The following block of code creates a line chart to illustrate this convergence.

# Sample sizes
s <- seq(5, 10000, 5)

# Empty data frame
df <- data.frame("A" = NA, "B" = NA)

# Calculate samples for different sample sizes
set.seed(2)
for(i in 1:length(s)) {
  
  a <- sample(c("A", "B"), size = i, replace = TRUE, prob = c(0.2, 0.8))
  
  df[i, ] <- prop.table(table(a))
  
}

# Line chart
plot(s, df$A, type = "l", xlab = "Sample size", ylab = "%")
lines(s, df$B, type = "l", col = 4)
abline(h = 0.8, lty = 2, col = "red")
abline(h = 0.2, lty = 2, col = "red")

Convergence of probabilities in R when the sample size increases

Additional examples

Reproducible samples

As stated before, if you don’t set a seed your results won’t be reproducible. The set.seed function allows to specify a seed for pseudo-random number generation and hence make the output reproducible.

# Set a seed (any integer)
set.seed(1)

# Reproducible sample
sample(1:10)
9  4  7  1  2  5  3 10  6  8

If you run the previous code you will get the same output of the block of code.

Sample of the rows of a data frame

A common use case of the sample function is to randomly select rows of a data frame. As rows in R can be selected using indices, you can create a sample of the desired size of a vector from 1 to the number of rows to create a sample of rows.

# Sample data
set.seed(21)
df <- data.frame(x = 1:10, y = rnorm(10))

# Select 5 rows randomly
df[sample(1:nrow(df), size = 5), ]
    x            y
10 10 -0.002393336
1   1  0.793013171
2   2  0.522251264
7   7 -1.570199630
8   8 -0.934905667