Subset in R

Data Manipulation in R Data transformation
Learn how to subset data in R with square brackets and the subset function

Subsetting data consists on obtaining a subsample of the original data, in order to obtain specific elements based on some condition. In this tutorial you will learn in detail how to make a subset in R in the most common scenarios, explained with several examples.

How to subset data in R?

Subsetting data in R can be achieved by different ways, depending on the data you are working with. In general, you can subset:

  1. Using square brackets ([] and [[]] operators).
  2. Using the dollar sign ($ ) if the elements are named.
  3. With functions, like the subset command for conditional or logical subsets.

Single and double square brackets in R

Before the explanations for each case, it is worth to mention the difference between using single and double square brackets when subsetting data in R, in order to avoid explaining the same on each case of use. Suppose you have the following named numeric vector:

x <- c(one = 1, two = 2)

As we will explain in more detail in its corresponding section, you could access the first element of the vector using single or with double square brackets and specifying the index of the element.

The difference is that single square brackets will maintain the original input structure but the double will simplify it as much as possible. This can be verified with the following example:

# Single square brackets
x[1]
one       # <-- Maintains the name of the element
 1
# Double square brackets
x[[1]]
1         # <-- Simplified output

Other interesting characteristic is when you try to access observations out of the bounds of the vector. In this case, if you use single square brackets you will obtain a NA value but an error with double brackets.

# Single square brackets
x[6]
<NA> 
  NA 
# Double square brackets
x[[6]]
Error in x[[6]] : subscript out of bounds

However, sometimes it is not possible to use double brackets, like working with data frames and matrices in several cases, as it will be pointed out on its corresponding sections.

Note that when subsetting gives no observations means that you are trying to subset under some condition that never meets.

Subset function in R

The subset function allows conditional subsetting in R for vector-like objects, matrices and data frames.

# For vectors
subset(x,             # Numeric vector
       condition)     # Logical condition/s

# For matrices and dataframes
subset(x,             # Numeric vector
       condition,     # Logical condition/s
       select,        # Selected columns
       drop = FALSE)  # Whether to maintain the object structure (default) or not

In the following sections we will use both this function and the operators to the most of the examples. Note that this function allows you to subset by one or multiple conditions.

Subset vector in R

Subsetting a variable in R stored in a vector can be achieved in several ways:

  1. Selecting the indices you want to display. If more than one, select them using the c function.
  2. Using boolean indices to indicate if a value must be selected (TRUE) or not (FALSE).
  3. Using logical operators with the subset function.
  4. If you want to select all the values except one or some, make a subset indicating the index with negative sign.

The following summarizes the ways to subset vectors in R with several examples.

my_vector <- c(15, 21, 17, 25, 12, 51)
# Returns the full vector
my_vector[]

# Third value
my_vector[3]

# Third value, simplified
my_vector[[3]]

# Elements one to three
my_vector[1:3]

# Second and fifth elements
my_vector[c(2, 5)]

# Second element twice
my_vector[c(2, 2)]

# All values except the fourth
my_vector[-4]

# All values except the fourth and fifth
my_vector[-c(4, 5)]
my_vector[c(-4, -5)] # Equivalent

# First, third, fourth and sixth values
my_vector[c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)]

# Elements greater than 15
my_vector[my_vector > 15]

# With subset function
subset(my_vector, my_vector > 15)

my_vector[] is useful when you want to assign the same value to all the elements of a already created vector. As an example, my_vector[] <- 1 will replace all the values of the vector with 1, but my_vector <- 1 will override the vector as a number.

In addition, if your vector is named, you can use the previous and the following ways to subset the data, specifying the elements name as character.

my_named_vector <- setNames(my_vector, letters[2:7])
# Element "b"
my_named_vector["b"]

# Element "b", simplified
my_named_vector[["b"]]

# Elements "d" and "f"
my_named_vector[c("d", "f")]

Note that vectors can be of any data type.

Subsetting a list in R

Consider the following sample list:

my_list <- list(1:10, c(TRUE, FALSE), 1)

You can subset the list elements with single or double brackets to subset the elements and the subelements of the list.

# Second object of the list
my_list[2]

# Second object of the list, simplified
my_list[[2]]

# Second object simplified, first element
my_list[[2]][1]

# Second object, first element, all simplified
my_list[[2]][[1]]

In case you have a list with names, you can access them specifying the element name or accessing them with the dollar sign.

my_named_list <- list(x = 1:10, y = c(TRUE, FALSE), z = 1)
# First element
my_named_list["x"]
my_named_list$x # Equivalent

# Second element, simplified
my_named_list[["y"]]

In addition, it is also possible to make a logical subsetting in R for lists. For example, you could replace the first element of the list with a subset of it in the following way:

my_list[[1]] <- subset(my_list[[1]], my_list[[1]] > 5)
my_list
[[1]]
[1] 6  7  8  9 10

[[2]]
[1]  TRUE FALSE

[[3]]
[1] 1

Subset R data frame

Subsetting a data frame consists on obtaining some rows or columns of the full data frame, or some that meet one or several conditions. It is very usual to subset a data frame in R for analysis purposes. Consider, for instance, the following sample data frame:

set.seed(24)
my_df <- data.frame(x = 1:10,
                    y = 11:20,
                    z = 3:12,
                    w = sample(c("Group 1", "Group 2"), 10, replace = TRUE))
head(my_df)
x   y  z    w
1  11  3  Group 1
2  12  4  Group 1
3  13  5  Group 2
4  14  6  Group 2
5  15  7  Group 2
6  16  8  Group 2

Columns subset in R

You can subset a column in R in different ways:

  1. If you want to subset just one column, you can use single or double square brackets to specify the index or the name (between quotes) of the column.
  2. Specifying the indices after a comma (leaving the first argument blank selects all rows of the data frame). In this case you can’t use double square brackets, but use drop argument.
  3. In case of subsetting multiple columns of a data frame just indicate the columns inside a vector.

The following block of code shows some examples:

# First column (simplified as vector)
my_df[[1]]
my_df[, 1]   # Equivalent
my_df[["x"]] # Equivalent
my_df[, c(TRUE, FALSE, FALSE, FALSE)] # Equivalent

# First column (with column and row names)
my_df[1]
my_df[, 1, drop = FALSE] # Equivalent
my_df["x"]               # Equivalent
my_df[c(TRUE, FALSE, FALSE, FALSE)] # Equivalent

# Second and third column
my_df[c(2, 3)]
my_df[, c(2, 3)]   # Equivalent
my_df[c("y", "z")] # Equivalent
my_df[c(FALSE, TRUE, TRUE, FALSE)] # Equivalent

When subsetting more than one column or when specifying rows and columns (using a comma inside brackets) you will need to set drop = FALSE to maintain the original structure of the object, instead of using double square brackets.

Subset dataframe by column name

Subsetting dataframe using column name in R can also be achieved using the dollar sign ($), specifying the name of the column with or without quotes.

# First column
my_df$x

# Second column
my_df$y
my_df$"y" # Equivalent

Subset dataframe by column value

You can also subset a data frame depending on the values of the columns. As an example, you may want to make a subset with all values of the data frame where the corresponding value of the column z is greater than 5, or where the group of the w column is Group 1.

# Values where column z is greater than 5
my_df[my_df$z > 5, ]

# All values corresponding to Group 1
my_df[my_df$w == "Group 1", ]

Note that when subsetting a data frame by column value you have to specify the condition in the first argument, as the output will be a subset of rows of the data frame.

You can also apply a conditional subset by column values with the subset function as follows. Note that when using this function you can use the variable names directly.

# All values corresponding to Group 1
subset(my_df, w == "Group 1") # Equivalent
  x   y   z    w
  1  11   3  Group 1
  2  12   4  Group 1
  7  17   9  Group 1
 10  20  12  Group 1
# All values where 'y' is lower or equal to 14
subset(my_df, y <= 14)
 x   y  z     w
 1  11  3  Group 1
 2  12  4  Group 1
 3  13  5  Group 2
 4  14  6  Group 2

When using the subset function with a data frame you can also specify the columns you want to be returned, indicating them in the select argument.

# Select the columns to return
subset(my_df, x > 3, select = c(x, w))
  x    w
  4  Group 2
  5  Group 2
  6  Group 2
  7  Group 1
  8  Group 2
  9  Group 2
 10  Group 1

In addition, you can use multiple subset conditions at once. Subsetting with multiple conditions is just easy as subsetting by one condition. In the following example we select the values of the column x, where the value is 1 or where it is 6.

my_df[my_df$x == 1 | my_df$x == 6, ]
x   y  z
1  11  3
6  16  8

Subset rows in R

Analogously to column subset, you can subset rows of a data frame indicating the indices you want to subset as the first argument between square brackets.

# First row
my_df[1, ]

# Fourth and sixth row
my_df[c(4, 6), ]

Subset rows by list of values

In case you want to subset rows based on a vector you can use the %in% operator or the is.element function as follows:

values <- c(12, 14)

my_df[my_df$y %in% values, ]
my_df[is.element(my_df$y, values), ] # Equivalent
 x   y  z    w
 2  12  4  Group 1
 4  14  6  Group 2

If you need the opposite of %in% you can create new operator with ‘%ni%’ <- Negate(‘%in%’) or write my_df[!(my_df$y %in% values), ].

Subset by date

Many data frames have a column of dates. In this case, each row represents a date and each column an event registered on those dates. For this purpose, you need to transform that column of dates with the as.Date function to convert the column to date format.

dates <- seq(as.Date("2011/1/1"), by = "day", length.out = 10)

my_df_dates <- cbind(dates, my_df)
head(my_df_dates)
   dates    x   y  z     w
2011-01-01  1  11  3  Group 1
2011-01-02  2  12  4  Group 1
2011-01-03  3  13  5  Group 2
2011-01-04  4  14  6  Group 2
2011-01-05  5  15  7  Group 2
2011-01-06  6  16  8  Group 2

As an example, you can subset the values corresponding to dates greater than January, 5, 2011 with the following code:

subset(my_df_dates, dates > as.Date("2011-01-5"))
   dates     x   y   z    w
2011-01-06   6  16   8  Group 2
2011-01-07   7  17   9  Group 1
2011-01-08   8  18  10  Group 2
2011-01-09   9  19  11  Group 2
2011-01-10  10  20  12  Group 1

Subsetting in R by unique date

Note that in case your date column contains the same date several times and you want to select all the rows that correspond to that date, you can use the == logical operator with the subset function as follows:

subset(my_df_dates, dates == as.Date("2011-01-02"))

Subset a matrix in R

Subsetting a matrix in R is very similar to subsetting a data frame. Consider the following sample matrix:

set.seed(45)

my_matrix <- matrix(sample(1:9), ncol = 3)
colnames(my_matrix) <- c("one", "two", "three")
my_matrix
       one two three
[1, ]   6    8    1
[2, ]   3    7    4
[3, ]   2    5    9

You can subset the rows and columns specifying the indices of rows and then of columns. You can also use boolean data type.

# Subset matrix with rows and columns index
my_matrix[c(1, 3), c(1, 2)]

# Subset with logical values
my_matrix[c(TRUE, FALSE, TRUE), c(TRUE, TRUE, FALSE)] # Equivalent

# You can also mix
my_matrix[c(1, 3), c(TRUE, TRUE, FALSE)] # Equivalent
      one two
[1, ]  6   8
[2, ]  2   5

Note that if you subset the matrix to just one column or row it will be converted to a vector. In order to preserve the matrix class, you can set the drop argument to FALSE.

my_matrix[, 2] # 8 7 5
 8 7 5
my_matrix[, 2, drop = FALSE]
      two
[1, ]  8
[2, ]  7
[3, ]  5

Subset matrix by column and row names

In case your matrix contains row or column names, you can use them instead of the index to subset the matrix. In the following example we selected the columns named ‘two’ and ‘three’.

my_matrix[, c("two", "three")]
      two   three
[1, ]   8     1
[2, ]   7     4
[3, ]   5     9

Subset matrix by column values

Equivalently to data frames, you can subset a matrix by the values of the columns. In this case, we are making a subset based on a condition over the values of the third column.

my_matrix[my_matrix[, 3] > 2, ]
     one  two  three
[1,]  3    7     4
[2,]  2    5     9

Subset time series

Time series are a type of R object with which you can create subsets of data based on time. We will use, for instance, the nottem time series.

nottem

class(nottem) # ts

The window function allows you to create subsets of time series, as shown in the following example:

# Data from 1930 (included)
window(nottem, start = c(1930))

# Data from April 1930 (included)
window(nottem, start = c(1930, 4))
subset()