How to subset data in R?
Subsetting data in R can be achieved by different ways, depending on the data you are working with. In general, you can subset:
- Using square brackets (
[]
and[[]]
operators). - Using the dollar sign (
$
) if the elements are named. - With functions, like the
subset
command for conditional or logical subsets.
Single and double square brackets in R
Before the explanations for each case, it is worth to mention the difference between using single and double square brackets when subsetting data in R, in order to avoid explaining the same on each case of use. Suppose you have the following named numeric vector:
x <- c(one = 1, two = 2)
As we will explain in more detail in its corresponding section, you could access the first element of the vector using single or with double square brackets and specifying the index of the element.
The difference is that single square brackets will maintain the original input structure but the double will simplify it as much as possible. This can be verified with the following example:
# Single square brackets
x[1]
one # <-- Maintains the name of the element
1
# Double square brackets
x[[1]]
1 # <-- Simplified output
Other interesting characteristic is when you try to access observations out of the bounds of the vector. In this case, if you use single square brackets you will obtain a NA
value but an error with double brackets.
# Single square brackets
x[6]
<NA>
NA
# Double square brackets
x[[6]]
Error in x[[6]] : subscript out of bounds
However, sometimes it is not possible to use double brackets, like working with data frames and matrices in several cases, as it will be pointed out on its corresponding sections.
Subset function in R
The subset
function allows conditional subsetting in R for vector-like objects, matrices and data frames.
# For vectors
subset(x, # Numeric vector
condition) # Logical condition/s
# For matrices and dataframes
subset(x, # Numeric vector
condition, # Logical condition/s
select, # Selected columns
drop = FALSE) # Whether to maintain the object structure (default) or not
In the following sections we will use both this function and the operators to the most of the examples. Note that this function allows you to subset by one or multiple conditions.
Subset vector in R
Subsetting a variable in R stored in a vector can be achieved in several ways:
- Selecting the indices you want to display. If more than one, select them using the
c
function. - Using boolean indices to indicate if a value must be selected (
TRUE
) or not (FALSE
). - Using logical operators with the
subset
function. - If you want to select all the values except one or some, make a subset indicating the index with negative sign.
The following summarizes the ways to subset vectors in R with several examples.
my_vector <- c(15, 21, 17, 25, 12, 51)
# Returns the full vector
my_vector[]
# Third value
my_vector[3]
# Third value, simplified
my_vector[[3]]
# Elements one to three
my_vector[1:3]
# Second and fifth elements
my_vector[c(2, 5)]
# Second element twice
my_vector[c(2, 2)]
# All values except the fourth
my_vector[-4]
# All values except the fourth and fifth
my_vector[-c(4, 5)]
my_vector[c(-4, -5)] # Equivalent
# First, third, fourth and sixth values
my_vector[c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)]
# Elements greater than 15
my_vector[my_vector > 15]
# With subset function
subset(my_vector, my_vector > 15)
my_vector[]
is useful when you want to assign the same value to all the elements of a already created vector. As an example, my_vector[] <- 1
will replace all the values of the vector with 1, but my_vector <- 1
will override the vector as a number.
In addition, if your vector is named, you can use the previous and the following ways to subset the data, specifying the elements name as character.
my_named_vector <- setNames(my_vector, letters[2:7])
# Element "b"
my_named_vector["b"]
# Element "b", simplified
my_named_vector[["b"]]
# Elements "d" and "f"
my_named_vector[c("d", "f")]
Subsetting a list in R
Consider the following sample list:
my_list <- list(1:10, c(TRUE, FALSE), 1)
You can subset the list elements with single or double brackets to subset the elements and the subelements of the list.
# Second object of the list
my_list[2]
# Second object of the list, simplified
my_list[[2]]
# Second object simplified, first element
my_list[[2]][1]
# Second object, first element, all simplified
my_list[[2]][[1]]
In case you have a list with names, you can access them specifying the element name or accessing them with the dollar sign.
my_named_list <- list(x = 1:10, y = c(TRUE, FALSE), z = 1)
# First element
my_named_list["x"]
my_named_list$x # Equivalent
# Second element, simplified
my_named_list[["y"]]
In addition, it is also possible to make a logical subsetting in R for lists. For example, you could replace the first element of the list with a subset of it in the following way:
my_list[[1]] <- subset(my_list[[1]], my_list[[1]] > 5)
my_list
[[1]]
[1] 6 7 8 9 10
[[2]]
[1] TRUE FALSE
[[3]]
[1] 1
Subset R data frame
Subsetting a data frame consists on obtaining some rows or columns of the full data frame, or some that meet one or several conditions. It is very usual to subset a data frame in R for analysis purposes. Consider, for instance, the following sample data frame:
set.seed(24)
my_df <- data.frame(x = 1:10,
y = 11:20,
z = 3:12,
w = sample(c("Group 1", "Group 2"), 10, replace = TRUE))
head(my_df)
x y z w
1 11 3 Group 1
2 12 4 Group 1
3 13 5 Group 2
4 14 6 Group 2
5 15 7 Group 2
6 16 8 Group 2
Columns subset in R
You can subset a column in R in different ways:
- If you want to subset just one column, you can use single or double square brackets to specify the index or the name (between quotes) of the column.
- Specifying the indices after a comma (leaving the first argument blank selects all rows of the data frame). In this case you can’t use double square brackets, but use
drop
argument. - In case of subsetting multiple columns of a data frame just indicate the columns inside a vector.
The following block of code shows some examples:
# First column (simplified as vector)
my_df[[1]]
my_df[, 1] # Equivalent
my_df[["x"]] # Equivalent
my_df[, c(TRUE, FALSE, FALSE, FALSE)] # Equivalent
# First column (with column and row names)
my_df[1]
my_df[, 1, drop = FALSE] # Equivalent
my_df["x"] # Equivalent
my_df[c(TRUE, FALSE, FALSE, FALSE)] # Equivalent
# Second and third column
my_df[c(2, 3)]
my_df[, c(2, 3)] # Equivalent
my_df[c("y", "z")] # Equivalent
my_df[c(FALSE, TRUE, TRUE, FALSE)] # Equivalent
drop = FALSE
to maintain the original structure of the object, instead of using double square brackets.
Subset dataframe by column name
Subsetting dataframe using column name in R can also be achieved using the dollar sign ($
), specifying the name of the column with or without quotes.
# First column
my_df$x
# Second column
my_df$y
my_df$"y" # Equivalent
Subset dataframe by column value
You can also subset a data frame depending on the values of the columns. As an example, you may want to make a subset with all values of the data frame where the corresponding value of the column z
is greater than 5, or where the group of the w
column is Group 1.
# Values where column z is greater than 5
my_df[my_df$z > 5, ]
# All values corresponding to Group 1
my_df[my_df$w == "Group 1", ]
You can also apply a conditional subset by column values with the subset
function as follows. Note that when using this function you can use the variable names directly.
# All values corresponding to Group 1
subset(my_df, w == "Group 1") # Equivalent
x y z w
1 11 3 Group 1
2 12 4 Group 1
7 17 9 Group 1
10 20 12 Group 1
# All values where 'y' is lower or equal to 14
subset(my_df, y <= 14)
x y z w
1 11 3 Group 1
2 12 4 Group 1
3 13 5 Group 2
4 14 6 Group 2
When using the subset
function with a data frame you can also specify the columns you want to be returned, indicating them in the select
argument.
# Select the columns to return
subset(my_df, x > 3, select = c(x, w))
x w
4 Group 2
5 Group 2
6 Group 2
7 Group 1
8 Group 2
9 Group 2
10 Group 1
In adition, you can use multiple subset conditions at once. Subsetting with multiple conditions is just easy as subsetting by one condition. In the following example we select the values of the column x
, where the value is 1 or where it is 6.
my_df[my_df$x == 1 | my_df$x == 6, ]
x y z
1 11 3
6 16 8
Subset rows in R
Analogously to column subset, you can subset rows of a data frame indicating the indices you want to subset as the first argument between square brackets.
# First row
my_df[1, ]
# Fourth and sixth row
my_df[c(4, 6), ]
Subset rows by list of values
In case you want to subset rows based on a vector you can use the %in%
operator or the is.element
function as follows:
values <- c(12, 14)
my_df[my_df$y %in% values, ]
my_df[is.element(my_df$y, values), ] # Equivalent
x y z w
2 12 4 Group 1
4 14 6 Group 2
%in%
you can create new operator with '%ni%' <- Negate('%in%')
or write my_df[!(my_df$y %in% values), ]
.
Subset by date
Many data frames have a column of dates. In this case, each row represents a date and each column an event registered on those dates. For this purpose, you need to transform that column of dates with the as.Date
function to convert the column to date format.
dates <- seq(as.Date("2011/1/1"), by = "day", length.out = 10)
my_df_dates <- cbind(dates, my_df)
head(my_df_dates)
dates x y z w
2011-01-01 1 11 3 Group 1
2011-01-02 2 12 4 Group 1
2011-01-03 3 13 5 Group 2
2011-01-04 4 14 6 Group 2
2011-01-05 5 15 7 Group 2
2011-01-06 6 16 8 Group 2
As an example, you can subset the values corresponding to dates greater than January, 5, 2011 with the following code:
subset(my_df_dates, dates > as.Date("2011-01-5"))
dates x y z w
2011-01-06 6 16 8 Group 2
2011-01-07 7 17 9 Group 1
2011-01-08 8 18 10 Group 2
2011-01-09 9 19 11 Group 2
2011-01-10 10 20 12 Group 1
Subsetting in R by unique date
Note that in case your date column contains the same date several times and you want to select all the rows that correspond to that date, you can use the ==
logical operator with the subset
function as follows:
subset(my_df_dates, dates == as.Date("2011-01-02"))
Subset a matrix in R
Subsetting a matrix in R is very similar to subsetting a data frame. Consider the following sample matrix:
set.seed(45)
my_matrix <- matrix(sample(1:9), ncol = 3)
colnames(my_matrix) <- c("one", "two", "three")
my_matrix
one two three
[1, ] 6 8 1
[2, ] 3 7 4
[3, ] 2 5 9
You can subset the rows and columns specifying the indices of rows and then of columns. You can also use boolean data type.
# Subset matrix with rows and columns index
my_matrix[c(1, 3), c(1, 2)]
# Subset with logical values
my_matrix[c(TRUE, FALSE, TRUE), c(TRUE, TRUE, FALSE)] # Equivalent
# You can also mix
my_matrix[c(1, 3), c(TRUE, TRUE, FALSE)] # Equivalent
one two
[1, ] 6 8
[2, ] 2 5
Note that if you subset the matrix to just one column or row it will be converted to a vector. In order to preserve the matrix class, you can set the drop
argument to FALSE
.
my_matrix[, 2] # 8 7 5
8 7 5
my_matrix[, 2, drop = FALSE]
two
[1, ] 8
[2, ] 7
[3, ] 5
Subset matrix by column and row names
In case your matrix contains row or column names, you can use them instead of the index to subset the matrix. In the following example we selected the columns named ‘two’ and ‘three’.
my_matrix[, c("two", "three")]
two three
[1, ] 8 1
[2, ] 7 4
[3, ] 5 9
Subset matrix by column values
Equivalently to data frames, you can subset a matrix by the values of the columns. In this case, we are making a subset based on a condition over the values of the third column.
my_matrix[my_matrix[, 3] > 2, ]
one two three
[1,] 3 7 4
[2,] 2 5 9
Subset time series
Time series are a type of R object with which you can create subsets of data based on time. We will use, for instance, the nottem
time series.
nottem
class(nottem) # ts
The window
function allows you to create subsets of time series, as shown in the following example:
# Data from 1930 (included)
window(nottem, start = c(1930))
# Data from April 1930 (included)
window(nottem, start = c(1930, 4))
subset()