- 1 How to subset data in R?
- 2 Subset function in R
- 3 Subset vector in R
- 4 Subsetting a list in R
- 5 Subset R data frame
- 6 Subset a matrix in R
- 7 Subset time series
How to subset data in R?
Subsetting data in R can be achieved by different ways, depending on the data you are working with. In general, you can subset:
- Using square brackets (
- Using the dollar sign (
$) if the elements are named.
- With functions, like the
subsetcommand for conditional or logical subsets.
Single and double square brackets in R
Before the explanations for each case, it is worth to mention the difference between using single and double square brackets when subsetting data in R, in order to avoid explaining the same on each case of use. Suppose you have the following named numeric vector:
x <- c(one = 1, two = 2)
As we will explain in more detail in its corresponding section, you could access the first element of the vector using single or with double square brackets and specifying the index of the element.
The difference is that single square brackets will maintain the original input structure but the double will simplify it as much as possible. This can be verified with the following example:
# Single square brackets x
one # <-- Maintains the name of the element 1
# Double square brackets x[]
1 # <-- Simplified output
Other interesting characteristic is when you try to access observations out of the bounds of the vector. In this case, if you use single square brackets you will obtain a
NA value but an error with double brackets.
# Single square brackets x
# Double square brackets x[]
Error in x[] : subscript out of bounds
However, sometimes it is not possible to use double brackets, like working with data frames and matrices in several cases, as it will be pointed out on its corresponding sections.
Subset function in R
subset function allows conditional subsetting in R for vector-like objects, matrices and data frames.
# For vectors subset(x, # Numeric vector condition) # Logical condition/s # For matrices and dataframes subset(x, # Numeric vector condition, # Logical condition/s select, # Selected columns drop = FALSE) # Whether to maintain the object structure (default) or not
In the following sections we will use both this function and the operators to the most of the examples. Note that this function allows you to subset by one or multiple conditions.
Subset vector in R
Subsetting a variable in R stored in a vector can be achieved in several ways:
- Selecting the indices you want to display. If more than one, select them using the
- Using boolean indices to indicate if a value must be selected (
TRUE) or not (
- Using logical operators with the
- If you want to select all the values except one or some, make a subset indicating the index with negative sign.
The following summarizes the ways to subset vectors in R with several examples.
my_vector <- c(15, 21, 17, 25, 12, 51)
# Returns the full vector my_vector # Third value my_vector # Third value, simplified my_vector[] # Elements one to three my_vector[1:3] # Second and fifth elements my_vector[c(2, 5)] # Second element twice my_vector[c(2, 2)] # All values except the fourth my_vector[-4] # All values except the fourth and fifth my_vector[-c(4, 5)] my_vector[c(-4, -5)] # Equivalent # First, third, fourth and sixth values my_vector[c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)] # Elements greater than 15 my_vector[my_vector > 15] # With subset function subset(my_vector, my_vector > 15)
my_vectoris useful when you want to assign the same value to all the elements of a already created vector. As an example,
my_vector <- 1will replace all the values of the vector with 1, but
my_vector <- 1will override the vector as a number.
In addition, if your vector is named, you can use the previous and the following ways to subset the data, specifying the elements name as character.
my_named_vector <- setNames(my_vector, letters[2:7])
# Element "b" my_named_vector["b"] # Element "b", simplified my_named_vector[["b"]] # Elements "d" and "f" my_named_vector[c("d", "f")]
Subsetting a list in R
Consider the following sample list:
my_list <- list(1:10, c(TRUE, FALSE), 1)
You can subset the list elements with single or double brackets to subset the elements and the subelements of the list.
# Second object of the list my_list # Second object of the list, simplified my_list[] # Second object simplified, first element my_list[] # Second object, first element, all simplified my_list[][]
In case you have a list with names, you can access them specifying the element name or accessing them with the dollar sign.
my_named_list <- list(x = 1:10, y = c(TRUE, FALSE), z = 1)
# First element my_named_list["x"] my_named_list$x # Equivalent # Second element, simplified my_named_list[["y"]]
In addition, it is also possible to make a logical subsetting in R for lists. For example, you could replace the first element of the list with a subset of it in the following way:
my_list[] <- subset(my_list[], my_list[] > 5) my_list
[]  6 7 8 9 10 []  TRUE FALSE []  1
Subset R data frame
Subsetting a data frame consists on obtaining some rows or columns of the full data frame, or some that meet one or several conditions. It is very usual to subset a data frame in R for analysis purposes. Consider, for instance, the following sample data frame:
set.seed(24) my_df <- data.frame(x = 1:10, y = 11:20, z = 3:12, w = sample(c("Group 1", "Group 2"), 10, replace = TRUE)) head(my_df)
x y z w 1 11 3 Group 1 2 12 4 Group 1 3 13 5 Group 2 4 14 6 Group 2 5 15 7 Group 2 6 16 8 Group 2
Columns subset in R
You can subset a column in R in different ways:
- If you want to subset just one column, you can use single or double square brackets to specify the index or the name (between quotes) of the column.
- Specifying the indices after a comma (leaving the first argument blank selects all rows of the data frame). In this case you can’t use double square brackets, but use
- In case of subsetting multiple columns of a data frame just indicate the columns inside a vector.
The following block of code shows some examples:
# First column (simplified as vector) my_df[] my_df[, 1] # Equivalent my_df[["x"]] # Equivalent my_df[, c(TRUE, FALSE, FALSE, FALSE)] # Equivalent # First column (with column and row names) my_df my_df[, 1, drop = FALSE] # Equivalent my_df["x"] # Equivalent my_df[c(TRUE, FALSE, FALSE, FALSE)] # Equivalent # Second and third column my_df[c(2, 3)] my_df[, c(2, 3)] # Equivalent my_df[c("y", "z")] # Equivalent my_df[c(FALSE, TRUE, TRUE, FALSE)] # Equivalent
drop = FALSEto maintain the original structure of the object, instead of using double square brackets.
Subset dataframe by column name
Subsetting dataframe using column name in R can also be achieved using the dollar sign (
$), specifying the name of the column with or without quotes.
# First column my_df$x # Second column my_df$y my_df$"y" # Equivalent
Subset dataframe by column value
You can also subset a data frame depending on the values of the columns. As an example, you may want to make a subset with all values of the data frame where the corresponding value of the column
z is greater than 5, or where the group of the
w column is Group 1.
# Values where column z is greater than 5 my_df[my_df$z > 5, ] # All values corresponding to Group 1 my_df[my_df$w == "Group 1", ]
You can also apply a conditional subset by column values with the
subset function as follows. Note that when using this function you can use the variable names directly.
# All values corresponding to Group 1 subset(my_df, w == "Group 1") # Equivalent
x y z w 1 11 3 Group 1 2 12 4 Group 1 7 17 9 Group 1 10 20 12 Group 1
# All values where 'y' is lower or equal to 14 subset(my_df, y <= 14)
x y z w 1 11 3 Group 1 2 12 4 Group 1 3 13 5 Group 2 4 14 6 Group 2
When using the
subset function with a data frame you can also specify the columns you want to be returned, indicating them in the
# Select the columns to return subset(my_df, x > 3, select = c(x, w))
x w 4 Group 2 5 Group 2 6 Group 2 7 Group 1 8 Group 2 9 Group 2 10 Group 1
In adition, you can use multiple subset conditions at once. Subsetting with multiple conditions is just easy as subsetting by one condition. In the following example we select the values of the column
x, where the value is 1 or where it is 6.
my_df[my_df$x == 1 | my_df$x == 6, ]
x y z 1 11 3 6 16 8
Subset rows in R
Analogously to column subset, you can subset rows of a data frame indicating the indices you want to subset as the first argument between square brackets.
# First row my_df[1, ] # Fourth and sixth row my_df[c(4, 6), ]
Subset rows by list of values
In case you want to subset rows based on a vector you can use the
%in% operator or the
is.element function as follows:
values <- c(12, 14) my_df[my_df$y %in% values, ] my_df[is.element(my_df$y, values), ] # Equivalent
x y z w 2 12 4 Group 1 4 14 6 Group 2
%in%you can create new operator with
'%ni%' <- Negate('%in%')or write
my_df[!(my_df$y %in% values), ].
Subset by date
Many data frames have a column of dates. In this case, each row represents a date and each column an event registered on those dates. For this purpose, you need to transform that column of dates with the
as.Date function to convert the column to date format.
dates <- seq(as.Date("2011/1/1"), by = "day", length.out = 10) my_df_dates <- cbind(dates, my_df) head(my_df_dates)
dates x y z w 2011-01-01 1 11 3 Group 1 2011-01-02 2 12 4 Group 1 2011-01-03 3 13 5 Group 2 2011-01-04 4 14 6 Group 2 2011-01-05 5 15 7 Group 2 2011-01-06 6 16 8 Group 2
As an example, you can subset the values corresponding to dates greater than January, 5, 2011 with the following code:
subset(my_df_dates, dates > as.Date("2011-01-5"))
dates x y z w 2011-01-06 6 16 8 Group 2 2011-01-07 7 17 9 Group 1 2011-01-08 8 18 10 Group 2 2011-01-09 9 19 11 Group 2 2011-01-10 10 20 12 Group 1
Subsetting in R by unique date
Note that in case your date column contains the same date several times and you want to select all the rows that correspond to that date, you can use the
== logical operator with the
subset function as follows:
subset(my_df_dates, dates == as.Date("2011-01-02"))
Subset a matrix in R
Subsetting a matrix in R is very similar to subsetting a data frame. Consider the following sample matrix:
set.seed(45) my_matrix <- matrix(sample(1:9), ncol = 3) colnames(my_matrix) <- c("one", "two", "three") my_matrix
one two three [1, ] 6 8 1 [2, ] 3 7 4 [3, ] 2 5 9
You can subset the rows and columns specifying the indices of rows and then of columns. You can also use boolean data type.
# Subset matrix with rows and columns index my_matrix[c(1, 3), c(1, 2)] # Subset with logical values my_matrix[c(TRUE, FALSE, TRUE), c(TRUE, TRUE, FALSE)] # Equivalent # You can also mix my_matrix[c(1, 3), c(TRUE, TRUE, FALSE)] # Equivalent
one two [1, ] 6 8 [2, ] 2 5
Note that if you subset the matrix to just one column or row it will be converted to a vector. In order to preserve the matrix class, you can set the
drop argument to
my_matrix[, 2] # 8 7 5
8 7 5
my_matrix[, 2, drop = FALSE]
two [1, ] 8 [2, ] 7 [3, ] 5
Subset matrix by column and row names
In case your matrix contains row or column names, you can use them instead of the index to subset the matrix. In the following example we selected the columns named ‘two’ and ‘three’.
my_matrix[, c("two", "three")]
two three [1, ] 8 1 [2, ] 7 4 [3, ] 5 9
Subset matrix by column values
Equivalently to data frames, you can subset a matrix by the values of the columns. In this case, we are making a subset based on a condition over the values of the third column.
my_matrix[my_matrix[, 3] > 2, ]
one two three [1,] 3 7 4 [2,] 2 5 9
Subset time series
Time series are a type of R object with which you can create subsets of data based on time. We will use, for instance, the
nottem time series.
nottem class(nottem) # ts
window function allows you to create subsets of time series, as shown in the following example:
# Data from 1930 (included) window(nottem, start = c(1930)) # Data from April 1930 (included) window(nottem, start = c(1930, 4)) subset()