Filter rows in R with dplyr

Filter rows in R with dplyr

The filter function from dplyr subsets rows of a data frame based on a single or multiple conditions. In this tutorial you will learn how to select rows using comparison and logical operators and how to filter by row number with slice.

Sample data

The examples inside this tutorial will use the women data set provided by R. This data set contains two numeric columns: height and weight.

# install.packages("dplyr")
library(dplyr)

# Sample data frame
df <- as_tibble(women)

df
# A tibble: 15 × 2
   height weight
    <dbl>  <dbl>
 1     58    115
 2     59    117
 3     60    120
 4     61    123
 5     62    126
 6     63    129
 7     64    132
 8     65    135
 9     66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Filtering rows based on a single condition

The filter function allows to subset rows of a data frame based on a condition. You can filter the values equal to, not equal to, lower than or greater than a value by specifying the desired condition within the function.

The following table contains the comparison operators in R and their descriptions.

Comparison operator Description
> Greater than
< Lower than
>= Greater or equal than
<= Lower or equal than
== Equal to
!= Not equal to

For example, if you want to filter the rows where the height column is greater than 68 you can write the following:

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' greater than 68
df_2 <- df %>%
  filter(height > 68)

df_2
  height weight
   <dbl>  <dbl>
1     69    150
2     70    154
3     71    159
4     72    164

Filter rows with dplyr in R

The filtering can be based on a function. The following example selects the rows of the data frame where the height is equal or lower to the mean of the column.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' lower or equal than the mean
df_2 <- df %>%
  filter(height <= mean(height))

df_2
# A tibble: 8 × 2
  height weight
   <dbl>  <dbl>
1     58    115
2     59    117
3     60    120
4     61    123
5     62    126
6     63    129
7     64    132
8     65    135

Filter rows with dplyr using the mean function

It is also possible to filter rows using logical operators or functions that return TRUE or FALSE or a combination of them. The most common are shown in the table below.

Operator/function Description
! Logical negation ‘NOT’
%in% In the set
!(x %in% y) x not in y
is.na() Is NA
!is.na() Is not NA
grepl() Contains a pattern
!grepl() Does not contain a pattern

Consider that you want to filter the rows in which the height column takes the value 65, 70 and 72. For this you can use the %in% operator and filter the rows by a vector.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' equal to 65, 70 and 72
df_2 <- df %>%
  filter(height %in% c(65, 70, 72))

df_2
# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1     65    135
2     70    154
3     72    164

Select rows with dplyr in R

The opposite of a condition can be selected with the logical negation operator !. The example below shows how to select the opposite of the filtering made on the previous code.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' NOT equal to 65, 70 and 72
df_2 <- df %>%
  filter(!(height %in% c(65, 70, 72)))

df_2
# A tibble: 12 × 2
   height weight
    <dbl>  <dbl>
 1     58    115
 2     59    117
 3     60    120
 4     61    123
 5     62    126
 6     63    129
 7     64    132
 8     66    139
 9     67    142
10     68    146
11     69    150
12     71    159

Filter rows not in a vector with dplyr

To filter rows containing a specific string you can use grepl or str_detect. The following example filters the rows containing a specific pattern (e.g. rows of height containing a 5).

library(dplyr)

df <- as_tibble(women)

# Rows that contain a 5 inside the values of 'height'
df_2 <- df %>%
  filter(grepl("5", height))

df_2
# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1     58    115
2     59    117
3     65    135

Subset rows that contain a string with the filter function from dplyr in R

Multiple conditions

Row filtering can also be based on multiple conditions to filter, for instance, rows where a value is in a specific range or to filter between dates. For this you will need to use logical operators, such as & to specify one AND another condition, and | to specify one OR another condition.

Logical operator Description
& Elementwise logical ‘AND’
| Elementwise logical ‘OR’
xor() Elementwise exclusive ‘OR’. Equivalent to !(x | y)

The example below selects rows whose values in the height column are greater than 65 and lower than 68.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' greater than 65 AND lower than to 68
df_2 <- df %>%
  filter(height > 65 & height < 68)

df_2
# A tibble: 2 × 2
  height weight
   <dbl>  <dbl>
1     66    139
2     67    142

Filter rows by multiple conditions with dplyr

The multiple conditions can be based on multiple columns. In the following block of code we are selecting the rows whose values in height are greater than 65 and whose values in weight are lower or equal to 150.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' greater than 65 AND weight lower or equal to 150
df_2 <- df %>%
  filter(height > 65 & weight <= 150)

df_2
# A tibble: 4 × 2
  height weight
   <dbl>  <dbl>
1     66    139
2     67    142
3     68    146
4     69    150

Filter rows using dplyr based on several conditions

In case you need to subset rows based on a condition OR on another you can use |. The example below filters the rows whose values in height area greater than 65 or whose values in weight are greater or equal to 150.

library(dplyr)

df <- as_tibble(women)

# Rows with 'height' greater than 65 OR 'weight' greater or equal to 150
df_2 <- df %>%
  filter(height > 65 | weight >= 150)

df_2
# A tibble: 7 × 2
  height weight
   <dbl>  <dbl>
1     66    139
2     67    142
3     68    146
4     69    150
5     70    154
6     71    159
7     72    164

Subset rows based on a condition or on another with the filter function from dplyr package

Filter by row number with slice

A similar function related to filter is slice, which allows to filter rows based on its index/position. The function takes a sequence or vector of indices (integers) as input, as shown below.

library(dplyr)

df <- as_tibble(women)

# First three rows
df_2 <- df %>%
  slice(1:3)

df_2
# A tibble: 3 × 2
  height weight
   <dbl>  <dbl>
1     58    115
2     59    117
3     60    120

First rows of a data frame with the slice function from dplyr

In addition, the slice_head function allows to select the first row of the data frame. This function provides an argument named n to select the n first rows.

library(dplyr)

df <- as_tibble(women)

# First row
df_2 <- df %>%
  slice_head()

df_2
# A tibble: 1 × 2
  height weight
   <dbl>  <dbl>
1     58    115

First row of a data frame with the slice_head function from dplyr

Finally, if you need to select the last row you can use slice_tail. This function also provides an argument named n to select the last n rows of the data frame.

library(dplyr)

df <- as_tibble(women)

# Last row
df_2 <- df %>%
  slice_tail()

df_2
# A tibble: 1 × 2
  height weight
   <dbl>  <dbl>
1     72    164

Last row of a data frame with the slice_tail function from dplyr

slice_sample selects rows randomly and slice_min and slice_max selects the rows with the lowest or highest values of a variable, respectively.