Filter rows in R with dplyr
The filter
function from dplyr subsets rows of a data frame based on a single or multiple conditions. In this tutorial you will learn how to select rows using comparison and logical operators and how to filter by row number with slice
.
Sample data
The examples inside this tutorial will use the women
data set provided by R. This data set contains two numeric columns: height
and weight
.
# install.packages("dplyr")
library(dplyr)
# Sample data frame
df <- as_tibble(women)
df
# A tibble: 15 × 2
height weight
<dbl> <dbl>
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
Filtering rows based on a single condition
The filter
function allows to subset rows of a data frame based on a condition. You can filter the values equal to, not equal to, lower than or greater than a value by specifying the desired condition within the function.
The following table contains the comparison operators in R and their descriptions.
Comparison operator | Description |
---|---|
> | Greater than |
< | Lower than |
>= | Greater or equal than |
<= | Lower or equal than |
== | Equal to |
!= | Not equal to |
For example, if you want to filter the rows where the height
column is greater than 68 you can write the following:
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' greater than 68
df_2 <- df %>%
filter(height > 68)
df_2
height weight
<dbl> <dbl>
1 69 150
2 70 154
3 71 159
4 72 164
The filtering can be based on a function. The following example selects the rows of the data frame where the height is equal or lower to the mean of the column.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' lower or equal than the mean
df_2 <- df %>%
filter(height <= mean(height))
df_2
# A tibble: 8 × 2
height weight
<dbl> <dbl>
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
It is also possible to filter rows using logical operators or functions that return TRUE
or FALSE
or a combination of them. The most common are shown in the table below.
Operator/function | Description |
---|---|
! | Logical negation ‘NOT’ |
%in% | In the set |
!(x %in% y) | x not in y |
is.na() | Is NA |
!is.na() | Is not NA |
grepl() | Contains a pattern |
!grepl() | Does not contain a pattern |
Consider that you want to filter the rows in which the height
column takes the value 65, 70 and 72. For this you can use the %in%
operator and filter the rows by a vector.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' equal to 65, 70 and 72
df_2 <- df %>%
filter(height %in% c(65, 70, 72))
df_2
# A tibble: 3 × 2
height weight
<dbl> <dbl>
1 65 135
2 70 154
3 72 164
The opposite of a condition can be selected with the logical negation operator !
. The example below shows how to select the opposite of the filtering made on the previous code.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' NOT equal to 65, 70 and 72
df_2 <- df %>%
filter(!(height %in% c(65, 70, 72)))
df_2
# A tibble: 12 × 2
height weight
<dbl> <dbl>
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 66 139
9 67 142
10 68 146
11 69 150
12 71 159
To filter rows containing a specific string you can use grepl
or str_detect
. The following example filters the rows containing a specific pattern (e.g. rows of height
containing a 5).
library(dplyr)
df <- as_tibble(women)
# Rows that contain a 5 inside the values of 'height'
df_2 <- df %>%
filter(grepl("5", height))
df_2
# A tibble: 3 × 2
height weight
<dbl> <dbl>
1 58 115
2 59 117
3 65 135
Multiple conditions
Row filtering can also be based on multiple conditions to filter, for instance, rows where a value is in a specific range or to filter between dates. For this you will need to use logical operators, such as &
to specify one AND another condition, and |
to specify one OR another condition.
Logical operator | Description |
---|---|
& | Elementwise logical ‘AND’ |
| | Elementwise logical ‘OR’ |
xor() | Elementwise exclusive ‘OR’. Equivalent to !(x | y) |
The example below selects rows whose values in the height
column are greater than 65 and lower than 68.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' greater than 65 AND lower than to 68
df_2 <- df %>%
filter(height > 65 & height < 68)
df_2
# A tibble: 2 × 2
height weight
<dbl> <dbl>
1 66 139
2 67 142
The multiple conditions can be based on multiple columns. In the following block of code we are selecting the rows whose values in height
are greater than 65 and whose values in weight
are lower or equal to 150.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' greater than 65 AND weight lower or equal to 150
df_2 <- df %>%
filter(height > 65 & weight <= 150)
df_2
# A tibble: 4 × 2
height weight
<dbl> <dbl>
1 66 139
2 67 142
3 68 146
4 69 150
In case you need to subset rows based on a condition OR on another you can use |
. The example below filters the rows whose values in height
area greater than 65 or whose values in weight
are greater or equal to 150.
library(dplyr)
df <- as_tibble(women)
# Rows with 'height' greater than 65 OR 'weight' greater or equal to 150
df_2 <- df %>%
filter(height > 65 | weight >= 150)
df_2
# A tibble: 7 × 2
height weight
<dbl> <dbl>
1 66 139
2 67 142
3 68 146
4 69 150
5 70 154
6 71 159
7 72 164
Filter by row number with slice
A similar function related to filter
is slice
, which allows to filter rows based on its index/position. The function takes a sequence or vector of indices (integers) as input, as shown below.
library(dplyr)
df <- as_tibble(women)
# First three rows
df_2 <- df %>%
slice(1:3)
df_2
# A tibble: 3 × 2
height weight
<dbl> <dbl>
1 58 115
2 59 117
3 60 120
In addition, the slice_head
function allows to select the first row of the data frame. This function provides an argument named n
to select the n
first rows.
library(dplyr)
df <- as_tibble(women)
# First row
df_2 <- df %>%
slice_head()
df_2
# A tibble: 1 × 2
height weight
<dbl> <dbl>
1 58 115
Finally, if you need to select the last row you can use slice_tail
. This function also provides an argument named n
to select the last n
rows of the data frame.
library(dplyr)
df <- as_tibble(women)
# Last row
df_2 <- df %>%
slice_tail()
df_2
# A tibble: 1 × 2
height weight
<dbl> <dbl>
1 72 164
slice_sample
selects rows randomly and slice_min
and slice_max
selects the rows with the lowest or highest values of a variable, respectively.