Select columns in R with dplyr

Select columns in R with dplyr

Select or remove columns from a data frame with the select function from dplyr and learn how to use helper functions to select columns such as contains, matches, all_of, any_of, starts_with, ends_with, last_col, where, num_range and everything.

Sample data

In this tutorial we will use as example data the first five rows and the first six columns of the starwars data set from dplyr.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# View data
df
# A tibble: 5 × 6
  name           height  mass hair_color skin_color  eye_color
  <chr>           <int> <dbl> <chr>      <chr>       <chr>    
1 Luke Skywalker    172    77 blond      fair        blue     
2 C-3PO             167    75 NA         gold        yellow   
3 R2-D2              96    32 NA         white, blue red      
4 Darth Vader       202   136 none       white       yellow   
5 Leia Organa       150    49 brown      light       brown 

Select columns

The select function allows to filter columns from a data frame. Columns can be selected either by name or index/position.

By name

Given a dataset you will need to specify the desired columns inside select with or without quotes. Note that you can select one or multiple columns.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select columns named 'name' and 'height'
df_2 <- df %>%
  select(name, height) # Or select(c("name", "height"))

df_2
# A tibble: 6 × 2
  name           height
  <chr>           <int>
1 Luke Skywalker    172
2 C-3PO             167
3 R2-D2              96
4 Darth Vader       202
5 Leia Organa       150

select() function from dplyr in R

You can also select a sequence of columns using :. For instance, if you want to select the columns from height to skin_color, both included, you can type: height:skin_color

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select columns from 'height' to 'skin_color'
df_2 <- df %>%
  select(height:skin_color)

df_2
# A tibble: 5 × 4
  height  mass hair_color skin_color 
   <int> <dbl> <chr>      <chr>      
1    172    77 blond      fair       
2    167    75 NA         gold       
3     96    32 NA         white, blue
4    202   136 none       white      
5    150    49 brown      light 

Select a sequence of columns with dplyr

By index

Columns can also be selected by index. To do so you just have to specify the desired column numbers inside select. The following example selects the first, the fifth and the sixth column.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select first, fifth and sixth column
df_2 <- df %>%
  select(1, 5, 6)

df_2
# A tibble: 5 × 3
  name           skin_color  eye_color
  <chr>          <chr>       <chr>    
1 Luke Skywalker fair        blue     
2 C-3PO          gold        yellow   
3 R2-D2          white, blue red      
4 Darth Vader    white       yellow   
5 Leia Organa    light       brown 

Select columns with dplyr by index or position

In addition, you can create a sequence of numbers to select a range of columns. The example below can be used to select the first n columns.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# First 4 columns
n <- 4

# Select first four columns
df_2 <- df %>%
  select(1:n)

df_2
# A tibble: 5 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
2 C-3PO             167    75 NA        
3 R2-D2              96    32 NA        
4 Darth Vader       202   136 none      
5 Leia Organa       150    49 brown   

Select first n columns with dplyr in R

Drop columns

The select function can also be used to filter out certain columns. To do this, it is only necessary to add the symbol - before each column name. In the following example we select all columns except mass.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select all columns but 'mass'
df_2 <- df %>%
  select(-mass)

df_2
# A tibble: 5 × 5
  name           height hair_color skin_color  eye_color
  <chr>           <int> <chr>      <chr>       <chr>    
1 Luke Skywalker    172 blond      fair        blue     
2 C-3PO             167 NA         gold        yellow   
3 R2-D2              96 NA         white, blue red      
4 Darth Vader       202 none       white       yellow   
5 Leia Organa       150 brown      light       brown 

Remove columns from a data frame with the dplyr select function

In case you want to drop several columns you can add - before each column name or before a vector with the names.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Drop 'height', 'mass' and 'hair_color'
df_2 <- df %>%
  select(-height, -mass, -hair_color) # Or -c(height, mass, hair_color)

df_2
# A tibble: 5 × 3
  name           skin_color  eye_color
  <chr>          <chr>       <chr>    
1 Luke Skywalker fair        blue     
2 C-3PO          gold        yellow   
3 R2-D2          white, blue red      
4 Darth Vader    white       yellow   
5 Leia Organa    light       brown 

Drop columns from a data set in R with the select function from dplyr

To remove a range of columns you will have to use the - operator and add the sequence of columns inside parenthesis, as in the example below.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select all columns except the fourth, the fifth and the sixth
df_2 <- df %>%
  select(-(4:6))

df_2
# A tibble: 5 × 3
  name           height  mass
  <chr>           <int> <dbl>
1 Luke Skywalker    172    77
2 C-3PO             167    75
3 R2-D2              96    32
4 Darth Vader       202   136
5 Leia Organa       150    49

Delete columns in R with the select function by using an index

Exclamation operator

Note that in addition to the minus operator (-) there is the exclamation operator (!), which negates the selection criteria. This means that selects all columns and excludes those matching the selection criteria.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select ALL EXCEPT "name"
df_2 <- df %>%
  select(!name)

df_2
# A tibble: 5 × 5
  height  mass hair_color skin_color  eye_color
   <int> <dbl> <chr>      <chr>       <chr>    
1    172    77 blond      fair        blue     
2    167    75 NA         gold        yellow   
3     96    32 NA         white, blue red      
4    202   136 none       white       yellow   
5    150    49 brown      light       brown  

Don’t confuse the minus symbol (-) with the exclamation mark (!). While both operators are used for excluding columns in dplyr’s select function, the - operator explicitly specifies columns to remove, whereas the ! operator selects ALL columns EXCEPT those that match the criteria specified.

Helper functions to select or remove columns

There are several helper functions to select columns based on patterns, such as contains, starts_with, ends_with, matches and num_range, based on a condition with where, based on character vectors such as any_of and all_of etc. The following table describes the most common functions and their usage.

Function Description
contains("text") Selects columns containing the given text (exact match)
all_of(c("col1", "col2")) Selects all columns based on a character vector
any_of(c("col1", "col2")) Selects columns based on a character vector, even if some do not exist
starts_with("prefix") Selects columns starting with the given prefix (exact match)
ends_with("suffix") Selects columns ending with the given suffix (exact match)
last_col() Select the last column
matches("regex") Selects columns that match a regular expression
num_range("prefix", 1:5) Selects columns with a prefix and numeric range in their names
where() Selects columns that meet a given condition, e.g. where(is.numeric) or where(is.character)
group_cols() Select the columns grouped with group_by
everything() Select all columns

contains

The contains function matches all columns containing a string. In the following example we select all the columns that include “color” in their name.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select 'name' and all columns containing 'color'
df_2 <- df %>%
  select(name, contains("color"))

df_2
# A tibble: 5 × 4
  name           hair_color skin_color  eye_color
  <chr>          <chr>      <chr>       <chr>    
1 Luke Skywalker blond      fair        blue     
2 C-3PO          NA         gold        yellow   
3 R2-D2          NA         white, blue red      
4 Darth Vader    none       white       yellow   
5 Leia Organa    brown      light       brown  

contains() function in dplyr to select columns containing a given text

To select columns that does not contain a string use !contains("string") or -contains("string").

all_of and any_of

The all_of and any_of function allow to select columns from character vectors. The difference between these functions is that all_of selects all the columns based on the vector and if any variable is missing an error is thrown, while any_of selects variables based on the vector without checking missing variables.

In the following block of code we are selecting the columns based on a vector named column_names using all_of.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Desired columns
column_names <- c("name", "eye_color")

df_2 <- df %>%
  select(name, all_of(column_names))

df_2
# A tibble: 5 × 2
  name           eye_color
  <chr>          <chr>    
1 Luke Skywalker blue     
2 C-3PO          yellow   
3 R2-D2          red      
4 Darth Vader    yellow   
5 Leia Organa    brown  

Select all columns from a vector in R with the all_of function from dplyr

Now we pass another vector in which there are two column names that do not exist, but the any_of function will select those columns that are in the data frame and ignore the others.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Desired columns
column_names <- c("name", "skin_color", "gender", "abc")

df_2 <- df %>%
  select(name, any_of(column_names))

df_2
# A tibble: 5 × 2
  name           skin_color 
  <chr>          <chr>      
1 Luke Skywalker fair       
2 C-3PO          gold       
3 R2-D2          white, blue
4 Darth Vader    white      
5 Leia Organa    light   

Select columns by a vector in R with the any_of function from dplyr

starts_with

If you want to select columns that start with a specific string you can use the starts_with function. In the following example we select the variable name and all variables that start with the letter h.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Selects 'name' and all columns starting with "h"
df_2 <- df %>%
  select(name, starts_with("h"))

df_2
# A tibble: 5 × 3
  name           height hair_color
  <chr>           <int> <chr>     
1 Luke Skywalker    172 blond     
2 C-3PO             167 NA        
3 R2-D2              96 NA        
4 Darth Vader       202 none      
5 Leia Organa       150 brown   

Select columns starting with a string with stats_with() function from dplyr

ends_with

Similar to the previous function, ends_with searches for the columns that end with a specific string. The example above we select the first variable (name) and all variables that end with "ss" (mass).

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Selects 'name' and all columns ending in "ss"
df_2 <- df %>%
  select(name, ends_with("ss"))

df_2
# A tibble: 5 × 2
  name            mass
  <chr>          <dbl>
1 Luke Skywalker    77
2 C-3PO             75
3 R2-D2             32
4 Darth Vader      136
5 Leia Organa       49 

Select columns ending with a string with ends_with() function from dplyr

last_col

The last_col function selects the last variable of the data frame. Note that this function provides an argument named offset that defaults to 0 to select the nth variable from the end.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Last column
df_2 <- df %>%
  select(name, last_col())

df_2
# A tibble: 5 × 2
  name           eye_color
  <chr>          <chr>    
1 Luke Skywalker blue     
2 C-3PO          yellow   
3 R2-D2          red      
4 Darth Vader    yellow   
5 Leia Organa    brown   

Select the last column of a data frame using the last_col() function from dplyr

Use !last_col() or -last_col() to select columns until last column.

matches

The matches function selects columns based on regular expressions. The following block of code selects all columns containing the letter a or the letter t.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Columns containing the letter "a" or the letter "t"
df_2 <- df %>%
  select(matches("a|t"))

df_2
# A tibble: 5 × 4
  name           height  mass hair_color
  <chr>           <int> <dbl> <chr>     
1 Luke Skywalker    172    77 blond     
2 C-3PO             167    75 NA        
3 R2-D2              96    32 NA        
4 Darth Vader       202   136 none      
5 Leia Organa       150    49 brown    

matches() function from dplyr to select columns based on regular expressions

num_range

If your variables are named with a prefix and a numeric range, such as x1, x2, … or y_1, y_2, … you can use the num_range function to quickly select a range of columns.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# New column names
colnames(df) <- paste0("x", 1:ncol(df))

# Selects x1, x2 and x3
df_2 <- df %>%
  select(num_range("x", 1:3))

df_2
# A tibble: 5 × 3
  x1                x2    x3
  <chr>          <int> <dbl>
1 Luke Skywalker   172    77
2 C-3PO            167    75
3 R2-D2             96    32
4 Darth Vader      202   136
5 Leia Organa      150    49   

The num_range function from dplyr package

where

The where function takes a function that returns TRUE or FALSE as input and selects columns based on the desired condition. For example, you can select or drop all numeric, character or factor variables using where(is.numeric), where(is.character) and where(is.factor), respectively.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

# Select all character variables
df_2 <- df %>%
  select(where(is.character))

df_2
# A tibble: 5 × 4
  name           hair_color skin_color  eye_color
  <chr>          <chr>      <chr>       <chr>    
1 Luke Skywalker blond      fair        blue     
2 C-3PO          NA         gold        yellow   
3 R2-D2          NA         white, blue red      
4 Darth Vader    none       white       yellow   
5 Leia Organa    brown      light       brown    

Select numeric, character and factor columns with the where function from dplyr

group_cols

The group_cols function selects the previously grouped columns with group_by.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

df_2 <- df %>%
  group_by(name, eye_color) %>%
  select(group_cols())

df_2
# A tibble: 5 × 2
# Groups:   name, eye_color [5]
  name           eye_color
  <chr>          <chr>    
1 Luke Skywalker blue     
2 C-3PO          yellow   
3 R2-D2          red      
4 Darth Vader    yellow   
5 Leia Organa    brown 

Select grouped columns in R with the group_cols function from dplyr

everything

The everything function allows to select all columns. This function is specially useful when combined with other dplyr functions.

# install.packages("dplyr")
library(dplyr)

# Sample data
df <- starwars[1:5, 1:6]

df_2 <- df %>%
  select(everything())

df_2
# A tibble: 5 × 6
  name           height  mass hair_color skin_color  eye_color
  <chr>           <int> <dbl> <chr>      <chr>       <chr>    
1 Luke Skywalker    172    77 blond      fair        blue     
2 C-3PO             167    75 NA         gold        yellow   
3 R2-D2              96    32 NA         white, blue red      
4 Darth Vader       202   136 none       white       yellow   
5 Leia Organa       150    49 brown      light       brown  

Select everything with dplyr in R

You can select the intersection or union of two sets of variables with & and |. For instance, select(starts_with("h") & ends_with("r")) selects all columns that starts with the letter hand ends with r.