Pattern matching in R with grepl() and grep()

Data Manipulation in R String manipulation
Pattern matching in R

The grepl and grep functions allows you to search for pattern coincidences inside a character vector. In this tutorial you will learn their differences and how to use them in several use cases.

Syntax and differences

Both grepl and grep search for matches of a pattern inside a character vector. The difference between these functions is that grepl returns TRUE or FALSE depending there is a match for the corresponding element of the vector or not while grep returns the indices of the elements of the vector that match the specified pattern.

The basic syntax of these functions is the same:

# Returns TRUE/FALSE if pattern matches or not the elements of x
grepl(pattern, x)

# Returns the index of the matching elements of x
grep(pattern, x)

These functions also provide several arguments named, perl, fixed, useBytes and invert and in addition, grepl provides an argument named value. Recall to type ?grepl or ?grep for additional information.

The grepl function

Consider that you want to check if a string contains the letter a. For this purpose you can input "a" as pattern and the desired string or vector to x (the word "sand" in this case):

grepl(pattern = "a", x = "sand")

The previous example returns a TRUE because there is an "a" inside "sand". Note that you can also search for longer string matches:

grepl("nd", "sand")

The function is case sensitive by default, so if you want to match an upper case and the string is in lower case or vice versa the function will return FALSE.

grepl("S", "sand")

If you want to ignore this behavior you will need to set = TRUE.

grepl("S", "sand", = TRUE)

Note that the function is designed to match character vectors, so if you input a vector to x the function will return a boolean vector of the same length as x indicating if there was a match on the corresponding element or not.

grepl("a", c("sand", "sea", "turtle"))

Using regular expressions

You can also use regular expressions to match anything you want. The following example matches the elements of the vector containing numeric characters. If you want to match the string ‘as is’ set fixed = TRUE.

grepl("[0-9]", c("one", "2", "three3"))

Match multiple patterns

Finally, if you want to match multiple patterns you can input the desired patterns separated by | without spaces. The example below will return TRUE if the corresponding element contains the letter "d" or the letter "t".

# Match "d" or "t"
grepl("d|t", c("sand", "sea", "turtle"))

It is possible to create the patterns programmatically pasting the different patterns with the paste() or paste0() functions

# Desired matching patterns
patterns <- c("and", "ea", "w")
patterns <- paste0(patterns, collapse = "|") # "and|ea|w"

# Search for coincidences
grepl(patterns, c("sand", "sea", "turtle"))

The grep function

If you want to know the indices of the coincidences of the pattern inside the vector passed to x you can use grep. In the example below we check if the string "The" is inside any of the elements of x.

grep(pattern = "The", x = c("The sand", "water", "The crab"))

# Equivalent to:
# which(grepl(pattern = "The", x = c("The sand", "water", "The crab")) == TRUE)
1 3

The output is 1 and 3 as the string "The" is inside the first and the third element of the character vector.

If there are no coincidences, the function will return integer(0), as in the example below.

grep(pattern = "x", x = c("The sand", "water", "The crab"))

The function behaves the same as grepl, so you can also search for multiple patterns separating each with |.

grep(pattern = "The|crab", x = c("The sand", "water", "The crab"))
1 3

If you need to ignore case you can also set = TRUE. Type ?grep to get more information about the function and its arguments.

grep(pattern = "b", x = c("A", "B", "C"), = TRUE)

Finally, if you set value = TRUE the function will return the matching elements of x.

grep(pattern = "r", x = c("airplane", "boat", "car"), value = TRUE)
"airplane" "car"

R provides other functions for pattern matching. If you want to know the start position and the length of the match you can use regexpr, if you want to know the position of every match you can use gregexpr. Check also regexec and gregexec functions for similar outputs.