Pattern matching and replacement in R with gsub() and sub()

Data Manipulation in R String manipulation
Pattern matching and replacement in R

Pattern matching and replacement can be achieved in R with the gsub and sub functions. In this tutorial you will learn the differences between these functions and how to remove or replace the patterns through examples explaining the most common use cases.

Syntax and differences

The gsub function replaces all matches of a pattern while sub replaces the first match of a pattern. The basic syntax of the functions is the same, as you will need to input the desired pattern for matching and the desired replacement string.

# Replace ALL matches of pattern
gsub(pattern, replacement, x)

# Replace THE FIRST match of pattern
sub(pattern, replacement, x)

In addition, these functions provide other arguments named, perl, fixed, useBytes and invert. Remember to type ?gsub or ?sub to read the official documentation of these functions.

The gsub function

Consider that you have a string and you want to replace the word “blue” by the word “red” on all the occurrences. For this purpose you can use the gsub function as in the example below.

x <- "The jacket was blue"

gsub(pattern = "blue", replacement = "red", x = x)
"The jacket was red"

As mentioned before, if there are several matches, all will be replaced.

x <- "The jacket was blue and the shirt was also blue"

gsub("blue", "red", x)
"The jacket was red and the shirt was also red"

You can also input a character vector to replace the possible matches inside each element of the vector.

x <- c("red", "blue", "yellow", "blue")

gsub("blue", "green", x)
"red"  "green"  "yellow" "green" 

Notice that by default, the pattern matching is case-sensitive, so if you want to ignore case set = TRUE.

x <- "The jacket was blue"

gsub("JACKET", "shirt", x, = TRUE)
"The shirt was blue"

The possibilities of this function are huge, so in the following subsections we are going to review some of the most common use cases.

Replace quotation marks

You can replace double with single quotation marks or remove them. For this purpose if you want to match single quotes you will need to use "'" and if you want to match double quotation marks you will need to use '"'.

x <- 'I said: "ok"'

# Replace double with single quotes
gsub('"', "'", x)

# Replace single with double quotes
# gsub("'", '"', x)
"I said: 'ok'"

Replace spaces

It is possible to replace or remove spaces passing " " as pattern.

x <- 'Name John'

gsub(" ", ": ", x)
"Name: John"

Replace backslash

If you want to replace backslash you can’t input the backslash directly but you will need to use "\\\\". In the following example we are replacing all backslashes with slashes.

x <- "E:\\Documents\\"

gsub("\\\\", "/", x)

Replace dots

Passing a single dot to pattern will match everything. In case you want to replace dots you will need to scape the dot with two backslashes or set fixed = TRUE as shown in the example below.

x <- "This is a text with a dot."

gsub("\\.", "", x)

# Equivalent to:
# gsub(".", "", x, fixed = TRUE)
"This is a text with a dot"

Remove brackets

In order to remove brackets you will need to input "\\(|\\)" as pattern. The pattern to match square brackets is "\\[|\\]".

x <- "(Text inside brackets)"

gsub("\\(|\\)", "", x)

# For square brackets:
# gsub("\\[|\\]", "", x)
"Text inside brackets"

Remove numbers

The pattern passed to gsub can also accept regular expressions. The following regular expression ("[0-9]+") can be used to match any number.

x <- "12Text52"

gsub("[0-9]+", "", x)

Note that if you set fixed = TRUE the function will ignore the regular expression and use the string as is.

x <- "[0-9]+12Text52"

gsub("[0-9]+", "", x, fixed = TRUE)

Replace multiple patterns

It is possible to replace multiple patterns with the gsub function if you separate each pattern with | without spaces. In the following example the letter a or the letter b will be replaced by the letter c.

gsub("a|b", "c", "ab")

Remove or replace everything after

Sometimes you need to remove or replace everything after a string. With [x].* you will match everything after [x], including [x].

x <- "Remove text after here: text to remove"

gsub(" here.*", "", x)
"Remove text after"

Remove or replace everything before

The opposite to the previous is to remove or replace everything before a match. For this you can use .*[x] to match everything before [x], including [x].

x <- "Remove text before here: text to keep"

gsub(".*here: ", "", x)
"text to keep"

The sub function

As stated before, the sub function is equivalent to gsub but will only match and replace the first occurrence of the input pattern. Consider the following example where the pattern appears twice but only the first occurrence is replaced.

x <- "red green blue red"

sub("red", "yellow", x)
"yellow green blue red"

The same happens when using regular expressions. In the example below only the first match is removed.

x <- "12Text52"

sub("[0-9]+", "", x)