Split strings in R with strsplit()

Data Manipulation in R String manipulation
The strsplit() function in R

The strsplit function creates substrings of a string based on a given separator. In this tutorial you will learn how to use this function in several use cases.

Syntax of strsplit

The strsplit function takes a string or character vector and a delimiter or separator as input. The basic syntax of the function is the following:

# x: character vector
# split: delimiter used for splitting
# fixed: if TRUE, matches 'split' as is. If FALSE (default) 'split' is considered a regular expression

strsplit(x, split, fixed = FALSE)

The output type is a list of the length of x and each element of the list will contain the substrings resulting from the split.

Splitting a string by a delimiter

The strsplit function splits strings into substrings based on a delimiter. For instance, given a string you can split it by spaces passing the string as input and a white space as delimiter (" ").

strsplit("This is a string", split = " ")
[[1]]
[1] "This"   "is"     "a"      "string"

Any character or string can be used as separator and the function will use it to split the input data into substrings.

strsplit("This&is a string", split = "&")
[[1]]
[1] "This"        "is a string"

Notice that the output is a list, so in order to convert it into a vector you will need to unlist it accessing the corresponding element or by using the unlist function.

strsplit("This is a string", split = " ")[[1]]

# Equivalent to:
unlist(strsplit("This is a string", split = " "))
[1] "This"   "is"     "a"      "string"

Now you will be able to access each substring. In the following examples we access the first, second and last element of the splitted string.

# Get the first element
strsplit("This is a string", split = " ")[[1]][1]

# Get the second element
strsplit("This is a string", split = " ")[[1]][2]

# Get the last element
splitted_string <- strsplit("This is a string", split = " ")[[1]]
splitted_string[length(splitted_string)]
[1] "This"  
[1] "is" 
[1] "string"

The input of the function can also be a character vector. In this scenario the output will be a list with as many elements as the length of the input and each element will contain the splitted strings based on the separator.

strsplit(c("This is a string", "This is other string"), split = " ")
[[1]]
[1] "This"   "is"     "a"      "string"

[[2]]
[1] "This"   "is"     "other"  "string"

Multiple delimiters

The strsplit function can take multiple separators if the length of x is greater than one. In the example below we set an empty space as delimiter for the first string and a slash as delimiter of the second. Note that if the length of the input character vector is greater than the length of the delimiters, the delimiters will be recycled along x.

strsplit(c("This is a string", "This is/other string"), split = c(" ", "/"))
[[1]]
[1] "This"   "is"     "a"      "string"

[[2]]
[1] "This is"      "other string"
strsplit("String-with/different&separators", split = "-|/|&")
[[1]]
[1] "String"     "with"       "different"    "separators"

Splitting a date

A common use case of strsplit is to split a column of a data frame containing dates into three other columns with the corresponding year, month and day. For this purpose you will need to split the dates with "-" or the corresponding separator, bind the rows of the output with rbind and do.call and bind by columns the result into the original data frame.

# Sample data frame with dates
df <- data.frame(date = as.Date(Sys.Date():(Sys.Date() + 5)))

# Split the dates with "-"
splitted_dates <- strsplit(as.character(df$date), split = "-")

# Bind the splitted dates by row and add them to the data frame
df <- cbind(df, do.call(rbind, splitted_dates))

# Change column names
colnames(df) <- c("date", "year", "month", "day")
df
        date year month day
1 2023-11-19 2023    11  19
2 2023-11-20 2023    11  20
3 2023-11-21 2023    11  21
4 2023-11-22 2023    11  22
5 2023-11-23 2023    11  23
6 2023-11-24 2023    11  24

Using regular expressions (regex) to split character vectors

The split argument of the function can take regular expressions as input. Considering that you want to use any number as delimiter you could use "[0-9]" as delimiter.

strsplit("A1B2C3D4", split = "[0-9]")
[[1]]
[1] "A" "B" "C" "D"

Keep in mind that if you set fixed = TRUE the function will interpret the delimiter as is, so that if, for example, you want to split a string by periods you can set this argument to TRUE or scape it with "\\.".

strsplit("String.with.periods", split = ".", fixed = TRUE)

# Equivalent to:
# strsplit("String.with.periods", split = "\\.")
[[1]]
[1] "String"  "with"    "periods"

If you want to split the string but keep the delimiter you can use the following delimiter: "(?<=[DELIMITERS])" and set perl = TRUE. For instance if you want to use "-" as delimiter and keep it you can type the following:

strsplit("a-b-c", split = "(?<=[-])", perl = TRUE)
[[1]]
[1] "a-" "b-" "c"

The opposite of strsplit in R is the paste function. You will need to unlist the output and paste it again with the same delimiter used for splitting.

# Split string by white spaces
splitted <- strsplit("A B C", split = " ")

# [[1]]
# [1] "A" "B" "C"

# Unsplit
paste(unlist(splitted), collapse = " ")
# [1] "A B C"