split function in R

Data Manipulation in R Data transformation
Learn how to split data in R

The split function allows dividing data in groups based on factor levels. In this tutorial we are going to show you how to split in R with different examples, reviewing all the arguments of the function.

The split() function syntax

The split function divides the input data (x) in different groups (f). The following block summarizes the function arguments and its description.

split(x,                 # Vector or data frame
      f,                 # Groups of class factor, vector or list
      drop = FALSE,      # Whether to drop unused levels or not
      sep = ".",         # Character string to separate groups when f is a list
      lex.order = FALSE, # Whether the factor concatenation should be lexically ordered or not
      ...)               # Additional arguments

Split vector in R

Suppose you have a named vector, where the name of each element corresponds to the group the element belongs. Hence, you can split the vector in two vectors where the elements are of the same group, passing the names of the vector with the names function to the argument f.

a <- c(x = 3, y = 5, x = 1, x = 4, y = 3)
a
x y x x y
3 5 1 4 3
split(a, f = names(a))
$`x`
x x x
3 1 4

$y
y y
5 3

In addition, you can pass a character vector as parameter of the argument f to indicate the corresponding groups of each element, or directly a factor object.

groups <- c("Group 1", "Group 1", "Group 2", "Group 1", "Group 2")

split(a, f = groups)
split(a, f = factor(groups)) # Equivalent
$`Group 1`
x y x
3 5 4

$`Group 2`
x y
1 3 

Moreover, you can split your data by multiple groups, generating interactions of groups. For that purpose, the input of the argument f must be a list.

# New group
groups_2 <- c("Type 1", "Type 1", "Type 1", "Type 2", "Type 1")

# Split "a" by two groups
split(a, f = list(groups, groups_2))


# Equivalent to:
f1 <- factor(c("Group 1", "Group 1", "Group 2", "Group 1", "Group 2"),
             levels = c("Group 1", "Group 2"))
f2 <- factor(c("Type 1", "Type 1", "Type 1", "Type 2", "Type 1"),
             levels = c("Type 1", "Type 2"))

split(a, f = list(f1, f2))
$`Group 1.Type 1`
x y
3 5

$`Group 2.Type 1`
x y
1 3

$`Group 1.Type 2`
x
4

$`Group 2.Type 2`
named numeric(0)

Note that, by default, the group interactions are separated with a dot and that the output contains all possible groups even when there are no observations in some of them. However, you can customize this with the sep and drop arguments, respectively.

# Remove the empty elements and change the separator
vec_split <-split(a, f = list(f1, f2), drop = TRUE, sep = ": ")
vec_split
$`Group 1: Type 1`
x y
3 5

$`Group 2: Type 1`
x y
1 3

$`Group 1: Type 2`
x
4

It should be noted that with the unsplit function you can recover the original vector, but the names will be lost.

unsplit(vec_split, list(f1, f2))
<NA> <NA> <NA> <NA> <NA>
 3    5    1    4    3 

Split data frame in R

You can split a data set in subsets based on one or more variables that represents groups of the data. Consider the following data frame:

set.seed(3)

df <- CO2[sample(1:nrow(CO2), 10), ]
head(df)
   Plant        Type  Treatment conc uptake
15   Qn3      Quebec nonchilled   95   16.2
68   Mc1 Mississippi    chilled  500   19.5
32   Qc2      Quebec    chilled  350   38.8
27   Qc1      Quebec    chilled  675   35.4
49   Mn1 Mississippi nonchilled 1000   35.5
48   Mn1 Mississippi nonchilled  675   32.4

You can use the split function to split the data frame in groups based for example in the Treatment variable.

split(df, f = df$Treatment)
$`nonchilled`
   Plant        Type  Treatment conc uptake
15   Qn3      Quebec nonchilled   95   16.2
49   Mn1 Mississippi nonchilled 1000   35.5
48   Mn1 Mississippi nonchilled  675   32.4
10   Qn2      Quebec nonchilled  250   37.1
44   Mn1 Mississippi nonchilled  175   19.2

$chilled
   Plant        Type Treatment conc uptake
68   Mc1 Mississippi   chilled  500   19.5
32   Qc2      Quebec   chilled  350   38.8
27   Qc1      Quebec   chilled  675   35.4
23   Qc1      Quebec   chilled  175   24.1
79   Mc3 Mississippi   chilled  175   18.0

As we explained in the vectors section, you can divide a data frame in subsets that meet different combinations of groups at the same time. As an example, you can create the split of the sample data frame with Type and Treatment columns. This will create four subsets with all possible combinations of the groups. Note that the total number of splits is the multiplication of the number of levels of each group.

dfs <- split(df, f = list(df$Type, df$Treatment))
dfs
$`Quebec.nonchilled`
   Plant   Type  Treatment conc uptake
15   Qn3 Quebec nonchilled   95   16.2
10   Qn2 Quebec nonchilled  250   37.1

$Mississippi.nonchilled
   Plant        Type  Treatment conc uptake
49   Mn1 Mississippi nonchilled 1000   35.5
48   Mn1 Mississippi nonchilled  675   32.4
44   Mn1 Mississippi nonchilled  175   19.2

$Quebec.chilled
   Plant   Type Treatment conc uptake
32   Qc2 Quebec   chilled  350   38.8
27   Qc1 Quebec   chilled  675   35.4
23   Qc1 Quebec   chilled  175   24.1

$Mississippi.chilled
   Plant        Type Treatment conc uptake
68   Mc1 Mississippi   chilled  500   19.5
79   Mc3 Mississippi   chilled  175   18.0

Remember that you can recover the original data frame with the unsplit function, passing the divided data frame and the group or groups you used to create the split.

unsplit(dfs, f = list(df$Type, df$Treatment))