HOME

Cut in R

Data Manipulation in R Data transformation

Categorize data in R with the cut function

The cut function in R allows you to cut data into bins and specify ‘cut labels’, so it is very useful to create a factor from a continuous variable. In this tutorial you will learn how to use cut in R and therefore, how to categorize data in R.

`cut` function in R

Sometimes it is useful to categorize the values of a continuous variable in different levels of a factor. For that purpose, you can use the R cut function. In the following block of code we show the syntax of the function and the simplified description of the arguments.

cut(num_vector,              # Numeric input vector
    breaks,                  # Number or vector of breaks
    labels = NULL,           # Labels for each group
    include.lowest = FALSE,  # Whether to include the lowest 'break' or not
    right = TRUE,            # Whether the right interval is closed (and the left open) or vice versa
    dig.lab = 3,             # Number of digits of the groups if labels = NULL
    ordered_result = FALSE,  # Whether to order the factor result or not
    …)                       # Additional arguments

Cut in R: the `breaks` argument

The breaks argument allows you to cut the data in bins and hence to categorize it. Consider the following vector:

x <- -5:5

On the one hand, you can set the breaks argument to any integer number, creating as many intervals (levels) as the specified number. These intervals will be all of the same length.

cut(x, breaks = 2)

(-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0]
(-5.01,0] (0,5.01] (0,5.01]  (0,5.01]  (0,5.01]  (0,5.01] 
Levels: (-5.01,0] (0,5.01]

On the other hand, you can specify the intervals you prefer.

cut(x, breaks = c(-6, 2, 5))

(-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (2,5]  (2,5] 
(2,5] 
Levels: (-6,2] (2,5]

It is worth to mention that if the intervals have decimals you can modify the number of decimals with the dig.lab argument and decide whether to order the results or not with the ordered_result argument.

Cut in R: the `labels` argument

You can also change the levels of the output factor with the labels argument.

x <- c(12, 1, 25, 12, 65, 2, 6, 17)

cut(x, breaks = c(0, 3, 12, 15, 20, 80),
    labels = c("First", "Second", "Third", "Fourth", "Fifth"))

# Equivalent to
c <- cut(x, breaks = c(0, 3, 12, 15, 20, 80))
levels(c) <- c("First", "Second", "Third", "Fourth", "Fifth")

Second First Fifth Second Fifth First Second Fourth
Levels: First Second Third Fourth Fifth

Include lowest value

The include.lowest argument specify whether to include the lowest break or not. By default, it is set to FALSE.

x <- 15:25

cut(x, breaks = c(15, 20, 25), include.lowest = FALSE)

<NA> (15,20] (15,20] (15,20] (15,20]
(15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: (15,20] (20,25]

In this case, the lowest value (15), specified as a break, it is not included in the interval (the left interval is open), so the value is categorized as NA, because the number 15 doesn’t belong to any of the intervals. However, if you set include.lowest to TRUE, the value will be included, as the left interval of the lowest break will be closed.

cut(x, breaks = c(15, 20, 25), include.lowest = TRUE)

[15,20] [15,20] [15,20] [15,20] [15,20] 
[15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: [15,20] (20,25]

The argument `right`

Consider, for instance, you want to categorize some data (\(x\)) in the following categories:

Low, if \(x \in\) [0, 150).
Medium, if \(x \in\) [150, 200).
High, if \(x \in\) [200, \(\infty\)).

By default, the argument right is set to TRUE, so the intervals are opened on the left and closed on the right (x, y].

x <- c(75, 150, 160, 151, 216, 149)

categories <- cut(x, breaks = c(0, 150, 200, Inf),
                  labels = c("low", "medium", "high"))

data.frame(x, categories)

In this scenario, not all the values are categorized well.

  x   categories
  75        low
 150        low   # <-- Categorized as low
 160     medium
 151     medium 
 216       high
 149        low

However, if you set right = FALSE, the intervals will be closed on the left and open on the right.

categories <- cut(x, breaks = c(0, 150, 200, Inf),
                  labels = c("low", "medium", "high"),
                  right = FALSE)

data.frame(x, categories)

Now the data is categorized correctly:

  x    categories
  75        low
 150     medium   # <-- Categorized as medium
 160     medium
 151     medium
 216       high
 149        low

Changing arguments right and include.lowest can lead to mistakes, so we recommend changing the values of the breaks argument instead of the others.

Example: How to categorize age groups in R?

Consider, for instance, that you want to categorize a numeric vector of ages in the following categories:

0-14: Children.
15-24: Youth.
25-64: Adult.
65 and over: Senior.

age <- c(0, 12, 89, 14, 25, 2, 65, 1, 16, 24, 67, 61, 64)

At first glance, you could think in set the following, but an error will arise.

cut(age, breaks = c(14, 24, 64, Inf),
    labels = c("Children", "Youth", "Adult", "Senior"))

Error in cut.default(age, breaks = c(14, 24, 64, Inf), labels = c(“Children”,: lengths of ‘breaks’ and ‘labels’ differ

Nonetheless, if you have specified 4 break values and 4 labels, as the breaks are intervals, you are generating three intervals instead of four (14-24, 24-64 and 64-Inf) . Consequently, you will need to add in this case the lowest value to have four intervals:

cut(age, breaks = c(0, 14, 24, 64, Inf),
    labels = c("Children", "Youth", "Adult", "Senior"))

<NA>  Children  Senior  Children  Adult  Children  Senior  Children
Youth  Youth  Senior  Adult  Adult   
Levels: Children Youth Adult Senior

But now the lowest age (0), will be categorized as NA, as the lowest value of the breaks is not included by default. You could solve this changing the 0 of the breaks (for example setting -0.01 instead of 0) or setting the include.lowest argument to TRUE.

cut(age, breaks = c(-0.01, 14, 24, 64, Inf),
    labels = c("Children", "Youth", "Adult", "Senior"))

# Equivalent to:
cut(age, breaks = c(0, 14, 24, 64, Inf),
    labels = c("Children", "Youth", "Adult", "Senior"),
    include.lowest = TRUE)

Children  Children  Senior  Children  Adult  Children  Senior  Children
Youth  Youth  Senior  Adult  Adult   
Levels: Children Youth Adult Senior

Example: How to categorize exam notes?

As another example, exam notes can be categorized as fail, if the note is lower than 5 points out of 10, or pass in the other case. We will generate a simple data set to categorize exam qualifications.

numeric <- c(6.1, 5.3, 8.9, 5.0, 8.8, 1.9, 6.6, 7.2, 9.4, 4.9,
             7.1, 3.9, 1.0, 9.3, 9.9, 5.9, 5.1, 8.4, 3.2, 10.0)

In this example you could implement the function as follows:

categorized_note <- cut(numeric, breaks = c(0, 4.9, 10),
                         labels = c("fail", "pass"))

# Equivalent to:
# categorized_note <- cut(numeric, breaks = c(0, 5, 10.1),
#                        labels = c("fail", "pass"), right = FALSE)

# You could specify factor levels with levels function
# levels(categorized_note) <- c("fail", "pass")

# Generating the dataframe
final_notes <- data.frame(numeric, categorized_note)
head(final_notes)

Note that in the equivalent alternative we set right = FALSE, because if TRUE, a 5 would be fail instead of pass. However, when setting this argument to FALSE, the right interval is open, so a 10 won’t enter the interval and that is the reason because we set the third break as 10.1 instead of 10. The final result is as follows:

    numeric     categorized_note
1     6.1              pass 
2     5.3              pass 
3     8.9              pass 
4     5.0              pass 
5     8.8              pass 
6     1.9              fail