Cut in R
The cut function in R allows you to cut data into bins and specify ‘cut labels’, so it is very useful to create a factor from a continuous variable. In this tutorial you will learn how to use cut in R and therefore, how to categorize data in R.
cut
function in R
Sometimes it is useful to categorize the values of a continuous variable in different levels of a factor. For that purpose, you can use the R cut
function. In the following block of code we show the syntax of the function and the simplified description of the arguments.
cut(num_vector, # Numeric input vector
breaks, # Number or vector of breaks
labels = NULL, # Labels for each group
include.lowest = FALSE, # Whether to include the lowest 'break' or not
right = TRUE, # Whether the right interval is closed (and the left open) or vice versa
dig.lab = 3, # Number of digits of the groups if labels = NULL
ordered_result = FALSE, # Whether to order the factor result or not
…) # Additional arguments
Cut in R: the breaks
argument
The breaks
argument allows you to cut the data in bins and hence to categorize it. Consider the following vector:
x <- -5:5
On the one hand, you can set the breaks
argument to any integer number, creating as many intervals (levels) as the specified number. These intervals will be all of the same length.
cut(x, breaks = 2)
(-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0] (-5.01,0]
(-5.01,0] (0,5.01] (0,5.01] (0,5.01] (0,5.01] (0,5.01]
Levels: (-5.01,0] (0,5.01]
On the other hand, you can specify the intervals you prefer.
cut(x, breaks = c(-6, 2, 5))
(-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (-6,2] (2,5] (2,5]
(2,5]
Levels: (-6,2] (2,5]
It is worth to mention that if the intervals have decimals you can modify the number of decimals with the dig.lab
argument and decide whether to order the results or not with the ordered_result
argument.
Cut in R: the labels
argument
You can also change the levels of the output factor with the labels
argument.
x <- c(12, 1, 25, 12, 65, 2, 6, 17)
cut(x, breaks = c(0, 3, 12, 15, 20, 80),
labels = c("First", "Second", "Third", "Fourth", "Fifth"))
# Equivalent to
c <- cut(x, breaks = c(0, 3, 12, 15, 20, 80))
levels(c) <- c("First", "Second", "Third", "Fourth", "Fifth")
Second First Fifth Second Fifth First Second Fourth
Levels: First Second Third Fourth Fifth
Include lowest value
The include.lowest
argument specify whether to include the lowest break or not. By default, it is set to FALSE
.
x <- 15:25
cut(x, breaks = c(15, 20, 25), include.lowest = FALSE)
<NA> (15,20] (15,20] (15,20] (15,20]
(15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: (15,20] (20,25]
In this case, the lowest value (15), specified as a break, it is not included in the interval (the left interval is open), so the value is categorized as NA
, because the number 15 doesn’t belong to any of the intervals. However, if you set include.lowest
to TRUE
, the value will be included, as the left interval of the lowest break will be closed.
cut(x, breaks = c(15, 20, 25), include.lowest = TRUE)
[15,20] [15,20] [15,20] [15,20] [15,20]
[15,20] (20,25] (20,25] (20,25] (20,25] (20,25]
Levels: [15,20] (20,25]
The argument right
Consider, for instance, you want to categorize some data (\(x\)) in the following categories:
- Low, if \(x \in\) [0, 150).
- Medium, if \(x \in\) [150, 200).
- High, if \(x \in\) [200, \(\infty\)).
By default, the argument right
is set to TRUE
, so the intervals are opened on the left and closed on the right (x, y].
x <- c(75, 150, 160, 151, 216, 149)
categories <- cut(x, breaks = c(0, 150, 200, Inf),
labels = c("low", "medium", "high"))
data.frame(x, categories)
In this scenario, not all the values are categorized well.
x categories
75 low
150 low # <-- Categorized as low
160 medium
151 medium
216 high
149 low
However, if you set right = FALSE
, the intervals will be closed on the left and open on the right.
categories <- cut(x, breaks = c(0, 150, 200, Inf),
labels = c("low", "medium", "high"),
right = FALSE)
data.frame(x, categories)
Now the data is categorized correctly:
x categories
75 low
150 medium # <-- Categorized as medium
160 medium
151 medium
216 high
149 low
Changing arguments right
and include.lowest
can lead to mistakes, so we recommend changing the values of the breaks
argument instead of the others.
Example: How to categorize age groups in R?
Consider, for instance, that you want to categorize a numeric vector of ages in the following categories:
- 0-14: Children.
- 15-24: Youth.
- 25-64: Adult.
- 65 and over: Senior.
age <- c(0, 12, 89, 14, 25, 2, 65, 1, 16, 24, 67, 61, 64)
At first glance, you could think in set the following, but an error will arise.
cut(age, breaks = c(14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
Error in cut.default(age, breaks = c(14, 24, 64, Inf), labels = c(“Children”,: lengths of ‘breaks’ and ‘labels’ differ
Nonetheless, if you have specified 4 break values and 4 labels, as the breaks are intervals, you are generating three intervals instead of four (14-24, 24-64 and 64-Inf) . Consequently, you will need to add in this case the lowest value to have four intervals:
cut(age, breaks = c(0, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
<NA> Children Senior Children Adult Children Senior Children
Youth Youth Senior Adult Adult
Levels: Children Youth Adult Senior
But now the lowest age (0), will be categorized as NA
, as the lowest value of the breaks is not included by default. You could solve this changing the 0 of the breaks (for example setting -0.01 instead of 0) or setting the include.lowest
argument to TRUE
.
cut(age, breaks = c(-0.01, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"))
# Equivalent to:
cut(age, breaks = c(0, 14, 24, 64, Inf),
labels = c("Children", "Youth", "Adult", "Senior"),
include.lowest = TRUE)
Children Children Senior Children Adult Children Senior Children
Youth Youth Senior Adult Adult
Levels: Children Youth Adult Senior
Example: How to categorize exam notes?
As another example, exam notes can be categorized as fail, if the note is lower than 5 points out of 10, or pass in the other case. We will generate a simple data set to categorize exam qualifications.
numeric <- c(6.1, 5.3, 8.9, 5.0, 8.8, 1.9, 6.6, 7.2, 9.4, 4.9,
7.1, 3.9, 1.0, 9.3, 9.9, 5.9, 5.1, 8.4, 3.2, 10.0)
In this example you could implement the function as follows:
categorized_note <- cut(numeric, breaks = c(0, 4.9, 10),
labels = c("fail", "pass"))
# Equivalent to:
# categorized_note <- cut(numeric, breaks = c(0, 5, 10.1),
# labels = c("fail", "pass"), right = FALSE)
# You could specify factor levels with levels function
# levels(categorized_note) <- c("fail", "pass")
# Generating the dataframe
final_notes <- data.frame(numeric, categorized_note)
head(final_notes)
Note that in the equivalent alternative we set right = FALSE
, because if TRUE
, a 5 would be fail instead of pass. However, when setting this argument to FALSE
, the right interval is open, so a 10 won’t enter the interval and that is the reason because we set the third break as 10.1 instead of 10. The final result is as follows:
numeric categorized_note
1 6.1 pass
2 5.3 pass
3 8.9 pass
4 5.0 pass
5 8.8 pass
6 1.9 fail