Factor in R
Factors in R are used to represent categorical data. You can think about them as integer vectors in which each integer has an associated label. Note that using factors with labels is preferred than integer vectors, as labels are self-descriptive. In this lesson you will learn all about how to create a factor in R.
What is a factor in R programming?
A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor object takes a bounded number of different values called levels. Factors are very useful when working with character columns of data frames, for creating barplots and creating statistical summaries for categorical variables.
The factor function
The factor
function allows you to create factors in R. In the following block we show the arguments of the function with a summarized description.
factor(x = character(), # Input vector data
levels, # Input of unique x values (optional)
labels = levels, # Output labels for the levels (optional)
exclude = NA, # Values to be excluded from levels
ordered = is.ordered(x), # Whether the input levels are ordered as given or not
nmax = NA) # Maximum number of levels
You can get a more detailed description of the function and its arguments calling ?factor
or help(factor)
.
Convert character to factor in R
Now we will review an example where our input is a character vector. Suppose, for instance, that you have a vector containing the week days when some event happened. Thus, you can convert your character vector to factor with the factor
function.
days <- c("Friday", "Tuesday", "Thursday", "Monday", "Wednesday", "Monday",
"Wednesday", "Monday", "Monday", "Wednesday", "Sunday", "Saturday")
# Levels in alphabetical order
my_factor <- factor(days)
my_factor
Friday Tuesday Thursday Monday Wednesday Monday
Wednesday Monday Monday Wednesday Sunday Saturday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday
By default, converting a character vector to factor will order the levels alphabetically.
If you want to preserve the order of the levels as appear on the input data, specify in the levels
argument the following:
factor(days, levels = unique(days))
Friday Tuesday Thursday Monday Wednesday Monday
Wednesday Monday Monday Wednesday Sunday Saturday
Levels: Friday Tuesday Thursday Monday Wednesday Sunday Saturday
Note that you can return and convert the factor levels to character with the levels
function.
levels(my_factor)
"Friday" "Monday" "Saturday" "Sunday" "Thursday" "Tuesday" "Wednesday"
Convert numeric to factor in R
Suppose you have registered the birth city of six individuals with the following codification:
- 1: Dublin.
- 2: London,
- 3: Sofia.
- 4: Pontevedra.
Hence, you will have something like the following data stored in a numeric vector:
city <- c(3, 2, 1, 4, 3, 2)
Now, you can call the factor
to convert the data into factor and get it categorized for further analysis.
my_factor <- factor(city)
my_factor
The output will have the following structure:
3 2 1 4 3 2
Levels: 1 2 3 4
Change factor labels of the levels
If the input vector is numeric, as in the previous section, the corresponding label (the city) is not reflected. In order to solve this issue, you can store the data in a factor object using the factor
function and indicate the corresponding labels of the levels in the labels
argument, in order to rename the factor levels.
# Setting the labels in the corresponding order
factor_cities <- factor(city, labels = c("Dublin", "London", "Sofia", "Pontevedra"))
# Print the result
factor_cities
Sofia London Dublin Pontevedra Sofia London
Levels: Dublin London Sofia Pontevedra # <- Dublin: 1, London: 2, Sofia: 3, Pontevedra: 4
In the previous code block you can see the final output. As you can observe, now the data is categorized using the cities as labels.
Difference between levels and labels in R
It is common to get confused between labels and levels arguments of the R factor
function. Consider the following vector with a unique group and create a factor from it with default arguments:
gender <- c("female", "female", "female", "female")
factor(gender)
female female female female
Levels: female
On the one hand, the labels
argument allows you to modify the factor levels names. Hence, the labels
argument it is related to output. Note that the length of the vector passed to the labels
argument must be of the same length of the number of unique groups of the input vector.
factor(gender, labels = c("f"))
f f f f
Levels: f
On the other hand, the levels
argument is related to input. This argument allows you to specify how the levels are coded. Moreover, this argument allows you to add new levels to the factor:
factor(gender, levels = c("male", "female"))
female female female female
Levels: male female
Note you have to specify at least the same names of the input vector groups, or the output won’t be as expected:
factor(gender, levels = c("male", "f"))
<NA> <NA> <NA> <NA>
Levels: male f
Relevel and reorder factor levels
You may be wondering how to change the levels order (which can be important, for instance, in some graphical representations). The factor levels order can be changed in various ways, described in the following subsections.
Custom order of factor levels
In case you want create a custom order for the levels you will have to create a vector with the desired order and pass it to the labels
argument.
# Create a vector with the desired order
order <- c("London", "Sofia", "Dublin", "Pontevedra")
# Indicate the order in the 'levels' argument
factor_cities <- factor(factor_cities, levels = order)
factor_cities
Sofia London Dublin Pontevedra Sofia London
Levels: London Sofia Dublin Pontevedra # <- Ordered as specified
In addition, you can order the levels of the factor alphabetically making use of the sort
function:
# Alphabetical order
factor(city, labels = sort(levels(factor_cities)))
Pontevedra London Dublin Sofia Pontevedra London
Levels: Dublin London Pontevedra Sofia # <- Alphabetical order
Reorder factor levels
The reorder
function is designed to order the levels of a factor based on a statistical measure of other variable. To demonstrate, consider a data frame where each row represents an individual, the ‘city’ column represents the city where it was born and the column ‘salary’ represents its actual annual wage in thousands of dollars.
set.seed(1)
df <- data.frame(city = factor_cities, salary = sample(20:50, 6))
df
city salary
1 Sofia 28
2 London 31
3 Dublin 36
4 Pontevedra 45
5 Sofia 25
6 London 43
You can reorder the factor based, for example, on the mean wage of the individuals using the reorder
function as follows:
reorder(df$city, df$salary, mean)
Dublin London Sofia Pontevedra
36.0 37.0 26.5 45.0
Levels: Sofia Dublin London Pontevedra # <- Ordered from lower to higher salary
Reverse order of levels
Recall that you can use the levels
function to obtain the levels of a factor. At this point, the levels of the factor are the following:
levels(factor_cities)
"London" "Sofia" "Dublin" "Pontevedra"
With this in mind, you can reverse the order of levels of a factor with the rev
function:
factor(factor_cities, labels = rev(levels(factor_cities)))
Sofia Dublin Pontevedra London Sofia Dublin
Levels: Pontevedra Dublin Sofia London # <- Reversed order
Relevel function
Moreover, if you want to change just one observation and put it first you can use the relevel
function. For example, if you want the level ‘London’ appearing first and maintain the order of the others you can use:
# Setting the level 'London' first
factor_cities <- relevel(factor_cities, "London")
factor_cities
Sofia London Dublin Pontevedra Sofia London
Levels: London Dublin Sofia Pontevedra
In the following sections we will review how to convert factors to other data types in the more efficient way.
Convert factor in R to numeric
If you have a factor in R that you want to convert to numeric, the most efficient way is illustrated in the following block code, using the as.numeric
and levels
functions for indexing the levels by the index of the corresponding factor.
my_data <- c(0, 2, 0, 5, 1, 9, 9, 4)
my_factor <- factor(my_data)
as.numeric(levels(my_factor))[my_factor]
0 2 0 5 1 9 9 4
If you want to convert the factor to the original vector (with the same order) never use as.numeric(my_factor)
, as it will return a numeric vector different than the desired.
Convert factor to string
You may need to convert a factor to string. For that purpose, you can make use of the as.character
function.
my_factor_2 <- factor(c("June", "July", "January", "June"))
as.character(my_factor_2)
"June" "July" "January" "June"
Note that if you use the levels
function, the output will return a character vector with the unique strings ordered alphabetically, as we show in one of the previous sections.
levels(my_factor_2)
"January" "July" "June"
Convert factor to date
Also, if you need to change your factor object to date, you can use the as.Date
function, specifying in the format
argument the date format you are working with.
my_date_factor <- factor(c("03/21/2020",
"03/22/2020",
"03/23/2020"))
as.Date(my_date_factor, format = "%m/%d/%Y")
"2020-03-21" "2020-03-22" "2020-03-23"