Factor in R

Introduction to R Data Structures
Create factors in R to represent categorical data

Factors in R are used to represent categorical data. You can think about them as integer vectors in which each integer has an associated label. Note that using factors with labels is preferred than integer vectors, as labels are self-descriptive. In this lesson you will learn all about how to create a factor in R.

What is a factor in R programming?

A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor object takes a bounded number of different values called levels. Factors are very useful when working with character columns of data frames, for creating barplots and creating statistical summaries for categorical variables.

The factor function

The factor function allows you to create factors in R. In the following block we show the arguments of the function with a summarized description.

factor(x = character(),         # Input vector data
       levels,                  # Input of unique x values (optional)
       labels = levels,         # Output labels for the levels (optional)
       exclude = NA,            # Values to be excluded from levels
       ordered = is.ordered(x), # Whether the input levels are ordered as given or not
       nmax = NA)               # Maximum number of levels

You can get a more detailed description of the function and its arguments calling ?factor or help(factor).

Convert character to factor in R

Now we will review an example where our input is a character vector. Suppose, for instance, that you have a vector containing the week days when some event happened. Thus, you can convert your character vector to factor with the factor function.

days <- c("Friday", "Tuesday", "Thursday", "Monday", "Wednesday", "Monday",
          "Wednesday", "Monday", "Monday", "Wednesday", "Sunday", "Saturday")

# Levels in alphabetical order
my_factor <- factor(days)
my_factor
Friday  Tuesday  Thursday  Monday  Wednesday  Monday
Wednesday  Monday  Monday  Wednesday  Sunday  Saturday 
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday

By default, converting a character vector to factor will order the levels alphabetically.

If you want to preserve the order of the levels as appear on the input data, specify in the levels argument the following:

factor(days, levels = unique(days))
Friday  Tuesday  Thursday  Monday  Wednesday  Monday
Wednesday  Monday  Monday  Wednesday  Sunday  Saturday 
Levels: Friday Tuesday Thursday Monday Wednesday Sunday Saturday

Note that you can return and convert the factor levels to character with the levels function.

levels(my_factor)
"Friday"  "Monday"  "Saturday"  "Sunday"  "Thursday"  "Tuesday"  "Wednesday"

Convert numeric to factor in R

Suppose you have registered the birth city of six individuals with the following codification:

  • 1: Dublin.
  • 2: London,
  • 3: Sofia.
  • 4: Pontevedra.

Hence, you will have something like the following data stored in a numeric vector:

city <- c(3, 2, 1, 4, 3, 2)

Now, you can call the factor to convert the data into factor and get it categorized for further analysis.

my_factor <- factor(city)
my_factor

The output will have the following structure:

3 2 1 4 3 2
Levels: 1 2 3 4

Change factor labels of the levels

If the input vector is numeric, as in the previous section, the corresponding label (the city) is not reflected. In order to solve this issue, you can store the data in a factor object using the factor function and indicate the corresponding labels of the levels in the labels argument, in order to rename the factor levels.

# Setting the labels in the corresponding order
factor_cities <- factor(city, labels = c("Dublin", "London", "Sofia", "Pontevedra"))

# Print the result
factor_cities
Sofia London Dublin Pontevedra Sofia London    
Levels: Dublin London Sofia Pontevedra   # <- Dublin: 1, London: 2, Sofia: 3, Pontevedra: 4

In the previous code block you can see the final output. As you can observe, now the data is categorized using the cities as labels.

Difference between levels and labels in R

It is common to get confused between labels and levels arguments of the R factor function. Consider the following vector with a unique group and create a factor from it with default arguments:

gender <- c("female", "female", "female", "female")
factor(gender)
female  female  female  female
Levels: female

On the one hand, the labels argument allows you to modify the factor levels names. Hence, the labels argument it is related to output. Note that the length of the vector passed to the labels argument must be of the same length of the number of unique groups of the input vector.

factor(gender, labels = c("f"))
f f f f
Levels: f

On the other hand, the levels argument is related to input. This argument allows you to specify how the levels are coded. Moreover, this argument allows you to add new levels to the factor:

factor(gender, levels = c("male", "female"))
female female female female
Levels: male female

Note you have to specify at least the same names of the input vector groups, or the output won’t be as expected:

factor(gender, levels = c("male", "f"))
<NA> <NA> <NA> <NA>
Levels: male f

Relevel and reorder factor levels

You may be wondering how to change the levels order (which can be important, for instance, in some graphical representations). The factor levels order can be changed in various ways, described in the following subsections.

Custom order of factor levels

In case you want create a custom order for the levels you will have to create a vector with the desired order and pass it to the labels argument.

# Create a vector with the desired order
order <- c("London", "Sofia", "Dublin", "Pontevedra")

# Indicate the order in the 'levels' argument
factor_cities <- factor(factor_cities, levels = order)
factor_cities
Sofia London Dublin Pontevedra Sofia London
Levels: London Sofia Dublin Pontevedra                  # <- Ordered as specified

In addition, you can order the levels of the factor alphabetically making use of the sort function:

# Alphabetical order
factor(city, labels = sort(levels(factor_cities)))
Pontevedra  London  Dublin  Sofia  Pontevedra  London    
Levels: Dublin London Pontevedra Sofia                  # <- Alphabetical order

Reorder factor levels

The reorder function is designed to order the levels of a factor based on a statistical measure of other variable. To demonstrate, consider a data frame where each row represents an individual, the ‘city’ column represents the city where it was born and the column ‘salary’ represents its actual annual wage in thousands of dollars.

set.seed(1)
df <- data.frame(city = factor_cities, salary = sample(20:50, 6))
df
       city    salary
1      Sofia     28
2     London     31
3     Dublin     36
4 Pontevedra     45
5      Sofia     25
6     London     43

You can reorder the factor based, for example, on the mean wage of the individuals using the reorder function as follows:

reorder(df$city, df$salary, mean)
Dublin   London   Sofia    Pontevedra 
 36.0     37.0     26.5       45.0 
Levels: Sofia Dublin London Pontevedra    # <- Ordered from lower to higher salary 

Reverse order of levels

Recall that you can use the levels function to obtain the levels of a factor. At this point, the levels of the factor are the following:

levels(factor_cities)
"London"  "Sofia"  "Dublin"  "Pontevedra"

With this in mind, you can reverse the order of levels of a factor with the rev function:

factor(factor_cities, labels = rev(levels(factor_cities)))
Sofia  Dublin  Pontevedra  London  Sofia  Dublin    
Levels: Pontevedra Dublin Sofia London     # <- Reversed order

Relevel function

Moreover, if you want to change just one observation and put it first you can use the relevel function. For example, if you want the level ‘London’ appearing first and maintain the order of the others you can use:

# Setting the level 'London' first
factor_cities <- relevel(factor_cities, "London")
factor_cities
Sofia  London  Dublin  Pontevedra  Sofia  London    
Levels: London Dublin Sofia Pontevedra

In the following sections we will review how to convert factors to other data types in the more efficient way.

Convert factor in R to numeric

If you have a factor in R that you want to convert to numeric, the most efficient way is illustrated in the following block code, using the as.numeric and levels functions for indexing the levels by the index of the corresponding factor.

my_data <- c(0, 2, 0, 5, 1, 9, 9, 4)
my_factor <- factor(my_data)

as.numeric(levels(my_factor))[my_factor]
0 2 0 5 1 9 9 4

If you want to convert the factor to the original vector (with the same order) never use as.numeric(my_factor), as it will return a numeric vector different than the desired.

Convert factor to string

You may need to convert a factor to string. For that purpose, you can make use of the as.character function.

my_factor_2 <- factor(c("June", "July", "January", "June"))

as.character(my_factor_2)
"June"  "July"  "January"  "June"

Note that if you use the levels function, the output will return a character vector with the unique strings ordered alphabetically, as we show in one of the previous sections.

levels(my_factor_2)
"January"  "July"  "June" 

Convert factor to date

Also, if you need to change your factor object to date, you can use the as.Date function, specifying in the format argument the date format you are working with.

my_date_factor <- factor(c("03/21/2020",
                           "03/22/2020",
                           "03/23/2020"))

as.Date(my_date_factor, format = "%m/%d/%Y")
"2020-03-21" "2020-03-22" "2020-03-23"