# Factor in R

Introduction to R Data Structures

Factors in R are used to represent categorical data. You can think about them as integer vectors in which each integer has an associated label. Note that using factors with labels is preferred than integer vectors, as labels are self-descriptive. In this lesson you will learn all about how to create a factor in R.

## What is a factor in R programming?

A factor in R is a data structure used to represent a vector as categorical data. Therefore, the factor object takes a bounded number of different values called levels. Factors are very useful when working with character columns of data frames, for creating barplots and creating statistical summaries for categorical variables.

## The factor function

The `factor` function allows you to create factors in R. In the following block we show the arguments of the function with a summarized description.

``````factor(x = character(),         # Input vector data
levels,                  # Input of unique x values (optional)
labels = levels,         # Output labels for the levels (optional)
exclude = NA,            # Values to be excluded from levels
ordered = is.ordered(x), # Whether the input levels are ordered as given or not
nmax = NA)               # Maximum number of levels``````

You can get a more detailed description of the function and its arguments calling `?factor` or `help(factor)`.

## Convert character to factor in R

Now we will review an example where our input is a character vector. Suppose, for instance, that you have a vector containing the week days when some event happened. Thus, you can convert your character vector to factor with the `factor` function.

``````days <- c("Friday", "Tuesday", "Thursday", "Monday", "Wednesday", "Monday",
"Wednesday", "Monday", "Monday", "Wednesday", "Sunday", "Saturday")

# Levels in alphabetical order
my_factor <- factor(days)
my_factor``````
``````Friday  Tuesday  Thursday  Monday  Wednesday  Monday
Wednesday  Monday  Monday  Wednesday  Sunday  Saturday
Levels: Friday Monday Saturday Sunday Thursday Tuesday Wednesday``````

By default, converting a character vector to factor will order the levels alphabetically.

If you want to preserve the order of the levels as appear on the input data, specify in the `levels` argument the following:

``factor(days, levels = unique(days))``
``````Friday  Tuesday  Thursday  Monday  Wednesday  Monday
Wednesday  Monday  Monday  Wednesday  Sunday  Saturday
Levels: Friday Tuesday Thursday Monday Wednesday Sunday Saturday``````

Note that you can return and convert the factor levels to character with the `levels` function.

``levels(my_factor)``
``"Friday"  "Monday"  "Saturday"  "Sunday"  "Thursday"  "Tuesday"  "Wednesday"``

## Convert numeric to factor in R

Suppose you have registered the birth city of six individuals with the following codification:

• 1: Dublin.
• 2: London,
• 3: Sofia.
• 4: Pontevedra.

Hence, you will have something like the following data stored in a numeric vector:

``city <- c(3, 2, 1, 4, 3, 2)``

Now, you can call the `factor` to convert the data into factor and get it categorized for further analysis.

``````my_factor <- factor(city)
my_factor``````

The output will have the following structure:

``````3 2 1 4 3 2
Levels: 1 2 3 4``````

## Change factor labels of the levels

If the input vector is numeric, as in the previous section, the corresponding label (the city) is not reflected. In order to solve this issue, you can store the data in a factor object using the `factor` function and indicate the corresponding labels of the levels in the `labels` argument, in order to rename the factor levels.

``````# Setting the labels in the corresponding order
factor_cities <- factor(city, labels = c("Dublin", "London", "Sofia", "Pontevedra"))

# Print the result
factor_cities``````
``````Sofia London Dublin Pontevedra Sofia London
Levels: Dublin London Sofia Pontevedra   # <- Dublin: 1, London: 2, Sofia: 3, Pontevedra: 4``````

In the previous code block you can see the final output. As you can observe, now the data is categorized using the cities as labels.

## Difference between levels and labels in R

It is common to get confused between labels and levels arguments of the R `factor` function. Consider the following vector with a unique group and create a factor from it with default arguments:

``````gender <- c("female", "female", "female", "female")
factor(gender)``````
``````female  female  female  female
Levels: female``````

On the one hand, the `labels` argument allows you to modify the factor levels names. Hence, the `labels` argument it is related to output. Note that the length of the vector passed to the `labels` argument must be of the same length of the number of unique groups of the input vector.

``factor(gender, labels = c("f"))``
``````f f f f
Levels: f``````

On the other hand, the `levels` argument is related to input. This argument allows you to specify how the levels are coded. Moreover, this argument allows you to add new levels to the factor:

``factor(gender, levels = c("male", "female"))``
``````female female female female
Levels: male female``````

Note you have to specify at least the same names of the input vector groups, or the output won’t be as expected:

``factor(gender, levels = c("male", "f"))``
``````<NA> <NA> <NA> <NA>
Levels: male f``````

## Relevel and reorder factor levels

You may be wondering how to change the levels order (which can be important, for instance, in some graphical representations). The factor levels order can be changed in various ways, described in the following subsections.

### Custom order of factor levels

In case you want create a custom order for the levels you will have to create a vector with the desired order and pass it to the `labels` argument.

``````# Create a vector with the desired order
order <- c("London", "Sofia", "Dublin", "Pontevedra")

# Indicate the order in the 'levels' argument
factor_cities <- factor(factor_cities, levels = order)
factor_cities``````
``````Sofia London Dublin Pontevedra Sofia London
Levels: London Sofia Dublin Pontevedra                  # <- Ordered as specified``````

In addition, you can order the levels of the factor alphabetically making use of the `sort` function:

``````# Alphabetical order
factor(city, labels = sort(levels(factor_cities)))``````
``````Pontevedra  London  Dublin  Sofia  Pontevedra  London
Levels: Dublin London Pontevedra Sofia                  # <- Alphabetical order``````

### Reorder factor levels

The `reorder` function is designed to order the levels of a factor based on a statistical measure of other variable. To demonstrate, consider a data frame where each row represents an individual, the ‘city’ column represents the city where it was born and the column ‘salary’ represents its actual annual wage in thousands of dollars.

``````set.seed(1)
df <- data.frame(city = factor_cities, salary = sample(20:50, 6))
df``````
``````       city    salary
1      Sofia     28
2     London     31
3     Dublin     36
4 Pontevedra     45
5      Sofia     25
6     London     43``````

You can reorder the factor based, for example, on the mean wage of the individuals using the `reorder` function as follows:

``reorder(df\$city, df\$salary, mean)``
``````Dublin   London   Sofia    Pontevedra
36.0     37.0     26.5       45.0
Levels: Sofia Dublin London Pontevedra    # <- Ordered from lower to higher salary ``````

### Reverse order of levels

Recall that you can use the `levels` function to obtain the levels of a factor. At this point, the levels of the factor are the following:

``levels(factor_cities)``
``"London"  "Sofia"  "Dublin"  "Pontevedra"``

With this in mind, you can reverse the order of levels of a factor with the `rev` function:

``factor(factor_cities, labels = rev(levels(factor_cities)))``
``````Sofia  Dublin  Pontevedra  London  Sofia  Dublin
Levels: Pontevedra Dublin Sofia London     # <- Reversed order``````

### Relevel function

Moreover, if you want to change just one observation and put it first you can use the `relevel` function. For example, if you want the level ‘London’ appearing first and maintain the order of the others you can use:

``````# Setting the level 'London' first
factor_cities <- relevel(factor_cities, "London")
factor_cities``````
``````Sofia  London  Dublin  Pontevedra  Sofia  London
Levels: London Dublin Sofia Pontevedra``````

In the following sections we will review how to convert factors to other data types in the more efficient way.

## Convert factor in R to numeric

If you have a factor in R that you want to convert to numeric, the most efficient way is illustrated in the following block code, using the `as.numeric` and `levels` functions for indexing the levels by the index of the corresponding factor.

``````my_data <- c(0, 2, 0, 5, 1, 9, 9, 4)
my_factor <- factor(my_data)

as.numeric(levels(my_factor))[my_factor]``````
``0 2 0 5 1 9 9 4``

If you want to convert the factor to the original vector (with the same order) never use `as.numeric(my_factor)`, as it will return a numeric vector different than the desired.

## Convert factor to string

You may need to convert a factor to string. For that purpose, you can make use of the `as.character` function.

``````my_factor_2 <- factor(c("June", "July", "January", "June"))

as.character(my_factor_2)``````
``"June"  "July"  "January"  "June"``

Note that if you use the `levels` function, the output will return a character vector with the unique strings ordered alphabetically, as we show in one of the previous sections.

``levels(my_factor_2)``
``"January"  "July"  "June" ``

## Convert factor to date

Also, if you need to change your factor object to date, you can use the `as.Date` function, specifying in the `format` argument the date format you are working with.

``````my_date_factor <- factor(c("03/21/2020",
"03/22/2020",
"03/23/2020"))

as.Date(my_date_factor, format = "%m/%d/%Y")``````
``"2020-03-21" "2020-03-22" "2020-03-23"``