Data frame in R
Are you interested in mastering R data frames? This comprehensive tutorial covers this fundamental R data structure. By the end of this post, you’ll grasp all the basic concepts required for working with data frames in R, including creating new data frames, accessing data, appending new data, and filtering or subsetting observations.
What is a data frame in R?
Data frames are the most common objects for storing data in R. Each row in this object represents an individual or date, while each column represents a variable. Data frames store different data types.
Data frame or matrix?
A common question is to ask in which cases you should use a data frame or a matrix in R. Data frames are data structures very similar to matrices, but in case of data frames you can have different data types within columns, so the difference is that matrix store homogeneous data types whereas data frames store heterogeneous data types. Suppose, for instance, that you have the following data:
Product <- c("Juice", "Cheese", "Yogurt")
Section <- c("Drinks", "Dairy products", "Dairy products")
Units <- c(2, 1, 10)
You could store those variables as a matrix using the cbind
function:
x <- cbind(Product, Section, Units)
If you print your new variable, you will get the following output:
Product Section Units
[1,] "Juice" "Drinks" "2"
[2,] "Cheese" "Dairy products" "1"
[3,] "Yogurt" "Dairy products" "10"
However, you may have noticed that the result is not satisfactory, as all the variables have been transformed to character class. If you use the data.frame
function, you will keep the original type of the variables.
Data frames, unlike matrices, can store different types of objects.
How to create a dataframe in R?
In R is very straightforward to create a new data frame. You can join your variables making use of the data.frame
function to convert your data to a data frame data structure. First, you need to have some variables stored to create your dataframe in R. In this example, we are going to define some variables of weather data. Note that all of them have the same length.
temp <- c(20.37, 18.56, 18.4, 21.96, 29.53, 28.16,
36.38, 36.62, 40.03, 27.59, 22.15, 19.85)
humidity <- c(88, 86, 81, 79, 80, 78,
71, 69, 78, 82, 85, 83)
rain <- c(72, 33.9, 37.5, 36.6, 31.0, 16.6,
1.2, 6.8, 36.8, 30.8, 38.5, 22.7)
month <- c("January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December")
To join the data you can use the data.frame
function. We are going to store the dataframe, for instance, in a variable named data
:
data <- data.frame(month = month, temperature = temp,
humidity = humidity, rain = rain)
names(data) # Names of the variables (columns)
"month" "temperature" "humidity" "rain"
First, it is very common to display the first values to make some checks. For that purpose you can make use of the head
function in R, which by default will show the first 6 rows of your dataframe.
# First rows of our dataset
head(data)
month temperature humidity rain
1 January 20.37 88 72.0
2 February 18.56 86 33.9
3 March 18.40 81 37.5
4 April 21.96 79 36.6
5 May 29.53 80 31.0
6 June 28.16 78 16.6
Second, you could make use of the summary
function that will return a statistical summary of the variables (columns) of the dataset.
summary(data)
month temperature humidity rain
April :1 Min. :18.40 Min. :69.0 Min. : 1.20
August :1 1st Qu.:20.24 1st Qu.:78.0 1st Qu.:21.18
December:1 Median :24.87 Median :80.5 Median :32.45
February:1 Mean :26.63 Mean :80.0 Mean :30.37
January :1 3rd Qu.:31.24 3rd Qu.:83.5 3rd Qu.:36.98
Nonetheless, you can also make use of the example data frames that R provides. To look for them you can call the data
function:
data()
Once executed, a window with a list of available datasets will open:
Data sets in package "datasets":
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
…
Now you can load any typing:
data(name_of_dataset)
As an example, if you want to load the ‘AirPassengers’ dataset into the workspace you can write:
data(AirPassengers)
Create empty dataframe in R
Sometimes you want to initialize an empty data frame without variables and fill them after inside a loop, or by other way you want. In this case, the most recommended way is to create an empty data structure using the data.frame
function and creating empty variables. Nevertheless, in the following code block we will show you that way and several alternatives.
# Empty variables
dataset <- data.frame(month = character(),
temperature = numeric(),
rain = numeric(),
humidity = numeric())
# Copy the structure of other dataset
dataset <- data[FALSE, ] # We created the dataframe 'data' before
# Converting a matrix to data.frame and assigning column names
dataset <- data.frame(matrix(ncol = 4, nrow = 0))
column_names <- c("month", "temperature", "rain", "humidity")
colnames(dataset) <- column_names
# Equivalent to the last option
dataset <- data.frame(matrix(ncol = 4, nrow = 0,
dimnames = list(NULL, c("month", "temperature",
"rain", "humidity"))))
Accessing data frame data
There are several ways to access the columns stored in data frame objects:
- Using the dollar sign ($) and the name of the column.
- Using square brackets with the index of the column after the comma.
As an example, if you want to select the month
column of the dataframe you created just call the following:
data$month
data[, 1] # Equivalent
You can also select several variables at once. For that purpose you can:
- Create a sequence of indices.
-
Create a vector with the
c
function with the names of the variables or indexes you want to select.
# Selecting columns 1 to 3 with a sequence
data[, 1:3]
# Selecting columns with c function
data[, c("temperature", "rain")]
data[, c(2, 4)] # Equivalent
Similarly, you can access rows of data frames with data[1, ]
or data[1:2, ]
to select the first row, the first and the second one, or select just some data points selecting rows and columns at once:
# Data point of the first
# row and second column
data[1, 2]
# First and second row
# of the second column
data[1:2, 2]
Direct access using attach function
If you don’t want to write the name of the data frame again and again you can just attach it, in order to make a direct use of variables, with the attach
function:
attach(data)
temperature # Now you have direct access of the variables
If you want to disable the direct access, you just have to use the detach
function:
detach(data)
temperature # You can't access this variable. An error will show up
Add columns and rows to dataframe in R
Sometimes you need to modify your data in order to append new rows or columns or deleting them. For the following examples we will be using the cars
dataset, recorded in the 1920s, from the R example datasets. You can load it running data(cars)
. The database contains 50 rows and 2 variables:
- speed: numeric speed (mph).
- dist: numeric stopping distance (ft).
If you call head(cars)
in the console you can see the following output:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
Suppose you want to create a new variable to transform the speed in kilometer per hour (kph) and the distance in meters. Recall that:
kilometer = miles/0.62137 and meters = feet/3.2808
So now you could add two new columns called kph
(kilometers per hour) and meters
with the following code:
cars$kph <- cars$speed / 0.62137
cars$meters <- cars$dist / 3.2808
You could also make use of the cbind
function. If you would like to add a new row, you could use the rbind
function.
kph <- cars$speed / 0.62137
meters <- cars$dist / 3.2808
cars <- cbind(cars[, c(1, 2)], kph, meters)
Resulting into:
speed dist kph meters
1 4 2 6.437388 0.6096074
2 4 10 6.437388 3.0480371
3 7 4 11.265430 1.2192148
4 7 22 11.265430 6.7056815
5 8 16 12.874777 4.8768593
6 9 10 14.484124 3.0480371
Append new rows with rbind
and new columns with the cbind
functions.
Delete columns and rows of a dataframe
Now, if you want to remove variables or rows of a data frame, you have several options:
- Use the minus sign (-) and indicate the columns or rows you want to delete.
- create a subset of the data you want to keep..
As an example, we will delete the speed
and dist
columns. As a consequence, we will save our results on a new data frame called cars2
, to avoid overriding the original dataset.
# Delete with the - sign the first and second column
cars2 <- cars[, -c(1, 2)]
# Select only the columns we want
cars2 <- cars[, c("kph", "meters")]
If you make use of the head
function again, you can see the new data frame.
head(cars2)
kph meters
1 6.437388 0.6096074
2 6.437388 3.0480371
3 11.265430 1.2192148
4 11.265430 6.7056815
5 12.874777 4.8768593
6 14.484124 3.0480371
Sorting and filtering data of dataframe in R
It is usual to sort or filter the data inside data frames by the values of some variable.
Sorting dataframes
Consider, for instance, the data in the mtcars
dataset and load it with data(mtcars)
. You can access the sorting index of any variable with the order
function.
ii <- order(mtcars$hp) # Sorting index with the hp variable
The vector of the sorted index establishes the order in which the rows of the database have to be chosen in order to obtain the desired ordering.
# Sorting by hp (lower to higher)
# We only show the first 4 columns
head(mtcars[ii, 1:4])
mpg cyl disp hp
Honda Civic 30.4 4 75.7 52
Merc 240D 24.4 4 146.7 62
Toyota Corolla 33.9 4 71.1 65
Fiat 128 32.4 4 78.7 66
Fiat X1-9 27.3 4 79.0 66
Porsche 914-2 26.0 4 120.3 91
You can also sort from higher to lower making use of the minus sign.
ii <- order(-mtcars$hp)
head(mtcars[ii,])
mpg cyl disp hp
Maserati Bora 15.0 8 301 335
Ford Pantera L 15.8 8 351 264
Duster 360 14.3 8 360 245
Camaro Z28 13.3 8 350 245
Chrysler Imperial 14.7 8 440 230
Lincoln Continental 10.4 8 460 215
In addition, you can establish different order conditions if you want. You can order by some variable and, in case of ties, order by another one. In the following example we will order the data frame by the variable named cyl
and then by the variable hp
.
ii <- order(mtcars$cyl, mtcars$hp)
head(mtcars[ii, 1:4])
mpg cyl disp hp
Honda Civic 30.4 4 75.7 52
Merc 240D 24.4 4 146.7 62
Toyota Corolla 33.9 4 71.1 65
Fiat 128 32.4 4 78.7 66
Fiat X1-9 27.3 4 79.0 66
Porsche 914-2 26.0 4 120.3 91
Filtering data frames
Filter a data frame consist on obtaining a subsample that meets some conditions. For this purpose, you can use the subset
function to subset dataframes by column values. We will provide some examples based on the mtcars
dataset.
Subset of the dataset where the number of cylinders of the car is exactly 6 and the horse power is greater than 110.
subset(mtcars, cyl == 6 & hp > 110)
mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.77 15.5 0 1 5 6
The same as the previous example, but we only show some variables (mpg
, cyl
and disp
) making use of the select
argument.
subset(mtcars, cyl == 6 & hp > 110, select = c(mpg, cyl, disp))
mpg cyl disp
Merc 280 19.2 6 167.6
Merc 280C 17.8 6 167.6
Ferrari Dino 19.7 6 145.0
Now, instead of using the AND condition we will use the OR condition. In this case, we will select the cars where the variable wt
is less than 2 or the variable hp
is greater than 115.
subset(mtcars, wt < 2 | hp > 115)
mpg cyl disp hp drat wt qsec vs am gear carb
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8