Home » Introduction » DataFrame in R

DataFrame in R

Create, modify and subset R data frames

Do you want to learn all about R Data Frames? This is a full tutorial of this R data structure. At the end of this post you will be able to understand all the basic concepts to work with a dataframe in R, like how to create new data frames and how to access the data, append new data, filter or subset the observations.

What is a data frame in R?

Data frames are the most usual object to store data in R. In this type of object each individual or date corresponds to a row and each column corresponds to a variable. Inside this type of structure you can store different data types.

Data frame or matrix?

A common question is to ask in which cases you should use a data frame or a matrix in R. Data frames are data structures very similar to matrices, but in case of data frames you can have different data types within columns, so the difference is that matrix store homogeneous data types whereas data frames store heterogeneous data types. Suppose, for instance, that you have the following data:

Product <- c("Juice", "Cheese", "Yogurt")
Section <- c("Drinks", "Dairy products", "Dairy products")
Units <- c(2, 1, 10)

You could store those variables as a matrix using the cbind function:

x <- cbind(Product, Section, Units)

If you print your new variable, you will get the following output:

      Product      Section        Units
 [1,] "Juice"    "Drinks"          "2"
 [2,] "Cheese"   "Dairy products"  "1"
 [3,] "Yogurt"   "Dairy products"  "10"

However, you may have noticed that the result is not satisfactory, as all the variables have been transformed to character class. If you use the data.frame function, you will keep the original type of the variables.

Data frames, unlike matrices, can store different types of objects.

How to create a dataframe in R?

In R is very straightforward to create a new data frame. You can join your variables making use of the data.frame function to convert your data to a data frame data structure. First, you need to have some variables stored to create your dataframe in R. In this example, we are going to define some variables of weather data. Note that all of them have the same length.

temp <- c(20.37, 18.56, 18.4, 21.96, 29.53, 28.16,
          36.38, 36.62, 40.03, 27.59, 22.15, 19.85)
humidity <- c(88, 86, 81, 79, 80, 78,
              71, 69, 78, 82, 85, 83)
rain <- c(72, 33.9, 37.5, 36.6, 31.0, 16.6,
          1.2, 6.8, 36.8, 30.8, 38.5, 22.7)
month <- c("January", "February", "March", "April", "May", "June",
           "July", "August", "September", "October", "November", "December")

To join the data you can use the data.frame function. We are going to store the dataframe, for instance, in a variable named data:

data <- data.frame(month = month, temperature = temp,
                   humidity = humidity, rain = rain)
names(data) # Names of the variables (columns)
"month"  "temperature"  "humidity"  "rain" 

First, it is very common to display the first values to make some checks. For that purpose you can make use of the head function in R, which by default will show the first 6 rows of your dataframe.

# First rows of our dataset
head(data)
     month temperature humidity  rain
1  January       20.37       88  72.0
2 February       18.56       86  33.9
3    March       18.40       81  37.5
4    April       21.96       79  36.6
5      May       29.53       80  31.0
6     June       28.16       78  16.6

Second, you could make use of the summary function that will return a statistical summary of the variables (columns) of the dataset.

summary(data)
    month    temperature       humidity         rain
April   :1   Min.   :18.40   Min.   :69.0   Min.   : 1.20
August  :1   1st Qu.:20.24   1st Qu.:78.0   1st Qu.:21.18
December:1   Median :24.87   Median :80.5   Median :32.45
February:1   Mean   :26.63   Mean   :80.0   Mean   :30.37
January :1   3rd Qu.:31.24   3rd Qu.:83.5   3rd Qu.:36.98

Nonetheless, you can also make use of the example data frames that R provides. To look for them you can call the data function:

data()

Once executed, a window with a list of available datasets will open:

Data sets in package "datasets":
 AirPassengers Monthly Airline Passenger Numbers 1949-1960
 BJsales Sales Data with Leading Indicator
 …

Now you can load any typing:

data(name_of_dataset)

As an example, if you want to load the ‘AirPassengers’ dataset into the workspace you can write:

data(AirPassengers)

Create empty dataframe in R

Sometimes you want to initialize an empty data frame without variables and fill them after inside a loop, or by other way you want. In this case, the most recommended way is to create an empty data structure using the data.frame function and creating empty variables. Nevertheless, in the following code block we will show you that way and several alternatives.

# Empty variables
dataset <- data.frame(month = character(),
                      temperature = numeric(),
                      rain = numeric(),
                      humidity = numeric())

# Copy the structure of other dataset
dataset <- data[FALSE, ] # We created the dataframe 'data' before

# Converting a matrix to data.frame and assigning column names
dataset <- data.frame(matrix(ncol = 4, nrow = 0))
column_names <- c("month", "temperature", "rain", "humidity")
colnames(dataset) <- column_names

# Equivalent to the last option
dataset <- data.frame(matrix(ncol = 4, nrow = 0,
                      dimnames = list(NULL, c("month", "temperature",
                                              "rain", "humidity"))))

Accessing data frame data

There are several ways to access the columns stored in data frame objects:

  1. Using the dollar sign ($) and the name of the column.
  2. Using square brackets with the index of the column after the comma.

As an example, if you want to select the month column of the dataframe you created just call the following:

data$month
data[, 1] # Equivalent

You can also select several variables at once. For that purpose you can:

  1. Create a sequence of indices.
  2. Create a vector with the c function with the names of the variables or indexes you want to select.
# Selecting columns 1 to 3 with a sequence
data[, 1:3]

# Selecting columns with c function
data[, c("temperature", "rain")]
data[, c(2, 4)] # Equivalent

Similarly, you can access rows of data frames with data[1, ] or data[1:2, ] to select the first row, the first and the second one, or select just some data points selecting rows and columns at once:

# Data point of the first
# row and second column
data[1, 2]

# First and second row
# of the second column
data[1:2, 2]

Direct access using attach function

If you don’t want to write the name of the data frame again and again you can just attach it, in order to make a direct use of variables, with the attach function:

attach(data)
temperature # Now you have direct access of the variables

If you want to disable the direct access, you just have to use the detach function:

detach(data)
temperature # You can't access this variable. An error will show up

Add columns and rows to dataframe in R

Sometimes you need to modify your data in order to append new rows or columns or deleting them. For the following examples we will be using the cars dataset, recorded in the 1920s, from the R example datasets. You can load it running data(cars). The database contains 50 rows and 2 variables:

  • speed: numeric speed (mph).
  • dist: numeric stopping distance (ft).

If you call head(cars) in the console you can see the following output:

   speed  dist
 1   4     2
 2   4    10
 3   7     4
 4   7    22
 5   8    16
 6   9    10

Suppose you want to create a new variable to transform the speed in kilometer per hour (kph) and the distance in meters. Recall that:

kilometer = miles/0.62137 and meters = feet/3.2808

So now you could add two new columns called kph (kilometers per hour) and meters with the following code:

cars$kph <- cars$speed / 0.62137
cars$meters <- cars$dist / 3.2808

You could also make use of the cbind function. If you would like to add a new row, you could use the rbind function.

kph <- cars$speed / 0.62137
meters <- cars$dist / 3.2808
cars <- cbind(cars[, c(1, 2)], kph, meters)

Resulting into:

     speed dist     kph       meters
 1     4    2    6.437388   0.6096074
 2     4   10    6.437388   3.0480371
 3     7    4   11.265430   1.2192148
 4     7   22   11.265430   6.7056815
 5     8   16   12.874777   4.8768593
 6     9   10   14.484124   3.0480371
Append new rows with rbind and new columns with the cbind functions.

Delete columns and rows of a dataframe

Now, if you want to remove variables or rows of a data frame, you have several options:

  1. Use the minus sign (-) and indicate the columns or rows you want to delete.
  2. create a subset of the data you want to keep..

As an example, we will delete the speed and dist columns. As a consequence, we will save our results on a new data frame called cars2, to avoid overriding the original dataset.

# Delete with the - sign the first and second column
cars2 <- cars[, -c(1, 2)]

# Select only the columns we want
cars2 <- cars[, c("kph", "meters")] 

If you make use of the head function again, you can see the new data frame.

head(cars2)
       kph     meters
1   6.437388 0.6096074
2   6.437388 3.0480371
3  11.265430 1.2192148
4  11.265430 6.7056815
5  12.874777 4.8768593
6  14.484124 3.0480371

Sorting and filtering data of dataframe in R

It is usual to sort or filter the data inside data frames by the values of some variable.

Sorting dataframes

Consider, for instance, the data in the mtcars dataset and load it with data(mtcars). You can access the sorting index of any variable with the order function.

ii <- order(mtcars$hp) # Sorting index with the hp variable

The vector of the sorted index establishes the order in which the rows of the database have to be chosen in order to obtain the desired ordering.

# Sorting by hp (lower to higher)
# We only show the first 4 columns
head(mtcars[ii, 1:4]) 
                 mpg cyl  disp  hp
Honda Civic     30.4  4   75.7  52
Merc 240D       24.4  4  146.7  62
Toyota Corolla  33.9  4   71.1  65
Fiat 128        32.4  4   78.7  66
Fiat X1-9       27.3  4   79.0  66
Porsche 914-2   26.0  4  120.3  91

You can also sort from higher to lower making use of the minus sign.

ii <- order(-mtcars$hp)
head(mtcars[ii,])
                    mpg  cyl disp  hp
Maserati Bora       15.0  8  301  335
Ford Pantera L      15.8  8  351  264
Duster 360          14.3  8  360  245
Camaro Z28          13.3  8  350  245
Chrysler Imperial   14.7  8  440  230
Lincoln Continental 10.4  8  460  215

In addition, you can establish different order conditions if you want. You can order by some variable and, in case of ties, order by another one. In the following example we will order the data frame by the variable named cyl and then by the variable hp.

ii <- order(mtcars$cyl, mtcars$hp)
head(mtcars[ii,], 1:4) 
               mpg  cyl  disp  hp
Honda Civic    30.4  4   75.7  52
Merc 240D      24.4  4  146.7  62
Toyota Corolla 33.9  4   71.1  65
Fiat 128       32.4  4   78.7  66
Fiat X1-9      27.3  4   79.0  66
Porsche 914-2  26.0  4  120.3  91

Filtering data frames

Filter a data frame consist on obtaining a subsample that meets some conditions. For this purpose, you can use the subset function to subset dataframes by column values. We will provide some examples based on the mtcars dataset.

Subset of the dataset where the number of cylinders of the car is exactly 6 and the horse power is greater than 110.

subset(mtcars, cyl == 6 & hp > 110)
              mpg cyl  disp  hp drat   wt qsec vs am gear carb
Merc 280     19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
Merc 280C    17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
Ferrari Dino 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6

The same as the previous example, but we only show some variables (mpg, cyl and disp) making use of the select argument.

subset(mtcars, cyl == 6 & hp > 110, select = c(mpg, cyl, disp))
              mpg cyl  disp
Merc 280     19.2   6 167.6
Merc 280C    17.8   6 167.6
Ferrari Dino 19.7   6 145.0

Now, instead of using the AND condition we will use the OR condition. In this case, we will select the cars where the variable wt is less than 2 or the variable hp is greater than 115.

subset(mtcars, wt < 2 | hp > 115)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8