Contingency table in R

Statistics with R Association of categorical variables
Create a contingency table in R and calculate association measures

In several situations it is interesting to study two or more variables at the same time. When working with categorical variables, you can summarize the data in a table called contingency table. The relationship between the variables can be also checked with association measures. In this article you will learn how to create a contingency table in R, how to plot it and verify the grade of relationship between the factor variables with association measures.

How to create a contingency table in R? The table() function

Qualitative variables, also known as nominal, categorical or factor variables appear when you have non-measurable variables (gender, birthplace, race, …). This kind of variables can be represented with text, where each name represents a unique category, or with numbers, where each number represents a unique category. When working with several categorical variables, you can represent its joint frequency distribution with a frequency or contingency table.

Consider the bidimensional case with factor variables XX and YY and denote by xix_i, i=1,,ki = 1, \dots, k as the different values that can take the variable XX and yiy_i, i=1,,hi = 1, \dots, h as the values that can take the variable YY. With this in mind we can define:

  1. Joint absolute frequency of (xi,yj)(x_i, y_j). Is the number of individuals that represent at the same time the values xiXx_i \in X and yjYy_j \in Y and it is denoted by nijn_{ij}.
  2. Joint relative frequency. Is the proportion of individuals that represent at the same time the values xiXx_i \in X and yjYy_j \in Y of the total number of individuals and it is defined by fij=nij/nf_{ij} = n{ij}/n, where nn is the sample size.

Note that the bidimensional distribution is the set of values that the bidimensional variable (X,Y)(X, Y) can take in addition of the joint frequencies of that values. In order to represent a bidimensional distribution you can use a two-way contingency table of the form:

XX \ YY y1y_1 y2y_2 \dots yjy_j \dots hhh_h njn_{j \bullet}
x1x_1 n11n_{11} n12n_{12} \dots n1jn_{1j} \dots n1hn_{1h} n1n_{1 \bullet}
x2x_2 n21n_{21} n22n_{22} \dots n2jn_{2j} \dots n2hn_{2h} n2n_{2 \bullet}
\vdots \vdots \vdots \ddots \dots \ddots \vdots \vdots
xix_i ni1n_{i1} ni2n_{i2} \dots nijn_{ij} \dots nihn_{ih} nin_{i \bullet}
\vdots \vdots \vdots \ddots \dots \ddots \vdots \vdots
xkx_k nk1n_{k1} nk2n_{k2} \dots nkjn_{kj} \dots nkhn_{kh} nkn_{k \bullet}
njn_{\bullet j} n1n_{\bullet 1} n2n_{\bullet 2} \dots njn_{\bullet j} \dots nhn_{\bullet h} nn

Absolute frequency table in R

You can make use of the table function to create a contingency table in R. Consider, for instance, the following example with random data, where the variable XX represents the votes for (Yes) and against (No) a law in an international committee composed by members of three different continents (variable YY):

set.seed(20)

X <- sample(c("Yes", "No"), 10, replace = TRUE)
# "No" "Yes" "Yes" "No"  "No" "Yes" "No" "Yes" "No" "No"

Y <- sample(c("Europe", "America", "Africa"), 10, replace = TRUE)
# "Europe" "Africa"  "Africa"  "Europe"  "Africa" 
# "Europe" "Europe"  "Europe"  "America" "America"

With the table function you can create a frequency table (the marginal distribution) for each variable:

table(X)
# X
#  No Yes 
#   6   4 
table(Y)
# Y
#  Africa America  Europe 
#       3       2       5 

However, if you pass two variables to the function, you can create a two-way contingency table. It is worth to mention that you could add more variables to the function, resulting into a multidimensional array.

my_table <- table(X, Y)
my_table
Output
Y X Africa America Europe No 1 2 3 Yes 2 0 2

Note that you can modify the row and column names of the table with the colnames and rownames functions.

rownames(my_table) <- c("0", "1")
colnames(my_table) <- c("Af", "Am", "Eu")
my_table
Output
Y X Af Am Eu 0 1 2 3 1 2 0 2

You can also convert a data frame or a matrix to a table. For that purpose, you could use the as.table function.

Relative frequency table in R

The table created with the table function displays the joint absolute frequency of the variables. Nonetheless, you can create a joint relative frequency table in R (as a fraction of a marginal table) with the prop.table function. By default, the function calculates the proportion of each cell respect to the total of observations, so the sum of the cells is equal to 1.

prop_table <- prop.table(my_table)
Output
Y X Africa America Europe No 0.1 0.2 0.3 Yes 0.2 0.0 0.2

However, the margin argument allows you to set the index (1: rows, 2: columns) according to which the proportions are going to be calculated.

On the one hand, if you set margin = 1, the sum of each row will be equal to 1.

prop.table(my_table, margin = 1)
Output
Y X Africa America Europe No 0.1666667 0.3333333 0.5000000 Yes 0.5000000 0.0000000 0.5000000

On the other hand, if you specify margin = 2 the relative frequency will be calculated by columns, so each column will sum up to 1.

my_table_2 <- prop.table(my_table, margin = 2)
my_table_2 
Output
Y X Africa America Europe No 0.3333333 1.0000000 0.6000000 Yes 0.6666667 0.0000000 0.4000000

Note you can also add the margins for a contingency table in R with the addmargins function, to display the cumulative relative frequency (or the cumulative absolute frequency, if you apply it to an absolute frequency table). In the following example we calculate the margins of prop_table, expressing the data as percentage.

my_table_3 <- addmargins(prop_table * 100)
my_table_3
Output
Y X Africa America Europe Sum 0 10 20 30 60 1 20 0 20 40 Sum 30 20 50 100

Two-way contingency graphs in R

In order to plot a contingency table in R you can make use of the bar charts with the barplot function. There are two main types: grouped and stacked. By default, the function will create a stacked bar plot:

# Divide the graphical window in two colums
par(mfrow = c(1, 2))

colors <- c("#80FFFF", "#FFFFFF")

barplot(my_table, col = colors)

# Adding a legend
legend("topleft", legend = c("No", "Yes"), fill = colors)

barplot(my_table_2, col = colors)

# Back to the original graphical window
par(mfrow = c(1, 1))

Two way contingency graph in R

If you prefer a grouped bar plot you will need to set the argument beside as TRUE:

par(mfrow = c(1, 2))

barplot(my_table, col = colors, beside = TRUE)
legend("topleft", legend = c("No", "Yes"), fill = colors)
barplot(my_table_2, col = colors, beside = TRUE)

par(mfrow = c(1, 1))

Plot a table in R

An alternative to bar plots are mosaic plots, that allows you to display two or more categorical variables. This kind of plot can be created in base R with the mosaicplot function as follows:

par(mfrow = c(1, 2))

mosaicplot(my_table, cex = 1.1)
mosaicplot(my_table_2, cex = 1.1)

par(mfrow = c(1, 1))

Mosaic plot from table in R

All the boxes across categories having the same area is a signal of independence.

Association measures in R

You can check if the variables are dependent, if there exist some relation between them, or independent if not. In order to define the condition of independence, if XX is independent of YY we can check that for all ii:

ni1n1=ni2n2==nijnj==nihnh=nin.\frac{n_{i1}}{n_{\bullet 1}} = \frac{n_{i2}}{n_{\bullet 2}} = \dots = \frac{n_{ij}}{n_{\bullet j}} = \dots = \frac{n_{ih}}{n_{\bullet h}} = \frac{n_{i \bullet}}{n}.

The previous is equivalent to:

nijn=ninnjni,j.\frac{n_{i j}}{n} = \frac{n_{i \bullet}}{n} \frac{n_{\bullet j}}{n} \quad \forall i, j.

Chi-square coefficient in R

You can check for independence with the chi-square test, comparing the expected frequencies eije_{ij} with nijn_{ij}. The chi-square metric is:

χ2=i,j(nijeij)2eij.\chi^2 = \sum_{i, j} \frac{(n_{ij} - e_{ij})^2}{e_{ij}}.

Note that this variable provides, in relative terms, the distance between the joint distribution of the variables of the case of independence. Hence, the bigger the value of χ2\chi^2, the greater the association between the variables. This measure goes from 0 (independence case) to n(q1)n(q-1), being qq the minimum between the number of rows and columns.

However, as this metric is not bounded between 0 and 1 there exists other association measures, described in the following subsections.

You can calculate the Chi-square coefficient in R performing a Chi-square test with the chisq.test function as follows:

chi <- chisq.test(my_table)
chi$statistic
Output
X-squared 2.222222

Therefore, in this case, the Chi-square coefficient is 2.2222.

Contingency coefficient

The contingency coefficient C is bounded between 0 (independence) and 1. It is defined as:

C=χ2χ2+n.C = \sqrt{\frac{\chi^2}{\chi^2 + n}}.

In order to calculate the contingency coefficient in R you can make use of the following function:

Contingency <- function(x) {
    chi <- chisq.test(x)
    unname(sqrt(chi$statistic / (chi$statistic + sum(x))))
}
Contingency(my_table) # 0.4264014

An alternative is to use the ContCoef function of the DescTools package:

# install.packages("DescTools")
library(DescTools)

ContCoef(my_table) # 0.4264014

The maximum value that the contingency coefficient can take is (k1)/k\sqrt{(k-1)/k}, so the value 1 is never reached for square tables. Let kk be the number of rows and hh the number of columns, some values that the coefficient can take are:

k = h 2 3 4 5 6
max C 0.71 0.82 0.87 0.89 0.91

Phi and Cramer’s V coefficients

The Phi coefficient is defined as:

ϕ=χ2n.\phi = \sqrt{\frac{\chi^2}{n}}.

Note that it takes values between 0 and 1 in tables 2×2. However, it can take values greater than 1 in other cases. In order to avoid that issue, the Cramer’s V coefficient modifies the Phi coefficient, taking the form:

V=χ2[n(q1)].V = \sqrt{\frac{\chi^2}{[n(q - 1)]}}.

Recall that qq is the minimum between the number of rows and columns.

In R you can make use of the following functions to calculate these coefficients:

# Phi coefficient
PhiCoef <- function(x){
    unname(sqrt(chisq.test(x)$statistic / sum(x)))
}

# Cramer's V coefficient
V <- function(x) {
    unname(sqrt(chisq.test(x)$statistic / (sum(x) * (min(dim(x)) - 1))))
}
PhiCoef(my_table) # 0.4714045
V(my_table)   # 0.4714045

An alternative is to use the Phi and CramerV functions of the DescTools package.

# install.packages("DescTools")
library(DescTools)

Phi(my_table) # 0.4714045
CramerV(my_table) # 0.4714045

As a final remark, it is worth to mention that you can calculate several association measures in R at the same time with the assocstats function of the vcd library.

# install.packages("vcd")
library(vcd)

assocstats(my_table)
Output
X^2 df P(> X^2) Likelihood Ratio 2.9110 2 0.23328 Pearson 2.2222 2 0.32919 Phi-Coefficient : NA Contingency Coeff.: 0.426 Cramer's V : 0.471