Box Cox transformation in R
The Box-Cox transformation is a power transformation that corrects asymmetry of a variable, different variances or non linearity between variables. In consequence, it is very useful to transform a variable and hence to obtain a new variable that follows a normal distribution.
Box cox family
The Box-Cox functions transformations are given for different values of \(\lambda\) by the following expression:
\[\begin{cases} \frac{x^{\lambda} - 1}{\lambda} \quad \text{ if } \quad \lambda \neq 0 \\log(x) \text{ if } \quad \lambda = 0\end{cases},\]
being \(y\) the variable to be transformed and \(\lambda\) the transformation parameter. However, the most common transformations are described in the following table:
\(\lambda\) | Transformation |
---|---|
-2 | \(1/x^2\) |
-1 | \(1/x\) |
-0.5 | \(1/\sqrt{x}\) |
0 | \(\log(x)\) |
0.5 | \(\sqrt{x}\) |
1 | x |
2 | \(x^2\) |
If the estimated transformation parameter is close to one of the values of the previous table, in the practice it is recommended to pick up the value of the table instead of the exact value, as the value from the table is easier to interpret.
The boxcox function in R
When using R, we can make use of the boxcox
function from the MASS
package to estimate the transformation parameter by maximum likelihood estimation. This function will also give us the 95% confidence interval of the parameter. The arguments of the function are the following:
boxcox(object, # lm or aov objects or formulas
lambda = seq(-2, 2, 1/10), # Vector of values of lambda
plotit = TRUE, # Create a plot or not
interp, # Logical. Controls if spline interpolation is used
eps = 1/50, # Tolerance for lambda. Defaults to 0.02.
xlab = expression(lambda), # X-axis title
ylab = "log-Likelihood", # Y-axis title
âŠ) # Additional arguments for model fitting
Box Cox transformation example
Consider the following sample vector x
, which doesnât follow a normal distribution:
x <- c(0.103, 0.528, 0.221, 0.260, 0.091,
1.314, 1.732, 0.244, 1.981, 0.273,
0.461, 0.366, 1.407, 0.079, 2.266)
# Histogram of the data
hist(x)
In order to calculate the optimal \(\lambda\) you have to compute a linear model with the lm
function and pass it to the boxcox
function as follows:
# install.packages(MASS)
library(MASS)
boxcox(lm(x ~ 1))
The output of the function will be the following plot:
Note that the center dashed vertical line represents the estimated parameter \(\hat{\lambda}\) and the others the 95% confidence interval of the estimation.
As the previous plot shows that the 0 is inside the confidence interval of the optimal \(\lambda\) and as the estimation of the parameter is really close to 0 in this example, the best option is to apply the logarithmic transformation of the data (see the table of the first section).
# Transformed data
new_x <- log(x)
# Histogram
hist(new_x)
Now the data looks more like following a normal distribution, but you can also perform, for instance, a statistical test to check it, as the Shapiro-Wilk test:
shapiro.test(new.x)
Shapiro-Wilk normality test
data: new_x
W = 0.9, p-value = 0.2
As the p-value is greater than the usual levels of significance (1%, 5% and 10%) we have no evidence to reject the null hypothesis of normality.
Extracting the exact lambda
If the confidence interval of the estimated parameter doesnât fit with any value of the table you can extract the exact lambda using the following code:
# install.packages(MASS)
library(MASS)
b <- boxcox(lm(x ~ 1))
# Exact lambda
lambda <- b$x[which.max(b$y)] # -0.02
Now you can make the transformation of the variable using the expression of the first section:
new_x_exact <- (x ^ lambda - 1) / lambda
new_x_exact