# Violin plot in R

Violin plots are **an alternative to box plots** that solves the issues regarding displaying the underlying distribution of the observations, as these plots show a kernel density estimate of the data. In this tutorial, we will show you how to create a violin plot in base R from a vector and from data frames, how to add mean points and split the R violin plots by group.

## Vioplot from vector

In order to create a violin plot in R from a vector, you need to pass the vector to the `vioplot`

function of the package of the same name. Consider, for instance, the following vector:

```
x <- c(6, 9, 0, 19, -1, 8, 12, 5, 3, 7,
2, 4, 3, -8, -9, 8, 4, 12, 5, 14)
```

You can create a simple violin plot in R typing:

```
# install.packages("vioplot")
library("vioplot")
vioplot(x)
```

**By default**, the `vioplot`

function will create a **vertical violin plot in R**, but if you set the argument `horizontal`

to `TRUE`

, you can create a horizontal violin plot.

`vioplot(x, horizontal = TRUE)`

If you want to **customize the violin plot**, there are several arguments to control the graphical representation:

```
vioplot(x,
col = 2, # Color of the area
rectCol = "red", # Color of the rectangle
lineCol = "white", # Color of the line
colMed = "green", # Pch symbol color
border = "black", # Color of the border of the violin
pchMed = 16, # Pch symbol for the median
plotCentre = "points") # If "line", plots a median line
```

In addition, you can **add jittered data points** to a violin plot with the `stripchart`

function as follows:

```
stripchart(x, method = "jitter", col = "blue",
vertical = TRUE, pch = 19, add = TRUE)
```

Note that if you have a horizontal violin plot, you will need to set `vertical = FALSE`

in the previous function.

Moreover, you can **draw a violin plot in R without taking into account the outliers of the data**. For that purpose, you can assign to a variable the output of the `boxplot`

function and then return the values of the original vector that are not outliers.

```
box <- boxplot(x)
x <- x[!(x %in% box$out)]
vioplot(x)
```

You can also set the argument `ylog`

to `TRUE`

if you want the Y-axis to be in **logarithmic scale**. Note that this only will work for positive data.

```
par(mfrow = c(1, 2))
vioplot(1:10)
vioplot(1:10, ylog = TRUE)
par(mfrow = c(1, 1))
```

### Histogram and violin plot

Finally, note that you can plot a violin plot over a histogram. Consider, for instance, that the **underlying distribution of your data presents multimodality**. In this case, a boxplot won’t represent this condition, but the violin plot will do. The following graphical representation will help you understand why a violin plot is useful:

```
set.seed(1)
# Multimodal data
n <- 10000
ii <- rbinom(n, 1, 0.5)
data <- rnorm(n, mean = 130, sd = 10) * ii +
rnorm(n, mean = 80, sd = 5) * (1 - ii)
# Histogram
hist(data, probability = TRUE, col = "grey", axes = FALSE,
main = "", xlab = "", ylab = "")
# X-axis
axis(1)
# Density
lines(density(data), lwd = 2, col = "red")
# Add violin plot
par(new = TRUE)
vioplot(data, horizontal = TRUE, yaxt = "n", axes = FALSE,
col = rgb(0, 1, 1, alpha = 0.15))
```

## Violin plot by group

On the one hand, if you have a **data frame with a variable containing groups**, you can draw a violin plot from a formula, specifying the numerical variable against the factor. We will show you an example using the `chickwts`

dataset of base R.

`tail(chickwts) # Last rows`

```
weight feed
66 352 casein
67 359 casein
68 216 casein
69 222 casein
70 283 casein
71 332 casein
```

Now, you can **specify the formula on the first argument**, the colors and any desired graphical parameter:

```
data <- chickwts
vioplot(data$weight ~ data$feed, col = 2:length(levels(data$feed)),
xlab = "Feed", ylab = "Weight")
```

You can also add jittered data points to the previous violin plot with the `stripchart`

function as follows:

```
stripchart(data$weight ~ data$feed, vertical = TRUE, method = "jitter",
pch = 19, add = TRUE, col = 3:8)
```

On the other hand, if your data set contains **numeric columns that represents some variable**, you can directly create the violin plot from the data frame. We will use, for instance, the `trees`

dataset of base R.

`tail(trees) # Last rows`

```
Girth Height Volume
26 17.3 81 55.4
27 17.5 82 55.7
28 17.9 80 58.3
29 18.0 80 51.5
30 18.0 80 51.0
31 20.6 87 77.0
```

If you pass the dataframe to the `vioplot`

function, you can create the plot. Note that if you stack this data frame with the `stack`

function, you can specify a formula as in the previous example.

```
data <- trees
vioplot(data, col = 2:4, border = 2:4)
# Equivalent to:
stacked_data <- stack(trees)
vioplot(stacked_data$values ~ stacked_data$ind, col = 2:4,
border = 2:4)
```

### Reorder violin plot

The violin plots are **ordered by default by the order of the levels of the categorical variable**. Recall the violin plot we created before with the `chickwts`

dataset and check that the order of the variables is the following:

`levels(chickwts$feed)`

` "casein" "horsebean" "linseed" "meatmeal" "soybean" "sunflower"`

However, you can override this behavior reordering the categorical variable by any characteristic of the data with the `reorder`

function. In the following example we are going to use the median, but you could choose any function you want.

```
par(mfrow = c(1, 2))
data <- chickwts
#----------------
# Lower to higher
#----------------
medians <- reorder(data$feed, data$weight, median)
# medians <- with(data, reorder(feed, weight, median)) # Equivalent
vioplot(data$weight ~ medians, col = 2:(length(levels(data$feed)) + 1),
xlab = "", ylab = "Weight", las = 2)
#----------------
# Higher to lower
#----------------
medians <- reorder(data$feed, -data$weight, median)
# medians <- with(data, reorder(feed, -weight, median)) # Equivalent
vioplot(data$weight ~ medians, col = 2:(length(levels(data$feed)) + 1),
xlab = "", ylab = "Weight", las = 2)
par(mfrow = c(1, 1))
```

### Add mean to base R violin plot

The `vioplot`

function displays the median of the data, but if the distribution is not symmetric the mean and the median can be very distant. Hence, you can **add the mean point**, or any other characteristic of the data, to a violin plot in base R with the `points`

function. Note that the steps are different if you are plotting a horizontal or vertical violin plot and single or multiple plots.

On the one hand, to display the mean point of a single violin plot you can type:

```
par(mfrow = c(1, 2))
# Exponential data
set.seed(5)
x <- rexp(20)
#-------------------
# Vertical vioplot
#-------------------
vioplot(x, col = 4)
# Add mean point
points(mean(x), pch = 19, col = "green", cex = 1.5)
#-------------------
# Horizontal vioplot
#-------------------
vioplot(x, col = 4, horizontal = TRUE)
# Add mean point
points(mean(x), 1, pch = 19, col = "green", cex = 1.5)
legend("topright", pch = c(21, 19), col = c("black", "green"),
bg = "white", legend = c("Median", "Mean"), cex = 1.25)
par(mfrow = c(1, 1))
```

On the other hand, you can add mean points to a violin plot by group typing the following:

```
par(mfrow = c(1, 2))
set.seed(5)
df <- data.frame(x = rexp(20), y = rexp(20), z = rexp(20))
#--------------------------
# Vertical vioplot by group
#--------------------------
vioplot(df, col = 2:4)
# Add mean points
means <- apply(df, 2, mean)
means <- colMeans(df) # Equivalent (more efficient)
points(means, pch = 19, col = "green", cex = 1.25)
legend("top", pch = c(21, 19), col = c("black", "green"),
bg = "white", legend = c("Median", "Mean"), cex = 1.25)
#----------------------------
# Horizontal vioplot by group
#----------------------------
vioplot(df, col = 2:4,
horizontal = TRUE)
# Add mean points
means <- apply(df, 2, mean)
means <- colMeans(df) # Equivalent (more efficient)
points(means, 1:ncol(df), pch = 19, col = "green", cex = 1.25)
par(mfrow = c(1, 1))
```

You can add points of other characteristic of the data changing the `mean`

function for other.

## Split R vioplots

It is worth to mention that **you can split a violin plot in R**. Consider, for instance, that you have divided the `trees`

dataset into two groups, representing tall and small trees, depending on its height. Then, you can make use of the `side`

and `add`

arguments as follows:

```
data <- trees
tall <- trees[trees$Height >= 76, ]
small <- trees[trees$Height < 76, ]
vioplot(tall, side = "left", plotCentre = "line", col = 2)
vioplot(small, side = "right", plotCentre = "line", col = 3, add = TRUE)
legend("topleft", legend = c("Tall", "Small"), fill = c(2, 3), cex = 1.25)
```

Even possible, it is recommended to plot median lines instead of points for split violin plots.