So-called Rainclouds not only include the distribution information from a box plot but also the distribution of the raw values. Because different distributions can lead to the same box plot. The following example is from a blog post by Cédric Scherer, where it is very well explained (in English) why box plots are sometimes not so good.
We have a fictional dataset that results in the following box plot:
So, at first glance, the box plot’s statistics suggest that the data is similar. Let’s now add the case numbers:
Even though the same box plot is generated, the case numbers per group are very different. Now let’s add the raw values (geom_point()
):
ggplot(
data,
aes(
x = group,
y = value
)
) +
geom_boxplot(fill = "grey92") +
geom_point(
size = 2,
alpha = .3,
position = position_jitter(
seed = 1,
width = .2
)
)
At first glance, it is already clear that the distribution between the groups is not as similar as the box plots may have suggested.
To include this information in a Raincloud plot, both the raw data and the distribution are plotted. To do this, you need to install and activate the library ggdist
:
install.packages("ggdist")
library("ggdist")
From the library ggdist
, you use the function stat_halfeye()
to display the distribution. Let’s take this first step:
ggplot(data, aes(x = group, y = value)) +
## add half-violin from {ggdist} package
stat_halfeye(
## custom bandwidth
adjust = .5,
## adjust height
width = .6,
## move geom to the right
justification = -.2,
## remove slab interval
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
## remove outliers
outlier.color = NA ## `outlier.shape = NA` works as well
)
Now you can see the distribution alongside the box plot. However, the information from the raw data points was also very helpful in visually understanding the data. For this, you use the function stat_dots()
from ggdist
.
ggplot(
data,
aes(
x = group,
y = value
)
) +
stat_halfeye(
adjust = .5,
width = .6,
justification = -.2,
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
outlier.color = NA
) +
## Rohdatenpunkte hinzfügen
stat_dots(
# in welche Richtung die Punkt sich türmen sollen, probiere right einfach aus!
side = "left",
# leichtes Einrücken von geom_boxplot()
justification = 1.1,
# Größe der Punkte
binwidth = .25
)
Finally, you can remove the white space by limiting the x-axis:
ggplot(
data,
aes(
x = group,
y = value
)
) +
stat_halfeye(
adjust = .5,
width = .6,
justification = -.2,
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
outlier.color = NA
) +
## Rohdatenpunkte hinzfügen
stat_dots(
# in welche Richtung die Punkt sich türmen sollen, probiere right einfach aus!
side = "left",
# leichtes Einrücken von geom_boxplot()
justification = 1.1,
# Größe der Punkte
binwidth = .25
) +
# Entferne white space
coord_cartesian(
xlim = c(
1.2,
NA
)
)
When you have many data points and groups, the view with stat_dots()
can sometimes be overwhelming. Alternatively, geom_halfpoint()
is suitable!
ggplot(
data,
aes(
x = group,
y = value
)
) +
stat_halfeye(
adjust = .5,
width = .6,
justification = -.2,
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
outlier.color = NA
) +
geom_half_point(
# links ausgerichtet
side = "l",
## horizontale Linien anstatt Punkten
shape = 95,
# kein jittern
range_scale = 0,
size = 10,
alpha = .2
) +
# Entferne white space
coord_cartesian(
xlim = c(
1.2,
NA
)
)
Or with jittered points:
ggplot(
data,
aes(
x = group,
y = value
)
) +
stat_halfeye(
adjust = .5,
width = .6,
justification = -.2,
.width = 0,
point_colour = NA
) +
geom_boxplot(
width = .12,
outlier.color = NA
) +
geom_half_point(
# Ausrichtung links
side = "l",
# jittering
range_scale = .4,
# Transparenz
alpha = .3
) +
coord_cartesian(
xlim = c(
1.2,
NA
)
)
So, a Raincloud plot gives you a much better overview than a box plot!