Rainclouds

So-called Rainclouds not only include the distribution information from a box plot but also the distribution of the raw values. Because different distributions can lead to the same box plot. The following example is from a blog post by Cédric Scherer, where it is very well explained (in English) why box plots are sometimes not so good.

Code to create the fictitious dataset

We have a fictional dataset that results in the following box plot: So, at first glance, the box plot’s statistics suggest that the data is similar. Let’s now add the case numbers:

Even though the same box plot is generated, the case numbers per group are very different. Now let’s add the raw values (geom_point()):

ggplot(
  data, 
  aes(
    x = group, 
    y = value
  )
) +
  geom_boxplot(fill = "grey92") +
  geom_point(
    size = 2,
    alpha = .3,
    position = position_jitter(
      seed = 1,
      width = .2
    )
  )

At first glance, it is already clear that the distribution between the groups is not as similar as the box plots may have suggested.

To include this information in a Raincloud plot, both the raw data and the distribution are plotted. To do this, you need to install and activate the library ggdist:

install.packages("ggdist")
library("ggdist")

From the library ggdist, you use the function stat_halfeye() to display the distribution. Let’s take this first step:

ggplot(data, aes(x = group, y = value)) + 
  ## add half-violin from {ggdist} package
  stat_halfeye(
    ## custom bandwidth
    adjust = .5, 
    ## adjust height
    width = .6, 
    ## move geom to the right
    justification = -.2, 
    ## remove slab interval
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    ## remove outliers
    outlier.color = NA ## `outlier.shape = NA` works as well
  )

Now you can see the distribution alongside the box plot. However, the information from the raw data points was also very helpful in visually understanding the data. For this, you use the function stat_dots() from ggdist.

ggplot(
  data, 
  aes(
    x = group, 
    y = value
    )
  ) + 
  stat_halfeye(
    adjust = .5, 
    width = .6, 
    justification = -.2, 
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    outlier.color = NA 
  ) +
  ## Rohdatenpunkte hinzfügen
  stat_dots(
    # in welche Richtung die Punkt sich türmen sollen, probiere right einfach aus!
    side = "left", 
    # leichtes Einrücken von geom_boxplot()
    justification = 1.1, 
    # Größe der Punkte
    binwidth = .25
  )

Finally, you can remove the white space by limiting the x-axis:

ggplot(
  data, 
  aes(
    x = group, 
    y = value
    )
  ) + 
  stat_halfeye(
    adjust = .5, 
    width = .6, 
    justification = -.2, 
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    outlier.color = NA 
  ) +
  ## Rohdatenpunkte hinzfügen
  stat_dots(
    # in welche Richtung die Punkt sich türmen sollen, probiere right einfach aus!
    side = "left", 
    # leichtes Einrücken von geom_boxplot()
    justification = 1.1, 
    # Größe der Punkte
    binwidth = .25
  ) + 
  # Entferne white space
  coord_cartesian(
  xlim = c(
      1.2,
      NA
    )
  )

When you have many data points and groups, the view with stat_dots() can sometimes be overwhelming. Alternatively, geom_halfpoint() is suitable!

ggplot(
  data, 
  aes(
    x = group, 
    y = value
    )
  ) + 
  stat_halfeye(
    adjust = .5, 
    width = .6, 
    justification = -.2, 
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    outlier.color = NA 
  ) +
  geom_half_point(
    # links ausgerichtet
    side = "l", 
    ## horizontale Linien anstatt Punkten
    shape = 95,
    # kein jittern
    range_scale = 0,
    size = 10, 
    alpha = .2
  ) + 
  # Entferne white space
  coord_cartesian(
  xlim = c(
      1.2,
      NA
    )
  )

Or with jittered points:

ggplot(
  data, 
  aes(
    x = group, 
    y = value
    )
  ) + 
  stat_halfeye(
    adjust = .5, 
    width = .6, 
    justification = -.2, 
    .width = 0, 
    point_colour = NA
  ) + 
  geom_boxplot(
    width = .12, 
    outlier.color = NA 
  ) +
  geom_half_point(
    # Ausrichtung links
    side = "l", 
    # jittering
    range_scale = .4, 
    # Transparenz
    alpha = .3
  ) + 
  coord_cartesian(
  xlim = c(
      1.2,
      NA
    )
  )

So, a Raincloud plot gives you a much better overview than a box plot!