Study unit 5 > Graphical Representations > Scatterplots > Start: Scatterplot

Start: Scatterplot

We use Scatterplots to represent two (pseudo-)metric variables. To do this, we use the geom_point() function.

Often, we only have pseudo-metric variables, but we can still use Scatterplots for visualization. We now use trstplt and trstprt. If you don’t remember what these variables stand for, check the codebook!

scatter <- ggplot(
  pss, 
  aes(
    trstplt, 
    trstprt
  )
) + 
  geom_point()

scatter

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

Jitter

To better identify data points on the plot, we need to scatter the data points so they do not overlap. Since pseudo-metric variables usually have only integer values, data pairs may overlap. We use the geom_jitter() function for this:

scatter <- scatter +
  geom_jitter(
    width = 0.3,
    height = 0.3
  )

scatter

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

In the arguments of geom_jitter(), specify how far you want the data points to jitter. Just try a few times with different values.

Labels

Now we add labels and titles.

scatter +
  geom_point() +
  geom_jitter(
              width = 0.3, 
              height = 0.3
              ) +
  labs(
       x = "Trust in Politicians", 
       y = "Trust in Legal System", 
       title = "Trust Scatterplot"
       )

And we change the appearance of the title: Within the theme() function, we modify the display. You will learn more about what the arguments do in Chapter 3!

scatter <- scatter +
  labs(
    x = "Trust in Politicians",
    y = "Trust in Legal System", 
    title = "Trust Scatterplot"
  ) +
  theme(
    plot.title = element_text(
      face = "bold", 
      hjust = 0.5, 
      size = 16
    )
  )
scatter

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

We also specify the data source. We do this using the lab() function and the caption argument:

scatter <- scatter + 
  labs(caption = "Data source: Panem Social Survey.")

scatter

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

Axes

The variable has only integer values, but the markings are always at the midpoint. Let’s change that now:

scatter <- scatter +
  scale_y_continuous(
    breaks = seq(
      0, 
      10, 
      1
    )
  ) + 
  scale_x_continuous(
    breaks = seq(
      0,
      10,
      1
    )
  )

scatter

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

Regression Line

We can also already plot a regression line for the relationship between the two variables. To do this, we use the geom_smooth() function. In the method argument, we specify that it is a linear model (lm), the confidence interval should be plotted (se = TRUE), and we set colors.

scatter + 
  geom_smooth(
    method = "lm", 
    se = TRUE, 
    color = "darkred",
    fill = "orange"
  )

## Warning: Removed 28 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 28 rows containing missing values or values outside the scale range
## (`geom_point()`).

Let’s continue, and now you add a grouping variable!