Missing values

Often, before starting the actual data analysis, it is important to inspect the data and especially check for missing values. There are two comprehensive packages that build on ggplot2 for this purpose: naniar and UpSetR.

First, we need to install or load the packages:

install.packages("UpSetR")
install.packages("naniar")

library("UpSetR")
library("naniar")

Next, we will continue using the dataset pss. Our goal is to visualize the missing values per variable. To do this, we first filter the dataset for the ID variable and the following four variables: stfdem, stfeco, trstprl, and trstlgl. Then, we convert the dataset into a long format and create a third column indicating whether it is a missing value or not. We then group by variables and the new third column, count the missing values per variable (as well as the non-missing values), exclude the non-missing values, and sort the table in descending order. This will show us the number of missing values in each of the four variables.

missingValues <- pss %>%
  select(
    c(
      "idno", 
      "stfdem",
      "stfeco", 
      "trstprl", 
      "trstlgl"
    )
  ) %>% 
  pivot_longer(
    everything(),
    names_to = "variable",
    values_to = "val"
  ) %>%
  mutate(is.missing = is.na(val)) %>%
  group_by(
    variable, 
    is.missing
  ) %>%
  summarize(
    num.missing = n()
  ) %>%
  filter(is.missing == TRUE) %>%
  select(-is.missing) %>%
  arrange(desc(num.missing))

missingValues
## # A tibble: 4 × 2
## # Groups:   variable [4]
##   variable num.missing
##   <chr>          <int>
## 1 stfdem            95
## 2 trstprl           35
## 3 trstlgl           12
## 4 stfeco             6

Next, you can display a simple bar plot with this new dataset:

missingValues %>%
  ggplot() +
  geom_bar(
    aes(
      variable, 
      num.missing
    ), 
    stat = 'identity'
  ) +
  labs(
    x = 'Variable',
    y = "Anzahl MV", 
    title = 'Missing Values pro Variable'
  ) +
  theme(
    axis.text.x = element_text(
      angle = 45, 
      hjust = 1
    )
  )

Here you can test all the tricks from above. However, we now want to display percentages to see how much percent of the variable missings there are. To do this, we repeat the steps from before, but add a mutate step that gives us the percentages.

#Prozente
missingValues <-  pss %>%
  select(
    c(
      "idno", 
      "stfdem",
      "stfeco", 
      "trstprl", 
      "trstlgl"
    )
  ) %>% 
  pivot_longer(
               everything(),
               names_to = "key",
               values_to = "val"
               ) %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(
           key, 
           total,
           isna
           ) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

levels <- (
           missingValues  %>% 
             filter(isna == T) %>%     
             arrange(desc(pct))
           )$key

missingsPercent <- missingValues %>%
  ggplot() +
  geom_bar(
           aes(
               x = reorder(
                           key, 
                           desc(pct)
                           ), 
               y = pct, 
               fill = isna
               ), 
           stat = 'identity',
           alpha = 0.8) +
  scale_x_discrete(limits = levels) +
  scale_fill_manual(
                    name = "", 
                    values = c(
                               'steelblue', 
                               'tomato3'
                               ), 
                    labels = c(
                               "vorhanden",
                               "fehlend"
                               )
                    ) +
  coord_flip() +
  labs(
       title = "Prozent von missing values", 
       x = 'Variable', 
       y = "% missing values"
       )

missingsPercent
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

Alternatively, you can also display the missings in a way that shows which case is missing on which variable. However, this quickly becomes confusing with large datasets.

# pro Fall (wird aber bei großen Datensätzen etwas schwer zu lesen)
rawPlot <-  pss %>%
  select(
    c(
      "idno", 
      "stfdem",
      "stfeco", 
      "trstprl", 
      "trstlgl"
    )
  ) %>% 
  pivot_longer(
               -c("idno"),
               names_to = "key",
               values_to = "val"
               ) %>%
  mutate(isna = is.na(val)) %>%
  ggplot(
         aes(
             key, 
             idno, 
             fill = isna)
        ) +
  geom_raster(alpha = 0.8) +
  scale_fill_manual(
                    name = "",
                    values = c(
                               "steelblue", 
                               "tomato3"
                               ),
                    labels = c(
                               "vorhanden",
                               "fehlend"
                               )
                    ) +
  scale_x_discrete(limits = levels) +
  labs(
       x = "Variable",
       y = "Row Number",
       title = "Missing values in rows"
       ) +
  coord_flip()

rawPlot

Let’s move on to the libraries naniar and UpSetR!