naniar & UpSetR

Using the naniar package makes the steps shown above much faster and easier to display. The package also always creates a ggplot, so the adjustments learned above are also possible here. First, we use functions to display tables using naniar. The first one is the miss_var_summary() function, which gives us the absolute and relative frequency of missing values in the variables.

pss %>%
  miss_var_summary()
## # A tibble: 14 × 3
##    variable n_miss pct_miss
##    <chr>     <int>    <num>
##  1 edu         352     7.04
##  2 agea        157     3.14
##  3 stfdem       95     1.9 
##  4 trstprl      35     0.7 
##  5 trstprt      17     0.34
##  6 trstlgl      12     0.24
##  7 trstplt      11     0.22
##  8 lrscale       7     0.14
##  9 stfeco        6     0.12
## 10 idno          0     0   
## 11 district      0     0   
## 12 gndr          0     0   
## 13 wkhtot        0     0   
## 14 income        0     0

We can also group this:

pss %>% 
  group_by(district) %>% 
  miss_var_summary()
## # A tibble: 65 × 4
## # Groups:   district [5]
##    district   variable n_miss pct_miss
##    <fct>      <chr>     <int>    <num>
##  1 Distrikt 1 edu          30      3  
##  2 Distrikt 1 stfdem       16      1.6
##  3 Distrikt 1 trstprl       8      0.8
##  4 Distrikt 1 trstprt       4      0.4
##  5 Distrikt 1 lrscale       2      0.2
##  6 Distrikt 1 trstlgl       1      0.1
##  7 Distrikt 1 idno          0      0  
##  8 Distrikt 1 gndr          0      0  
##  9 Distrikt 1 agea          0      0  
## 10 Distrikt 1 wkhtot        0      0  
## # ℹ 55 more rows

First, we can display a distribution of missing values in the dataset. The gg_miss_var_cumsum() function gives us the cumulative sum of missing values per variable. This way, we can see how the missing values are distributed across the variables.```

gg_miss_var_cumsum(pss)

The function vis_miss() visualizes the missing values of an entire dataset (unless we specify a subset).

vis_miss(pss)

Another appealing alternative is the function gg_miss_upset() from the package naniar. Here, the frequencies of the combinations of missing values between variables are also displayed. However, this can quickly become overwhelming with very large datasets. It can be insightful for subsets (e.g., when checking if individuals have only partially answered an item battery or have not answered the item battery at all).

gg_miss_upset(pss)

In the graph, it can be seen that the four variables trstprt, trtprl, stfdem, and agea have missing values. The following combinations exist:

  • 311 cases with missing values in edu,
  • 148 cases with missing values in agea,
  • 82 cases with missing values in stfdem,
  • 30 cases with missing values in trstprl,
  • 13 cases with missing values in trstprt,
  • 9 cases with missing values in both stfdem and edu,
  • 5 cases with missing values in both agea and edu,
  • 4 cases with missing values in both trstprl and edu,
  • 3 cases with missing values in both stfdem and agea,
  • 3 cases with missing values in both trstprt and edu,
  • 1 case with missing values in both trstprl and stfdem,
  • 1 case with missing values in both trstprt and agea.

Overall, the maximum number of combinations is calculated as follows: \(2^{Number of Variables} - 1\). In this case, there would be 31 possible combinations, but only 12 are displayed. Why?

Additionally, missing values of two variables can be easily visualized in a ggplot using the function geom_miss_point():

ggplot(
  pss,
  aes(
    x = district,
    y = agea
  )
) +
  geom_miss_point() 

This allows for easy identification of any potential clustering of missing values in specific combinations.

Alternatively, the functions gg_miss_var() and gg_miss_fct() can also be used.

The function gg_miss_var() displays the number of missing values. Using the facet argument, this can be broken down by individual categories. This helps in identifying if one group has significantly more missing values than another group.

gg_miss_var(
  pss,
  facet = district
)

The function gg_miss_fct() provides a visually appealing representation of missing values.

gg_miss_fct(
  x = pss, 
  fct = district
)

You can also display this based on values of another variable to see if there are strong group differences:

gg_miss_fct(
  x = pss, 
  fct = district
) + 
  labs(title = "NA in PSS nach Distrikt")

That’s it for missing values!