Using the naniar
package makes the steps shown above much faster and easier to display. The package also always creates a ggplot
, so the adjustments learned above are also possible here. First, we use functions to display tables using naniar
. The first one is the miss_var_summary()
function, which gives us the absolute and relative frequency of missing values in the variables.
pss %>%
miss_var_summary()
## # A tibble: 14 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 edu 352 7.04
## 2 agea 157 3.14
## 3 stfdem 95 1.9
## 4 trstprl 35 0.7
## 5 trstprt 17 0.34
## 6 trstlgl 12 0.24
## 7 trstplt 11 0.22
## 8 lrscale 7 0.14
## 9 stfeco 6 0.12
## 10 idno 0 0
## 11 district 0 0
## 12 gndr 0 0
## 13 wkhtot 0 0
## 14 income 0 0
We can also group this:
pss %>%
group_by(district) %>%
miss_var_summary()
## # A tibble: 65 × 4
## # Groups: district [5]
## district variable n_miss pct_miss
## <fct> <chr> <int> <num>
## 1 Distrikt 1 edu 30 3
## 2 Distrikt 1 stfdem 16 1.6
## 3 Distrikt 1 trstprl 8 0.8
## 4 Distrikt 1 trstprt 4 0.4
## 5 Distrikt 1 lrscale 2 0.2
## 6 Distrikt 1 trstlgl 1 0.1
## 7 Distrikt 1 idno 0 0
## 8 Distrikt 1 gndr 0 0
## 9 Distrikt 1 agea 0 0
## 10 Distrikt 1 wkhtot 0 0
## # ℹ 55 more rows
First, we can display a distribution of missing values in the dataset. The gg_miss_var_cumsum()
function gives us the cumulative sum of missing values per variable. This way, we can see how the missing values are distributed across the variables.```
gg_miss_var_cumsum(pss)
The function vis_miss()
visualizes the missing values of an entire dataset (unless we specify a subset).
vis_miss(pss)
Another appealing alternative is the function gg_miss_upset()
from the package naniar
. Here, the frequencies of the combinations of missing values between variables are also displayed. However, this can quickly become overwhelming with very large datasets. It can be insightful for subsets (e.g., when checking if individuals have only partially answered an item battery or have not answered the item battery at all).
gg_miss_upset(pss)
In the graph, it can be seen that the four variables trstprt
, trtprl
, stfdem
, and agea
have missing values. The following combinations exist:
edu
,agea
,stfdem
,trstprl
,trstprt
,stfdem
and edu
,agea
and edu
,trstprl
and edu
,stfdem
and agea
,trstprt
and edu
,trstprl
and stfdem
,trstprt
and agea
.Overall, the maximum number of combinations is calculated as follows: \(2^{Number of Variables} - 1\). In this case, there would be 31 possible combinations, but only 12 are displayed. Why?
Additionally, missing values of two variables can be easily visualized in a ggplot
using the function geom_miss_point()
:
ggplot(
pss,
aes(
x = district,
y = agea
)
) +
geom_miss_point()
This allows for easy identification of any potential clustering of missing values in specific combinations.
Alternatively, the functions gg_miss_var()
and gg_miss_fct()
can also be used.
The function gg_miss_var()
displays the number of missing values. Using the facet
argument, this can be broken down by individual categories. This helps in identifying if one group has significantly more missing values than another group.
gg_miss_var(
pss,
facet = district
)
The function gg_miss_fct()
provides a visually appealing representation of missing values.
gg_miss_fct(
x = pss,
fct = district
)
You can also display this based on values of another variable to see if there are strong group differences:
gg_miss_fct(
x = pss,
fct = district
) +
labs(title = "NA in PSS nach Distrikt")
That’s it for missing values!