Now that the most common functions have been introduced, we will explain how modularization or piping works with dplyr
. As mentioned above, one immense advantage of dplyr
is that operations are piped, meaning they are divided into individual components that are easy to follow.
Through piping, datasets or tibbles are passed on from previous operations. This is done using the %>%
operator. Let’s look at a first example:
pss <- pss %>%
group_by(district) %>%
mutate(
wkhtotMean = mean(
wkhtot,
na.rm = TRUE
)
) %>%
ungroup()
First, we pass the loaded dataset pss
. Then we perform three operations on the dataset: group_by()
, mutate()
, and ungroup()
. Afterwards, we pass the modified object back to the original object using the assignment arrows ->
(so we are overwriting it!).
This example calculates the average working hours by district and stores it in the new variable wkhtotMean
. To ensure this is also saved in the dataset pss
, we save these steps back into the object pss
using the assignment arrow.
Since we are passing the dataset, we do not need to call it in each operation. The piping operators do not always have to be entered manually; they can be inserted automatically with [Ctrl] + [Shift] + [M]
(Windows) or [Cmd] + [Shift] = [M]
(Mac).
Next, we want to perform various preparation steps using piping.
In this example, we want to create a new variable that distinguishes the level of trust individuals have in politicians (trstplt
). We want to differentiate between low, medium, and high trust.
pss <- pss %>%
mutate(
trstpltG = case_when(
trstplt <= 3 ~ "low",
trstplt > 3 & trstplt <= 6 ~ "medium",
trstplt > 6 ~ "high"
)
)
table(pss$trstpltG)
##
## high low medium
## 844 1275 2870
Checking the new variable reveals that it is stored as a character type.
str(pss$trstpltG)
## chr [1:5000] "medium" "medium" "medium" "medium" "medium" "low" "medium" ...
This can be changed normally, or it can be written directly into the piping:
pss <- pss %>%
mutate(
trstpltG = case_when(
trstplt <= 3 ~ "low",
trstplt > 3 & trstplt <= 6 ~ "medium",
trstplt > 6 ~ "high"
)
) %>%
mutate(trstpltG = factor(trstpltG)) #Schritt um von Character auf Factor zu kommen!
table(pss$trstpltG)
##
## high low medium
## 844 1275 2870
str(pss$trstpltG)
## Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 2 3 3 2 3 ...
Now we have a factor, but without order, which we can also fix:
pss <- pss %>%
mutate(
trstpltG = case_when(
trstplt <= 3 ~ "low",
trstplt > 3 & trstplt <= 6 ~ "medium",
trstplt > 6 ~ "high"
)
) %>%
mutate(
trstpltG = factor(
trstpltG,
ordered = TRUE,
levels = c(
"low",
"medium",
"high"
)
)
)
table(pss$trstpltG)
##
## low medium high
## 1275 2870 844
str(pss$trstpltG)
## Ord.factor w/ 3 levels "low"<"medium"<..: 2 2 2 2 2 1 2 2 1 2 ...
A more complex task would be: We not only want to calculate the average working hours per district but also the deviation of an individual in the respective district from the district’s average!
pss <- pss %>%
group_by(district) %>%
mutate(
wkhtotMean = mean(
wkhtot,
na.rm = TRUE
),
wkhtotDist = wkhtot - wkhtotMean # wir fügen einfach diese einfache Berechnung des Abstands hinzu
) %>%
ungroup()
head(
pss[,
c(
"district",
"wkhtot",
"wkhtotMean",
"wkhtotDist"
)
]
)
## # A tibble: 6 × 4
## district wkhtot wkhtotMean wkhtotDist
## <fct> <dbl> <dbl> <dbl>
## 1 Distrikt 1 34 31.8 2.19
## 2 Distrikt 1 20 31.8 -11.8
## 3 Distrikt 1 27 31.8 -4.81
## 4 Distrikt 1 30 31.8 -1.81
## 5 Distrikt 1 29 31.8 -2.81
## 6 Distrikt 1 30 31.8 -1.81
Here we can also see an advantage of the modular principle. When we calculate and pass on new variables (here wkhtotMean
), we can directly use them in the subsequent operations.
Alternatively, we could hierarchically group the data by district and education (edu
) and then simply output the different means using summarize()
:
meansDistriktEdu <- pss %>%
group_by(
district,
edu
) %>%
summarize(mean(wkhtot))
meansDistriktEdu
## # A tibble: 29 × 3
## # Groups: district [5]
## district edu `mean(wkhtot)`
## <fct> <fct> <dbl>
## 1 Distrikt 1 ES-ISCED I 33.8
## 2 Distrikt 1 ES-ISCED II 33.2
## 3 Distrikt 1 ES-ISCED III 32.4
## 4 Distrikt 1 ES-ISCED IV 31.8
## 5 Distrikt 1 ES-ISCED V 29.5
## 6 Distrikt 1 <NA> 31.0
## 7 Distrikt 5 ES-ISCED I 34.1
## 8 Distrikt 5 ES-ISCED II 33.6
## 9 Distrikt 5 ES-ISCED III 33.2
## 10 Distrikt 5 ES-ISCED IV 32.2
## # ℹ 19 more rows
Let’s take a step back and see how we can divide datasets. This is relevant, especially when working with secondary datasets, as they are sometimes collected for more individuals than needed for one’s research purposes.