Study unit 2 > Tidyverse > Tidyverse - dplyr > Piping

Piping

Now that the most common functions have been introduced, we will explain how modularization or piping works with dplyr. As mentioned above, one immense advantage of dplyr is that operations are piped, meaning they are divided into individual components that are easy to follow.

Through piping, datasets or tibbles are passed on from previous operations. This is done using the %>% operator. Let’s look at a first example:

pss <- pss %>%
  group_by(district) %>%
  mutate(
    wkhtotMean = mean(
      wkhtot, 
      na.rm = TRUE
    )
  ) %>%
  ungroup()

First, we pass the loaded dataset pss. Then we perform three operations on the dataset: group_by(), mutate(), and ungroup(). Afterwards, we pass the modified object back to the original object using the assignment arrows -> (so we are overwriting it!).

This example calculates the average working hours by district and stores it in the new variable wkhtotMean. To ensure this is also saved in the dataset pss, we save these steps back into the object pss using the assignment arrow.

Since we are passing the dataset, we do not need to call it in each operation. The piping operators do not always have to be entered manually; they can be inserted automatically with [Ctrl] + [Shift] + [M] (Windows) or [Cmd] + [Shift] = [M] (Mac).

Next, we want to perform various preparation steps using piping.

Calculating and Recoding Variables

In this example, we want to create a new variable that distinguishes the level of trust individuals have in politicians (trstplt). We want to differentiate between low, medium, and high trust.

pss <- pss %>%
  mutate(
    trstpltG = case_when(
      trstplt <= 3 ~ "low", 
      trstplt > 3 & trstplt <= 6 ~ "medium", 
      trstplt > 6 ~ "high"
    )
  )
table(pss$trstpltG)

## 
##   high    low medium 
##    844   1275   2870

Checking the new variable reveals that it is stored as a character type.

str(pss$trstpltG)

##  chr [1:5000] "medium" "medium" "medium" "medium" "medium" "low" "medium" ...

This can be changed normally, or it can be written directly into the piping:

pss <- pss %>%
  mutate(
    trstpltG = case_when(
      trstplt <= 3 ~ "low", 
      trstplt > 3 & trstplt <= 6 ~ "medium", 
      trstplt > 6 ~ "high"
    )
  ) %>% 
  mutate(trstpltG = factor(trstpltG)) #Schritt um von Character auf Factor zu kommen!

table(pss$trstpltG)

## 
##   high    low medium 
##    844   1275   2870

str(pss$trstpltG)

##  Factor w/ 3 levels "high","low","medium": 3 3 3 3 3 2 3 3 2 3 ...

Now we have a factor, but without order, which we can also fix:

pss <- pss %>%
  mutate(
    trstpltG = case_when(
      trstplt <= 3 ~ "low", 
      trstplt > 3 & trstplt <= 6 ~ "medium", 
      trstplt > 6 ~ "high"
    )
  ) %>% 
  mutate(
    trstpltG = factor(
      trstpltG,
      ordered = TRUE, 
      levels = c(
        "low", 
        "medium", 
        "high"
      )
    ) 
  )

table(pss$trstpltG)

## 
##    low medium   high 
##   1275   2870    844

str(pss$trstpltG)

##  Ord.factor w/ 3 levels "low"<"medium"<..: 2 2 2 2 2 1 2 2 1 2 ...

A more complex task would be: We not only want to calculate the average working hours per district but also the deviation of an individual in the respective district from the district’s average!

pss <- pss %>%
  group_by(district) %>%
  mutate(
    wkhtotMean = mean(
      wkhtot, 
      na.rm = TRUE
    ),
    wkhtotDist = wkhtot - wkhtotMean  # wir fügen einfach diese einfache Berechnung des Abstands hinzu
  ) %>%
  ungroup()

head(
  pss[, 
      c(
        "district", 
        "wkhtot",
        "wkhtotMean",
        "wkhtotDist"
      )
  ]
)

## # A tibble: 6 × 4
##   district   wkhtot wkhtotMean wkhtotDist
##   <fct>       <dbl>      <dbl>      <dbl>
## 1 Distrikt 1     34       31.8       2.19
## 2 Distrikt 1     20       31.8     -11.8 
## 3 Distrikt 1     27       31.8      -4.81
## 4 Distrikt 1     30       31.8      -1.81
## 5 Distrikt 1     29       31.8      -2.81
## 6 Distrikt 1     30       31.8      -1.81

Here we can also see an advantage of the modular principle. When we calculate and pass on new variables (here wkhtotMean), we can directly use them in the subsequent operations.

Alternatively, we could hierarchically group the data by district and education (edu) and then simply output the different means using summarize():

meansDistriktEdu <- pss %>%
  group_by(
    district,
    edu
  ) %>%
  summarize(mean(wkhtot)) 

meansDistriktEdu

## # A tibble: 29 × 3
## # Groups:   district [5]
##    district   edu          `mean(wkhtot)`
##    <fct>      <fct>                 <dbl>
##  1 Distrikt 1 ES-ISCED I             33.8
##  2 Distrikt 1 ES-ISCED II            33.2
##  3 Distrikt 1 ES-ISCED III           32.4
##  4 Distrikt 1 ES-ISCED IV            31.8
##  5 Distrikt 1 ES-ISCED V             29.5
##  6 Distrikt 1 <NA>                   31.0
##  7 Distrikt 5 ES-ISCED I             34.1
##  8 Distrikt 5 ES-ISCED II            33.6
##  9 Distrikt 5 ES-ISCED III           33.2
## 10 Distrikt 5 ES-ISCED IV            32.2
## # ℹ 19 more rows

Let’s take a step back and see how we can divide datasets. This is relevant, especially when working with secondary datasets, as they are sometimes collected for more individuals than needed for one’s research purposes.