Study unit 2 > Tidyverse > Tidyverse - dplyr > Most Common Functions

Most Common Functions

Prepare Data

In most cases, preparatory work on the dataset is necessary to perform analysis steps. These steps are often of great importance and can be quite time-consuming.

There are several packages that allow for data manipulation. One of the most well-known and commonly used packages due to its clarity is dplyr. The idea behind dplyr is that it processes only tables (data frames or tibbles) and facilitates processing through what is known as piping. The code in dplyr is structured into a long chain, which may initially seem unnecessary. However, this makes the code easy to read, and only a few functions need to be learned, which together can cover the most common processing steps. By breaking down the process into steps, each function has a well-defined scope that is easy to understand. By combining multiple functions, more complex steps can also be performed. Additionally, functions from the dplyr package always return a tibble (table).

The extensive documentation for dplyr can be found here.

Most Common Functions

The most common functions used with dplyr are the following:

Funktion	Operation
select()	Spalte(n) wählen
slice()	Zeile(n) wählen
filter()	Zeile(n) filtern
arrange()	Zeile(n) ordnen
mutate()	neue Spalten / Variablen
summarize()	Werte zusammenfassen
group_by()	gruppieren

These will now be introduced to you step by step before we learn about piping, the second principle of dplyr. These examples are somewhat artificial, as dplyr is almost exclusively used with piping. However, for understanding the functions, it is helpful to get to know them first.

In all examples, we use a fictional dataset that surveyed the study motivation (mot) of students at the universities of Gießen, Marburg, and Frankfurt.

First, let’s get an overview of the dataset here:

The dataset includes questions about satisfaction with democracy (stfdem), district (district), trust in parliament (trstprl), and gender (gndr), among others.

The package dplyr can be loaded individually, but it is recommended to load the package tidyverse, which also loads additional packages like ggplot2 directly.

# install.packages(
#   "tidyverse", 
#   dependencies = TRUE
# )
library("tidyverse")

select()

With the select() function, you can choose one or more columns from a dataset. The output is always a tibble.

In this example, we only want to display the variables gender (gndr) and district (district).

head(pss)

##    idno   district   gndr agea         edu wkhtot     income stfdem stfeco
## 1 10000 Distrikt 1   male   41 ES-ISCED IV     34 7th decile      7      6
## 2 10001 Distrikt 1   male   65 ES-ISCED II     20 6th decile      8      7
## 3 10002 Distrikt 1   male   48 ES-ISCED IV     27 7th decile      6      6
## 4 10003 Distrikt 1 female   49  ES-ISCED V     30 6th decile      5      4
## 5 10004 Distrikt 1 female   48 ES-ISCED IV     29 5th decile      4      5
## 6 10005 Distrikt 1 female   64  ES-ISCED V     30 6th decile      6      6
##   trstprl trstprt trstplt trstlgl lrscale
## 1       3       5       4       6       4
## 2       5       5       5       4       3
## 3       4       4       6       5       6
## 4       2       7       4       3       6
## 5       6       6       6       6       2
## 6       1       3       2       4       7

select(
  pss,
  c(
    gndr, 
    district
  )
)

slice()

On the other hand, with slice(), you can display individual rows. For example, rows \(50\) to \(55\), or in combination with the seq() function, every \(100^{th}\) row.

slice(
  pss,
  50:55
)

##    idno   district   gndr agea          edu wkhtot     income stfdem stfeco
## 1 10049 Distrikt 1 female   65  ES-ISCED II     30 4th decile      5      7
## 2 10050 Distrikt 1 female   47  ES-ISCED IV     30 5th decile      5      2
## 3 10051 Distrikt 1 female   57 ES-ISCED III     21 5th decile      4      5
## 4 10052 Distrikt 1   male   42  ES-ISCED IV     27 9th decile      3      5
## 5 10053 Distrikt 1 female   47 ES-ISCED III     38 5th decile      5      5
## 6 10054 Distrikt 1 female   43 ES-ISCED III     42 2nd decile     10      6
##   trstprl trstprt trstplt trstlgl lrscale
## 1       6       5       6       7       7
## 2       2       5       7       4       5
## 3       5       6       2       4       3
## 4       5       4       5       5       5
## 5       4       3       0       5       2
## 6       8       5       7       4       3

slice(
  pss, 
  seq(
    0, 
    1000,
    100
  )
)

##     idno   district   gndr agea          edu wkhtot     income stfdem stfeco
## 1  10099 Distrikt 1 female   53         <NA>     41 7th decile      5      6
## 2  10199 Distrikt 1   male   58 ES-ISCED III     39 5th decile      6      5
## 3  10299 Distrikt 1 female   45 ES-ISCED III     29 5th decile      4      6
## 4  10399 Distrikt 1 female   63 ES-ISCED III     37 5th decile      7      6
## 5  10499 Distrikt 1 female   56  ES-ISCED IV     38 6th decile      5      5
## 6  10600 Distrikt 1   male   71  ES-ISCED II     44 5th decile      7      7
## 7  10700 Distrikt 1   male   70 ES-ISCED III     30 6th decile      4      6
## 8  10800 Distrikt 1   male   75  ES-ISCED II     48 5th decile      8      6
## 9  10900 Distrikt 1   male   67  ES-ISCED II     48 3rd decile      4      4
## 10 11000 Distrikt 1 female   69 ES-ISCED III     41 5th decile      4      5
##    trstprl trstprt trstplt trstlgl lrscale
## 1        1       4       4       8       8
## 2        6       4       4       5       8
## 3        7       3       6       2       5
## 4        4       5       6       6       8
## 5        3       5       4       7       6
## 6        5       4       4       4       7
## 7        3       7       7       3       8
## 8        5      10       4       5       6
## 9        3       3       4       3       5
## 10       6       3       5       4       7

filter()

In addition to selecting specific rows (cases) or columns (variables), we can narrow down the dataset with filter(). For example, we can display only the cases that live in district 1.

filter(
  pss, 
  district == "Distrikt 1"
)

We can also introduce multiple conditions. For example, only people living in District 5 and are male (male).

filter(
  pss,
  district == "Distrikt 5" & gndr == "male"
)

All known logical connections work here as well. As a reminder, here are the logical connectors:

logical and: &
logical or: |
logical equal: ==
logical not equal: !=
logical greater: >
logical less: <
logical less than or equal to: <=
logical greater than or equal to: >=

arrange()

To order data sets, the arrange() function can be used. You can sort either in ascending or descending order. For example, by working hours:

pssAsc <- arrange(
  pss, 
  wkhtot
)
head(pssAsc)

##    idno   district   gndr agea         edu wkhtot     income stfdem stfeco
## 1 20438 Distrikt 5   male   37 ES-ISCED II      6 6th decile      4      4
## 2 10078 Distrikt 1   male   54 ES-ISCED IV      7 9th decile      6      4
## 3 20249 Distrikt 5   male   48 ES-ISCED IV      7 9th decile      5      5
## 4 10072 Distrikt 1   male   52 ES-ISCED IV      8 9th decile      2      5
## 5 10757 Distrikt 1   male   37 ES-ISCED IV      9 8th decile      4      5
## 6 20103 Distrikt 5 female   25 ES-ISCED IV      9 6th decile      3      0
##   trstprl trstprt trstplt trstlgl lrscale
## 1       6       5       4       3       3
## 2       1       7       3       7       9
## 3       4       5       4       6       6
## 4       3       1       4       2       7
## 5       4       6       4       5       7
## 6       1       5       3       4       1

Using the desc() function within arrange(), cases are sorted in descending order. desc stands for descending. Alternatively, you can simply put a minus sign before the variable name to get cases sorted in descending order as well.

pssDesc <- arrange(
  pss,
  desc(wkhtot)
)

head(pssDesc)

##    idno    district   gndr agea          edu wkhtot     income stfdem stfeco
## 1 40446 Distrikt 10   male   63         <NA>     65 1st decile      5      6
## 2 50618 Distrikt 12 female   51         <NA>     63 2nd decile      5      6
## 3 50494 Distrikt 12 female   NA  ES-ISCED II     62 2nd decile      3      4
## 4 50491 Distrikt 12 female   NA ES-ISCED III     60 4th decile      5      4
## 5 20294  Distrikt 5   male   59 ES-ISCED III     59 6th decile      7      8
## 6 40525 Distrikt 10   male   40  ES-ISCED II     59 4th decile      3      7
##   trstprl trstprt trstplt trstlgl lrscale
## 1       2       2       9       5       7
## 2       5       4       6       4       5
## 3       3       3       4       2       1
## 4       3       6       6       6       3
## 5       6       4       7       3       5
## 6       1       3       4       1       4

# Alternativ Minuszeichen vor Variable
pssDesc2 <- arrange(
  pss, 
  -wkhtot
)

mutate()

To calculate new variables or recode a variable, mutate() is used. In this example, we calculate a variable that outputs the difference from the average working hours within our survey.

mutate(
  pss, 
  wkhtotCen = wkhtot - mean(wkhtot, na.rm = TRUE)
)

If we want to create new dummy or categorical variables, we need to use the case_when() function additionally. In the case_when() function, you hierarchically specify which condition to check and how to recode. For example, we could recode the variable district and turn the character variable into an integer variable. It is important that the checks are done hierarchically, similar to an if statement, from the most specific condition to the most general condition (otherwise, there will be nonsensical variables!).

mutate(
  pss,
  districtRec = case_when(
    district == "Distrikt 1" ~ 1, 
    district == "Distrikt 5" ~ 5, 
    district == "Distrikt 7" ~ 7,
    district == "Distrikt 10" ~ 10, 
    district == "Distrikt 12" ~ 12,
  )
)

Multiple conditions can also be linked here: Now we want to calculate a dummy variable that indicates whether people live in District 12 and are female (female).

mutate(
  pss, 
  d12gndr = case_when(
    district == "Distrikt 12" & gndr == "female" ~ 1
  )
)

In this example, we have only specified one condition for the value 1 on the new variable. As we can see, all other cases are automatically assigned NA. However, we want to assign the value 0 to all other cases. To avoid having to write out the different combinations, we use another argument of the case_when() function:

mutate(
  pss, 
  d12gndr = case_when(
    district == "Distrikt 12" & gndr == "female" ~ 1,
    TRUE ~ 0
  )
)

With the argument TRUE ~ 0, we specify that all other values will be assigned the value 0. This way, we do not need to write out all other combinations as code.

summarize()

The summarize() function allows us to summarize a value from columns. For example, the mean of a column.

summarize(
  pss, 
  mean(wkhtot)
)

##   mean(wkhtot)
## 1      34.3008

Any function that requires a column as input can be used: including first(), last(), nth(), n(), n_distinct(), IQR(), min(), max(), mean(), median(), var(), and sd().

group_by()

Often in datasets, we have categorical variables by which we want to group the dataset. For example, we could group the dataset by field of study. We do this with the group_by() function:

group_by(
  pss,
  gndr
)

Apparently, nothing has changed in the dataset, but here is the important difference from arrange(): group_by() does not sort the dataset, but rather groups the dataset. Therefore, the output does not change. If, for example, we want to display the mean semester by field of study, we concatenate group_by() and summarize() (we will do this more elegantly below with the Piping Operator):

summarize(
  group_by(
    pss, 
    gndr
  ), 
  mean(wkhtot)
)

## # A tibble: 2 × 2
##   gndr   `mean(wkhtot)`
##   <fct>           <dbl>
## 1 female           34.5
## 2 male             34.1

Here we can see what group_by() does: Instead of one mean, we now get four means here (one for each course of study in the dataset). Important: Grouping should always be resolved subsequently with the function ungroup().