In most cases, preparatory work on the dataset is necessary to perform analysis steps. These steps are often of great importance and can be quite time-consuming.
There are several packages that allow for data manipulation. One of the most well-known and commonly used packages due to its clarity is dplyr
. The idea behind dplyr
is that it processes only tables (data frames or tibbles) and facilitates processing through what is known as piping. The code in dplyr
is structured into a long chain, which may initially seem unnecessary. However, this makes the code easy to read, and only a few functions need to be learned, which together can cover the most common processing steps. By breaking down the process into steps, each function has a well-defined scope that is easy to understand. By combining multiple functions, more complex steps can also be performed. Additionally, functions from the dplyr
package always return a tibble (table).
The extensive documentation for dplyr
can be found here.
The most common functions used with dplyr
are the following:
Funktion | Operation |
---|---|
select() | Spalte(n) wählen |
slice() | Zeile(n) wählen |
filter() | Zeile(n) filtern |
arrange() | Zeile(n) ordnen |
mutate() | neue Spalten / Variablen |
summarize() | Werte zusammenfassen |
group_by() | gruppieren |
These will now be introduced to you step by step before we learn about piping, the second principle of dplyr
. These examples are somewhat artificial, as dplyr
is almost exclusively used with piping. However, for understanding the functions, it is helpful to get to know them first.
In all examples, we use a fictional dataset that surveyed the study motivation (mot
) of students at the universities of Gießen, Marburg, and Frankfurt.
The dataset includes questions about satisfaction with democracy (stfdem
), district (district
), trust in parliament (trstprl
), and gender (gndr
), among others.
The package dplyr
can be loaded individually, but it is recommended to load the package tidyverse
, which also loads additional packages like ggplot2
directly.
# install.packages(
# "tidyverse",
# dependencies = TRUE
# )
library("tidyverse")
With the select()
function, you can choose one or more columns from a dataset. The output is always a tibble.
In this example, we only want to display the variables gender (gndr
) and district (district
).
head(pss)
## idno district gndr agea edu wkhtot income stfdem stfeco
## 1 10000 Distrikt 1 male 41 ES-ISCED IV 34 7th decile 7 6
## 2 10001 Distrikt 1 male 65 ES-ISCED II 20 6th decile 8 7
## 3 10002 Distrikt 1 male 48 ES-ISCED IV 27 7th decile 6 6
## 4 10003 Distrikt 1 female 49 ES-ISCED V 30 6th decile 5 4
## 5 10004 Distrikt 1 female 48 ES-ISCED IV 29 5th decile 4 5
## 6 10005 Distrikt 1 female 64 ES-ISCED V 30 6th decile 6 6
## trstprl trstprt trstplt trstlgl lrscale
## 1 3 5 4 6 4
## 2 5 5 5 4 3
## 3 4 4 6 5 6
## 4 2 7 4 3 6
## 5 6 6 6 6 2
## 6 1 3 2 4 7
select(
pss,
c(
gndr,
district
)
)
On the other hand, with slice()
, you can display individual rows. For example, rows \(50\) to \(55\), or in combination with the seq()
function, every \(100^{th}\) row.
slice(
pss,
50:55
)
## idno district gndr agea edu wkhtot income stfdem stfeco
## 1 10049 Distrikt 1 female 65 ES-ISCED II 30 4th decile 5 7
## 2 10050 Distrikt 1 female 47 ES-ISCED IV 30 5th decile 5 2
## 3 10051 Distrikt 1 female 57 ES-ISCED III 21 5th decile 4 5
## 4 10052 Distrikt 1 male 42 ES-ISCED IV 27 9th decile 3 5
## 5 10053 Distrikt 1 female 47 ES-ISCED III 38 5th decile 5 5
## 6 10054 Distrikt 1 female 43 ES-ISCED III 42 2nd decile 10 6
## trstprl trstprt trstplt trstlgl lrscale
## 1 6 5 6 7 7
## 2 2 5 7 4 5
## 3 5 6 2 4 3
## 4 5 4 5 5 5
## 5 4 3 0 5 2
## 6 8 5 7 4 3
slice(
pss,
seq(
0,
1000,
100
)
)
## idno district gndr agea edu wkhtot income stfdem stfeco
## 1 10099 Distrikt 1 female 53 <NA> 41 7th decile 5 6
## 2 10199 Distrikt 1 male 58 ES-ISCED III 39 5th decile 6 5
## 3 10299 Distrikt 1 female 45 ES-ISCED III 29 5th decile 4 6
## 4 10399 Distrikt 1 female 63 ES-ISCED III 37 5th decile 7 6
## 5 10499 Distrikt 1 female 56 ES-ISCED IV 38 6th decile 5 5
## 6 10600 Distrikt 1 male 71 ES-ISCED II 44 5th decile 7 7
## 7 10700 Distrikt 1 male 70 ES-ISCED III 30 6th decile 4 6
## 8 10800 Distrikt 1 male 75 ES-ISCED II 48 5th decile 8 6
## 9 10900 Distrikt 1 male 67 ES-ISCED II 48 3rd decile 4 4
## 10 11000 Distrikt 1 female 69 ES-ISCED III 41 5th decile 4 5
## trstprl trstprt trstplt trstlgl lrscale
## 1 1 4 4 8 8
## 2 6 4 4 5 8
## 3 7 3 6 2 5
## 4 4 5 6 6 8
## 5 3 5 4 7 6
## 6 5 4 4 4 7
## 7 3 7 7 3 8
## 8 5 10 4 5 6
## 9 3 3 4 3 5
## 10 6 3 5 4 7
In addition to selecting specific rows (cases) or columns (variables), we can narrow down the dataset with filter()
. For example, we can display only the cases that live in district 1.
filter(
pss,
district == "Distrikt 1"
)
We can also introduce multiple conditions. For example, only people living in District 5 and are male (male).
filter(
pss,
district == "Distrikt 5" & gndr == "male"
)
All known logical connections work here as well. As a reminder, here are the logical connectors:
logical and: &
logical or: |
logical equal: ==
logical not equal: !=
logical greater: >
logical less: <
logical less than or equal to: <=
logical greater than or equal to: >=
To order data sets, the arrange()
function can be used. You can sort either in ascending or descending order. For example, by working hours:
pssAsc <- arrange(
pss,
wkhtot
)
head(pssAsc)
## idno district gndr agea edu wkhtot income stfdem stfeco
## 1 20438 Distrikt 5 male 37 ES-ISCED II 6 6th decile 4 4
## 2 10078 Distrikt 1 male 54 ES-ISCED IV 7 9th decile 6 4
## 3 20249 Distrikt 5 male 48 ES-ISCED IV 7 9th decile 5 5
## 4 10072 Distrikt 1 male 52 ES-ISCED IV 8 9th decile 2 5
## 5 10757 Distrikt 1 male 37 ES-ISCED IV 9 8th decile 4 5
## 6 20103 Distrikt 5 female 25 ES-ISCED IV 9 6th decile 3 0
## trstprl trstprt trstplt trstlgl lrscale
## 1 6 5 4 3 3
## 2 1 7 3 7 9
## 3 4 5 4 6 6
## 4 3 1 4 2 7
## 5 4 6 4 5 7
## 6 1 5 3 4 1
Using the desc()
function within arrange()
, cases are sorted in descending order. desc
stands for descending. Alternatively, you can simply put a minus sign before the variable name to get cases sorted in descending order as well.
pssDesc <- arrange(
pss,
desc(wkhtot)
)
head(pssDesc)
## idno district gndr agea edu wkhtot income stfdem stfeco
## 1 40446 Distrikt 10 male 63 <NA> 65 1st decile 5 6
## 2 50618 Distrikt 12 female 51 <NA> 63 2nd decile 5 6
## 3 50494 Distrikt 12 female NA ES-ISCED II 62 2nd decile 3 4
## 4 50491 Distrikt 12 female NA ES-ISCED III 60 4th decile 5 4
## 5 20294 Distrikt 5 male 59 ES-ISCED III 59 6th decile 7 8
## 6 40525 Distrikt 10 male 40 ES-ISCED II 59 4th decile 3 7
## trstprl trstprt trstplt trstlgl lrscale
## 1 2 2 9 5 7
## 2 5 4 6 4 5
## 3 3 3 4 2 1
## 4 3 6 6 6 3
## 5 6 4 7 3 5
## 6 1 3 4 1 4
# Alternativ Minuszeichen vor Variable
pssDesc2 <- arrange(
pss,
-wkhtot
)
To calculate new variables or recode a variable, mutate()
is used. In this example, we calculate a variable that outputs the difference from the average working hours within our survey.
mutate(
pss,
wkhtotCen = wkhtot - mean(wkhtot, na.rm = TRUE)
)
If we want to create new dummy or categorical variables, we need to use the case_when()
function additionally. In the case_when()
function, you hierarchically specify which condition to check and how to recode. For example, we could recode the variable district
and turn the character variable into an integer variable. It is important that the checks are done hierarchically, similar to an if statement, from the most specific condition to the most general condition (otherwise, there will be nonsensical variables!).
mutate(
pss,
districtRec = case_when(
district == "Distrikt 1" ~ 1,
district == "Distrikt 5" ~ 5,
district == "Distrikt 7" ~ 7,
district == "Distrikt 10" ~ 10,
district == "Distrikt 12" ~ 12,
)
)
Multiple conditions can also be linked here: Now we want to calculate a dummy variable that indicates whether people live in District 12 and are female (female).
mutate(
pss,
d12gndr = case_when(
district == "Distrikt 12" & gndr == "female" ~ 1
)
)
In this example, we have only specified one condition for the value 1
on the new variable. As we can see, all other cases are automatically assigned NA
. However, we want to assign the value 0
to all other cases. To avoid having to write out the different combinations, we use another argument of the case_when()
function:
mutate(
pss,
d12gndr = case_when(
district == "Distrikt 12" & gndr == "female" ~ 1,
TRUE ~ 0
)
)
With the argument TRUE ~ 0
, we specify that all other values will be assigned the value 0
. This way, we do not need to write out all other combinations as code.
The summarize()
function allows us to summarize a value from columns. For example, the mean of a column.
summarize(
pss,
mean(wkhtot)
)
## mean(wkhtot)
## 1 34.3008
Any function that requires a column as input can be used: including first()
, last()
, nth()
, n()
, n_distinct()
, IQR()
, min()
, max()
, mean()
, median()
, var()
, and sd()
.
Often in datasets, we have categorical variables by which we want to group the dataset. For example, we could group the dataset by field of study. We do this with the group_by()
function:
group_by(
pss,
gndr
)
Apparently, nothing has changed in the dataset, but here is the important difference from arrange()
: group_by()
does not sort the dataset, but rather groups the dataset. Therefore, the output does not change. If, for example, we want to display the mean semester by field of study, we concatenate group_by()
and summarize()
(we will do this more elegantly below with the Piping Operator):
summarize(
group_by(
pss,
gndr
),
mean(wkhtot)
)
## # A tibble: 2 × 2
## gndr `mean(wkhtot)`
## <fct> <dbl>
## 1 female 34.5
## 2 male 34.1
Here we can see what group_by()
does: Instead of one mean, we now get four means here (one for each course of study in the dataset). Important: Grouping should always be resolved subsequently with the function ungroup()
.