Im nächsten Schritt nehmen wir nun an, dass die Datenerfassung getrennt nach Distrikt durchgeführt wurde und es somit fünf Teildatensätze gibt, die nun zu einem vollständigen Datensatz verbunden werden sollen. Dazu nutzen wir die Funktion bind_rows()
. In unserem Beispiel haben alle fünf Teildatensätze genau die gleiche Anzahl an Variablen, die dazu auch noch genau gleich benannt sind! Mit dem Argument .id
erstellen wir eine Variable namens "origin"
, die die Herkunft des Falles erfasst. Dies ist automatisch nummeriert. Mit mutate()
machen wir daraus einen Faktor, der eine bessere Beschreibung beinhaltet (process1, process2, process3, process4, process5)
pssAll <- pss1 %>%
bind_rows(
list(
pss5,
pss7,
pss10,
pss12
),
.id = "origin"
) %>%
mutate(
origin = factor(
origin,
labels = c(
"process1",
"process2",
"process3",
"process4",
"process5"
)
)
)
table(pssAll$origin)
##
## process1 process2 process3 process4 process5
## 1000 1000 1000 1000 1000
head(pssAll$origin)
## [1] process1 process1 process1 process1 process1 process1
## Levels: process1 process2 process3 process4 process5
Wir haben hier jetzt also aus fünf Teildatensätzen einen gesamten Datensatz erstellt, der alle Fälle der fünf Teildatensätze enthält. Wichtig, in diesem Fall waren alle Variablennamen gleich!
Now we will try out what happens, for example, if there is a typo in a subset of data. First, let’s create two new datasets, each including only 3 cases, and different variables.
pssA <- pss[1:3, 2:3]
District <- c(
"Distrikt 1",
"Distrikt 5",
"Distrikt 7"
)
gndr <- c(
"male",
"female",
"female"
)
pssB <- data.frame(
District,
gndr
)
head(pssA)
## district gndr
## 1 Distrikt 1 male
## 2 Distrikt 1 male
## 3 Distrikt 1 male
head(pssB)
## District gndr
## 1 Distrikt 1 male
## 2 Distrikt 5 female
## 3 Distrikt 7 female
So, in both datasets, we have two variables indicating District and Gender. However, in dataset pssB
, the District variable is spelled differently. Let’s try bind_rows()
.
pssTest <- pssA %>%
bind_rows(pssB)
pssTest
## district gndr District
## 1 Distrikt 1 male <NA>
## 2 Distrikt 1 male <NA>
## 3 Distrikt 1 male <NA>
## 4 <NA> male Distrikt 1
## 5 <NA> female Distrikt 5
## 6 <NA> female Distrikt 7
Since the variable names are not exactly the same, three variables are now created: district
, gndr
, and District
. Where the variable is missing, NAs
are automatically generated. This can be advantageous but can also become tricky if the dataset creation did not strictly follow a coding name schema. Solution: Clarify and rename variables beforehand. Otherwise, you can use full_join()
.
This approach is no less complex than renaming column names but still offers an alternative. With full_join()
, we combine two datasets and can specify in the by
argument which columns have the same content. The drawback here is that the same column names must also be listed, as otherwise (as in this example) the variables gndr.x
and gndr.y
will be created. This is because full_join()
is actually intended to add new/additional variables.
In our example, we would specify that from dataset pssA
, the column district
is equal to the column District
from dataset pssB
. The same applies to the gndr
variable.
pssTest2 <- pssA %>%
full_join(
pssB,
by = c(
"district" = "District",
"gndr" = "gndr"
)
)
head(pssTest2)
## district gndr
## 1 Distrikt 1 male
## 2 Distrikt 1 male
## 3 Distrikt 1 male
## 4 Distrikt 5 female
## 5 Distrikt 7 female
This is how you can successfully merge datasets with different variable names!