Study unit 2 > Tidyverse > Tidyverse - dplyr > Merging Datasets (Adding Cases)

Merging Datasets (Adding Cases)

Im nächsten Schritt nehmen wir nun an, dass die Datenerfassung getrennt nach Distrikt durchgeführt wurde und es somit fünf Teildatensätze gibt, die nun zu einem vollständigen Datensatz verbunden werden sollen. Dazu nutzen wir die Funktion bind_rows(). In unserem Beispiel haben alle fünf Teildatensätze genau die gleiche Anzahl an Variablen, die dazu auch noch genau gleich benannt sind! Mit dem Argument .id erstellen wir eine Variable namens "origin", die die Herkunft des Falles erfasst. Dies ist automatisch nummeriert. Mit mutate() machen wir daraus einen Faktor, der eine bessere Beschreibung beinhaltet (process1, process2, process3, process4, process5)

pssAll <- pss1 %>%
  bind_rows(
    list(
      pss5,
      pss7, 
      pss10, 
      pss12
    ), 
    .id = "origin"
  ) %>%
  mutate(
    origin = factor(
      origin, 
      labels = c(
        "process1", 
        "process2", 
        "process3", 
        "process4",
        "process5"
      )
    )
  )

table(pssAll$origin)

## 
## process1 process2 process3 process4 process5 
##     1000     1000     1000     1000     1000

head(pssAll$origin)

## [1] process1 process1 process1 process1 process1 process1
## Levels: process1 process2 process3 process4 process5

Wir haben hier jetzt also aus fünf Teildatensätzen einen gesamten Datensatz erstellt, der alle Fälle der fünf Teildatensätze enthält. Wichtig, in diesem Fall waren alle Variablennamen gleich!

Now we will try out what happens, for example, if there is a typo in a subset of data. First, let’s create two new datasets, each including only 3 cases, and different variables.

pssA <- pss[1:3, 2:3]

District <- c(
  "Distrikt 1", 
  "Distrikt 5",
  "Distrikt 7"
)

gndr <- c(
  "male",
  "female",
  "female"
)

pssB <- data.frame(
  District, 
  gndr
)

head(pssA)

##     district gndr
## 1 Distrikt 1 male
## 2 Distrikt 1 male
## 3 Distrikt 1 male

head(pssB)

##     District   gndr
## 1 Distrikt 1   male
## 2 Distrikt 5 female
## 3 Distrikt 7 female

So, in both datasets, we have two variables indicating District and Gender. However, in dataset pssB, the District variable is spelled differently. Let’s try bind_rows().

pssTest <- pssA %>% 
  bind_rows(pssB)

pssTest

##     district   gndr   District
## 1 Distrikt 1   male       <NA>
## 2 Distrikt 1   male       <NA>
## 3 Distrikt 1   male       <NA>
## 4       <NA>   male Distrikt 1
## 5       <NA> female Distrikt 5
## 6       <NA> female Distrikt 7

Since the variable names are not exactly the same, three variables are now created: district, gndr, and District. Where the variable is missing, NAs are automatically generated. This can be advantageous but can also become tricky if the dataset creation did not strictly follow a coding name schema. Solution: Clarify and rename variables beforehand. Otherwise, you can use full_join().

Merging Datasets (Different Column Names)

This approach is no less complex than renaming column names but still offers an alternative. With full_join(), we combine two datasets and can specify in the by argument which columns have the same content. The drawback here is that the same column names must also be listed, as otherwise (as in this example) the variables gndr.x and gndr.y will be created. This is because full_join() is actually intended to add new/additional variables.

In our example, we would specify that from dataset pssA, the column district is equal to the column District from dataset pssB. The same applies to the gndr variable.

pssTest2 <- pssA %>% 
  full_join(
    pssB,
    by = c(
      "district" = "District", 
      "gndr" = "gndr"
    )
  )

head(pssTest2)

##     district   gndr
## 1 Distrikt 1   male
## 2 Distrikt 1   male
## 3 Distrikt 1   male
## 4 Distrikt 5 female
## 5 Distrikt 7 female

This is how you can successfully merge datasets with different variable names!