In addition to vectors and factors, there is another important object type for us, the data frame. A data frame is simply a combination of multiple vectors (variables) of the same length in a matrix. In the conventional format (wide format), the variables are found in the columns and the respondents in the rows.
Let’s illustrate this with the example of the dataset we will use during the course: Panem Social Survey (pss
). This is a training dataset based on the European Social Survey, but with significantly fewer variables/cases (only 10 cases and 4 variables):
idno | district | gndr | agea |
---|---|---|---|
10000 | Distrikt 1 | male | 41 |
10001 | Distrikt 1 | male | 65 |
10002 | Distrikt 1 | male | 48 |
10003 | Distrikt 1 | female | 49 |
10004 | Distrikt 1 | female | 48 |
10005 | Distrikt 1 | female | 64 |
10006 | Distrikt 1 | male | 63 |
10007 | Distrikt 1 | female | 70 |
10008 | Distrikt 1 | female | 80 |
10009 | Distrikt 1 | male | 57 |
In this example dataset, we have four variables: idno
, district
, gndr
, and agea
. These are self-explanatory: idno
is the unique ID, district
is the district of the respondent, gndr
is the gender, and agea
is the age. Often, variables are not intuitively understandable, so you may need to consult a codebook. Handling larger datasets will be covered in the next learning block.
Let’s go to the final challenge!