Study unit 4 > Linear Regression > Multivariate lin. Regression with polytomous categorical variables

Multivariate lin. Regression with polytomous categorical variables

We can add not only dichotomous but also polytomous categorical variables to the regression model. Now, we want to include the variable edu in the model. This variable represents the highest level of education attained by the surveyed individual. What theoretical assumptions could we make about the effect of edu?

Calculating the Model

We simply add the variable to the lm() function as before:

olsModel4 <- lm(
    stfdem ~ 1 + stfeco + trstlgl + gndr + edu,   
    data = pss
)
summary(olsModel4)

## 
## Call:
## lm(formula = stfdem ~ 1 + stfeco + trstlgl + gndr + edu, data = pss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7867 -1.1246  0.0123  1.1391  5.8527 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.535964   0.112218   4.776 1.84e-06 ***
## stfeco           0.854641   0.014166  60.332  < 2e-16 ***
## trstlgl         -0.044393   0.013572  -3.271  0.00108 ** 
## gndrmale         0.001836   0.051229   0.036  0.97142    
## eduES-ISCED II   0.168395   0.076925   2.189  0.02864 *  
## eduES-ISCED III  0.343037   0.076832   4.465 8.21e-06 ***
## eduES-ISCED IV   0.419061   0.085739   4.888 1.06e-06 ***
## eduES-ISCED V    0.870502   0.125865   6.916 5.29e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.716 on 4542 degrees of freedom
##   (450 observations deleted due to missingness)
## Multiple R-squared:  0.4643,	Adjusted R-squared:  0.4635 
## F-statistic: 562.3 on 7 and 4542 DF,  p-value: < 2.2e-16

How did the lm() function incorporate the polytomous variable into the model?

What is the reference category?

And what would the regression equation look like?

It calculates the additional effect of the four highest categories (ES-ISCED II, ES-ISCED III, ES-ISCED IV, and ES-ISCED V) compared to the lowest category (ES-ISCED I).

The reference category is therefore ES-ISCED I (omitted category).

\[\begin{align*}stfdem = &\beta_0 + \beta_1*stfeco + \beta_2*trstlgl + \beta_3*gndr + \\ &\beta_4*eduLevelII + \beta_5*eduLevelIII + \\ &\beta_6*eduLevelIV + \beta_7*eduLevelV + \\&e \end{align*}\]

Changing the Reference Category

In this example, the lowest (or first) category was automatically chosen as the reference category. But what if you want, for example, the middle category (ES-ISCED III) as the reference? You can easily do this using the relevel() function. In the first argument, specify the data source (variable pss$edu), and in the second argument ref, specify the corresponding category (i.e., "ES-ISCED III"). Important: You must save this with the assignment arrow in the variable in the dataset!

pss$edu <- relevel(
  pss$edu, 
  ref = "ES-ISCED III"
)

Then you just need to recalculate the model:

olsModel5 <- lm(
  stfdem ~ 1 + stfeco + trstlgl + gndr + edu,
  data = pss
)

summary(olsModel5)

## 
## Call:
## lm(formula = stfdem ~ 1 + stfeco + trstlgl + gndr + edu, data = pss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7867 -1.1246  0.0123  1.1391  5.8527 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.879001   0.106299   8.269  < 2e-16 ***
## stfeco          0.854641   0.014166  60.332  < 2e-16 ***
## trstlgl        -0.044393   0.013572  -3.271  0.00108 ** 
## gndrmale        0.001836   0.051229   0.036  0.97142    
## eduES-ISCED I  -0.343037   0.076832  -4.465 8.21e-06 ***
## eduES-ISCED II -0.174643   0.066577  -2.623  0.00874 ** 
## eduES-ISCED IV  0.076024   0.075825   1.003  0.31610    
## eduES-ISCED V   0.527465   0.119052   4.431 9.62e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.716 on 4542 degrees of freedom
##   (450 observations deleted due to missingness)
## Multiple R-squared:  0.4643,	Adjusted R-squared:  0.4635 
## F-statistic: 562.3 on 7 and 4542 DF,  p-value: < 2.2e-16

Interpret the re-specified model. Write a few sentences in the script.

The model explains $46.35 %$ of the variance in stfdem. The effect of stfeco is positive and significant ($p<0.001$). Individuals with higher trust in the legal system have lower satisfaction ($\beta_2 = -0.044393$, $p<0.01$). Male and female respondents do not have different satisfaction levels ($\beta_3 = 0.001836$. $p>0.05$). Compared to individuals with a medium level of education, individuals with very low education level (ES-ISCED I) and low education level (ES-ISCED II) have lower trust ($\beta_4 = -0.343037$ and $\beta_5 = -0.174643$ respectively). Both effects are significant. Individuals with the highest education level have significantly higher trust than those with a medium education level ($\beta_7 = 0.527465$, $p<0.001$). Individuals with the second-highest education level have slightly higher trust ($\beta_6 = 0.076024$), but this effect is not significant.

So now you can also add polytomous categorical variables and interpret the regression model in the R output!