|
|
|
||||
|
|
|||||
Topics covered
5.1 Simple Coding
5.2 Deviation Coding
5.3 Orthogonal Polynomial Coding
5.4 Helmert Coding
5.5 Reverse Helmert Coding
5.6 Forward Difference Coding
5.7 Backward Difference Coding
5.8 User-Defined Coding
5.9 Summary
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. For example, if you have a variable called race that is coded 1 = Hispanic, 2 = Asian 3 = Black 4 = White, then entering race in your regression will look at the linear effect of race, which is probably not what you intended. Instead, categorical variables like this need to be recoded into a series of variables which can then be entered into the regression model. There are a variety of coding systems that can be used when coding categorical variables. Ideally, you would choose a coding system that reflects the comparisons that you want to make. For example, you may want to compare each level to the next higher level, in which case you would want to use "forward difference" coding, or you might want to compare each level to the mean of the subsequent levels of the variable, in which case you would want to use "Helmert" coding. By deliberately choosing a coding system, you can obtain comparisons that are most meaningful for testing your hypotheses. Regardless of the coding system you choose, the test of the overall effect of the categorical variable (i.e., the overall effect of race) will remain the same. Below is a table listing various types of contrasts and the comparison that they make.
| Name of contrast | Comparison made |
| Simple Coding | Compares each level of a variable to the reference level |
| Deviation Coding | Compares deviations from the grand mean |
| Orthogonal Polynomial Coding | Orthogonal polynomial contrasts |
| Helmert Coding | Compare levels of a variable with the mean of the subsequent levels of the variable |
| Reverse Helmert Coding | Compares levels of a variable with the mean of the previous levels of the variable |
| Forward Difference Coding | Adjacent levels of a variable (each level minus the next level) |
| Backward Difference Coding | Adjacent levels of a variable (each level minus the prior level) |
| User-Defined Coding | User-defined contrast |
There are a couple of notes to be made about the coding systems listed above. The first is that they represent planned comparisons and not post hoc comparisons. In other words, they are comparisons that you plan to do before you begin analyzing your data, not comparisons that you think of once you have seen the results of preliminary analyses. Also, some forms of coding make more sense with ordinal categorical variables than with nominal categorical variables. Below we will show examples using race as a categorical variable, which is a nominal variable. Because simple effect coding compares the mean of the dependent variable for each level of the categorical variable to the mean of the dependent variable at for the reference level, it makes sense with a nominal variable. However, it may not make as much sense to use a coding scheme that tests the linear effect of race. As we describe each type of coding system, we note those coding systems with which it does not make as much sense to use a nominal variable. Also, you may notice that we follow several rules when creating the contrast coding schemes. For more information about these rules, please see the section on User-Defined Coding.
The Example Data File
R data frame: hsb2.csv
The examples in this page will use data frame called hsb2 and we will focus on the categorical variable race, which has four levels (1 = Hispanic, 2 = Asian, 3 = African American and 4 = Caucasian) and we will use write as our dependent variable. Although our example uses a variable with four levels, these coding systems work with variables that have more or fewer categories. No matter which coding system you select, you will always have a contrast matrix with one less column than levels of the original variable. In our example, our categorical variable has four levels so we will have contrast matrices with three columns and four rows.
First, we read in the data frame and then we create a factor variable, race.f, based on race.
hsb2 = read.table('c:/hsb2.csv', header=T, sep=",")
#creating the factor variable race.f
race.f = factor(race, labels=c("Hispanic", "Asian", "African-Am", "Caucasian"))
Before considering any analyses, let's look at the mean of the dependent variable, write, for each level of race. This will help in interpreting the output from later analyses.
tapply(hsb2$write, hsb2$race.f, mean) Hispanic Asian African-Am Caucasian 46.45833 58 48.2 54.05517
In R there are four built-in contrasts (simple, deviation, helmert, orthogonal polynomial) which we will consider first. Then we will demonstrate how to create the other commonly used contrast systems.
The results of simple coding are very similar to dummy coding in that each level is compared to the reference level. In the example below, level 1 is the reference level and the first comparison compares level 1 to level 2, the second comparison compares level 1 to level 3, and the third comparison compares level 1 to level 4.
Let's first look at how to use this contrast coding using the built-in contrast function contr.treatment which uses group 1 as the reference group and then we will show how to manually create the contrast matrix using any group as the reference group.
#the contrast matrix for categorical variable with four levels
contr.treatment(4)
2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
#assigning the treatment contrasts to race.f
contrasts(hsb2$race.f) = contr.treatment(4)
#the regression
summary(lm(write ~ race.f, hsb2))
Residuals:
Min 1Q Median 3Q Max
-23.06 -5.458 0.9724 7 18.8
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 46.4583 1.8422 25.2184 0.0000
race.f2 11.5417 3.2861 3.5122 0.0006
race.f3 1.7417 2.7325 0.6374 0.5246
race.f4 7.5968 1.9889 3.8197 0.0002
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-Squared: 0.1071
F-statistic: 7.833 on 3 and 196 degrees of freedom, the p-value is 0.00005785
The parameter estimate for the first contrast compares the mean of the dependent variable, write, for levels 1 and 2 yielding 11.5417 and is statistically significant (p<.000). The t-value associated with this test is -3.5122. The results of the second contrast, comparing the mean of write for levels 1 and 3 is not statistically significant (t = 0.6374, p = .5246), while the third contrast is statistically significant.
In our example below, level 1 is the reference level and race.f1 compares level 1 to level 2, race.f2 compares level 1 to level 3, and race.f3 compares level 1 to level 4. For race.f1 the coding is 3/4 for level 2, and -1/4 for all other levels. Likewise, for race.f2 the coding is 3/4 for level 3, and -1/4 for all other levels, and for race.f3 the coding is 3/4 for level 4, and -1/4 for all other levels. The general rule is that the reference group is never coded anything but -1/4 and for each contrast the level that is being contrasted is coded 3/4. Thus, for the first contrast it is level 2 which is coded 3/4 and all other level are -1/4. Since there are four groups and the values have to add to one there must be three levels coded as -1/4 and one level as 3/4.
SIMPLE regression coding
| Level of race | race.f1 | race.f2 | race.f3 |
| 1 (Hispanic) | -1/4 | -1/4 | -1/4 |
| 2 (Asian) | 3/4 | -1/4 | -1/4 |
| 3 (African American) | -1/4 | 3/4 | -1/4 |
| 4 (Caucasian) | -1/4 | -1/4 | 3/4 |
Below we show the more general rule for creating this kind of coding scheme using regression coding, where k is the number of levels of the categorical variable (in the case of the variable race.f k = 4).
SIMPLE regression coding
| Level of race | level 1 vs. 2 | level 1 vs. 3 | level 1 vs. 4 |
| 1 (Hispanic) | -1 / k | -1 / k | -1 / k |
| 2 (Asian) | (k-1) / k | -1 / k | -1 / k |
| 3 (African American) | -1 / k | (k-1) / k | -1 / k |
| 4 (Caucasian) | -1 / k | -1 / k | (k-1) / k |
Let's create the contrast matrix manually using the scheme shown above.
#creating the contrast matrix manually
my.treat = matrix(c( -1/4, 3/4, -1/4, -1/4, -1/4, -1/4, 3/4, -1/4, -1/4, -1/4, -1/4, 3/4), ncol=3)
my.treat
[,1] [,2] [,3]
[1,] -0.25 -0.25 -0.25
[2,] 0.75 -0.25 -0.25
[3,] -0.25 0.75 -0.25
[4,] -0.25 -0.25 0.75
contrasts(hsb2$race.f) = my.treat
#the regression
summary(lm(write~race.f, hsb2))
Residuals:
Min 1Q Median 3Q Max
-23.06 -5.458 0.9724 7 18.8
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 11.5417 3.2861 3.5122 0.0006
race.f2 1.7417 2.7325 0.6374 0.5246
race.f3 7.5968 1.9889 3.8197 0.0002
Residual standard error: 9.025 on 196 degrees of freedom
Multiple R-Squared: 0.1071
F-statistic: 7.833 on 3 and 196 degrees of freedom, the p-value is 0.00005785
This coding system compares the mean of the dependent variable for a given level to the overall mean of the dependent variable. In our example below, the first comparison compares level 1 (Hispanics) to all levels of race, the second comparison compares level 2 (Asians) to all levels of race, and the third comparison compares level 3 (African Americans) to all levels of race.
As you see in the example below, the regression coding is accomplished by assigning 1 to level 1 for the first comparison (because level 1 is the level to be compared to all others), a 1 to level 2 for the second comparison (because level 2 is to be compared to all others), and 1 to level 3 for the third comparison (because level 3 is to be compared to all others). Note that a -1 is assigned to level 4 for all three comparisons (because it is the level that is never compared to the other levels) and all other values are assigned a 0. This regression coding scheme yields the comparisons described above.
We will not create the contrast matrix manually because the contr.sum function creates it for us.
DEVIATION regression coding
| Level of race | Level 1 v. Mean | Level 2 v. Mean | Level 3 v. Mean |
| 1 (Hispanic) | 1 | 0 | 0 |
| 2 (Asian) | 0 | 1 | 0 |
| 3 (African American) | 0 | 0 | 1 |
| 4 (Caucasian) | -1 | -1 | -1 |
#the contrast matrix for categorical variable with four levels
contr.sum(4)
[,1] [,2] [,3]
1 1 0 0
2 0 1 0
3 0 0 1
4 -1 -1 -1
#assigning the deviation contrasts to race.f
contrasts(hsb2$race.f) = contr.sum(4)
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 -5.2200 1.6314 -3.1997 0.0016
race.f2 6.3216 2.1603 2.9263 0.0038
race.f3 -3.4784 1.7323 -2.0079 0.0460
The contrast estimate is the mean for level 1 minus the grand mean. However, this grand mean is not the mean of the dependent variable that is listed in the output of the means command above. Rather it is the mean of means of the dependent variable at each level of the categorical variable: (46.4583 + 58 + 48.2 + 54.0552) / 4 = 51.678375. This contrast estimate is then 46.4583 - 51.678375 = -5.220. The difference between this value and zero (the null hypothesis that the contrast coefficient is zero) is statistically significant (p = .0016), and the t-value for this test of -3.20. The results for the next two contrasts were computed in a similar manner.
5.3 Orthogonal Polynomial Coding
Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education. The table below shows the contrast coefficients for the linear, quadratic and cubic trends for the four levels. In R it is not necessary to compute these values since this contrast can be obtained for any categorical variable by using the contr.poly function. This is also the default contrast used for ordered factor variables.
POLYNOMIAL
| Level of race | Linear (race.f.L) | Quadratic (race.f.Q) | Cubic (race.f.C) |
| 1 (Hispanic) | -.671 | .5 | -.224 |
| 2 (Asian) | -.224 | -.5 | .671 |
| 3 (African American) | .224 | -.5 | -.671 |
| 4 (Caucasian) | .671 | .5 | .224 |
#the contrast matrix for categorical variable with four levels
contr.poly(4)
.L .Q .C
1 -0.6708204 0.5 -0.2236068
2 -0.2236068 -0.5 0.6708204
3 0.2236068 -0.5 -0.6708204
4 0.6708204 0.5 0.2236068
#assigning the orthogonal polynomial contrasts to race.f
contrasts(hsb2$race.f) = contr.poly(4)
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f.L 2.9048 1.5342 1.8933 0.0598
race.f.Q -2.8432 1.9642 -1.4475 0.1494
race.f.C 8.2727 2.3157 3.5724 0.0004
To calculate the contrast estimates for these comparisons, you need to multiply the code used in the contrast by the mean for the dependent variable for each level of the categorical variable, and then sum the values. For example, the code used in race.f.L for level 1 of race is -.671 and the mean of write for level 1 is 46.4583. Hence, you would multiply -.671 and 46.4583 and add that to the product of the code for level 2 of race.f.L and its mean, and so on. To obtain the contrast estimate for the linear contrast, you would do the following: -.671*46.4583 + -.224*58 + .224*48.2 + .671*54.0552 = 2.9048 (with rounding error). This result is not statistically significant at the .05 alpha level, but it is close. The quadratic component is also not statistically significant, but the cubic component is. This suggests that, if the means of the dependent variable for each level of race.f was plotted against race, the line would tend to have two bends. As noted earlier, this type of coding system does not make much sense with a nominal variable such as race.
Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4).
In R there is a built-in function, contr.helmert, which will generate one version of the helmert contrast coding for a factor variable. However, there is some dispute about the correct method for generating the helmert contrast coding and therefore we will show how to do it manually rather than by using the built-in function. For a more detailed discussion of the Helmert coding as it is implemented by the contr.helmert function please refer to p. 156-158 of Modern Applied Statistics with SPLUS by Venables and Ripley.
Below we see an example of Helmert regression coding. For the first comparison (comparing level 1 with levels 2, 3 and 4) the codes are 3/4 and -1/4 -1/4 -1/4. The second comparison compares level 2 with levels 3 and 4 and is coded 0 2/3 -1/3 -1/3. The third comparison compares level 3 to level 4 and is coded 0 0 1/2 -1/2.
HELMERT regression coding
| race.f1 | race.f2 | race.f3 | |
| Level of Race | Level 1 v. Later | Level 2 v. Later | Level 3 v. Later |
| 1 (Hispanic) | 3/4 | 0 | 0 |
| 2 (Asian) | -1/4 | 2/3 | 0 |
| 3 (African American) | -1/4 | -1/3 | 1/2 |
| 4 (Caucasian) | -1/4 | -1/3 | -1/2 |
#helmert for factor variable with 4 levels
my.helmert = matrix(c(3/4, -1/4, -1/4, -1/4, 0, 2/3, -1/3, -1/3, 0, 0, 1/
2, -1/2), ncol = 3)
my.helmert
[,1] [,2] [,3]
[1,] 0.75 0.0000000 0.0
[2,] -0.25 0.6666667 0.0
[3,] -0.25 -0.3333333 0.5
[4,] -0.25 -0.3333333 -0.5
#assigning the new helmert coding to race.f
contrasts(hsb2$race.f) = my.helmert
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 -6.9601 2.1752 -3.1997 0.0016
race.f2 6.8724 2.9263 2.3485 0.0198
race.f3 -5.8552 2.1528 -2.7198 0.0071
The contrast estimate for the comparison between level 1 and the remaining levels is calculated by taking the mean of the dependent variable for level 1 and subtracting the mean of the dependent variable for levels 2, 3 and 4: 46.4583 - [(58 + 48.2 + 54.0552) / 3] = -6.960, which is statistically significant. This means that the mean of write for level 1 of race is statistically significantly different from the mean of write for levels 2 through 4. To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2: 58 - [(48.2 + 54.0552) / 2] = 6.872, which is statistically significant. The contrast estimate for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels: 48.2 - 54.0552 = -5.855, which is also statistically significant.
Reverse Helmert coding (also know as difference coding) is just the opposite of Helmert coding: instead of comparing each level of categorical variable to the mean of the subsequent level(s), each is compared to the mean of the previous level(s). In our example, the first contrast codes the comparison of the mean of the dependent variable for level 2 of race to the mean of the dependent variable for level 1 of race. The second comparison compares the mean of the dependent variable level 3 of race with both levels 1 and 2 of race, and the third comparison compares the mean of the dependent variable for level 4 of race with levels 1, 2 and 3.
The regression coding for reverse Helmert coding is shown below. For the first comparison, where the first and second level are compared, race.f1 is coded -1/2 and 1/2 and 0 otherwise. For the second comparison, the values of race.f2 are coded -1/3 -1/3 2/3 and 0. Finally, for the third comparison, the values of race.f3 are coded -1/4 -1/4 -/14 and 3/4.
REVERSE HELMERT regression coding
| Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |
| 1 (Hispanic) | -1/2 | -1/3 | -1/4 |
| 2 (Asian) | 1/2 | -1/3 | -1/4 |
| 3 (African American) | 0 | 2/3 | -1/4 |
| 4 (Caucasian) | 0 | 0 | 3/4 |
#reverse helmert for factor variable with 4 leves
my.rev.helmert = matrix(c(-1/2, 1/2, 0, 0, -1/3, -1/3, 2/3, 0, -1/4, -1/4,
-1/4, 3/4), ncol = 3)
my.rev.helmert
[,1] [,2] [,3]
[1,] -0.5 -0.3333333 -0.25
[2,] 0.5 -0.3333333 -0.25
[3,] 0.0 0.6666667 -0.25
[4,] 0.0 0.0000000 0.75
#assigning the reverse helmert coding to race.f
contrasts(hsb2$race.f) = my.rev.helmert
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 11.5417 3.2861 3.5122 0.0006
race.f2 -4.0292 2.6024 -1.5483 0.1232
race.f3 3.1691 1.4880 2.1298 0.0344
The contrast estimate for the first comparison shown in this output was calculated by subtracting the mean of the dependent variable for level 2 of the categorical variable from the mean of the dependent variable for level 1: 58 - 46.4583 = 11.542. This result is statistically significant. The contrast estimate for the second comparison (between level 3 and the previous levels) was calculated by subtracting the mean of the dependent variable for levels 1 and 2 from that of level 3: 48.2 - [(46.4583 + 58) / 2] = -4.029. This result is not statistically significant, meaning that there is not a reliable difference between the mean of write for level 3 of race compared to the mean of write for levels 1 and 2 (Hispanics and Asians). As noted above, this type of coding system does not make much sense for a nominal variable such as race. For the comparison of level 4 and the previous levels, you take the mean of the dependent variable for the those levels and subtract it from the mean of the dependent variable for level 4: 54.0552 - [(46.4583 + 58 + 48.2) / 3] = 3.169. This result is statistically significant.
In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the next (adjacent) level. In our example below, the first comparison compares the mean of write for level 1 with the mean of write for level 2 of race (Hispanics minus Asians). The second comparison compares the mean of write for level 2 minus level 3, and the third comparison compares the mean of write for level 3 minus level 4. This type of coding may be useful with either a nominal or an ordinal variable.
For the first comparison, where the first and second levels are compared, race.f1 is coded 3/4 for level 1 and the other levels are coded -1/4. For the second comparison where level 2 is compared with level 3, race.f2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, race.f3 is coded 1/4 1/4 1/4 -3/4.
FORWARD DIFFERENCE regression coding
| race.f1 | race.f2 | race.f3 | |
| Level of race | Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 |
| 1 (Hispanic) | 3/4 | 1/2 | 1/4 |
| 2 (Asian) | -1/4 | 1/2 | 1/4 |
| 3 (African American) | -1/4 | -1/2 | 1/4 |
| 4 (Caucasian) | -1/4 | -1/2 | -3/4 |
The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case k = 4).
FORWARD DIFFERENCE regression coding
| contrast 1 | contrast 2 | contrast 3 | |
| Level of race | Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 |
| 1 (Hispanic) | (k-1)/k | (k-2)/k | (k-3)/k |
| 2 (Asian) | -1/k | (k-2)/k | (k-3)/k |
| 3 (African American) | -1/k | -2/k | (k-3)/k |
| 4 (Caucasian) | -1/k | -2/k | -3/k |
#forward difference for factor variable with 4 leves
my.forward.diff = matrix(c(3/4, -1/4, -1/4, -1/4, 1/2, 1/2, -1/2, -1/2, 1/
4, 1/4, 1/4, -3/4), ncol = 3)
my.forward.diff
[,1] [,2] [,3]
[1,] 0.75 0.5 0.25
[2,] -0.25 0.5 0.25
[3,] -0.25 -0.5 0.25
[4,] -0.25 -0.5 -0.75
#assigning the forward difference coding to race.f
contrasts(hsb2$race.f) = my.forward.diff
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 -11.5417 3.2861 -3.5122 0.0006
race.f2 9.8000 3.3878 2.8927 0.0043
race.f3 -5.8552 2.1528 -2.7198 0.0071
With this coding system, adjacent levels of the categorical variable are compared. Hence, the mean of the dependent variable at level 1 is compared to the mean of the dependent variable at level 2: 46.4583 - 58 = -11.542, which is statistically significant. For the comparison between levels 2 and 3, the calculation of the contrast coefficient would be 58 - 48.2 = 9.8, which is also statistically significant. Finally, comparing levels 3 and 4, 48.2 - 54.0552 = -5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.
5.7 Backward Difference Coding
In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the prior adjacent level. In our example below, the first comparison compares the mean of write for level 2 with the mean of write for level 1 of race (Hispanics minus Asians). The second comparison compares the mean of write for level 3 minus level 2, and the third comparison compares the mean of write for level 4 minus level 3. This type of coding may be useful with either a nominal or an ordinal variable.
For the first comparison, where the first and second levels are compared, race.f1 is coded 3/4 for level 1 while the other levels are coded -1/4. For the second comparison where level 2 is compared with level 3, race.f2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, race.f3 is coded 1/4 1/4 1/4 -3/4.
BACKWARD DIFFERENCE regression coding
| race.f1 | race.f2 | race.f3 | |
| Level of race | Level 2 v. Level 1 | Level 3 v. Level 2 | Level 4 v. Level 3 |
| 1 (Hispanic) | - 3/4 | -1/2 | -1/4 |
| 2 (Asian) | 1/4 | -1/2 | -1/4 |
| 3 (African American) | 1/4 | 1/2 | -1/4 |
| 4 (Caucasian) | 1/4 | 1/2 | 3/4 |
The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case, k = 4).
BACKWARD DIFFERENCE regression coding
| contrast 1 | contrast 2 | contrast 3 | |
| Level of race | Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 |
| 1 (Hispanic) | -(k-1)/k | -(k-2)/k | -(k-3)/k |
| 2 (Asian) | 1/k | -(k-2)/k | -(k-3)/k |
| 3 (African American) | 1/k | 2/k | -(k-3)/k |
| 4 (Caucasian) | 1/k | 2/k | 3/k |
#backward difference for factor variable with 4 leves
my.backward.diff = matrix(c(-3/4, 1/4, 1/4, 1/4, -1/2, -1/2, 1/2, 1/2,
-1/4, -1/4, -1/4, 3/4), ncol = 3)
my.backward.diff
[,1] [,2] [,3]
[1,] -0.75 -0.5 -0.25
[2,] 0.25 -0.5 -0.25
[3,] 0.25 0.5 -0.25
[4,] 0.25 0.5 0.75
#assigning the backward difference coding to race.f
contrasts(hsb2$race.f) = my.backward.diff
#the regression
summary(lm(write ~ race.f, hsb2))
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 11.5417 3.2861 -3.5122 0.0006
race.f2 -9.8000 3.3878 -2.8927 0.0043
race.f3 5.8552 2.1528 2.7198 0.0071
With this coding system, adjacent levels of the categorical variable are compared, with each level compared to the prior level. Hence, the mean of the dependent variable at level 2 is compared to the mean of the dependent variable at level 1: 58 - 46.4583 = 11.542, which is statistically significant. For the comparison between levels 3 and 2, the calculation of the contrast coefficient is 48.2 - 58 = -9.8, which is also statistically significant. Finally, comparing levels 4 and 3, 54.0552 - 48.2 = 5.855, a statistically significant difference. One would conclude from this that each adjacent level of race is statistically significantly different.
In R it is possible to use any general kind of coding scheme. For our example, we would like to make the following three comparisons:
1) level 1 to level 3
2) level 2 to levels 1 and 4
3) levels 1 and 2 to levels 3 and 4.
In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2 . Finally, to compare levels 1 and 2 with levels 3 and 4 we use the coefficients 1/2 1/2 -1/2 -1/2. Before proceeding to the code necessary to conduct these analyses, let's take a moment to more fully explain the logic behind the selection of these contrast coefficients.
For the first contrast, we are comparing level 1 to level 3, and the contrast coefficients are 1 0 -1 0. This means that the levels associated with the contrast coefficients with opposite signs are being compared. In fact, the mean of the dependent variable is multiplied by the contrast coefficient. Hence, levels 2 and 4 are not involved in the comparison: they are multiplied by zero and "dropped out." You will also notice that the contrast coefficients sum to zero. This is necessary. If the contrast coefficients do not sum to zero, the contrast is not estimable and the program will issue an error message. Which level of the categorical variable is assigned a positive or negative value is not terribly important: 1 0 -1 0 is the same as -1 0 1 0 in that both of these codings compare the first and the third levels of the variable. However, the sign of the regression coefficient would change.
Now let's look at the contrast coefficients for the second and third comparisons. You will notice that in both cases we use fractions that sum to one (or minus one). They do not have to sum to one (or minus one). You may wonder why we would use fractions like -1/2 1 0 -1/2 instead of whole numbers such as -1 2 0 -1. While -1/2 1 0 -1/2 and -1 2 0 -1 both compare level 2 with levels 1 and 4 and both will give you the same t-value and p-value for the regression coefficient, the regression coefficients themselves would be different, as would their interpretation. The coefficient for the -1/2 1 0 -1/2 contrast is the mean of level 2 minus the mean of the means for levels 1 and 4: 58 - (46.4583 + 54.0552)/2 = 7.74325. (Alternatively, you can multiply the contrasts by the mean of the dependent variable for each level of the categorical variable: -1/2*46.4583 + 1*58.00 + 0*48.20 + -1/2*54.0552 = 7.74325. Clearly these are equivalent ways of thinking about how the contrast coefficient is calculated.) By comparison, the coefficient for the -1 2 0 -1 contrast is two times the mean for level 2 minus the means of the dependent variable for levels 1 and 4: 2*58 - (46.4583 + 54.0552) = 15.4865, which is the same as -1*46.4583 + 2*58 + 0*48.20 - 1*54.0552 = 15.4865. Note that the regression coefficient using the contrast coefficients -1 2 0 -1 is twice the regression coefficient obtained when -1/2 1 0 -1/2 is used.
Let's turn our attention to how we would implement this in R.
#initial contrast matrix
mat = matrix(c(1, 0, -1, 0, -1/2, 1, 0, -1/2, -1/2, -1/2, 1/2, 1/2), ncol = 3)
mat
[,1] [,2] [,3]
[1,] 1 -0.5 -0.5
[2,] 0 1.0 -0.5
[3,] -1 0.0 0.5
[4,] 0 -0.5 0.5
my.contrasts = mat %*% solve(t(mat) %*% mat)
my.contrasts
[,1] [,2] [,3]
[1,] -0.5 -1 -1.5
[2,] 0.5 1 0.5
[3,] -1.5 -1 -1.5
[4,] 1.5 1 2.5
#assigning my.contrasts to race.f
contrasts(hsb2$race.f) = my.contrasts
#the regression
summary(lm(write ~ race.f, hsb2))
Call: lm(formula = write ~ race.f, data = hsb2)
Residuals:
Min 1Q Median 3Q Max
-23.06 -5.458 0.9724 7 18.8
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 51.6784 0.9821 52.6191 0.0000
race.f1 -1.7417 2.7325 -0.6374 0.5246
race.f2 7.7432 2.8972 2.6727 0.0082
race.f3 -1.1016 1.9642 -0.5608 0.5756
The contrast estimate for the first comparison is the mean of level 1 minus the mean for level 3, and the significance of this is .5246, i.e., not significant. The second contrast estimate is 7.7432, which is the mean of level 2 minus the mean of level 1 and level 4, and this difference is significant, p = 0.0082. The final contrast estimate is -1.1016 which is the mean of levels 1 and 2 minus the mean of levels 3 and 4, and this contrast is not statistically significant, p = .5756.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services