### Regression with Stata Chapter 5 - Additional coding systems for categorical variables in regression analysis

Chapter Outline
5.1 Simple Coding
5.2 Forward Difference Coding
5.3 Backward Difference Coding
5.4 Helmert Coding
5.5 Reverse Helmert Coding
5.6 Deviation Coding
5.7 Orthogonal Polynomial Coding
5.8 User-Defined Coding
5.9 Summary

Please note: This page makes use of the program xi3 which is no longer being maintained and has been from our archives. References to xi3 will be left on this page because they illustrate specific principles of coding categorical variables.

5.0 Introduction

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are.  For example, if you have a variable called race that is coded 1 = Hispanic, 2 = Asian 3 = Black 4 = White, then entering race in your regression will look at the linear effect of race, which is probably not what you intended. Instead, categorical variables like this need to be recoded into a series of variables which can then be entered into the regression model.  There are a variety of coding systems that can be used when coding categorical variables.  Ideally, you would choose a coding system that reflects the comparisons that you want to make.  In Chapter 3 of the Regression with Stata Web Book we covered the use of categorical variables in regression analysis focusing on the use of dummy variables, but that is not the only coding scheme that you can use.  For example, you may want to compare each level to the next higher level, in which case you would want to use "forward difference" coding, or you might want to compare each level to the mean of the subsequent levels of the variable, in which case you would want to use "Helmert" coding.  By deliberately choosing a coding system, you can obtain comparisons that are most meaningful for testing your hypotheses.  Regardless of the coding system you choose, the test of the overall effect of the categorical variable (i.e., the overall effect of race) will remain the same.  Below is a table listing various types of contrasts and the comparison that they make.

 Name of contrast Comparison made Simple Coding Compares each level of a variable to the reference level Forward Difference Coding Adjacent levels of a variable (each level minus the next level) Backward Difference Coding Adjacent levels of a variable (each level minus the prior level) Helmert Coding Compare levels of a variable with the mean of the subsequent levels of the variable Reverse Helmert Coding Compares levels of a variable with the mean of the previous levels of the variable Deviation Coding Compares deviations from the grand mean Orthogonal Polynomial Coding Orthogonal polynomial contrasts User-Defined Coding User-defined contrast

There are a couple of notes to be made about the coding systems listed above.  The first is that they represent planned comparisons and not post hoc comparisons.  In other words, they are comparisons that you plan to do before you begin analyzing your data, not comparisons that you think of once you have seen the results of preliminary analyses.  Also, some forms of coding make more sense with ordinal categorical variables than with nominal categorical variables. Below we will show examples using race as a categorical variable, which is a nominal variable.  Because simple effect coding compares the mean of the dependent variable for each level of the categorical variable to the mean of the dependent variable at for the reference level, it makes sense with a nominal variable.  However, it may not make as much sense to use a coding scheme that tests the linear effect of race.  As we describe each type of coding system, we note those coding systems with which it does not make as much sense to use a nominal variable.  Also, you may notice that we follow several rules when creating the contrast coding schemes.  For more information about these rules, please see the section on User-Defined Coding.

This page will illustrate two ways that you can conduct analyses using these coding schemes: 1) using the xi3 command (an extended version of the xi command) and 2) manually coding the variables and entering them using the regress command. When using regress to do contrasts, you first need to create k-1 new variables (where k is the number of levels of the categorical variable) and use these new variables as predictors in your regression model.

#### The Example Data File

use http://www.ats.ucla.edu/stat/stata/notes/hsb2

Within this data file, we will focus on the categorical variable race, which has four levels (1 = Hispanic, 2 = Asian, 3 = African American and 4 = white) and we will use write as our dependent variable.  Although our example uses a variable with four levels, these coding systems work with variables that have more or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables than levels of the original variable.  In our example, our categorical variable has four levels so we will have three new variables (a variable corresponding to the final level of the categorical variables would be redundant and therefore unnecessary).

Before considering any analyses, let's look at the mean of the dependent variable, write, for each level of race.  This will help in interpreting the output from later analyses.

tabulate race, summarize(write)

|      Summary of writing score
race |        Mean   Std. Dev.       Freq.
------------+------------------------------------
hispanic |   46.458333   8.2724223          24
asian |          58   7.8993671          11
african-a |        48.2   9.3222992          20
white |   54.055172   9.1725582         145
------------+------------------------------------
Total |      52.775    9.478586         200

#### 5.1 Simple Coding

The results of simple coding are very similar to dummy coding in that each level is compared to the reference level. In the example below, level 1 is the reference level and the first comparison compares level 2 to level 1, the second comparison compares level 3 to level 1, and the third comparison compares level 4 to level 1.

Method 1: Using xi3

When using xi3, we can refer to g.race to indicate that we wish to code race using simple coding comparing each group to a reference group, as shown in the example below.

xi3: regress write g.race
s.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
_Irace_3 |   1.741667   2.732488     0.64   0.525    -3.647186    7.130519
_Irace_4 |   7.596839    1.98887     3.82   0.000     3.674507    11.51917
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The coefficient for _Irace_2 compares the mean of the dependent variable, write, for levels 2 and 1 yielding 58-46.458 = 11.54 and is statistically significant (p<.000). The coefficient for _Irace_3 compares the mean of the dependent variable, write, for levels 3 and 1, yielding 48.2 - 46.46 =  1.74, and this is not statistically significant.  Finally, the coefficient for _Irace_4 compares the mean of the dependent variable, write, for levels 4 and 1, yielding 7.59, and that is statistically significant.

Method 2: Manual Coding

If we wished, we could manually code race instead of allowing xi3 to do the coding for us.  Below we see the coding that replicates the results we saw in the example above.  In the coding below, level 1 is the reference level and x1 compares level 2 to level 1, x2 compares level 3 to level 1, and x3 compares level 4 to level 1.  For x1 the coding is 3/4 for level 2, and -1/4 for all other levels.  Likewise, for x2 the coding is 3/4 for level 2, and -1/4 for all other levels, and for x3 the coding is 3/4 for level 3, and -1/4 for all other levels.  It is not intuitive that this regression coding scheme yields these comparisons; however, if you desire simple comparisons, you can follow this general rule to obtain these comparisons.

SIMPLE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) 1 (Hispanic) -1/4 -1/4 -1/4 2 (Asian) 3/4 -1/4 -1/4 3 (African American) -1/4 3/4 -1/4 4 (white) -1/4 -1/4 3/4

Below we show the more general rule for creating this kind of coding scheme using regression coding, where k is the number of levels of the categorical variable (in this instance, k = 4).

SIMPLE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) 1 (Hispanic) -1 / k -1 / k -1 / k 2 (Asian) (k-1) / k -1 / k -1 / k 3 (African American) -1 / k (k-1) / k -1 / k 4 (white) -1 / k -1 / k (k-1) / k

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regression command.

generate x1 = -1/4
replace x1 = 3/4 if race==2

generate x2 = -1/4
replace x2 = 3/4 if race==3

generate x3 = -1/4
replace x3 = 3/4 if race==4

regress write x1 x2 x3

As you can see, the results below match those when we used the xi3 command above.

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
x2 |   1.741667   2.732488     0.64   0.525    -3.647186    7.130519
x3 |   7.596839    1.98887     3.82   0.000     3.674507    11.51917
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

#### 5.2 Forward Difference Coding

In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the next (adjacent) level.  In our example below, the first comparison compares the mean of write for level 1 with the mean of write for level 2 of race (Hispanics minus Asians).  The second comparison compares the mean of write for level 2 minus level 3, and the third comparison compares the mean of write for level 3 minus level 4.  This type of coding may be useful with either a nominal or an ordinal variable.

Method 1: Using xi3

We can indicate that we want forward adjacent difference coding for race by specifying a.race as shown below.

xi3 : regress write a.race

f.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_1 |  -11.54167   3.286129    -3.51   0.001    -18.02238   -5.060956
_Irace_2 |        9.8   3.387834     2.89   0.004     3.118714    16.48129
_Irace_3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are compared.  Hence, the mean of the dependent variable at level 1 is compared to the mean of the dependent variable at level 2:  46.4583 - 58 = -11.542, which is statistically significant.  For the comparison between levels 2 and 3, the calculation of the contrast coefficient would be 58 - 48.2 = 9.8, which is also statistically significant.  Finally, comparing levels 3 and 4, 48.2 - 54.0552 = -5.855, a statistically significant difference.  One would conclude from this that each adjacent level of race is statistically significantly different.

Method 2: Manual Coding

For the first comparison, where the first and second levels are compared, x1 is coded 3/4 for level 1 and the other levels are coded -1/4.  For the second comparison where level 2 is compared with level 3, x2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, x3 is coded 1/4 1/4 1/4 -3/4.

FORWARD DIFFERENCE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 1 v. Level 2 Level 2 v. Level 3 Level 3 v. Level 4 1 (Hispanic) 3/4 1/2 1/4 2 (Asian) -1/4 1/2 1/4 3 (African American) -1/4 -1/2 1/4 4 (white) -1/4 -1/2 -3/4

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case k = 4).

FORWARD DIFFERENCE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 1 v. Level 2 Level 2 v. Level 3 Level 3 v. Level 4 1 (Hispanic) (k-1)/k (k-2)/k (k-3)/k 2 (Asian) -1/k (k-2)/k (k-3)/k 3 (African American) -1/k -2/k (k-3)/k 4 (white) -1/k -2/k -3/k

generate x1 = 3/4 if race==1
replace x1 = -1/4 if inlist(race,2,3,4)
generate x2 = 1/2 if inlist(race,1,2)
replace x2 = -1/2 if inlist(race,3,4)
generate x3 = 1/4 if inlist(race,1,2,3)
replace x3 = -3/4 if race==4
regress write x1 x2 x3
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |  -11.54167   3.286129    -3.51   0.001    -18.02238   -5.060956
x2 |        9.8   3.387834     2.89   0.004     3.118714    16.48129
x3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

You can see the regression coefficient for x1 is the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian).  Likewise, the regression coefficient for x2 is the mean of write for level 2 (Asian) minus the mean of write for level 3 (African American), and the regression coefficient for x3 is the mean of write for level 3 (African American) minus the mean of write for level 4 (white).

#### 5.3 Backward Difference Coding

In this coding system, the mean of the dependent variable for one level of the categorical variable is compared to the mean of the dependent variable for the prior adjacent level.  In our example below, the first comparison compares the mean of write for level 2 with the mean of write for level 1 of race (Hispanics minus Asians).  The second comparison compares the mean of write for level 3 minus level 2, and the third comparison compares the mean of write for level 4 minus level 3.  This type of coding may be useful with either a nominal or an ordinal variable.

Method 1: Using xi3

We can indicate that we want backward difference coding for race by specifying b.race as shown below.

xi3 : regress write b.race
b.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
_Irace_3 |       -9.8   3.387834    -2.89   0.004    -16.48129   -3.118714
_Irace_4 |   5.855172    2.15276     2.72   0.007     1.609626    10.10072
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are compared, with each level compared to the prior level.  Hence, the mean of the dependent variable at level 2 is compared to the mean of the dependent variable at level 1:  58-46.4583 = 11.542, which is statistically significant.  For the comparison between levels 3 and 2, we calculate 48.2 - 58 = -9.8, which is also statistically significant.  Finally, comparing levels 4 and 3, 54.0552 - 48.2 = 5.855, a statistically significant difference.  One would conclude from this that each adjacent level of race is statistically significantly different.

Method 2: Manual Coding

For the first comparison, where the first and second levels are compared, x1 is coded 3/4 for level 1 while the other levels are coded -1/4.  For the second comparison where level 2 is compared with level 3, x2 is coded 1/2 1/2 -1/2 -1/2, and for the third comparison where level 3 is compared with level 4, x3 is coded 1/4 1/4 1/4 -3/4.

BACKWARD DIFFERENCE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 2 v. Level 1 Level 3 v. Level 2 Level 4 v. Level 3 1 (Hispanic) - 3/4 -1/2 -1/4 2 (Asian) 1/4 -1/2 -1/4 3 (African American) 1/4 1/2 -1/4 4 (white) 1/4 1/2 3/4

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case, k = 4).

BACKWARD DIFFERENCE regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 1 v. Level 2 Level 2 v. Level 3 Level 3 v. Level 4 1 (Hispanic) -(k-1)/k -(k-2)/k -(k-3)/k 2 (Asian) 1/k -(k-2)/k -(k-3)/k 3 (African American) 1/k 2/k -(k-3)/k 4 (white) 1/k 2/k 3/k

generate x1 = -3/4 if race==1
replace  x1 =  1/4 if inlist(race,2,3,4)

generate x2 = -1/2 if inlist(race,1,2)
replace  x2 =  1/2 if inlist(race,3,4)

generate x3 = -1/4 if inlist(race,1,2,3)
replace  x3 =  3/4 if race==4

regress write x1 x2 x3
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
x2 |       -9.8   3.387834    -2.89   0.004    -16.48129   -3.118714
x3 |   5.855172    2.15276     2.72   0.007     1.609626    10.10072
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

In the above example, the regression coefficient for x1 is the mean of write for level 2 minus the mean of write for level 1 (58- 46.4583 = 11.542).  Likewise, the regression coefficient for x2 is the mean of write for level 3 minus the mean of write for level 2, and the regression coefficient for x3 is the mean of write for level 4 minus the mean of write for level 3.

#### 5.4 Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels.  Hence, the first contrast compares the mean of the dependent variable for level 1 of race with the mean of all of the subsequent levels of race (levels 2, 3, and 4), the second contrast compares the mean of the dependent variable for level 2 of race with the mean of all of the subsequent levels of race (levels 3 and 4), and the third contrast compares the mean of the dependent variable for level 3 of race with the mean of all of the subsequent levels of race (level 4). While this type of coding system does not make much sense with a nominal variable like race, it is useful in situations where the levels of the categorical variable are ordered say, from lowest to highest, or smallest to largest, etc.

Method 1: Using xi3

We can specify Helmert coding for race using h.race as shown below.

xi3 : regress write h.race
h.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_1 |  -6.960057   2.175211    -3.20   0.002    -11.24988   -2.670234
_Irace_2 |   6.872414   2.926325     2.35   0.020     1.101287    12.64354
_Irace_3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for the comparison between level 1 and the remaining levels is calculated by taking the mean of the dependent variable for level 1 and subtracting the mean of the dependent variable for levels 2, 3 and 4: 46.4583 - [(58 + 48.2 + 54.0552) / 3] = -6.960, which is statistically significant.  This means that the mean of write for level 1 of race is statistically significantly different from the mean of write for levels 2 through 4.  As noted above, this comparison probably is not meaningful because the variable race is nominal.  This type of comparison would be more meaningful if the categorical variable was ordinal.

To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2:  58 - [(48.2 + 54.0552) / 2] = 6.872, which is statistically significant.  The regression coefficient for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels:  48.2 - 54.0552 = -5.855, which is also statistically significant.

Method 2: Manual Coding

Below we see an example of Helmert regression coding.  For the first comparison (comparing level 1 with levels 2, 3 and 4) the codes are 3/4 and -1/4 -1/4 -1/4.  The second comparison compares level 2 with levels 3 and 4 and is coded 0 2/3 -1/3 -1/3.  The third comparison compares level 3 to level 4 and is coded 0 0 1/2 -1/2.

HELMERT regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 1 v. Later Level 2 v. Later Level 3 v. Later 1 (Hispanic) 3/4 0 0 2 (Asian) -1/4 2/3 0 3 (African American) -1/4 -1/3 1/2 4 (white) -1/4 -1/3 -1/2

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regression command.

generate x1 = -3/4 if race==1
replace  x1 =  1/4 if inlist(race,2,3,4)

generate x2 =    0 if race==1
replace  x2 =  2/3 if race==2
replace  x2 = -1/3 if inlist(race,3,4)

generate x3 =    0 if inlist(race,1,2)
replace  x3 =  1/2 if race==3
replace  x3 = -1/2 if race==4

regress write x1 x2 x3
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   6.960057   2.175211     3.20   0.002     2.670234    11.24988
x2 |   6.872414   2.926325     2.35   0.020     1.101287    12.64354
x3 |  -5.855172    2.15276    -2.72   0.007    -10.10072   -1.609626
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

As you see above, regression coefficient for x1 is the mean of write for level 1 (Hispanic) versus all subsequent levels (levels 2, 3 and 4).  Likewise, the regression coefficient for x2 is the mean of write for level 2 minus the mean of write for levels 3 and 4.  Finally, the regression coefficient for x3 is the mean of write for level 3 minus the mean of write for level 4.

#### 5.5 Reverse Helmert Coding

Reverse Helmert coding (also know as difference coding) is just the opposite of Helmert coding: instead of comparing each level of categorical variable to the mean of the subsequent level(s), each is compared to the mean of the previous level(s).  In our example, the first contrast codes the comparison of the mean of the dependent variable for level 2 of race to the mean of the dependent variable for level 1 of race.  The second comparison compares the mean of the dependent variable level 3 of race with both levels 1 and  2 of race, and the third comparison compares the mean of the dependent variable for level 4 of race with levels 1, 2 and 3. Clearly, this coding system does not make much sense with our example of race because it is a nominal variable.  However, this system is useful when the levels of the categorical variable are ordered in a meaningful way.  For example, if we had a categorical variable in which work-related stress was coded as low, medium or high, then comparing the means of the previous levels of the variable would make more sense.

Method 1: Using xi3

We can specify Helmert coding for race using r.race as shown below.

xi3 : regress write r.race

r.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_2 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
_Irace_3 |  -4.029167   2.602363    -1.55   0.123    -9.161394    1.103061
_Irace_4 |   3.169061   1.487987     2.13   0.034     .2345401    6.103582
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for the first comparison shown in this output was calculated by subtracting the mean of the dependent variable for level 2 of the categorical variable from the mean of the dependent variable for level 1:  58 - 46.4583 = 11.542.  This result is statistically significant.  The regression coefficient for the second comparison (between level 3 and the previous levels) was calculated by subtracting the mean of the dependent variable for levels 1 and 2 from that of level 3:  48.2 - [(46.4583 + 58) / 2] = -4.029.  This result is not statistically significant, meaning that there is not a reliable difference between the mean of write for level 3 of race compared to the mean of write for levels 1 and 2 (Hispanics and Asians).  As noted above, this type of coding system does not make much sense for a nominal variable such as race.  For the comparison of level 4 and the previous levels, you take the mean of the dependent variable for the those levels and subtract it from the mean of the dependent variable for level 4:  54.0552 - [(46.4583 + 58 + 48.2) / 3] = 3.169.  This result is statistically significant.

Method 2: Manual Coding

The regression coding for reverse Helmert coding is shown below.  For the first comparison, where the first and second level are compared, x1 is coded -1/2 and 1/2 and 0 otherwise.  For the second comparison, the values of x2 are coded -1/3 -1/3  2/3 and 0.  Finally, for the third comparison, the values of x3 are coded -1/4 -1/4 -/14 and 3/4.

REVERSE HELMERT regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) 1 (Hispanic) -1/2 -1/3 -1/4 2 (Asian) 1/2 -1/3 -1/4 3 (African American) 0 2/3 -1/4 4 (white) 0 0 3/4

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regress command.

generate x1 = -1/2 if race==1
replace  x1 =  1/2 if race==2
replace  x1 =    0 if inlist(race,3,4)

generate x2 = -1/3 if inlist(race,1,2)
replace  x2 =  2/3 if race==3
replace  x2 =    0 if race==4

generate x3 = -1/4 if inlist(race,1,2,3)
replace  x3 =  3/4 if race==4

regress write x1 x2 x3
      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   11.54167   3.286129     3.51   0.001     5.060956    18.02238
x2 |  -4.029167   2.602363    -1.55   0.123    -9.161394    1.103061
x3 |   3.169061   1.487987     2.13   0.034     .2345401    6.103582
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

In the above example, the regression coefficient for x1 is the mean of write for level 1 (Hispanic) minus the mean of write for level 2 (Asian).  Likewise, the regression coefficient for x2 is the mean of write for levels 1 and 2 combined minus the mean of write for level 3.  Finally, the regression coefficient for x3 is the mean of write for levels 1, 2 and 3 combined minus the mean of write for level 4.

#### 5.6 Deviation Coding

This coding system compares the mean of the dependent variable for a given level to the mean of the dependent variable for the all levels of the variable.  In our example below, the first comparison compares level 2 (Asians) to all levels of race, the second compares level 3 (African Americans) to all levels of race, and the third comparison compares level 4 (White) to all levels of race.

Method 1: Using xi3

We indicate we would like race to be coded using deviation effect coding using e.race as shown below.

. xi3 : regress write e.race
d.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_2 |   6.321624   2.160314     2.93   0.004     2.061179    10.58207
_Irace_3 |  -3.478376   1.732305    -2.01   0.046    -6.894726    -.062027
_Irace_4 |   2.376796   1.115991     2.13   0.034     .1759051    4.577687
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficient for _Irace_2 is the mean for level 2 minus the grand mean.  However, this grand mean is not the overall mean of the dependent variable that you would get from the summarize command.  Rather, it is the mean of means of the dependent variable at each level of the categorical variable:  (46.4583 + 58 + 48.2 + 54.0552) / 4 = 51.678375.  This regression coefficient is then 58 - 51.678375 = 6.32.  Likewise, the coefficient for _Irace_3 is the mean for level 3 of race minus the overall mean, i.e., 48.2 - 51.678 = -3.47, and _Irace_4 is the mean for level 4 of race minus the overall mean, 54.055 - 51.678 = 2.37.

Method 2: Manual Coding

As you see in the example below, the regression coding is accomplished by assigning 1 to level 2 for the first comparison (because level 2 is the level to be compared to all), level 1 to level 3 for the second comparison (because level 3 is to be compared to all), and 1 to level 4 for the third comparison (because level 4 is to be compared to all).  Note that a  -1 is assigned to level 1 for all three comparisons (because it is the level that is never compared to the other levels) and all other values are assigned a 0.  This regression coding scheme yields the comparisons described above.

DEVIATION regression coding
 Level of race New variable 1 (x1) New variable 2 (x2) New variable 3 (x3) Level 2 v. Mean Level 3 v. Mean Level 4 v. Mean 1 (Hispanic) -1 -1 -1 2 (Asian) 1 0 0 3 (African American) 0 1 0 4 (white) 0 0 1

Below we illustrate how to create x1, x2 and x3 and enter these new variables into the regression model using the regress command.

generate x1 = -1 if race==1
replace  x1 =  1 if race==2
replace  x1 =  0 if inlist(race,3,4)

generate x2 = -1 if race==1
replace  x2 =  1 if race==3
replace  x2 =  0 if inlist(race,2,4)

generate x3 = -1 if race==1
replace  x3 =  1 if race==4
replace  x3 =  0 if inlist(race,2,3)

regress write x1 x2 x3
     Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |   6.321624   2.160314     2.93   0.004     2.061179    10.58207
x2 |  -3.478376   1.732305    -2.01   0.046    -6.894726    -.062027
x3 |   2.376796   1.115991     2.13   0.034     .1759051    4.577687
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The regression coefficients for this analysis match those in the example above and have the same interpretation.

#### 5.7 Orthogonal Polynomial Coding

Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable.  This type of coding system should be used only with an ordinal variable in which the levels are equally spaced.  Examples of such a variable might be income or education.  The table below shows the contrast coefficients for the linear, quadratic and cubic trends for the four levels.  These could be obtained from most statistics books on linear models.

POLYNOMIAL
 Level of race Linear (x1) Quadratic (x2) Cubic (x3) 1 (Hispanic) -.671 .5 -.224 2 (Asian) -.224 -.5 .671 3 (African American) .224 -.5 -.671 4 (white) .671 .5 .224

Method 1: Using xi3

We indicate we would like race to be coded using orthogonal polynomials by using o.race as shown below.

. xi3 : regress write o.race
o.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0000
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_1 |   2.080058   .6381718     3.26   0.001     .8214929    3.338622
_Irace_2 |  -.2159021   .6381718    -0.34   0.735    -1.474467    1.042663
_Irace_3 |   2.279811   .6381718     3.57   0.000     1.021246    3.538375
_cons |     52.775   .6381718    82.70   0.000     51.51644    54.03356
------------------------------------------------------------------------------

The three coded variables, _Irace_1, _Irace_2 and _Irace_3, represent the linear, quadratic and cubic trends respectively. Of course, the term 'trend' doesn't make sense if the variable is nominal, like race. But if we pretend that race is ordinal than there would be a significant linear and cubic trend. It is also easy to test for nonlinear trend.

. test _Irace_2 _Irace_3

( 1)  _Irace_2 = 0.0
( 2)  _Irace_3 = 0.0

F(  2,   196) =    6.44
Prob > F =    0.0020

The test for nonlinear trend is statistically significant. This example worked okay to show how to use xi3 but we need an ordered example that can be interpreted.

Example 2

We will create our own categorical variable, readcat, from the continuous variable read.

. gen readcat = read
recode readcat 1/43=1 44/49=2 50/59=3 60/100=4

------------+-----------------------------------
1 |         39       19.50       19.50
2 |         44       22.00       41.50
3 |         61       30.50       72.00
4 |         56       28.00      100.00
------------+-----------------------------------
Total |        200      100.00

Now we can run the regression with xi3.

. xi3: regress write o.readcat

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =   29.64
Model |  5579.22989     3   1859.7433           Prob > F      =  0.0000
Residual |  12299.6451   196  62.7532914           R-squared     =  0.3121
Total |   17878.875   199   89.843593           Root MSE      =  7.9217

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ireadcat_1 |    5.27249   .5601486     9.41   0.000     4.167798    6.377182
_Ireadcat_2 |   .3097532   .5601486     0.55   0.581     -.794939    1.414445
_Ireadcat_3 |  -.0324612   .5601486    -0.06   0.954    -1.137153    1.072231
_cons |     52.775   .5601486    94.22   0.000     51.67031    53.87969
------------------------------------------------------------------------------


We see from the significant _Ireadcat_1 that the linear trend is significant while neither quadratic nor cubic trends (_Ireadcat_2 & _Ireadcat_3 ) are significant. The test for nonlinear trend is also nonsignificant.

. test _Ireadcat_2 _Ireadcat_3

F(  2,   196) =    0.15
Prob > F =    0.8569


Method 2: Manual Coding

For the moment we are skipping manual coding.

#### 5.8 User Defined Coding

You can use the xi3 command to create your own regression coding system.  For our example, we will make the following three comparisons:

1) level 1 to level 3
2) level 2 to levels 1 and 4
3) levels 1 and 2 to levels 3 and 4.

In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2   Finally, to compare levels 1 and 2 with levels 3 and 4 we use the coefficients 1/2 1/2 -1/2 -1/2.  Before proceeding to the Stata code necessary to conduct these analyses, let's take a moment to more fully explain the logic behind the selection of these contrast coefficients.

For the first contrast, we are comparing level 1 to level 3, and the contrast coefficients are 1 0 -1 0.  This means that the levels associated with the contrast coefficients with opposite signs are being compared.  In fact, the mean of the dependent variable is multiplied by the contrast coefficient.  Hence, levels 2 and 4 are not involved in the comparison:  they are multiplied by zero and "dropped out."  You will also notice that the contrast coefficients sum to zero.  This is necessary.  If the contrast coefficients do not sum to zero, the contrast is not estimable and Stata will issue an error message. Which level of the categorical variable is assigned a positive or negative value is not terribly important:  1 0 -1 0 is the same as -1 0 1 0 in that both of these codings compare the first and the third levels of the variable.  However, the sign of the regression coefficient would change.

Now let's look at the contrast coefficients for the second and third comparisons.  You will notice that in both cases we use fractions that sum to one (or minus one).  They do not have to sum to one (or minus one).  You may wonder why we would use fractions like -1/2 1 0 -1/2 instead of whole numbers such as -1 2 0 -1.  While -1/2 1 0 -1/2 and -1 2 0 -1 both compare level 2 with levels 1 and 4 and both will give you the same t-value and p-value for the regression coefficient, the regression coefficients themselves would be different, as would their interpretation.  The coefficient for the -1/2 1 0 -1/2 contrast is the mean of level 2 minus the mean of the means for levels 1 and 4:  58 - (46.4583 + 54.0552)/2 = 7.74325.  (Alternatively, you can multiply the contrasts by the mean of the dependent variable for each level of the categorical variable: -1/2*46.4583 + 1*58.00 + 0*48.20 + -1/2*54.0552 = 7.74325.  Clearly these are equivalent ways of thinking about how the contrast coefficient is calculated.)  By comparison, the coefficient for the -1 2 0 -1 contrast is two times the mean for level 2 minus the means of the dependent variable for levels 1 and 4:  2*58 - (46.4583 + 54.0552) = 15.4865, which is the same as -1*46.4583 + 2*58 + 0*48.20 - 1*54.0552 = 15.4865. Note that the regression coefficient using the contrast coefficients -1 2 0 -1 is twice the regression coefficient obtained when -1/2 1 0 -1/2 is used.

Method 1: Using xi3

We use the char command to indicate the contrast coefficients to be used for race as shown below. In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2   Finally, to compare levels 1 and 2 with levels 3 and 4, we use the coefficients 1/2 1/2 -1/2 -1/2.  These coefficients are used in the char race[user] command below.  This indicates that for race that the user defined contrast is defined as having three contrasts (because race has four levels) as (1 0 -1 0 \ -.5 1 0 -.5 \ .5 .5 -.5 -.5).

char race[user] (1 0 -1 0 \ -.5 1 0 -.5 \ .5 .5 -.5 -.5)

xi3 : regress write u.race
u.race            _Irace_1-4          (naturally coded; _Irace_4 omitted)

Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Irace_1 |  -1.741667   2.732488    -0.64   0.525    -7.130519    3.647186
_Irace_2 |   7.743247   2.897186     2.67   0.008     2.029588    13.45691
_Irace_3 |    1.10158   1.964244     0.56   0.576    -2.772186    4.975347
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

The coefficient for _Irace_1 corresponds to the first contrast comparing level 1 to level 3 of race.  The coefficient is the mean of level 1 of write minus the mean for level 3 of write, and the significance of this is .525, i.e., not significant.  The coefficient for _Irace_2 is 7.743, which is the mean of level 2 minus the mean of level 1 and level 4, and this difference is significant, p = 0.008.  The final regression coefficient is 1.1 which is the mean of levels 1 and 2 minus the mean of levels 3 and 4, and this contrast is not statistically significant, p = .576.

Method 2: Manual Coding

As in the prior examples, we will make the following three comparisons:

1) level 1 to level 3,
2) level 2 to levels 1 and 4 and
3) levels 1 and 2 to levels 3 and 4.

The xi3 command converts the contrast coding into regression coding for us.  However, we could do this process manually as well.

For methods 1 and 2 it was quite easy to translate the comparisons we wanted to make into contrast codings, but it is not as easy to translate the comparisons we want into a regression coding scheme.  If we know the contrast coding system, then we can convert that into a regression coding system using the Stata program shown below. As you can see, we place the three contrast codings we want into the matrix c and then perform a set of matrix operations on c, yielding the matrix x.  We then display x using the print command.

matrix input c = (1 0 -1 0 \ -.5 1 0 -.5 \ .5 .5 -.5 -.5)
matrix x = c'*inv(c*c')
matrix list x

x[4,3]
r1    r2    r3
c1   -.5    -1   1.5
c2    .5     1   -.5
c3  -1.5    -1   1.5
c4   1.5     1  -2.5

This converted the contrast coding into the regression coding that we need for running this analysis with the regress command.  Below, we use the generate and replace commands to create x1, x2 and x3 according to the coding shown above and then enter them into the regression analysis.

generate x1 =  -.5 if race == 1
replace  x1 =   .5 if race == 2
replace  x1 = -1.5 if race == 3
replace  x1 =  1.5 if race == 4
generate x2 =  -1 if race == 1
replace  x2 =   1 if race == 2
replace  x2 =  -1 if race == 3
replace  x2 =   1 if race == 4
generate x3 =  1.5 if race == 1
replace  x3 =  -.5 if race == 2
replace  x3 =  1.5 if race == 3
replace  x3 = -2.5 if race == 4
regress write x1 x2 x3

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  3,   196) =    7.83
Model |  1914.15805     3  638.052682           Prob > F      =  0.0001
Residual |   15964.717   196  81.4526375           R-squared     =  0.1071
Total |   17878.875   199   89.843593           Root MSE      =  9.0251

------------------------------------------------------------------------------
write |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 |  -1.741667   2.732488    -0.64   0.525    -7.130519    3.647186
x2 |   7.743247   2.897186     2.67   0.008     2.029588    13.45691
x3 |    1.10158   1.964244     0.56   0.576    -2.772186    4.975347
_cons |   51.67838    .982122    52.62   0.000     49.74149    53.61526
------------------------------------------------------------------------------

As you can see, the results of this analysis matches those produced using xi3.

#### 5.9 Summary

This page has described a number of different coding systems that you could use for categorical data, and two different strategies you could use for performing the analyses.  You can choose a coding system that yields comparisons that make the most sense for testing your hypotheses.  Between the two strategies (xi3 and manual coding), you can see that xi3 automates the process of creating the coding, but this gives up a certain amount of control. If you like, you can use manual coding which gives you more control over creating the coding of the variables, but may be more laborious and tedious.  In general we would recommend using the easiest method that accomplishes your goals.