### Regression with Stata Chapter 3 - Regression with Categorical Predictors

Chapter Outline
3.0 Regression with Categorical Predictors
3.1 Regression with a 0/1 variable
3.2 Regression with a 1/2 variable
3.3 Regression with a 1/2/3 variable
3.4 Regression with multiple categorical predictors
3.5 Categorical predictor with interactions
3.6 Continuous and Categorical variables
3.7 Interactions of Continuous by 0/1 Categorical variables
3.8 Continuous and Categorical variables, interaction with 1/2/3 variable
3.9 Summary
3.10 Self assessment

Please note: This page makes use of the program xi3 which is no longer being maintained and has been from our archives. References to xi3 will be left on this page because they illustrate specific principles of coding categorical variables.

3.0 Introduction

In the previous two chapters, we have focused on regression analyses using continuous variables. However, it is possible to include categorical predictors in a regression analysis, but it requires some extra work in performing the analysis and extra work in properly interpreting the results.  This chapter will illustrate how you can use Stata for including categorical predictors in your analysis and describe how to interpret the results of such analyses. Stata has some great tools that really ease the process of including categorical variables in your regression analysis, and we will emphasize the use of these timesaving tools.

This chapter will use the elemapi2 data that you have seen in the prior chapters. We will focus on four variables api00, some_col, yr_rnd and mealcat, which takes meals and breaks it up into 3 categories. Let's have a quick look at these variables.

use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2
describe api00  some_col yr_rnd  mealcat

storage display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
api00           int    %6.0g                  api 2000
some_col        byte   %4.0f                  parent some college
yr_rnd          byte   %4.0f       yr_rnd     year round school
mealcat         byte   %18.0g      mealcat    Percentage free meals in 3
categories

The variable api00 is a measure of the performance of the schools. Below we see the codebook information for api00

codebook api00
api00 ---------------------------------------------------------------- api 2000
type:  numeric (int)

range:  [369,940]                    units:  1
unique values:  271                  coded missing:  0 / 400

mean:   647.622
std. dev:   142.249

percentiles:        10%       25%       50%       75%       90%
465.5     523.5       643     762.5       850


The variable some_col is a continuous variable that measures the percentage of the parents in the school who have attended college, and the codebook information is shown below.

codebook some_col
some_col -------------------------------------------------- parent some college
type:  numeric (byte)

range:  [0,67]                       units:  1
unique values:  49                   coded missing:  0 / 400

mean:   19.7125
std. dev:   11.3369

percentiles:        10%       25%       50%       75%       90%
2.5        12        19        28        34


The variable yr_rnd is a categorical variable that is coded 0 if the school is not year round, and 1 if year round, see below.

codebook yr_rnd
yr_rnd ------------------------------------------------------ year round school
type:  numeric (byte)
label:  yr_rnd

range:  [0,1]                        units:  1
unique values:  2                    coded missing:  0 / 400

tabulation:  Freq.   Numeric  Label
308         0  No
92         1  Yes

The variable meals is the percentage of students who are receiving state sponsored free meals and can be used as an indicator of poverty. This was broken into 3 categories (to make equally sized groups) creating the variable mealcat. The codebook information for mealcat is shown below.

codebook mealcat
mealcat ----------------------------------------------------------- (unlabeled)
type:  numeric (float)
label:  mealcat

range:  [1,3]                        units:  1
unique values:  3                    coded missing:  0 / 400

tabulation:  Freq.   Numeric  Label
131         1  0-46% free meals
132         2  47-80% free meals
137         3  81-100% free meals

3.1 Regression with a 0/1 variable

The simplest example of a categorical predictor in a regression analysis is a 0/1 variable, also called a dummy variable. Let's use the variable yr_rnd as an example of a dummy variable. We can include a dummy variable as a predictor in a regression analysis as shown below.

regress api00 yr_rnd
      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  1,   398) =  116.24
Model |  1825000.56     1  1825000.56           Prob > F      =  0.0000
Residual |  6248671.43   398  15700.1795           R-squared     =  0.2260
Total |  8073672.00   399  20234.7669           Root MSE      =  125.30

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
yr_rnd |  -160.5064    14.8872   -10.78   0.000    -189.7737    -131.239
_cons |    684.539    7.13965    95.88   0.000     670.5028    698.5751
------------------------------------------------------------------------------


This may seem odd at first, but this is a legitimate analysis. But what does this mean? Let's go back to basics and write out the regression equation that this model implies.

api00 = _cons + Byr_rnd * yr_rnd

where _cons is the intercept (or constant) and we use Byr_rnd to represent the coefficient for variable yr_rnd.  Filling in the values from the regression equation, we get

api00 = 684.539 + -160.5064 * yr_rnd

If a school is not a year-round school (i.e. yr_rnd is 0) the regression equation would simplify to

api00 = constant    + 0 * Byr_rnd
api00 = 684.539     + 0 * -160.5064
api00 = 684.539

If a school is a year-round school, the regression equation would simplify to

api00 = constant + 1 * Byr_rnd
api00 = 684.539  + 1 * -160.5064
api00 = 524.0326

We can graph the observed values and the predicted values using the scatter command as shown below. Although yr_rnd only has 2 values, we can still draw a regression line showing the relationship between yr_rnd and api00.  Based on the results above, we see that the predicted value for non-year round schools is 684.539 and the predicted value for the year round schools is 524.032, and the slope of the line is negative, which makes sense since the coefficient for yr_rnd was negative (-160.5064).

twoway (scatter api00 yr_rnd) (lfit api00 yr_rnd)

Let's compare these predicted values to the mean api00 scores for the year-round and non-year-round schools.

tabulate yr_rnd, sum(api00)
 year round |         Summary of api 2000
school |        Mean   Std. Dev.       Freq.
------------+------------------------------------
No |   684.53896   132.11253         308
Yes |   524.03261   98.916043          92
------------+------------------------------------
Total |    647.6225   142.24896         400


As you see, the regression equation predicts that the value of api00 will be the mean value, depending on whether a school is a year round school or non-year round school.

Let's relate these predicted values back to the regression equation. For the non-year-round schools, their mean is the same as the intercept (684.539). The coefficient for yr_rnd is the amount we need to add to get the mean for the year-round schools, i.e., we need to add -160.5064 to get 524.0326, the mean for the non year-round schools. In other words, Byr_rnd is the mean api00 score for the year-round schools minus the mean api00 score for the non year-round schools, i.e., mean(year-round) - mean(non year-round).

It may be surprising to note that this regression analysis with a single dummy variable is the same as doing a t-test comparing the mean api00 for the year-round schools with the non year-round schools (see below). You can see that the t value below is the same as the t value for yr_rnd in the regression above. This is because Byr_rnd compares the year-rounds and non year-rounds (since the coefficient is mean(year round)-mean(non year-round)).

ttest api00, by(yr_rnd)

Two-sample t test with equal variances

------------------------------------------------------------------------------
Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
No |     308     684.539     7.52781    132.1125    669.7263    699.3516
Yes |      92    524.0326    10.31271    98.91604    503.5477    544.5175
---------+--------------------------------------------------------------------
combined |     400    647.6225    7.112448     142.249    633.6399    661.6051
---------+--------------------------------------------------------------------
diff |            160.5064     14.8872                 131.239    189.7737
------------------------------------------------------------------------------
Degrees of freedom: 398

Ho: mean(No) - mean(Yes) = diff = 0

Ha: diff < 0               Ha: diff ~= 0              Ha: diff > 0
t =  10.7815                t =  10.7815              t =  10.7815
P < t =   1.0000          P > |t| =   0.0000          P > t =   0.0000

Since a t-test is the same as doing an anova, we can get the same results using the anova command as well.

anova api00 yr_rnd

Number of obs =     400     R-squared     =  0.2260
Root MSE      =  125.30     Adj R-squared =  0.2241

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  1825000.56     1  1825000.56     116.24     0.0000
|
yr_rnd |  1825000.56     1  1825000.56     116.24     0.0000
|
Residual |  6248671.43   398  15700.1795
-----------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

If we square the t-value from the t-test, we get the same value as the F-value from the anova.

di 10.7815^2

116.24074


3.2 Regression with a 1/2 variable

A categorical predictor variable does not have to be coded 0/1 to be used in a regression model. It is easier to understand and interpret the results from a model with dummy variables, but the results from a variable coded 1/2 yield essentially the same results.

Lets make a copy of the variable yr_rnd called yr_rnd2 that is coded 1/2, 1=non year-round and 2=year-round.

generate yr_rnd2=yr_rnd
recode yr_rnd2 0=1 1=2

(400 changes made)

Let's perform a regression predicting api00 from yr_rnd2.

regress api00 yr_rnd2

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  1,   398) =  116.24
Model |  1825000.56     1  1825000.56           Prob > F      =  0.0000
Residual |  6248671.43   398  15700.1795           R-squared     =  0.2260
Total |  8073672.00   399  20234.7669           Root MSE      =  125.30

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
yr_rnd2 |  -160.5064    14.8872   -10.78   0.000    -189.7737    -131.239
_cons |   845.0453   19.35336    43.66   0.000     806.9977    883.0929
------------------------------------------------------------------------------


Note that the coefficient for yr_rnd is the same as yr_rnd2. So, you can see that if you code yr_rnd as 0/1 or as 1/2, the regression coefficient works out to be the same. However the intercept (_cons) is a bit less intuitive. When we used yr_rnd, the intercept was the mean for the non year-rounds. When using yr_rnd2, the intercept is the mean for the non year-rounds minus Byr_rnd2, i.e., 684.539 - (-160.506) = 845.045

Note that you can use 0/1 or 1/2 coding and the results for the coefficient come out the same, but the interpretation of the constant in the regression equation is different. It is often easier to interpret the estimates for 0/1 coding.

In summary, these results indicate that the api00 scores are significantly different for the schools depending on the type of school, year round school vs. non-year round school. Non year-round schools have significantly higher API scores than year-round schools. Based on the regression results, non year- round schools have scores that are 160.5 points higher than year- round schools.

3.3 Regression with a 1/2/3 variable

3.3.1 Manually Creating Dummy Variables

Say, that we would like to examine the relationship between the amount of poverty and api scores. We don't have a measure of poverty, but we can use mealcat as a proxy for a measure of poverty. Below we repeat the codebook info for mealcat showing the values for the three categories.

codebook mealcat

mealcat ----------------------------------------------------------- (unlabeled)
type:  numeric (float)
label:  mealcat

range:  [1,3]                        units:  1
unique values:  3                    coded missing:  0 / 400

tabulation:  Freq.   Numeric  Label
131         1  0-46% free meals
132         2  47-80% free meals
137         3  81-100% free meals

You might be tempted to try including mealcat in a regression like this.

regress api00 mealcat

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  1,   398) = 1207.74
Model |  6072527.52     1  6072527.52           Prob > F      =  0.0000
Residual |  2001144.48   398   5028.0012           R-squared     =  0.7521
Total |  8073672.00   399  20234.7669           Root MSE      =  70.908

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mealcat |  -150.5533   4.332147   -34.75   0.000    -159.0701   -142.0365
_cons |   950.9874   9.421798   100.93   0.000     932.4647    969.5101
------------------------------------------------------------------------------


But this is looking at the linear effect of mealcat with api00, but mealcat is not an interval variable. Instead, you will want to code the variable so that all the information concerning the three levels is accounted for. You can dummy code mealcat like this.

tabulate mealcat, gen(mealcat)

mealcat |      Freq.     Percent        Cum.
-------------------+-----------------------------------
0-46% free meals |        131       32.75       32.75
47-80% free meals |        132       33.00       65.75
81-100% free meals |        137       34.25      100.00
-------------------+-----------------------------------
Total |        400      100.00

We now have created mealcat1 that is 1 if mealcat is 1, and 0 otherwise. Likewise, mealcat2 is 1 if mealcat is 2, and 0 otherwise and likewise mealcat3 was created. We can see this below.

list mealcat mealcat1 mealcat2 mealcat3 in 1/10, nolabel

mealcat  mealcat1  mealcat2  mealcat3
1.                  1         1         0         0
2.                  2         0         1         0
3.                  3         0         0         1
4.                  1         1         0         0
5.                  1         1         0         0
6.                  1         1         0         0
7.                  1         1         0         0
8.                  1         1         0         0
9.                  1         1         0         0
10.                  1         1         0         0  

We can now use two of these dummy variables (mealcat2 and mealcat3) in the regression analysis.

regress api00 mealcat2 mealcat3

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mealcat2 |  -166.3236   8.708331   -19.10   0.000    -183.4438   -149.2034
mealcat3 |   -301.338   8.628815   -34.92   0.000    -318.3019   -284.3741
_cons |   805.7176   6.169416   130.60   0.000     793.5887    817.8464
------------------------------------------------------------------------------

We can test the overall differences among the three groups by using the test command as shown below. This shows that the overall differences among the three groups are significant.

test mealcat2 mealcat3

( 1)  mealcat2 = 0.0
( 2)  mealcat3 = 0.0

F(  2,   397) =  611.12
Prob > F =    0.0000

The interpretation of the coefficients is much like that for the binary variables. Group 1 is the omitted group, so _cons is the mean for group 1. The coefficient for mealcat2 is the mean for group 2 minus the mean of the omitted group (group 1). And the coefficient for mealcat3 is the mean of group 3 minus the mean of group 1. You can verify this by comparing the coefficients with the means of the groups.

tabulate mealcat, summarize(api00)

|         Summary of api 2000
mealcat |        Mean   Std. Dev.       Freq.
------------+------------------------------------
0-46% fre |   805.71756   65.668664         131
47-80% fr |   639.39394    82.13513         132
81-100% f |   504.37956   62.727015         137
------------+------------------------------------
Total |    647.6225   142.24896         400


Based on these results, we can say that the three groups differ in their api00 scores, and that in particular group2 is significantly different from group1 (because mealcat2 was significant) and group 3 is significantly different from group 1 (because mealcat3 was significant).

3.3.2 Using the xi command

We can use the xi command to do the work for us to create the indicator variables and run the regression all in one command, as shown below.

xi : regress api00 i.mealcat

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_1 omitted)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_2 |  -166.3236   8.708331   -19.10   0.000    -183.4438   -149.2034
_Imealcat_3 |   -301.338   8.628815   -34.92   0.000    -318.3019   -284.3741
_cons |   805.7176   6.169416   130.60   0.000     793.5887    817.8464
------------------------------------------------------------------------------

When we use xi and include the term i.mealcat in the model, Stata creates the variables _Imealcat_2 and _Imealcat_3 that are dummy variables just like mealcat2 and mealcat3 that we created before. There really is no difference between mealcat2 and _Imealcat_2.

As you can see, the results are the same as in the prior analysis. If we want to test the overall effect of mealcat we use the test command as shown below, which also gives us the same results as we found using the dummy variables mealcat2 and mealcat3.

test _Imealcat_2 _Imealcat_3

( 1)  _Imealcat_2 = 0.0
( 2)  _Imealcat_3 = 0.0

F(  2,   397) =  611.12
Prob > F =    0.0000

Note that if you are doing this in Stata version 6 the variables would be named Imealc_2 and Imealc_3 instead of _Imealcat_2 and _Imealcat_3. One of the improvements in Stata 7 is that variable names can be longer than 8 characters, so the names of the variables created by the xi command are easier to understand than in version 6. From this point forward, we will use the variable names that would be created in version 7.

What if we wanted a different group to be the reference group? If we create dummy variables via tabulate , generate() then we can easily choose which variable will be the omitted group, for example, let's omit group 3.

regress api00 mealcat1 mealcat2

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mealcat1 |    301.338   8.628815    34.92   0.000     284.3741    318.3019
mealcat2 |   135.0144    8.61209    15.68   0.000     118.0834    151.9454
_cons |   504.3796   6.032807    83.61   0.000     492.5193    516.2398
------------------------------------------------------------------------------

With group 3 omitted, the constant is now the mean of group 3 and mealcat1 is group1-group3 and mealcat2 is group2-group3. We see that both of these coefficients are significant, indicating that group 1 is significantly different from group 3 and group 2 is significantly different from group 3.

When we use the xi command, how can we choose which group is the omitted group? By default, the first group is omitted, but say we want group 3 to be omitted. We can use the char command as shown below to tell Stata that we want the third group to be the omitted group for the variable mealcat.

char mealcat[omit] 3

Then, when we use the xi command using mealcat the mealcat=3 group will be omitted. If you save the data file, Stata will remember this for future Stata sessions.

xi : regress api00 i.mealcat

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_3 omitted)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |    301.338   8.628815    34.92   0.000     284.3741    318.3019
_Imealcat_2 |   135.0144    8.61209    15.68   0.000     118.0834    151.9454
_cons |   504.3796   6.032807    83.61   0.000     492.5193    516.2398
------------------------------------------------------------------------------

You can compare and see that these results are identical to those found using mealcat1 and mealcat2 as predictors.

3.3.3 Using the anova command

We can also do this analysis using the anova command. The benefit of the anova command is that it gives us the test of the overall effect of mealcat without needing to subsequently use the test command as we did with the regress command.

anova api00 mealcat

Number of obs =     400     R-squared     =  0.7548
Root MSE      = 70.6122     Adj R-squared =  0.7536

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  6094197.67     2  3047098.83     611.12     0.0000
|
mealcat |  6094197.67     2  3047098.83     611.12     0.0000
|
Residual |  1979474.33   397  4986.08143
-----------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

We can see the anova test of the effect of mealcat is the same as the test command from the regress command.

We can even follow this with the anova, regress command and compare the parameter estimates with those we performed previously.

anova, regress

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons            504.3796   6.032807    83.61   0.000     492.5193    516.2398
mealcat
1      301.338   8.628815    34.92   0.000     284.3741    318.3019
2     135.0144    8.61209    15.68   0.000     118.0834    151.9454
3    (dropped)
------------------------------------------------------------------------------

Note: the parameter estimates are the same because mealcat is coded the same way in the regress command and in the anova command, in both cases the last category (category 3) being dropped. While you can control which category is the omitted category when you use the regress command, the anova, regress command always drops the last category.

3.3.4 Other coding schemes

It is generally very convenient to use dummy coding but that is not the only kind of coding that can be used. As you have seen, when you use dummy coding one of the groups becomes the reference group and all of the other groups are compared to that group. This may not be the most interesting set of comparisons.

Say you want to compare group 1 with groups 2 and 3, and for a second comparison compare group 2 with group 3. You need to generate a coding scheme that forms these 2 comparisons. We will illustrate this using a Stata program, xi3, (an enhanced version of xi) that will create the variables you would need for such comparisons (as well as a variety of other common comparisons).

The comparisons that we have described (comparing group 1 with 2 and 3, and then comparing groups 2 and 3) correspond to Helmert comparisons (see Chapter 5 for more details). We use the h. prefix (instead of the i. prefix) to indicate that we desire Helmert comparisons on the variable mealcat. Otherwise, you see that xi3 works much like the xi command.

xi3: regress api00 h.mealcat

h.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_3 omitted)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |   233.8308   7.523544    31.08   0.000     219.0398    248.6218
_Imealcat_2 |   135.0144    8.61209    15.68   0.000     118.0834    151.9454
_cons |   649.8304   3.531285   184.02   0.000      642.888    656.7727
----------------------------------------------------------------------------

If you compare the parameter estimates with the means (see below) you can verify that the coefficient for _Imealcat_1 is the mean of group 1 minus the mean of groups 2 and 3 (805.71756 - (639.39394 + 504.37956) / 2 = 233.83081) and the coefficient for _Imealcat_2 is the mean of group 2 minus group 3 (639.39 - 504.37 = 135.01).  Both of these comparisons are significant, indicating that group 1 differs significantly from groups 2 and 3 combined, and group 2 differs significantly from group 3.

tabulate mealcat, sum(api00)

|         Summary of api 2000
mealcat |        Mean   Std. Dev.       Freq.
------------+------------------------------------
0-46% fre |   805.71756   65.668664         131
47-80% fr |   639.39394    82.13513         132
81-100% f |   504.37956   62.727015         137
------------+------------------------------------
Total |    647.6225   142.24896         400

And the value of _cons is the unweighted average of the means of the 3 groups.

display (805.71756 +639.39394 +504.37956)/3
649.83035

Using the coding scheme provided by xi3, we were able to form perhaps more interesting tests than those provided by dummy coding.  The xi3 program can create variables according to other coding schemes, as well as custom coding schemes that you create, see help xi3 and Chapter 5 for more information.

3.4 Regression with two categorical predictors

3.4.1 Using the xi: command

Previously we looked at using yr_rnd to predict api00

regress api00 yr_rnd
      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  1,   398) =  116.24
Model |  1825000.56     1  1825000.56           Prob > F      =  0.0000
Residual |  6248671.43   398  15700.1795           R-squared     =  0.2260
Total |  8073672.00   399  20234.7669           Root MSE      =  125.30

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
yr_rnd |  -160.5064    14.8872   -10.78   0.000    -189.7737    -131.239
_cons |    684.539    7.13965    95.88   0.000     670.5028    698.5751
------------------------------------------------------------------------------


And we have also looked at mealcat using the xi command

xi : regress api00 i.mealcat

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_3 omitted)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =  611.12
Model |  6094197.67     2  3047098.83           Prob > F      =  0.0000
Residual |  1979474.33   397  4986.08143           R-squared     =  0.7548
Total |  8073672.00   399  20234.7669           Root MSE      =  70.612

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |    301.338   8.628815    34.92   0.000     284.3741    318.3019
_Imealcat_2 |   135.0144    8.61209    15.68   0.000     118.0834    151.9454
_cons |   504.3796   6.032807    83.61   0.000     492.5193    516.2398
------------------------------------------------------------------------------

We can include both yr_rnd and mealcat together in the same model.

xi : regress api00 i.mealcat yr_rnd

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_3 omitted)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  435.02
Model |  6194144.30     3  2064714.77           Prob > F      =  0.0000
Residual |  1879527.69   396  4746.28206           R-squared     =  0.7672
Total |  8073672.00   399  20234.7669           Root MSE      =  68.893

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |   281.6832   9.445676    29.82   0.000     263.1132    300.2531
_Imealcat_2 |   117.9458   9.188911    12.84   0.000     99.88066     136.011
yr_rnd |  -42.96006   9.361761    -4.59   0.000    -61.36502   -24.55509
_cons |     526.33   7.584533    69.40   0.000      511.419    541.2409
------------------------------------------------------------------------------

We can test the overall effect of mealcat with the test command, which is significant.

test  _Imealcat_1 _Imealcat_2

( 1)  _Imealcat_1 = 0.0
( 2)  _Imealcat_2 = 0.0

F(  2,   396) =  460.27
Prob > F =    0.0000

Because this model has only main effects (no interactions) you can interpret Byr_rnd as the difference between the year round and non-year round group. The coefficient for I_mealcat_1 (which we will call B_Imealcat_1) is the difference between mealcat=1 and mealcat=3, and B_Imealcat_2 as the difference between mealcat=2 and mealcat=3.

Let's dig below the surface and see how the coefficients relate to the predicted values. Let's view the cells formed by crossing yr_rnd and mealcat and number the cells from cell1 to cell6.

           mealcat=1     mealcat=2      mealcat=3
yr_rnd=0  cell1         cell2          cell3
yr_rnd=1  cell4         cell5          cell6

With respect to mealcat, the group mealcat=3 is the reference category, and with respect to yr_rnd the group yr_rnd=0 is the reference category. As a result, cell3 is the reference cell. The constant is the predicted value for this cell.

The coefficient for yr_rnd is the difference between cell3 and cell6. Since this model has only main effects, it is also the difference between cell2 and cell5, or from cell1 and cell4. In other words, Byr_rnd is the amount you add to the predicted value when you go from non-year round to year round schools.

The coefficient for _Imealcat_1 is the predicted difference between cell1 and cell3. Since this model only has main effects, it is also the predicted difference between cell4 and cell6. Likewise, B_Imealcat_2 is the predicted difference between cell2 and cell3, and also the predicted difference between cell5 and cell6.

So, the predicted values, in terms of the coefficients, would be

           mealcat=1         mealcat=2         mealcat=3
-----------------------------------------------
yr_rnd=0  _cons             _cons             _cons
+BImealcat1       +BImealcat2
-----------------------------------------------
yr_rnd=1  _cons             _cons             _cons
+Byr_rnd          +Byr_rnd          +Byr_rnd
+BImealcat1       +BImealcat2

We should note that if you computed the predicted values for each cell, they would not exactly match the means in the 6 cells.  The predicted means would be close to the observed means in the cells, but not exactly the same.  This is because our model only has main effects and assumes that the difference between cell1 and cell4 is exactly the same as the difference between cells 2 and 5 which is the same as the difference between cells 3 and 6.  Since the observed values don't follow this pattern, there is some discrepancy between the predicted means and observed means.

3.4.2 Using the anova command

We can run the same analysis using the anova command with just main effects

anova api00 yr_rnd mealcat

Number of obs =     400     R-squared     =  0.7672
Root MSE      = 68.8933     Adj R-squared =  0.7654

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  6194144.30     3  2064714.77     435.02     0.0000
|
yr_rnd |  99946.6332     1  99946.6332      21.06     0.0000
mealcat |  4369143.74     2  2184571.87     460.27     0.0000
|
Residual |  1879527.69   396  4746.28206
-----------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

Note that we get the same information that we do from the xi : regress command, followed by the test command. The anova command automatically provides the information provided by the test command. If we like, we can also request the parameter estimates later just by doing this.

anova, regress

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  435.02
Model |  6194144.30     3  2064714.77           Prob > F      =  0.0000
Residual |  1879527.69   396  4746.28206           R-squared     =  0.7672
Total |  8073672.00   399  20234.7669           Root MSE      =  68.893

------------------------------------------------------------------------------
api00        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons            483.3699    7.45694    64.82   0.000     468.7098      498.03
yr_rnd
1     42.96006   9.361761     4.59   0.000     24.55509    61.36502
2    (dropped)
mealcat
1     281.6832   9.445676    29.82   0.000     263.1132    300.2531
2     117.9458   9.188911    12.84   0.000     99.88066     136.011
3    (dropped)
------------------------------------------------------------------------------

anova will display the parameter estimates from the last anova model. However, the anova command is rigid in its determination of which group will be the omitted group and the last group is dropped.  Since this differs from the coding we used in the regression commands above, the parameter estimates from this anova command will differ from the regress command above.

In summary, these results indicate the differences between year round and non-year round schools is significant, and the differences among the three mealcat groups are significant.

3.5 Categorical predictor with interactions

3.5.1 using xi

Let's perform the same analysis that we performed above, this time let's include the interaction of mealcat by yr_rnd. When using xi, it is easy to include an interaction term, as shown below.

xi : regress api00 i.mealcat*yr_rnd

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_3 omitted)
i.meal~t*yr_rnd   _ImeaXyr_rn_#       (coded as above)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  5,   394) =  261.61
Model |  6204727.82     5  1240945.56           Prob > F      =  0.0000
Residual |  1868944.18   394  4743.51314           R-squared     =  0.7685
Total |  8073672.00   399  20234.7669           Root MSE      =  68.873

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |   288.1929   10.44284    27.60   0.000     267.6623    308.7236
_Imealcat_2 |    123.781   10.55185    11.73   0.000      103.036    144.5259
yr_rnd |  -33.49254   11.77129    -2.85   0.005    -56.63492   -10.35015
_ImeaXyr_r~1 |  -40.76438   29.23118    -1.39   0.164    -98.23297    16.70422
_ImeaXyr_r~2 |  -18.24763   22.25624    -0.82   0.413    -62.00347     25.5082
_cons |   521.4925   8.414197    61.98   0.000     504.9502    538.0349
------------------------------------------------------------------------------


We can test the overall interaction with the test command. This interaction effect is not significant.

test  _ImeaXyr_rn_1 _ImeaXyr_rn_2

( 1)  _ImeaXyr_rn_1 = 0.0
( 2)  _ImeaXyr_rn_2 = 0.0

F(  2,   394) =    1.12
Prob > F =    0.3288

It is important to note how the meaning of the coefficients change in the presence of these interaction terms. For example, in the prior model, with only main effects, we could interpret Byr_rnd as the difference between the year round and non year round schools. However, now that we have added the interaction term, the term Byr_rnd represents the difference between cell3 and cell6, or the difference between the year round and non-year round schools when mealcat=3 (because mealcat=3 was the omitted group). The presence of an interaction would imply that the difference between year round and non-year round schools depends on the level of mealcat. The interaction terms B_ImeaXyr_rn_1 and B_ImeaXyr_rn_2 represent the extent to which the difference between the year round/non year round schools changes when mealcat=1 and when mealcat=2 (as compared to the reference group, mealcat=3). For example the term B_ImeaXyr_rn_1 represents the difference between year round and non-year round for mealcat=1 vs. the difference for mealcat=3. In other words, B_ImeaXyr_rn_1 in this design is (cell1-cell4) - (cell3-cell6), or it represents how much the effect of yr_rnd differs between mealcat=1 and mealcat=3.

Below we have shown the predicted values for the six cells in terms of the coefficients in the model.  If you compare this to the main effects model, you will see that the predicted values are the same except for the addition of _ImeaXyr_rn_1 (in cell 4) and _ImeaXyr_rn_2 (in cell 5).

           mealcat=1           mealcat=2         mealcat=3
-------------------------------------------------
yr_rnd=0  _cons               _cons             _cons
+BImealcat1         +BImealcat2
-------------------------------------------------
yr_rnd=1  _cons               _cons             _cons
+Byr_rnd            +Byr_rnd          +Byr_rnd
+BImealcat1         +BImealcat2
+B_ImeaXyr_rn_1     +B_ImeaXyr_rn_2 

It can be very tricky to interpret these interaction terms if you wish to form specific comparisons. For example, if you wanted to perform a test of the simple main effect of yr_rnd when mealcat=1, i.e., comparing cell1 with cell4, you would want to compare _cons+ BImealcat1 vs. _cons + B yr_rnd + BImealcat1+ BImeaXyr_rn_1 and since _cons and Imealcat1 would drop out, we would test

test _b[yr_rnd] +_b[_ImeaXyr_rn_1] ==0

( 1)  yr_rnd + _ImeaXyr_rn_1 = 0.0

F(  1,   394) =    7.70
Prob > F =    0.0058

This test is significant, indicating that the effect of yr_rnd is significant for the mealcat = 1 group.

As we will see, such tests can be more easily done via anova.

3.5.2 Using anova

Constructing these interactions can be somewhat easier when using the anova command.  As you see below, the anova command gives us the test of the overall main effects and interactions without the need to perform subsequent test commands.

anova api00 yr_rnd mealcat yr_rnd*mealcat

Number of obs =     400     R-squared     =  0.7685
Root MSE      = 68.8732     Adj R-squared =  0.7656

Source |  Partial SS    df       MS           F     Prob > F
---------------+----------------------------------------------------
Model |  6204727.82     5  1240945.56     261.61     0.0000
|
yr_rnd |  99617.3706     1  99617.3706      21.00     0.0000
mealcat |  1796232.80     2  898116.399     189.34     0.0000
yr_rnd*mealcat |  10583.5187     2  5291.75936       1.12     0.3288
|
Residual |  1868944.18   394  4743.51314
---------------+----------------------------------------------------
Total |  8073672.00   399  20234.7669


It is easy to perform tests of simple main effects using the sme command. You can download sme from within Stata by typing findit sme (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

Now we can test the simple main effects of yr_rnd at each level of mealcat.

sme yr_rnd mealcat

Test of yr_rnd at mealcat(1): F(1/394)  = 7.7023296
Test of yr_rnd at mealcat(2): F(1/394)  = 7.5034121
Test of yr_rnd at mealcat(3): F(1/394)  = 8.0955856

Critical value of F for alpha = .05 using ...
--------------------------------------------------
Dunn's procedure              = 4.7435944
Marascuilo & Levin            = 5.4561926
per family error rate         = 5.7804
simultaneous test procedure   = 8.1680324

The results from sme show us the effect of yr_rnd at each of the 3 levels of mealcat. We can see that the comparison for mealcat = 1 matches those we computed above using the test statement, however, it was much easier and less error prone using the sme command.

Although this section has focused on how to handle analyses involving interactions, these particular results show no indication of interaction. We could decide to omit interaction terms from future analyses having found the interactions to be non-significant. This would simplify future analyses, however including the interaction term can be useful to assure readers that the interaction term is non-significant.

3.6 Continuous and Categorical variables

3.6.1 Using regress

Say that we wish to analyze both continuous and categorical variables in one analysis. For example, let's include yr_rnd and some_col in the same analysis.

regress api00 yr_rnd some_col

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  2,   397) =   68.54
Model |  2072201.84     2  1036100.92           Prob > F      =  0.0000
Residual |  6001470.16   397  15117.0533           R-squared     =  0.2567
Total |  8073672.00   399  20234.7669           Root MSE      =  122.95

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
yr_rnd |  -149.1591   14.87519   -10.03   0.000    -178.4031   -119.9151
some_col |   2.235689   .5528656     4.04   0.000     1.148779    3.322599
_cons |   637.8581   13.50332    47.24   0.000     611.3111     664.405
------------------------------------------------------------------------------

We can create the predicted values using the predict command.

predict yhat

(option xb assumed; fitted values)

Let's graph the predicted values by some_col.

scatter yhat some_col



The coefficient for some_col indicates that for every unit increase in some_col the api00 score is predicted to increase by 2.23 units. This is the slope of the lines shown in the above graph. The graph has two lines, one for the year round schools and one for the non-year round schools. The coefficient for yr_rnd is -149.16, indicating that as yr_rnd increases by 1 unit, the api00 score is expected to decrease by about 149 units. As you can see in the graph, the top line is about 150 units higher than the lower line. You can see that the intercept is 637 and that is where the upper line crosses the Y axis when X is 0. The lower line crosses the line about 150 units lower at about 487.

3.6.2 Using anova

We can run this analysis using the anova command. The anova command assumes that the variables are categorical, thus, we need to use the continuous() option (which can be abbreviated as cont()) to specify that some_col is a continuous variable.

anova api00 yr_rnd some_col, cont(some_col)

Number of obs =     400     R-squared     =  0.2567
Root MSE      = 122.951     Adj R-squared =  0.2529

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  2072201.84     2  1036100.92      68.54     0.0000
|
yr_rnd |  1519992.67     1  1519992.67     100.55     0.0000
some_col |  247201.276     1  247201.276      16.35     0.0001
|
Residual |  6001470.16   397  15117.0533
-----------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

If we square the t-values from the regress command (above), we would find that they match those of the anova command.

3.7 Interactions of Continuous by 0/1 Categorical variables

Above we showed an analysis that looked at the relationship between some_col and api00 and also included yr_rnd.  We saw that this produced a graph where we saw the relationship between some_col and api00 but there were two regression lines, one higher than the other but with equal slope.  Such a model assumed that the slope was the same for the two groups.  Perhaps the slope might be different for these groups.  Let's run the regressions separately for these two groups beginning with the non-year round schools.

regress api00 some_col if yr_rnd==0

Source |       SS       df       MS              Number of obs =     308
-------------+------------------------------           F(  1,   306) =    4.91
Model |  84700.8576     1  84700.8576           Prob > F      =  0.0274
Residual |  5273591.67   306  17233.9597           R-squared     =  0.0158
Total |  5358292.53   307  17453.7216           Root MSE      =  131.28

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   1.409427   .6357572     2.22   0.027     .1584181    2.660436
_cons |   655.1103   15.23704    42.99   0.000     625.1277    685.0929
------------------------------------------------------------------------------

predict yhat0 if yr_rnd==0

(option xb assumed; fitted values)
(92 missing values generated)

scatter yhat0 api00 some_col if yr_rnd==0, connect(l i) msymbol(i o) sort



Likewise, let's look at the year round schools.

regress api00 some_col if yr_rnd==1

Source |       SS       df       MS              Number of obs =      92
-------------+------------------------------           F(  1,    90) =   65.08
Model |  373644.064     1  373644.064           Prob > F      =  0.0000
Residual |  516734.838    90   5741.4982           R-squared     =  0.4196
Total |  890378.902    91  9784.38354           Root MSE      =  75.773

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   7.402618   .9176327     8.07   0.000      5.57958    9.225655
_cons |   407.0391   16.51462    24.65   0.000     374.2299    439.8482
------------------------------------------------------------------------------

predict yhat1 if yr_rnd==1

(option xb assumed; fitted values)
(308 missing values generated)
scatter yhat1 api00 some_col if yr_rnd==1, connect(l i) msymbol(i o) sort



Note that the slope of the regression line looks much steeper for the year round schools than for the non-year round schools. This is confirmed by the regression equations that show the slope for the year round schools to be higher (7.4) than non-year round schools (1.3). We can compare these to see if these are significantly different from each other by including the interaction of some_col by yr_rnd, an interaction of a continuous variable by a categorical variable.

3.7.1 Computing interactions manually

We will start by manually computing the interaction of some_col by yr_rnd. Let's start fresh and use the elemapi2 data file using the , clear option to clear out any variables we have previously created.

use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear

Next, let's make a variable that is the interaction of some college (some_col) and year round schools (yr_rnd) called yrXsome.

gen yrXsome = yr_rnd*some_col

We can now run the regression that tests whether the coefficient for some_col is significantly different for year round schools and non-year round schools. Indeed, the yrXsome interaction effect is significant.

regress api00 some_col yr_rnd yrXsome

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =   52.05
Model |  2283345.48     3  761115.162           Prob > F      =  0.0000
Residual |  5790326.51   396  14622.0366           R-squared     =  0.2828
Total |  8073672.00   399  20234.7669           Root MSE      =  120.92

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   1.409427   .5856022     2.41   0.017     .2581494    2.560705
yr_rnd |  -248.0712   29.85895    -8.31   0.000    -306.7731   -189.3694
yrXsome |    5.99319    1.57715     3.80   0.000     2.892557    9.093824
_cons |   655.1103   14.03499    46.68   0.000     627.5179    682.7027
------------------------------------------------------------------------------

We can make a graph showing the regression lines for the two types of schools showing how different their regression lines are. We first create the predicted value, we call it yhata.

predict yhata

(option xb assumed; fitted values)

Then, we create separate variables for the two types of schools which will be called yhata0 for non-year round schools and yhata1 for year round schools.

separate yhata, by(yr_rnd)

storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
yhata0          float  %9.0g                  yhata, yr_rnd == 0
yhata1          float  %9.0g                  yhata, yr_rnd == 1

We can then graph the predicted values for the two types of schools by some_col. You can see how the two lines have quite different slopes, consistent with the fact that the yrXsome interaction was significant.  The c(ll[_])  option indicates that yhata0 should be connected with a line, and yhata1 should be connected with dashed lines (because we included [_] after the l ).  If we had used l[.] it would have made a dotted line.  The options to make dashed and dotted lines are new to Stata 7 and you can find more information via help grsym

line yhata0 yhata1 some_col, sort


We can replot the same graph including the data points.

twoway (line yhata0 yhata1 some_col, sort) (scatter api00 some_col)

The graph above used the same kind of dots for the data points for both types of schools. Let's make separate variables for the api00 scores for the two types of schools called api000 for the non-year round schools and api001 for the year round schools.

separate api00, by(yr_rnd)

storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
api000          int    %6.0g                  api00, yr_rnd == 0
api001          int    %6.0g                  api00, yr_rnd == 1


We can then make the same graph as above except show the points differently for the two types of schools.  Below we use small circles for the non-year round schools, and triangles for the year round schools.

twoway (line yhata0 yhata1 some_col, sort) (scatter api000 api001 some_col)

Let's quickly run the regressions again where we performed separate regressions for the two groups

Non-year round

regress api00 some_col if yr_rnd==0

Source |       SS       df       MS              Number of obs =     308
-------------+------------------------------           F(  1,   306) =    4.91
Model |  84700.8576     1  84700.8576           Prob > F      =  0.0274
Residual |  5273591.67   306  17233.9597           R-squared     =  0.0158
Total |  5358292.53   307  17453.7216           Root MSE      =  131.28

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   1.409427   .6357572     2.22   0.027     .1584181    2.660436
_cons |   655.1103   15.23704    42.99   0.000     625.1277    685.0929
------------------------------------------------------------------------------


Year round

regress api00 some_col if yr_rnd==1

Source |       SS       df       MS              Number of obs =      92
-------------+------------------------------           F(  1,    90) =   65.08
Model |  373644.064     1  373644.064           Prob > F      =  0.0000
Residual |  516734.838    90   5741.4982           R-squared     =  0.4196
Total |  890378.902    91  9784.38354           Root MSE      =  75.773

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   7.402618   .9176327     8.07   0.000      5.57958    9.225655
_cons |   407.0391   16.51462    24.65   0.000     374.2299    439.8482
------------------------------------------------------------------------------

Now, let's show the regression for both types of schools with the interaction term.

regress api00 some_col yr_rnd yrXsome

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =   52.05
Model |  2283345.48     3  761115.162           Prob > F      =  0.0000
Residual |  5790326.51   396  14622.0366           R-squared     =  0.2828
Total |  8073672.00   399  20234.7669           Root MSE      =  120.92

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
some_col |   1.409427   .5856022     2.41   0.017     .2581494    2.560705
yr_rnd |  -248.0712   29.85895    -8.31   0.000    -306.7731   -189.3694
yrXsome |    5.99319    1.57715     3.80   0.000     2.892557    9.093824
_cons |   655.1103   14.03499    46.68   0.000     627.5179    682.7027
------------------------------------------------------------------------------

Note that the coefficient for some_col in the combined analysis is the same as the coefficient for some_col for the non-year round schools? This is because non-year round schools are the reference group.  Then, the coefficient for the yrXsome interaction in the combined analysis is the Bsome_col for the year round schools (7.4) minus Bsome_col for the non year round schools (1.41) yielding 5.99. This interaction is the difference in the slopes of some_col for the two types of schools, and this is why this is useful for testing whether the regression lines for the two types of schools are equal. If the two types of schools had the same regression coefficient for some_col, then the coefficient for the yrXsome interaction would be 0. In this case, the difference is significant, indicating that the regression lines are significantly different.

So, if we look at the graph of the two regression lines we can see the difference in the slopes of the regression lines (see graph below).  Indeed, we can see that the non-year round schools (the solid line) have a smaller slope (1.4) than the slope for the year round schools (7.4).  The difference between these slopes is 5.99, the coefficient for yrXsome.

line yhata0 yhata1 some_col, sort


3.7.2 Computing interactions with xi

We can use the xi command for doing this kind of analysis as well. Let's start fresh and use the elemapi2 file.

use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2, clear

We can run a model just like the model we showed above using the xi command. You can compare the results to those above and see that we get the exact same results.

xi : regress api00 i.yr_rnd*some_col

i.yr_rnd          _Iyr_rnd_1-2        (naturally coded; _Iyr_rnd_1 omitted)
i.yr_rnd*some~l   _Iyr_Xsome__#       (coded as above)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =   52.05
Model |  2283345.48     3  761115.162           Prob > F      =  0.0000
Residual |  5790326.51   396  14622.0366           R-squared     =  0.2828
Total |  8073672.00   399  20234.7669           Root MSE      =  120.92

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iyr_rnd_2 |  -248.0712   29.85895    -8.31   0.000    -306.7731   -189.3694
some_col |   1.409427   .5856022     2.41   0.017     .2581494    2.560705
_Iyr_Xsome~2 |    5.99319    1.57715     3.80   0.000     2.892557    9.093824
_cons |   655.1103   14.03499    46.68   0.000     627.5179    682.7027
------------------------------------------------------------------------------

The i.yr_rnd*some_col term creates 3 terms, some_col, _Iyr_rnd_2 an indicator variable for yr_rnd representing whether the school is year round and the variable _Iyr_Xsome~2 representing the interaction of yr_rnd by some_col.

As we did above, we can create predicted values and create graphs showing the regression lines for the two types of schools.  We omit showing these commands.

3.7.3 Computing interactions with anova

We can also run a model just like the model we showed above using the anova command. We include the terms yr_rnd some_col and the interaction yr_rnr*some_col

anova api00 yr_rnd some_col yr_rnd*some_col, contin(some_col)

Number of obs =     400     R-squared     =  0.2828
Root MSE      = 120.922     Adj R-squared =  0.2774

Source |  Partial SS    df       MS           F     Prob > F
----------------+----------------------------------------------------
Model |  2283345.48     3  761115.162      52.05     0.0000
|
yr_rnd |  1009279.99     1  1009279.99      69.02     0.0000
some_col |  456473.187     1  456473.187      31.22     0.0000
yr_rnd*some_col |  211143.646     1  211143.646      14.44     0.0002
|
Residual |  5790326.51   396  14622.0366
----------------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

As we illustrated above, we can compute the predicted values using the predict command and graph the separate regression lines.  These commands are omitted.

In this section we found that the relationship between some_col and api00 depended on whether the school is a  year round school or a non-year round school.  For the year round schools, the relationship between some_col and api00 was significantly stronger than for non-year round schools.  In general, this type of analysis allows you to test whether the strength of the relationship between two continuous variables varies based on the categorical variable.

3.8 Continuous and Categorical variables, interaction with 1/2/3 variable

The prior examples showed how to do regressions with a continuous variable and a categorical variable that has 2 levels.  These examples will extend this further by using a categorical variable with 3 levels, mealcat.

3.8.1 using xi

We can use the xi command to run a model with some_col, mealcat and the interaction of these two variables.

xi : regress api00 i.mealcat*some_col

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_1 omitted)
i.meal~t*some~l   _ImeaXsome__#       (coded as above)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  5,   394) =  263.00
Model |  6212306.88     5  1242461.38           Prob > F      =  0.0000
Residual |  1861365.12   394  4724.27696           R-squared     =  0.7695
Total |  8073672.00   399  20234.7669           Root MSE      =  68.733

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_2 |    -239.03   18.66502   -12.81   0.000    -275.7255   -202.3345
_Imealcat_3 |  -344.9476   17.05743   -20.22   0.000    -378.4825   -311.4126
some_col |  -.9473385   .4873679    -1.94   0.053    -1.905505    .0108284
_ImeaXsome~2 |   3.140945   .7292897     4.31   0.000     1.707159     4.57473
_ImeaXsome~3 |   2.607308   .8960435     2.91   0.004     .8456841    4.368933
_cons |   825.8937   11.99182    68.87   0.000     802.3177    849.4697
------------------------------------------------------------------------------

The interaction now has two terms (_ImeaXsome~2 and _ImeaXsome~3).  To get an overall test of this interaction, we can use the test command.

test  _ImeaXsome__2 _ImeaXsome__3

( 1)  _ImeaXsome__2 = 0.0
( 2)  _ImeaXsome__3 = 0.0

F(  2,   394) =   10.32
Prob > F =    0.0000

These results indicate that the overall interaction is indeed significant.  This means that the regression lines from the 3 groups differ significantly. As we have done before, let's compute the predicted values and make a graph of the predicted values so we can see how the regression lines differ.

predict yhatc

(option xb assumed; fitted values)

separate yhatc, by(mealcat)

storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
yhatc1          float  %9.0g                  yhatc, mealcat == 1
yhatc2          float  %9.0g                  yhatc, mealcat == 2
yhatc3          float  %9.0g                  yhatc, mealcat == 3

Since we had three groups, we get three regression lines, one for each category of mealcat. The solid line is for group 1, the dashed line for group 2, and the dotted line is for group 3.

line yhatc1 yhatc2 yhatc3 some_col, sort

Group 1 was the omitted group, therefore the slope of the line for group 1 is the coefficient for some_col which is -.94.  Indeed, this line has a downward slope.  If we add the coefficient for some_col to the coefficient for _ImeaXsome~2 we get the coefficient for group 2, i.e., 3.14 + -.94 yields 2.2, the slope for group 2. Indeed, group 2 shows an upward slope. Likewise,  if we add the coefficient for some_col to the coefficient for _ImeaXsome~3 we get the coefficient for group 3, i.e., 2.6 + -.94 yields 1.66, the slope for group 3,.  So, the slopes for the 3 groups are

group 1: -0.94
group 2:  2.2
group 3:  1.66

The test of the coefficient for _ImeaXsome~2 tested whether the coefficient for group 2 differed from group 1, and indeed this was significant.  Likewise, the test of the coefficient for _ImeaXsome~3 tested whether the coefficient for group 3 differed from group 1, and indeed this was significant.  What did the test of the coefficient some_col test?  This coefficient represents the coefficient for group 1, so this tested whether the coefficient for group 1 (-0.94) was significantly different from 0.  This is probably a non-interesting test.

The comparisons in the above analyses don't seem to be as interesting as comparing group 1 vs. 2 and then comparing group 2 vs. 3.  These successive comparisons seem much more interesting. We can do this by making group 2 the omitted group, and then each group would be compared to group 2.  As we have done before, we will use the char command to indicate that we want group 2 to be the omitted category and then rerun the regression.

char mealcat[omit] 2
xi : regress api00 i.mealcat*some_col

i.mealcat         _Imealcat_1-3       (naturally coded; _Imealcat_2 omitted)
i.meal~t*some~l   _ImeaXsome__#       (coded as above)

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  5,   394) =  263.00
Model |  6212306.88     5  1242461.38           Prob > F      =  0.0000
Residual |  1861365.12   394  4724.27696           R-squared     =  0.7695
Total |  8073672.00   399  20234.7669           Root MSE      =  68.733

------------------------------------------------------------------------------
api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Imealcat_1 |     239.03   18.66502    12.81   0.000     202.3345    275.7255
_Imealcat_3 |  -105.9176    18.7545    -5.65   0.000     -142.789    -69.0462
some_col |   2.193606   .5425274     4.04   0.000     1.126996    3.260217
_ImeaXsome~1 |  -3.140945   .7292897    -4.31   0.000     -4.57473   -1.707159
_ImeaXsome~3 |  -.5336362   .9272014    -0.58   0.565    -2.356517    1.289245
_cons |   586.8637   14.30311    41.03   0.000     558.7438    614.9837
------------------------------------------------------------------------------

Now, the test of _ImeaXsome~1 tests whether the coefficient for group 1 differs from group 2, and it does.  Then, the test of _ImeaXsome~3 tests whether the coefficient for group 3 significantly differs from group 2, and it does not. This makes sense given the graph and given the estimates of the coefficients that we have, that -.94 is significantly different from 2.2 but 2.2 is not significantly different from 1.66.

3.8.2 Using Anova

We can perform the same analysis using the anova command, as shown below.  The anova command gives us somewhat less flexibility since we cannot choose which group is the omitted group.

use elemapi2, clear
anova api00 mealcat some_col mealcat*some_col, cont(some_col)

Number of obs =     400     R-squared     =  0.7695
Root MSE      = 68.7334     Adj R-squared =  0.7665

Source |  Partial SS    df       MS           F     Prob > F
-----------------+----------------------------------------------------
Model |  6212306.88     5  1242461.38     263.00     0.0000
|
mealcat |  2012065.49     2  1006032.75     212.95     0.0000
some_col |  36366.3662     1  36366.3662       7.70     0.0058
mealcat*some_col |  97468.1685     2  48734.0843      10.32     0.0000
|
Residual |  1861365.12   394  4724.27696
-----------------+----------------------------------------------------
Total |  8073672.00   399  20234.7669   

Because the anova command omits the 3rd category, and the analysis we showed above omitted the second category, the parameter estimates will not be the same. You can compare the results from below with the results above and see that the parameter estimates are not the same.  Because group 3 is dropped, that is the reference category and all comparisons are made with group 3.

anova, regress

Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  5,   394) =  263.00
Model |  6212306.88     5  1242461.38           Prob > F      =  0.0000
Residual |  1861365.12   394  4724.27696           R-squared     =  0.7695
Total |  8073672.00   399  20234.7669           Root MSE      =  68.733

------------------------------------------------------------------------------
api00        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
------------------------------------------------------------------------------
_cons            480.9461   12.13063    39.65   0.000     457.0973     504.795
meals3
1     344.9476   17.05743    20.22   0.000     311.4126    378.4825
2     105.9176    18.7545     5.65   0.000      69.0462     142.789
3    (dropped)
some_col          1.65997   .7519086     2.21   0.028     .1817153    3.138225
meals3*some_col
1       -2.607308   .8960435    -2.91   0.004    -4.368933   -.8456841
2        .5336362   .9272014     0.58   0.565    -1.289245    2.356517
3       (dropped)
------------------------------------------------------------------------------

These analyses showed that the relationship between some_col and api00 varied, depending on the level of mealcat.  In comparing group 1 with group 2, the coefficient for some_col was significantly different, but there was no difference in the coefficient for some_col in comparing groups 2 and 3.

3.9 Summary

This covered four techniques for analyzing data with categorical variables, 1) manually constructing indicator variables, 2) creating indicator variables using the xi command, 3) coding variables using xi3, and 4) using the anova command. Each method has its advantages and disadvantages, as described below.

Manually constructing indicator variables can be very tedious and even error prone. For very simple models, it is not very difficult to create your own indicator variables, but if you have categorical variables with many levels and/or interactions of categorical variables, it can be laborious to manually create indicator variables. However, the advantage is that you can have quite a bit of control over how the variables are created and the terms that are entered into the model.

The xi command can really ease the creation of indicator variables, and make it easier to include interactions in your models by allowing you to include interaction terms such as i.prog*female .  The xi command also gives you the flexibility to decide which category would be the omitted category (unlike the anova command).

The anova command eliminates the need to create indicator variables making it easy to include variables that have lots of categories, and making it easy to create interactions by allowing you to include terms like some_col*mealcat. It can be easier to perform tests of simple main effects with the anova command. However, the anova command is not flexible in letting you choose which category is the omitted category (the last category is always the omitted category).

As you will see in the next chapter, the regress command includes additional options like the robust option and the cluster option that allow you to perform analyses when you don't exactly meet the assumptions of ordinary least squares regression.  In such cases, the regress command offers features not available in the anova command and may be more advantageous to use.

See the Stata Topics: Regression page for more information and resources on regression with categorical predictors in Stata.

3.10 Self Assessment

1. Using the elemapi2 data file ( use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) convert the variable ell into 2 categories using the following coding, 0-25 on ell becomes 0, and 26-100 on ell becomes 1. Use this recoded version of ell to predict api00 and interpret the results.

2. Convert the variable ell into 3 categories coding those scoring 0-14 on ell as 1, and those 15/41 as 2 and 42/100 as 3. Do an analysis predicting api00 from the ell variable converted to a 1/2/3 variable. Interpret the results.

3. Do a regression analysis predicting api00 from yr_rnd and the ell variable converted to a 0/1 variable. Then create an interaction term and run the analysis again. Interpret the results of these analyses.

4. Do a regression analysis predicting api00 from ell coded as 0/1 (from question 1) and some_col, and the interaction of these two variables. Interpret the results, including showing a graph of the results.

5. Use the variable ell converted into 3 categories (from question 2) and predict api00 from ell in 3 categories, from some_col and the interaction. of these two variables. Interpret the results, including showing a graph.