Regression with Stata
Chapter 3 - Regression with Categorical Predictors

Chapter Outline
    3.0 Regression with Categorical Predictors
    3.1 Regression with a 0/1 variable
    3.2 Regression with a 1/2 variable
    3.3 Regression with a 1/2/3 variable
    3.4 Regression with multiple categorical predictors
    3.5 Categorical predictor with interactions
    3.6 Continuous and Categorical variables
    3.7 Interactions of Continuous by 0/1 Categorical variables
    3.8 Continuous and Categorical variables, interaction with 1/2/3 variable
    3.9 Summary
    3.10 Self assessment
    3.11 For more information

Please note: This page makes use of the program xi3 which is no longer being maintained and has been from our archives. References to xi3 will be left on this page because they illustrate specific principles of coding categorical variables.

3.0 Introduction

In the previous two chapters, we have focused on regression analyses using continuous variables. However, it is possible to include categorical predictors in a regression analysis, but it requires some extra work in performing the analysis and extra work in properly interpreting the results.  This chapter will illustrate how you can use Stata for including categorical predictors in your analysis and describe how to interpret the results of such analyses. Stata has some great tools that really ease the process of including categorical variables in your regression analysis, and we will emphasize the use of these timesaving tools.

This chapter will use the elemapi2 data that you have seen in the prior chapters. We will focus on four variables api00, some_col, yr_rnd and mealcat, which takes meals and breaks it up into 3 categories. Let's have a quick look at these variables.

The variable api00 is a measure of the performance of the schools. Below we see the codebook information for api00

The variable some_col is a continuous variable that measures the percentage of the parents in the school who have attended college, and the codebook information is shown below.

The variable yr_rnd is a categorical variable that is coded 0 if the school is not year round, and 1 if year round, see below.

The variable meals is the percentage of students who are receiving state sponsored free meals and can be used as an indicator of poverty. This was broken into 3 categories (to make equally sized groups) creating the variable mealcat. The codebook information for mealcat is shown below.

3.1 Regression with a 0/1 variable

The simplest example of a categorical predictor in a regression analysis is a 0/1 variable, also called a dummy variable. Let's use the variable yr_rnd as an example of a dummy variable. We can include a dummy variable as a predictor in a regression analysis as shown below.

This may seem odd at first, but this is a legitimate analysis. But what does this mean? Let's go back to basics and write out the regression equation that this model implies.

api00 = _cons + Byr_rnd * yr_rnd 

where _cons is the intercept (or constant) and we use Byr_rnd to represent the coefficient for variable yr_rnd.  Filling in the values from the regression equation, we get

api00 = 684.539 + -160.5064 * yr_rnd 

If a school is not a year-round school (i.e. yr_rnd is 0) the regression equation would simplify to

api00 = constant    + 0 * Byr_rnd 
api00 = 684.539     + 0 * -160.5064  
api00 = 684.539

If a school is a year-round school, the regression equation would simplify to 

api00 = constant + 1 * Byr_rnd 
api00 = 684.539  + 1 * -160.5064 
api00 = 524.0326

We can graph the observed values and the predicted values using the scatter command as shown below. Although yr_rnd only has 2 values, we can still draw a regression line showing the relationship between yr_rnd and api00.  Based on the results above, we see that the predicted value for non-year round schools is 684.539 and the predicted value for the year round schools is 524.032, and the slope of the line is negative, which makes sense since the coefficient for yr_rnd was negative (-160.5064).   

twoway (scatter api00 yr_rnd) (lfit api00 yr_rnd)

Let's compare these predicted values to the mean api00 scores for the year-round and non-year-round schools.

As you see, the regression equation predicts that the value of api00 will be the mean value, depending on whether a school is a year round school or non-year round school.

Let's relate these predicted values back to the regression equation. For the non-year-round schools, their mean is the same as the intercept (684.539). The coefficient for yr_rnd is the amount we need to add to get the mean for the year-round schools, i.e., we need to add -160.5064 to get 524.0326, the mean for the non year-round schools. In other words, Byr_rnd is the mean api00 score for the year-round schools minus the mean api00 score for the non year-round schools, i.e., mean(year-round) - mean(non year-round).

It may be surprising to note that this regression analysis with a single dummy variable is the same as doing a t-test comparing the mean api00 for the year-round schools with the non year-round schools (see below). You can see that the t value below is the same as the t value for yr_rnd in the regression above. This is because Byr_rnd compares the year-rounds and non year-rounds (since the coefficient is mean(year round)-mean(non year-round)).

Since a t-test is the same as doing an anova, we can get the same results using the anova command as well.

If we square the t-value from the t-test, we get the same value as the F-value from the anova.

3.2 Regression with a 1/2 variable

A categorical predictor variable does not have to be coded 0/1 to be used in a regression model. It is easier to understand and interpret the results from a model with dummy variables, but the results from a variable coded 1/2 yield essentially the same results.

Lets make a copy of the variable yr_rnd called yr_rnd2 that is coded 1/2, 1=non year-round and 2=year-round.

Let's perform a regression predicting api00 from yr_rnd2.

Note that the coefficient for yr_rnd is the same as yr_rnd2. So, you can see that if you code yr_rnd as 0/1 or as 1/2, the regression coefficient works out to be the same. However the intercept (_cons) is a bit less intuitive. When we used yr_rnd, the intercept was the mean for the non year-rounds. When using yr_rnd2, the intercept is the mean for the non year-rounds minus Byr_rnd2, i.e., 684.539 - (-160.506) = 845.045

Note that you can use 0/1 or 1/2 coding and the results for the coefficient come out the same, but the interpretation of the constant in the regression equation is different. It is often easier to interpret the estimates for 0/1 coding.

In summary, these results indicate that the api00 scores are significantly different for the schools depending on the type of school, year round school vs. non-year round school. Non year-round schools have significantly higher API scores than year-round schools. Based on the regression results, non year- round schools have scores that are 160.5 points higher than year- round schools.

3.3 Regression with a 1/2/3 variable

3.3.1 Manually Creating Dummy Variables

Say, that we would like to examine the relationship between the amount of poverty and api scores. We don't have a measure of poverty, but we can use mealcat as a proxy for a measure of poverty. Below we repeat the codebook info for mealcat showing the values for the three categories.

You might be tempted to try including mealcat in a regression like this.

But this is looking at the linear effect of mealcat with api00, but mealcat is not an interval variable. Instead, you will want to code the variable so that all the information concerning the three levels is accounted for. You can dummy code mealcat like this.

We now have created mealcat1 that is 1 if mealcat is 1, and 0 otherwise. Likewise, mealcat2 is 1 if mealcat is 2, and 0 otherwise and likewise mealcat3 was created. We can see this below.

We can now use two of these dummy variables (mealcat2 and mealcat3) in the regression analysis.

We can test the overall differences among the three groups by using the test command as shown below. This shows that the overall differences among the three groups are significant.

The interpretation of the coefficients is much like that for the binary variables. Group 1 is the omitted group, so _cons is the mean for group 1. The coefficient for mealcat2 is the mean for group 2 minus the mean of the omitted group (group 1). And the coefficient for mealcat3 is the mean of group 3 minus the mean of group 1. You can verify this by comparing the coefficients with the means of the groups.

Based on these results, we can say that the three groups differ in their api00 scores, and that in particular group2 is significantly different from group1 (because mealcat2 was significant) and group 3 is significantly different from group 1 (because mealcat3 was significant).

3.3.2 Using the xi command

We can use the xi command to do the work for us to create the indicator variables and run the regression all in one command, as shown below.

When we use xi and include the term i.mealcat in the model, Stata creates the variables _Imealcat_2 and _Imealcat_3 that are dummy variables just like mealcat2 and mealcat3 that we created before. There really is no difference between mealcat2 and _Imealcat_2.

As you can see, the results are the same as in the prior analysis. If we want to test the overall effect of mealcat we use the test command as shown below, which also gives us the same results as we found using the dummy variables mealcat2 and mealcat3.

Note that if you are doing this in Stata version 6 the variables would be named Imealc_2 and Imealc_3 instead of _Imealcat_2 and _Imealcat_3. One of the improvements in Stata 7 is that variable names can be longer than 8 characters, so the names of the variables created by the xi command are easier to understand than in version 6. From this point forward, we will use the variable names that would be created in version 7.

What if we wanted a different group to be the reference group? If we create dummy variables via tabulate , generate() then we can easily choose which variable will be the omitted group, for example, let's omit group 3.

With group 3 omitted, the constant is now the mean of group 3 and mealcat1 is group1-group3 and mealcat2 is group2-group3. We see that both of these coefficients are significant, indicating that group 1 is significantly different from group 3 and group 2 is significantly different from group 3.

When we use the xi command, how can we choose which group is the omitted group? By default, the first group is omitted, but say we want group 3 to be omitted. We can use the char command as shown below to tell Stata that we want the third group to be the omitted group for the variable mealcat.

Then, when we use the xi command using mealcat the mealcat=3 group will be omitted. If you save the data file, Stata will remember this for future Stata sessions.

You can compare and see that these results are identical to those found using mealcat1 and mealcat2 as predictors.

3.3.3 Using the anova command

We can also do this analysis using the anova command. The benefit of the anova command is that it gives us the test of the overall effect of mealcat without needing to subsequently use the test command as we did with the regress command.

We can see the anova test of the effect of mealcat is the same as the test command from the regress command.

We can even follow this with the anova, regress command and compare the parameter estimates with those we performed previously.

Note: the parameter estimates are the same because mealcat is coded the same way in the regress command and in the anova command, in both cases the last category (category 3) being dropped. While you can control which category is the omitted category when you use the regress command, the anova, regress command always drops the last category.

3.3.4 Other coding schemes

It is generally very convenient to use dummy coding but that is not the only kind of coding that can be used. As you have seen, when you use dummy coding one of the groups becomes the reference group and all of the other groups are compared to that group. This may not be the most interesting set of comparisons.

Say you want to compare group 1 with groups 2 and 3, and for a second comparison compare group 2 with group 3. You need to generate a coding scheme that forms these 2 comparisons. We will illustrate this using a Stata program, xi3, (an enhanced version of xi) that will create the variables you would need for such comparisons (as well as a variety of other common comparisons). 

The comparisons that we have described (comparing group 1 with 2 and 3, and then comparing groups 2 and 3) correspond to Helmert comparisons (see Chapter 5 for more details). We use the h. prefix (instead of the i. prefix) to indicate that we desire Helmert comparisons on the variable mealcat. Otherwise, you see that xi3 works much like the xi command.

If you compare the parameter estimates with the means (see below) you can verify that the coefficient for _Imealcat_1 is the mean of group 1 minus the mean of groups 2 and 3 (805.71756 - (639.39394 + 504.37956) / 2 = 233.83081) and the coefficient for _Imealcat_2 is the mean of group 2 minus group 3 (639.39 - 504.37 = 135.01).  Both of these comparisons are significant, indicating that group 1 differs significantly from groups 2 and 3 combined, and group 2 differs significantly from group 3.

And the value of _cons is the unweighted average of the means of the 3 groups.

Using the coding scheme provided by xi3, we were able to form perhaps more interesting tests than those provided by dummy coding.  The xi3 program can create variables according to other coding schemes, as well as custom coding schemes that you create, see help xi3 and Chapter 5 for more information. 

3.4 Regression with two categorical predictors

3.4.1 Using the xi: command

Previously we looked at using yr_rnd to predict api00

And we have also looked at mealcat using the xi command

We can include both yr_rnd and mealcat together in the same model.

We can test the overall effect of mealcat with the test command, which is significant.

Because this model has only main effects (no interactions) you can interpret Byr_rnd as the difference between the year round and non-year round group. The coefficient for I_mealcat_1 (which we will call B_Imealcat_1) is the difference between mealcat=1 and mealcat=3, and B_Imealcat_2 as the difference between mealcat=2 and mealcat=3.

Let's dig below the surface and see how the coefficients relate to the predicted values. Let's view the cells formed by crossing yr_rnd and mealcat and number the cells from cell1 to cell6.

           mealcat=1     mealcat=2      mealcat=3
 yr_rnd=0  cell1         cell2          cell3
 yr_rnd=1  cell4         cell5          cell6

With respect to mealcat, the group mealcat=3 is the reference category, and with respect to yr_rnd the group yr_rnd=0 is the reference category. As a result, cell3 is the reference cell. The constant is the predicted value for this cell.

The coefficient for yr_rnd is the difference between cell3 and cell6. Since this model has only main effects, it is also the difference between cell2 and cell5, or from cell1 and cell4. In other words, Byr_rnd is the amount you add to the predicted value when you go from non-year round to year round schools.

The coefficient for _Imealcat_1 is the predicted difference between cell1 and cell3. Since this model only has main effects, it is also the predicted difference between cell4 and cell6. Likewise, B_Imealcat_2 is the predicted difference between cell2 and cell3, and also the predicted difference between cell5 and cell6.

So, the predicted values, in terms of the coefficients, would be

           mealcat=1         mealcat=2         mealcat=3
          -----------------------------------------------
 yr_rnd=0  _cons             _cons             _cons
           +BImealcat1       +BImealcat2
          -----------------------------------------------
 yr_rnd=1  _cons             _cons             _cons    
           +Byr_rnd          +Byr_rnd          +Byr_rnd 
           +BImealcat1       +BImealcat2

We should note that if you computed the predicted values for each cell, they would not exactly match the means in the 6 cells.  The predicted means would be close to the observed means in the cells, but not exactly the same.  This is because our model only has main effects and assumes that the difference between cell1 and cell4 is exactly the same as the difference between cells 2 and 5 which is the same as the difference between cells 3 and 6.  Since the observed values don't follow this pattern, there is some discrepancy between the predicted means and observed means.

3.4.2 Using the anova command

We can run the same analysis using the anova command with just main effects

Note that we get the same information that we do from the xi : regress command, followed by the test command. The anova command automatically provides the information provided by the test command. If we like, we can also request the parameter estimates later just by doing this.

anova will display the parameter estimates from the last anova model. However, the anova command is rigid in its determination of which group will be the omitted group and the last group is dropped.  Since this differs from the coding we used in the regression commands above, the parameter estimates from this anova command will differ from the regress command above.  

In summary, these results indicate the differences between year round and non-year round schools is significant, and the differences among the three mealcat groups are significant.

3.5 Categorical predictor with interactions

3.5.1 using xi

Let's perform the same analysis that we performed above, this time let's include the interaction of mealcat by yr_rnd. When using xi, it is easy to include an interaction term, as shown below.

We can test the overall interaction with the test command. This interaction effect is not significant.

It is important to note how the meaning of the coefficients change in the presence of these interaction terms. For example, in the prior model, with only main effects, we could interpret Byr_rnd as the difference between the year round and non year round schools. However, now that we have added the interaction term, the term Byr_rnd represents the difference between cell3 and cell6, or the difference between the year round and non-year round schools when mealcat=3 (because mealcat=3 was the omitted group). The presence of an interaction would imply that the difference between year round and non-year round schools depends on the level of mealcat. The interaction terms B_ImeaXyr_rn_1 and B_ImeaXyr_rn_2 represent the extent to which the difference between the year round/non year round schools changes when mealcat=1 and when mealcat=2 (as compared to the reference group, mealcat=3). For example the term B_ImeaXyr_rn_1 represents the difference between year round and non-year round for mealcat=1 vs. the difference for mealcat=3. In other words, B_ImeaXyr_rn_1 in this design is (cell1-cell4) - (cell3-cell6), or it represents how much the effect of yr_rnd differs between mealcat=1 and mealcat=3.

Below we have shown the predicted values for the six cells in terms of the coefficients in the model.  If you compare this to the main effects model, you will see that the predicted values are the same except for the addition of _ImeaXyr_rn_1 (in cell 4) and _ImeaXyr_rn_2 (in cell 5). 

           mealcat=1           mealcat=2         mealcat=3
           -------------------------------------------------
 yr_rnd=0  _cons               _cons             _cons    
           +BImealcat1         +BImealcat2 
           -------------------------------------------------
 yr_rnd=1  _cons               _cons             _cons    
           +Byr_rnd            +Byr_rnd          +Byr_rnd
           +BImealcat1         +BImealcat2           
           +B_ImeaXyr_rn_1     +B_ImeaXyr_rn_2 

It can be very tricky to interpret these interaction terms if you wish to form specific comparisons. For example, if you wanted to perform a test of the simple main effect of yr_rnd when mealcat=1, i.e., comparing cell1 with cell4, you would want to compare _cons+ BImealcat1 vs. _cons + B yr_rnd + BImealcat1+ BImeaXyr_rn_1 and since _cons and Imealcat1 would drop out, we would test

This test is significant, indicating that the effect of yr_rnd is significant for the mealcat = 1 group.

As we will see, such tests can be more easily done via anova.

3.5.2 Using anova

Constructing these interactions can be somewhat easier when using the anova command.  As you see below, the anova command gives us the test of the overall main effects and interactions without the need to perform subsequent test commands.

It is easy to perform tests of simple main effects using the sme command. You can download sme from within Stata by typing findit sme (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

Now we can test the simple main effects of yr_rnd at each level of mealcat.

The results from sme show us the effect of yr_rnd at each of the 3 levels of mealcat. We can see that the comparison for mealcat = 1 matches those we computed above using the test statement, however, it was much easier and less error prone using the sme command. 

Although this section has focused on how to handle analyses involving interactions, these particular results show no indication of interaction. We could decide to omit interaction terms from future analyses having found the interactions to be non-significant. This would simplify future analyses, however including the interaction term can be useful to assure readers that the interaction term is non-significant.

3.6 Continuous and Categorical variables  

3.6.1 Using regress

Say that we wish to analyze both continuous and categorical variables in one analysis. For example, let's include yr_rnd and some_col in the same analysis.

We can create the predicted values using the predict command.

Let's graph the predicted values by some_col.

The coefficient for some_col indicates that for every unit increase in some_col the api00 score is predicted to increase by 2.23 units. This is the slope of the lines shown in the above graph. The graph has two lines, one for the year round schools and one for the non-year round schools. The coefficient for yr_rnd is -149.16, indicating that as yr_rnd increases by 1 unit, the api00 score is expected to decrease by about 149 units. As you can see in the graph, the top line is about 150 units higher than the lower line. You can see that the intercept is 637 and that is where the upper line crosses the Y axis when X is 0. The lower line crosses the line about 150 units lower at about 487.

3.6.2 Using anova

We can run this analysis using the anova command. The anova command assumes that the variables are categorical, thus, we need to use the continuous() option (which can be abbreviated as cont()) to specify that some_col is a continuous variable.

If we square the t-values from the regress command (above), we would find that they match those of the anova command. 

3.7 Interactions of Continuous by 0/1 Categorical variables

Above we showed an analysis that looked at the relationship between some_col and api00 and also included yr_rnd.  We saw that this produced a graph where we saw the relationship between some_col and api00 but there were two regression lines, one higher than the other but with equal slope.  Such a model assumed that the slope was the same for the two groups.  Perhaps the slope might be different for these groups.  Let's run the regressions separately for these two groups beginning with the non-year round schools.

Likewise, let's look at the year round schools.

Note that the slope of the regression line looks much steeper for the year round schools than for the non-year round schools. This is confirmed by the regression equations that show the slope for the year round schools to be higher (7.4) than non-year round schools (1.3). We can compare these to see if these are significantly different from each other by including the interaction of some_col by yr_rnd, an interaction of a continuous variable by a categorical variable.

3.7.1 Computing interactions manually

We will start by manually computing the interaction of some_col by yr_rnd. Let's start fresh and use the elemapi2 data file using the , clear option to clear out any variables we have previously created.

Next, let's make a variable that is the interaction of some college (some_col) and year round schools (yr_rnd) called yrXsome.

We can now run the regression that tests whether the coefficient for some_col is significantly different for year round schools and non-year round schools. Indeed, the yrXsome interaction effect is significant.

We can make a graph showing the regression lines for the two types of schools showing how different their regression lines are. We first create the predicted value, we call it yhata.

Then, we create separate variables for the two types of schools which will be called yhata0 for non-year round schools and yhata1 for year round schools.

We can then graph the predicted values for the two types of schools by some_col. You can see how the two lines have quite different slopes, consistent with the fact that the yrXsome interaction was significant.  The c(ll[_])  option indicates that yhata0 should be connected with a line, and yhata1 should be connected with dashed lines (because we included [_] after the l ).  If we had used l[.] it would have made a dotted line.  The options to make dashed and dotted lines are new to Stata 7 and you can find more information via help grsym

We can replot the same graph including the data points.

The graph above used the same kind of dots for the data points for both types of schools. Let's make separate variables for the api00 scores for the two types of schools called api000 for the non-year round schools and api001 for the year round schools.

We can then make the same graph as above except show the points differently for the two types of schools.  Below we use small circles for the non-year round schools, and triangles for the year round schools.

Let's quickly run the regressions again where we performed separate regressions for the two groups

Non-year round

Year round

Now, let's show the regression for both types of schools with the interaction term.

Note that the coefficient for some_col in the combined analysis is the same as the coefficient for some_col for the non-year round schools? This is because non-year round schools are the reference group.  Then, the coefficient for the yrXsome interaction in the combined analysis is the Bsome_col for the year round schools (7.4) minus Bsome_col for the non year round schools (1.41) yielding 5.99. This interaction is the difference in the slopes of some_col for the two types of schools, and this is why this is useful for testing whether the regression lines for the two types of schools are equal. If the two types of schools had the same regression coefficient for some_col, then the coefficient for the yrXsome interaction would be 0. In this case, the difference is significant, indicating that the regression lines are significantly different.

So, if we look at the graph of the two regression lines we can see the difference in the slopes of the regression lines (see graph below).  Indeed, we can see that the non-year round schools (the solid line) have a smaller slope (1.4) than the slope for the year round schools (7.4).  The difference between these slopes is 5.99, the coefficient for yrXsome.

line yhata0 yhata1 some_col, sort
 
3.7.2 Computing interactions with xi

We can use the xi command for doing this kind of analysis as well. Let's start fresh and use the elemapi2 file.

We can run a model just like the model we showed above using the xi command. You can compare the results to those above and see that we get the exact same results.

The i.yr_rnd*some_col term creates 3 terms, some_col, _Iyr_rnd_2 an indicator variable for yr_rnd representing whether the school is year round and the variable _Iyr_Xsome~2 representing the interaction of yr_rnd by some_col.

As we did above, we can create predicted values and create graphs showing the regression lines for the two types of schools.  We omit showing these commands.

3.7.3 Computing interactions with anova

We can also run a model just like the model we showed above using the anova command. We include the terms yr_rnd some_col and the interaction yr_rnr*some_col  

As we illustrated above, we can compute the predicted values using the predict command and graph the separate regression lines.  These commands are omitted.

In this section we found that the relationship between some_col and api00 depended on whether the school is a  year round school or a non-year round school.  For the year round schools, the relationship between some_col and api00 was significantly stronger than for non-year round schools.  In general, this type of analysis allows you to test whether the strength of the relationship between two continuous variables varies based on the categorical variable.

3.8 Continuous and Categorical variables, interaction with 1/2/3 variable  

The prior examples showed how to do regressions with a continuous variable and a categorical variable that has 2 levels.  These examples will extend this further by using a categorical variable with 3 levels, mealcat.   

3.8.1 using xi

We can use the xi command to run a model with some_col, mealcat and the interaction of these two variables.

The interaction now has two terms (_ImeaXsome~2 and _ImeaXsome~3).  To get an overall test of this interaction, we can use the test command.  

These results indicate that the overall interaction is indeed significant.  This means that the regression lines from the 3 groups differ significantly. As we have done before, let's compute the predicted values and make a graph of the predicted values so we can see how the regression lines differ.

Since we had three groups, we get three regression lines, one for each category of mealcat. The solid line is for group 1, the dashed line for group 2, and the dotted line is for group 3.

Group 1 was the omitted group, therefore the slope of the line for group 1 is the coefficient for some_col which is -.94.  Indeed, this line has a downward slope.  If we add the coefficient for some_col to the coefficient for _ImeaXsome~2 we get the coefficient for group 2, i.e., 3.14 + -.94 yields 2.2, the slope for group 2. Indeed, group 2 shows an upward slope. Likewise,  if we add the coefficient for some_col to the coefficient for _ImeaXsome~3 we get the coefficient for group 3, i.e., 2.6 + -.94 yields 1.66, the slope for group 3,.  So, the slopes for the 3 groups are

group 1: -0.94
group 2:  2.2
group 3:  1.66

The test of the coefficient for _ImeaXsome~2 tested whether the coefficient for group 2 differed from group 1, and indeed this was significant.  Likewise, the test of the coefficient for _ImeaXsome~3 tested whether the coefficient for group 3 differed from group 1, and indeed this was significant.  What did the test of the coefficient some_col test?  This coefficient represents the coefficient for group 1, so this tested whether the coefficient for group 1 (-0.94) was significantly different from 0.  This is probably a non-interesting test.

The comparisons in the above analyses don't seem to be as interesting as comparing group 1 vs. 2 and then comparing group 2 vs. 3.  These successive comparisons seem much more interesting. We can do this by making group 2 the omitted group, and then each group would be compared to group 2.  As we have done before, we will use the char command to indicate that we want group 2 to be the omitted category and then rerun the regression.

Now, the test of _ImeaXsome~1 tests whether the coefficient for group 1 differs from group 2, and it does.  Then, the test of _ImeaXsome~3 tests whether the coefficient for group 3 significantly differs from group 2, and it does not. This makes sense given the graph and given the estimates of the coefficients that we have, that -.94 is significantly different from 2.2 but 2.2 is not significantly different from 1.66.

3.8.2 Using Anova

We can perform the same analysis using the anova command, as shown below.  The anova command gives us somewhat less flexibility since we cannot choose which group is the omitted group.

Because the anova command omits the 3rd category, and the analysis we showed above omitted the second category, the parameter estimates will not be the same. You can compare the results from below with the results above and see that the parameter estimates are not the same.  Because group 3 is dropped, that is the reference category and all comparisons are made with group 3.

These analyses showed that the relationship between some_col and api00 varied, depending on the level of mealcat.  In comparing group 1 with group 2, the coefficient for some_col was significantly different, but there was no difference in the coefficient for some_col in comparing groups 2 and 3.

3.9 Summary

This covered four techniques for analyzing data with categorical variables, 1) manually constructing indicator variables, 2) creating indicator variables using the xi command, 3) coding variables using xi3, and 4) using the anova command. Each method has its advantages and disadvantages, as described below.

Manually constructing indicator variables can be very tedious and even error prone. For very simple models, it is not very difficult to create your own indicator variables, but if you have categorical variables with many levels and/or interactions of categorical variables, it can be laborious to manually create indicator variables. However, the advantage is that you can have quite a bit of control over how the variables are created and the terms that are entered into the model.

The xi command can really ease the creation of indicator variables, and make it easier to include interactions in your models by allowing you to include interaction terms such as i.prog*female .  The xi command also gives you the flexibility to decide which category would be the omitted category (unlike the anova command). 

The anova command eliminates the need to create indicator variables making it easy to include variables that have lots of categories, and making it easy to create interactions by allowing you to include terms like some_col*mealcat. It can be easier to perform tests of simple main effects with the anova command. However, the anova command is not flexible in letting you choose which category is the omitted category (the last category is always the omitted category).

As you will see in the next chapter, the regress command includes additional options like the robust option and the cluster option that allow you to perform analyses when you don't exactly meet the assumptions of ordinary least squares regression.  In such cases, the regress command offers features not available in the anova command and may be more advantageous to use.

See the Stata Topics: Regression page for more information and resources on regression with categorical predictors in Stata.

3.10 Self Assessment

1. Using the elemapi2 data file ( use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) convert the variable ell into 2 categories using the following coding, 0-25 on ell becomes 0, and 26-100 on ell becomes 1. Use this recoded version of ell to predict api00 and interpret the results.

2. Convert the variable ell into 3 categories coding those scoring 0-14 on ell as 1, and those 15/41 as 2 and 42/100 as 3. Do an analysis predicting api00 from the ell variable converted to a 1/2/3 variable. Interpret the results.

3. Do a regression analysis predicting api00 from yr_rnd and the ell variable converted to a 0/1 variable. Then create an interaction term and run the analysis again. Interpret the results of these analyses.

4. Do a regression analysis predicting api00 from ell coded as 0/1 (from question 1) and some_col, and the interaction of these two variables. Interpret the results, including showing a graph of the results.

5. Use the variable ell converted into 3 categories (from question 2) and predict api00 from ell in 3 categories, from some_col and the interaction. of these two variables. Interpret the results, including showing a graph.

Click here for our answers to these self assessment questions.

3.11 For more information

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.