UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Regression with SPSS
Chapter 3 - Regression with Categorical Predictors

Chapter Outline
    3.0 Regression with Categorical Predictors
    3.1 Regression with a 0/1 variable
    3.2 Regression with a 1/2 variable
    3.3 Regression with a 1/2/3 variable
    3.4 Regression with multiple categorical predictors
    3.5 Categorical predictor with interactions
    3.6 Continuous and Categorical variables
    3.7 Interactions of Continuous by 0/1 Categorical variables
    3.8 Continuous and Categorical variables, interaction with 1/2/3 variable
    3.9 Summary
    3.10 For more information

3.0 Introduction

In the previous two chapters, we have focused on regression analyses using continuous variables. However, it is possible to include categorical predictors in a regression analysis, but it requires some extra work in performing the analysis and extra work in properly interpreting the results.  This chapter will illustrate how you can use SPSS for including categorical predictors in your analysis and describe how to interpret the results of such analyses. 

This chapter will use the elemapi2 data that you have seen in the prior chapters. We will focus on four variables: api00, some_col, yr_rnd and mealcat. The variable api00 is a measure of the performance of the students.  The variable some_col is a continuous variable that measures the percentage of the parents of the children in the school who have attended college. The variable yr_rnd is a categorical variable that is coded 0 if the school is not year round and 1 if year round. The variable meals is the percentage of students who are receiving state sponsored free meals and can be used as an indicator of poverty. This was broken into 3 categories (to make equally sized groups) creating the variable mealcat.

3.1 Regression with a 0/1 variable

The simplest example of a categorical predictor in a regression analysis is a 0/1 variable, also called a dummy variable. Let's use the variable yr_rnd as an example of a dummy variable. We can include a dummy variable as a predictor in a regression analysis as shown below.

This may seem odd at first, but this is a legitimate analysis. But what does this mean? Let's go back to basics and write out the regression equation that this model implies.

api00 = constant + Byr_rnd * yr_rnd 

where constant is the intercept and we use Byr_rnd to represent the coefficient for variable yr_rnd.  Filling in the values from the regression equation, we get

api00 = 684.539 + -160.5064 * yr_rnd 

If a student is not in year-round school (i.e., yr_rnd is 0) the regression equation would simplify to

api00 = constant    + 0 * Byr_rnd 
api00 = 684.539     + 0 * -160.5064  
api00 = 684.539

If a student is year-round school, the regression equation would simplify to 

api00 = constant + 1 * Byr_rnd 
api00 = 684.539  + 1 * -160.5064 
api00 = 524.0326

We can graph the observed values and the predicted values using the igraph command as shown below. Although yr_rnd only has 2 values, we can still draw a regression line showing the relationship between yr_rnd and api00.  Based on the results above, we see that the predicted value for non-year round schools is 684.539 and the predicted value for the year round schools is 524.032, and the slope of the line is negative, which makes sense since the coefficient for yr_rnd was negative (-160.5064).  Note that the "type = scale" option is needed here because yr_rnd is an ordinal variable in the dataset.

IGRAPH
 /X1 = VAR(yr_rnd) TYPE = scale
 /Y = VAR (api00) TYPE = SCALE
 /FITLINE METHOD = REGRESSION  LINEAR LINE = TOTAL MEFFECT
 /CATORDER VAR(yr_rnd) (ASCENDING VALUES  OMITEMPTY)
 /SCATTER COINCIDENT = NONE.
Interactive Graph

Let's compare these predicted values to the mean api00 scores for the year-round and non-year-round students.

As you see, the regression equation predicts that the value of api00 will be the mean value of your group, depending on whether you went to year round school or non-year round school.

Let's relate these predicted values back to the regression equation. For the non-year-round students, their mean is the same as the intercept (684.539). The coefficient for yr_rnd is the amount we need to add to get the mean for the year-round students, i.e., we need to add -160.5064 to get 524.0326, the mean for the non year-round students. In other words, Byr_rnd is the mean api00 score for the year-round students minus the mean api00 score for the non year-round students, i.e., mean(year-round) - mean(non year-round).

It may be surprising to note that this regression analysis with a single dummy variable is the same as doing a t-test comparing the mean api00 for the year-round students with the non year-round students (see below). You can see that the t-value below is the same as the t-value for yr_rnd in the regression above. This is because Byr_rnd compares the non year-rounds and non year-rounds (since the coefficient is mean(year round)-mean(non year-round)).

Since a t-test is the same as doing an ANOVA, we can get the same results using the anova command as well.  Note that in SPSS, when you click on "analyze" and "compare means," you can select a one-way ANOVA test.  The code for conducting a one-way ANOVA is shown below.  After this analysis, however, we will use the glm (for general linear model) command instead of the oneway command.  

ONEWAY
  api00 BY yr_rnd.
ANOVA
api 2000

Sum of Squares df Mean Square F Sig.
Between Groups 1825000.563 1 1825000.563 116.241 .000
Within Groups 6248671.435 398 15700.179

Total 8073671.998 399


 

Remember that if you square the t-value, you will get the F-value:  10.7815**2 = 116.24074 , showing another way in which the t-test is the same as the ANOVA test.

3.2 Regression with a 1/2 variable

A categorical predictor variable does not have to be coded 0/1 to be used in a regression model. It is easier to understand and interpret the results from a model with dummy variables, but the results from a variable coded 1/2 yield essentially the same results.

Let's make a copy of the variable yr_rnd called yr_rnd2 that is coded 1/2, 1=non year-round and 2=year-round.

compute yr_rnd2 = yr_rnd.
recode yr_rnd2 (0=1) (1=2).
execute.

REGRESSION
  /DEPENDENT api00
  /METHOD=ENTER yr_rnd2.
 

<some output omitted to save space>



Coefficients(a)

Unstandardized Coefficients Standardized Coefficients t Sig.
Model B Std. Error Beta
1 (Constant) 845.045 19.353
43.664 .000
YR_RND2 -160.506 14.887 -.475 -10.782 .000
a Dependent Variable: api 2000

Note that the coefficient for yr_rnd is the same as yr_rnd2. So, you can see that if you code yr_rnd as 0/1 or as 1/2, the regression coefficient works out to be the same. However the intercept is a bit less intuitive. When we used yr_rnd, the intercept was the mean for the non year-rounds. When using yr_rnd2, the intercept is the mean for the non year-rounds minus Byr_rnd2, i.e., 684.539 - (-160.506) = 845.045

Note that you can use 0/1 or 1/2 coding and the results for the coefficient come out the same, but the interpretation of constant in the regression equation is different. It is often easier to interpret the estimates for 0/1 coding.

In summary, these results indicate that the api00 scores are significantly different for the students depending on the type of school they attend, year round school vs. non-year round school. Those who attend non-year round school have significantly higher scores. Based on the regression results, those who attend non-year round schools have scores that are 160.5 points higher than those who attend year-round schools.

3.3 Regression with a 1/2/3 variable

3.3.1 Manually Creating Dummy Variables

Say that we would like to examine the relationship between the amount of poverty and api scores. We don't have a measure of poverty, but we can use mealcat as a proxy for a measure of poverty. You might be tempted to try including mealcat in a regression like this.

This is looking at the linear effect of mealcat with api00, but mealcat is not an interval variable. Instead, you will want to code the variable so that all the information concerning the three levels is accounted for. You can dummy code mealcat like this.

We now have created mealcat1 that is 1 if mealcat is 1, and 0 otherwise. Likewise, mealcat2 is 1 if mealcat is 2, and 0 otherwise; and likewise mealcat3 was created. We can see this below.

We can now use two of these dummy variables (mealcat2 and mealcat3) in the regression analysis.

We can test the overall differences among the three groups by using the /method = test statement as shown below. This shows that the overall differences among the three groups are significant, with an F value of 611.121 and a p value of .000.

The interpretation of the coefficients is much like that for the binary variables. Group 1 is the omitted group, so the constant is the mean for group 1. The coefficient for mealcat2 is the mean for group 2 minus the mean of the omitted group (group 1), and the coefficient for mealcat3 is the mean of group 3 minus the mean of group 1. You can verify this by comparing the coefficients with the means of the groups, shown below.

Based on these results, we can say that the three groups differ in their api00 scores, and that in particular group2 is significantly different from group1 (because mealcat2 was significant) and group 3 is significantly different from group 1 (because mealcat3 was significant).

3.3.2 Using Do Loops

We can use the do repeat command to do the work for us to create the indicator (dummy) variables.  This method is particularly useful when you need to create many indicator variables.

DO REPEAT A=mealcat1 mealcat2 mealcat3 
 /B=1 2 3.
COMPUTE A=(mealcat=B).
END REPEAT.

We will then do a crosstab to verify that our indicator variables were created correctly.


crosstab /tables = mealcat by mealcat1
         /tables = mealcat by mealcat2
         /tables = mealcat by mealcat3.
 
Case Processing Summary

Cases
Valid Missing Total
N Percent N Percent N Percent
Percentage free meals in 3 categories * MEALCAT1 400 100.0% 0 .0% 400 100.0%
Percentage free meals in 3 categories * MEALCAT2 400 100.0% 0 .0% 400 100.0%
Percentage free meals in 3 categories * MEALCAT3 400 100.0% 0 .0% 400 100.0%


Percentage free meals in 3 categories * MEALCAT1 Crosstabulation

Count

MEALCAT1 Total
.00 1.00
Percentage free meals in 3 categories 0-46% free meals
131 131
47-80% free meals 132
132
81-100% free meals 137
137
Total 269 131 400


Percentage free meals in 3 categories * MEALCAT2 Crosstabulation

Count

MEALCAT2 Total
.00 1.00
Percentage free meals in 3 categories 0-46% free meals 131
131
47-80% free meals
132 132
81-100% free meals 137
137
Total 268 132 400


Percentage free meals in 3 categories * MEALCAT3 Crosstabulation

Count

MEALCAT3 Total
.00 1.00
Percentage free meals in 3 categories 0-46% free meals 131
131
47-80% free meals 132
132
81-100% free meals
137 137
Total 263 137 400

What if we wanted a different group to be the reference group? For example, let's omit group 3.

With group 3 omitted, the constant is now the mean of group 3 and mealcat1 is group1-group3 and mealcat2 is group2-group3. We see that both of these coefficients are significant, indicating that group 1 is significantly different from group 3 and group 2 is significantly different from group 3.

3.3.3 Using the glm command

We can also do this analysis using the glm command. The benefit of the glm command is that it we don't need to manually create dummy varaibles, and it gives us the test of the overall effect of mealcat without needing to subsequently use the /method = test statement as we did with the regress command. 

glm
 api00 by mealcat.
Between-Subjects Factors

Value Label N
Percentage free meals in 3 categories 1 0-46% free meals 131
2 47-80% free meals 132
3 81-100% free meals 137

Tests of Between-Subjects Effects

Dependent Variable: api 2000
Source Type III Sum of Squares df Mean Square F Sig.
Corrected Model 6094197.670(a) 2 3047098.835 611.121 .000
Intercept 168847142.059 1 168847142.059 33863.695 .000
MEALCAT 6094197.670 2 3047098.835 611.121 .000
Error 1979474.328 397 4986.081

Total 175839633.000 400


Corrected Total 8073671.997 399


a R Squared = .755 (Adjusted R Squared = .754)

We can use the /print=parameter statement with the glm command to obtain the parameter estimates.  Note that the estimates are based on dummy coding with the last (third) category omitted, and correspond to the results shown above where the third category was omitted.

glm
 api00 by mealcat
 /print=parameter.
Between-Subjects Factors

Value Label N
Percentage free meals in 3 categories 1 0-46% free meals 131
2 47-80% free meals 132
3 81-100% free meals 137

Tests of Between-Subjects Effects

Dependent Variable: api 2000
Source Type III Sum of Squares df Mean Square F Sig.
Corrected Model 6094197.670(a) 2 3047098.835 611.121 .000
Intercept 168847142.059 1 168847142.059 33863.695 .000
MEALCAT 6094197.670 2 3047098.835 611.121 .000
Error 1979474.328 397 4986.081

Total 175839633.000 400


Corrected Total 8073671.997 399


a R Squared = .755 (Adjusted R Squared = .754)





Parameter Estimates

Dependent Variable: api 2000

B Std. Error t Sig. 95% Confidence Interval
Parameter Lower Bound Upper Bound
Intercept 504.380 6.033 83.606 .000 492.519 516.240
[MEALCAT=1] 301.338 8.629 34.922 .000 284.374 318.302
[MEALCAT=2] 135.014 8.612 15.677 .000 118.083 151.945
[MEALCAT=3] 0(a) . . . . .
a This parameter is set to zero because it is redundant.

Note that the parameter estimates are the same because mealcat is coded the same way in the regress command and in the glm command, because in both cases the last category (category 3) is being dropped. 

3.3.4 Other coding schemes

It is generally very convenient to use dummy coding, but that is not the only kind of coding that can be used. As you have seen, when you use dummy coding one of the groups becomes the reference group and all of the other groups are compared to that group. This may not be the most interesting set of comparisons.  Below is a list of the types of coding schemes that SPSS will create for you.  You can access these through the pull-down menus, or you can request it on the /CONTRAST statement when using GLM (described later).  First, we show you how to manually create the codes.

Deviation(refcat): The deviations from the grand mean. 
Difference: The difference or reverse Helmert contrast - compare levels of a factor with the mean of the previous levels of the factor.
Simple(refcat): Compare each level of a factor to the last level.
Helmert: Compare levels of a factor with the mean of the subsequent levels of the factor.
Polynomial: Orthogonal polynomial contrasts. 
Repeated: Adjacent levels of a factor. 
Special: A user-defined contrast.

Let's create a variable that compares group 1 with 2 and another variable that compares group 2 with 3, and include those variables in the regression model.  In other words, we wish to create coefficients are comparisons of successive groups with group 1 as the baseline comparison group (i.e., the first comparison comparing group 1 vs. 2, and the second comparison comparing groups 2 vs. 3).  Below we show how to manually generate a coding scheme that forms these 2 comparisons. 

if mealcat = 1 grp1 = .667.
if mealcat = 2 grp1 = -.333.
if mealcat = 3 grp1 = -.333.

if mealcat = 1 grp2 = .333.
if mealcat = 2 grp2 = .333.
if mealcat = 3 grp2 = -.667.
execute.

regression
 /dep = api00
 /method = enter grp1 grp2.
 
Variables Entered/Removed(b)
Model Variables Entered Variables Removed Method
1 GRP2, GRP1(a) . Enter
a All requested variables entered.
b Dependent Variable: api 2000

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .869(a) .755 .754 70.612
a Predictors: (Constant), GRP2, GRP1
ANOVA(b)
Model Sum of Squares df Mean Square F Sig.
1 Regression 6094197.670 2 3047098.835 611.121 .000(a)
Residual 1979474.328 397 4986.081

Total 8073671.997 399


a Predictors: (Constant), GRP2, GRP1
b Dependent Variable: api 2000



Coefficients(a)

Unstandardized Coefficients Standardized Coefficients t Sig.
Model B Std. Error Beta
1 (Constant) 649.820 3.531
184.016 .000
GRP1 166.324 8.708 .549 19.099 .000
GRP2 135.014 8.612 .451 15.677 .000
a Dependent Variable: api 2000

We can perform this same series of comparisions much easier using the glm command with the contrast statement.

glm
 api00 by mealcat
 /contrast (mealcat)=repeated
 /print = parameter TEST(LMATRIX).
Between-Subjects Factors

Value Label N
Percentage free meals in 3 categories 1 0-46% free meals 131
2 47-80% free meals 132
3 81-100% free meals 137

Tests of Between-Subjects Effects

Dependent Variable: api 2000
Source Type III Sum of Squares df Mean Square F Sig.
Corrected Model 6094197.670(a) 2 3047098.835 611.121 .000
Intercept 168847142.059 1 168847142.059 33863.695 .000
MEALCAT 6094197.670 2 3047098.835 611.121 .000
Error 1979474.328 397 4986.081

Total 175839633.000 400


Corrected Total 8073671.997 399


a R Squared = .755 (Adjusted R Squared = .754)





Parameter Estimates

Dependent Variable: api 2000

B Std. Error t Sig. 95% Confidence Interval
Parameter Lower Bound Upper Bound
Intercept 504.380 6.033 83.606 .000 492.519 516.240
[MEALCAT=1] 301.338 8.629 34.922 .000 284.374 318.302
[MEALCAT=2] 135.014 8.612 15.677 .000 118.083 151.945
[MEALCAT=3] 0(a) . . . . .
a This parameter is set to zero because it is redundant.

Intercept

Contrast
Parameter L1
Intercept 1.000
[MEALCAT=1] .333
[MEALCAT=2] .333
[MEALCAT=3] .333
The default display of this matrix is the transpose of the corresponding L matrix.
Based on Type III Sums of Squares.

MEALCAT

Contrast
Parameter L2 L3
Intercept 0 0
[MEALCAT=1] 1 0
[MEALCAT=2] 0 1
[MEALCAT=3] -1 -1
The default display of this matrix is the transpose of the corresponding L matrix.
Based on Type III Sums of Squares.

Contrast Coefficients (L' Matrix)

Percentage free meals in 3 categories Repeated Contrast
Parameter Level 1 vs. Level 2 Level 2 vs. Level 3
Intercept 0 0
[MEALCAT=1] 1 0
[MEALCAT=2] -1 1
[MEALCAT=3] 0 -1
The default display of this matrix is the transpose of the corresponding L matrix.

Contrast Results (K Matrix)

Dependent Variable
Percentage free meals in 3 categories Repeated Contrast api 2000
Level 1 vs. Level 2 Contrast Estimate 166.324
Hypothesized Value 0
Difference (Estimate - Hypothesized) 166.324
Std. Error 8.708
Sig. .000
95% Confidence Interval for Difference Lower Bound 149.203
Upper Bound 183.444
Level 2 vs. Level 3 Contrast Estimate 135.014
Hypothesized Value 0
Difference (Estimate - Hypothesized) 135.014
Std. Error 8.612
Sig. .000
95% Confidence Interval for Difference Lower Bound 118.083
Upper Bound 151.945

Test Results

Dependent Variable: api 2000
Source Sum of Squares df Mean Square F Sig.
Contrast 6094197.670 2 3047098.835 611.121 .000
Error 1979474.328 397 4986.081


If you compare the parameter estimates with the means you can verify that B1 (i.e., 0-46% free meals) is the mean of group 1 minus group 2, and B2 (i.e., 47-80% free meals) is the mean of group 2 minus group 3.  Both of these comparisons are significant, indicating that group 1 significantly differs from group 2, and group 2 significantly differs from group 3.
MEANS
  TABLES=api00 BY mealcat.
 
Case Processing Summary

Cases
Included Excluded Total
N Percent N Percent N Percent
api 2000 * Percentage free meals in 3 categories 400 100.0% 0 .0% 400 100.0%

Report

api 2000
Percentage free meals in 3 categories Mean N Std. Deviation
0-46% free meals 805.72 131 65.669
47-80% free meals 639.39 132 82.135
81-100% free meals 504.38 137 62.727
Total 647.62 400 142.249

3.4 Regression with two categorical predictors

Previously we looked at using yr_rnd to predict api00

    regression
     /dep api00
     /method = enter yr_rnd.
    Variables Entered/Removed(b)
    Model Variables Entered Variables Removed Method
    1 year round school(a) . Enter
    a All requested variables entered.
    b Dependent Variable: api 2000

    Model Summary
    Model R R Square Adjusted R Square Std. Error of the Estimate
    1 .475(a) .226 .224 125.300
    a Predictors: (Constant), year round school

    ANOVA(b)
    Model Sum of Squares df Mean Square F Sig.
    1 Regression 1825000.563 1 1825000.563 116.241 .000(a)
    Residual 6248671.435 398 15700.179

    Total 8073671.997 399


    a Predictors: (Constant), year round school
    b Dependent Variable: api 2000



    Coefficients(a)

    Unstandardized Coefficients Standardized Coefficients t Sig.
    Model B Std. Error Beta
    1 (Constant) 684.539 7.140
    95.878 .000
    year round school -160.506 14.887 -.475 -10.782 .000
    a Dependent Variable: api 2000

And we have also looked at mealcat using the regression command

    regression
     /dep api00
     /method =  enter mealcat1 mealcat2.
     
    Variables Entered/Removed(b)
    Model Variables Entered Variables Removed Method
    1 MEALCAT2, MEALCAT1(a) . Enter
    a All requested variables entered.
    b Dependent Variable: api 2000

    Model Summary
    Model R R Square Adjusted R Square Std. Error of the Estimate
    1 .869(a) .755 .754 70.612
    a Predictors: (Constant), MEALCAT2, MEALCAT1

    ANOVA(b)
    Model Sum of Squares df Mean Square F Sig.
    1 Regression 6094197.670 2 3047098.835 611.121 .000(a)
    Residual 1979474.328 397 4986.081

    Total 8073671.997 399


    a Predictors: (Constant), MEALCAT2, MEALCAT1
    b Dependent Variable: api 2000



    Coefficients(a)

    Unstandardized Coefficients Standardized Coefficients t Sig.
    Model B Std. Error Beta
    1 (Constant) 504.380 6.033
    83.606 .000
    MEALCAT1 301.338 8.629 .995 34.922 .000
    MEALCAT2