Statistical Computing Seminar Proc Logistic and Logistic Regression Models

• Introduction
• Binary Logistic Regression
• Exact Logistic Regression
• Generalized Logits Model - Multinomial Logistic Regression
• Proportional Odds Model - Ordinal Logistic Regression

Introduction

Logistic regression describes the relationship between a categorical response variable and a set of predictor variables. A categorical response variable can be a binary variable, an ordinal variable or a nominal variable. Each type of categorical variables requires different techniques to model its relationship with the predictor variables. In this seminar, we illustrate how to perform different types of analyses using SAS proc logistic. For a binary response variable, such as a response to a yes-no question, a commonly used model is the logistic regression model. We also touch the surface of exact logistic regression, which is very useful when the sample size is too small or the events are too sparse. For a nominal response variable, such as Democrats, Republicans and Independents, we can fit a generalized logits model. For an ordinal response variable, such as low, medium and high, we can fit it to a proportional odds model.

Logistic Regression Models

In this section, we will use the High School and Beyond data set, hsb2.sas7bdat to describe what a logistic model is, how to perform a logistic regression model analysis and how to interpret the model. Our dependent variable is created as a dichotomous variable indicating if a student's writing score is higher than or equal to 52. We call it hiwrite. The predictor variables will include prog, female and other test scores. Our data set has 200 observations.

data hsb2;
set hsb2;
hiwrite = write >=52;
run;
proc means data = hsb2 mean std;
run;
Variable            Mean         Std Dev
----------------------------------------
ID           100.5000000      57.8791845
FEMALE         0.5450000       0.4992205
RACE           3.4300000       1.0394722
SES            2.0550000       0.7242914
SCHTYP         1.1600000       0.3675260
PROG           2.0250000       0.6904772
WRITE         52.7750000       9.4785860
MATH          52.6450000       9.3684478
SCIENCE       51.8500000       9.9008908
SOCST         52.4050000      10.7357935
hiwrite        0.6300000       0.4840159
----------------------------------------

Let π be the probability of scoring higher than 51 in writing test. The odds is π/(1-π). For example, the overall probability of scoring higher than 51 is .63. The odds will be .63/(1-.63) = 1.703. A logistic regression model describes a linear relationship between the logit, which is the log of odds, and a set of predictors.

logit(π) = log(π/(1-π)) = α + β1*x1 + + ... + βk*xk = α + x β

We can either interpret the model using the logit scale, or we can convert the log of odds back to the probability such that

β)).

The advantage of using the logit scale for interpretation is that the relationship between the logit and the predictors is a linear relationship. But sometimes it is easier to interpret the model in terms of probabilities. Then we have to keep in mind that the relationship between the probabilities and the predictors is not a linear relationship. For more details on odds ratio, please see our FAQ page on how to interpret odds ratios in logistic regression.

A Simple Model

Let's consider the model where female is the only predictor. We will use this example to understand the concepts of odds and odds ratios and to understand how they are related to the parameter estimates.

proc logistic data = hsb2 ;
model hiwrite (event='1') = female ;
ods output ParameterEstimates = model_female;
run;
             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1      0.0220      0.2097        0.0110        0.9165
FEMALE        1      0.9928      0.3016       10.8369        0.0010

Notice that we can specify which event to model using the event = option in the model statement. This is new in SAS 8.2. The other way of specifying that we want to model 1 as event instead of 0 is to use the descending option in the proc logistic statement.  One thing that is worth noticing is the use of quotes in the option event = '1'. Even though, the variable hiwrite is a numeric variable, it is still necessary to surround 1 with a pair of quotes. It comes handy when the outcome variable is coded as a character variable. Using the ODS output statement, we created a data set called model_female containing the parameter estimates shown above. We can then use the data set to create the odds and odds ratio.

data model_fem;
set model_female;
o = exp(estimate);
run;
proc print data = model_fem;
var variable estimate o;
run;
Obs    Variable     Estimate        o
 1     Intercept      0.0220    1.02222
2     FEMALE         0.9928    2.69865

The intercept has a parameter estimate of .022. This is the estimated logit when female = 0, that is when the student is a male student. Therefore, the odds = exp(logit) = exp(.0220) = 1.02222 is the estimated odds for a male student to score 52 or higher in writing test. The coefficient for variable female is .9928. That means that for a one unit increase in female (that is changing from male to female) the expected change in log of odds is .9928.  We can also interpret it in the scale of odds ratio. The odds for a male student is exp() = exp(.022) and the odds for a female student is exp(.022 + .9928*1). Therefore, taking the ratio of these two odds, we get the odds ratio for female versus male is exp(.9928)  = 2.699. In terms of probabilities, the probability for females to score 52 or higher on the writing test is exp(.022 + .9928) / (1 + exp(.022 + .9928)) = .734. The probability for males is exp(.022 )/(1 + exp(.022)) = .505.

With this simple example, we can actually compute the odds ratio from the 2x2 table of hiwrite*female.

proc freq data = hsb2;
tables hiwrite*female /nocum nopercent;
run;
hiwrite     FEMALE
Frequency|
Row Pct  |
Col Pct  |       0|       1|  Total
---------+--------+--------+
0 |     45 |     29 |     74
|  60.81 |  39.19 |
|  49.45 |  26.61 |
---------+--------+--------+
1 |     46 |     80 |    126
|  36.51 |  63.49 |
|  50.55 |  73.39 |
---------+--------+--------+
Total          91      109      200

For example, for males, the odds is 46/45 = 1.022, which is the exponentiated value of the intercept from the model. The odds ratio for females versus males is (80/29)/(46/45) = 2.699. It is usually written as a cross-product (45*80)/(29*46) = 2.699. This is the exponentiated value of the parameter estimate for variable female.

A Model with a Continuous Predictor and a Categorical Predictor

Let's now take a look at a model with both a continuous variable math and a categorical variable female as predictors. We will focus on how to interpret the parameter estimate for the continuous variable.

proc logistic data = hsb2;
model hiwrite (event='1') = female math;
output out = m2 p = prob xbeta = logit;
run;
             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1    -10.3651      1.5535       44.5153        <.0001
FEMALE        1      1.6304      0.4052       16.1922        <.0001
MATH          1      0.1979      0.0293       45.5559        <.0001

           Odds Ratio Estimates
             Point          95% Wald
Effect    Estimate      Confidence Limits
FEMALE       5.106       2.308      11.298
MATH         1.219       1.151       1.291

The interpretation for the parameter estimate of  math is very similar to that for the categorical variable female. In terms of logit scale, we can say that for every unit increase in the math score, the logit will increase by .198, holding everything else constant. We can also say that for a one unit increase in math score, the odds of scoring 51 or higher in writing test increases by (1.219-1)*100% = 22%.

We used an output statement to create a data set containing the predicted probabilities based on the model. We can compare the linear predictions and the probabilities in terms of the math scores for the males and females.

proc sort data = m2;
by math;
run;
symbol1 i = join v=star l=32  c = black;
symbol2 i = join v=circle l = 1 c=black;
proc gplot data = m2;
plot logit*math = female;
plot prob*math = female;
run;
quit;
Sometimes, a one unit change may not be a desirable scale to use. We can ask SAS to give us odds ratio for different units of change. For example, it may make more sense to talk about change of every 5 units in math score. This can be done using units statement. We also include the option clodds = wald to the model statement so that the confidence interval will also be calculated for the odds ratio calculated in the unit statement. Of course, you can always manually compute the odds ratio for every 5 units change in math score as 1.219^5 = 2.69.
proc logistic data = hsb2 ;
model hiwrite (event='1') = female math /clodds=wald;
units math = 5;
run;

Odds Ratio Estimates
             Point          95% Wald
Effect    Estimate      Confidence Limits
FEMALE       5.106       2.308      11.298
MATH         1.219       1.151       1.291

    Wald Confidence Interval for Adjusted Odds Ratios
Effect         Unit     Estimate     95% Confidence Limits
MATH         5.0000        2.689        2.018        3.584

Other Features of Proc Logistic

We will illustrate other features of proc logistic by using a model with more predictors. We will include categorical variables prog and female, continuous variables math and read. This model is merely for the purpose of demonstrating proc logistic, not really a model developed based on any theory.

proc logistic data = hsb2 ;
class prog (ref='1') /param = ref;
model hiwrite (event='1') = female read math prog ;
run;

Response Profile
 Ordered                      Total
Value      hiwrite     Frequency
       1            0            74
2            1           126
Probability modeled is hiwrite=1.

   Class Level Information
                      Design
Class     Value     Variables
PROG      1          0      0
2          1      0
3          0      1

              Analysis of Maximum Likelihood Estimates
                                 Standard          Wald
Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept       1    -12.3140      2.0374       36.5311        <.0001
FEMALE          1      1.9576      0.4533       18.6541        <.0001
READ            1      0.1037      0.0298       12.1453        0.0005
MATH            1      0.1310      0.0329       15.8738        <.0001
PROG      2     1      0.2721      0.4889        0.3098        0.5778
PROG      3     1     -0.5776      0.5478        1.1116        0.2917


CLASS statement

Notice that we have used the class statement for variable prog. SAS will create dummy variables for a categorical variable on-the-fly. There are various coding schemes from which to choose. The default coding for all the categorical variables in proc logistic is the effect coding. Here we changed it to dummy coding by using the param = ref option. We can specify the comparison group by using ref = option after the variable name. There are other coding schemes available, such as orthogonal polynomial coding scheme and reference cell coding. We can double check what coding scheme is used and which group is the reference group by looking at the Class Level Information part of the output.

CONTRAST statement

In the parameter estimates, we only see the comparison of level 2 vs. 1 and level 3 vs. 1 for variable prog. If we want to compare level 2 vs. level 3, we can use the contrast statement. Usually, contrast is done using less than full rank, reference cell coding as used in proc glm. We chose this type of coding by using param = glm option in the class statement. We also used estimate option at the end of contrast statement to get the estimate of the difference between group 1 and group 2. It is always a good idea to check the Class Level Information to see how the variable is coded so we know that the contrast statement gives us the expected contrast among groups.
proc logistic data = hsb2 ;
class prog /param = glm ;
model hiwrite (event='1') = female read math prog;
contrast '1 vs 2 of prog' prog 1 -1 0 / estimate;
run;
      Class Level Information
Class     Value     Design Variables
PROG      1          1      0      0
2          0      1      0
3          0      0      1
              Contrast Test Results
                                Wald
Contrast            DF    Chi-Square    Pr > ChiSq
1 vs 2 of prog       1        0.3098        0.5778

                 Contrast Rows Estimation and Testing Results
                                          Standard
Contrast        Type       Row  Estimate     Error   Alpha   Confidence Limits
1 vs 2 of prog  PARM         1   -0.2721    0.4889    0.05   -1.2303    0.6861
      Contrast Rows Estimation and Testing Results
                                      Wald
Contrast        Type       Row  Chi-Square    Pr > ChiSq
1 vs 2 of prog  PARM         1      0.3098        0.5778

TEST Statement

The parameter estimates offers all the one degree of freedom test on each of the parameters. We can also test the combined effect of multiple parameters using the test statement. In the example below, we first tested on the joint effect of read and math. Next we tested on the hypothesis that the effect of read and math are the same.

proc logistic data = hsb2 ;
class prog(ref='1') /param = ref;
model hiwrite (event='1') = prog female read math;
run;

Linear Hypotheses Testing Results
                         Wald
Label             Chi-Square      DF    Pr > ChiSq
 test_read_math       37.2236       2        <.0001
test_equal            0.3041       1        0.5813

LACKFIT and RSQUARE Option

The Hosmer-Lemeshow test of goodness-of-fit can be performed by using the lackfit option after the model statement. This test divides subjects into deciles based on predicted probabilities, then computes a chi-square from observed and expected frequencies. It tests the null hypothesis that there is no difference between the observed and predicted values of the response variable. Therefore, when the test is not significant, as in this example, we can not reject the null hypothesis and say that the model fits the data well. We can also request the generalized R-square measure for the model by using rsquare option after the model statement. SAS gives the likelihood-based pseudo R-square measure and its rescaled measure. Categorical Data Analysis Using The SAS System, by M. Stokes, C. Davis and G. Koch offers more details on how the generalized R-square measures that you can request are constructed and how to interpret them.
proc logistic data = hsb2;
class prog(ref='1') /param = ref;
model hiwrite(event='1') = female prog read math / rsq lackfit;
run;

Model Fit Statistics
                             Intercept
Intercept            and
Criterion          Only     Covariates
AIC             265.582        167.035
SC              268.881        186.825
-2 Log L        263.582        155.035

R-Square    0.4188    Max-rescaled R-Square    0.5720

             Partition for the Hosmer and Lemeshow Test
                             hiwrite = 1             hiwrite = 0
Group       Total    Observed    Expected    Observed    Expected
1          20           0        1.08          20       18.92
2          20           4        3.45          16       16.55
3          20           9        6.54          11       13.46
4          21          10       10.86          11       10.14
5          20          13       13.64           7        6.36
6          20          17       15.70           3        4.30
7          20          16       17.44           4        2.56
8          20          18       18.76           2        1.24
9          20          20       19.61           0        0.39
10          19          19       18.90           0        0.10

Hosmer and Lemeshow Goodness-of-Fit Test
Chi-Square       DF     Pr > ChiSq
    5.2766        8         0.7276

Influence Statistics

One important topic in logistic regression is regression diagnostics. Proc logistic can generate a lot of diagnostic measures for detecting outliers and influential data points for a binary outcome variable. These diagnostic measures can be requested by using the output statement. For example, we can request for residual deviance, the hat matrix diagonal and residual chi-squared deviance and the difference between chi-square goodness-of-fit  when an observation is deleted. We can then plot these variables against the predicted values to investigate the influence of each point on the model. By using the pointlabel option in the symbol statement, we can see that the observation with id = 187 has the highest influence on the chi-square goodness-of-fit.

proc logistic data = hsb2 ;
class prog(ref='1') /param = ref;
model hiwrite(event='1') = female prog read math ;
output out=dinf prob=p resdev=dr h=pii reschi=pr difchisq=difchi;
run;

goptions reset = all;
symbol1 pointlabel = ("#id" h=1 )  value=none;
proc gplot data = dinf;
plot difchi*p;
run;
quit;

Scoring a New Data Set

There are situations where we want to produce predicted probabilities for a specific combination of the values of the predictors. For example, we may want to know the predicted probabilities for groups defined by female and prog when math and read are held at their grand means. Let's first create a data set with the groups and grand means for math and read.

proc sql;
create table gdata as
select distinct female, (prog=2) as prog2,(prog=3) as prog3,
from hsb2;
quit;
proc print data = gdata;
run;
Obs    FEMALE    prog2    prog3     read     math
 1        0        0        0      52.23    52.645
2        0        0        1      52.23    52.645
3        0        1        0      52.23    52.645
4        1        0        0      52.23    52.645
5        1        0        1      52.23    52.645
6        1        1        0      52.23    52.645

We can use SAS proc score to generate the linear predicted values and then compute the odds or probabilities afterwards. Notice that the score procedure does not care what model we have run. It uses the estimated parameters to generate linear predictions. In our logistic regression case, the predicted values are therefore in the logit scale. In the output data set created by proc score, we have a variable called hiwrite.  This is the new variable that proc score created for predicted values.

proc logistic data = hsb2 outest=mg;
class prog(ref='1') /param = ref;
model hiwrite(event='1') = female prog read math ;
run;
*Scoring the data set to get the linear predictions;
proc score data=gdata score=mg out=gpred type=parms;
var female prog2 prog3 read math;
run;
data gpred;
set gpred;
odds = exp(hiwrite);
p_1 = odds /(1+odds);
p_0 = 1 - p_1;
run;
proc print data = gpred;
run;
Obs  FEMALE  prog2  prog3   read   math    hiwrite    odds     p_1      p_0
 1      0      0      0    52.23  52.645   0.00012  1.00012  0.50003  0.49997
2      0      0      1    52.23  52.645  -0.57747  0.56132  0.35952  0.64048
3      0      1      0    52.23  52.645   0.27223  1.31289  0.56764  0.43236
4      1      0      0    52.23  52.645   1.95774  7.08332  0.87629  0.12371
5      1      0      1    52.23  52.645   1.38016  3.97552  0.79902  0.20098
6      1      1      0    52.23  52.645   2.22986  9.29856  0.90290  0.09710

Remarks: This process will be simplified with SAS 9.0 and above with the new statement score in proc logistic. The syntax one will use looks like the following:

proc sql;
create table gdata1 as
select distinct female, ses,
from hsb2;
quit;
proc logistic data = hsb2 outmodel=mg1;
class prog(ref='1') /param = ref;
model hiwrite(event='1') = female prog read math  ;
run;
proc logistic inmodel=mg1;
score data = gdata1 out=gpred1;
run;
proc print data = gpred1;
run;

Exact Logistic Regression

All of the models we have inspected so far require large sample sizes. When the data sets are too small or when the event occurs very infrequently,  the maximum likelihood method may not work or may not provide reliable estimates. Exact logistic regression provides a way to get around these difficulties. What it does is to enumerate the exact distributions of the parameters of interest, conditional on the remaining parameters. Here is a simple example from Performing Exact Logistic Regression with the SAS System. The data set has very small cells, with each cell having only 3 observations. Let's run the exact logistic regression on this data set.

data dose;
input dose deaths total;
datalines;
0 0 3
1 0 3
2 0 3
3 0 3
4 1 3
5 2 3
;
run;
proc logistic data = dose desc;
model deaths/total = dose;
exact dose /estimate = both;
run;

Model Fit Statistics
                             Intercept
Intercept            and
Criterion          Only     Covariates
AIC              18.220         12.072
SC               19.111         13.853
-2 Log L         16.220          8.072

        Testing Global Null Hypothesis: BETA=0
Test                 Chi-Square       DF     Pr > ChiSq
Likelihood Ratio         8.1478        1         0.0043
Score                    5.7943        1         0.0161
Wald                     2.7249        1         0.0988

             Analysis of Maximum Likelihood Estimates
                               Standard          Wald
Parameter    DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept     1     -9.4745      5.5677        2.8958        0.0888
dose          1      2.0804      1.2603        2.7249        0.0988

           Odds Ratio Estimates
             Point          95% Wald
Effect    Estimate      Confidence Limits
dose         8.007       0.677      94.679

Exact Conditional Analysis
             Conditional Exact Tests
                                   --- p-Value ---
Effect   Test          Statistic    Exact      Mid
dose     Score            5.4724   0.0245   0.0190
Probability      0.0110   0.0245   0.0190

               Exact Parameter Estimates
                            95% Confidence
Parameter    Estimate           Limits           p-Value
dose           1.8000      0.1157      5.8665     0.0245

                  Exact Odds Ratios
                          95% Confidence
Parameter   Estimate          Limits          p-Value
dose           6.049      1.123    353.000     0.0245

Notice first of all that the syntax for model statement is slight different than we have seen so far. This is the syntax used for grouped data. That is we have frequencies of the events for each of the cells. This type of syntax works for both the maximum likelihood logistic regression and exact logistic regression. With estimate = both, we request that both the parameters and the odds ratios are being estimated.

Generalized Logits Model for Multinomial Logistic Models

Proc logistic also perform analysis on nominal response variables. Since the response variable no longer has the ordering, we can no longer fit a proportional odds model to our data. But we can fit a generalized logits model. This analysis can be done using proc catmod and that is how it is used to be done. SAS 8.2 added some new features to its proc logistic and now proc logistic does analysis on nominal responses with ease. In this section, we are going to use a data file called school used in Categorical Data Analysis Using The SAS System, by M. Stokes, C. Davis and G. Koch. We will illustrate what a generalized logits model is and how to perform an analysis using proc logistic.

data school;
input school program  style \$ count;
cards;
1 1 self  10
1 1 team  17
1 1 class 26
1 2 self   5
1 2 team  12
1 2 class 50
2 1 self  21
2 1 team  17
2 1 class 26
2 2 self  16
2 2 team  12
2 2 class 36
3 1 self  15
3 1 team  15
3 1 class 16
3 2 self  12
3 2 team  12
3 2 class 20
;
run;

In this data set, three different teaching styles have been implemented in teaching third grade math. School children in experimental learning settings were surveyed to determine which teaching styles they preferred. The response variable style takes three values: class, self and team. We want to determine the preference of students by their schools and programs. The programs are regular and after-school programs with 1 being regular and 2 being after-school.

In a generalized logit model, we will pick a particular category of responses as the baseline reference and compare every other category with the baseline response. In our example, we will choose team as the baseline category. Consider the probabilities:

class',
π2 = probability of 'Preferring self',
π3 = probability of 'Preferring team'.

The generalized logits are defined as

logit(θ2) = log(π23).

The generalized logits model for our example is then defined as

βi,

where i = 1 and 2 indicating the two logits. This means that we allow two different sets of regression parameters, one for each logit.

Saturated Model Example

In this data set, there are three schools and two types of programs. That is, for each of the preference choices there are possible six cell counts. If we use both school and program and also include their interaction, we will use up all the degrees of freedom. That is we have a saturated model. This is the best model we can get, fitting each cell with its own parameter.

proc logistic data=school order = internal;
freq count;
class school program / order = data;
model style = school program school*program /link = glogit scale = none aggregate;
run;
The LOGISTIC Procedure
               Model Information
Data Set                      WORK.SCHOOL
Response Variable             style
Number of Response Levels     3
Number of Observations        18
Frequency Variable            count
Sum of Frequencies            338
Model                         generalized logit
Optimization Technique        Fisher's scoring

          Response Profile
 Ordered                      Total
Value     style        Frequency
       1     class              174
2     self                79
3     team                85
Logits modeled use style='team' as the reference category.

    Class Level Information
                        Design
Class       Value     Variables
school      1          1      0
2          0      1
3         -1     -1
program     1          1
2         -1

                    Model Convergence Status
         Convergence criterion (GCONV=1E-8) satisfied.

       Deviance and Pearson Goodness-of-Fit Statistics
Criterion          Value       DF     Value/DF     Pr > ChiSq
Deviance          0.0000        0        .              .
Pearson           0.0000        0        .              .
Number of unique profiles: 6
         Model Fit Statistics
                             Intercept
Intercept            and
Criterion          Only     Covariates
AIC             699.404        689.156
SC              707.050        735.033
-2 Log L        695.404        665.156

        Testing Global Null Hypothesis: BETA=0
Test                 Chi-Square       DF     Pr > ChiSq
Likelihood Ratio        30.2480       10         0.0008
Score                   28.3738       10         0.0016
Wald                    25.6828       10         0.0042

            Type 3 Analysis of Effects
                                Wald
Effect              DF    Chi-Square    Pr > ChiSq
school               4       14.5522        0.0057
program              2       10.4815        0.0053
school*program       4        1.7439        0.7827


We have included most parts of the output from SAS, excluding the parameter estimates. The Deviance and Pearson Goodness-of-Fit Statistics output is new here. They were requested by using option scale = none aggregate. Because our model is saturated, the goodness-of-fit statistics are zero with zero degree of freedom. We also see that the default type of coding scheme, e.g. effect coding, that proc logistic has for categorical variables. We also see that the overall effect of the interaction of school and program is not significant. This leads us to a simpler model with only the main effect.

Model With Only Main Effect

proc logistic data=school order = internal;
freq count;
class school program /order = data;
model style = school program  /link = glogit scale = none aggregate;
run;
 Odds Ratio Estimates
                              Point          95% Wald
Effect            style    Estimate      Confidence Limits
school  1 vs 3    class       1.926       0.990       3.747
school  1 vs 3    self        0.517       0.228       1.175
school  2 vs 3    class       1.609       0.820       3.155
school  2 vs 3    self        1.276       0.620       2.626
program 1 vs 2    class       0.476       0.280       0.809
program 1 vs 2    self        1.005       0.538       1.877

We will focus on the interpretation of parameters. For example the odds ratio of class to team for program1 versus program 2 is .476. We can say that the odds for students in program 1 to choose class over team is .476 times the odds for students in program 2. Or we can say that the odds for students in program 1 to choose class over team is .524 times less than the odds for students in program 2. Similarly, we can say that the odds for students in school 1 to choose class over team  is 1.926 times the odds for students in school 3. Or we can say that the odds for students in school 1 to choose class over team is .926 times more than the odds for students in school 3. It is oftentimes easier to describe in terms of probabilities. We can use the output statement to generate these probabilities as shown below.

proc logistic data=school order = internal;
freq count;
class school program ;
model style = school program  / link = glogit;
output out = smodel p=prob;
run;
proc freq data = smodel;
where school = 1 or school = 2;
format prob 5.4;
tables school*program*_level_*prob /list nopercent nocum;
run;
school    program    _LEVEL_     prob    Frequency
--------------------------------------------------
1          1    class      .5371           3
1          1    self       .1580           3
1          1    team       .3049           3
1          2    class      .7095           3
1          2    self       .0989           3
1          2    team       .1917           3
2          1    class      .3924           3
2          1    self       .3409           3
2          1    team       .2667           3
2          2    class      .5764           3
2          2    self       .2372           3
2          2    team       .1864           3

Proportional Odds Model for Ordinal Logistic Models

The proportional odds model is also referred as the logit version of an ordinal regression model. It extends logistic regression to handle ordinal response variables. In this section, we are going to use SAS data set ordwarm2.sas7bdat to illustrate what a proportional odds model is and how to perform a proportional odds model analysis.

Let's first take a look at the data set. This data set is taken from Regression Models For Categorical Dependent Variables Using Stata by Long and Freese. Each subject in the data set was asked to evaluate the following statement: "A working mother can establish just as warm and secure of a relationship with her child as a mother who does not work". The response is recoded in a variable called warm. It has four levels: 1 = Strongly Disagree (SD), 2 = Disagree (D), 3 = Agree (A) and 4 = Strongly Agree (SA). This will be the response variable in our analysis. Other variables in the data set include age, education level, gender of the subject, and other subject related variables.

options nocenter nodate label;
proc contents data = ordwarm2;
run;
The CONTENTS Procedure
Data Set Name: WORK.ORDWARM2                          Observations:         2293
Member Type:   DATA                                   Variables:            10

-----Alphabetic List of Variables and Attributes-----
 #    Variable    Type    Len    Pos    Label
------------------------------------------------------------------------------
2    age         Num       3      8    Age in years
3    ed          Num       3     11    Years of education
5    male        Num       3     17    Gender: 1=male 0=female
4    prst        Num       3     14    Occupational prestige
1    warm        Num       8      0    Mom can have warm relations with child
8    warmlt2     Num       3     26    1=SD; 0=D,A,SA
9    warmlt3     Num       3     29    1=SD,D; 0=A,SA
10    warmlt4     Num       3     32    1=SD,D,A; 0=SA
7    white       Num       3     23    Race: 1=white  0=not white
6    yr89        Num       3     20    Survey year: 1=1989 0=1977

We are interested in building up a model to describe the relationship between the response variable warm and some of the explanatory variables, such as the age, level of education and race. Let's consider the probabilities
θ1 = π1, probability of  'Strongly Disagree',
θ2 = π1 + π2,  probability of 'Strongly Disagree' or 'Disagree',
θ3 = π1 + π2 + π3, probability of 'Not Strongly Agree',

where
π1 = probability of 'Strongly Disagree',
π2 = probability of 'Disagree',
π3 = probability of 'Agree',
π4 = probability of 'Strongly Agree'.

logit(θ1)  =  log( θ1/(1 - θ1)) =   log(π1/(π2 + π3 + π4)),
logit(θ2)  =  log( θ2/(1 - θ2)) =  log((π1 + π2)/(π3 + π4)),
logit(θ3)  =  log( θ3/(1 - θ3)) =  log((π1 + π2 + π3))/π4).

Thus we allow the intercept to be different for different cumulative logit functions, but the effect of the explanatory variables will be the same across different logit functions. That is, we allow different αs for each of the cumulative odds, but only one set of βs for all the cumulative odds. This is the proportionality assumption and this is why this type model is called proportional odds model. Also notice that although this is a model in terms of cumulative odds, we can always recover the probabilities of each response category as follows.

π2 = θ2 - θ1
π3 = θ3 - θ2
π4 = 1 - θ3

An Example With Only Categorical Predictors

In SAS, a proportional odds model analysis can be performed using proc logistic with the option link = clogit. Here clogit stands for cumulative logit. In this example, we are going to use only categorical predictors, white (1=white  0=not white) and male (1=male 0=female), and we will focus more on the interpretation of the regression coefficients. Our model can be written as ) = αi + β1*white + β2*male, i = 1, 2, 3. The formula for the odds is shown in the table below. For example, we can see that the odds ratio for males versus females is exp(β2) and the odds ratio for the whites versus non-whites is exp(β1).

 Race             Gender SD vs. all other choices SD and D vs. all other choices SD, D and A vs. SA White             Male exp(+ β1+ β2) + β1+ β2) + β1 + β2) White             Female + β1) + β1) + β1) Non-White     Male + β2) + β2) + β2) None-White   Female
proc logistic data = ordwarm2 ;
model warm =  white male /link = clogit;
run;
          Response Profile
 Ordered                      Total
Value         warm     Frequency
       1            1           297
2            2           723
3            3           856
4            4           417
Probabilities modeled are cumulated over the lower Ordered Values.

              Analysis of Maximum Likelihood Estimates
                                 Standard          Wald
Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept 1     1     -2.5550      0.1277      400.0337        <.0001
Intercept 2     1     -0.8417      0.1159       52.7347        <.0001
Intercept 3     1      0.9326      0.1167       63.8455        <.0001
white           1      0.3422      0.1163        8.6594        0.0033
male            1      0.6450      0.0774       69.5178        <.0001

           Odds Ratio Estimates
             Point          95% Wald
Effect    Estimate      Confidence Limits
white        1.408       1.121       1.769
male         1.906       1.638       2.218

From the output above, we can say that males have 1.906 times greater odds of somewhat disagreeing with the statement as females, no matter at what level.

A Example with a Continuous Predictor

In this example, we are going to use a continuous predictor, age and show how to output the predicted values and how to graph them. The output statement below requests that SAS output predicted probabilities and the linear predictions and save them to a data set. Based on the proportionality assumption, we should expect that the lines for the linear predictions will be parallel to each other.

proc logistic data = ordwarm2 ;
model warm =  age /link = clogit;
output out = pred p=p xbeta=linp;
run;
proc sort data = pred;
by age;
run;

goptions reset = all ;
symbol i = join w=.4 ;
proc gplot data = pred;
plot p*age=_level_;
plot linp*age=_level_;
run;
quit;

Another Example -- Proportional Odds Assumption Test and Goodness of Fit

• Proportionality Assumption

In general, we can model the cumulative odds model in such as a way that each of the cumulative odds has its own regression model:

A proportional odds model simplifies the model so that the effects of the predictors are the same across different levels. This is the main assumption of the model. We need to test this assumption. That is, we need  test the hypothesis that  β1 = β2 = β3. SAS proc logistic performs a score test to test this hypothesis. Let's look at the model with male and white as predictors again.

proc logistic data = ordwarm2 ;
model warm =  white male /link = clogit;
run;
Score Test for the Proportional Odds Assumption
Chi-Square       DF     Pr > ChiSq
21.6592        4         0.0002

The p-value is really small, so we have to reject the null hypothesis of proportionality. The degrees of freedom is calculated as k*(r-2), where k is the number of predictors and r is the number of levels of response variables. In our example, we have two predictors and four levels of responses. It is not uncommon for a model not to satisfy the proportionality assumption (which is also called parallel regression assumption). When the test fails, other alternative models should be considered, such as multinomial logistic models.

• A Model with More Predictors

Now let's take a look at a model where we use white, age and ed as our predictors. We also add options scale = none aggregate to get the goodness of fit tests. The deviance and Pearson tests compare the current model with the saturated model. This test being not significant tells us our model fits the data well.

proc logistic data = ordwarm2 ;
model warm =  white age ed /link = clogit scale=none aggregate;
run;
          Response Profile
 Ordered                      Total
Value         warm     Frequency
       1            1           297
2            2           723
3            3           856
4            4           417
Probabilities modeled are cumulated over the lower Ordered Values.
Score Test for the Proportional Odds Assumption

Chi-Square       DF     Pr > ChiSq
10.3962        6         0.1089
       Deviance and Pearson Goodness-of-Fit Statistics
Criterion        DF          Value     Value/DF     Pr > ChiSq
Deviance       2628      2523.3191       0.9602         0.9271
Pearson        2628      2588.2232       0.9849         0.7062
Number of unique profiles: 878
         Model Fit Statistics
                              Intercept
Intercept         and
Criterion        Only        Covariates
AIC             5997.541       5841.101
SC              6014.754       5875.526
-2 Log L        5991.541       5829.101
        Testing Global Null Hypothesis: BETA=0
Test                 Chi-Square       DF     Pr > ChiSq
Likelihood Ratio       162.4403        3         <.0001
Score                  157.6156        3         <.0001
Wald                   157.7599        3         <.0001

Analysis of Maximum Likelihood Estimates
                                 Standard          Wald
Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq
Intercept 1     1     -2.0393      0.2342       75.8321        <.0001
Intercept 2     1     -0.2698      0.2294        1.3829        0.2396
Intercept 3     1      1.5397      0.2318       44.1185        <.0001
white           1      0.4376      0.1176       13.8470        0.0002
age             1      0.0179     0.00241       54.8182        <.0001
ed              1     -0.0933      0.0129       52.6988        <.0001
           Odds Ratio Estimates
             Point          95% Wald
Effect    Estimate      Confidence Limits
white        1.549       1.230       1.951
age          1.018       1.013       1.023
ed           0.911       0.888       0.934

Summary

This seminar illustrates how to perform binary logistic, exact logistic, multinomial logistic (generalized logits model) and ordinal logistic (proportional odds model) regression analysis using SAS proc logistic. It focuses on some new features of proc logistic available since SAS 8.1.

Some References:

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.