Help the Stat Consulting Group by giving a gift

Proc Logistic and Logistic Regression Models

- Introduction
- Binary Logistic Regression
- Exact Logistic Regression
- Generalized Logits Model - Multinomial Logistic Regression
- Proportional Odds Model - Ordinal Logistic Regression

Logistic regression describes the relationship between a categorical response
variable and a set of predictor variables. A categorical response variable can
be a binary variable, an ordinal variable or a nominal variable. Each type of
categorical variables requires different techniques to model its relationship
with the predictor variables. In this seminar, we illustrate how to perform
different types of analyses using SAS **proc logistic**. For a binary
response variable, such as a response to a yes-no question, a commonly used
model is the logistic regression model. We also touch the surface of exact
logistic regression, which is very useful when the sample size is too small or
the events are too sparse. For a nominal response variable, such as Democrats,
Republicans and Independents, we can fit a generalized logits model. For an
ordinal response variable, such as low, medium and high, we can fit it to a
proportional odds model.

In this section, we will use the High School and Beyond data set,
hsb2.sas7bdat to describe what a logistic model is, how to perform a logistic
regression model analysis and how to interpret the model. Our dependent variable
is created as a dichotomous variable indicating if a student's writing score is
higher than or equal to 52. We call it **hiwrite**. The predictor variables will include
**prog**, **female** and other test scores. Our data set has 200 observations.

data hsb2; set hsb2; hiwrite = write >=52; run;

proc means data = hsb2 mean std; run;

Variable Mean Std Dev ---------------------------------------- ID 100.5000000 57.8791845 FEMALE 0.5450000 0.4992205 RACE 3.4300000 1.0394722 SES 2.0550000 0.7242914 SCHTYP 1.1600000 0.3675260 PROG 2.0250000 0.6904772 READ 52.2300000 10.2529368 WRITE 52.7750000 9.4785860 MATH 52.6450000 9.3684478 SCIENCE 51.8500000 9.9008908 SOCST 52.4050000 10.7357935 hiwrite 0.6300000 0.4840159 ----------------------------------------

Let π be the probability of scoring higher than 51 in writing test. The odds is π/(1-π). For example, the overall probability of scoring higher than 51 is .63. The odds will be .63/(1-.63) = 1.703. A logistic regression model describes a linear relationship between the logit, which is the log of odds, and a set of predictors.

logit(π) = log(π/(1-π)) = α +
β_{1}***x**_{1} +
+ ... +
β_{k}***x**_{k} = α + **x** **β**

We can either interpret the model using the logit scale, or we can convert the log of odds back to the probability such that

**β**)).

The advantage of using the logit scale for interpretation is that the relationship between the logit and the predictors is a linear relationship. But sometimes it is easier to interpret the model in terms of probabilities. Then we have to keep in mind that the relationship between the probabilities and the predictors is not a linear relationship. For more details on odds ratio, please see our FAQ page on how to interpret odds ratios in logistic regression.

Let's consider the model where **female** is the only predictor. We will
use this example to understand the concepts of odds and odds ratios and to
understand how they
are related to the parameter estimates.

proc logistic data = hsb2 ; model hiwrite (event='1') = female ; ods output ParameterEstimates = model_female; run;

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 0.0220 0.2097 0.0110 0.9165 FEMALE 1 0.9928 0.3016 10.8369 0.0010

Notice that we can specify which event to model using the **event = **
option in the model statement. This is new in SAS
8.2. The other way of specifying that we want to model 1 as event instead of 0
is to use the **descending** option in the **proc logistic** statement. One
thing that is worth noticing is the use of quotes in the option **event = '1'**.
Even though, the variable **hiwrite** is a numeric variable, it is still
necessary to surround 1 with a pair of quotes. It comes handy when the outcome
variable is coded as a character variable. Using the **ODS** output statement, we
created a data set called **model_female** containing the parameter estimates shown
above. We can then use the data set to create the odds and odds ratio.

data model_fem; set model_female; o = exp(estimate); run; proc print data = model_fem; var variable estimate o; run;

Obs Variable Estimate o

1 Intercept 0.0220 1.02222 2 FEMALE 0.9928 2.69865

The intercept has a parameter estimate of .022. This is the estimated logit
when female = 0, that is when the student is a male student. Therefore, the
odds = exp(logit) = exp(.0220) = 1.02222 is the estimated odds for a male student
to score 52 or higher in writing test. The coefficient for variable **female**
is .9928. That means that for a one unit increase in **female** (that is changing from male to
female) the expected change in log of odds is .9928. ** **We can also
interpret it in the scale of odds ratio. The odds for a male student is exp()
= exp(.022) and the odds for a female student is exp(.022
+ .9928*1). Therefore, taking the ratio of these two odds, we get the odds ratio for female versus male is exp(.9928)
= 2.699.
In terms of probabilities, the probability
for females to score 52 or higher on the writing test is exp(.022 + .9928) /
(1 + exp(.022 + .9928)) = .734. The probability for males is exp(.022 )/(1 +
exp(.022)) = .505.

With this simple example, we can actually
compute the odds ratio from the 2x2 table of **hiwrite*female**.

proc freq data = hsb2; tables hiwrite*female /nocum nopercent; run;

hiwrite FEMALE

Frequency| Row Pct | Col Pct | 0| 1| Total ---------+--------+--------+ 0 | 45 | 29 | 74 | 60.81 | 39.19 | | 49.45 | 26.61 | ---------+--------+--------+ 1 | 46 | 80 | 126 | 36.51 | 63.49 | | 50.55 | 73.39 | ---------+--------+--------+ Total 91 109 200

For example, for males, the odds is 46/45 = 1.022, which is the exponentiated
value of the intercept from the model. The odds ratio for females versus males
is (80/29)/(46/45) = 2.699. It is usually written as a cross-product (45*80)/(29*46) = 2.699.
This is the exponentiated value of the parameter
estimate for variable **female**.

Let's now take a look at a model with both a continuous variable
**math** and a categorical variable **female** as predictors. We will
focus on how to interpret the parameter estimate for the continuous variable.

proc logistic data = hsb2; model hiwrite (event='1') = female math; output out = m2 p = prob xbeta = logit; run;

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 -10.3651 1.5535 44.5153 <.0001 FEMALE 1 1.6304 0.4052 16.1922 <.0001 MATH 1 0.1979 0.0293 45.5559 <.0001

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

FEMALE 5.106 2.308 11.298 MATH 1.219 1.151 1.291

The interpretation for the parameter estimate of **math** is very
similar to that for the categorical variable **female**. In terms of logit scale, we can say
that for every unit increase in the math score, the logit will increase by
.198, holding everything else constant. We can also say that for a one unit
increase in math score, the odds of scoring 51 or higher in writing test
increases by (1.219-1)*100% = 22%.

We used an **output** statement to create a data set containing the
predicted probabilities based on the model. We can compare the
linear predictions and the probabilities in terms of the math scores for the males and females.

proc sort data = m2; by math; run;

Sometimes, a one unit change may not be a desirable scale to use. We can ask SAS to give us odds ratio for different units of change. For example, it may make more sense to talk about change of every 5 units in math score. This can be done usingsymbol1 i = join v=star l=32 c = black; symbol2 i = join v=circle l = 1 c=black; proc gplot data = m2; plot logit*math = female; plot prob*math = female; run; quit;

proc logistic data = hsb2 ; model hiwrite (event='1') = female math /clodds=wald; units math = 5; run;Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits FEMALE 5.106 2.308 11.298 MATH 1.219 1.151 1.291

Wald Confidence Interval for Adjusted Odds Ratios

Effect Unit Estimate 95% Confidence Limits MATH 5.0000 2.689 2.018 3.584

We will illustrate other features of **proc logistic** by using a model with more
predictors. We will include categorical variables **prog** and **female**,
continuous variables **math** and **read**. This model is merely for the
purpose of demonstrating **proc logistic**, not really a model developed
based on any theory.

proc logistic data = hsb2 ; class prog (ref='1') /param = ref; model hiwrite (event='1') = female read math prog ; run;Response Profile

Ordered Total Value hiwrite Frequency

1 0 74 2 1 126

Probability modeled is hiwrite=1.

Class Level Information

Design Class Value Variables

PROG 1 0 0 2 1 0 3 0 1

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -12.3140 2.0374 36.5311 <.0001 FEMALE 1 1.9576 0.4533 18.6541 <.0001 READ 1 0.1037 0.0298 12.1453 0.0005 MATH 1 0.1310 0.0329 15.8738 <.0001 PROG 2 1 0.2721 0.4889 0.3098 0.5778 PROG 3 1 -0.5776 0.5478 1.1116 0.2917

**CLASS statement
**

Notice that we have used the

**CONTRAST statement**

proc logistic data = hsb2 ; class prog /param = glm ; model hiwrite (event='1') = female read math prog; contrast '1 vs 2 of prog' prog 1 -1 0 / estimate; run;

Class Level Information

Class Value Design Variables

PROG 1 1 0 0 2 0 1 0 3 0 0 1

Contrast Test Results

Wald Contrast DF Chi-Square Pr > ChiSq

1 vs 2 of prog 1 0.3098 0.5778

Contrast Rows Estimation and Testing Results

Standard Contrast Type Row Estimate Error Alpha Confidence Limits

1 vs 2 of prog PARM 1 -0.2721 0.4889 0.05 -1.2303 0.6861

Contrast Rows Estimation and Testing Results

Wald Contrast Type Row Chi-Square Pr > ChiSq

1 vs 2 of prog PARM 1 0.3098 0.5778

**TEST Statement**

The parameter estimates offers all the one degree of freedom test on each of
the parameters. We can also test the combined effect of multiple parameters
using the **test** statement. In the example below, we first tested on the joint effect of read and math.
Next we tested on the hypothesis that the effect of read and math are the same.

proc logistic data = hsb2 ; class prog(ref='1') /param = ref; model hiwrite (event='1') = prog female read math; test_read_math: test read, math; test_equal: test read = math; run;Linear Hypotheses Testing Results

Wald Label Chi-Square DF Pr > ChiSq

test_read_math 37.2236 2 <.0001 test_equal 0.3041 1 0.5813

**LACKFIT and RSQUARE Option**

proc logistic data = hsb2; class prog(ref='1') /param = ref; model hiwrite(event='1') = female prog read math / rsq lackfit; run;Model Fit Statistics

Intercept Intercept and Criterion Only Covariates AIC 265.582 167.035 SC 268.881 186.825 -2 Log L 263.582 155.035

R-Square 0.4188 Max-rescaled R-Square 0.5720

Partition for the Hosmer and Lemeshow Test

hiwrite = 1 hiwrite = 0 Group Total Observed Expected Observed Expected 1 20 0 1.08 20 18.92 2 20 4 3.45 16 16.55 3 20 9 6.54 11 13.46 4 21 10 10.86 11 10.14 5 20 13 13.64 7 6.36 6 20 17 15.70 3 4.30 7 20 16 17.44 4 2.56 8 20 18 18.76 2 1.24 9 20 20 19.61 0 0.39 10 19 19 18.90 0 0.10

Hosmer and Lemeshow Goodness-of-Fit Test

Chi-Square DF Pr > ChiSq

5.2766 8 0.7276

**Influence Statistics**

One important topic in logistic regression is regression diagnostics. **Proc logistic** can
generate a lot of diagnostic measures for detecting outliers and influential
data points for a binary outcome variable. These diagnostic measures can be
requested by using the **output** statement. For example, we can request for
residual deviance, the hat matrix diagonal and residual chi-squared deviance
and the difference between chi-square goodness-of-fit when an observation is deleted. We can
then plot these variables against the predicted values to investigate the
influence of each point on the model. By using the **pointlabel** option in
the **symbol** statement, we can see that the observation with id = 187 has
the highest influence on the chi-square goodness-of-fit.

proc logistic data = hsb2 ; class prog(ref='1') /param = ref; model hiwrite(event='1') = female prog read math ; output out=dinf prob=p resdev=dr h=pii reschi=pr difchisq=difchi; run; goptions reset = all; symbol1 pointlabel = ("#id" h=1 ) value=none; proc gplot data = dinf; plot difchi*p; run; quit;

**Scoring a New Data Set**

There are situations where we want to produce predicted probabilities for a
specific combination of the values of the predictors. For example, we may want
to know the predicted probabilities for groups defined by **female** and **
prog**
when **math** and **read** are held at their grand means. Let's first
create a data set with the groups and grand means for **math** and **read**.

proc sql; create table gdata as select distinct female, (prog=2) as prog2,(prog=3) as prog3, mean(read) as read, mean(math) as math from hsb2; quit; proc print data = gdata; run;

Obs FEMALE prog2 prog3 read math

1 0 0 0 52.23 52.645 2 0 0 1 52.23 52.645 3 0 1 0 52.23 52.645 4 1 0 0 52.23 52.645 5 1 0 1 52.23 52.645 6 1 1 0 52.23 52.645

We can use SAS **proc score** to generate the linear predicted values and then
compute the odds or probabilities afterwards. Notice that the **score**
procedure does not care what model we have run. It uses the estimated
parameters to generate linear predictions. In our logistic regression case, the
predicted values are therefore in the logit scale. In the output data set
created by **proc score**, we have a variable called **hiwrite**.
This is the new variable that proc score created for predicted values.

proc logistic data = hsb2 outest=mg; class prog(ref='1') /param = ref; model hiwrite(event='1') = female prog read math ; run; *Scoring the data set to get the linear predictions; proc score data=gdata score=mg out=gpred type=parms; var female prog2 prog3 read math; run; data gpred; set gpred; odds = exp(hiwrite); p_1 = odds /(1+odds); p_0 = 1 - p_1; run; proc print data = gpred; run;

Obs FEMALE prog2 prog3 read math hiwrite odds p_1 p_0

1 0 0 0 52.23 52.645 0.00012 1.00012 0.50003 0.49997 2 0 0 1 52.23 52.645 -0.57747 0.56132 0.35952 0.64048 3 0 1 0 52.23 52.645 0.27223 1.31289 0.56764 0.43236 4 1 0 0 52.23 52.645 1.95774 7.08332 0.87629 0.12371 5 1 0 1 52.23 52.645 1.38016 3.97552 0.79902 0.20098 6 1 1 0 52.23 52.645 2.22986 9.29856 0.90290 0.09710

Remarks: This process will be simplified with SAS 9.0 and above with the new statementscoreinproc logistic. The syntax one will use looks like the following:proc sql; create table gdata1 as select distinct female, ses, mean(read) as read, mean(math) as math from hsb2; quit; proc logistic data = hsb2 outmodel=mg1; class prog(ref='1') /param = ref; model hiwrite(event='1') = female prog read math ; run; proc logistic inmodel=mg1; score data = gdata1 out=gpred1; run; proc print data = gpred1; run;

All of the models we have inspected so far require large sample sizes. When the data sets are too small or when the event occurs very infrequently, the maximum likelihood method may not work or may not provide reliable estimates. Exact logistic regression provides a way to get around these difficulties. What it does is to enumerate the exact distributions of the parameters of interest, conditional on the remaining parameters. Here is a simple example from Performing Exact Logistic Regression with the SAS System. The data set has very small cells, with each cell having only 3 observations. Let's run the exact logistic regression on this data set.

data dose; input dose deaths total; datalines; 0 0 3 1 0 3 2 0 3 3 0 3 4 1 3 5 2 3 ; run;

proc logistic data = dose desc; model deaths/total = dose; exact dose /estimate = both; run;Model Fit Statistics

Intercept Intercept and Criterion Only Covariates AIC 18.220 12.072 SC 19.111 13.853 -2 Log L 16.220 8.072

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq Likelihood Ratio 8.1478 1 0.0043 Score 5.7943 1 0.0161 Wald 2.7249 1 0.0988

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -9.4745 5.5677 2.8958 0.0888 dose 1 2.0804 1.2603 2.7249 0.0988

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits dose 8.007 0.677 94.679

Exact Conditional Analysis

Conditional Exact Tests

--- p-Value --- Effect Test Statistic Exact Mid dose Score 5.4724 0.0245 0.0190 Probability 0.0110 0.0245 0.0190

Exact Parameter Estimates

95% Confidence Parameter Estimate Limits p-Value dose 1.8000 0.1157 5.8665 0.0245

Exact Odds Ratios

95% Confidence Parameter Estimate Limits p-Value dose 6.049 1.123 353.000 0.0245

Notice first of all that the syntax for **model** statement is slight
different than we have seen so far. This is the syntax used for grouped data.
That is we have frequencies of the events for each of the cells. This type of
syntax works for both the maximum likelihood logistic regression and exact
logistic regression. With **estimate = both**, we request that both the parameters and the
odds ratios are being estimated.

**Proc logistic** also perform analysis on nominal response variables. Since the
response variable no longer has the ordering, we can no longer fit a
proportional odds model to our data. But we can fit a generalized logits model.
This analysis can be done using **proc catmod** and that is how it is used to
be done. SAS 8.2 added some new features to its **proc logistic** and now **proc logistic**
does analysis on nominal responses with ease. In this section, we are going to
use a data file called **school** used in *
Categorical Data Analysis Using The SAS System*, by M. Stokes, C. Davis
and G. Koch. We will illustrate what a generalized logits model is and how to
perform an analysis using **proc logistic**.

data school; input school program style $ count; cards; 1 1 self 10 1 1 team 17 1 1 class 26 1 2 self 5 1 2 team 12 1 2 class 50 2 1 self 21 2 1 team 17 2 1 class 26 2 2 self 16 2 2 team 12 2 2 class 36 3 1 self 15 3 1 team 15 3 1 class 16 3 2 self 12 3 2 team 12 3 2 class 20 ; run;

In this data set, three different teaching styles have been implemented in
teaching third grade math. School children in experimental learning settings
were surveyed to determine which teaching styles they preferred. The response
variable **style** takes three values: **class, self** and **team**.
We want to determine the preference of students by their schools and programs.
The programs are regular and after-school programs with 1 being regular and 2
being after-school.

In a generalized logit model, we will pick a particular category of responses
as the baseline reference and compare every other category with the baseline
response. In our example, we will choose **team** as the
baseline category. Consider the probabilities:

class',

π_{2} = probability of 'Preferring **self**',

π_{3} = probability of 'Preferring **team**'.

The generalized logits are defined as

logit(θ_{2}) = log(π_{2}/π_{3}).

The generalized logits model for our example is then defined as

β_{i},

where i = 1 and 2 indicating the two logits. This means that we allow two different sets of regression parameters, one for each logit.

We can calculate the generalized odds from the frequency table, similar to what we have done in the case of proportional odds model.

proc freq data = school; weight count; tables style /list chisq relrisk; ods output OneWayFreqs = test; run; data test1; set test; godds = frequency/85; run; proc print data = test1; var style frequency godds; run;

Obs style Frequency godds

1 class 174 2.04706 2 self 79 0.92941 3 team 85 1.00000

The other way of getting the same results is to run the generalized logits
model. In SAS, we can simply use **proc logistic** with the **link = glogit**
option.

proc logistic data = school order = internal; freq count; model style = /link = glogit ; ods output parameterestimates= odds; run; data odds1; set odds; odds = exp(estimate); run; proc print data = odds1; var response estimate odds; run;

Obs Response Estimate odds

1 class 0.7164 2.04706 2 self -0.0732 0.92941

In this data set, there are three schools and two types of programs. That is, for
each of the preference choices there are possible six cell counts. If we use both
**school** and **program** and also include their interaction, we will use up
all the degrees of freedom. That is we have a saturated model. This is the best
model we can get, fitting each cell with its own parameter.

proc logistic data=school order = internal; freq count; class school program / order = data; model style = school program school*program /link = glogit scale = none aggregate; run;

The LOGISTIC Procedure

Model Information

Data Set WORK.SCHOOL Response Variable style Number of Response Levels 3 Number of Observations 18 Frequency Variable count Sum of Frequencies 338 Model generalized logit Optimization Technique Fisher's scoring

Response Profile

Ordered Total Value style Frequency

1 class 174 2 self 79 3 team 85

Logits modeled use style='team' as the reference category.

Class Level Information

Design Class Value Variables

school 1 1 0 2 0 1 3 -1 -1

program 1 1 2 -1

Model Convergence Status

Convergence criterion (GCONV=1E-8) satisfied.

Deviance and Pearson Goodness-of-Fit Statistics

Criterion Value DF Value/DF Pr > ChiSq

Deviance 0.0000 0 . . Pearson 0.0000 0 . .

Number of unique profiles: 6

Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 699.404 689.156 SC 707.050 735.033 -2 Log L 695.404 665.156

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 30.2480 10 0.0008 Score 28.3738 10 0.0016 Wald 25.6828 10 0.0042

Type 3 Analysis of Effects

Wald Effect DF Chi-Square Pr > ChiSq

school 4 14.5522 0.0057 program 2 10.4815 0.0053 school*program 4 1.7439 0.7827

We have included most parts of the output from SAS, excluding the parameter
estimates. The Deviance and Pearson Goodness-of-Fit Statistics output is new
here. They were requested by using option **scale = none aggregate**. Because our model is saturated, the goodness-of-fit statistics are
zero with zero degree of freedom. We also see that the default type of coding scheme,
e.g. effect coding, that **proc logistic** has
for categorical variables. We also see that the overall effect of the
interaction of **school** and **program** is not significant. This leads
us to a simpler model with only the main effect.

proc logistic data=school order = internal; freq count; class school program /order = data; model style = school program /link = glogit scale = none aggregate; run;

Odds Ratio Estimates

Point 95% Wald Effect style Estimate Confidence Limits

school 1 vs 3 class 1.926 0.990 3.747 school 1 vs 3 self 0.517 0.228 1.175 school 2 vs 3 class 1.609 0.820 3.155 school 2 vs 3 self 1.276 0.620 2.626 program 1 vs 2 class 0.476 0.280 0.809 program 1 vs 2 self 1.005 0.538 1.877

We will focus on the interpretation of parameters. For example the odds ratio of
**class** to **team** for program1 versus
program 2 is .476. We can say that the odds for students in program 1 to choose **
class **over** team **is .476 times the odds for students in program 2. Or we
can say that the odds for students in program 1 to choose class over team is .524
times less than the odds for students in program 2. Similarly, we can say that the
odds for students in school 1 to choose **class** over **team** is
1.926 times the odds for students in school 3. Or we can say that the odds for
students in school 1 to choose **class** over **team** is .926 times more
than the odds for students in school 3. It is oftentimes easier to
describe in terms of probabilities. We can use the **output** statement to generate
these probabilities as shown below.

proc logistic data=school order = internal; freq count; class school program ; model style = school program / link = glogit; output out = smodel p=prob; run; proc freq data = smodel; where school = 1 or school = 2; format prob 5.4; tables school*program*_level_*prob /list nopercent nocum; run;

school program _LEVEL_ prob Frequency -------------------------------------------------- 1 1 class .5371 3 1 1 self .1580 3 1 1 team .3049 3 1 2 class .7095 3 1 2 self .0989 3 1 2 team .1917 3 2 1 class .3924 3 2 1 self .3409 3 2 1 team .2667 3 2 2 class .5764 3 2 2 self .2372 3 2 2 team .1864 3

The proportional odds model is also referred as the logit version of an ordinal regression model. It extends logistic regression to handle ordinal response variables. In this section, we are going to use SAS data set ordwarm2.sas7bdat to illustrate what a proportional odds model is and how to perform a proportional odds model analysis.

Let's first take a look at the data set. This data set is taken from *
Regression Models For Categorical Dependent Variables Using Stata *by Long
and Freese. Each subject in the data set was asked to evaluate the following statement:
"A working mother can establish just as warm and secure of a relationship with
her child as a mother who does not work". The response is recoded in a variable called **
warm**. It has four levels:
1 = Strongly Disagree (SD), 2 = Disagree (D), 3 = Agree (A) and 4 = Strongly
Agree (SA). This will be the response variable in our analysis. Other variables in the data set include age, education level, gender
of the subject, and other subject related variables.

options nocenter nodate label; proc contents data = ordwarm2; run;

The CONTENTS Procedure

Data Set Name: WORK.ORDWARM2 Observations: 2293 Member Type: DATA Variables: 10 -----Alphabetic List of Variables and Attributes-----

# Variable Type Len Pos Label ------------------------------------------------------------------------------ 2 age Num 3 8 Age in years 3 ed Num 3 11 Years of education 5 male Num 3 17 Gender: 1=male 0=female 4 prst Num 3 14 Occupational prestige 1 warm Num 8 0 Mom can have warm relations with child 8 warmlt2 Num 3 26 1=SD; 0=D,A,SA 9 warmlt3 Num 3 29 1=SD,D; 0=A,SA 10 warmlt4 Num 3 32 1=SD,D,A; 0=SA 7 white Num 3 23 Race: 1=white 0=not white 6 yr89 Num 3 20 Survey year: 1=1989 0=1977

We are interested in building up a model to describe the relationship between the response variable
**warm **and some of the explanatory variables, such as the age, level of
education and race. Let's consider the probabilities

θ_{1} = π_{1}, probability of 'Strongly
Disagree',

θ_{2} = π_{1}
+ π_{2}, _{ }probability of 'Strongly Disagree' or
'Disagree',

θ_{3} = π_{1} + π_{2} + π_{3},
probability of 'Not Strongly Agree',

where

π_{1 }= probability of 'Strongly Disagree',

π_{2} = probability of 'Disagree',

π_{3} = probability of 'Agree',

π_{4} = probability of 'Strongly Agree'.

logit(θ_{1}) = log( θ_{1}/(1
- θ_{1})) = _{ }log(π_{1}/(π_{2}
+ π_{3} + π_{4})),

logit(θ_{2}) = log( θ_{2}/(1 - θ_{2}))
= log((π_{1} + π_{2})/(π_{3}
+ π_{4})),

logit(θ_{3}) = log( θ_{3}/(1 - θ_{3}))
= log((π_{1} + π_{2} + π_{3}))/π_{4}).

Thus we allow the intercept to be different for different cumulative logit functions, but the effect of the explanatory variables will be the same across different logit functions. That is, we allow different αs for each of the cumulative odds, but only one set of βs for all the cumulative odds. This is the proportionality assumption and this is why this type model is called proportional odds model. Also notice that although this is a model in terms of cumulative odds, we can always recover the probabilities of each response category as follows.

π_{2} = θ_{2 }- θ_{1
}π_{3} = θ_{3} - θ_{2
}π_{4} = 1 - θ_{3}

We can calculate the cumulative odds from the frequency table.

proc freq data = ordwarm2; table warm ; ods output onewayfreqs = test (keep = warm frequency cumfrequency); run; data test1; set test; if _n_ <=3 then odds = cumfrequency /(2293-cumfrequency); run; proc print data= test1; run;

Cum Obs warm Frequency Frequency odds

1 1 297 297 0.14880 2 2 723 1020 0.80126 3 3 856 1876 4.49880 4 4 417 2293 .

The other way of getting the same result is to run a proportional odds model with only the intercept as a predictor.

proc logistic data = ordwarm2 ; model warm = /link = clogit; ods output ParameterEstimates = model_based; run; data test2; set model_based; odds = exp(estimate); run; proc print data = test2 noobs; var variable estimate odds; run;

Variable Estimate odds

Intercept -1.9052 0.14880 Intercept -0.2216 0.80126 Intercept 1.5038 4.49880

In SAS, a proportional odds model analysis can be performed using **proc
logistic** with the option **link = clogit**. Here **clogit** stands for
cumulative logit. In this example, we are going to use only categorical
predictors, **white** (1=white 0=not white) and **male** (1=male
0=female), and we will focus more on the interpretation of the regression
coefficients. Our model can be written as )
= α_{i} + β_{1}***white** + β_{2}***male**, i = 1,
2, 3. The formula for the odds is shown in the table below. For example, we can
see that the odds ratio for males versus females is exp(β_{2}) and the
odds ratio for the whites versus non-whites is exp(β_{1}).

Race Gender |
SD vs. all other choices |
SD and D vs. all other choices | SD, D and A vs. SA |

White Male | exp(+
β_{1}+ β_{2}) |
+
β_{1}+ β_{2}) |
+
β_{1} + β_{2}) |

White Female | +
β_{1}) |
+
β_{1}) |
+
β_{1}) |

Non-White Male | +
β_{2}) |
+
β_{2}) |
+
β_{2}) |

None-White Female |

proc logistic data = ordwarm2 ; model warm = white male /link = clogit; run;

Response Profile

Ordered Total Value warm Frequency

1 1 297 2 2 723 3 3 856 4 4 417

Probabilities modeled are cumulated over the lower Ordered Values.

Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 1 -2.5550 0.1277 400.0337 <.0001 Intercept 2 1 -0.8417 0.1159 52.7347 <.0001 Intercept 3 1 0.9326 0.1167 63.8455 <.0001 white 1 0.3422 0.1163 8.6594 0.0033 male 1 0.6450 0.0774 69.5178 <.0001

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

white 1.408 1.121 1.769 male 1.906 1.638 2.218

From the output above, we can say that males have 1.906 times greater odds of somewhat disagreeing with the statement as females, no matter at what level.

In this example, we are going to use a continuous predictor, **age** and
show how to output the predicted values and how to graph them. The **output**
statement below requests that SAS output predicted probabilities and the linear
predictions and save them to a data set. Based on the proportionality
assumption, we should expect that the lines for the linear predictions will be
parallel to each other.

proc logistic data = ordwarm2 ; model warm = age /link = clogit; output out = pred p=p xbeta=linp; run;

proc sort data = pred; by age; run; goptions reset = all ; symbol i = join w=.4 ; proc gplot data = pred; plot p*age=_level_; plot linp*age=_level_; run; quit;

- Proportionality Assumption

In general, we can model the cumulative odds model in such as a way that each
of the cumulative odds has its own regression model:

A proportional odds model simplifies the model so
that the effects of the predictors are the same across different levels. This is
the main assumption of the model. We need to test this assumption. That is, we
need test the hypothesis that
β_{1 }=
β_{2 }=
β_{3}. SAS **proc logistic** performs a score test to test this
hypothesis. Let's look at the model with **male** and **white** as
predictors again.

proc logistic data = ordwarm2 ; model warm = white male /link = clogit; run;

Score Test for the Proportional Odds Assumption

Chi-Square DF Pr > ChiSq 21.6592 4 0.0002

The p-value is really small, so we have to reject the null hypothesis of proportionality. The degrees of freedom is calculated as k*(r-2), where k is the number of predictors and r is the number of levels of response variables. In our example, we have two predictors and four levels of responses. It is not uncommon for a model not to satisfy the proportionality assumption (which is also called parallel regression assumption). When the test fails, other alternative models should be considered, such as multinomial logistic models.

- A Model with More Predictors

Now let's take a look at a model where we use **white**, **age** and **
ed** as our predictors. We also add options **scale = none aggregate** to
get the goodness of fit tests. The deviance and Pearson tests compare the
current model with the saturated model. This test being not significant tells us
our model fits the data well.

proc logistic data = ordwarm2 ; model warm = white age ed /link = clogit scale=none aggregate; run;

Response Profile

Ordered Total Value warm Frequency

1 1 297 2 2 723 3 3 856 4 4 417

Probabilities modeled are cumulated over the lower Ordered Values.

Score Test for the Proportional Odds Assumption Chi-Square DF Pr > ChiSq 10.3962 6 0.1089

Deviance and Pearson Goodness-of-Fit Statistics

Criterion DF Value Value/DF Pr > ChiSq Deviance 2628 2523.3191 0.9602 0.9271 Pearson 2628 2588.2232 0.9849 0.7062

Number of unique profiles: 878

Model Fit Statistics

Intercept Intercept and Criterion Only Covariates

AIC 5997.541 5841.101 SC 6014.754 5875.526 -2 Log L 5991.541 5829.101

Testing Global Null Hypothesis: BETA=0

Test Chi-Square DF Pr > ChiSq

Likelihood Ratio 162.4403 3 <.0001 Score 157.6156 3 <.0001 Wald 157.7599 3 <.0001 Analysis of Maximum Likelihood Estimates

Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq

Intercept 1 1 -2.0393 0.2342 75.8321 <.0001 Intercept 2 1 -0.2698 0.2294 1.3829 0.2396 Intercept 3 1 1.5397 0.2318 44.1185 <.0001 white 1 0.4376 0.1176 13.8470 0.0002 age 1 0.0179 0.00241 54.8182 <.0001 ed 1 -0.0933 0.0129 52.6988 <.0001

Odds Ratio Estimates

Point 95% Wald Effect Estimate Confidence Limits

white 1.549 1.230 1.951 age 1.018 1.013 1.023 ed 0.911 0.888 0.934

This seminar illustrates how to perform binary logistic, exact logistic,
multinomial logistic
(generalized logits model) and ordinal logistic (proportional odds model)
regression analysis using SAS **proc logistic**. It focuses on some new features of
**proc logistic** available since SAS
8.1.

Analysis of Categorical Dependent Variables with SAS and SPSS

SAS
Topics: Logistic Regression

Performing
Exact Logistic Regression with the SAS System

SAS/STAT Software: Changes and Enhancements,
Release 8.2