This article was originally published in Perspective, Volume 18, Number 2, 1995, pp. 30-35.


Comparing Logistic Regression Procedures for Binary Response Data in Statistical Packages SPSS, SAS, and SUDAAN

by Vivian Lew

Dichotomous outcomes such as yes/no, pass/fail, and lived/died, arise in many disciplines. Logistic regression analysis is a popular method used to examine the relationship between binary outcome variables such as these and a set of explanatory variables. Many statistical software packages include logistic regression procedures. Unfortunately, the incorrect use of these procedures may lead to unexpected results, or the results obtained from different statistical packages may appear contradictory.

In response to numerous questions asked to OAC Consulting about the implementation of logistic regression in SPSS, SAS and SUDAAN, this article compares how to estimate this type of model for binary response data in each of these packages. In the following sections, we explain how to use the Logistic Regression procedures found in SPSS, SAS, and SUDAAN, which are available on the MVS/ESA system at OAC. Sample setups for each package are included. The article concludes with a comparison of the results.

Background

Before turning to the statistical packages, you should note that for any given unweighted dataset, all three packages will yield the same parameter estimates in both magnitude and direction if the procedures and their associated options are properly specified. Second, the sample dataset that is used in this article is for illustrative purposes only. This dataset consists of 27 observations and is available online to anyone who wishes to try this exercise on their own. It is stored in partitioned dataset APP4.SAS.SAMPLE, member LOGISTEX, and it is the second of five logistic regression examples in the dataset. Refer to OAC Writeup SS03, "SAS Statistical Package: Version 6", for more information on how to access sample SAS setups. The raw data can be used by SPSS, SAS, or SUDAAN for analysis. Third, to illustrate the use of independent categorical variables, a three-category variable, ANCESTRY, was added to the dataset. It takes values 0, 1, and 2 (where 0=African, 1= European, and 2=Other). The first nine observations were designated as ANCESTRY=0, the next ten ANCESTRY=1, and the last eight ANCESTRY=2.

SPSS - Logistic Regression

SPSS is a comprehensive integrated system for statistical analysis produced by SPSS Inc. of Chicago. The Logistic Regression procedure is included in Release 4.1, the current default version of SPSS on MVS/ESA. SPSS Logistic Regression is relatively easy to use and handles categorical independent variables and their contrasts well.

The raw Cancer Remission data is read using a DATA LIST command:

  DATA LIST FREE
     /REMISS CELL SMEAR INFIL LI BLAST 
      TEMP ANCESTRY
  BEGIN DATA
      1 .8 .83 .66 1.9 1.1 .996 0
      1 .9 .36 .32 1.4 .74 .992 0
      0 .8 .88 .7 .8 .176 .982  0
      0 1 .87 .87 .7 1.053 .986 0
      1 .9 .75 .68 1.3 .519 .98 0
      0 1 .65 .65 .6 .519 .982  0
      1 .95 .97 .92 1 1.23 .992 0
      0 .95 .87 .83 1.9 1.354 1.02 0
      0 1 .45 .45 .8 .322 .999 0
      0 .95 .36 .34 .5 0 1.038 1
      0 .85 .39 .33 .7 .279 .988 1
      0 .7 .76 .53 1.2 .146 .982 1
      0 .8 .46 .37 .4 .38 1.006 1
      0 .2 .39 .08 .8 .114 .99 1
      0 1 .9 .9 1.1 1.037 .99 1
      1 1 .84 .84 1.9 2.064 1.02 1
      0 .65 .42 .27 .5 .114 1.014 1
      0 1 .75 .75 1 1.322 1.004 1
      0 .5 .44 .22 .6 .114 .99 1
      1 1 .63 .63 1.1 1.072 .986 2
      0 1 .33 .33 .4 .176 1.01 2
      0 .9 .93 .84 .6 1.591 1.02 2
      1 1 .58 .58 1 .531 1.002 2
      0 .95 .32 .3 1.6 .886 .988 2
      1 1 .6 .6 1.7 .964 .99 2
      1 1 .69 .69 .9 .398 .986 2
      0 1 .73 .73 .7 .398 .986 2
  END DATA

Variable REMISS is the binary (0,1) dependent variable where 0=no remission occurred and 1=remission occurred. Unless you indicate otherwise, SPSS assumes that the larger value of the dependent always signifies the outcome in which the researcher is interested. For example, if the binary dependent was coded (1,2) or (0,2), SPSS would treat outcome 2 as the outcome of interest.

The interval-level independent variables (CELL, SMEAR, INFIL, LI, BLAST, and TEMP) are the results of medical tests administered to a set of cancer patients. ANCESTRY is a nominal variable having a number of unordered categories (SPSS accepts ordered categorical variables, too).

Suppose you want to model the probability of remission, REMISS=1, with all of the independent variables listed above. For the categorical independent variable ANCESTRY, the second category, European ancestry, will be treated as the referent group. This model, in its simplest form, is specified using the following SPSS control statements:

   LOGISTIC REGRESSION
      REMISS WITH CELL SMEAR INFIL LI 
      BLAST TEMP ANCESTRY
        /CATEGORICAL=ANCESTRY
        /CONTRAST(ANCESTRY)=INDICATOR(2)

In SPSS, each model that is estimated must have its own LOGISTIC REGRESSION command. Each model is limited to a single dependent variable. A model can have as many independent variables as the dataset can support and/or the SPSS work space will allow. Independent variables which are not interval or ratio need to be identified in the CATEGORICAL subcommand. Variables specified in the CATEGORICAL subcommand must also appear in the variable list associated with the model or in a METHOD subcommand.

Variables that are declared categorical are automatically transformed into a set of deviation contrast variables in which the last category is assumed to be the referent category (its parameter estimate will not be calculated). The parameter estimates for the other contrast variables represent deviations from the overall effect. This is the default unless a different type of contrast or referent category is explicitly specified. For this exercise, the contrast type is specified as INDICATOR and the referent category is identified as (2). The INDICATOR contrast type creates a set of "dummy variables" that indicates the presence or absence of category membership. Cell sizes are not assumed to be equal under this contrast type. SPSS will print out the design matrices used in the model. This is useful for error checking, and you should examine them to make certain that the comparisons are being made correctly. The design matrix generated for the ANCESTRY variable is:

                 Design      Variables
                   D1           D2
     ANCESTRY 
         0        1.000       0.000
         1        0.000       0.000
         2        0.000       1.000

The referent category identifier (2) is not the value of the ANCESTRY variable for that category but its sequence number (i.e. the second category of three). Values for the second category are set to zero in the design matrix such that: (a) no parameter estimate will be displayed for this category and (b) the parameter estimates for the other two categories represent deviations from the effect of being a member of the omitted category.

Rather than allow SPSS to generate the design matrix based on your specifications in the CONTRAST subcommand, you can alternatively create dummy variables using a pair of COMPUTE and IF statements:

    COMPUTE AFRICAN=0
    IF (ANCESTRY EQ 0) AFRICAN=1
    COMPUTE OTHER=0
    IF (ANCESTRY EQ 2) OTHER=1

And then include these two variables in the model. If you do this, you will not need the CATEGORICAL or CONTRAST subcommands. The equivalent model now looks like this:

   LOGISTIC REGRESSION
      REMISS WITH CELL SMEAR INFIL LI
      BLAST TEMP AFRICAN OTHER

The output will look slightly different because you have added the dummy variables AFRICAN and OTHER and removed the three category ANCESTRY variable, but the estimated parameters will be the same. For more information on using SPSS at OAC, please refer to OAC Writeup SS02 "SPSS Statistical Package." For more information on Logistic Regression in SPSS please refer to the SPSS Reference Guide (1990).

SAS - Proc Logistic

The SAS Institute of Cary, North Carolina produces the SAS Information Delivery System. The LOGISTIC procedure became available as a part of SAS/STAT in SAS Version 6. Although logistic regression can be performed using other SAS statistical procedures, we limit this discussion to PROC LOGISTIC. SAS 6.08 is currently the default version of SAS on MVS/ESA.

The command syntax and procedure options in PROC LOGISTIC are similar to those available in other SAS regression procedures. The only required statements in the logistic procedure are a PROC statement and a MODEL statement. In the logistic procedure, the option to analyze models BY subgroup is available; you can create OUTPUT SAS datasets; and you can WEIGHT the data. You are limited to one model per PROC LOGISTIC and you can only have one dependent variable per model. Models can have as many independent variables as the dataset can support and/or the SAS workspace will allow. PROC LOGISTIC in SAS can use categorical independent variables but you must transform them into a set of dummy variables before invoking this procedure. SAS PROC LOGISTIC does not have a subcommand equivalent to CATEGORICAL in SPSS for automatically creating a set of contrast variables from a categorical variable.

PROC LOGISTIC has far too many options to be discussed in this article. Instead we focus on the model detailed in the SPSS example above and generate its equivalent in SAS:

   DATA CANCER;
     INPUT REMISS CELL SMEAR INFIL LI 
           BLAST TEMP ANCESTRY;
    
      /* create a set of dummy   */
   /* variables from ANCESTRY */
   IF ANCESTRY=0 THEN AFRICAN=1;
        ELSE AFRICAN=0;
   IF ANCESTRY=2 THEN OTHER=1;
        ELSE OTHER=0;
      /* end dummy variable creation */
    
     /* create a new dependent variable */
      /* by reversing the original */
   /* dependent REMISS          */
   NEWREMIS = 1 - REMISS;
      /* the result is: */ 
   /* NEWREMIS 0=remission, 1=no remission */
   /* end dependent variable reversal  */
 
   CARDS;                          
      /* data begins here */
       1 .8 .83 .66 1.9 1.1 .996 0
   /* next 26 lines of data omitted */
      ;
     /* data ends here */
   PROC LOGISTIC;
     MODEL NEWREMIS=CELL SMEAR INFIL LI 
              BLAST TEMP AFRICAN OTHER;

Note that a new dependent variable NEWREMIS was created for use in the MODEL statement which has values that are the reverse of REMISS. This change is necessary for SAS to generate parameter estimates for the probability of REMISS=1 (remission). By default, PROC LOGISTIC assumes that values of the response variable are ordered according to values of its formats (ORDER=FORMATTED). If you have not specified formats, then the values of the response variables are ordered according to their internal values (ORDER=INTERNAL). What this means is that SAS will order the values of the response variable from low to high, i.e. 0 and then 1, and the low value will be designated as the outcome of interest. Thus, if we allow PROC LOGISTIC to use the default ORDER with the original dependent variable REMISS, then the parameter estimates generated will correspond to the probability of REMISS=0 (no remission). This results in a sign reversal of the estimated parameters if the SAS results are compared with the SPSS results. For those who are making comparisons across SPSS and SAS, or for those who are using SAS exclusively, it is important to know which outcome PROC LOGISTIC uses by default.

For more information on using SAS at OAC, refer to OAC Writeup SS03 "SAS Statistical Package: Version 6." For more information on the Logistic procedure in SAS, please refer to "The SAS/STAT User's Guide: Volume 2, GLM-VARCOMP."

SUDAAN - Proc Logistic

SUDAAN or SUrvey DAta ANalysis Software is a product of the Research Triangle Institute. This is a highly specialized statistical package intended primarily to analyze data collected from complex sample designs. SUDAAN does not assume that the observations in a dataset are independent and identically distributed. Instead, sample design effects are incorporated when computing standard errors and test statistics. SUDAAN Release 6.34 is the default version on MVS/ESA.

Generally, a researcher will perform data management and estimate preliminary models using a package such as SAS and then turn to SUDAAN for its ability to incorporate sample design information into its estimation procedures. Since SUDAAN will accept SAS version 5 datasets as input, it is recommended that data management be performed in SAS. Although the PROC LOGISTIC command in SUDAAN is very close in syntax to PROC LOGISTIC in SAS, there are enough differences in the two packages (in addition to the sample assumptions) to make replication of the same model difficult.

SUDAAN is the most restrictive of the three packages. In SUDAAN's PROC LOGISTIC, the dependent variable MUST be coded (0,1) and 1 is always the outcome of interest. Although the logistic regression routine in SUDAAN has an option which transforms independent categorical variables into contrast vectors, these variables may not have a value of 0. During the estimation of the model, SUDAAN will omit any observations containing an independent categorical variable which takes a value of 0. Finally, for categorical variables, SUDAAN always treats the highest value as the referent category. There is no option to modify this. All of these limitations require that a user process data appropriately before starting analysis in SUDAAN. As a general rule, prior to estimation, check the sample size and summary statistics for the variables that are going to be used in the logistic regression to make certain that the data have been processed by SUDAAN as you intended.

Returning to our model, the ANCESTRY variable, originally coded 0=African, 1=European, 2=Other, needs to be recoded to 1=African, 2=Other, 3=European, in order to use European as the referent group in the model and to keep all of our cases. We use SAS to recode ANCESTRY into a form that SUDAAN can use. What follows is an example, written in SAS Version 6, of how to recode the ANCESTRY variable to suit SUDAAN and save the data as a SAS Version 5 dataset.

   LIBNAME OUT V5 'aaaaiii.dsname' 
      DISP=NEW;
   DATA OUT.CANCER;
        INPUT REMISS CELL SMEAR INFIL 
      LI BLAST TEMP ANCESTRY;
      /* recode the ANCESTRY     */
      /* variable to suit SUDAAN */
   IF ANCESTRY=1 THEN ANCESTRY=3;
   IF ANCESTRY=0 THEN ANCESTRY=1;
      /* end recode */
   CARDS;
      /* lines of data omitted */

Note the use of the V5 engine on the SAS Libname statement which causes SAS to write a SAS Version 5 dataset that can be read by SUDAAN. Since we do not have any sample design information for this dataset, we approximate the results of the SPSS and SAS logistic regressions by assuming the data were collected by simple random sampling using the option DESIGN=SRS. The model now looks like this in SUDAAN's logistic procedure:

   PROC LOGISTIC DATA="ddname.CANCER" 
      DESIGN=SRS FILETYPE=SAS;
   TITLE "LOGISTIC EXAMPLE USING A SAS
          VERSION 5 DATASET AS INPUT";
   MODEL REMISS=CELL SMEAR INFIL LI 
                BLAST TEMP ANCESTRY;
   SUBGROUP ANCESTRY;
   LEVELS 3;

Where ddname comes from the DD statement (in JCL) which references the SAS Version 5 dataset created in the previous example (refer to OAC Writeup CJ02 "JCL for Disk Datasets" for more information on issuing JCL commands). The parameter DESIGN=SRS is specified because both SAS and SPSS assume that the data is collected by simple random sampling and we want SUDAAN to generate results which will be comparable to those obtained from SAS and SPSS. The MODEL statements in SUDAAN and SAS are virtually the same, except that in SUDAAN, the categorical independent variable, ANCESTRY, is put directly into the MODEL statement rather than creating a set of dummy variables as was done in the SAS example. The SUBGROUP command in SUDAAN is the equivalent of the CATEGORICAL subcommand using contrast type CONTRAST=INDICATOR in SPSS. The SUBGROUP command will automatically create a set of contrast variables from a categorical independent variable. The LEVELS subcommand must be used along with SUBGROUP. It tells SUDAAN that there are three categories in variable ANCESTRY (which should be coded to have integer values greater than zero). There is no option to choose the referent category. SUDAAN automatically omits the highest category.

SUDAAN uses a different method to estimate standard errors than SPSS and SAS (Taylor series linearization), but the coefficients will have the magnitude and direction as those generated by SAS and SPSS. The probability levels of the parameter estimates will be affected however. Therefore, a coefficient that is "significant" in SAS or SPSS may not attain statistical significance in SUDAAN, or vice versa.

For more information on using the SUDAAN statistical package, please refer to OAC Writeup SS17 "SUDAAN Statistical Package," the SUDAAN User's Manual, and two recent Perspective articles, "Statistical Software Updates: SUDAAN, SAS, BMDP, and SPSS," Vol. 17(2), pp.8-10, and "Using the SUDAAN Statistical Package," Vol.17(4), pp.18-19.

SPSS, SAS and SUDAAN Results

The results of this exercise can by summarized in a single table (see Table 1). The estimated coefficients are nearly identical for the three packages. The differences are attributable to the algorithms employed by each one. The standard errors are close to each other for SAS and SPSS because they use similar methods of estimation. Although simple random sampling is being assumed for SUDAAN, its standard errors reflect a different method of estimation. By default, all three packages generate considerably more information than what is being shown here and they each have a variety of options which can generate goodness of fit statistics.

Summary

Logistic Regression is a popular method which can be used to examine the response probabilities of dichotomous outcome variables given a set of explanatory variables. Statistical packages SAS and SPSS offer data management capabilities along with procedures for performing this type of logistic regression. Statistical package SUDAAN also offers a logistic regression procedure with the ability to incorporate sample design information during model estimation. Properly used, these packages and their logistic regression procedures offer researchers a powerful set of tools to enhance their research capabilities.


Vivian Lew is a recent addition to OAC's User Services Group. She is a Statistical Consultant and has extensive work and teaching experience with SPSS, SAS, and SUDAAN.

*OAC/CS 21 Jun 95; Rev. 19 Dec 95