This article was originally published in Perspective, Volume 17, Number 4, 1993, p. 18-19.


Using the SUDAAN Statistical Package

by Linda R. Ferguson

The SUDAAN statistical package is available on the MVS system at OAC, as announced in a previous article of Perspective (17(2), 1993). This article explains when SUDAAN should be used, provides a sample setup for a logistic regression analysis, and describes how the output from SUDAAN differs from the output obtained from other statistical packages, such as SAS.

When Should You Use SUDAAN?

SUDAAN performs SUrvey DAta ANalysis for multi-stage sample designs. With SUDAAN, you can analyze survey data collected from complex sample designs, including stratified designs, clustered designs, and designs in which observations have unequal probabilities of selection. Data from the following types of sample designs can be analyzed in SUDAAN:

Most other statistical packages, such as SAS and SPSS, assume the data is collected by simple random sampling with replacement. The calculation of standard errors and test statistics is based on the assumption that the observations are independent and identically distributed. SUDAAN should be used when observations do not meet this standard statistical assumption. In SUDAAN, the sample design is taken into consideration when computing standard errors and test statistics. The estimation of these standard errors is based on the Taylor series linearization method.

For example, suppose you are involved in a nationwide study to evaluate student academic performance in high school. Since it is not practical or even necessary to sample all high schools in the U.S., a stratified sample of schools is obtained. This is done by stratifying (grouping) all high schools by state, and then by region within state. Schools are selected at random from each region, and students are selected at random from their high schools.

In this example, state and region are stratifying variables, and school is the primary sample unit (PSU). Students are considered to be clustered within high school, since more than one student may be selected from the same school. Student is the elementary unit, and the data will be used to obtain person-level estimates of academic performance.

How Do You Set Up a SUDAAN Program?

If you are familiar with the syntax of PROCedure statements in the SAS System, then you should find SUDAAN programs relatively easy to write. The procedures available in SUDAAN include CROSSTAB, RATIO, DESCRIPT, REGRESS, LOGISTIC, SURVIVAL, and CATAN. SUDAAN will accept a SAS version 5 data set as an input file. Unlike SAS, the sample design must be specified for each SUDAAN procedure.

The design specification identifies the sampling protocol that was used to collect the data, and is used by SUDAAN to read the data and compute statistics. Two components are necessary to specify a sample design. The first component is the DESIGN= parameter that appears on a PROCedure statement. It declares what sampling method was used to collect your data. The second component consists of one or more of the following design statements: NEST, WEIGHT, TOTCNT, JOINTPROB, or others. These statements identify the variables that are used to read the data and to compute variances and test statistics.

The design for the stratified random sample of high schools described above would be specified in SUDAAN as follows:

    PROC LOGISTIC DATA=ACADPERF DESIGN=STRWR;
    NEST STATE REGION SCHOOL / PSULEV=3;
    WEIGHT WTSTUDNT;

Notes:

  1. A logistic regression analysis is requested on the PROC statement. The DATA= option identifies the input file that will be used for the analysis. The DESIGN= option identifies the sample design as STratified Random sampling With Replacement.
  2. The NEST statement identifies the sampling levels, or stages, used in the sample design. The option PSULEV=3 indicates that the third nesting variable, SCHOOL, is the primary sampling unit (PSU).
  3. The WEIGHT statement identifies the variable to use for analysis weights.

What Kind of Output Do You Get From SUDAAN?

The complete model specifications for a logistic regression analysis are given in the setup below. The binary dependent variable, GRAD4YRS, represents whether or not students completed high school within four years. It is modeled as a function of student sex, race, and grade point average. Whereas the first two predictor variables are categorical, as specified on the SUBGROUP and LEVEL statements (there are 2 levels (categories) of SEX, and 3 levels of RACE), GPA is a continuous variable. The PRINT statement indicates the statistics that will appear in the output.

A JOB card, EXEC statement, and appropriate DD statements should be included as part of this setup. Refer to OAC's writeup "SUDAAN Statistical Package" (SS17).

    PROC LOGISTIC DATA=ACADPERF DESIGN=STRWR;
    NEST STATE REGION SCHOOL / PSULEV=3;
    WEIGHT WTSTUDNT;
    SUBGROUP SEX RACE;
    LEVELS 2 3;
    MODEL GRAD4YRS=SEX RACE GPA;
    PRINT BETA SEBETAS T_BETA P_BETA DEFT;

In the program output, goodness-of-fit statistics are given for the overall model (minus log likelihood and multiple R2), and beta coefficients are given for each predictor variable. The calculation of the standard errors for the beta coefficients is consistent with the sample design, and therefore may differ from standard errors that would be obtained from a statistical package that calculates standard errors under the assumption of simple random sampling. The output also contains a t-test for each beta coefficient to evaluate the null hypothesis B=0. The value of the standard error will influence the value of this t-test and its corresponding probability value. Consequently, coefficients estimated by SUDAAN and identified as "significant" may not be the same coefficients that attain statistical significance when estimated by other statistical packages.

Unlike other statistical packages, the SUDAAN output also includes a design effect for each beta coefficient (labeled as "DEFF Beta" and specified by the DEFT parameter on the PRINT statement). This design effect is a ratio of the variance obtained from a complex sample design to the variance obtained under the assumption of simple random sampling. This statistic can be used to assess the extent of bias (efficiency) that would result if simple random sampling were assumed.

For example, if a subgroup is larger than expected by simple random sampling because that subgroup was oversampled, then the design effect will be less than one. This is not uncommon when stratified samples are drawn to study targeted subgroups, such as the elderly or ethnic groups, since the variance estimates for the stratifying and related variables will be smaller (more efficient) than if simple random sampling were used. On the other hand, if there is a clustering of respondents within geographic region or household, as in clustered samples, then the design effect tends to be greater than one. This occurs because individuals within the same cluster often share similar social and health characteristics, making variance estimates of these characteristics larger (less efficient) than if this clustering effect was not present.

It is difficult to predict what the overall design effect will be for the example given in this article. The stratification by state and region that tends to yield more efficient variance estimates will be offset by the clustering of students within high schools that tends to yield less efficient estimates.

This article has briefly described some of the circumstances in which you should consider using SUDAAN for data analysis. The example shows you how to set up the design specifications for a stratified random sample and what to look for in the output of a logistic regression analysis. All SUDAAN procedures and statements are documented in SUDAAN User's Manual, and more technical issues are discussed in Statistical Methods and Mathematical Algorithms Used in SUDAAN.


Originally revised: 05 July 95; Rev. 19 Dec 95