SAS Data Analysis Examples Negative Binomial Regression

Examples of Negative Binomial Regression

Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.

Description of the Data

Let's pursue Example 1 from above. This is the same example that was used in the page on poisson regression.

We have attendance data on 316 high school juniors from two urban high schools in the file poissonreg.csv. The response variable of interest is days absent, daysabs. The variables math and langarts give the standardized test scores for math and language arts respectively. The variable male is a binary indicator of student gender.

Let's look at the data.

Data poissonreg;
infile "d:\work\data\raw\poissonreg.csv" delimiter="," firstobs=2;
input id school male math langarts daysatt daysabs;
run;
proc means data = poissonreg mean std min max var;
var daysabs math langarts male;
run;
The MEANS Procedure
Variable          Mean       Std Dev       Minimum       Maximum      Variance
------------------------------------------------------------------------------
daysabs      5.8101266     7.4490028             0    45.0000000    55.4876432
math        48.7510115    17.8807562     1.0071140    98.9928900   319.7214429
langarts    50.0637938    17.9392106     1.0071140    98.9928900   321.8152757
male         0.4873418     0.5006325             0     1.0000000     0.2506329
------------------------------------------------------------------------------
proc univariate data = poissonreg noprint;
histogram daysabs / midpoints = 0 to 50 by 1 vscale = count ;
run;
proc freq data = poissonreg;
tables male;
run;
The FREQ Procedure

Cumulative    Cumulative
male    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
0         162       51.27           162        51.27
1         154       48.73           316       100.00

Some Strategies You Might Be Tempted To Try

Before we show how you can analyze this with a negative binomial regression model, let's consider some other methods that you might use.
• OLS Regression - You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.
• Poisson Regression - Ordinary poisson regression will have difficulty with over dispersed data, i.e. variance much larger than the mean.
• Zero-inflated Regression Model - This might be a good if there are more zeros than would be expected by either a poisson or negative binomial model.

SAS Negative Binomial Regression Analysis

proc genmod data = poissonreg;
model daysabs = male math langarts /dist=negbin;
run;
The GENMOD Procedure

Model Information
Data Set                WORK.POISSONREG
Distribution          Negative Binomial
Dependent Variable              daysabs

Number of Observations Used         316

Criteria For Assessing Goodness Of Fit
Criterion                     DF           Value        Value/DF
Deviance                     312        356.9348          1.1440
Scaled Deviance              312        356.9348          1.1440
Pearson Chi-Square           312        337.0888          1.0804
Scaled Pearson X2            312        337.0888          1.0804
Log Likelihood                         2149.3649
Full Log Likelihood                    -880.8731
AIC (smaller is better)                1771.7462
AICC (smaller is better)               1771.9398
BIC (smaller is better)                1790.5250

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Standard     Wald 95% Confidence          Wald
Parameter     DF    Estimate       Error           Limits           Chi-Square    Pr > ChiSq
Intercept      1      2.7161      0.2326      2.2602      3.1719        136.38        <.0001
male           1     -0.4312      0.1397     -0.7049     -0.1574          9.53        0.0020
math           1     -0.0016      0.0048     -0.0111      0.0079          0.11        0.7413
langarts       1     -0.0143      0.0056     -0.0253     -0.0034          6.61        0.0102
Dispersion     1      1.2884      0.1231      1.0471      1.5296

NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.

The output looks very much like the output from an OLS regression. The output begins with a summary of the dataset and model, followed by a list of various goodness of fit statistics. These are likelihood based.  Below the fit statistics, you will find the negative binomial regression coefficients for each of the variables along with the corresponding standard errors, Wald 95% confidence intervals, Wald Chi-Square statistics, and p-values.  After the coefficients for the predictors, there is an estimate for the Dispersion parameter.  If the dispersion is 0, then a Poisson model be more appropriate to the data.  Based on the 95% Confidence Limits for our dispersion parameter, we can say that dispersion is significantly different from 0 and we are justified in our negative binomial model.

Now, just to be on the safe side, let's rerun proc genmod with the repeated statement in order to obtain robust standard errors for the negative binomial regression coefficients.

proc genmod data = poissonreg;
class id;
model daysabs = male math langarts /dist=negbin;
repeated  subject=id  /type=cs;
run;
             GEE Model Information

Correlation Structure              Exchangeable
Subject Effect                  id (316 levels)
Number of Clusters                          316
Correlation Matrix Dimension                  1
Maximum Cluster Size                          1
Minimum Cluster Size                          1

Algorithm converged.

The GENMOD Procedure

GEE Fit Criteria

QIC        -3969.2402
QICu       -3970.7849

Analysis Of GEE Parameter Estimates
Empirical Standard Error Estimates

Standard   95% Confidence
Parameter Estimate    Error       Limits            Z Pr > |Z|

Intercept   2.7161   0.2323   2.2608   3.1714   11.69   <.0001
male       -0.4312   0.1446  -0.7146  -0.1478   -2.98   0.0029
math       -0.0016   0.0079  -0.0170   0.0138   -0.20   0.8388
langarts   -0.0143   0.0053  -0.0248  -0.0039   -2.69   0.0071


The robust standard errors attempt to adjust for heterogeneity in the model. Using the robust standard errors has resulted in small changes in the standard errors and the z-tests still yield similar significant results.

The variable math is not significant with or without the repeated statement. Since math is not significant in the model with robust standard errors, we will rerun the model dropping that variable.

proc genmod data = poissonreg;
model daysabs = male langarts /dist=negbin;
run;

The GENMOD Procedure
Model Information
Data Set                WORK.POISSONREG
Distribution          Negative Binomial
Dependent Variable              daysabs

Number of Observations Used         316

Criteria For Assessing Goodness Of Fit
Criterion                     DF           Value        Value/DF
Deviance                     313        356.9042          1.1403
Scaled Deviance              313        356.9042          1.1403
Pearson Chi-Square           313        334.4317          1.0685
Scaled Pearson X2            313        334.4317          1.0685
Log Likelihood                         2149.3106
Full Log Likelihood                    -880.9274
AIC (smaller is better)                1769.8548
AICC (smaller is better)               1769.9834
BIC (smaller is better)                1784.8778

Algorithm converged.

Analysis Of Maximum Likelihood Parameter Estimates
Standard     Wald 95% Confidence          Wald
Parameter     DF    Estimate       Error           Limits           Chi-Square    Pr > ChiSq
Intercept      1      2.7034      0.2293      2.2541      3.1528        139.03        <.0001
male           1     -0.4312      0.1397     -0.7050     -0.1574          9.53        0.0020
langarts       1     -0.0156      0.0039     -0.0234     -0.0079         15.71        <.0001
Dispersion     1      1.2891      0.1231      1.0478      1.5304
NOTE: The negative binomial dispersion parameter was estimated by maximum likelihood.


This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can run the null model and compare the null model with the current model using chi-squared test on the difference of log likelihood.

proc genmod data = poissonreg;
model daysabs = / dist=negbin;
run;
quit;
Criteria For Assessing Goodness Of Fit

Criterion                     DF           Value        Value/DF

Deviance                     315        356.9918          1.1333
Scaled Deviance              315        356.9918          1.1333
Pearson Chi-Square           315        329.9199          1.0474
Scaled Pearson X2            315        329.9199          1.0474
Log Likelihood                         2138.9953
Full Log Likelihood                    -891.2427
AIC (smaller is better)                1786.4854
AICC (smaller is better)               1786.5238
BIC (smaller is better)                1793.9969


The log likelihood for the full model is -880.9274 and is -891.2427 for the null model. The chi-squared value is 2*( -880.9274 - -891.2427) = 20.6306. Since we have two predictor variables in the full model, the degrees of freedom for the chi-squared test is 2. This yields a p-value <.0001. Thus, our overall model is statistically significant.

Finally, we will use the estimate statement to get the predicted change in days absent for male and female group when the langarts is held at its mean.
proc genmod data = poissonreg;
class id ;
model daysabs = male  langarts /dist=negbin;
repeated  subject=id  /type=cs;
estimate "male" langarts 50.0637938 male 1   intercept 1 /exp;
estimate "female" langarts 50.0637938 male 0 intercept 1 /exp;
run;
                           Contrast Estimate Results

Standard                                Chi-
Label        Estimate     Error   Alpha   Confidence Limits  Square  Pr > ChiSq

male           1.5032    0.0966    0.05    1.3138    1.6926  241.98      <.0001
Exp(male)      4.4960    0.4345    0.05    3.7202    5.4335
female         1.9125    0.0992    0.05    1.7182    2.1069  372.06      <.0001
Exp(female)    6.7703    0.6713    0.05    5.5745    8.2225

Using Proc Countreg

If you are using SAS version 9.2 or higher, you could run a negative binomial regression using proc countreg.  This procedure allows a few more options specific to count outcomes than proc genmod.  The proc countreg code for the original model run on this page appears below.

proc countreg data = poissonreg;
model daysabs = male math langarts /dist=negbin (p=2);
run; 

Sample Write-Up of the Analysis

In the negative binomial regression model predicting days absent from school stay with language arts and gender, our predictors langarts and male were each statically significant. For these data, the change in expected change in log count for a one-unit increase in language arts was -0.0156.  Male students had an expected log count 0.4312 less than female students.

Cautions, Flies in the Ointment

• It is not recommended that negative binomial models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.

• SAS Online Manual
• References
• Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.