|
|
|
||||
|
|
|||||
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.
We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).
In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model. Let's look at the data.
proc means data = fish mean std min max var; var count child persons; run; The MEANS Procedure Variable Mean Std Dev Minimum Maximum Variance ---------------------------------------------------------------------------------------- count 3.2960000 11.6350281 0 149.0000000 135.3738795 child 0.6840000 0.8503153 0 3.0000000 0.7230361 persons 2.5280000 1.1127303 1.0000000 4.0000000 1.2381687 ---------------------------------------------------------------------------------------- proc univariate data = fish noprint; histogram count / midpoints = 0 to 50 by 1 vscale = count ; run;proc freq data = fish; tables camper; run; The FREQ Procedure Cumulative Cumulative camper Frequency Percent Frequency Percent ----------------------------------------------------------- 0 103 41.20 103 41.20 1 147 58.80 250 100.00
If you have SAS version 9.2 or higher, you can carry out a zero-inflated negative binomial regression using proc countreg.
In the code below, we predict the number of fish with child and camper and predict the excessive zeroes with persons.
proc countreg data = fish method = qn;
model count = child camper / dist= zinegbin;
zeromodel count ~ persons;
run;
The COUNTREG Procedure
Model Fit Summary
Dependent Variable count
Number of Observations 250
Data Set WORK.FISH
Model ZINB
ZI Link Function Logistic
Log Likelihood -432.89091
Maximum Absolute Gradient 7.13505E-8
Number of Iterations 28
Optimization Method Dual Quasi-Newton
AIC 877.78181
SBC 898.91058
Algorithm converged.
Parameter Estimates
Standard Approx
Parameter DF Estimate Error t Value Pr > |t|
Intercept 1 1.371048 0.256114 5.35 <.0001
child 1 -1.515255 0.195591 -7.75 <.0001
camper 1 0.879051 0.269274 3.26 0.0011
Inf_Intercept 1 1.603106 0.836493 1.92 0.0553
Inf_persons 1 -1.666566 0.679265 -2.45 0.0141
_Alpha 1 2.678759 0.471328 5.68 <.0001
The output begins with the Model Fit Summary. This includes the model type ("ZINB"),
link function used to model the inflation ("Logistic"), and optimization method
as well as fit statistics like the log likelihood, AIC, and SBC. Next comes the parameter estimates. In one block of output, we see the parameter estimates, standard errors, t statistics, and p-values from both the count and inflation models. The parameters associated with the inflation model are indicated with Inf_. Though all appear in the same block, these parameters must be interpreted in different ways: the first three correspond to the count model and should be interpreted as you would parameters from a negative binomial model; the Inf_ estimates correspond to the inflation model and should be interpreted as you would estimates from a logistic regression. _Alpha is the dispersion parameter. If _Alpha is zero, then a Poisson model would be more appropriate than a negative binomial model.
All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can run the null model and compare the null model with the current model using chi-squared test on the difference of log likelihood.
proc countreg data = fish method = qn;
model count = / dist= zinegbin;
zeromodel count ~ ;
run;
The COUNTREG Procedure
Model Fit Summary
Dependent Variable count
Number of Observations 250
Data Set WORK.FISH
Model ZINB
ZI Link Function Logistic
Log Likelihood -464.43931
Maximum Absolute Gradient 0.0000371
Number of Iterations 35
Optimization Method Dual Quasi-Newton
AIC 934.87863
SBC 945.44301
Algorithm converged.
Parameter Estimates
Standard Approx
Parameter DF Estimate Error t Value Pr > |t|
Intercept 1 1.192709 0.151551 7.87 <.0001
Inf_Intercept 0 -23.388418 . . .
_Alpha 1 5.438516 0.664078 8.19 <.0001
The log likelihood for the full model is -432.89091 and is -464.43931 for the null model. The chi-squared value is 2*( -432.89091 - -464.43931) = 63.0968. Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a p-value <.0001. Thus, our overall model is statistically significant.
For those using a version of SAS lower than 9.2, a zero-inflated negative binomial model is doable, though significantly more difficult. Please see this code fragment: Zero-inflated Poisson and Negative Binomial Using Proc Nlmixed.
The zero-inflated negative binomial regression model predicting fish caught (count) from child, camper, and persons was statistically significant (chi-squared = 63.0968, df = 3, p<.01). The predictor of excess zeros, persons, was statistically significant. The count predictors child and camper were also each statically significant. For these data, the expected change in log(count) for a one-unit increase in child was -1.515255. Groups with campers (camper = 1) had an expected log count 0.879051 higher than groups without campers (camper = 0). The model allows us to reject the null hypothesis that _Alpha = 0, suggesting that our negative binomial model is more appropriate than a Poisson model.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services