|
|
|
||||
|
|
|||||
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.
We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).
In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the zeroes that were not simply a result of bad luck fishing. We will use the variables child, persons, and camper in our model. Let's look at the data.
proc means data = fish mean std min max var; var count child persons; run; The MEANS Procedure Variable Mean Std Dev Minimum Maximum Variance ---------------------------------------------------------------------------------------- count 3.2960000 11.6350281 0 149.0000000 135.3738795 child 0.6840000 0.8503153 0 3.0000000 0.7230361 persons 2.5280000 1.1127303 1.0000000 4.0000000 1.2381687 ---------------------------------------------------------------------------------------- proc univariate data = fish noprint; histogram count / midpoints = 0 to 50 by 1 vscale = count ; run;proc freq data = fish; tables camper; run; The FREQ Procedure Cumulative Cumulative camper Frequency Percent Frequency Percent ----------------------------------------------------------- 0 103 41.20 103 41.20 1 147 58.80 250 100.00
If you are using SAS version 9.2 or higher, you can run a zero-inflated Poisson model using proc genmod.
proc genmod data = fish;
model count = child camper /dist=zip;
zeromodel persons /link = logit ;
run;
The GENMOD Procedure
Model Information
Data Set WORK.FISH
Distribution Zero Inflated Poisson
Link Function Log
Dependent Variable count
Number of Observations Read 250
Number of Observations Used 250
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2063.2168
Scaled Deviance 2063.2168
Pearson Chi-Square 245 1543.4597 6.2998
Scaled Pearson X2 245 1543.4597 6.2998
Log Likelihood 774.8999
Full Log Likelihood -1031.6084
AIC (smaller is better) 2073.2168
AICC (smaller is better) 2073.4627
BIC (smaller is better) 2090.8241
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 1.5979 0.0855 1.4302 1.7655 348.96 <.0001
child 1 -1.0428 0.1000 -1.2388 -0.8469 108.78 <.0001
camper 1 0.8340 0.0936 0.6505 1.0175 79.35 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 1.2974 0.3739 0.5647 2.0302 12.04 0.0005
persons 1 -0.5643 0.1630 -0.8838 -0.2449 11.99 0.0005
The output begins with a summary of the model and the data.
This is followed by a list of goodness of fit statistics. The next block of output includes parameter estimates from the count portion of the model. It also includes the standard errors, Wald 95% confidence intervals, Wald Chi-square statistics, and p-values for the parameter estimates.
The last block of output corresponds to the zero-inflation portion of the model. This is a logistic model predicting the zeroes. The output includes parameter estimates for the inflation model predictors and their standard errors, Wald 95% confidence intervals, Wald Chi-square statistics, and p-values.
All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can run the null model (a model without any predictors) and compare the null model with the current model using chi-squared test on the difference of log likelihoods.
proc genmod data = poissonreg;
model daysabs = /dist=zip;
zeromodel /link = logit ;
run;
The GENMOD Procedure
Model Information
Data Set WORK.FISH
Distribution Zero Inflated Poisson
Link Function Log
Dependent Variable count
Number of Observations Read 250
Number of Observations Used 250
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 2254.0459
Scaled Deviance 2254.0459
Pearson Chi-Square 248 1918.7890 7.7371
Scaled Pearson X2 248 1918.7890 7.7371
Log Likelihood 679.4854
Full Log Likelihood -1127.0229
AIC (smaller is better) 2258.0459
AICC (smaller is better) 2258.0945
BIC (smaller is better) 2265.0888
Algorithm converged.
Analysis Of Maximum Likelihood Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 2.0316 0.0349 1.9631 2.1000 3388.16 <.0001
Scale 0 1.0000 0.0000 1.0000 1.0000
NOTE: The scale parameter was held fixed.
Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates
Standard Wald 95% Confidence Wald
Parameter DF Estimate Error Limits Chi-Square Pr > ChiSq
Intercept 1 0.2728 0.1277 0.0225 0.5232 4.56 0.0327
The log likelihoods for the full model and null mode are -1031.6084 and -1127.0229, respectively. The chi-squared value is 2*( -1031.6084 - -1127.0229) = 190.829. Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a p-value <.0001. Thus, our overall model is statistically significant.
Proc countreg is another option for running a zero-inflated Poisson regression in SAS (again, version 9.2 or higher). This procedure allows a few more options specific to count outcomes than proc genmod. The proc countreg code for the original model run on this page appears below. We indicate method = qn to specify the quasi-Newton optimization process that matches the proc genmod results.
proc countreg data = fish method = qn;
model count = child camper / dist= zip;
zeromodel count ~ persons;
run;
The COUNTREG Procedure
Model Fit Summary
Dependent Variable count
Number of Observations 250
Data Set WORK.FISH
Model ZIP
ZI Link Function Logistic
Log Likelihood -1032
Maximum Absolute Gradient 4.61991E-7
Number of Iterations 15
Optimization Method Dual Quasi-Newton
AIC 2073
SBC 2091
Algorithm converged.
Parameter Estimates
Standard Approx
Parameter DF Estimate Error t Value Pr > |t|
Intercept 1 1.597889 0.085538 18.68 <.0001
child 1 -1.042838 0.099988 -10.43 <.0001
camper 1 0.834022 0.093627 8.91 <.0001
Inf_Intercept 1 1.297439 0.373850 3.47 0.0005
Inf_persons 1 -0.564347 0.162962 -3.46 0.0005
For those using a version of SAS prior to 9.2, a zero-inflated negative binomial model is doable, though significantly more difficult. Please see this code fragment: Zero-inflated Poisson and Negative Binomial Using Proc Nlmixed.
The zero-inflated Poisson regression model predicting fish caught (count) from child, camper, and persons was statistically significant (chi-squared = 190.829, df = 3, p<.01). The predictor of excess zeros, persons, was statistically significant. The count predictors child and camper were also each statically significant. For these data, the expected change in log(count) for a one-unit increase in child was -1.0428. Groups with campers (camper = 1) had an expected log count 0.8340 higher than groups without campers (camper = 0).
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services