|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.4.1.
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.
We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).
In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model.
Let's look at the data. For a nice layout for displaying the summary statistics, we made use of the package called fields. We can also look at the variance and a histogram of our outcome variable count.
zinb<-read.table("http://www.ats.ucla.edu/stat/R/dae/fish.csv", sep=",", header = TRUE)
attach(zinb)
library(fields)
c<-cbind(count, child, persons, camper)
stats(c)
count child persons camper
N 250.00000 250.0000000 250.000000 250.0000000
mean 3.29600 0.6840000 2.528000 0.5880000
Std.Dev. 11.63503 0.8503153 1.112730 0.4931824
min 0.00000 0.0000000 1.000000 0.0000000
Q1 0.00000 0.0000000 2.000000 0.0000000
median 0.00000 0.0000000 2.000000 1.0000000
Q3 2.00000 1.0000000 4.000000 1.0000000
max 149.00000 3.0000000 4.000000 1.0000000
missing values 0.00000 0.0000000 0.000000 0.0000000
hist(count, breaks=50, col="gray")
Though we can run a Poisson regression in R using the glm function in one of the base packages, we will need to install an additional R package to run a zero-inflated Poisson regression. We have opted for pscl. To install this, enter install.packages("pscl"). In order to load the pscl package, you may need to install additional packages in the same manner.
library(pscl)
zip<-zeroinfl(count ~ child + camper | persons)
summary(zip)
Call:
zeroinfl(formula = count ~ child + camper | persons)
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.59789 0.08554 18.680 <2e-16 ***
child -1.04284 0.09999 -10.430 <2e-16 ***
camper 0.83402 0.09363 8.908 <2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.2974 0.3739 3.470 0.000520 ***
persons -0.5643 0.1630 -3.463 0.000534 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 10
Log-likelihood: -1032 on 5 Df
The output looks very much like the output from two OLS regressions in R.
Below the model call, you will find a block of output containing Poisson regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.
All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.
library(lmtest) lrtest(zip) Likelihood ratio test Model 1: count ~ child + camper | persons Model 2: count ~ 1 #Df LogLik Df Chisq Pr(>Chisq) 1 5 -1031.6 2 2 -1127.0 -3 190.83 < 2.2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output above, we can see that the chi-squared value 190.83. Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a p-value <.0001. Thus, our overall model is statistically significant.
Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard Poisson regression. We can determine this by running the corresponding standard Poisson model and then performing a Vuong test of the two models.
p1<-glm(count ~ child + camper, family=poisson)
summary(p1)
Call:
glm(formula = count ~ child + camper, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7736 -2.2293 -1.2024 -0.3498 24.9492
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.91026 0.08119 11.21 <2e-16 ***
child -1.23476 0.08029 -15.38 <2e-16 ***
camper 1.05267 0.08871 11.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2958.4 on 249 degrees of freedom
Residual deviance: 2380.1 on 247 degrees of freedom
AIC: 2723.2
Number of Fisher Scoring iterations: 6
vuong(p1, zip)
Vuong Non-Nested Hypothesis Test-Statistic: -3.57426
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model2 > model1, with p-value 0.0001756096
The Vuong test compares the zero-inflated model with an ordinary Poisson regression model. In this example, we can see that our test statistic is significant, indicating that the zero-inflated model is superior to the standard Poisson model.
The zero-inflated Poisson regression model predicting fish caught (count) from child, camper, and persons was statistically significant (chi-squared = 190.83, df = 3, p<.01). The predictor of excess zeros, persons, was statistically significant. The count predictors child and camper were also each statically significant. For these data, the expected change in log(count) for a one-unit increase in child was -1.04284. Groups with campers (camper = 1) had an expected log count 0.83402 higher than groups without campers (camper = 0). The Vuong test suggests that our zero-inflated model is a significant improvement over a standard Poisson model.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services