|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.4.1.
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.
We have data on 250 groups that went to a park. Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).
In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model.
Let's look at the data. For a nice layout for displaying the summary statistics, we made use of the package called fields. We can also look at the variance and a histogram of our outcome variable count.
zinb<-read.table("http://www.ats.ucla.edu/stat/R/dae/fish.csv", sep=",", header = TRUE)
attach(zinb)
library(fields)
c<-cbind(count, child, persons, camper)
stats(c)
count child persons camper
N 250.00000 250.0000000 250.000000 250.0000000
mean 3.29600 0.6840000 2.528000 0.5880000
Std.Dev. 11.63503 0.8503153 1.112730 0.4931824
min 0.00000 0.0000000 1.000000 0.0000000
Q1 0.00000 0.0000000 2.000000 0.0000000
median 0.00000 0.0000000 2.000000 1.0000000
Q3 2.00000 1.0000000 4.000000 1.0000000
max 149.00000 3.0000000 4.000000 1.0000000
missing values 0.00000 0.0000000 0.000000 0.0000000
var(count)
[1] 135.3739
hist(count, breaks=50, col="gray")
We will need to install an additional R package to run a zero-inflated negative binomial regression. We have opted for pscl. To install this, enter install.packages("pscl"). In order to load the pscl package, you may need to install additional packages in the same manner. We begin by estimating the model with the variables of interest.
library(pscl)
zinb<-zeroinfl(count ~ child + camper | persons, dist = "negbin", EM = TRUE)
summary(zinb)
Call:
zeroinfl(formula = count ~ child + camper | persons, dist = "negbin",
EM = TRUE)
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3711 0.2561 5.353 8.63e-08 ***
child -1.5152 0.1956 -7.747 9.42e-15 ***
camper 0.8790 0.2693 3.264 0.00110 **
Log(theta) -0.9854 0.1759 -5.601 2.14e-08 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.6028 0.8363 1.917 0.0553 .
persons -1.6662 0.6789 -2.454 0.0141 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 0.3733
Number of iterations in BFGS optimization: 2
Log-likelihood: -432.9 on 6 Df
The output looks very much like the output from two OLS regressions in R.
Below the model call, you will find a block of output containing negative binomial regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model. This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.
All of the predictors in both the count and inflation portions of the model are statistically significant. This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.
library(lmtest) lrtest(zinb) Likelihood ratio test Model 1: count ~ child + camper | persons Model 2: count ~ 1 #Df LogLik Df Chisq Pr(>Chisq) 1 6 -432.89 2 3 -464.44 -3 63.097 1.280e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the output above, we can see that the chi-squared value 63.097. Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a p-value <.0001. Thus, our overall model is statistically significant.
Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard negative binomial regression. We can determine this by running the corresponding standard negative binomial model and then performing a Vuong test of the two models. We require the MASS package to run the standard negative binomial regression.
library(MASS)
nb<-glm.nb(count ~ child + camper)
summary(nb)
Call:
glm.nb(formula = count ~ child + camper, init.theta = 0.255293111911724,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3141 -1.0361 -0.7266 -0.1720 4.0163
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0727 0.2425 4.424 9.69e-06 ***
child -1.3753 0.1958 -7.025 2.14e-12 ***
camper 0.9094 0.2836 3.206 0.00135 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.2553) family taken to be 1)
Null deviance: 258.93 on 249 degrees of freedom
Residual deviance: 201.89 on 247 degrees of freedom
AIC: 887.42
Number of Fisher Scoring iterations: 1
Correlation of Coefficients:
(Intercept) child
child -0.39
camper -0.72 -0.03
Theta: 0.2553
Std. Err.: 0.0329
2 x log-likelihood: -879.4210
vuong(nb, zinb)
Vuong Non-Nested Hypothesis Test-Statistic: -1.701802
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
in this case:
model2 > model1, with p-value 0.04439626
The zero-inflated negative binomial regression model predicting fish caught (count) from child, camper, and persons was statistically significant (both with and without robust standard errors). The predictor of excess zeros, persons, was statistically significant. The count predictors child and camper were also each statically significant. For these data, the expected change in log(count) for a one-unit increase in child was -1.5152. Groups with campers (camper = 1) had an expected log count 0.8790 higher than groups without campers (camper = 0). We can see in our model that the dispersion parameter Log(theta) is significantly different from zero. This suggests that our outcome is overdispersed and that a negative binomial model is more appropriate than a Poisson model. The Vuong suggests that our zero-inflated model is a significant improvement over a standard negative binomial model.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services