UCLA Academic Technology Services HomeServicesClassesContactJobs

R Data Analysis Examples
Zero-Inflated Negative Binomial Regression

Note: the code on this page works with R 2.4.1.


Examples of Negative Binomial Regression

Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.

Example 2. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park. Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.

Description of the Data

Let's pursue Example 2 from above. 

We have data on 250 groups that went to a park.  Each group was questioned about how many fish they caught (count), how many children were in the group (child), how many people were in the group (persons), and whether or not they brought a camper to the park (camper).   

In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i.e., the probability that a group caught zero fish. We will use the variables child, persons, and camper in our model.

Let's look at the data. For a nice layout for displaying the summary statistics, we made use of the package called fields. We can also look at the variance and a histogram of our outcome variable count.

Some Strategies You Might Be Tempted To Try

Before we show how you can analyze this with a negative binomial regression analysis, let's consider some other methods that you might use.

R Negative Binomial Regression Analysis

We will need to install an additional R package to run a zero-inflated negative binomial regression.  We have opted for pscl.  To install this, enter install.packages("pscl"). In order to load the pscl package, you may need to install additional packages in the same manner. We begin by estimating the model with the variables of interest.

library(pscl)
zinb<-zeroinfl(count ~ child + camper | persons, dist = "negbin", EM = TRUE)
summary(zinb)
	
Call:
zeroinfl(formula = count ~ child + camper | persons, dist = "negbin", 
    EM = TRUE)


Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.3711     0.2561   5.353 8.63e-08 ***
child        -1.5152     0.1956  -7.747 9.42e-15 ***
camper        0.8790     0.2693   3.264  0.00110 ** 
Log(theta)   -0.9854     0.1759  -5.601 2.14e-08 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   1.6028     0.8363   1.917   0.0553 .
persons      -1.6662     0.6789  -2.454   0.0141 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.3733 
Number of iterations in BFGS optimization: 2 
Log-likelihood: -432.9 on 6 Df
The output looks very much like the output from two OLS regressions in R.

Below the model call, you will find a block of output containing negative binomial regression coefficients for each of the variables along with standard errors, z-scores, and p-values for the coefficients. A second block follows that corresponds to the inflation model.  This includes logit coefficients for predicting excess zeros along with their standard errors, z-scores, and p-values.

All of the predictors in both the count and inflation portions of the model are statistically significant.  This model fits the data significantly better than the null model, i.e., the intercept-only model. To show that this is the case, we can compare with the current model to a null model without predictors using chi-squared test on the difference of log likelihoods.


library(lmtest)
lrtest(zinb)

Likelihood ratio test

Model 1: count ~ child + camper | persons
Model 2: count ~ 1
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   6 -432.89                         
2   3 -464.44 -3 63.097  1.280e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, we can see that the chi-squared value 63.097. Since we have three predictor variables in the full model, the degrees of freedom for the chi-squared test is 3. This yields a p-value <.0001.  Thus, our overall model is statistically significant.

Note that the model output above does not indicate in any way if our zero-inflated model is an improvement over a standard negative binomial regression.  We can determine this by running the corresponding standard negative binomial model and then performing a Vuong test of the two models.  We require the MASS package to run the standard negative binomial regression.

library(MASS)
nb<-glm.nb(count ~ child + camper)
summary(nb)

Call:
glm.nb(formula = count ~ child + camper, init.theta = 0.255293111911724, 
    link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3141  -1.0361  -0.7266  -0.1720   4.0163  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0727     0.2425   4.424 9.69e-06 ***
child        -1.3753     0.1958  -7.025 2.14e-12 ***
camper        0.9094     0.2836   3.206  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for Negative Binomial(0.2553) family taken to be 1)

    Null deviance: 258.93  on 249  degrees of freedom
Residual deviance: 201.89  on 247  degrees of freedom
AIC: 887.42

Number of Fisher Scoring iterations: 1

Correlation of Coefficients:
       (Intercept) child
child  -0.39            
camper -0.72       -0.03


              Theta:  0.2553 
          Std. Err.:  0.0329 

 2 x log-likelihood:  -879.4210

vuong(nb, zinb)

Vuong Non-Nested Hypothesis Test-Statistic: -1.701802 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
in this case:
model2 > model1, with p-value 0.04439626

Sample Write-Up of the Analysis

The zero-inflated negative binomial regression model predicting fish caught (count) from child, camper, and persons was statistically significant (both with and without robust standard errors). The predictor of excess zeros, persons, was statistically significant. The count predictors child and camper were also each statically significant. For these data, the expected change in log(count) for a one-unit increase in child was -1.5152. Groups with campers (camper = 1) had an expected log count 0.8790 higher than groups without campers (camper = 0).  We can see in our model that the dispersion parameter Log(theta) is significantly different from zero.  This suggests that our outcome is overdispersed and that a negative binomial model is more appropriate than a Poisson model.  The Vuong suggests that our zero-inflated model is a significant improvement over a standard negative binomial model.

Cautions, Flies in the Ointment

  • It is not recommended that zero-inflated negative binomial models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.
  • See Also

     

    How to cite this page

    Report an error on this page

    UCLA Researchers are invited to our Statistical Consulting Services
    We recommend others to our list of Other Resources for Statistical Computing Help
    These pages are Copyrighted (c) by UCLA Academic Technology Services


    The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.