Regression Models with Count Data

Statistical Consulting Group
UCLA Academic Technology Services

April 2007






About This Presentation

  • It is a broad survey of count regression models

  • It is designed to demonstrate the range of analyses available for count regression models

  • It is not an in-depth statistical presentation

  • It is not a how-to manual that will train you in count data analysis


    Why Use Count Regression Models


  • Count data is common in many disciplines

  • Count models can be used for rate data in many instances by using exposure

  • Count data often analyzed incorrectly with OLS regression



    Regression Models with Count Data Outline

  • Poisson Regression

  • Negative Binomial Regression

  • Zero-Inflated Count Models
  • Zero-Truncated Count Models
  • Hurdle Models

  • Random-effects Count Models


    Poisson Distribution

    The poisson probability distribution.



  • λ is the mean or expected value of a poisson distribution

  • λ is also the variance of a poisson distribution

  • Poisson is a one parameter λ (lambda)


    Likelihood function for the poisson model.








    Some History of the Poisson Distribution



    The poisson distribution was first published by Siméon-Denis Poisson in 1838.



    There is a famous example from a 1898 book by Ladislaus von Bortkiewicz showing that the number of soldiers killed by mule-kicks each year in the Prussian cavalry followed a poisson distribution. The title of the book was The Law of Small Numbers.





    Poisson Examples




    As a comparison, here is a normal distribution with the same mean and variance as the poisson distribution above.





    Negative Binomial Distribution


    One formulation of the negative binomial distribution can be used to model count data with overdispersion.



  • The negative binomial distribution has two parameters: λ and α

  • λ is the mean or expected value of the distribution

  • α is the over dispersion parameter

  • When α = 0 the negative binomial distribution is the same as a poisson distribution


    Likelihood function for the negative binomial model.








    Negative Binomial Examples









    Maximum Likelihood Estimation

    Count models are estimated using maximum likelihood

    Log likelihood:

  • Computationally, the log of the likelihood function is easier to work with.
  • Analysis proceeds iteratively until the log likelihood converges.
  • Larger (in the closer to zero sense) log likelihoods are better.

    BIC:
  • Schwartz' bayesian information criterion = -2*ln(L)+k*ln(n).
  • Given any two estimated models, the model with the lower value of BIC is the one to be preferred.
  • Can be used to compare different models, even models that are non-nested.
  • Interpretation of absolute difference: 0-2 Weak, 2-6 Positive, 6-10 Strong, and >10 Very Strong.

    Sample size:
  • The small sample behavior of ML estimators for count models is largely unknown.
  • It is risky to use ML with samples smaller than 100.
  • Samples over 500 seem adequate.





    Soccer Goals Example

    Number of goals scored for each game in a season (mean= 2.16 variance= 3.00)




    Number goals displayed as a connected line graph




    Observed number of goals overlayed with a poisson distribution




    Observed number of goals overlayed with a negative binomial distribution



    Exposure

    Count models need some sort of mechanism to deal with the fact that counts can be made over different observation periods. For example, the number of accidents are recorded for 50 different intersections. However, the number of vehicles that pass through the intersections can vary greatly. Fifteen accidents for 30,000 vehicles is very different from 15 accidents for 1,500 vehicles. Count models account for these differences by including the log of the exposure variable in model with coefficient constrained to be one.

    The use of exposure is superior in many instances to analyzing rates as response variables because it makes use of the correct probability distributions. It should be noted that exposure is used to adjust counts on the response variable and that it is possible to various kinds of rates, indexes or per capita measures as predictors.




    Deaths

    The response variable is the number of deaths recorded at each of five different age-group and two smoker categories. The difference in the number of patient years will be accounted for with an exposure variable pyears. Below, note that rows 1 and 10 have almost identical numbers of deaths but have very different values for patient years.

       agecat     smokes     deaths     pyears
            1          1         32     52,407
            2          1        104     43,248
            3          1        206     28,612
            4          1        186     12,663
            5          1        102      5,317
            1          0          2     18,790
            2          0         12     10,673
            3          0         28      5,710
            4          0         28      2,585
            5          0         31      1,462
      
        variable |         N      mean  variance       min       max
    -------------+--------------------------------------------------
          deaths |        10      73.1  5390.767         2       206
          pyears |        10   18146.7  3.16e+08      1462     52407
    ----------------------------------------------------------------

    The predictor variables are four age-group dummy variables and a dummy variable to indicate smokers.


    These data can be analyzed with either a poisson regression model or a negative binomial regression model.

         ll    df    BIC     model
     -33.60015  6   81.0158  poisson 
     -33.60014  7   83.3184  negative binomial

    There is not much difference between the two models based on the log-likelihood and the BIC but the poisson model has a slightly better BIC.

    Poisson regression                                Number of obs   =         10
                                                      LR chi2(5)      =     922.93
                                                      Prob > chi2     =     0.0000
    Log likelihood = -33.600153                       Pseudo R2       =     0.9321
    
    ------------------------------------------------------------------------------
          deaths |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          smokes |   .3545356   .1073741     3.30   0.001     .1440862     .564985
              a2 |   1.484007   .1951034     7.61   0.000     1.101611    1.866403
              a3 |   2.627505   .1837273    14.30   0.000     2.267406    2.987604
              a4 |   3.350493   .1847992    18.13   0.000     2.988293    3.712693
              a5 |   3.700096   .1922195    19.25   0.000     3.323353     4.07684
           _cons |  -7.919326   .1917618   -41.30   0.000    -8.295172   -7.543479
          pyears | (exposure)
    ------------------------------------------------------------------------------



    Zero-inflated Models

    Zero-inflated models attempt to account for excess zeros, i.e., there is thought to be two kinds of zeros, "true zeros" and excess zeros. Zero-inflated models estimate two equations, one for the count model and one for the excess zero's.




    Fishing Example

    Somehow, it seems very appropriate to discuss fish and poisson in the same example.

    Response variable:
    count -- the number of fish caught by visitors to a state park

    Predictors:
    child -- number of children in group
    camper -- camping one or more nights during stay (0/1)
    persons -- number of people in group

    Child is also used as the predictor of the inflated-zeros

    In our example "true zeros" are obtained from people who fish but do not catch anything. The excess zeros come from those who do not fish at all. Separate equations are used to predict each kind of zero.


    We will look at two count models and two zero-inflated models and an OLS regression for comparison.

        ll    df     BIC    model
     -749.3503 4  1520.754  ols
     -645.2568 4  1312.567  poisson 
     -391.0271 5   809.621  negative binomial
     -561.5176 6  1156.116  zero-inflated poisson
     -384.8586 7   808.311  zero-inflated negative binomial

    The results for the zero-inflated negative binomial are given below. Note this model is only marginally better than the ordinary negative binomial model and the Vuong test is not significant but the data were clearly generated by a process that leads to inflated zeros.



    Here are the results from the several of the other models that we can look at the differences in the output.




    Days Absent From School

    Days absent from school during one school year (mean = 5.81 variance = 55.49)

    Predictors of the count variable are female, school and reading.

    Several variables were tried as predictors of excess zeros for zero-inflated models. The large number of zeros in the data might seem to suggest some type of zero-inflated model. However, the zero-inflated models were not used for both empirical reasons and because there did not seem to be a reasonable way explain what kind of a process would generate excess zeros in days absent.



    Histogram of days absent by school.



    We will look at five different models, two count models, two zero-inflated count model and an ols regression thrown in for good measure.

        ll    df     BIC    model
    -1060.365  4  2143.75   ols regression  
    -1435.846  3  2894.72   poisson  
     -867.240  4  1763.26   negative binomial 
    -1278.182  5  2590.90   zero-inflated poisson       
     -867.200  6  1774.69   zero-inflated negative binomial

    Both the negative binomial and the zero-inflated negative binomial are very close in log likelihoods and BIC's, there's a slight edge to the straight negative binomial due to having fewer degrees of freedom.






    Zero-truncated Models

    There are a number of interesting situations in which the count variable cannot take on the value zero.

    Take for example the response variable length of hospital stay (mean = 5.37 variance = 49.24). The predictor variables are age, hmo and died (died before discharge). Note that there are no zero counts in the data.


    Note that both poisson and negative binomial predict a probability for zero length of hospital stay. The negative binomial provides a closer fit to the observed than does the poisson.

        ll     df      BIC   model
    -6908.799   4  13846.83  zero-truncated poisson  
    -4755.28    5   9547.10  zero-truncated negative binomial

    The results for the zero-truncated negative binomial are given below.




    Hurdle Models

    A hurdle model is a modified count model in which there are two processes, one generating the zeros and one generating the positive values. The two models are not constrained to be the same. The concept underlying the hurdle model is that a binomial probability model governs the binary outcome of whether a count variable has a zero or a positive value. If the value is positive, the "hurdle is crossed," and the conditional distribution of the positive values is governed by a zero-truncated count model.

    We will illustrate this model using artificially generated data with response variable y and predictors x1 x2.

             y |      Freq. 
    ------------+---------------
              0 |        315       
              1 |        123       
              2 |        138       
              3 |        117       
              4 |         88       
              5 |         60       
              6 |         35       
              7 |         23       
              8 |         24       
              9 |         19       
             10 |         16       
             11 |         11       
             12 |         10       
             13 |          5       
             14 |          1       
             15 |          4       
             16 |          1       
             19 |          2       
             21 |          2       
             22 |          2       
             26 |          1       
             32 |          2       
             34 |          1       


    Since the count part of the model was generated using a random poisson generator, we will look at three different poisson based models.

        ll     df        BIC   model
    -2133.079   3     4286.881 poisson
    -1748.863   6     3539.173 zero-inflated poisson
    -1730.438   6     3502.322 poisson-logit hurdle

    The poisson-logit hurdle model is clearly the best choice here. The results for this model are given below




    Days Absent Clustered in Schools

    We will take another look at days absent (mean = 5.37 variance = 49.24), this time obtained from 12 schools with about 50 students per school. The predictors are gender and reading. Any count model will need to account for the lack of independence within schools.



    We have a number of options for analyzing these data. We can use standard poisson with cluster robust standard errors to account for within school association. Or, we can use one of the random-effects models for poisson or negative binomial. These are random intercept only models.

        ll    df     BIC    model
    -2938.653  3  5896.653  poisson with cluster robust standard errors
    -1718.321  4  3462.437  negative binomial with cluster robust standard errors
    -2718.767  4  5463.33   random-effects poisson w/ gaussian random effects
    -2718.864  4  5463.523  random-effects poisson w/ gamma random effects
    -1696.797  5  3425.838  random-effects negative binomial w/ beta random effects
    The random-effects negative binomial has the best log likelihood and BIC. The results of the model are given below.








    Summary of Count Regression Models




    Some Stata Specific Information

    Here are the Stata commands for selected models shown above:
    poisson regression with exposure, deaths
    (also see DAE page): . poisson deaths smokes a2 a3 a4 a5, exposure(pyears) zero-inflated negative binomial, fish
    (also see DAE page): . zinb count child camper persons, inflate(child) vuong zip negative binomial regression, days absent
    (also see DAE page): . nbreg daysabs female school Zero-truncated negative binomial regression, length of hosptial stay
    (also see DAE page): . ztnb stay age hmo died Poisson-Logit Hurdle Regression, artifical data . hplogit y x1 x2 negative binomial regression with cluster robust standard errors, days absent: . nbreg daysabs female reading, cluster(sid) random effects negative binomial, days absent: . xtnbreg daysabs female reading, i(sid)



    References


    Here are some places to read more about regression models with count data.

    Agresti, A. (2001) Categorical Data Analysis (2nd ed). New York: Wiley.

    Agresti, A. (1996) An Introduction to Categorical Data Analysis. New York: Wiley.

    Long, S. J. (1997) Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: SAGE Publications, Inc.

    Long, S. J. & Freese, J. (2001) Regression Models for Categorical Dependent Variables using Stata. College Station, TX: Stata Press.