UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Library
Analyzing Count Data

Stata has several estimation procedures which are designed to be used in analyzing count data. Let's begin by loading and describing a dataset on 316 students at two Los Angeles high schools.
use http://www.ats.ucla.edu/stat/stata/notes/lahigh

describe

Contains data from lahigh.dta

  obs:           316                          
 vars:            10                          3 Dec 1999 09:43
 size:        13,904 (98.5% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
   1. id        float  %9.0g                  
   2. gender    float  %9.0g       gl         
   3. ethnic    float  %10.0g      el         ethnicity
   4. school    float  %9.0g                  school 1 or 2
   5. mathpr    float  %9.0g                  ctbs math pct rank
   6. langpr    float  %9.0g                  ctbs lang pct rank
   7. mathnce   float  %9.0g                  ctbs math nce
   8. langnce   float  %9.0g                  ctbs lang nce
   9. biling    float  %12.0g      bl         bilingual status
  10. daysabs   float  %9.0g                  number days absent
-------------------------------------------------------------------------------
Sorted by:
We wish to analyze the variable daysabs, the number of days absent during the school year, to see if there is an effect due gender and ability as measured by mathnce and langnce, math and language standardized tests score reported as normalized curve equivalents.

A naive analysis might be to use OLS regression to predict daysabs using gender, mathnce and langnce. However, a simple histogram can show us that this is not a very good idea.

hist daysabs
The data are strongly skewed to the right, clearly OLS regression would be inappropriate. Students are taught that count data often follows a Poisson distribution, so some type of Poisson analysis might be appropriate. Recall from statistical theory that in a Poisson distribution the mean and variance are the same. Let's summarize daysabs using the detail option.
summarize daysabs, detail

                     number days absent
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs                 316
25%            1              0       Sum of Wgt.         316
50%            3                      Mean           5.810127
                        Largest       Std. Dev.      7.449003
75%            8             35
90%           14             35       Variance       55.48764
95%           23             41       Skewness       2.250587
99%           35             45       Kurtosis       8.949302
The variance of daysabs is nearly 10 times larger than the mean. The distribution of daysabs is displaying signs of over-dispersion, that is, greater variance than might be expected in a Poisson distribution. Before we get to an alternative analysis, let's run a Poisson regression, even though we believe that the Poisson distribution is not the best choice. Poisson regression will be followed up with the poisgof command which tests the Poisson goodness-of-fit. Here is what these commands look like.
Poisson daysabs gender mathnce langnce 

Poisson regression                                Number of obs   =        316
                                                  LR chi2(3)      =     175.27
                                                  Prob > chi2     =     0.0000
Log likelihood = -1547.9709                       Pseudo R2       =     0.0536
------------------------------------------------------------------------------
 daysabs |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  gender |  -.4009209   .0484122     -8.281   0.000       -.495807   -.3060348
 mathnce |  -.0035232   .0018213     -1.934   0.053       -.007093    .0000466
 langnce |  -.0121521   .0018348     -6.623   0.000      -.0157483   -.0085559
   _cons |   3.088587   .1017365     30.359   0.000       2.889187    3.287987
------------------------------------------------------------------------------

* Stata 8 code.
poisgof

* Stata 9 code and output.
estat gof

Goodness of fit chi-2 =  2234.546
Prob > chi2(312)      =    0.0000
The large value for chi-square in the gof is another indicator that the Poisson distribtuion is not a good choice.

The Stata glm command can also be used to run this analysis. In GLM we need to indidcate both the probability distribution family (Poisson) and the appropriate link function (in this case the log link).

glm daysabs gender mathnce langpr, family(Poisson) link(log) 


Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       312
                                                   Scale param     =         1
Deviance         =  2232.107093                    (1/df) Deviance =  7.154189
Pearson          =   2788.85834                    (1/df) Pearson  =  8.938649

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -1546.751398                    AIC             =  9.814882
BIC              =  2209.084124

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -.4019586   .0483947    -8.31   0.000    -.4968105   -.3071067
     mathnce |  -.0033221   .0018193    -1.83   0.068    -.0068879    .0002436
      langpr |  -.0089998   .0013326    -6.75   0.000    -.0116117   -.0063879
       _cons |   2.924353   .0953732    30.66   0.000     2.737425    3.111281
------------------------------------------------------------------------------
The small differences between the Poisson command and the glm command are due to differences in starting values and convergence criteria of the algorithms.

Our first attempt to deal with the over-dispersion is to use the scale option in glm to scale the standard errors using the square root of the Pearson chi-square dispersion. The coefficients are identical to the previous analysis but the standard errors are adjusted to compensate for the over-dispersion in the Poisson distribution.

glm daysabs gender mathnce langpr, family(Poisson) link(log) scale(x2) 

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       312
                                                   Scale param     =         1
Deviance         =  2232.107093                    (1/df) Deviance =  7.154189
Pearson          =   2788.85834                    (1/df) Pearson  =  8.938649

Variance function: V(u) = u                        [Poisson]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -1546.751398                    AIC             =  9.814882
BIC              =  2209.084124

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -.4019586   .1446885    -2.78   0.005    -.6855428   -.1183744
     mathnce |  -.0033221   .0054392    -0.61   0.541    -.0139828    .0073385
      langpr |  -.0089998   .0039842    -2.26   0.024    -.0168086   -.0011909
       _cons |   2.924353   .2851426    10.26   0.000     2.365484    3.483222
------------------------------------------------------------------------------
(Standard errors scaled using square root of Pearson X2-based dispersion)
An alternative to scaling the standard errors would be to select a distribution other than Poisson. One which would allow for the variance to be greater than the mean. The negative binomial distribution is often more appropriate in cases of over-dispersion. Here is the negative binomial analysis.
nbreg daysabs gender mathnce langnce

Negative binomial regression                      Number of obs   =        316
                                                  LR chi2(3)      =      20.74
                                                  Prob > chi2     =     0.0001
Log likelihood = -880.87312                       Pseudo R2       =     0.0116
------------------------------------------------------------------------------
 daysabs |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
  gender |  -.4311844   .1396656     -3.087   0.002       -.704924   -.1574448
 mathnce |   -.001601     .00485     -0.330   0.741      -.0111067    .0079048
 langnce |  -.0143475   .0055815     -2.571   0.010      -.0252871    -.003408
   _cons |   3.147254   .3211669      9.799   0.000       2.517778    3.776729
---------+--------------------------------------------------------------------
/lnalpha |   .2533877   .0955362                          .0661402    .4406351
---------+--------------------------------------------------------------------
   alpha |   1.288383   .1230871     10.467   0.000       1.068377    1.553694
------------------------------------------------------------------------------

Likelihood ratio test of alpha=0:    chi2(1) =  1334.20   Prob > chi2 = 0.0000
The likelihood ratio test at the bottom of the analysis is a test of the over-dispersion parameter alpha. When the over-dispersion parameter is zero the negative binomial distribution is equivalent to a Poisson distribution. In this case, alpha is significantly different from zero and thus reinforces that the Poisson distribution is not appropriate.

It is also possible to run this analysis using glm with similar results. It is only necessary to change the family option to nbinomial.

glm daysabs gender mathnce langpr, family(nbinomial) link(log)

Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       312
                                                   Scale param     =         1
Deviance         =  425.1272067                    (1/df) Deviance =  1.362587
Pearson          =  423.8531355                    (1/df) Pearson  =  1.358504

Variance function: V(u) = u+(1)u^2                 [Neg. Binomial]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -884.2572249                    AIC             =  5.621881
BIC              =  402.1042379

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -.4324168   .1253785    -3.45   0.000    -.6781543   -.1866794
     mathnce |  -.0014521   .0043635    -0.33   0.739    -.0100045    .0071002
      langpr |  -.0103129   .0035236    -2.93   0.003    -.0172191   -.0034067
       _cons |   2.942868   .2642829    11.14   0.000     2.424883    3.460853
------------------------------------------------------------------------------
In this analysis, both gender and langnce are significant while mathnce is not. From the coding of gender (1=female, 2=male) it is evident that females are absent significantly more than are males. The significant coefficient for langnce suggests that higher ability students are absent less often than lower ability students.

Now let's check to see how well the variable, daysabs, fits both the Poisson and negative binomial distributions using the nbvargr command available for ATS. (You can download nbvargr over the internet by typing findit nbvargr (see How can I use the findit command to search for programs and get additional help? for more information about using findit). nbvargr graphs the variable against a Poisson distribution with the same mean and a negative binomial distribution with the same mean and variance.

nbvargr daysabs

Obtaining Parameter Estimates

(23 observations deleted)

  Negative Binomial Probabilities
  with mean = 5.810127 & over-dispersion = 1.397268

        k     nbprob      nbcum
  1.    0  0.20559212  0.20559211 
  2.    1  0.13100202  0.33659413 
  3.    2  0.10005438  0.43664852 
  4.    3  0.08063899  0.51728749 
  5.    4  0.06669218  0.58397967 
  6.    5  0.05600163  0.63998133 
  7.    6  0.04749728  0.68747860 
  8.    7  0.04057066  0.72804928 
  9.    8  0.03483756  0.76288682 
 10.    9  0.03003709  0.79292393 
 11.   10  0.02598259  0.81890649 

 Poisson Probabilities for lambda = 5.810127

        k      pprob       pcum
  1.    0  0.00299705  0.00299705 
  2.    1  0.01741324  0.02041029 
  3.    2  0.05058656  0.07099685 
  4.    3  0.09797145  0.16896829 
  5.    4  0.14230664  0.31127495 
  6.    5  0.16536394  0.47663888 
  7.    6  0.16013090  0.63676977 
  8.    7  0.13291156  0.76968133 
  9.    8  0.09652913  0.86621046 
 10.    9  0.06231628  0.92852676 
 11.   10  0.03620655  0.96473330 
Many researchers would be satisfied with this analysis, however there is one more analysis that we can try. Using glm and keeping the log link, let's try the gamma distribution to see how it does.
glm daysabs gender mathnce langpr, family(gamma) link(log) 


Generalized linear models                          No. of obs      =       316
Optimization     : ML: Newton-Raphson              Residual df     =       312
                                                   Scale param     =   1.62252
Deviance         =  253.0860658                    (1/df) Deviance =  .8111733
Pearson          =  506.2263775                    (1/df) Pearson  =   1.62252

Variance function: V(u) = u^2                      [Gamma]
Link function    : g(u) = ln(u)                    [Log]
Standard errors  : OIM

Log likelihood   = -855.9673534                    AIC             =  5.442831
BIC              =  230.0630969

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      gender |  -.4357982   .1462805    -2.98   0.003    -.7225027   -.1490937
     mathnce |  -.0011621    .005043    -0.23   0.818    -.0110461    .0087219
      langpr |  -.0105212    .004128    -2.55   0.011     -.018612   -.0024303
       _cons |   2.944013   .3107076     9.48   0.000     2.335037    3.552988
------------------------------------------------------------------------------
The coefficients and p-values are very similar to those obtained from the negative binomial example, however the deviance is much lower. In any case, there does not seem to be much substantive differences between these analyses.

There actually is one other analysis that we could try, generalized negative binomial regression. Generalized negative binomial regression is a generalization of the negative binomial model in which the shape parameter itself is parameterized (predicted). The data in the lahigh dataset come from two different schools. It is conceivable that the shape parameter could be different depending on the school. We will investigate this using the gnbreg command.

gnbreg daysabs gender mathnce langnce, lnalpha(school)

Generalized negative binomial regression          Number of obs   =        316
                                                  LR chi2(3)      =      20.51
                                                  Prob > chi2     =     0.0001
Log likelihood = -876.75565                       Pseudo R2       =     0.0116

------------------------------------------------------------------------------
     daysabs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
daysabs      |
      gender |  -.4742854   .1386001    -3.42   0.001    -.7459366   -.2026343
     mathnce |  -.0026143   .0047886    -0.55   0.585    -.0119997    .0067712
     langnce |  -.0120342   .0056573    -2.13   0.033    -.0231222   -.0009461
       _cons |   2.751023   .2236062    12.30   0.000     2.312763    3.189283
-------------+----------------------------------------------------------------
lnalpha      |
      school |   .5960854   .2061477     2.89   0.004     .1920433    1.000128
       _cons |  -.6219761    .316587    -1.96   0.049    -1.242475    -.001477
------------------------------------------------------------------------------
It does appear that the shape parameter is different depending on the school but it is not clear whether this model is superior to negative binomial or gamma models.

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.