|
|
|
||||
|
|
|||||
We wish to analyze the variable daysabs, the number of days absent during the school year, to see if there is an effect due gender and ability as measured by mathnce and langnce, math and language standardized tests score reported as normalized curve equivalents.use http://www.ats.ucla.edu/stat/stata/notes/lahigh describe Contains data from lahigh.dta obs: 316 vars: 10 3 Dec 1999 09:43 size: 13,904 (98.5% of memory free) (_dta has notes) ------------------------------------------------------------------------------- 1. id float %9.0g 2. gender float %9.0g gl 3. ethnic float %10.0g el ethnicity 4. school float %9.0g school 1 or 2 5. mathpr float %9.0g ctbs math pct rank 6. langpr float %9.0g ctbs lang pct rank 7. mathnce float %9.0g ctbs math nce 8. langnce float %9.0g ctbs lang nce 9. biling float %12.0g bl bilingual status 10. daysabs float %9.0g number days absent ------------------------------------------------------------------------------- Sorted by:
A naive analysis might be to use OLS regression to predict daysabs using gender, mathnce and langnce. However, a simple histogram can show us that this is not a very good idea.
The data are strongly skewed to the right, clearly OLS regression would be inappropriate. Students are taught that count data often follows a Poisson distribution, so some type of Poisson analysis might be appropriate. Recall from statistical theory that in a Poisson distribution the mean and variance are the same. Let's summarize daysabs using the detail option.hist daysabs
summarize daysabs, detail
number days absent
-------------------------------------------------------------
Percentiles Smallest
1% 0 0
5% 0 0
10% 0 0 Obs 316
25% 1 0 Sum of Wgt. 316
50% 3 Mean 5.810127
Largest Std. Dev. 7.449003
75% 8 35
90% 14 35 Variance 55.48764
95% 23 41 Skewness 2.250587
99% 35 45 Kurtosis 8.949302
The variance of daysabs is nearly 10 times larger than the mean. The distribution of
daysabs is displaying signs of over-dispersion, that is, greater variance than might
be expected in a Poisson distribution. Before we get to an alternative analysis, let's run a
Poisson regression, even though we believe that the Poisson distribution is not the best choice.
Poisson regression will be followed up with the poisgof command which tests the
Poisson goodness-of-fit. Here is what these commands look like.
Poisson daysabs gender mathnce langnce
Poisson regression Number of obs = 316
LR chi2(3) = 175.27
Prob > chi2 = 0.0000
Log likelihood = -1547.9709 Pseudo R2 = 0.0536
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gender | -.4009209 .0484122 -8.281 0.000 -.495807 -.3060348
mathnce | -.0035232 .0018213 -1.934 0.053 -.007093 .0000466
langnce | -.0121521 .0018348 -6.623 0.000 -.0157483 -.0085559
_cons | 3.088587 .1017365 30.359 0.000 2.889187 3.287987
------------------------------------------------------------------------------
* Stata 8 code.
poisgof
* Stata 9 code and output.
estat gof
Goodness of fit chi-2 = 2234.546
Prob > chi2(312) = 0.0000
The large value for chi-square in the gof is another indicator that the
Poisson distribtuion is not a good choice.
The Stata glm command can also be used to run this analysis. In GLM we need to indidcate both the probability distribution family (Poisson) and the appropriate link function (in this case the log link).
glm daysabs gender mathnce langpr, family(Poisson) link(log)
Generalized linear models No. of obs = 316
Optimization : ML: Newton-Raphson Residual df = 312
Scale param = 1
Deviance = 2232.107093 (1/df) Deviance = 7.154189
Pearson = 2788.85834 (1/df) Pearson = 8.938649
Variance function: V(u) = u [Poisson]
Link function : g(u) = ln(u) [Log]
Standard errors : OIM
Log likelihood = -1546.751398 AIC = 9.814882
BIC = 2209.084124
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -.4019586 .0483947 -8.31 0.000 -.4968105 -.3071067
mathnce | -.0033221 .0018193 -1.83 0.068 -.0068879 .0002436
langpr | -.0089998 .0013326 -6.75 0.000 -.0116117 -.0063879
_cons | 2.924353 .0953732 30.66 0.000 2.737425 3.111281
------------------------------------------------------------------------------
The small differences between the Poisson command and the glm command are due to
differences in starting values and convergence criteria of the algorithms.Our first attempt to deal with the over-dispersion is to use the scale option in glm to scale the standard errors using the square root of the Pearson chi-square dispersion. The coefficients are identical to the previous analysis but the standard errors are adjusted to compensate for the over-dispersion in the Poisson distribution.
glm daysabs gender mathnce langpr, family(Poisson) link(log) scale(x2)
Generalized linear models No. of obs = 316
Optimization : ML: Newton-Raphson Residual df = 312
Scale param = 1
Deviance = 2232.107093 (1/df) Deviance = 7.154189
Pearson = 2788.85834 (1/df) Pearson = 8.938649
Variance function: V(u) = u [Poisson]
Link function : g(u) = ln(u) [Log]
Standard errors : OIM
Log likelihood = -1546.751398 AIC = 9.814882
BIC = 2209.084124
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -.4019586 .1446885 -2.78 0.005 -.6855428 -.1183744
mathnce | -.0033221 .0054392 -0.61 0.541 -.0139828 .0073385
langpr | -.0089998 .0039842 -2.26 0.024 -.0168086 -.0011909
_cons | 2.924353 .2851426 10.26 0.000 2.365484 3.483222
------------------------------------------------------------------------------
(Standard errors scaled using square root of Pearson X2-based dispersion)
An alternative to scaling the standard errors would be to select a distribution other than
Poisson. One which would allow for the variance to be greater than the mean. The negative
binomial distribution is often more appropriate in cases of over-dispersion.
Here is the negative binomial analysis.
nbreg daysabs gender mathnce langnce
Negative binomial regression Number of obs = 316
LR chi2(3) = 20.74
Prob > chi2 = 0.0001
Log likelihood = -880.87312 Pseudo R2 = 0.0116
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gender | -.4311844 .1396656 -3.087 0.002 -.704924 -.1574448
mathnce | -.001601 .00485 -0.330 0.741 -.0111067 .0079048
langnce | -.0143475 .0055815 -2.571 0.010 -.0252871 -.003408
_cons | 3.147254 .3211669 9.799 0.000 2.517778 3.776729
---------+--------------------------------------------------------------------
/lnalpha | .2533877 .0955362 .0661402 .4406351
---------+--------------------------------------------------------------------
alpha | 1.288383 .1230871 10.467 0.000 1.068377 1.553694
------------------------------------------------------------------------------
Likelihood ratio test of alpha=0: chi2(1) = 1334.20 Prob > chi2 = 0.0000
The likelihood ratio test at the bottom of the analysis
is a test of the over-dispersion parameter alpha. When the over-dispersion parameter is zero the
negative binomial distribution is equivalent to a Poisson distribution. In this case, alpha is
significantly different from zero and thus reinforces that the Poisson distribution
is not appropriate.It is also possible to run this analysis using glm with similar results. It is only necessary to change the family option to nbinomial.
glm daysabs gender mathnce langpr, family(nbinomial) link(log)
Generalized linear models No. of obs = 316
Optimization : ML: Newton-Raphson Residual df = 312
Scale param = 1
Deviance = 425.1272067 (1/df) Deviance = 1.362587
Pearson = 423.8531355 (1/df) Pearson = 1.358504
Variance function: V(u) = u+(1)u^2 [Neg. Binomial]
Link function : g(u) = ln(u) [Log]
Standard errors : OIM
Log likelihood = -884.2572249 AIC = 5.621881
BIC = 402.1042379
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -.4324168 .1253785 -3.45 0.000 -.6781543 -.1866794
mathnce | -.0014521 .0043635 -0.33 0.739 -.0100045 .0071002
langpr | -.0103129 .0035236 -2.93 0.003 -.0172191 -.0034067
_cons | 2.942868 .2642829 11.14 0.000 2.424883 3.460853
------------------------------------------------------------------------------
In this analysis, both gender and langnce are significant while mathnce
is not. From the coding of gender (1=female, 2=male) it is evident that females are
absent significantly more than are males. The significant coefficient for langnce
suggests that higher ability students are absent less often than lower ability
students.Now let's check to see how well the variable, daysabs, fits both the Poisson and negative binomial distributions using the nbvargr command available for ATS. (You can download nbvargr over the internet by typing findit nbvargr (see How can I use the findit command to search for programs and get additional help? for more information about using findit). nbvargr graphs the variable against a Poisson distribution with the same mean and a negative binomial distribution with the same mean and variance.
Many researchers would be satisfied with this analysis, however there is one more analysis that we can try. Using glm and keeping the log link, let's try the gamma distribution to see how it does.nbvargr daysabs Obtaining Parameter Estimates (23 observations deleted) Negative Binomial Probabilities with mean = 5.810127 & over-dispersion = 1.397268 k nbprob nbcum 1. 0 0.20559212 0.20559211 2. 1 0.13100202 0.33659413 3. 2 0.10005438 0.43664852 4. 3 0.08063899 0.51728749 5. 4 0.06669218 0.58397967 6. 5 0.05600163 0.63998133 7. 6 0.04749728 0.68747860 8. 7 0.04057066 0.72804928 9. 8 0.03483756 0.76288682 10. 9 0.03003709 0.79292393 11. 10 0.02598259 0.81890649 Poisson Probabilities for lambda = 5.810127 k pprob pcum 1. 0 0.00299705 0.00299705 2. 1 0.01741324 0.02041029 3. 2 0.05058656 0.07099685 4. 3 0.09797145 0.16896829 5. 4 0.14230664 0.31127495 6. 5 0.16536394 0.47663888 7. 6 0.16013090 0.63676977 8. 7 0.13291156 0.76968133 9. 8 0.09652913 0.86621046 10. 9 0.06231628 0.92852676 11. 10 0.03620655 0.96473330
glm daysabs gender mathnce langpr, family(gamma) link(log)
Generalized linear models No. of obs = 316
Optimization : ML: Newton-Raphson Residual df = 312
Scale param = 1.62252
Deviance = 253.0860658 (1/df) Deviance = .8111733
Pearson = 506.2263775 (1/df) Pearson = 1.62252
Variance function: V(u) = u^2 [Gamma]
Link function : g(u) = ln(u) [Log]
Standard errors : OIM
Log likelihood = -855.9673534 AIC = 5.442831
BIC = 230.0630969
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gender | -.4357982 .1462805 -2.98 0.003 -.7225027 -.1490937
mathnce | -.0011621 .005043 -0.23 0.818 -.0110461 .0087219
langpr | -.0105212 .004128 -2.55 0.011 -.018612 -.0024303
_cons | 2.944013 .3107076 9.48 0.000 2.335037 3.552988
------------------------------------------------------------------------------
The coefficients and p-values are very similar to those obtained from the negative binomial
example, however the deviance is much lower. In any case, there does not seem to be
much substantive differences between these analyses.There actually is one other analysis that we could try, generalized negative binomial regression. Generalized negative binomial regression is a generalization of the negative binomial model in which the shape parameter itself is parameterized (predicted). The data in the lahigh dataset come from two different schools. It is conceivable that the shape parameter could be different depending on the school. We will investigate this using the gnbreg command.
gnbreg daysabs gender mathnce langnce, lnalpha(school)
Generalized negative binomial regression Number of obs = 316
LR chi2(3) = 20.51
Prob > chi2 = 0.0001
Log likelihood = -876.75565 Pseudo R2 = 0.0116
------------------------------------------------------------------------------
daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
daysabs |
gender | -.4742854 .1386001 -3.42 0.001 -.7459366 -.2026343
mathnce | -.0026143 .0047886 -0.55 0.585 -.0119997 .0067712
langnce | -.0120342 .0056573 -2.13 0.033 -.0231222 -.0009461
_cons | 2.751023 .2236062 12.30 0.000 2.312763 3.189283
-------------+----------------------------------------------------------------
lnalpha |
school | .5960854 .2061477 2.89 0.004 .1920433 1.000128
_cons | -.6219761 .316587 -1.96 0.049 -1.242475 -.001477
------------------------------------------------------------------------------
It does appear that the shape parameter is different depending on the school but it is not clear
whether this model is superior to negative binomial or gamma models.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services