|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.4.1.
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts.
Let's look at the data.
p<-read.table("http://www.ats.ucla.edu/stat/R/dae/poissonreg.csv", sep=",", header = TRUE)
attach(p)
names(p)
[1] "id" "school" "male" "math" "langarts" "daysatt" "daysabs"
The response variable of interest is days absent, daysabs. The variables math and langarts give the standardized test scores for math and language arts respectively. The variable male is a binary indicator of student gender.
Let's look at the data. For a nice layout for displaying the summary statistics, we made use of the package called fields.
library(fields)
c<-cbind(daysabs, math, langarts, male)
stats(c)
daysabs math langarts male
N 316.000000 316.000000 316.000000 316.0000000
mean 5.810127 48.751012 50.063794 0.4873418
Std.Dev. 7.449003 17.880757 17.939211 0.5006325
min 0.000000 1.007114 1.007114 0.0000000
Q1 1.000000 37.725357 40.150265 0.0000000
median 3.000000 48.943764 50.000000 0.0000000
Q3 8.000000 61.043877 61.043877 1.0000000
max 45.000000 98.992889 98.992889 1.0000000
missing values 0.000000 0.000000 0.000000 0.0000000
var(daysabs)
[1] 55.48764
hist(daysabs, breaks=150, xlim=c(0, 50), col="gray")
table(male)
male 0 1 162 154
R function glm.nb (generalized linear model) is used here for fitting the negative binomial model.
We begin by estimating the model with the variables of interest.
library(MASS)
m1<-glm.nb(daysabs~math+langarts+male)
summary(m1)
Call:
glm.nb(formula = daysabs ~ math + langarts + male, init.theta = 0.776166936578963,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9785 -1.0627 -0.4147 0.2865 2.8193
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.716069 0.234174 11.598 < 2e-16 ***
math -0.001601 0.005300 -0.302 0.76259
langarts -0.014348 0.005372 -2.671 0.00756 **
male -0.431185 0.139516 -3.091 0.00200 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.7762) family taken to be 1)
Null deviance: 378.43 on 315 degrees of freedom
Residual deviance: 356.93 on 312 degrees of freedom
AIC: 1771.7
Number of Fisher Scoring iterations: 1
Correlation of Coefficients:
(Intercept) math langarts
math -0.28
langarts -0.43 -0.69
male -0.40 -0.09 0.19
Theta: 0.7762
Std. Err.: 0.0742
2 x log-likelihood: -1761.7460
The output looks very much like the output from an OLS regression. The output begins with the information on deviance residuals. It gives us some idea on how well the model fits the data. You will then see the negative binomial regression coefficients for each of the variables along with the standard errors, z-scores, and p-values. The last part of the output gives the model fit information, including AIC.
Since math is not significant in the model with robust standard errors, we will rerun the model dropping that variable.
m2 <- glm.nb(daysabs ~ langarts + male) summary(m2)
Call:
glm.nb(formula = daysabs ~ langarts + male, init.theta = 0.775738355950073,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0040 -1.0659 -0.4137 0.2747 2.7043
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.703440 0.224672 12.033 < 2e-16 ***
langarts -0.015649 0.003883 -4.030 5.57e-05 ***
male -0.431207 0.139017 -3.102 0.00192 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(0.7757) family taken to be 1)
Null deviance: 378.29 on 315 degrees of freedom
Residual deviance: 356.90 on 313 degrees of freedom
AIC: 1769.9
Number of Fisher Scoring iterations: 1
Correlation of Coefficients:
(Intercept) langarts
langarts -0.91
male -0.44 0.17
Theta: 0.7757
Std. Err.: 0.0741
2 x log-likelihood: -1761.8550
Finally, we will use the predict function to get the predicted values in days absent.
For example, we fix the value of langarts at its mean and male at
0 and 1, the predict function returns the expected count for both male and
female when the language arts score is held at its mean.
male<-c(0, 1) ml<-mean(langarts) langarts<-c(ml, ml) pnew<-cbind(male, langarts) fitted.m2<-predict(m2, newdata=data.frame(pnew), type="response") fitted.m2
1 2 6.820807 4.431645
Finally, we compute the likehood ratio chi-square and degrees of freedom for our final model.
library(lmtest) lrtest(m2) Likelihood ratio test Model 1: daysabs ~ male + langarts Model 2: daysabs ~ 1 #Df LogLik Df Chisq Pr(>Chisq) 1 4 -880.93 2 2 -891.24 -2 20.631 3.312e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(xtable)
x2<-xtable(m2)
x
% latex table generated in R 2.5.0 by xtable 1.4-6 package
% Tue Jun 12 11:44:07 2007
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
\hline
& Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
\hline
(Intercept) & 2.7034 & 0.2247 & 12.03 & 0.0000 \\
langarts & $-$0.0156 & 0.0039 & $-$4.03 & 0.0001 \\
male & $-$0.4312 & 0.1390 & $-$3.10 & 0.0019 \\
\hline
\end{tabular}
\end{center}
\end{table}
The LaTex code above generates a table as shown below:
The negative binomial regression model predicting days absent from school stay from language arts and gender was statistically significant (chi-squared = 20.63, df = 2, p<.00003). The predictors langarts and male were each statically significant. For these data, the expected log count increase for a one-unit increase in language arts was -0.0156. This translates to a decrease of about 1.56 days absent for a one standard deviation increase in language arts when gender is held constant. Male students had an expected log count -0.43 less than female students which amounts to about 2.39 fewer days absent than females while holding language arts constant.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services