|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.
This page was updated using R 2.11.
Example 1. School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include the type of program in which the student is enrolled and a standardized test in math.
Example 2. A health-related researcher is studying the number of hospital visits in past 12 months by senior citizens in a community based on the characteristics of the individuals and the types of health plans under which each one is covered.
We have attendance data on 314 high school juniors from two urban high schools in the file nb_data. The response variable of interest is days absent, daysabs. The variable math gives the standardized math score for each student. The variable prog is a three-level nominal variable indicating the type of instructional program in which the student is enrolled.
Let's look at the data.
library(foreign)
library(fields)
library(MASS)
library(car)
nbdata <- read.dta("http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta")
attach(nbdata)
stats(cbind(daysabs, math))
daysabs math
N 314.000000 314.00000
mean 5.955414 48.26752
Std.Dev. 7.036958 25.36239
min 0.000000 1.00000
Q1 1.000000 28.00000
median 4.000000 48.00000
Q3 8.000000 70.00000
max 35.000000 99.00000
missing values 0.000000 0.00000
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.
m1<-glm.nb(daysabs~math+as.factor(prog))
summary(m1)
Call:
glm.nb(formula = daysabs ~ math + as.factor(prog), init.theta = 1.03271315570559,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1548 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 2e-16 ***
math -0.005993 0.002505 -2.392 0.0167 *
as.factor(prog)2 -0.440760 0.182610 -2.414 0.0158 *
as.factor(prog)3 -1.278651 0.200720 -6.370 1.89e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
Null deviance: 427.54 on 313 degrees of freedom
Residual deviance: 358.52 on 310 degrees of freedom
AIC: 1741.3
Number of Fisher Scoring iterations: 1
Theta: 1.033
Std. Err.: 0.106
2 x log-likelihood: -1731.258
Anova(m1, type = "III")
Anova Table (Type III tests)
Response: daysabs
LR Chisq Df Pr(>Chisq)
math 5.677 1 0.01719 *
as.factor(prog) 48.827 2 2.496e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As we mentioned earlier, negative binomial models assume the conditional means are not equal to the conditional variances. This inequality is captured by estimating a dispersion parameter (not shown in the output) that is held constant in a Poisson model. Thus, the Poisson model is actually nested in the negative binomial model. We can then use a likelihood ratio test to compare these two and test this model assumption. To do this, we will run our model as a Poisson and then use the lrtest command from the lmtest package.
In this example the associated chi-squared value is 926.03 with one degree of freedom. This strongly suggests the negative binomial model is more appropriate than the Poisson model. We can see that estimating the dispersion parameter improves the model significantly, so our negative binomial is more appropriate than a Poisson model.library(lmtest) m2 <- glm(daysabs~math+as.factor(prog), family = "poisson") lrtest(m2, m1) Likelihood ratio test Model 1: daysabs ~ math + as.factor(prog) Model 2: daysabs ~ math + as.factor(prog) #Df LogLik Df Chisq Pr(>Chisq) 1 4 -1328.64 2 5 -865.63 1 926.03 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We might be interested in looking at incident rate ratios rather than coefficients. To do this, we can exponentiate our model coefficients.
exp(m1$coefficients)
(Intercept) math as.factor(prog)2 as.factor(prog)3
13.6708448 0.9940249 0.6435471 0.2784127
The output above indicates that the incident rate for prog=2 is 0.64 times the incident rate for the reference group (prog=1). Likewise, the incident rate for prog=3 is 0.28 times the incident rate for the reference group holding the other variables constant. The percent change in the incident rate of daysabs is a 1% decrease for every unit increase in math.
The form of the model equation for negative binomial regression is the same as that for Poisson regression. The log of the outcome is predicted with a linear combination of the predictors:
log(daysabs) = Intercept + b1(prog=2) + b2(prog=3) + b3math.
This implies:
daysabs = exp(Intercept + b1(prog=2) + b2(prog=3)+ b3math) = exp(Intercept) * exp(b1(prog=2)) * exp(b2(prog=3)) * exp(b3math)
The coefficients have an additive effect in the log(y) scale and the IRR have a multiplicative effect in the y scale. The dispersion parameter in negative binomial regression does not effect the expected counts, but it does effect the estimated variance of the expected counts. More details can be found in the Modern Applied Statistics with S by W.N. Venables and B.D. Ripley (the book companion of the MASS package).
For additional information on the various metrics in which the results can be presented, and the interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long and Jeremy Freese (2006).
For assistance in further understanding the model, we can look at predicted counts for various levels of our predictors. Below we create new datasets with values of math and prog and then use the predict command to calculate the predicted number of events.
First, we can look at predicted counts for each value of prog while holding math at its mean. To do this, we create a new dataset with the combinations of prog and math for which we would like to find predicted values, then use the predict command. The predict command generates the predicted log counts, so to see the predicted counts, we exponentiate these.
newdata1 <- data.frame(cbind(rep(mean(math), 3), 1:3))
colnames(newdata1) <- c("math", "prog")
plogcounts <- predict(m1, newdata = newdata1)
pcounts <- exp(plogcounts)
cbind(newdata1, pcounts)
math prog pcounts
1 48.26752 1 10.236899
2 48.26752 2 6.587927
3 48.26752 3 2.850083
In the output above, we see that the predicted number of events (e.g., days absent) for level 1 of prog is about 10.24, holding math at its mean. The predicted number of events for level 2 of prog is lower at 6.59, and the predicted number of events for level 3 of prog is about 2.85.
Below we will obtain the mean predicted number of events for values of math that range from 0 to 100 in increments of 20 for each level of prog.
newdata2 <- data.frame(cbind(
rep(seq(0, 100, 20), 3),
c(rep(1, 6), rep(2, 6), rep(3, 6))
))
colnames(newdata2) <- c("math", "prog")
plogcounts <- predict(m1, newdata = newdata2)
pcounts <- exp(plogcounts)
cbind(newdata2, pcounts)
math prog pcounts
1 0 1 13.670845
2 20 1 12.126652
3 40 1 10.756884
4 60 1 9.541838
5 80 1 8.464038
6 100 1 7.507981
7 0 2 8.797833
8 20 2 7.804072
9 40 2 6.922562
10 60 2 6.140623
11 80 2 5.447007
12 100 2 4.831740
13 0 3 3.806137
14 20 3 3.376214
15 40 3 2.994853
16 60 3 2.656569
17 80 3 2.356496
18 100 3 2.090317
The table above shows that when prog is held at 1 and math increases from 0 to 100, the predicted counts decrease from 13.67 to 7.51. When prog is held at 2 and math increases from 0 to 100, the predicted counts decrease from 8.80 to 4.83. When prog is held at 3 and math increases from 0 to 100, the predicted counts decrease from 3.81 to 2.09. If we look at the ratio of these values, (2.09/3.81) = 0.55, we can see that this matches the IRR of 0.994 for a 100 unit change: 0.994^100 = 0.55.
Since this is a log-linear model, these decreases will not be linearly related to math. We can plot these predicted values to see the relationship between math and daysabs.
xvals <- seq(0, 100, 20)
plot(xvals, pcounts[1:6], ylim = c(0, 15),
type = "l", xlab = "math", ylab = "predicted counts")
points(xvals, pcounts[7:12], type = "l", lty = 2)
points(xvals, pcounts[13:18], type = "l", lty = 3)
legend("topright", c("prog1", "prog2", "prog3"), lty = c(1,2,3))

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services