UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

R Data Analysis Examples
Poisson Regression

Poisson regression is used to model count variables with the assumption that the conditional means equal the conditional variances.

Please note: The purpose of this page is to show how to use various data analysis commands.  It does not cover all aspects of the research process which researchers are expected to do.  In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.

This page uses R 2.11. 

Examples of Poisson regression

Example 1.  The number of persons killed by mule or horse kicks in the Prussian army per year. Ladislaus Bortkiewicz collected data from 20 volumes of Preussischen Statistik.  These data were collected on 10 corps of the Prussian army in the late 1800s over the course of 20 years.

Example 2.  The number of people in line in front of you at the grocery store.  Predictors may include the number of items currently offered at a special discounted price and whether a special event (e.g., a holiday, a big sporting event) is three or fewer days away.

Example 3.  The number of awards earned by students at one high school.  Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general or academic) and the score on their final exam in math.

Description of the data

For the purpose of illustration, we have simulated a data set for Example 3 above.  In this example, num_awards is the outcome variable and indicates the number of awards earned by students at a high school in a year, math is a continuous predictor variable and represents students' scores on their math final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = "General", 2 = "Academic" and 3 = "Vocational".

Let's start with loading the data and looking at some descriptive statistics.

p<-read.table("http://www.ats.ucla.edu/stat/R/dae/poisson_sim.csv", sep=",", header = TRUE) 
summary(p)
       id           num_awards        prog            math      
 Min.   :  1.00   Min.   :0.00   Min.   :1.000   Min.   :33.00  
 1st Qu.: 50.75   1st Qu.:0.00   1st Qu.:2.000   1st Qu.:45.00  
 Median :100.50   Median :0.00   Median :2.000   Median :52.00  
 Mean   :100.50   Mean   :0.63   Mean   :2.025   Mean   :52.65  
 3rd Qu.:150.25   3rd Qu.:1.00   3rd Qu.:2.250   3rd Qu.:59.00  
 Max.   :200.00   Max.   :6.00   Max.   :3.000   Max.   :75.00  

Each variable has 200 valid observations and their distributions seem quite reasonable. The unconditional mean and variance of our outcome variable are not extremely different. Our model assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).

We can use the tapply function to display the summary statistics by program type. The table below shows the average numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards, our outcome variable, because the mean value of the outcome appears to vary by prog. Additionally, the means and variances within each level of prog--the conditional means and variances--are similar. A bar plot is also produced to display the distribution of the outcome variable.

p$pcat <-factor(p$prog, labels=c("General", "Academic", "Vocational"))
table(p$pcat)
General   Academic Vocational 
     45        105         50
tapply(p$num_awards, p$pcat, mean)
   General   Academic Vocational 
      0.20       1.00       0.24 
tapply(p$num_awards, p$pcat, sd)
   General   Academic Vocational 
 0.4045199  1.2785208  0.5174506 
barplot(table(p$num_awards), space=0, ylim=c(0, 160), xlab="number of awards")

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered.  Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations. 

Poisson regression analysis

At this point, we are ready to perform our Poisson model analysis using function glm and we will attach our data so we will have easier time referring our variables. By using the function relevel, we specify that we will use pcat = "General" as the reference group.

attach(p)
b.<-relevel(pcat, "General")
m1<-glm(num_awards ~ b. + math, family="poisson", p)
summary(m1)
Call:
glm(formula = num_awards ~ b. + math, family = "poisson", data = p)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2043  -0.8436  -0.5106   0.2558   2.6796  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -5.24712    0.65845  -7.969 1.60e-15 ***
b.Academic    1.08386    0.35825   3.025  0.00248 ** 
b.Vocational  0.36981    0.44107   0.838  0.40179    
math          0.07015    0.01060   6.619 3.63e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 287.67  on 199  degrees of freedom
Residual deviance: 189.45  on 196  degrees of freedom
AIC: 373.50

Number of Fisher Scoring iterations: 6

Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean. We use R package sandwich below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculated the 95% confidence interval using the parameter estimates and their robust standard errors.

library(sandwich)
cov.m1<-vcovHC (m1, type="HC0")
std.err<-sqrt(diag(cov.m1))
r.est<-cbind(estimate=m1$coefficients, std.err, 
       pvalue=round(2*(1-pnorm(abs(m1$coefficients/std.err))),5), 
       lower=m1$coefficients-1.96*std.err, 
       upper=m1$coefficients+1.96*std.err)
       
r.est
               estimate    std.err  pvalue       lower       upper
(Intercept)  -5.2471244 0.64599839 0.00000 -6.51328124 -3.98096756
b.Academic    1.0838591 0.32104816 0.00074  0.45460476  1.71311353
b.Vocational  0.3698092 0.40041731 0.35572 -0.41500870  1.15462716
math          0.0701524 0.01043516 0.00000  0.04969947  0.09060532

Now let's look at the output of function glm more closely.

We can also test the overall effect of pcat by comparing the deviance of the full model with the deviance of the model excluding pcat. The two degree-of-freedom chi-square test indicates that pcat, taken together, is a statistically significant predictor of num_awards.

m1<-glm(num_awards ~ b. + math, family="poisson", p)
m2<-glm(num_awards ~ math, family="poisson", p)
dev.dff<-anova(m2, m1)
1-pchisq(dev.dff[2,4], dev.dff[2,3])
[1] 0.0006851718

Sometimes, we might want to present the regression results as incident rate ratios and their standard errors, together with the confidence interval. To compute the standard error for the incident rate ratios, we will use the Delta method. To this end, we make use the function deltamethod implemented in R package msm.

library(msm)
estmean<-coef(m1)
var<-cov.m1
s2<-deltamethod (~ exp(x2), estmean, var) 
s3<-deltamethod (~ exp(x3), estmean, var) 
s4<-deltamethod (~ exp(x4), estmean, var) 
irr<-cbind(irr=exp(estmean)[2:4], r.serr=rbind(s2, s3, s4), 
           lower=exp(m1$coefficients-1.96*std.err)[2:4], 
           upper=exp(m1$coefficients+1.96*std.err)[2:4])
           
irr
                  irr                lower    upper
b.Academic   2.956065 0.94903937 1.5755505 5.546203
b.Vocational 1.447458 0.57958742 0.6603345 3.172840
math         1.072672 0.01119351 1.0509552 1.094837

The output above indicates that the incident rate for pcat = "Academic" is 2.96 times the incident rate for the reference group (pcat = "General").  Likewise, the incident rate for pcat = "Vocational" is 1.45 times the incident rate for the reference group holding the other variables at constant.  The percent change in the incident rate of num_awards is by 7% for every unit increase in math. For additional information on the various metrics in which the results can be presented, and the interpretation of such, please see Regression Models for Categorical Dependent Variables Using Stata, Second Edition by J. Scott Long and Jeremy Freese (2006).

Sometimes, we might want to look at the expected marginal means. For example, what are the expected counts for each program type holding math score at its overall mean? To answer this question, we can make use of the predict function. First off, we will make a small data set so to apply the predict function to it.

s1<-as.data.frame(rep(mean(math), 3))
colnames(s1)<-c("math")
mypcat <-factor(c(1, 2, 3), labels=c("General", "Academic", "Vocational"))
s1$b.<-relevel(mypcat, "General")
s1
    math         b.
1 52.645    General
2 52.645   Academic
3 52.645 Vocational

Now we can get the expected marginal means for each program type.

phat1<-predict(m1, s1, type="response", se.fit=TRUE)
print(phat1)
$fit
        1         2         3 
0.2114109 0.6249446 0.3060086 
$se.fit
         1          2          3 
0.07050108 0.08628117 0.08833706 
$residual.scale
[1] 1

In the output above, we see that the predicted number of events for level 1 of prog is about .21, holding math at its mean.  The predicted number of events for level 2 of prog is higher at .62, and the predicted number of events for level 3 of prog is about .31. The ratios of these predicted counts (.625/.211 = 2.96, .306/.211 = 1.45) match what we saw looking at the IRR.

We can also graph the predicted number of events with the commands below.  The graph indicates that the most awards are predicted for those in the academic program (prog = 2), especially if the student has a high math score.  The lowest number of predicted awards is for those students in the general program (prog = 1).

p$phat<-predict(m1, type="response")
pall<-p[, c("pcat", "phat", "math")][order(p$math),]

pg<-pall[pall$pcat=="General", ]
pa<-pall[pall$pcat=="Academic", ]
pv<-pall[pall$pcat=="Vocational", ]

plot(pg$math, pg$phat, xlim=c(30, 80), ylim=c(0, 3), type="l", lty=3,
     xlab="Math Score", ylab="Expected number of awards")
lines(pa$math, pa$phat, type="l", lty=1)
lines(pv$math, pv$phat, type="l", lty=2)
legend("topleft",c("General","Academic", "Vocation"), lty=c(3, 1, 2))

Things to consider

References

Here is the link to the R syntax file used for this page.


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.