Help the Stat Consulting Group by giving a gift

Poisson Regression

This page shows an example of Poisson regression analysis with footnotes explaining the output. The data
collected were academic information on
316
students. The response variable is
days absent during the school year (**daysabs**), from which we explore its relationship with
math standardized tests score (**mathnce**),
language standardized tests score (**langnce**)
and gender (**female**).

As assumed for a Poisson model our response variable is a count variable, and
each subject has the same length of observation time. Had the
observation time for subjects varied, the Poisson model would need to be adjusted to account for the varying length of observation time
per subject. This
point is discussed later in the page. Also, the Poisson model, as compared to
other count models (i.e., negative
binomial or zero-inflated models), is assumed the appropriate model. In other words,
we assume that the dependent variable is not over-dispersed and does not have an excessive number
of zeros. The first half of this page
interprets the coefficients in terms of Poisson regression coefficients and the second half interprets the coefficients in terms of
incidence rate ratios.

We also run the **estat ic** command to calculate the likelihood ratio
chi-square statistic.

use http://www.ats.ucla.edu/stat/stata/notes/lahigh, clear generate female = (gender == 1) poisson daysabs mathnce langnce femaleIteration 0: log likelihood = -1547.9709 Iteration 1: log likelihood = -1547.9709 Poisson regression Number of obs = 316 LR chi2(3) = 175.27 Prob > chi2 = 0.0000 Log likelihood = -1547.9709 Pseudo R2 = 0.0536 ------------------------------------------------------------------------------ daysabs | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- mathnce | -.0035232 .0018213 -1.93 0.053 -.007093 .0000466 langnce | -.0121521 .0018348 -6.62 0.000 -.0157483 -.0085559 female | .4009209 .0484122 8.28 0.000 .3060348 .495807 _cons | 2.286745 .0699539 32.69 0.000 2.149638 2.423852 ------------------------------------------------------------------------------estat ic------------------------------------------------------------------------------ Model | Obs ll(null) ll(model) df AIC BIC -------------+---------------------------------------------------------------- . | 316 -1635.608 -1547.971 4 3103.942 3118.965 ------------------------------------------------------------------------------

Iteration 0: log likelihood = -1547.9709 Iteration 1: log likelihood = -1547.9709^{a}Poisson regression Number of obs^{c}= 316 LR chi2(3)^{d}= 175.27 Prob > chi2^{e}= 0.0000 Log likelihood = -1547.9709^{b}Pseudo R2^{f}= 0.0536estat ic------------------------------------------------------------------------------ Model | Obs ll(null)^{d}ll(model)^{d}df AIC BIC -------------+---------------------------------------------------------------- . | 316 -1635.608 -1547.971 4 3103.942 3118.965 ------------------------------------------------------------------------------

a. **Iteration Log** - This is a listing of the log likelihood at each iteration.
Poisson regression uses maximum likelihood estimation, which is an iterative
procedure to obtain parameter estimates. If you are familiar with other
regression models that use maximum likelihood (e.g., logistic regression), you may
notice this iteration log behaves differently. Specifically, the log likelihood
at iteration 0 does not correspond to the likelihood for the empty (or null)
model. This is evident when we look under ll(null) from the **estat ic**
command, which provides the log likelihood for the empty model. The log likelihood for the fitted
model is given in the last
iteration of the iteration log and under ll(model) from **estat ic**; note
that both values are equal (unlike ll(null) and the log likelihood from
iteration 0). The log likelihood for the fitted model is then used with ll(null) to calculate
the Likelihood ratio chi-square test statistic.

b.** Log Likelihood** - This is the log likelihood of the fitted model. It is used in the
calculation of the Likelihood Ratio (LR) chi-square test of whether all predictor
variables'
regression coefficients are simultaneously zero and in tests of nested models.

c.** Number of obs** - This is the number of observations used in the
Poisson regression.
It may be less than the number of cases in the dataset if there are missing
values for some variables in the model. By default, Stata does a listwise
deletion of incomplete cases.

d.** LR chi2(3),** ** ll(null)** and **ll(model)** from ** estat ic** - This is the LR test statistic for the omnibus test that at
least one predictor variable regression coefficient is not equal to zero in the
model. The degrees of freedom (the number in parenthesis) of the LR test
statistic is defined by the number of predictor variables (3). **LR chi2(3)** is calculated as -2*[ll(null) - ll(model)]
= -2*[-1635.608 - (-1547.971)] = 175.274.

e.** Prob > chi2** - This is the probability of getting a LR test statistic as extreme as, or more so, than the
one observed under the null
hypothesis; the null hypothesis is that all of the regression coefficients
are simultaneously equal to zero. In other words, this is the probability of obtaining this
chi-square test statistic (175.274) if there is in fact no effect of the predictor variables. This p-value is compared to a specified alpha level, our willingness
to accept a Type I error, which is typically set at 0.05 or 0.01. The small p-value from the LR test,
p < 0.00001, would lead us to conclude that at least
one of the regression coefficients in the model is not equal to zero. The parameter of the
chi-square distribution used to test the null hypothesis is defined
by the degrees of freedom in the prior line, ** chi2(3)**.

f.** Pseudo R2** - This is McFadden's pseudo R-squared. It is calculated
as 1 - **ll(model)**/**ll(null)** = 0.0536. Poisson regression does not have an equivalent to the R-squared found in OLS regression;
however, many have tried to derive an equivalent measure. There are a variety of pseudo-R-square statistics. Because this statistic does not mean what
R-square means in OLS regression (the proportion of variance of the response variable explained by the predictors), we suggest interpreting this statistic with
caution.

------------------------------------------------------------------------------ daysabs^{g}| Coef.^{h}Std. Err.^{i}z^{j}P>|z|^{j}[95% Conf. Interval]^{k}-------------+---------------------------------------------------------------- mathnce | -.0035232 .0018213 -1.93 0.053 -.007093 .0000466 langnce | -.0121521 .0018348 -6.62 0.000 -.0157483 -.0085559 female | .4009209 .0484122 8.28 0.000 .3060348 .495807 _cons | 2.286745 .0699539 32.69 0.000 2.149638 2.423852 ------------------------------------------------------------------------------

g. **daysabs** - This is the response variable in the Poisson regression. Underneath
**daysabs** are the predictor variables and the intercept (_cons).

h. **Coef**. - These are the estimated
Poisson regression coefficients for the model. Recall that the dependent variable is
a count variable, and Poisson regression models the log of the expected count
as a function of the predictor variables. We can interpret the Poisson
regression coefficient as follows: for a one unit change in the predictor variable, the
difference in the logs of expected counts is expected to change by the respective
regression coefficient, given the other predictor variables in the model are held
constant.

** mathnce** - This is the Poisson regression estimate for a one unit increase in
math standardized test score, given the other
variables are held constant in the model. If a student
were to increase her **mathnce** test score by one point, the difference in
the logs of expected counts would be expected to decrease by 0.0035 unit, while holding
the other variables in the model constant.

**langnce** - This is the Poisson regression estimate
for a one unit increase in language standardized test score, given the other
variables are held constant in the model. If a student
were to increase her **langnce** test score by one point, the difference in
the logs of expected counts would be expected to decrease by 0.0122 unit while holding
the other variables in the model constant.

**female** - This is the estimated Poisson
regression coefficient comparing females to males, given the other variables are
held constant in the model. The difference in the logs of expected counts is
expected to be 0.4010 unit higher for females compared to males, while holding
the other variables constant in the model.

** _cons** - This is the Poisson regression estimate
when all variables in the model are evaluated at zero. For males (the variable
**female** evaluated at zero) with zero **mathnce** and **langnce**
test scores, the log of the expected count for **daysabs** is 2.2867 units. Note that evaluating **
mathnce** and **langnce** at zero is out of the range of plausible test
scores. If the test scores were mean-centered, the intercept would have a
natural interpretation: the log of the expected count for males with average **
mathnce** and **langnce** test scores.

i. **Std. Err.** - These are the standard errors of the individual
regression coefficients. They are used
both in the calculation of the **z **test
statistic, superscript j, and the confidence interval of the regression coefficient, superscript
k.

j. **z** and **P>|z|** - These are the test statistic and p-value, respectively,
that the
null hypothesis that an individual predictor's regression
coefficient is zero given that the rest of the predictors are in the model. The test statistic **z** is the ratio of the **Coef.** to the
**Std. Err.** of the respective predictor. The **z** value follows a standard normal distribution which is used to test against a two-sided
alternative hypothesis that the **Coef.** is not equal to zero. The probability that a particular **z** test statistic is as extreme as, or more
so, than what has been observed under the null hypothesis is defined by **P>|z|**.

** mathnce** - The **z** test statistic testing
the slope for **mathnce** on** daysabs **is zero,
given the other variables are in the model, is (-0.0035/0.0018) -1.93, with an
associated p-value of 0.053. If we set our alpha level at 0.05, we would
fail to reject the null hypothesis and conclude the Poisson regression
coefficient for **mathnce** is not statistically different from zero given **langnce **and **female** are in the model.

**langnce** - The **z** test statistic testing the
slope for **langnce** on **daysabs **is zero, given
the other variables are in the model, is (-0.0122/0.0018) -6.62, with an
associated p-value of <0.0001. If we set our alpha level at 0.05, we would
reject the null hypothesis and conclude the Poisson regression
coefficient for **langnce** is statistically different from zero given **mathnce **and **female** are in the model.

**female** - The **z** test statistic testing
the difference between the log of expected counts between males and females on **daysabs **
is zero, given the other variables are in the model, is (0.4009/0.04841) -8.28,
with an associated p-value of <0.0001. If we set our alpha level at 0.05,
we would reject the null hypothesis and conclude that the
coefficient for **female** is statistically different from zero given **mathnce **and
**langnce** are in the model.

** _cons** - The **z** test statistic testing **_cons** is zero, given the other variables are in
the model and evaluated at zero, is (2.2867/0.0670) -32.69, with an associated p-value of <0.0001. If we
set our alpha level at 0.05, we would reject the null hypothesis and conclude
that **_cons** on **daysabs**
has been found to be statistically different from zero given **mathnce**, **
langnce **and **female** are in the model and evaluated at zero.

k. **[95% Conf. Interval]** - This is the confidence interval (CI) of an
individual poisson regression coefficient, given the other predictors are in the
model. For a given predictor variable with a level of 95% confidence, we'd say
that we are 95% confident that upon repeated trials 95% of the CI's would
include the "true" population Poisson regression coefficient. It is calculated
as **Coef.** ± (z_{α/2})*(**Std.Err.**), where z_{α/2}
is a critical value on the standard normal distribution. The CI is equivalent to the **z** test statistic: if the CI includes zero, we'd fail to
reject the null hypothesis that a particular regression coefficient is zero, given the other predictors are in the model.
An advantage of a CI is that it is illustrative; it provides information on where the "true" parameter may lie
and the precision of the point estimate.

The following is the interpretation of the Poisson regression in terms of
incidence rate ratios, which can be obtained by **poisson, irr** after running the
Poisson model or by specifying the **irr** option
when the full model is specified. This part of the interpretation applies to the output below.

Before we interpret the coefficients in terms of incidence rate ratios, we
must address how we can go from interpreting the Poisson regression coefficients
as a difference between the logs of expected counts to incidence rate ratios. In the
discussion above, Poisson regression coefficients were interpreted as the difference between the log of expected counts,
where formally, this can be
written as β_{ }= log( μ_{x+1}) - log(
μ_{x }), where β is the regression coefficient,
μ is the expected count and the subscripts represent where the predictor
variable, say x, is evaluated at x and x+1 (implying a one unit change in the
predictor variable x). Recall that the difference of two logs is equal to the log
of their quotient, log( μ_{x+1}) - log(
μ_{x }) = log( μ_{x+1} /
μ_{x }), and therefore, we could have also interpreted the parameter
estimate as the log of the ratio of expected counts: This explains the "ratio"
in incidence rate ratios. In addition, what we
referred to as a count can also be called a rate. By definition a rate is the
number of events per time (or space), which our response variable qualifies as.
Hence, we could
also interpret the Poisson regression coefficients as the log of the rate ratio:
This explains the "rate" in incidence rate ratio. Finally, the rate at which
events occur is called the incidence rate; thus we arrive at being able to
interpret the coefficients in terms of incidence rate ratios from our
interpretation above.

Also, note that each subject in our sample was followed for one school year.
If this was not the case (i.e., some subjects were followed for half a
year, some for a year and the rest for two years) and we were to neglect the
exposure time, our Poisson regression estimate would be biased, since our model
assumes all subjects had the same follow up time. If this was an issue, we would
use the exposure option, **exposure( varname)**,
where

poisson daysabs mathnce langnce female, irrIteration 0: log likelihood = -1547.9709 Iteration 1: log likelihood = -1547.9709 Poisson regression Number of obs = 316 LR chi2(3) = 175.27 Prob > chi2 = 0.0000 Log likelihood = -1547.9709 Pseudo R2 = 0.0536 ------------------------------------------------------------------------------ daysabs | IRR^{a}Std. Err. z P>|z| [95% Conf. Interval]^{b}-------------+---------------------------------------------------------------- mathnce | .996483 .0018149 -1.93 0.053 .9929321 1.000047 langnce | .9879214 .0018127 -6.62 0.000 .984375 .9914806 female | 1.493199 .072289 8.28 0.000 1.35803 1.641823 ------------------------------------------------------------------------------

a. **IRR** - These are the incidence rate ratios for the
Poisson model shown earlier. We obtain at the incidence rate ratio by exponentiating
the Poisson regression coefficient

** mathnce** - This is the estimated
rate ratio for a one unit increase in
math standardized test score, given the other
variables are held constant in the model. If a student
were to increase his **mathnce** test score by one point, his incidence rate
for **daysabs** would
be expected to change by a factor of 0.9965 (decrease 0.35%), while holding all other variables in the model constant.

**langnce** - This is the estimated rate ratio for a one unit increase in
language standardized test score, given the other variables are held constant in
the model. If a student
were to increase his **langnce** test score by one point, his incidence rate
for **daysabs** would
be expected to change by a factor of 0.9880 (decrease 1.2%), while holding all other variables in the model constant.

**female** - This is the estimated
rate ratio comparing females to males. Females compared to males, while holding the other variable constant
in the model, are expected to have an incidence rate for **daysabs** 1.493 times that of males (a 49.3% increase).

b.** [95% Conf. Interval]** - This is the CI for the rate ratio
given the other predictors are in the model. For a given predictor with a level
of 95% confidence, we'd say that we are 95% confident that upon repeated trials,
95% of the CI's would include the "true" population incidence rate ratio, given
the other variables are in the model.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.