### SAS Data Analysis Examples Zero-Truncated Poisson Regression

Zero-truncated poisson regression is used to model count data for which the value zero cannot occur.

Please Note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and verification, verification of assumptions, model diagnostics and potential follow-up analyses.

#### Examples of zero-truncated Poisson regression

Example 1. A study of length of hospital stay, in days, as a function of age, kind of health insurance and whether or not the patient died while in the hospital. Length of hospital stay is recorded as a minimum of at least one day.

Example 2. A study of the number of journal articles published by tenured faculty as a function of discipline (fine arts, science, social science, humanities, medical, etc). To get tenure faculty must publish, therefore, there are no tenured faculty with zero publications.

Example 3. A study by the county traffic court on the number of tickets received by teenagers as predicted by school performance, amount of driver training and gender. Only individuals who have received at least one citation are in the traffic court files.

#### Description of the data

Let's pursue Example 1 from above.

We have a hypothetical data file, ztp, with 1,493 observations available here . The length of hospital stay variable is stay. The variable age gives the age group from 1 to 9 which will be treated as interval in this example. The variables hmo and died are binary indicator variables for HMO insured patients and patients who died while in the hospital, respectively.

Let's look at the data.

proc means data=mylib.ztp;
var stay;
run;

The MEANS Procedure

Analysis Variable : stay Length of Stay

N            Mean         Std Dev         Minimum         Maximum
--------------------------------------------------------------------
1493       9.7287341       8.1329081       1.0000000      74.0000000
--------------------------------------------------------------------

proc univariate data=mylib.ztp noprint;
histogram stay / midpoints = 0 to 80 by 2 vscale = count;
run;

proc freq data=mylib.ztp;
tables age hmo died;
run;

The FREQ Procedure

Age Group

Cumulative    Cumulative
age    Frequency     Percent     Frequency      Percent
--------------------------------------------------------
1           6        0.40             6         0.40
2          60        4.02            66         4.42
3         163       10.92           229        15.34
4         291       19.49           520        34.83
5         317       21.23           837        56.06
6         327       21.90          1164        77.96
7         190       12.73          1354        90.69
8          93        6.23          1447        96.92
9          46        3.08          1493       100.00

hmo

Cumulative    Cumulative
hmo    Frequency     Percent     Frequency      Percent
--------------------------------------------------------
0        1254       83.99          1254        83.99
1         239       16.01          1493       100.00

died

Cumulative    Cumulative
died    Frequency     Percent     Frequency      Percent
---------------------------------------------------------
0         981       65.71           981        65.71
1         512       34.29          1493       100.00


#### Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

• Zero-truncated Poisson Regression - The focus of this web page.
• Zero-truncated Negative Binomial Regression - If you have overdispersion in addition to zero truncation. See the Data Analysis Example for ztnb.
• Poisson Regression - Ordinary Poisson regression will have difficulty with zero-truncated data. It will try to predict zero counts even though there are no zero values.
• Negative Binomial Regression - Ordinary Negative Binomial regression will have difficulty with zero-truncated data. It will try to predict zero counts even though there are no zero values.
• OLS Regression - You could try to analyze these data using OLS regression. However, count data are highly non-normal and are not well estimated by OLS regression.

#### Zero-truncated Poisson regression using proc nlmixed

In order to use proc nlmixed to perform truncated Poisson regression, we must supply it with a likelihood function.

The probability that an observation has count $$y$$ under the Poisson distribution (without zero truncation) is given by the equation: $P(Y=y) = \frac{\lambda^re^{-\lambda}}{y!}$ With zero truncation, we calculate the probability that $$Y=y$$ conditional on $$Y>0$$, that is, that $$Y$$ is observed as 0 values are not observed. Thus: $P(Y=y|Y>0) = \frac{P(Y=y)}{P(Y>0)} = \frac{P(Y=y)}{1-P(Y=0)} = \frac{\lambda^re^{-\lambda}}{y!(1-e^{-\lambda})}$ The log-likelihood function for the zero-truncated Poisson distribution is then: $\mathcal{L}=\sum\limits_{i=1}^n[y_ilog(\lambda_i) - \lambda_i - log(1 - e^{-\lambda_i}) - log(y_i!)]$ In Poisson regression, we model $$log(\lambda)$$, the log of the expected counts, as a linear combination of a set of predictors: $log(\lambda) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3$ We supply the last two equations to proc nlmixed to model our data using a zero truncated Poisson distribution.  Additionally, proc nlmixed does not support a class statement, so categorical variables should be dummy-coded before running the analysis.

proc nlmixed data = mylib.ztp;
log_lambda = intercept + b_age*age + b_died*died + b_hmo*hmo;
lambda = exp(log_lambda);
ll = stay*log_lambda - lambda - log(1-exp(-lambda)) - lgamma(stay+1);
model stay ~ general(ll);
run;
The NLMIXED Procedure

Specifications

Data Set                                    MYLIB.ZTP
Dependent Variable                          stay
Distribution for Dependent Variable         General
Optimization Technique                      Dual Quasi-Newton
Integration Method                          None

Dimensions

Observations Used                   1493
Observations Not Used                  0
Total Observations                  1493
Parameters                             4

Parameters

intercept       b_age      b_died       b_hmo    NegLogLike

1           1           1           1    6176595.73

Iteration History

Iter     Calls    NegLogLike        Diff     MaxGrad       Slope

1         9    21118.0877     6155478    145098.1    -2.38E13
2        12    18578.3704    2539.717    137452.3     -436416
3        14    18437.2943    141.0761      120107    -13864.4
4        16    18394.0864    43.20793    118005.9    -3700.37
5        18     17809.511    584.5754    126017.3    -788.358
6        20    12597.6045    5211.906     71574.7    -383.804
7        21    7312.74337    5284.861    7488.696    -3740.01
8        23    7029.45115    283.2922     684.374    -459.944
9        25    6943.66162    85.78953    2049.627    -98.7188
10        27    6911.69968    31.96194    1391.058    -53.8698
11        29    6909.12237    2.577317    98.24019    -4.28966
12        31    6908.81335    0.309012    23.25771    -0.33228
13        33    6908.80066    0.012691    29.08574    -0.01855
14        35    6908.79908    0.001588    0.573832    -0.00203
15        37    6908.79907    3.373E-6    0.019132    -6.95E-6

NOTE: GCONV convergence criterion satisfied.

The SAS System           09:40 Tuesday, May 29, 2012   2

The NLMIXED Procedure

Fit Statistics

-2 Log Likelihood                  13818
AIC (smaller is better)            13826
AICC (smaller is better)           13826
BIC (smaller is better)            13847

Parameter Estimates

Standard
Parameter  Estimate     Error    DF  t Value  Pr > |t|   Alpha     Lower     Upper  Gradient

intercept    2.4358   0.02733  1493    89.12    <.0001    0.05    2.3822    2.4894  0.003988
b_age      -0.01444  0.005035  1493    -2.87    0.0042    0.05  -0.02432  -0.00457  0.019132
b_died      -0.2038   0.01837  1493   -11.09    <.0001    0.05   -0.2398   -0.1677  0.000506
b_hmo       -0.1359   0.02374  1493    -5.72    <.0001    0.05   -0.1825  -0.08933  -0.00231


The output looks very much like the output from an OLS regression:

• Towards the top is the iteration history, giving the values of the log pseudolikelihoods.
• The last value in the log (-6908.7990731) is the final value of the log pseudolikelihood for the full model.
• Next comes a number of fit statistics, which can be used to compare the fit of nested models.
• Below the fit statistics are the zero-truncated poisson coefficients for each of the variables along with standard errors, t-scores, and p-values.

Looking through the results we see the following:

• The value of the coefficient for age, -.01444, suggests that the log count of stay decreases by .01444 for each year increase in age. This coefficient is statistically significant.
• The coefficient for hmo, -.1359, is significant and indicates that the log count of stay for HMO patient is .1359 less than for non-HMO patients.
• The log count of stay for patients who died while in the hospital was .20377 less than those of patients who did not die.
• Finally, the value of the constant (intercept), 2..4358 is log count of the stay when age = 0, hmo = 0, and died = 0.

We can also use estimate statments to help understand our model. For example we can predict the expected number of days spent at the hospital across age groups for the two hmo statuses for patients who died. The estimate statement for proc nlmixed works slightly differently from how it works within other procs.  Here, each parameter must be explicitly multiplied by the value at which is to be held for that estimate statment.  Additionally, because we would like to predict actual number of days rather than log number of days, we need to exponentiate the estimate.

proc nlmixed data = mylib.ztp;
log_lambda = intercept + b_age*age + b_died*died + b_hmo*hmo;
lambda = exp(log_lambda);
ll = stay*log_lambda - lambda - log(1-exp(-lambda)) - lgamma(stay+1);
model stay ~ general(ll);
estimate 'age 1 died 1 hmo 0' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 0);
estimate 'age 1 died 1 hmo 1' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 1);
estimate 'age 3 died 1 hmo 0' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 0);
estimate 'age 3 died 1 hmo 1' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 1);
estimate 'age 5 died 1 hmo 0' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 0);
estimate 'age 5 died 1 hmo 1' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 1);
estimate 'age 7 died 1 hmo 0' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 0);
estimate 'age 7 died 1 hmo 1' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 1);
estimate 'age 9 died 1 hmo 0' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 0);
estimate 'age 9 died 1 hmo 1' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 1);
run;

<**SOME OUTPUT OMITTED**>

Standard
Label               Estimate     Error    DF  t Value  Pr > |t|   Alpha     Lower     Upper

age 1 died 1 hmo 0    9.1852    0.2548  1493    36.05    <.0001    0.05    8.6855    9.6849
age 1 died 1 hmo 1    8.0180    0.2736  1493    29.31    <.0001    0.05    7.4814    8.5546
age 3 died 1 hmo 0    8.9237    0.1805  1493    49.45    <.0001    0.05    8.5697    9.2777
age 3 died 1 hmo 1    7.7897    0.2213  1493    35.19    <.0001    0.05    7.3556    8.2239
age 5 died 1 hmo 0    8.6696    0.1375  1493    63.06    <.0001    0.05    8.4000    8.9393
age 5 died 1 hmo 1    7.5680    0.1934  1493    39.12    <.0001    0.05    7.1885    7.9474
age 7 died 1 hmo 0    8.4228    0.1451  1493    58.05    <.0001    0.05    8.1382    8.7074
age 7 died 1 hmo 1    7.3525    0.1948  1493    37.74    <.0001    0.05    6.9703    7.7347
age 9 died 1 hmo 0    8.1830    0.1910  1493    42.84    <.0001    0.05    7.8083    8.5577
age 9 died 1 hmo 1    7.1432    0.2206  1493    32.38    <.0001    0.05    6.7104    7.5759


We can see that the number of days spent tends to decrease as we move up age groups and that patients enrolled in an hmo (hmo = 1) tend to spend fewer days at the hospital as well than those not in hmos. For example, we expect that a non-hmo patient who died in age group 1 to stay for 9.1852 days whereas an hmo patient who died in age group 1 is expected to stay 8.0180 days.

It may be illustrative for us to plot the predicted number of days stayed as a function of age and hmo status.  To do this, we must tell SAS to save this table of predicted values as a dataset.  Tables and graphics produced by procedures are given names upon creation.  We will need the name of this prediction table to tell SAS to save it.  Place ods trace on and ods trace off statements around the procedure which produced this table to obtain its name.  Output from the ods trace statements is located in the log, not the output.

ods trace on;
proc nlmixed data = mylib.ztp;
log_lambda = intercept + b_age*age + b_died*died + b_hmo*hmo;
lambda = exp(log_lambda);
ll = stay*log_lambda - lambda - log(1-exp(-lambda)) - lgamma(stay+1);
model stay ~ general(ll);
estimate 'age 1 died 1 hmo 0' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 0);
estimate 'age 1 died 1 hmo 1' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 1);
estimate 'age 3 died 1 hmo 0' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 0);
estimate 'age 3 died 1 hmo 1' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 1);
estimate 'age 5 died 1 hmo 0' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 0);
estimate 'age 5 died 1 hmo 1' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 1);
estimate 'age 7 died 1 hmo 0' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 0);
estimate 'age 7 died 1 hmo 1' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 1);
estimate 'age 9 died 1 hmo 0' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 0);
estimate 'age 9 died 1 hmo 1' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 1);
run;
ods trace off;

<***SOME OF THE LOG OMITTED***>

-------------
-------------
NOTE: PROCEDURE NLMIXED used (Total process time):
real time           0.20 seconds
cpu time            0.12 seconds

140  ods trace off;


Towards the end of the log we find the name of this table, which as expected by its heading in the output above, is "AdditionalEstimates".  We can now tell SAS to save this output table as the dataset "mylib.addest" using an ods output statement.

ods output AdditionalEstimates = mylib.addest;
proc nlmixed data = mylib.ztp;
log_lambda = intercept + b_age*age + b_died*died + b_hmo*hmo;
lambda = exp(log_lambda);
ll = stay*log_lambda - lambda - log(1-exp(-lambda)) - lgamma(stay+1);
model stay ~ general(ll);
estimate 'age 1 died 1 hmo 0' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 0);
estimate 'age 1 died 1 hmo 1' exp(intercept * 1 + b_age * 1 + b_died * 1 + b_hmo * 1);
estimate 'age 3 died 1 hmo 0' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 0);
estimate 'age 3 died 1 hmo 1' exp(intercept * 1 + b_age * 3 + b_died * 1 + b_hmo * 1);
estimate 'age 5 died 1 hmo 0' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 0);
estimate 'age 5 died 1 hmo 1' exp(intercept * 1 + b_age * 5 + b_died * 1 + b_hmo * 1);
estimate 'age 7 died 1 hmo 0' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 0);
estimate 'age 7 died 1 hmo 1' exp(intercept * 1 + b_age * 7 + b_died * 1 + b_hmo * 1);
estimate 'age 9 died 1 hmo 0' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 0);
estimate 'age 9 died 1 hmo 1' exp(intercept * 1 + b_age * 9 + b_died * 1 + b_hmo * 1);
run;

Now we can use this predicted values for plotting. We need to add actual values of age and hmo to the dataset for plotting as well.

data mylib.addest;
input age hmo;
datalines;
1 0
1 1
3 0
3 1
5 0
5 1
7 0
7 1
9 0
9 1
;
run;

Finally, we use proc sgplot to plot our predicted number of days stayed as well as 95% confidence interval bands. The predicted values, lines connecting them, and confidence interval bands are all specified separately within the same proc sgplot.  The group option will produce separate points, lines, and bands by the grouping variable.

proc sgplot data = mylib.addest;
title 'Predicted number of days stayed (with 95% CL) by age and hmo status for patients who died';
band x = age lower = lower upper = upper / group=hmo;
scatter x= age y = estimate / group = hmo;
series x = age y = estimate / group = hmo;
run; 

#### Zero-truncated Poisson regression using proc fmm

As of SAS 9.3, you can use proc fmm to fit zero-truncated Poisson models.  You simply specify "dist = truncpoisson" in the model statement. Although compared to proc nlmixed, proc fmm provides a much easier way to specify a zero-truncated poisson regression, it has much more limited postestimation options.  For example, estimate and predict statements are not available with proc fmm.  Additionally, proc fmm like most other SAS procs, uses the last group within a categorical variable as the reference group.

proc fmm data = mylib.ztp;
class hmo died;
model stay = age hmo died / dist = truncpoisson;
run;

The FMM Procedure

Model Information

Data Set             MYLIB.ZTP
Response Variable    stay
Type of Model        Homogeneous Regression Mixture
Distribution         Truncated Poisson
Components           1
Estimation Method    Maximum Likelihood

Class Level Information

Class    Levels    Values

hmo           2    0 1
died          2    0 1

Number of Observations Used        1493

Optimization Information

Optimization Technique        Dual Quasi-Newton
Parameters in Optimization    4
Mean Function Parameters      4
Scale Parameters              0

Iteration History

Objective                         Max

0              5     7444.776354       .            4192.321
1              4    6918.2708098    526.50554412    359.7548
2              4    6913.2026998      5.06811003    327.6266
3              4    6913.0538752      0.14882461    320.7117
4              4    6910.8911491      2.16272613    145.0423
5              3    6908.8042839      2.08686512    9.443069
6              3    6908.7990735      0.00521050     0.07041
7              3    6908.7990731      0.00000040    0.000569

Convergence criterion (GCONV=1E-8) satisfied.

The SAS System          09:48 Thursday, May 24, 2012  14

The FMM Procedure

Fit Statistics

-2 Log Likelihood            13817.6
AIC  (smaller is better)     13825.6
AICC (smaller is better)     13825.6
BIC  (smaller is better)     13846.8
Pearson Statistic             9968.8

Parameter Estimates for 'Truncated Poisson' Model

Standard
Effect       hmo    died    Estimate       Error    z Value    Pr > |z|

Intercept                     2.0961     0.03766      55.65      <.0001
age                         -0.01444    0.005035      -2.87      0.0041
hmo          0                0.1359     0.02374       5.72      <.0001
hmo          1                     0           .        .         .
died                0         0.2038     0.01837      11.09      <.0001
died                1              0           .        .         .

#### Things to consider

• Count data often use exposure variable to indicate the number of times the event could have happened. You can incorporate exposure into your model by including a log-linear term for exposure in the log-likehood function specification.
• It is not recommended that zero-truncated poisson models be applied to small samples. What constitutes a small sample does not seem to be clearly defined in the literature.
• Pseudo-R-squared values differ from OLS R-squareds, please see FAQ: What are pseudo R-squareds? for a discussion on this issue.

#### References

• Cameron, A. Colin and Trivedi, P.K. (2009) Microeconometrics using stata. College Station, TX: Stata Press.
• Long, J. Scott, & Freese, Jeremy (2006). Regression Models for Categorical Dependent Variables Using Stata (Second Edition). College Station, TX: Stata Press.
• Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.