### Stata Library Discrete Time Survival Analysis

Discrete time survival analysis treats time not as a continuous variable, but as being divided into discrete units or chunks. We will be able to analyze discrete time data using logistic regression with indicator variables for each of the time periods. We will illustrate discrete time survival analysis using the cancer.dta dataset.

#### Cancer Example

After reading in the dataset, we will describe the variables and list several variables for patient 5, 10 and 20.
use http://www.ats.ucla.edu/stat/stata/library/cancer

describe

Contains data from cancer.dta
obs:            48                          Patient Survival in Drug Trial
vars:             7                          2 Jan 1904 13:58
size:         1,248 (99.1% of memory free)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g
studytime       int    %8.0g                  Months to death or end of exp.
died            int    %8.0g                  1 if patient died
drug            float  %9.0g
age             int    %8.0g                  Patient's age at start of exp.
distime         float  %9.0g
censor          float  %9.0g
-------------------------------------------------------------------------------

tab distime

distime |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |         11       22.92       22.92
2 |         13       27.08       50.00
3 |          6       12.50       62.50
4 |          8       16.67       79.17
5 |          4        8.33       87.50
6 |          6       12.50      100.00
------------+-----------------------------------
Total |         48      100.00

univar age
-------------- Quantiles --------------
Variable       n     Mean     S.D.      Min      .25      Mdn      .75      Max
-------------------------------------------------------------------------------
age      48    55.88     5.66    47.00    50.50    56.00    60.00    67.00
-------------------------------------------------------------------------------

list distime drug age died censor if id==5

distime       drug       age      died     censor
5.         1          0        56         1          0

list distime drug age died censor if id==10

distime       drug       age      died     censor
10.         2          0        58         0          1

list distime drug age died censor if id==20

distime       drug       age      died     censor
20.         4          0        52         1          0
Patient 5 (56 years old, did not receive a drug treatment) was observed for one time period, died. So, the observation for patient was not censored. Patient 10 (58, no drug) was observed for two time periods did not die, i.e., observation was censored. Finally, patient 20 (52, no drug) was observed for four time periods, died (not censored).

In this dataset there is one observation for each patient. In order to do discrete time survival analysis we to have as many observations as there are time periods for each patient. For patients that die we need a response variable that is zero until the last time period when it is coded one. For patients that don't die the response variable will be zero for every observation.

A collection of Stata commands written by Alexis Dinno (Harvard School of Public Health) will help us with the analysis. You can download this family of commands from within Stata by typing findit dthaz (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

The command that we are interested in is prsnperd which creates the type of dataset that we want. prsnperd wants a variable that indicates whether the observation is censored or not which in our dataset is the variable censor. prsnperd creates the following variables: _period which is the time period, _Y which is the response variable and _d1 through _d6 which are the dummy coded time periods. Here is what it looks like.

prsnperd id distime censor

list id _period _Y if id==5

id    _period         _Y
5.         5          1          1

list id _period _Y if id==10

id    _period         _Y
11.        10          1          0
12.        10          2          0

list id _period _Y if id==20

id    _period         _Y
35.        20          1          0
36.        20          2          0
37.        20          3          0
38.        20          4          1
Now we can actually do the discrete time survival analysis using the logit command. We will run logit with and without the cluster and nocons options. The nocons options is used so that the dummy variables for all of the time periods will be included.
logit _Y drug age _d1-_d6, cluster(id) nocons

Logit estimates                                   Number of obs   =        143
Wald chi2(8)    =      45.39
Log likelihood =  -55.65503                       Prob > chi2     =     0.0000

(standard errors adjusted for clustering on id)
------------------------------------------------------------------------------
|               Robust
_Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
drug |  -3.024052   .6859866    -4.41   0.000    -4.368561   -1.679543
age |   .1607128   .0497324     3.23   0.001      .063239    .2581866
_d1 |  -9.309867   2.754574    -3.38   0.001    -14.70873   -3.911001
_d2 |  -8.335442   2.641359    -3.16   0.002    -13.51241   -3.158473
_d3 |  -8.326742   2.533321    -3.29   0.001    -13.29196   -3.361525
_d4 |  -7.071942   2.564526    -2.76   0.006    -12.09832   -2.045564
_d5 |   -7.19799   2.490689    -2.89   0.004    -12.07965   -2.316328
_d6 |  -7.622593   2.722941    -2.80   0.005    -12.95946   -2.285726
------------------------------------------------------------------------------

logit _Y drug age _d1-_d6, nocons

Logit estimates                                   Number of obs   =        143
LR chi2(8)      =          .
Log likelihood =  -55.65503                       Prob > chi2     =          .

------------------------------------------------------------------------------
_Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
drug |  -3.024052   .6347086    -4.76   0.000    -4.268058   -1.780046
age |   .1607128    .051414     3.13   0.002     .0599433    .2614823
_d1 |  -9.309867   2.922645    -3.19   0.001    -15.03815   -3.581589
_d2 |  -8.335442   2.780394    -3.00   0.003    -13.78491   -2.885969
_d3 |  -8.326742   2.823744    -2.95   0.003    -13.86118   -2.792306
_d4 |  -7.071942   2.734906    -2.59   0.010    -12.43226   -1.711624
_d5 |   -7.19799   2.811519    -2.56   0.010    -12.70847   -1.687513
_d6 |  -7.622593   2.988678    -2.55   0.011    -13.48029   -1.764892
------------------------------------------------------------------------------

logit, or

Logit estimates                                   Number of obs   =        143
LR chi2(8)      =          .
Log likelihood =  -55.65503                       Prob > chi2     =          .

------------------------------------------------------------------------------
_Y | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
drug |   .0486039   .0308493    -4.76   0.000      .014009    .1686304
age |   1.174348   .0603779     3.13   0.002     1.061776    1.298854
_d1 |   .0000905   .0002646    -3.19   0.001     2.94e-07    .0278315
_d2 |   .0002399   .0006669    -3.00   0.003     1.03e-06    .0558007
_d3 |    .000242   .0006832    -2.95   0.003     9.55e-07    .0612797
_d4 |   .0008486   .0023208    -2.59   0.010     3.99e-06    .1805723
_d5 |   .0007481   .0021033    -2.56   0.010     3.03e-06     .184979
_d6 |   .0004893   .0014623    -2.55   0.011     1.40e-06    .1712052
------------------------------------------------------------------------------
Both drug and age are significant with the older patients more likely to die and those on drug therapy less likely. It is useful to look at the hazard function (and survival function) to ascertain the effects over time. The dthaz command (from Dinno) will produce a table with hazard and survival values for each time period. We will specify the function for drug=1 (drug therapy) and age=56 (the median age).
dthaz drug age, specify(1 56)

Discrete-Time Estimation of Conditional Hazard and Survival Probabilities
------------------------------------------------------------------------------
Time Parameterization: Fully Discrete

drug = 1
age = 56

-----------------------------------------
Period      p(Hazard)   p(Survival)
(T_j)       ^H(T_j)     ^S(T_j)
-----------------------------------------
0            --        1
1          0.0344      0.9656
2          0.0863      0.8822
3          0.0870      0.8055
4          0.2505      0.6037
5          0.2276      0.4663
6          0.1616      0.3910
-----------------------------------------
Logit Link (assumes proportional odds)
Notice that the hazard maxes out at time period four and then declines.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.