Stata Data Analysis Examples
Interval Regression

Version info: Code for this page was tested in Stata 12.

Interval regression is used to model outcomes that have interval censoring.  In other words, you know the ordered category into which each observation falls, but you do not know the exact value of the observation.  Interval regression is a generalization of censored regression.

Please note: The purpose of this page is to show how to use various data analysis commands.  It does not cover all aspects of the research process which researchers are expected to do.  In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.

Examples of interval regression

Example 1.  We wish to model annual income using years of education and marital status.  However, we do not have access to the precise values for income.  Rather, we only have data on the income ranges: <$15,000, $15,000-$25,000, $25,000-$50,000, $50,000-$75,000, $75,000-$100,000, and >$100,000.  Note that the extreme values of the categories on either end of the range are either left-censored or right-censored.  The other categories are interval censored, that is, each interval is both left- and right-censored.  Analyses of this type require a generalization of censored regression known as interval regression.

Example 2.  We wish to predict GPA from teacher ratings of effort and from reading and writing test scores.  The measure of GPA is a self-report response to the following item:

Select the category that best represents your overall GPA.
  less than 2.0
  2.0 to 2.5
  2.5 to 3.0
  3.0 to 3.4
  3.4 to 3.8
  3.8 to 3.9
  4.0 or greater
Again, we have a situation with both interval censoring and left- and right-censoring.  We do not know the exact value of GPA for each student; we only know the interval in which their GPA falls.

Example 3. We wish to predict GPA from teacher ratings of effort, writing test scores and the type of program in which the student was enrolled (vocational, general or academic).  The measure of GPA is a self-report response to the following item:

Select the category that best represents your overall GPA.
  0.0 to 2.0
  2.0 to 2.5
  2.5 to 3.0
  3.0 to 3.4
  3.4 to 3.8
  3.8 to 4.0
This is a slight variation of Example 2.  In this example, there is only interval censoring.

Description of the data

Let's pursue Example 3 from above.

We have a hypothetical data file, intreg_data.dta with 30 observations.  The GPA score is represented by two values, the lower interval score (lgpa) and the upper interval score (ugpa).  The writing test scores, the teacher rating and the type of program (a nominal variable which has three levels) are write, rating and type, respectively.

Let's look at the data.  It is always a good idea to start with descriptive statistics.

Note that there are two GPA responses for each observation, lgpa for the lower end of the interval and ugpa for the upper end. Graphing these data can be rather tricky.  Just to get an idea of what the distribution of GPA is, we will do separate histograms for lgpa and ugpa.  We will also correlate the variables in the dataset.

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered.  Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations. 

Interval regression

We will use the intreg command to run the interval regression analysis.  The intreg command requires two outcome variables, the lower limit of the interval and the upper limit of the interval.  The i. before type indicates that it is a factor variable (i.e., categorical variable), and that it should be included in the model as a series of indicator variables.  Note that this syntax was introduced in Stata 11.

contrast type


Contrasts of marginal linear predictions

Margins      : asbalanced

------------------------------------------------
             |         df        chi2     P>chi2
-------------+----------------------------------
model        |
        type |          2       18.71     0.0001
------------------------------------------------
The two degree-of-freedom chi-square test indicates that type is a statistically significant predictor of lgpa and ugpa.

We can use the margins command to obtain the expected cell means.  Note that these are different from the means we obtained with the tabstat command above, because they are adjusted for write and rating also.

margins type

Predictive margins                                Number of obs   =         30
Model VCE    : OIM

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        type |
          1  |   2.471456     .13236    18.67   0.000     2.212035    2.730877
          2  |   2.846309    .118957    23.93   0.000     2.613157     3.07946
          3  |   3.181203   .0969802    32.80   0.000     2.991125     3.37128
------------------------------------------------------------------------------

The expected mean GPA for students in program type 1 (vocational) is 2.47; the expected mean GPA for students in program type 3 (academic) is 3.18.

If you would like to compare interval regression models, you can issue the estat ic command to get the log likelihood, AIC and BIC values.

estat ic

-----------------------------------------------------------------------------
       Model |    Obs    ll(null)   ll(model)     df          AIC         BIC
-------------+---------------------------------------------------------------
           . |     30   -51.74729   -33.12891      6     78.25781    86.66499
-----------------------------------------------------------------------------
               Note:  N=Obs used in calculating BIC; see [R] BIC note

The intreg command does not compute an R2 or pseudo-R2.  You can compute an approximate measure of fit by calculating the R2 between the predicted and observed values.

The calculated values of approximately .56 and .71 are probably close to what you would find in an OLS regression if you had actual GPA scores.  You can also make use of the Long and Freese utility command fitstat (findit spostado) (see How can I use the findit command to search for programs and get additional help? for more information about using findit), which provides a number of pseudo-R2s in addition to other measures of fit.  The Cox-Snell pseudo-R2, in which the ratio of the likelihoods reflects the improvement of the full model over the intercept-only model, is close to our approximate estimates above.

Things to consider

 

See also

References

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.