|
|
|
||||
|
|
|||||
Example 1. We wish to model annual income using years of education and marital status. However, we do not have access to the precise values for income, we only have data on the income ranges: <$15,000, $15,000-$25,000, $25,000-$50,000, $50,000-$75,000, $75,000-$100,000, and >$100,000. Note that the extreme values of the categories on either end of the range are either left-censored or right-censored. The other categories are interval censored, that is, each interval is both left and right censored. Analyses of this type require a generalization of censored regression known as interval regression.
Example 2. We wish to predict GPA from teacher ratings of effort and from reading and writing test scores. The measure of GPA is a self-report response to to the following item:
Select the category that best represents your overall gpa. less than 2.0 2.0 to 2.5 2.5 to 3.0 3.0 to 3.4 3.4 to 3.8 3.8 to 3.9 4.0 or greaterAgain, we have a situation with both interval censoring and left- and right-censoring. We do not know the exact value of GPA for each student, we only know the interval in which their GPA falls.
Example 3. We wish to predict GPA from teacher ratings of effort and from reading and writing test scores. The measure of GPA is a self-report response to to the following item:
Select the category that best represents your overall gpa. 0.0 to 2.0 2.0 to 2.5 2.5 to 3.0 3.0 to 3.4 3.4 to 3.8 3.8 to 4.0This is a slight variation of Example 2, in which there is only interval censoring.
We have a hypothetical data file, intregex.dta with 30 observations. The GPA score is represented by two values, the lower interval score (lgpa) and the upper interval score (ugpa). The reading, writing test scores and the teacher rating are read, write and rating respectively.
Let's look at the data.
use http://www.ats.ucla.edu/stat/stata/dae/intregex, clear
list lgpa ugpa, clean
lgpa ugpa
1. 2.5 3
2. 3.4 3.8
3. 2.5 3
4. 0 2
5. 3 3.4
6. 3.4 3.8
7. 3.8 4
8. 2 2.5
9. 3 3.4
10. 3.4 3.8
11. 2 2.5
12. 2 2.5
13. 2 2.5
14. 2.5 3
15. 2.5 3
16. 2.5 3
17. 3.4 3.8
18. 2.5 3
19. 2 2.5
20. 3 3.4
21. 3.4 3.8
22. 3.8 4
23. 2 2.5
24. 3 3.4
25. 3.4 3.8
26. 2 2.5
27. 2 2.5
28. 2 2.5
29. 2.5 3
30. 2.5 3
Note that there are two GPA responses for each observation, lgpa for the lower end of the
interval and ugpa for the upper end.
summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
id | 30 15.5 8.803408 1 30
lgpa | 30 2.6 .7754865 0 3.8
ugpa | 30 3.096667 .5708332 2 4
write | 30 113.8333 49.94278 50 205
rating | 30 57.53333 8.303441 48 72
read | 30 171.6667 94.39767 50 350
Graphing these data can be rather tricky. So just to get an idea of what the distribution of
GPA is we will do separate histograms for lgpa and ugpa.
histogram lgpa, normalhistogram ugpa, normal
correlate lgpa ugpa write rating read (obs=30) | lgpa ugpa write rating read -------------+--------------------------------------------- lgpa | 1.0000 ugpa | 0.9488 1.0000 write | 0.6206 0.6572 1.0000 rating | 0.5355 0.5904 0.4763 1.0000 read | 0.5747 0.5997 0.3182 0.4489 1.0000 graph matrix write rating read lgpa ugpa, half jitter(2)
intreg lgpa ugpa write rating read
Fitting constant-only model:
Iteration 0: log likelihood = -52.129849
Iteration 1: log likelihood = -51.74803
Iteration 2: log likelihood = -51.747288
Iteration 3: log likelihood = -51.747288
Fitting full model:
Iteration 0: log likelihood = -38.212102
Iteration 1: log likelihood = -36.680551
Iteration 2: log likelihood = -36.662189
Iteration 3: log likelihood = -36.662185
Iteration 4: log likelihood = -36.662185
Interval regression Number of obs = 30
LR chi2(3) = 30.17
Log likelihood = -36.662185 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .0052829 .0015363 3.44 0.001 .0022718 .0082939
rating | .016789 .009751 1.72 0.085 -.0023226 .0359005
read | .002329 .0008046 2.89 0.004 .000752 .003906
_cons | .9133711 .4794007 1.91 0.057 -.026237 1.852979
-------------+----------------------------------------------------------------
/lnsigma | -1.090882 .1516747 -7.19 0.000 -1.388159 -.7936051
-------------+----------------------------------------------------------------
sigma | .3359201 .0509506 .2495343 .4522116
------------------------------------------------------------------------------
Observation summary: 0 left-censored observations
0 uncensored observations
0 right-censored observations
30 interval observations
Just to be a bit more conservative we will run the model again using the robust
option to obtain the Huber-White robust standard errors.
intreg lgpa ugpa write rating read, robust
Fitting constant-only model:
Iteration 0: log pseudolikelihood = -52.129849
Iteration 1: log pseudolikelihood = -51.74803
Iteration 2: log pseudolikelihood = -51.747288
Iteration 3: log pseudolikelihood = -51.747288
Fitting full model:
Iteration 0: log pseudolikelihood = -38.212102
Iteration 1: log pseudolikelihood = -36.680551
Iteration 2: log pseudolikelihood = -36.662189
Iteration 3: log pseudolikelihood = -36.662185
Iteration 4: log pseudolikelihood = -36.662185
Interval regression Number of obs = 30
Wald chi2(3) = 54.85
Log pseudolikelihood = -36.662185 Prob > chi2 = 0.0000
------------------------------------------------------------------------------
| Robust
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write | .0052829 .0014818 3.57 0.000 .0023786 .0081872
rating | .016789 .0106434 1.58 0.115 -.0040717 .0376497
read | .002329 .0010766 2.16 0.031 .0002189 .004439
_cons | .9133711 .4883437 1.87 0.061 -.043765 1.870507
-------------+----------------------------------------------------------------
/lnsigma | -1.090882 .1267837 -8.60 0.000 -1.339373 -.8423906
-------------+----------------------------------------------------------------
sigma | .3359201 .0425892 .2620098 .4306797
------------------------------------------------------------------------------
Observation summary: 0 left-censored observations
0 uncensored observations
0 right-censored observations
30 interval observations
The output looks very much like the output from an OLS regression. At the top, the number of
observations used in the analysis is given along a likelihood-ratio chi-squared. The
likelihood-ratio chi-squared tests the difference between the full model (with predictors)
and the constant only model. Below that is the p-value for the chi-squared with three
degrees of freedom. Obviously this model, as a whole, is statistically significant.
Next comes the log-likelihood for the full model.The intreg command does not compute an R2 or pseudo-R2. You can compute a rough-and-ready measure of fit by calculating the R2 between the predicted and observed values.
predict p
correlate lgpa ugpa p
(obs=30)
| lgpa ugpa p
-------------+---------------------------
lgpa | 1.0000
ugpa | 0.9488 1.0000
p | 0.7494 0.7963 1.0000
display .7494^2
.56160036
display .7963^2
.63409369
The calculated values of approximately .60 are probably close to what you would find in an OLS
regression if you had actual GPA scores.
You can also make use of the Long and Freese utility command fitstat (findit spostado),
which provides a number of pseudo-R2's in addition to other measures of fit. The
McKelvey-Zavoina pseudo-R2's, which computes the R2 using the variances of
the latent variable and the latent predicted variable [ Var(predicted-y*)/Var(y*) ],
is close to our rough-and-ready estimates above.
fitstat
Measures of Fit for intreg of lgpa ugpa
Log-Lik Intercept Only: -51.747 Log-Lik Full Model: -36.662
D(25): 73.324 LR(3): 30.170
Prob > LR: 0.000
McFadden's R2: 0.292 McFadden's Adj R2: 0.195
ML (Cox-Snell) R2: 0.634 Cragg-Uhler(Nagelkerke) R2: 0.655
McKelvey & Zavoina's R2: 0.677
Variance of y*: 0.350 Variance of error: 0.113
AIC: 2.777 AIC*n: 83.324
BIC: -11.706 BIC': -19.967
BIC used by Stata: 90.330 AIC used by Stata: 83.324
In the main body of the table we have the interval regression coefficients, the standard error
of the coefficients, a Wald t-test (coefficient/se) and the p-value associated with each t-test.
By default, we also get a 95% confidence interval for the coefficients. With the level()
option you can request a different confidence interval.The ancillary statistic /sigma is equivalent to the standard error of estimate in OLS regression. The value of 0.34 can be compared to the standard deviations for lgpa and ugpa of 0.78 and 0.57. This shows a substantial reduction. The output also contains an estimate of the standard error of sigma as well as a 95% confidence interval. Stata does not compute sigma directly but actually computes the log of sigma (/lnsigma in the output).
Finally, the output provides a summary of the number of left-censored, uncensored, right-censored and interval-censored values.
estout, cells(b(star fmt(%8.2f)) se(par fmt(%8.2f))) stats(sigma ll chi2, fmt(%8.2f))
b/se
model
write 0.0053***
(0.0015)
rating 0.0168
(0.0106)
read 0.0023*
(0.0011)
_cons 0.9134
(0.4883)
lnsigma
_cons -1.0909***
(0.1268)
sigma 0.34
ll -36.66
chi2 54.85
With a little bit of manual editing and remembering to change to add the McKelvey-Zavoina pseudo-R2
we can produce an acceptable table of the output.
model
write 0.0053***
(0.0015)
rating 0.0168
(0.0106)
read 0.0023*
(0.0011)
constant 0.9134
(0.4883)
standard error
of estimate 0.34
log-likelihood -36.66
chi-squared 54.85
pseudo R-squared 0.68
legend: coefficient/(standard error) *** p<0.001
The interval regression model predicting GPA from reading, writing and teacher ratings
was statistically significant (chi-squared = 54.85, df = 3, p < 0.001).
Both the reading and writing test scores were
statically significant at the 0.001 and 0.05 levels, respectively. The teacher ratings
were not significant (p = 0.12).
The McKelvey and Zavoina pseudo-R2 was 0.68
indicating that these three predictors
accounted for approximately 68% of the variability in the latent outcome variable.
A unit change in reading and writing
lead to a 0.0053 and 0.0023 increase in the predicted GPA, respectively.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services