|
|
|
||||
|
|
|||||
Example 1. We wish to model annual income using years of education and marital status. However, we do not have access to the precise values for income, we only have data on the income ranges: <$15,000, $15,000-$25,000, $25,000-$50,000, $50,000-$75,000, $75,000-$100,000, and >$100,000. Note that the extreme values of the categories on either end of the range are either left-censored or right-censored. The other categories are interval censored, that is, each interval is both left and right censored. Analyses of this type require a generalization of censored regression known as interval regression.
Example 2. We wish to predict GPA from teacher ratings of effort and from reading and writing test scores. The measure of GPA is a self-report response to to the following item:
Select the category that best represents your overall gpa. less than 2.0 2.0 to 2.5 2.5 to 3.0 3.0 to 3.4 3.4 to 3.8 3.8 to 3.9 4.0 or greaterAgain, we have a situation with both interval censoring and left- and right-censoring. We do not know the exact value of GPA for each student, we only know the interval in which their GPA falls.
Example 3. We wish to predict GPA from teacher ratings of effort and from reading and writing test scores. The measure of GPA is a self-report response to to the following item:
Select the category that best represents your overall gpa. 0.0 to 2.0 2.0 to 2.5 2.5 to 3.0 3.0 to 3.4 3.4 to 3.8 3.8 to 4.0This is a slight variation of Example 2, in which there is only interval censoring.
We have a hypothetical data file, intregex.sas7bdat , with 30 observations. The GPA score is represented by two values, the lower interval score (lgpa) and the upper interval score (ugpa). The reading, writing test scores and the teacher rating are read, write and rating respectively.
Let's look at the data.
proc print data = intregex noobs; run;
ID LGPA UGPA WRITE RATING READ 1 2.50000 3.00000 175 54.0 150 2 3.40000 3.80000 125 68.0 250 3 2.50000 3.00000 70 48.0 150 4 0.00000 2.00000 50 52.0 50 5 3.00000 3.40000 70 49.0 250 6 3.40000 3.80000 205 53.5 150 7 3.80000 4.00000 180 72.0 250 8 2.00000 2.50000 50 50.0 250 9 3.00000 3.40000 155 57.5 150 10 3.40000 3.80000 105 69.0 250 11 2.00000 2.50000 55 54.5 50 12 2.00000 2.50000 90 51.5 50 13 2.00000 2.50000 70 52.0 50 14 2.50000 3.00000 135 71.0 150 15 2.50000 3.00000 150 61.0 350 16 2.50000 3.00000 175 54.0 150 17 3.40000 3.80000 125 68.0 250 18 2.50000 3.00000 70 48.0 150 19 2.00000 2.50000 95 52.0 50 20 3.00000 3.40000 70 49.0 250 21 3.40000 3.80000 205 53.5 150 22 3.80000 4.00000 180 72.0 300 23 2.00000 2.50000 50 50.0 250 24 3.00000 3.40000 155 57.5 150 25 3.40000 3.80000 105 69.0 250 26 2.00000 2.50000 55 54.5 50 27 2.00000 2.50000 90 51.5 50 28 2.00000 2.50000 70 52.0 50 29 2.50000 3.00000 135 71.0 150 30 2.50000 3.00000 150 61.0 350Note that there are two GPA responses for each observation, lgpa for the lower end of the interval and ugpa for the upper end.
proc means data = intregex; run;
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------ ID 30 15.5000000 8.8034084 1.0000000 30.0000000 LGPA 30 2.6000000 0.7754865 0 3.8000000 UGPA 30 3.0966666 0.5708332 2.0000000 4.0000000 WRITE 30 113.8333333 49.9427834 50.0000000 205.0000000 RATING 30 57.5333333 8.3034406 48.0000000 72.0000000 READ 30 171.6666667 94.3976670 50.0000000 350.0000000 ------------------------------------------------------------------------------Graphing these data can be rather tricky. So just to get an idea of what the distribution of GPA is, we will do separate histograms for lgpa and ugpa.
proc univariate data = intregex plot; var lgpa; histogram / normal; run;
proc univariate data = intregex plot; var ugpa; histogram / normal; run;
proc corr data = intregex nosimple; var lgpa ugpa write rating read; run;
The CORR Procedure
5 Variables: LGPA UGPA WRITE RATING READ
Pearson Correlation Coefficients, N = 30
Prob > |r| under H0: Rho=0
LGPA UGPA WRITE RATING READ
LGPA 1.00000 0.94878 0.62057 0.53551 0.57468
<.0001 0.0003 0.0023 0.0009
UGPA 0.94878 1.00000 0.65724 0.59039 0.59972
<.0001 <.0001 0.0006 0.0005
WRITE 0.62057 0.65724 1.00000 0.47635 0.31823
0.0003 <.0001 0.0078 0.0866
RATING 0.53551 0.59039 0.47635 1.00000 0.44887
0.0023 0.0006 0.0078 0.0128
READ 0.57468 0.59972 0.31823 0.44887 1.00000
0.0009 0.0005 0.0866 0.0128
proc insight data = intregex; scatter write rating read lgpa ugpa * write rating read lgpa ugpa; run; quit;
proc lifereg data = intregex; model (lgpa ugpa) = write rating read /d=normal; run;
The LIFEREG Procedure
Model Information
Data Set WORK.INTREGEX
Dependent Variable LGPA
Dependent Variable UGPA
Number of Observations 30
Noncensored Values 0
Right Censored Values 0
Left Censored Values 0
Interval Censored Values 30
Name of Distribution Normal
Log Likelihood -36.66218493
Number of Observations Read 30
Number of Observations Used 30
Algorithm converged.
Type III Analysis of Effects
Wald
Effect DF Chi-Square Pr > ChiSq
WRITE 1 11.8250 0.0006
RATING 1 2.9645 0.0851
READ 1 8.3783 0.0038
Analysis of Parameter Estimates
Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 0.9134 0.4794 -0.0262 1.8530 3.63 0.0567
WRITE 1 0.0053 0.0015 0.0023 0.0083 11.83 0.0006
RATING 1 0.0168 0.0098 -0.0023 0.0359 2.96 0.0851
READ 1 0.0023 0.0008 0.0008 0.0039 8.38 0.0038
Scale 1 0.3359 0.0510 0.2495 0.4522
At the top of the output, we see information about the model and the data set.
This includes a summary of the number of left-censored, uncensored,
right-censored and interval-censored values. After that, the output looks very much like the output from an OLS regression.
In the table entitled "Type III Analysis of Effects," we see each variable in
the model along with its degrees of freedom, Wald chi-square and p-value.
In this example, both write and read are statistically
significant. In the table entitled "Analysis of Parameter Estimates," we have the interval regression coefficients, the standard error
of the coefficients, the 95% confidence intervals for the coefficients, a
chi-square test and the associated p-value. In this example, the
information in the last two tables is redundant. If we had used a
categorical variable with more than two levels, the information in the two
tables would not be redundant. Rather, we would see the multi degree of
freedom test in the Type III Analysis of Effects and from that would see if the
variable as a whole was statistically significant, while in the Analysis of
Parameter Estimates table, we would see the coefficients for each dummy
variable.The ancillary statistic scale is equivalent to the standard error of estimate in OLS regression. The value of 0.34 can be compared to the standard deviations for lgpa and ugpa of 0.78 and 0.57. This shows a substantial reduction. The output also contains an estimate of the standard error of sigma as well as a 95% confidence interval.
The lifereg procedure does not compute an R2 or pseudo-R2. You can compute a rough-and-ready measure of fit by calculating the R2 between the predicted and observed values.
proc lifereg data = intregex; model (lgpa ugpa) = write rating read /d=normal; output out = t xbeta=xb; run;
ods output PearsonCorr=int_corr; proc corr data = t nosimple; var xb lgpa ugpa; run; data _null_; set int_corr; file print; if variable = "LGPA" then do; a = round((xb)**2, .0001); put "The squared multiple correlation between lgpa and the predicted value is " a; end; if variable = "UGPA" then do; b = round((xb)**2, .0001); put "The squared multiple correlation between ugpa and the predicted value is " b; end; run;
The squared multiple correlation between lgpa and the predicted value is 0.5616 The squared multiple correlation between ugpa and the predicted value is 0.6342
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services