Version info: Code for this page was tested in SAS 9.3.
Truncated regression is used to model dependent variables for which some of the observations are not included in the analysis because of the value of the dependent variable.Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.
Example 1. A study of students in a special GATE (gifted and talented education) program wishes to model achievement as a function of language skills and the type of program in which the student is currently enrolled. A major concern is that students are required to have a minimum achievement score of 40 to enter the special program. Thus, the sample is truncated at an achievement score of 40.
Example 2. A researcher has data for a sample of Americans whose income is above the poverty line. Hence, the lower part of the distribution of income is truncated. If the researcher had a sample of Americans whose income was at or below the poverty line, then the upper part of the income distribution would be truncated. In other words, truncation is a result of sampling only part of the distribution of the outcome variable.
We have a hypothetical data file, truncreg.sas7bdat, with 178 observations. The outcome variable is called achiv, and the language test score variable is called langscore. The variable prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled.
Let's look at the data. It is always a good idea to start with descriptive statistics.
proc means data = mylib.truncreg;
var achiv langscore;
run;
The MEANS Procedure
Variable Label N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------------------------
achiv 178 54.2359551 8.9632299 41.0000000 76.0000000
langscore writing score 178 54.0112360 8.9448964 31.0000000 67.0000000
-------------------------------------------------------------------------------------------------
proc sort data = mylib.truncreg;
by prog;
run;
proc means data = mylib.truncreg;
by prog;
var achiv langscore;
run;
--------------------------------------- type of program=1 ----------------------------------------
The MEANS Procedure
Variable Label N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------------------------
achiv 40 51.5750000 7.9707398 42.0000000 68.0000000
langscore writing score 40 51.6750000 9.4391099 31.0000000 67.0000000
-------------------------------------------------------------------------------------------------
--------------------------------------- type of program=2 ----------------------------------------
Variable Label N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------------------------
achiv 101 56.8910891 9.0187593 41.0000000 76.0000000
langscore writing score 101 56.7326733 7.5748150 37.0000000 67.0000000
-------------------------------------------------------------------------------------------------
--------------------------------------- type of program=3 ----------------------------------------
Variable Label N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------------------------
achiv 37 49.8648649 7.2769124 41.0000000 68.0000000
langscore writing score 37 49.1081081 9.2699748 31.0000000 67.0000000
-------------------------------------------------------------------------------------------------
proc sgplot data = mylib.truncreg;
histogram achiv / scale = count showbins;
density achiv;
run;
proc freq data = mylib.truncreg;
tables prog;
run;
The FREQ Procedure
type of program
Cumulative Cumulative
prog Frequency Percent Frequency Percent
---------------------------------------------------------
1 40 22.47 40 22.47
2 101 56.74 141 79.21
3 37 20.79 178 100.00
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.
We will use proc qlim to run our truncated regression analysis. The variables langscore, prog are predictors in the model, while achiv is the outcome. We will specify that prog is a categorical variable using a class statement. The lb= option on the endogenous statement indicates the value at which the left truncation takes place. There is also a ub= option to indicate the value of the right truncation, which was not needed in this example. We will use the test statement to obtain the two degree-of-freedom test of prog. To save our parameter estimates in a dataset we can use later, we specify a dataset name using the outest statement.
proc qlim data = mylib.truncreg outest = mylib.truncreg_outest;
class prog;
model achiv = langscore prog;
endogenous achiv ~ truncated (lb = 40);
overall_prog: test prog_1, prog_2 = 0;
run;
The QLIM Procedure
Summary Statistics of Continuous Responses
N Obs N Obs
Standard Lower Upper Lower Upper
Variable Mean Error Type Bound Bound Bound Bound
achiv 54.23596 8.963230 Truncated 40
Class Level Information
Class Levels Values
prog 3 1 2 3
Model Fit Summary
Number of Endogenous Variables 1
Endogenous Variable achiv
Number of Observations 178
Log Likelihood -591.30981
Maximum Absolute Gradient 4.46555E-8
Number of Iterations 21
Optimization Method Quasi-Newton
AIC 1193
Schwarz Criterion 1209
Algorithm converged.
Parameter Estimates
Standard Approx
Parameter DF Estimate Error t Value Pr > |t|
Intercept 1 10.165659 6.676185 1.52 0.1278
langscore 1 0.712578 0.114485 6.22 <.0001
prog 1 1 1.135863 2.669958 0.43 0.6705
prog 2 1 5.201081 2.306222 2.26 0.0241
prog 3 0 0 . . .
_Sigma 1 8.755314 0.666880 13.13 <.0001
The SAS System 10:17 Friday, June 8, 2012 20
The QLIM Procedure
Test Results
Test Type Statistic Pr > ChiSq Label
OVERALL_PROG Wald 7.19 0.0274 prog_1 = 0,
prog_2 = 0
In the output we see our put statements, where we printed our estimates. Now using test statements within proc qlm, we assess whether these predicted means are different from one another.data _null_; set mylib.truncreg_outest; where _TYPE_ = "PARM"; prog1 = intercept + 54.011236 * langscore + prog_1; prog2 = intercept + 54.011236 * langscore + prog_2; prog3 = intercept + 54.011236 * langscore; file print; put "predicted achiv for langscore = mean and prog = 1: " prog1; put "predicted achiv for langscore = mean and prog = 2: " prog2; put "predicted achiv for langscore = mean and prog = 3: " prog3; run; <**SOME OUTPUT OMITTED**> predicted achiv for langscore = mean and prog = 1: 49.78871363 predicted achiv for langscore = mean and prog = 2: 53.853932015 predicted achiv for langscore = mean and prog = 3: 48.652851052
proc qlim data = mylib.truncreg;
class prog;
model achiv = langscore prog;
endogenous achiv ~ truncated (lb = 40);
prog1_vs_prog2: test intercept + 54.01124 * langscore + prog_1 = intercept + 54.01124 * langscore + prog_2;
prog1_vs_prog3: test intercept + 54.01124 * langscore + prog_1 = intercept + 54.01124 * langscore;
prog2_vs_prog2: test intercept + 54.01124 * langscore + prog_2 = intercept + 54.01124 * langscore;
run;
<**SOME OUTPUT OMITTED**>
Test Results
Test Type Statistic Pr > ChiSq Label
PROG1_VS_ Wald 3.91 0.0479 intercept +
PROG2 54.01124 * langscore
+ prog_1 =
intercept + 54.01124
* langscore + prog_2
PROG1_VS_ Wald 0.18 0.6705 intercept +
PROG3 54.01124 * langscore
+ prog_1 =
intercept + 54.01124
* langscore
PROG2_VS_ Wald 5.09 0.0241 intercept +
PROG3 54.01124 * langscore
+ prog_2 =
intercept + 54.01124
* langscore
The effect of level 2 of prog appears to be significantly different from the effects of levels 1 and 3 of
prog, which do not differ.
The qlim procedure produces neither an R2 nor a pseudo-R2. You can compute a rough estimate of the degree of association by correlating achiv with the predicted value and squaring the result. Below, we rerun the analysis, this time including an output statement to obtain the predicted values. Next, we use proc corr to get the correlation between the outcome variable (achiv) and the predicted value (called p_achiv by default), and use the ods output statement to save the correlation matrix to a data set called corr. Finally, we use a data step to square the correlation (and round it to four decimal places), and output the answer to the output window.
proc qlim data=mylib.truncreg; class prog; model achiv = langscore prog; endogenous achiv ~ truncated (lb = 40); output out = mylib.trunc_temp predicted; run; ods output PearsonCorr=mylib.corr; proc corr data = mylib.trunc_temp nosimple; var achiv p_achiv; run; data _null_; set mylib.corr; if variable = "achiv"; file print; a = round((P_achiv)**2, .0001); put "The squared multiple correlation between achieve and the predicted value is " a; run; The squared multiple correlation between achieve and the predicted value is 0.3052
The calculated value of approximately .31 is rough estimate of the R2 you would find in an OLS regression. The squared correlation between the observed and predicted academic aptitude values is about 0.31, indicating that these predictors accounted for over 30% of the variability in the outcome variable.
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.