UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Data Analysis Examples
Truncated Regression

Examples of Truncated Regression Analysis

Example 1. A researcher has data for a sample of employed persons and wishes to model wages as predicted by years of schooling and gender. Since the sample excludes individuals who are not employed, the data can be considered to be truncated at zero, i.e., wages need to be greater than zero to be included in the sample.

Example 2. A study of students in a special GATE (gifted and talented education) program wishes to model achievement as a function of gender, language skills and math skills. A major concern is that students require a minimum achievement score of 40 to enter the special program. Thus, the sample is truncated at an achievement score of 39.

Description of the Data

Let's pursue Example 2 from above.

We have a hypothetical data file, truncreg2.sas7bdat , with 178 observations. The achievement variable is achiv, the language and math test scores are langscore and mathscore respectively. The variable female is a zero-one indicator variable with the one's indicating a female student.

Let's look at the data.

Some Strategies You Might Be Tempted To Try

Before we show how you can analyze this with a truncated regression analysis, let's consider some other methods that you might use.

SAS Truncated Regression Analysis

The lb= option on the endogenous statement indicates the value at which the left truncation take place.  There is also a ub= option to indicate the value of the right truncation, which was not needed in this example.

The output looks very much like the output from an OLS regression.  In the upper right of the output we can see that zero observations were truncated.  This is because our sample contained no data with values less than 40 for achievement.  In the Parameter Estimates table we have the truncated regression coefficients, the standard error of the coefficients, a t-Value and the associated p-value.  The ancillary statistic _sigma is equivalent to the standard error of estimate in OLS regression.  The value of 7.74 can be compared to the standard deviation of achievement, which was 8.96.  This shows a modest reduction.  The output also contains an estimate of the standard error of _sigma, as well as the t-Value and p-value.  That _sigma is statistically significant means that 7.74 is statistically significantly different from 0.  The validity of this test of _sigma is a matter of debate among statisticians, and some programs will produce the estimate and standard error, but not the test of statistical significance.

The qlim procedure produces neither an R2 nor a pseudo-R2.  You can compute a rough estimate of the degree of association by correlating achiv with the predicted value and squaring the result.  Below, we rerun the analysis, this time including an output statement to obtain the predicted values.  Next, we use proc corr to get the correlation between the outcome variable (achiv) and the predicted value (called p_achiv by default), and use the ods output statement to save the correlation matrix to a data set called corr.  Finally, we use a data step to square the correlation (and round it to four decimal places), and output the answer to the output window.