Regression with SAS
Chapter 4 - Beyond OLS

Chapter Outline
    4.1 Robust Regression Methods
        4.1.1 Regression with Robust Standard Errors
        4.1.2 Using the Proc Genmod for Clustered Data
        4.1.3 Robust Regression
        4.1.4 Quantile Regression
    4.2 Constrained Linear Regression
    4.3 Regression with Censored or Truncated Data
        4.3.1 Regression with Censored Data
        4.3.2 Regression with Truncated Data
    4.4 Regression with Measurement Error
    4.5 Multiple Equation Regression Models
        4.5.1 Seemingly Unrelated Regression
        4.5.2 Multivariate Regression
    4.6 Summary     

In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.

The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.

4.1 Robust Regression Methods

It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.

The idea behind robust regression methods is to make adjustments in the estimates that take into account some of the flaws in the data itself. We are going to look at three robust methods: regression with robust standard errors, regression with clustered data, robust regression, and quantile regression.

Before we look at these approaches, let's look at a standard OLS regression using the elementary school academic performance index (elemapi2.dta) dataset. We will look at a model that predicts the api 2000 scores using the average class size in K through 3 (acs_k3), average class size 4 through 6 (acs_46), the percent of fully credentialed teachers (full), and the size of the school (enroll). First let's look at the descriptive statistics for these variables.  Note the missing values for acs_k3 and acs_k6.


proc means data = "c:\sasreg\elemapi2" mean std max min;
  var api00 acs_k3 acs_46 full enroll;
run;
The MEANS Procedure

Variable            Mean         Std Dev         Minimum         Maximum
------------------------------------------------------------------------
api00        647.6225000     142.2489610     369.0000000     940.0000000
acs_k3        19.1608040       1.3686933      14.0000000      25.0000000
acs_46        29.6851385       3.8407840      20.0000000      50.0000000
full          84.5500000      14.9497907      37.0000000     100.0000000
enroll       483.4650000     226.4483847     130.0000000         1570.00
------------------------------------------------------------------------

Below we see the regression predicting api00 from acs_k3 acs_46 full and enroll. We see that all of the variables are significant except for acs_k3.

The REG Procedure
Model: MODEL1
Dependent Variable: api00

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     4        3071909         767977      61.01    <.0001
Error                   390        4909501          12588
Corrected Total         394        7981410


Root MSE            112.19832    R-Square     0.3849
Dependent Mean      648.65063    Adj R-Sq     0.3786
Coeff Var            17.29719


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       -5.20041       84.95492      -0.06      0.9512
acs_k3        1        6.95438        4.37110       1.59      0.1124
acs_46        1        5.96601        1.53105       3.90      0.0001
full          1        4.66822        0.41425      11.27      <.0001
enroll        1       -0.10599        0.02695      -3.93      <.0001

Since the regression procedure is interactive and we haven't issued the quit command, we can test both of the class size variables, and we find the overall test of these two variables is significant.

       Test 1 Results for Dependent Variable api00

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           2         139437      11.08    <.0001
Denominator       390          12588

Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope.  The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.

Here is the index plot of Cook's D for this regression. We see 4 points that are somewhat high in both their leverage and their residuals. 

None of these results are dramatic problems, but the plot of residual vs. predicted value suggests that there might be some outliers and some possible heteroscedasticity and the index plot of Cook's D shows some points in the upper right quadrant that could be influential. We might wish to use something other than OLS regression to estimate this model. In the next several sections we will look at some robust regression methods.

4.1.1 Regression with Robust Standard Errors

The SAS proc reg  includes an option called acov in the model statement for estimating the asymptotic covariance matrix of the estimates under the hypothesis of heteroscedasticity. The standard error obtained from the asymptotic covariance matrix is considered to be more  robust and can deal with a collection of minor concerns about failure to meet assumptions, such as minor problems about normality, heteroscedasticity, or some observations that exhibit large residuals, leverage or influence. For such minor problems, the standard error based on acov  may effectively deal with these concerns.

With the acov option, the point estimates of the coefficients are exactly the same as in ordinary OLS, but we will calculate the standard errors based on the asymptotic covariance matrix. Here is the same regression as above using the acov option. We also use SAS ODS (Output Delivery System)  to output the parameter estimates along with the asymptotic covariance matrix. We calculated the robust standard error in a data step and merged them with the parameter estimate using proc sql and created the t-values and corresponding probabilities.  Note the changes in the standard errors and t-tests (but no change in the coefficients). In this particular example, using robust standard errors did not change any of the conclusions from the original OLS regression. We should also mention that the robust standard error has been adjusted for the sample size correction. 

                                                      robust_
Variable   Estimate     StdErr  tValue   Probt     stderr  tvalue_rb  probt_rb
------------------------------------------------------------------------------
Intercept  -5.20041   84.95492   -0.06  0.9512   86.66308   -0.06001   0.95218
acs_k3      6.95438    4.37110    1.59  0.1124   4.620599   1.505082  0.133104
acs_46      5.96601    1.53105    3.90  0.0001   1.573214   3.792246  0.000173
full        4.66822    0.41425   11.27  <.0001   0.414681   11.25737         0
enroll     -0.10599    0.02695   -3.93  <.0001   0.028015   -3.78331  0.000179

4.1.2 Using the Proc Genmod for Clustered Data

As described in Chapter 2, OLS regression assumes that the residuals are independent. The elemapi2 dataset contains data on 400 schools that come from 37 school districts. It is very possible that the scores within each school district may not be independent, and this could lead to residuals that are not independent within districts. SAS proc genmod is used to model correlated data.  We can use the class statement and the repeated statement  to indicate that the observations are clustered into districts (based on dnum) and that the observations may be correlated within districts, but would be independent between districts. 

The GENMOD Procedure

             Analysis Of GEE Parameter Estimates
              Empirical Standard Error Estimates
                   Standard   95% Confidence
Parameter Estimate    Error       Limits            Z Pr > |Z|
Intercept  -5.2004 119.5172 -239.450 229.0490   -0.04   0.9653
acs_k3      6.9544   6.7726  -6.3196  20.2284    1.03   0.3045
acs_46      5.9660   2.4839   1.0976  10.8344    2.40   0.0163
full        4.6682   0.6904   3.3151   6.0213    6.76   <.0001
enroll     -0.1060   0.0421  -0.1886  -0.0234   -2.51   0.0119

As with the regression with robust error, the estimate of the coefficients are the same as the OLS estimates, but the standard errors take into account that the observations within districts are non-independent.  Even though the standard errors are larger in this analysis, the three variables that were significant in the OLS analysis are significant in this analysis as well. 

We notice that the standard error estimates given here are different from what Stata's result using regress with the cluster option. This is because that Stata further does a finite-sample adjustment. We can do some SAS programming here for the adjustment. The adjusted variance is a constant times the variance obtained from the empirical standard error estimates. This particular constant is
(N-1)/(N-k)*M/(M-1).

data em;
  set 'c:\sasreg\elemapi2';
run;
proc genmod data=em;
    class dnum;
    model api00 = acs_k3 acs_46 full enroll ;
    repeated subject=dnum / type = ind covb ;
	ods output geercov = gcov;
    ods output GEEEmpPEst = parms;
run;
quit;
proc sql;
  select count(dnum),count(distinct dnum)  into :n, :m
  from em;
quit;
proc sql;
  select count(prm1) into :k
  from gcov;
quit;
data gcov_ad;
  set gcov;
  array all(*) _numeric_;
  do i = 1 to dim(all);
  all(i) = all(i)*((&n-1)/(&n-&k))*(&m/(&m-1));
  if i = _n_ then std_ad = sqrt(all(i));
  end;
drop i;
keep std_ad;
run;
data all;
  merge parms gcov_ad;
run;
proc print data = all noobs;
run;
Parm         Estimate      Stderr     LowerCL     UpperCL          Z     ProbZ     std_ad
Intercept     -5.2004    119.5172    -239.450    229.0490      -0.04    0.9653    121.778
acs_k3         6.9544      6.7726     -6.3196     20.2284       1.03    0.3045      6.901
acs_46         5.9660      2.4839      1.0976     10.8344       2.40    0.0163      2.531
full           4.6682      0.6904      3.3151      6.0213       6.76    <.0001      0.703
enroll        -0.1060      0.0421     -0.1886     -0.0234      -2.51    0.0119      0.043

4.1.3 Robust Regression

In SAS, we can not simply execute some proc to perform a robust regression using iteratively reweighted least squares. In order to perform a robust regression,  we have to write our own macro. To this end, ATS has written a macro called robust_hb.sas. This macro first uses Hubert weight and later switches to biweight. This is why the macro is called robust_hb where h and b stands for Hubert and biweight respectively.  Robust regression assigns a weight to each observation with higher weights given to better behaved observations. In fact, extremely deviant cases, those with Cook's D greater than 1, can have their weights set to missing so that they are not included in the analysis at all.

The macro robust_hb.sas uses another macro called mad.sas to generate MAD (median absolute deviation) during the iteration process. We will include both macros to perform the robust regression analysis as shown below. Note that in this analysis both the coefficients and the standard errors differ from the original OLS regression. 

The REG Procedure
Model: MODEL1
Dependent Variable: api00

Weight: _w2_

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     4        3053248         763312      73.19    <.0001
Error                   390        4067269          10429
Corrected Total         394        7120516


Root MSE            102.12196    R-Square     0.4288
Dependent Mean      647.38090    Adj R-Sq     0.4229
Coeff Var            15.77463


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       -6.79858       80.45406      -0.08      0.9327
acs_k3        1        6.11031        4.15015       1.47      0.1417
acs_46        1        6.25588        1.45280       4.31      <.0001
full          1        4.79575        0.39212      12.23      <.0001
enroll        1       -0.10924        0.02558      -4.27      <.0001

If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems.   Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.

Let's look at the predicted (fitted) values (p), the residuals (r), and the leverage (hat) values (h).  The macro robust_hb generates a final data set with predicted values, raw residuals and leverage values together with the original data called _tempout_.

Now, let's check on the various predicted values and the weighting. First, we will sort by _w2_, the weight generated at last iteration. Then we will look at the first 15 observations. Notice that the smallest weights are near one-half but quickly get into the .6 range. The first five values are missing due to the missing values of predictors. 


proc sort data = _tempout_;
  by descending _w2_;
run;
proc print data = _tempout_ (obs=10);
  var snum api00 p r h _w2_;
run;
Obs    snum    api00       p           r           h         _w2_

  1     116     513        .           .        .           .
  2    3072     763        .           .        .           .
  3    3055     590        .           .        .           .
  4    4488     521        .           .        .           .
  5    4534     445        .           .        .           .
  6     637     447     733.426    -286.144    0.003657    0.55684
  7    5387     892     611.709     280.464    0.002278    0.57183
  8    2267     897     622.062     275.512    0.009887    0.58481
  9      65     903     631.930     271.731    0.010198    0.59466
 10    3759     585     842.664    -257.473    0.040009    0.63128

Now, let's look at the last 10 observations. The weights for observations with snum 1678, 4486 and 1885 are all very close to one, since the residuals are fairly small. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).

Obs    snum    api00       p           r           h         _w2_

  1    1678     497     497.065     0.18424    0.023137    1.00000
  2    4486     706     706.221     0.20151    0.013740    1.00000
  3    1885     605     606.066    -0.41421    0.013745    1.00000
  4    3535     705     703.929     1.16143    0.004634    0.99999
  5    3024     727     729.278    -2.01559    0.010113    0.99997
  6    3700     717     719.803    -2.43802    0.007317    0.99996
  7    1596     536     533.918     2.77985    0.012143    0.99995
  8     181     724     728.068    -3.53209    0.013665    0.99992
  9    1949     696     691.898     3.96736    0.020426    0.99990
 10    5917     735     731.141     4.03336    0.005831    0.99990

After using macro robust_hb.sas, we can use the dataset _tempout_ to create some graphs for regression diagnostic purposes. For example,  we can create a graph of residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS plot, except that in the OLS all of the observations would be weighted equally, but as we saw above the observations with the greatest residuals are weighted less and hence have less influence on the results.

4.1.4 Quantile Regression

Quantile regression, in general, and median regression, in particular, might be considered as an alternative to robust regression. SAS does quantile regression using a little bit of proc iml.  Inside proc iml, a procedure called LAV is called and it does a median regression in which the coefficients will be estimated by minimizing the absolute deviations from the median. Of course, as an estimate of central tendency, the median is a resistant measure that is not as greatly affected by outliers as is the mean. It is not clear that median regression is a resistant estimation procedure, in fact, there is some evidence that it can be affected by high leverage values.

Here is what the quantile regression looks like using SAS proc iml. The first data step is to make sure that the data set that proc iml takes as input does not have any missing values. Inside proc iml we first generate necessary matrices for regression computation and then call the procedure LAV. After calling LAV we can calculate the predicted values and residuals. At last, we create a data set called _temp_ containing the dependent variables and all the predictors plus the predicted values and residuals. 

                                 LAV (L1) Estimation
                               Start with LS Solution
                      Start Iter: gamma=284.75134813 ActEqn=395
 Iter  N Huber  Act Eqn     Rank    Gamma         L1(x)      F(Gamma)
   31       36        5        5   0.0000    36268.1049    36240.6335

Algorithm converged
Objective Function L1(x)= 36268.104941
Number Search Directions=       68
Number Refactorizations =        2
Total Execution Time=       0.0200

Necessary and sufficient optimality condition satisfied.


                                    L1 Solution with ASE

Est  17.1505029  1.2690656888   7.2240793844   5.3238408715 -0.124573396
ASE  123.531545  6.3559394265   2.2262729207   0.602359504   0.0391932684

The coefficient and standard error for acs_k3 are considerably different as compared to OLS (the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients and standard errors for the other variables are also different, but not as dramatically different. Nevertheless, the  quantile regression results indicate that, like the OLS results, all of the variables except acs_k3 are significant. Using the data set _temp_ we created above we obtain a plot of residuals vs. predicted values shown below.

4.2 Constrained Linear Regression

Let's begin this section by looking at a regression model using the hsb2 dataset.  The hsb2 file is a sample of 200 cases from the Highschool and Beyond Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the following variables: id female race ses schtyp program read write math science socst. The variables read write math science socst are the results of standardized tests on reading, writing, math, science and social studies (respectively), and the variable female is coded 1 if female, 0 if male.

Let's start by doing an OLS regression where we predict socst score from read, write, math, science and female (gender)

The REG Procedure
Model: MODEL1
Dependent Variable: socst

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     5          10949     2189.85150      35.44    <.0001
Error                   194          11987       61.78834
Corrected Total         199          22936


Root MSE              7.86056    R-Square     0.4774
Dependent Mean       52.40500    Adj R-Sq     0.4639
Coeff Var            14.99963


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        7.33934        3.65024       2.01      0.0457
read          1        0.37840        0.08063       4.69      <.0001
write         1        0.38587        0.08893       4.34      <.0001
math          1        0.13033        0.08938       1.46      0.1464
science       1       -0.03339        0.08187      -0.41      0.6838
female        1       -0.35326        1.24537      -0.28      0.7770

Notice that the coefficients for read and write are very similar, which makes sense since they are both measures of language ability.  Also, the coefficients for math and science are similar (in that they are both not significantly different from 0). Suppose that we have a theory that suggests that read and write and math should have equal coefficients. We can test the equality of the coefficients using the test command.

   test read = write;
 run;
        Test 1 Results for Dependent Variable socst

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           1        0.19057       0.00    0.9558
Denominator       194       61.78834

The test result indicates that there is no significant difference in the coefficients for the reading and writing scores. Since it appears that the coefficients for math and science are also equal, let's test the equality of those as well.


  test math = science;
  run;
       Test 2 Results for Dependent Variable socst

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           1       89.63950       1.45    0.2299
Denominator       194       61.78834

Let's now perform both of these tests together, simultaneously testing that the coefficient for read equals write and math equals science using mtest statement.  

  mtest math - science, read - write;
  run;
Multivariate Test 1

                Multivariate Statistics and Exact F Statistics

                              S=1    M=0    N=96

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.99257173       0.73         2       194    0.4852
Pillai's Trace              0.00742827       0.73         2       194    0.4852
Hotelling-Lawley Trace      0.00748386       0.73         2       194    0.4852
Roy's Greatest Root         0.00748386       0.73         2       194    0.4852

Note this second test has 2 df, since it is testing both of the hypotheses listed, and this test is not significant, suggesting these pairs of coefficients are not significantly different from each other.  We can estimate regression models where we constrain coefficients to be equal to each other.  For example, let's begin on a limited scale and constrain read to equal write. Proc reg uses restrict statement to  accomplish this.

The REG Procedure
Model: MODEL1
Dependent Variable: socst

NOTE: Restrictions have been applied to parameter estimates.


                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     4          10949     2737.26674      44.53    <.0001
Error                   195          11987       61.47245
Corrected Total         199          22936


Root MSE              7.84044    R-Square     0.4774
Dependent Mean       52.40500    Adj R-Sq     0.4667
Coeff Var            14.96124


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        7.35415        3.63118       2.03     0.0442
read          1        0.38185        0.05139       7.43     <.0001
write         1        0.38185        0.05139       7.43     <.0001
math          1        0.13030        0.08915       1.46     0.1454
science       1       -0.03328        0.08164      -0.41     0.6840
female        1       -0.32962        1.16736      -0.28     0.7780
RESTRICT     -1      -25.51302      458.21611      -0.06     0.9558*

* Probability computed using beta distribution.

Notice that the coefficients for read and write are identical, along with their standard errors, t-test, etc. Also note that the degrees of freedom for the F test is four, not five, as in the OLS model. This is because only one coefficient is estimated for read and write, estimated like a single variable equal to the sum of their values.  Notice also that the Root MSE is slightly higher for the constrained model, but only slightly higher.  This is because we have forced the model to estimate the coefficients for read and write that are not as good at minimizing the Sum of Squares Error (the coefficients that would minimize the SSE would be the coefficients from the unconstrained model).

Next, we will define a second constraint, setting math equal to science together with the first constraint we set before. 

The REG Procedure
Model: MODEL1
Dependent Variable: socst

NOTE: Restrictions have been applied to parameter estimates.


                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          10860     3619.84965      58.75    <.0001
Error                   196          12077       61.61554
Corrected Total         199          22936


Root MSE              7.84956    R-Square     0.4735
Dependent Mean       52.40500    Adj R-Sq     0.4654
Coeff Var            14.97864


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1        7.50566        3.63323       2.07     0.0402
read          1        0.38604        0.05133       7.52     <.0001
write         1        0.38604        0.05133       7.52     <.0001
math          1        0.04281        0.05192       0.82     0.4107
science       1        0.04281        0.05192       0.82     0.4107
female        1       -0.20087        1.16383      -0.17     0.8631
RESTRICT     -1      -15.36285      458.82638      -0.03     0.9734*
RESTRICT     -1      547.24384      454.01563       1.21     0.2290*

* Probability computed using beta distribution.

Now the coefficients for readwrite and math = science and the degrees of freedom for the model has dropped to three.  Again, the Root MSE is slightly larger than in the prior model, but we should emphasize only very slightly larger.  If indeed the population coefficients for readwrite and math = science, then these combined (constrained) estimates may be more stable and generalize better to other samples.  So although these estimates may lead to slightly higher standard error of prediction in this sample, they may generalize better to the population from which they came.

4.3 Regression with Censored or Truncated Data

Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.

We will begin by looking at analyzing data with censored values.

4.3.1 Regression with Censored Data

In this example we have a variable called acadindx which is a weighted combination of standardized test scores and academic grades. The maximum possible score on acadindx is 200 but it is clear that the 16 students who scored 200 are not exactly equal in their academic abilities. In other words, there is variability in academic ability that is not being accounted for when students score 200 on acadindx. The variable acadindx is said to be censored, in particular, it is right censored.

Let's look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

The MEANS Procedure

Variable      N            Mean         Std Dev         Minimum         Maximum
-------------------------------------------------------------------------------
id          200     100.5000000      57.8791845       1.0000000     200.0000000
female      200       0.5450000       0.4992205               0       1.0000000
reading     200      52.2300000      10.2529368      28.0000000      76.0000000
writing     200      52.7750000       9.4785860      31.0000000      67.0000000
acadindx    200     172.1850000      16.8173987     138.0000000     200.0000000
-------------------------------------------------------------------------------
proc means data = "c:\sasreg\acadindx" N;
  where acadindx = 200;
run;
The MEANS Procedure

Variable      N
---------------
id           16
female       16
reading      16
writing      16
acadindx     16
---------------
proc corr data = "c:\sasreg\acadindx" nosimple noprob;
  var acadindx female reading writing;
run;
The CORR Procedure

   4  Variables:    acadindx female   reading  writing

           Pearson Correlation Coefficients, N = 200

              acadindx        female       reading       writing

acadindx       1.00000      -0.08210       0.71309       0.66256

female        -0.08210       1.00000      -0.05308       0.25649

reading        0.71309      -0.05308       1.00000       0.59678

writing        0.66256       0.25649       0.59678       1.00000

Now, let's run a standard OLS regression on the data and generate predicted scores in p1.

The REG Procedure
Model: MODEL1
Dependent Variable: acadindx

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3          34994          11665     107.40    <.0001
Error                   196          21288      108.61160
Corrected Total         199          56282

Root MSE             10.42169    R-Square     0.6218
Dependent Mean      172.18500    Adj R-Sq     0.6160
Coeff Var             6.05261

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       96.11841        4.48956      21.41      <.0001
female        1       -5.83250        1.58821      -3.67      0.0003
reading       1        0.71842        0.09315       7.71      <.0001
writing       1        0.79057        0.10410       7.59      <.0001

The proc lifereg is one of the procedures in SAS that can be used for regression with censored data. The syntax of the command is similar to proc reg with the addition of the variable indicating if an observation is censored. Therefore, we have to create a data set with the information on censoring. 

The LIFEREG Procedure

             Model Information

Data Set                    WORK.TOBIT_MODEL
Dependent Variable                  acadindx
Censoring Variable                    censor
Censoring Value(s)                         1
Number of Observations                   200
Noncensored Values                       184
Right Censored Values                     16
Left Censored Values                       0
Interval Censored Values                   0
Name of Distribution                  Normal
Log Likelihood                  -718.0636168


Algorithm converged.

       Type III Analysis of Effects

                         Wald
Effect       DF    Chi-Square    Pr > ChiSq

female        1       14.0654        0.0002
reading       1       60.8529        <.0001
writing       1       54.1655        <.0001

                  Analysis of Parameter Estimates

                      Standard   95% Confidence     Chi-
Parameter DF Estimate    Error       Limits       Square Pr > ChiSq

Intercept  1  92.7378   4.8034  83.3233 102.1524  372.74     <.0001
female     1  -6.3473   1.6924  -9.6644  -3.0302   14.07     0.0002
reading    1   0.7777   0.0997   0.5823   0.9731   60.85     <.0001
writing    1   0.8111   0.1102   0.5951   1.0271   54.17     <.0001
Scale      1  10.9897   0.5817   9.9067  12.1912

Let's merge the two data sets we created together to compare the predicted score p1 and p2.   

The MEANS Procedure

Variable      N            Mean         Std Dev         Minimum         Maximum
-------------------------------------------------------------------------------
acadindx    200     172.1850000      16.8173987     138.0000000     200.0000000
p1          200     172.1850000      13.2608696     142.3820725     201.5311070
p2          200     172.7040275      14.0029221     141.2210940     203.8540576
-------------------------------------------------------------------------------

It shows that the censored regression model predicted values have a larger standard deviation and a greater range of values. When we look at a listing of p1 and p2 for all students who scored the maximum of 200 on acadindx, we see that in every case the censored regression model predicted value is greater than the OLS predicted value. These predictions represent an estimate of what the variability would be if the values of acadindx could exceed 200.

Obs    acadindx       p1         p2

 32       200      179.175    179.620
 57       200      192.681    194.329
 68       200      201.531    203.854
 80       200      191.831    193.577
 82       200      188.154    189.563
 88       200      186.573    187.940
 95       200      195.997    198.176
100       200      186.933    188.108
132       200      197.578    199.798
136       200      189.459    191.144
143       200      191.185    192.833
157       200      191.614    193.477
161       200      180.251    181.008
169       200      182.275    183.367
174       200      191.614    193.477
200       200      187.662    189.421

Proc lifereg  handles right censored, left censored and interval censored data.  There are two other commands in SAS that perform censored regression analysis such as proc qlim

4.3.2 Regression with Truncated Data

Truncated data occurs when some observations are not included in the analysis because of the value of the variable. We will illustrate analysis with truncation using the dataset, acadindx, that was used in the previous section. 

Let's imagine that in order to get into a special honors program, students need to score at least 160 on acadindx. So we will drop all observations in which the value of acadindx is less than or equal 160. Now, let's estimate the same model that we used in the section on censored data, only this time we will pretend that a 200 for acadindx is not censored.

The REG Procedure
Model: MODEL1
Dependent Variable: acadindx

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3     8074.79638     2691.59879      33.01    <.0001
Error                   140          11416       81.54545
Corrected Total         143          19491


Root MSE              9.03025    R-Square     0.4143
Dependent Mean      180.42361    Adj R-Sq     0.4017
Coeff Var             5.00503


                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1      125.63547        5.89156      21.32      <.0001
female        1       -5.23849        1.61563      -3.24      0.0015
reading       1        0.44111        0.09635       4.58      <.0001
writing       1        0.58733        0.11508       5.10      <.0001

It is clear that the estimates of the coefficients are distorted due to the fact that 53 observations are no longer in the dataset. This amounts to restriction of range on both the response variable and the predictor variables. For example, the coefficient for writing dropped from .79 to .58. What this means is that if our goal is to find the relation between adadindx and the predictor variables in the populations, then the truncation of acadindx in our sample is going to lead to biased estimates. A better approach to analyzing these data is to use truncated regression. In SAS this can be accomplished using proc qlim. Proc qlim is an experimental procedure first available in SAS version 8.1. Proc qlim (Qualitative and LImited dependent variable model) analyzes univariate (and multivariate) limited dependent variable models where dependent variables takes discrete values or dependent variables are observed only in a limited range of values. 

The QLIM Procedure

                            Summary Statistics of Continuous Responses

                                                                                    N Obs    N Obs
                            Standard                           Lower       Upper    Lower    Upper
Variable        Mean           Error          Type             Bound       Bound    Bound    Bound

   y        180.4236       11.674837       Truncated             160


             Model Fit Summary

Number of Endogenous Variables             1
Endogenous Variable                        y
Number of Observations                   144
Missing Values                            56
Log Likelihood                    -510.00768
Maximum Absolute Gradient         3.87471E-6
Number of Iterations                      14
AIC                                     1030
Schwarz Criterion                       1045

Algorithm converged.


                      Parameter Estimates

                                 Standard                 Approx
Parameter        Estimate           Error    t Value    Pr > |t|

Intercept      110.289206        8.673847      12.72      <.0001
female          -6.099602        1.925245      -3.17      0.0015
reading          0.518179        0.116829       4.44      <.0001
writing          0.766164        0.152620       5.02      <.0001
_Sigma           9.803571        0.721646      13.59      <.0001

The coefficients from the proc qlim are closer to the OLS results, for example the coefficient for writing is .77 which is closer to the OLS results of .79.  However, the results are still somewhat different on the other variables, for example the coefficient for reading is .52 in the proc qlim as compared to .72 in the original OLS with the unrestricted data, and better than the OLS estimate of .47 with the restricted data.  While proc qlim may improve the estimates on a restricted data file as compared to OLS, it is certainly no substitute for analyzing the complete unrestricted data file.

4.4 Regression with Measurement Error

As you will most likely recall, one of the assumptions of regression is that the predictor variables are measured without error. The problem is that measurement error in predictor variables leads to under estimation of the regression coefficients. 

This section is under development.

4.5 Multiple Equation Regression Models

If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type multiple equation regression known as seemly unrelated regression.. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation modes is that we can test predictors across equations.

Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.

Multiple equation models are a powerful extension to our data analysis tool kit.

4.5.1 Seemingly Unrelated Regression

Let's continue using the hsb2 data file to illustrate the use of seemingly unrelated regression. This time let's look at two regression models.

    science = math female
    write   = read female    

It is the case that the errors (residuals) from these two models would be correlated. This would be true even if the predictor female were not found in both models. The errors would be correlated because all of the values of the variables are collected on the same set of observations. This is a situation tailor made for seemingly unrelated regression using the proc syslin with option sur. With the proc syslin we can estimate both models simultaneously while accounting for the correlated errors at the same time, leading to efficient estimates of the coefficients and standard errors. The syntax is as follows.

The first part of the output consists of the OLS estimate for each model. Here is the OLS estimate for the first model. 

               The SYSLIN Procedure
Ordinary Least Squares Estimation

Model                  SCIENCE
Dependent Variable     science


                         Analysis of Variance

                               Sum of        Mean
Source                 DF     Squares      Square    F Value    Pr > F

Model                   2    7993.550    3996.775      68.38    <.0001
Error                 197    11513.95    58.44645
Corrected Total       199    19507.50


Root MSE             7.64503    R-Square       0.40977
Dependent Mean      51.85000    Adj R-Sq       0.40378
Coeff Var           14.74451


                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     18.11813    3.167133       5.72      <.0001
math              1     0.663190    0.057872      11.46      <.0001
female            1     -2.16840    1.086043      -2.00      0.0472

And here is OLS estimate for the second model.

The SYSLIN Procedure
Ordinary Least Squares Estimation

Model                    WRITE
Dependent Variable       write


                         Analysis of Variance

                               Sum of        Mean
Source                 DF     Squares      Square    F Value    Pr > F

Model                   2    7856.321    3928.161      77.21    <.0001
Error                 197    10022.55    50.87591
Corrected Total       199    17878.88


Root MSE             7.13273    R-Square       0.43942
Dependent Mean      52.77500    Adj R-Sq       0.43373
Coeff Var           13.51537


                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     20.22837    2.713756       7.45      <.0001
read              1     0.565887    0.049385      11.46      <.0001
female            1     5.486894    1.014261       5.41      <.0001

Proc syslin with option sur also gives an estimate of the correlation between the errors of the two models. Here is the corresponding output.

The SYSLIN Procedure
Seemingly Unrelated Regression Estimation

       Cross Model Covariance

              SCIENCE       WRITE

SCIENCE        58.4464        7.8908
WRITE           7.8908       50.8759


      Cross Model Correlation

              SCIENCE       WRITE

SCIENCE        1.00000       0.14471
WRITE          0.14471       1.00000


  Cross Model Inverse Correlation

              SCIENCE       WRITE

SCIENCE        1.02139      -0.14780
WRITE         -0.14780       1.02139


   Cross Model Inverse Covariance

              SCIENCE       WRITE

SCIENCE       0.017476      -.002710
WRITE         -.002710      0.020076


System Weighted MSE            0.9981
Degrees of freedom                394
System Weighted R-Square       0.3875
Finally, we have the seemingly unrelated regression estimation for our models. Note that both the estimates of the coefficients and their standard errors are different from the OLS model estimates shown above. 
The SYSLIN Procedure
Seemingly Unrelated Regression Estimation

Model                  SCIENCE
Dependent Variable     science

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     20.13265    3.149485       6.39      <.0001
math              1     0.625141    0.057528      10.87      <.0001
female            1     -2.18934    1.086038      -2.02      0.0452


Model                    WRITE
Dependent Variable       write

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     21.83439    2.698827       8.09      <.0001
read              1     0.535484    0.049091      10.91      <.0001
female            1     5.453748    1.014244       5.38      <.0001

Now that we have estimated our models let's test the predictor variables. The test for female combines information from both models. The tests for math and read are actually equivalent to the t-tests above except that the results are displayed as F-tests.

      Test Results for Variable FEAMLE

  Num DF      Den DF    F Value    Pr > F

       2         394      18.48    0.0001


      Test Results for Variable MATH

  Num DF      Den DF    F Value    Pr > F

       1         394     118.31    0.0001


      Test Results for Variable READ

  Num DF      Den DF    F Value    Pr > F

       1         394     119.21    0.0001

Now, let's estimate 3 models where we use the same predictors in each model as shown below.

     read  = female prog1 prog3
     write = female prog1 prog3
     math  = female prog1 prog3

Here variable prog1 and prog3 are dummy variables for the variable prog. Let's generate these variables before estimating our three models using proc syslin

The OLS regression estimate of our three models are as follows. 

<some output omitted>
The SYSLIN Procedure
Ordinary Least Squares Estimation

Model                   MODEL1
Dependent Variable        read

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     56.82950    1.170562      48.55      <.0001
female            1     -1.20858    1.327672      -0.91      0.3638
prog1             1     -6.42937    1.665893      -3.86      0.0002
prog3             1     -9.97687    1.606428      -6.21      <.0001


Model                   MODEL2
Dependent Variable       write

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     53.62162    1.042019      51.46      <.0001
female            1     4.771211    1.181876       4.04      <.0001
prog1             1     -4.83293    1.482956      -3.26      0.0013
prog3             1     -9.43807    1.430021      -6.60      <.0001


Model                   MODEL3
Dependent Variable        math

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     57.10551    1.036890      55.07      <.0001
female            1     -0.67377    1.176059      -0.57      0.5674
prog1             1     -6.72394    1.475657      -4.56      <.0001
prog3             1     -10.3217    1.422983      -7.25      <.0001

These regressions provide fine estimates of the coefficients and standard errors but these results assume the residuals of each analysis are completely independent of the other. Also, if we wish to test female, we would have to do it three times and would not be able to combine the information from all three tests into a single overall test.

Now let's see the output of the estimate using seemingly unrelated regression.

The SYSLIN Procedure
Seemingly Unrelated Regression Estimation
Model                   MODEL1
Dependent Variable       read
                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     56.82950    1.170562      48.55      <.0001
female            1     -1.20858    1.327672      -0.91      0.3638
prog1             1     -6.42937    1.665893      -3.86      0.0002
prog3             1     -9.97687    1.606428      -6.21      <.0001


Model                   MODEL2
Dependent Variable       write

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     53.62162    1.042019      51.46      <.0001
female            1     4.771211    1.181876       4.04      <.0001
prog1             1     -4.83293    1.482956      -3.26      0.0013
prog3             1     -9.43807    1.430021      -6.60      <.0001


Model                   MODEL3
Dependent Variable        math

                        Parameter Estimates

                       Parameter    Standard
Variable         DF     Estimate       Error    t Value    Pr > |t|

Intercept         1     57.10551    1.036890      55.07      <.0001
female            1     -0.67377    1.176059      -0.57      0.5674
prog1             1     -6.72394    1.475657      -4.56      <.0001
prog3             1     -10.3217    1.422983      -7.25      <.0001

Note that the coefficients are identical in the OLS results and in the seemingly unrelated regression estimate, however the standard errors are different, only slightly, due to the correlation among the residuals in the multiple equations. 

In addition to getting more appropriate standard errors, we can test the effects of the predictors across the equations.  We can test the hypothesis that the coefficient for female is 0 for all three outcome variables, as shown below.

     Test Results for Variable FEAMLE

  Num DF      Den DF    F Value    Pr > F

       3         588      11.63    0.0001

We can also test the hypothesis that the coefficient for female is 0 for just read and math

proc syslin data = hsb2 sur;
  model1: model read  = female prog1 prog3;
  model2: model write = female prog1 prog3;
  model3: model  math  = female prog1 prog3;
  f1: stest model1.female = model3.female = 0;
run;
       Test Results for Variable F1

  Num DF      Den DF    F Value    Pr > F

       2         588       0.42    0.6599

We can also test the hypothesis that the coefficients for prog1 and prog3 are 0 for all three outcome variables, as shown below.

     Test Results for Variable PROGS

  Num DF      Den DF    F Value    Pr > F

       6         588      11.83    0.0001

4.5.2 Multivariate Regression

Let's now use multivariate regression using proc reg  to look at the same analysis say that we saw in the proc syslin example above, estimating the following 3 models.

     read  = female prog1 prog3
     write = female prog1 prog3
     math  = female prog1 prog3

Below we use proc reg to predict read write and math from female prog1 and prog3. Note that the top part of the output is similar to the sureg output in that it gives an overall summary of the model for each outcome variable, however the results are somewhat different and the sureg uses a Chi-Square test for the overall fit of the model, and mvreg uses an F-test. The lower part of the output appears similar to the sureg output, however when you compare the standard errors you see that the results are not the same.  These standard errors correspond to the OLS standard errors, so these results below do not take into account the correlations among the residuals (as do the sureg results).

The REG Procedure [Some output omitted] 

Dependent Variable: read

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       56.82950        1.17056      48.55      <.0001
female        1       -1.20858        1.32767      -0.91      0.3638
prog1         1       -6.42937        1.66589      -3.86      0.0002
prog3         1       -9.97687        1.60643      -6.21      <.0001


Dependent Variable: write

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       53.62162        1.04202      51.46      <.0001
female        1        4.77121        1.18188       4.04      <.0001
prog1         1       -4.83293        1.48296      -3.26      0.0013
prog3         1       -9.43807        1.43002      -6.60      <.0001


Dependent Variable: math

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1       57.10551        1.03689      55.07      <.0001
female        1       -0.67377        1.17606      -0.57      0.5674
prog1         1       -6.72394        1.47566      -4.56      <.0001
prog3         1      -10.32168        1.42298      -7.25      <.0001

Now, let's test female. Note, that female was statistically significant in only one of the three equations. Using the mtest statement after proc reg allows us to test female across all three equations simultaneously. And, guess what? It is significant. This is consistent with what we found using seemingly unrelated regression estimation.

Multivariate Test: female

                Multivariate Statistics and Exact F Statistics

                             S=1    M=0.5    N=96

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.84892448      11.51         3       194    <.0001
Pillai's Trace              0.15107552      11.51         3       194    <.0001
Hotelling-Lawley Trace      0.17796108      11.51         3       194    <.0001
Roy's Greatest Root         0.17796108      11.51         3       194    <.0001

We can also test prog1 and prog3, both separately and combined. Remember these are multivariate tests.

Multivariate Test: prog1

                Multivariate Statistics and Exact F Statistics

                             S=1    M=0.5    N=96

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.89429287       7.64         3       194    <.0001
Pillai's Trace              0.10570713       7.64         3       194    <.0001
Hotelling-Lawley Trace      0.11820192       7.64         3       194    <.0001
Roy's Greatest Root         0.11820192       7.64         3       194    <.0001
  prog3: mtest prog3 = 0;
run;
Multivariate Test: prog3

                Multivariate Statistics and Exact F Statistics

                             S=1    M=0.5    N=96

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.75267026      21.25         3       194    <.0001
Pillai's Trace              0.24732974      21.25         3       194    <.0001
Hotelling-Lawley Trace      0.32860304      21.25         3       194    <.0001
Roy's Greatest Root         0.32860304      21.25         3       194    <.0001
   prog: mtest prog1 = prog3 =0;
run;
quit;
Multivariate Test: prog

                 Multivariate Statistics and F Approximations

                              S=2    M=0    N=96

Statistic                        Value    F Value    Num DF    Den DF    Pr > F

Wilks' Lambda               0.73294667      10.87         6       388    <.0001
Pillai's Trace              0.26859190      10.08         6       390    <.0001
Hotelling-Lawley Trace      0.36225660      11.68         6     256.9    <.0001
Roy's Greatest Root         0.35636617      23.16         3       195    <.0001

NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.

Proc syslin with sur option and proc reg both allow you to test multi-equation models while taking into account the fact that the equations are not independent.  The proc syslin  with sur option allows you to get estimates for each equation which adjust for the non-independence of the equations, and it allows you to estimate equations which don't necessarily have the same predictors. By contrast, proc reg is restricted to equations that have the same set of predictors, and the estimates it provides for the individual equations are the same as the OLS estimates.  However, proc reg  allows you to perform more traditional multivariate tests of predictors.

4.6 Summary 

This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in SAS for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.