Help the Stat Consulting Group by giving a gift

Chapter 4 - Beyond OLS

**Chapter Outline
** 4.1 Robust Regression Methods

4.1.1 Regression with Robust Standard Errors

4.1.2 Using the Proc Genmod for Clustered Data

4.1.3 Robust Regression

4.1.4 Quantile Regression

4.2 Constrained Linear Regression

4.3 Regression with Censored or Truncated Data

4.3.1 Regression with Censored Data

4.3.2 Regression with Truncated Data

4.4 Regression with Measurement Error

4.5 Multiple Equation Regression Models

4.5.1 Seemingly Unrelated Regression

4.5.2 Multivariate Regression

4.6 Summary

In this chapter we will go into various commands that go beyond OLS. This chapter is a bit different from the others in that it covers a number of different concepts, some of which may be new to you. These extensions, beyond OLS, have much of the look and feel of OLS but will provide you with additional tools to work with linear models.

The topics will include robust regression methods, constrained linear regression, regression with censored and truncated data, regression with measurement error, and multiple equation models.

**4.1 Robust Regression Methods**

It seems to be a rare dataset that meets all of the assumptions underlying multiple regression. We know that failure to meet assumptions can lead to biased estimates of coefficients and especially biased estimates of the standard errors. This fact explains a lot of the activity in the development of robust regression methods.

The idea behind robust regression methods is to make adjustments in the estimates that take into account some of the flaws in the data itself. We are going to look at three robust methods: regression with robust standard errors, regression with clustered data, robust regression, and quantile regression.

Before we look at these approaches, let's look at a standard OLS regression using the
elementary school academic performance index (elemapi2.dta) dataset. We will look at a model that predicts the api 2000 scores using the average class size
in K through 3 (**acs_k3**), average class size 4 through 6 (**acs_46**), the
percent of fully credentialed teachers (**full**), and the size of the school (**enroll**).
First let's look at the descriptive statistics for these variables. Note the missing
values for **acs_k3** and **acs_k6**.

proc means data = "c:\sasreg\elemapi2" mean std max min; var api00 acs_k3 acs_46 full enroll; run;

The MEANS Procedure Variable Mean Std Dev Minimum Maximum ------------------------------------------------------------------------ api00 647.6225000 142.2489610 369.0000000 940.0000000 acs_k3 19.1608040 1.3686933 14.0000000 25.0000000 acs_46 29.6851385 3.8407840 20.0000000 50.0000000 full 84.5500000 14.9497907 37.0000000 100.0000000 enroll 483.4650000 226.4483847 130.0000000 1570.00 ------------------------------------------------------------------------

Below we see the regression predicting **api00** from **acs_k3 acs_46
full **and** enroll**. We see that all of the variables are significant except for **acs_k3**.

proc reg data = "c:\sasreg\elemapi2"; model api00 = acs_k3 acs_46 full enroll ; run;

The REG Procedure Model: MODEL1 Dependent Variable: api00 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 3071909 767977 61.01 <.0001 Error 390 4909501 12588 Corrected Total 394 7981410 Root MSE 112.19832 R-Square 0.3849 Dependent Mean 648.65063 Adj R-Sq 0.3786 Coeff Var 17.29719 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -5.20041 84.95492 -0.06 0.9512 acs_k3 1 6.95438 4.37110 1.59 0.1124 acs_46 1 5.96601 1.53105 3.90 0.0001 full 1 4.66822 0.41425 11.27 <.0001 enroll 1 -0.10599 0.02695 -3.93 <.0001

Since the regression procedure is interactive and we haven't issued the **quit**
command, we can test both of the class size variables,
and we find the overall test of these two variables is significant.

test acs_k3 = acs_46 = 0; run;

Test 1 Results for Dependent Variable api00 Mean Source DF Square F Value Pr > F Numerator 2 139437 11.08 <.0001 Denominator 390 12588

Here is the residual versus fitted plot for this regression. Notice that the pattern of the residuals is not exactly as we would hope. The spread of the residuals is somewhat wider toward the middle right of the graph than at the left, where the variability of the residuals is somewhat smaller, suggesting some heteroscedasticity.

plot r.*p.; run;

Here is the index plot of Cook's D for this regression. We see 4 points that are somewhat high in both their leverage and their residuals.

plot cookd.*obs.; run;

None of these results are dramatic problems, but the plot of residual vs. predicted value suggests that there might be some outliers and some possible heteroscedasticity and the index plot of Cook's D shows some points in the upper right quadrant that could be influential. We might wish to use something other than OLS regression to estimate this model. In the next several sections we will look at some robust regression methods.

**4.1.1 Regression with Robust Standard Errors**

The SAS **proc reg** includes an option called **acov** in the
model statement for
estimating the asymptotic covariance matrix of the estimates under the
hypothesis of heteroscedasticity. The standard error obtained from the
asymptotic covariance matrix is considered to be more robust and can deal with a collection of minor concerns about failure to meet
assumptions, such as minor problems about normality, heteroscedasticity, or some
observations that exhibit large residuals, leverage or influence. For such minor problems,
the standard error based on **acov** may effectively deal with these concerns.

With the **acov** option, the point estimates of the coefficients are exactly the
same as in ordinary OLS, but we will calculate the standard errors based on the
asymptotic covariance matrix. Here is the same regression as above using the **acov**
option. We also use SAS **ODS** (Output Delivery System) to output the parameter
estimates along with the asymptotic covariance matrix. We calculated the robust
standard error in a data step and merged them with the parameter estimate using **proc
sql **and created the t-values and corresponding probabilities. Note the changes in the standard errors and t-tests (but no change in the
coefficients). In this particular example, using robust standard errors did not change any
of the conclusions from the original OLS regression. We should also mention that
the robust standard error has been adjusted for the sample size
correction.

proc reg data = "c:\sasreg\elemapi2"; model api00 = acs_k3 acs_46 full enroll /acov; ods output ACovEst = estcov; ods output ParameterEstimates=pest; run; quit; data temp_dm; set estcov; drop model dependent; array a(5) intercept acs_k3 acs_46 full enroll; array b(5) std1-std5; b(_n_) = sqrt((395/390)*a(_n_)); std = max(of std1-std5); keep variable std; run; proc sql; select pest.variable, estimate, stderr, tvalue, probt, std as robust_stderr, estimate/robust_stderr as tvalue_rb, (1 - probt(abs(estimate/robust_stderr), 394))*2 as probt_rb from pest, temp_dm where pest.variable=temp_dm.variable; quit;

robust_ Variable Estimate StdErr tValue Probt stderr tvalue_rb probt_rb ------------------------------------------------------------------------------ Intercept -5.20041 84.95492 -0.06 0.9512 86.66308 -0.06001 0.95218 acs_k3 6.95438 4.37110 1.59 0.1124 4.620599 1.505082 0.133104 acs_46 5.96601 1.53105 3.90 0.0001 1.573214 3.792246 0.000173 full 4.66822 0.41425 11.27 <.0001 0.414681 11.25737 0 enroll -0.10599 0.02695 -3.93 <.0001 0.028015 -3.78331 0.000179

**4.1.2 Using the Proc Genmod for Clustered Data**

As described in Chapter 2, OLS regression assumes that the residuals are independent.
The **elemapi2** dataset contains data on 400 schools that come from 37
school districts. It is very possible that the scores within each school
district may not be independent, and this could lead to residuals that are not
independent within districts. SAS **proc genmod** is used to model correlated
data. We can use the
**class** statement and the** repeated** statement to indicate that the observations
are clustered into districts (based on **dnum**) and that the
observations may be correlated within districts, but would be independent
between districts.

proc genmod data="c:\sasreg\elemapi2"; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type=ind ; run; quit;

The GENMOD Procedure

Analysis Of GEE Parameter Estimates Empirical Standard Error Estimates

Standard 95% Confidence Parameter Estimate Error Limits Z Pr > |Z|

Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653 acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045 acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163 full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001 enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119

As with the regression with robust error, the estimate of the coefficients are the same as the OLS estimates, but the standard errors take into account that the observations within districts are non-independent. Even though the standard errors are larger in this analysis, the three variables that were significant in the OLS analysis are significant in this analysis as well.

We notice that the standard error estimates given here are different from
what Stata's result using regress with the cluster option. This is because that
Stata further does a finite-sample adjustment. We can do some SAS programming
here for the adjustment. The adjusted variance is a constant times the variance
obtained from the empirical standard error estimates. This particular constant
is

(N-1)/(N-k)*M/(M-1).

data em; set 'c:\sasreg\elemapi2'; run;

proc genmod data=em; class dnum; model api00 = acs_k3 acs_46 full enroll ; repeated subject=dnum / type = ind covb ; ods output geercov = gcov; ods output GEEEmpPEst = parms; run; quit;

proc sql; select count(dnum),count(distinct dnum) into :n, :m from em; quit; proc sql; select count(prm1) into :k from gcov; quit; data gcov_ad; set gcov; array all(*) _numeric_; do i = 1 to dim(all); all(i) = all(i)*((&n-1)/(&n-&k))*(&m/(&m-1)); if i = _n_ then std_ad = sqrt(all(i)); end; drop i; keep std_ad; run; data all; merge parms gcov_ad; run; proc print data = all noobs; run;

Parm Estimate Stderr LowerCL UpperCL Z ProbZ std_ad

Intercept -5.2004 119.5172 -239.450 229.0490 -0.04 0.9653 121.778 acs_k3 6.9544 6.7726 -6.3196 20.2284 1.03 0.3045 6.901 acs_46 5.9660 2.4839 1.0976 10.8344 2.40 0.0163 2.531 full 4.6682 0.6904 3.3151 6.0213 6.76 <.0001 0.703 enroll -0.1060 0.0421 -0.1886 -0.0234 -2.51 0.0119 0.043

**4.1.3 Robust Regression**

In SAS, we can not simply execute some proc to perform a robust regression using iteratively reweighted least squares. In order to perform a robust regression, we have to write our own macro. To this end, ATS has written a macro called robust_hb.sas. This macro first uses Hubert weight and later switches to biweight. This is why the macro is called robust_hb where h and b stands for Hubert and biweight respectively. Robust regression assigns a weight to each observation with higher weights given to better behaved observations. In fact, extremely deviant cases, those with Cook's D greater than 1, can have their weights set to missing so that they are not included in the analysis at all.

The macro robust_hb.sas uses another macro called mad.sas to generate MAD (median absolute deviation) during the iteration process. We will include both macros to perform the robust regression analysis as shown below. Note that in this analysis both the coefficients and the standard errors differ from the original OLS regression.

%include 'c:\sasreg\mad.sas'; %include 'c:\sasreg\robust_hb.sas'; %robust_hb("c:\sasreg\elemapi2", api00, acs_k3 acs_46 full enroll, .01, 0.00005, 10);

The REG Procedure Model: MODEL1 Dependent Variable: api00 Weight: _w2_ Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 3053248 763312 73.19 <.0001 Error 390 4067269 10429 Corrected Total 394 7120516 Root MSE 102.12196 R-Square 0.4288 Dependent Mean 647.38090 Adj R-Sq 0.4229 Coeff Var 15.77463 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -6.79858 80.45406 -0.08 0.9327 acs_k3 1 6.11031 4.15015 1.47 0.1417 acs_46 1 6.25588 1.45280 4.31 <.0001 full 1 4.79575 0.39212 12.23 <.0001 enroll 1 -0.10924 0.02558 -4.27 <.0001

If you compare the robust regression results (directly above) with the OLS results previously presented, you can see that the coefficients and standard errors are quite similar, and the t values and p values are also quite similar. Despite the minor problems that we found in the data when we performed the OLS analysis, the robust regression analysis yielded quite similar results suggesting that indeed these were minor problems. Had the results been substantially different, we would have wanted to further investigate the reasons why the OLS and robust regression results were different, and among the two results the robust regression results would probably be the more trustworthy.

Let's look at the predicted (fitted) values (**p**), the
residuals (**r**), and the leverage (hat) values (**h**).
The macro robust_hb generates a final data set with predicted values, raw
residuals and leverage values together with the original data called _tempout_.

Now, let's check on the various predicted values and the weighting. First, we will sort
by **_w2_, **the weight generated at last iteration. Then we will look at the first 15 observations. Notice that the smallest
weights are near one-half but quickly get into the .6 range. The first five values
are missing due to
the missing values of predictors.

proc sort data = _tempout_; by descending _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;

Obs snum api00 p r h _w2_ 1 116 513 . . . . 2 3072 763 . . . . 3 3055 590 . . . . 4 4488 521 . . . . 5 4534 445 . . . . 6 637 447 733.426 -286.144 0.003657 0.55684 7 5387 892 611.709 280.464 0.002278 0.57183 8 2267 897 622.062 275.512 0.009887 0.58481 9 65 903 631.930 271.731 0.010198 0.59466 10 3759 585 842.664 -257.473 0.040009 0.63128

Now, let's look at the last 10 observations. The weights for observations with snum 1678, 4486 and 1885 are all very close to one, since the residuals are fairly small. Note that the observations above that have the lowest weights are also those with the largest residuals (residuals over 200) and the observations below with the highest weights have very low residuals (all less than 3).

proc sort data = _tempout_; by _w2_; run; proc print data = _tempout_ (obs=10); var snum api00 p r h _w2_; run;

Obs snum api00 p r h _w2_ 1 1678 497 497.065 0.18424 0.023137 1.00000 2 4486 706 706.221 0.20151 0.013740 1.00000 3 1885 605 606.066 -0.41421 0.013745 1.00000 4 3535 705 703.929 1.16143 0.004634 0.99999 5 3024 727 729.278 -2.01559 0.010113 0.99997 6 3700 717 719.803 -2.43802 0.007317 0.99996 7 1596 536 533.918 2.77985 0.012143 0.99995 8 181 724 728.068 -3.53209 0.013665 0.99992 9 1949 696 691.898 3.96736 0.020426 0.99990 10 5917 735 731.141 4.03336 0.005831 0.99990

After using macro **robust_hb.sas**, we can use the dataset _tempout_ to
create some graphs for regression diagnostic purposes. For example, we can
create a graph of
residuals versus fitted (predicted) with a line at zero. This plot looks much like the OLS
plot, except that in the OLS all of the observations would be weighted equally, but as we
saw above the observations with the greatest residuals are weighted less and hence have
less influence on the results.

axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; symbol v=star h=0.8 c=blue; proc gplot data = _tempout_; plot r*p = 1 /haxis=axis2 vaxis=axis1 vref=0; label r ="Residual"; label p ="Fitted Value"; run; quit;

**4.1.4 Quantile Regression**

Quantile regression, in general, and median regression, in particular, might be
considered as an alternative to robust regression. SAS does quantile regression using a little bit of **proc iml**.
Inside **proc iml**, a procedure called LAV is called and it does a median regression in which
the coefficients will be estimated by minimizing the absolute deviations from the median.
Of course, as an estimate of central tendency, the median is a resistant measure that is
not as greatly affected by outliers as is the mean. It is not clear that median regression
is a resistant estimation procedure, in fact, there is some evidence that it can be
affected by high leverage values.

Here is what the quantile regression looks like using SAS **proc iml**.
The first data step is to make sure that the data set that **proc iml **takes
as input does not have any missing values. Inside **proc iml** we first
generate necessary matrices for regression computation and then call the
procedure LAV. After calling LAV we can calculate the predicted values and
residuals. At last, we create a data set called _temp_ containing the dependent
variables and all the predictors plus the predicted values and residuals.

data elemapi2; set "c:\sasreg\elemapi2"; cons = 1; if api00 ~=. & acs_k3 ~= . & acs_46 ~=. & full ~=. & enroll ~=.; run; proc iml ; /*Least absolute values*/ use elemapi2; read all; a = cons || acs_k3 || acs_46 || full || enroll; b=api00; opt= { . 3 0 1 }; call lav(rc,xr,a,b,,opt); /*print out the estimates*/ opt= { . 0 . 1 }; call lav(rc,xr,a,b,,opt); /*no print out but to create the xr*/ pred = a*t(xr); resid = b - pred; create _temp_ var { api00 cons acs_k3 acs_46 full enroll pred resid}; append; quit;

LAV (L1) Estimation Start with LS Solution Start Iter: gamma=284.75134813 ActEqn=395 Iter N Huber Act Eqn Rank Gamma L1(x) F(Gamma) 31 36 5 5 0.0000 36268.1049 36240.6335 Algorithm converged Objective Function L1(x)= 36268.104941 Number Search Directions= 68 Number Refactorizations = 2 Total Execution Time= 0.0200 Necessary and sufficient optimality condition satisfied. L1 Solution with ASE Est 17.1505029 1.2690656888 7.2240793844 5.3238408715 -0.124573396 ASE 123.531545 6.3559394265 2.2262729207 0.602359504 0.0391932684

The
coefficient and standard error for **acs_k3** are considerably different
as compared to OLS
(the coefficients are 1.2 vs 6.9 and the standard errors are 6.4 vs 4.3). The coefficients
and standard errors for the other variables are also different, but not as dramatically
different. Nevertheless, the quantile regression results indicate that, like the OLS
results, all of the variables except **acs_k3** are significant.
Using the data set _temp_ we created above we obtain a plot of residuals vs.
predicted values shown below.

symbol v=star h=0.8 c=blue; axis1 order = (-300 to 300 by 100) label=(a=90) minor=none; axis2 order = (300 to 900 by 300) minor=none; proc gplot data = _temp_; plot resid*pred = 1 / vaxis=axis1 haxis=axis2 vref=0; label resid="Residual"; label pred = "Fitted Value"; run; quit;

**4.2 Constrained Linear Regression**

Let's begin this section by looking at a regression model using the ** hsb2** dataset.
The **hsb2** file is a sample of 200 cases from the Highschool and Beyond
Study (Rock, Hilton, Pollack, Ekstrom & Goertz, 1985). It includes the
following variables: **id** **female** **race** **ses** **schtyp**
**program** **read** **write** **math** **science** **socst**.
The variables **read** **write** **math** **science** **socst**
are the results of standardized tests on reading, writing, math, science and
social studies (respectively), and the variable **female** is coded 1 if
female, 0 if male.

Let's start by doing an OLS regression where we predict **socst** score
from **read**, **write**, **math**, **science**
and **female** (gender)

proc reg data="c:\sasreg\hsb2"; model socst = read write math science female ; run;

The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 10949 2189.85150 35.44 <.0001 Error 194 11987 61.78834 Corrected Total 199 22936 Root MSE 7.86056 R-Square 0.4774 Dependent Mean 52.40500 Adj R-Sq 0.4639 Coeff Var 14.99963 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.33934 3.65024 2.01 0.0457 read 1 0.37840 0.08063 4.69 <.0001 write 1 0.38587 0.08893 4.34 <.0001 math 1 0.13033 0.08938 1.46 0.1464 science 1 -0.03339 0.08187 -0.41 0.6838 female 1 -0.35326 1.24537 -0.28 0.7770

Notice that the coefficients for **read** and **write** are very similar, which
makes sense since they are both measures of language ability. Also, the coefficients
for **math** and **science** are similar (in that they are both
not significantly different from 0). Suppose that we have a theory that suggests that **read**
and **write** and **math** should have equal coefficients. We can test the equality
of the coefficients using the **test** command.

test read = write; run;

Test 1 Results for Dependent Variable socst Mean Source DF Square F Value Pr > F Numerator 1 0.19057 0.00 0.9558 Denominator 194 61.78834

The test result indicates that there is no significant difference in the
coefficients for the reading and writing scores. Since it appears that the coefficients
for **math** and **science** are also equal, let's test the
equality of those as well.

test math = science; run;

Test 2 Results for Dependent Variable socst Mean Source DF Square F Value Pr > F Numerator 1 89.63950 1.45 0.2299 Denominator 194 61.78834

Let's now perform both of these tests together, simultaneously testing that the
coefficient for **read** equals **write** and **math**
equals **science **using **mtest** statement.

mtest math - science, read - write; run;

Multivariate Test 1 Multivariate Statistics and Exact F Statistics S=1 M=0 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.99257173 0.73 2 194 0.4852 Pillai's Trace 0.00742827 0.73 2 194 0.4852 Hotelling-Lawley Trace 0.00748386 0.73 2 194 0.4852 Roy's Greatest Root 0.00748386 0.73 2 194 0.4852

Note this second test has 2 df, since it is testing both of the hypotheses listed, and
this test is not significant, suggesting these pairs of coefficients are not significantly
different from each other. We can estimate regression models where we constrain
coefficients to be equal to each other. For example, let's begin on a limited scale
and constrain **read** to equal **write**. **Proc reg** uses **restrict
**statement to accomplish this.

proc reg data="c:\sasreg\hsb2"; model socst = read write math science female ; restrict read=write; run;

The REG Procedure Model: MODEL1 Dependent Variable: socst NOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 4 10949 2737.26674 44.53 <.0001 Error 195 11987 61.47245 Corrected Total 199 22936 Root MSE 7.84044 R-Square 0.4774 Dependent Mean 52.40500 Adj R-Sq 0.4667 Coeff Var 14.96124 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.35415 3.63118 2.03 0.0442 read 1 0.38185 0.05139 7.43 <.0001 write 1 0.38185 0.05139 7.43 <.0001 math 1 0.13030 0.08915 1.46 0.1454 science 1 -0.03328 0.08164 -0.41 0.6840 female 1 -0.32962 1.16736 -0.28 0.7780 RESTRICT -1 -25.51302 458.21611 -0.06 0.9558* * Probability computed using beta distribution.

Notice that the coefficients for **read** and **write** are identical, along with
their standard errors, t-test, etc. Also note that the degrees of freedom for the F test
is four, not five, as in the OLS model. This is because only one coefficient is estimated
for **read** and **write**, estimated like a single variable equal to the sum of
their values. Notice also that the Root MSE is slightly higher for the constrained
model, but only slightly higher. This is because we have forced the model to
estimate the coefficients for **read** and **write** that are
not as good at minimizing the Sum of Squares Error (the coefficients that would minimize
the SSE would be the coefficients from the unconstrained model).

Next, we will define a second constraint, setting **math** equal to **science
**together with the first constraint we set before.

proc reg data="c:\sasreg\hsb2"; model socst = read write math science female ; restrict read = write, math = science; run;

The REG Procedure Model: MODEL1 Dependent Variable: socst NOTE: Restrictions have been applied to parameter estimates. Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 10860 3619.84965 58.75 <.0001 Error 196 12077 61.61554 Corrected Total 199 22936 Root MSE 7.84956 R-Square 0.4735 Dependent Mean 52.40500 Adj R-Sq 0.4654 Coeff Var 14.97864 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 7.50566 3.63323 2.07 0.0402 read 1 0.38604 0.05133 7.52 <.0001 write 1 0.38604 0.05133 7.52 <.0001 math 1 0.04281 0.05192 0.82 0.4107 science 1 0.04281 0.05192 0.82 0.4107 female 1 -0.20087 1.16383 -0.17 0.8631 RESTRICT -1 -15.36285 458.82638 -0.03 0.9734* RESTRICT -1 547.24384 454.01563 1.21 0.2290* * Probability computed using beta distribution.

Now the coefficients for **read** = **write** and **math** = **science**
and the degrees of freedom for the model has dropped to three. Again, the Root MSE
is slightly larger than in the prior model, but we should emphasize only very slightly
larger. If indeed the population coefficients for **read** = **write**
and **math** = **science**, then these combined (constrained) estimates
may be more stable and generalize better to other samples. So although these
estimates may lead to slightly higher standard error of prediction in this sample, they
may generalize better to the population from which they came.

**4.3 Regression with Censored or Truncated Data **

Analyzing data that contain censored values or are truncated is common in many research disciplines. According to Hosmer and Lemeshow (1999), a censored value is one whose value is incomplete due to random factors for each subject. A truncated observation, on the other hand, is one which is incomplete due to a selection process in the design of the study.

We will begin by looking at analyzing data with censored values.

**4.3.1 Regression with Censored Data**

In this example we have a variable called **acadindx** which is a weighted
combination of standardized test scores and academic grades. The maximum possible score on
**acadindx** is 200 but it is clear that the 16 students who scored 200 are not exactly
equal in their academic abilities. In other words, there is variability in academic
ability that is not being accounted for when students score 200 on acadindx. The variable **acadindx**
is said to be censored, in particular, it is right censored.

Let's look at the example. We will begin by looking at a description of the data, some descriptive statistics, and correlations among the variables.

proc means data = "c:\sasreg\acadindx"; run;

The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- id 200 100.5000000 57.8791845 1.0000000 200.0000000 female 200 0.5450000 0.4992205 0 1.0000000 reading 200 52.2300000 10.2529368 28.0000000 76.0000000 writing 200 52.7750000 9.4785860 31.0000000 67.0000000 acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000 -------------------------------------------------------------------------------

proc means data = "c:\sasreg\acadindx" N; where acadindx = 200; run;

The MEANS Procedure Variable N --------------- id 16 female 16 reading 16 writing 16 acadindx 16 ---------------

proc corr data = "c:\sasreg\acadindx" nosimple noprob; var acadindx female reading writing; run;

The CORR Procedure 4 Variables: acadindx female reading writing Pearson Correlation Coefficients, N = 200 acadindx female reading writing acadindx 1.00000 -0.08210 0.71309 0.66256 female -0.08210 1.00000 -0.05308 0.25649 reading 0.71309 -0.05308 1.00000 0.59678 writing 0.66256 0.25649 0.59678 1.00000

Now, let's run a standard OLS regression on the data and generate predicted scores in **p1**.

proc reg data = "c:\sasreg\acadindx"; model acadindx =female reading writing; output out = reg1 p = p1; run; quit;

The REG Procedure Model: MODEL1 Dependent Variable: acadindx Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 34994 11665 107.40 <.0001 Error 196 21288 108.61160 Corrected Total 199 56282 Root MSE 10.42169 R-Square 0.6218 Dependent Mean 172.18500 Adj R-Sq 0.6160 Coeff Var 6.05261 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 96.11841 4.48956 21.41 <.0001 female 1 -5.83250 1.58821 -3.67 0.0003 reading 1 0.71842 0.09315 7.71 <.0001 writing 1 0.79057 0.10410 7.59 <.0001

The **proc** **lifereg ** is one of the procedures in SAS that can be used for regression with
censored data. The syntax of the command is similar to **proc reg** with the addition of the
variable indicating if an observation is censored. Therefore, we have to create
a data set with the information on censoring.

data tobit_model; set "c:\sasreg\acadindx"; censor = ( acadindx >= 200 ); run; proc lifereg data = tobit_model; model acadindx*censor(1) = female reading writing /d=normal; output out = reg2 p = p2; run;

The LIFEREG Procedure Model Information Data Set WORK.TOBIT_MODEL Dependent Variable acadindx Censoring Variable censor Censoring Value(s) 1 Number of Observations 200 Noncensored Values 184 Right Censored Values 16 Left Censored Values 0 Interval Censored Values 0 Name of Distribution Normal Log Likelihood -718.0636168 Algorithm converged. Type III Analysis of Effects Wald Effect DF Chi-Square Pr > ChiSq female 1 14.0654 0.0002 reading 1 60.8529 <.0001 writing 1 54.1655 <.0001 Analysis of Parameter Estimates Standard 95% Confidence Chi- Parameter DF Estimate Error Limits Square Pr > ChiSq Intercept 1 92.7378 4.8034 83.3233 102.1524 372.74 <.0001 female 1 -6.3473 1.6924 -9.6644 -3.0302 14.07 0.0002 reading 1 0.7777 0.0997 0.5823 0.9731 60.85 <.0001 writing 1 0.8111 0.1102 0.5951 1.0271 54.17 <.0001 Scale 1 10.9897 0.5817 9.9067 12.1912

Let's merge the two data sets we created together to compare the predicted
score **p1** and **p2. **

data compare; merge reg1 reg2; by id; run; proc means data = compare; var acadindx p1 p2; run;

The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- acadindx 200 172.1850000 16.8173987 138.0000000 200.0000000 p1 200 172.1850000 13.2608696 142.3820725 201.5311070 p2 200 172.7040275 14.0029221 141.2210940 203.8540576 -------------------------------------------------------------------------------

It shows that the censored regression model predicted
values have a larger standard deviation and a greater range of values. When we look at a listing of **p1** and **p2** for all students who scored the
maximum of 200 on acadindx, we see that in every case the censored regression
model predicted value is
greater than the OLS predicted value. These predictions represent an estimate of what the
variability would be if the values of **acadindx** could exceed 200.

proc print data = compare; var acadindx p1 p2; where acadindx = 200; run;

Obs acadindx p1 p2 32 200 179.175 179.620 57 200 192.681 194.329 68 200 201.531 203.854 80 200 191.831 193.577 82 200 188.154 189.563 88 200 186.573 187.940 95 200 195.997 198.176 100 200 186.933 188.108 132 200 197.578 199.798 136 200 189.459 191.144 143 200 191.185 192.833 157 200 191.614 193.477 161 200 180.251 181.008 169 200 182.275 183.367 174 200 191.614 193.477 200 200 187.662 189.421

**Proc lifereg ** handles right censored, left censored and
interval censored data. There are two other commands in SAS that perform
censored regression analysis such as **proc qlim**.

**4.3.2 Regression with Truncated Data**

Truncated data occurs when some observations are not included in the analysis because
of the value of the variable. We will illustrate analysis with truncation using the
dataset, **acadindx**, that was used in the previous section.

Let's imagine that in order to get into a special honors program, students need to
score at least 160 on **acadindx**. So we will drop all observations in which the value
of **acadindx** is less than or equal 160. Now, let's estimate the same model that we used in the section on censored data, only
this time we will pretend that a 200 for **acadindx** is not censored.

proc reg data = "c:\sasreg\acadindx"; model acadindx = female reading writing; where acadindx >160; run; quit;

The REG Procedure Model: MODEL1 Dependent Variable: acadindx Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 8074.79638 2691.59879 33.01 <.0001 Error 140 11416 81.54545 Corrected Total 143 19491 Root MSE 9.03025 R-Square 0.4143 Dependent Mean 180.42361 Adj R-Sq 0.4017 Coeff Var 5.00503 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 125.63547 5.89156 21.32 <.0001 female 1 -5.23849 1.61563 -3.24 0.0015 reading 1 0.44111 0.09635 4.58 <.0001 writing 1 0.58733 0.11508 5.10 <.0001

It is clear that the estimates of the coefficients are distorted due to the fact that
53 observations are no longer in the dataset. This amounts to restriction of range on both
the response variable and the predictor variables. For example, the coefficient for
writing dropped from .79 to .58. What this means is that if our goal is to find the
relation between **adadindx** and the predictor variables in the populations, then the
truncation of **acadindx** in our sample is going to lead to biased estimates. A better
approach to analyzing these data is to use truncated regression. In SAS this can be
accomplished using **proc qlim**. **Proc qlim **is an experimental
procedure first available in SAS version 8.1. **Proc qlim **(Qualitative and
LImited dependent variable model) analyzes univariate (and multivariate) limited
dependent variable models where dependent variables takes discrete values or
dependent variables are observed only in a limited range of values.

data trunc_model; set "c:\sasreg\acadindx"; y = .; if acadindx > 160 & acadindx ~=. then y = acadindx; run; proc qlim data=trunc_model ; model y = female reading writing; endogenous y~ truncated (lb=160); run;

The QLIM Procedure Summary Statistics of Continuous Responses N Obs N Obs Standard Lower Upper Lower Upper Variable Mean Error Type Bound Bound Bound Bound y 180.4236 11.674837 Truncated 160 Model Fit Summary Number of Endogenous Variables 1 Endogenous Variable y Number of Observations 144 Missing Values 56 Log Likelihood -510.00768 Maximum Absolute Gradient 3.87471E-6 Number of Iterations 14 AIC 1030 Schwarz Criterion 1045 Algorithm converged. Parameter Estimates Standard Approx Parameter Estimate Error t Value Pr > |t| Intercept 110.289206 8.673847 12.72 <.0001 female -6.099602 1.925245 -3.17 0.0015 reading 0.518179 0.116829 4.44 <.0001 writing 0.766164 0.152620 5.02 <.0001 _Sigma 9.803571 0.721646 13.59 <.0001

The coefficients from the **proc qlim** are closer to the OLS results, for
example the coefficient for **writing** is .77 which is closer to the OLS
results of .79. However, the results are still somewhat different on the other
variables, for example the coefficient for reading is .52 in the **proc
qlim** as compared to .72 in the original OLS with the unrestricted data, and better than the OLS
estimate of .47 with the restricted data. While **proc qlim **may
improve the estimates on a restricted data file as compared to OLS, it is certainly no
substitute for analyzing the complete unrestricted data file.

**4.4 Regression with Measurement Error**

As you will most likely recall, one of the assumptions of regression is that the predictor variables are measured without error. The problem is that measurement error in predictor variables leads to under estimation of the regression coefficients.

This section is under development.

**4.5 Multiple Equation Regression Models**

If a dataset has enough variables we may want to estimate more than one regression model. For example, we may want to predict y1 from x1 and also predict y2 from x2. Even though there are no variables in common these two models are not independent of one another because the data come from the same subjects. This is an example of one type multiple equation regression known as seemly unrelated regression.. We can estimate the coefficients and obtain standard errors taking into account the correlated errors in the two models. An important feature of multiple equation modes is that we can test predictors across equations.

Another example of multiple equation regression is if we wished to predict y1, y2 and y3 from x1 and x2. This is a three equation system, known as multivariate regression, with the same predictor variables for each model. Again, we have the capability of testing coefficients across the different equations.

Multiple equation models are a powerful extension to our data analysis tool kit.

**4.5.1 Seemingly Unrelated Regression**

Let's continue using the **hsb2** data file to illustrate the use of
seemingly unrelated regression. This time let's look at two regression models.

science = math female write = read female

It is the case that the errors (residuals) from these two models would be correlated. This
would be true even if the predictor female were not found in both models. The errors would
be correlated because all of the values of the variables are collected on the same set of
observations. This is a situation tailor made for seemingly unrelated regression using the
**proc syslin **with option **sur**. With the **proc syslin** we can estimate both models simultaneously while
accounting for the correlated errors at the same time, leading to efficient estimates of
the coefficients and standard errors. The syntax is as follows.

proc syslin data="c:\sasreg\hsb2" sur ; science: model science = math female ; write: model write = read female ; run;

The first part of the output consists of the OLS estimate for each model. Here is the OLS estimate for the first model.

The SYSLIN Procedure Ordinary Least Squares Estimation Model SCIENCE Dependent Variable science Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 7993.550 3996.775 68.38 <.0001 Error 197 11513.95 58.44645 Corrected Total 199 19507.50 Root MSE 7.64503 R-Square 0.40977 Dependent Mean 51.85000 Adj R-Sq 0.40378 Coeff Var 14.74451 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 18.11813 3.167133 5.72 <.0001 math 1 0.663190 0.057872 11.46 <.0001 female 1 -2.16840 1.086043 -2.00 0.0472

And here is OLS estimate for the second model.

The SYSLIN Procedure Ordinary Least Squares Estimation Model WRITE Dependent Variable write Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 7856.321 3928.161 77.21 <.0001 Error 197 10022.55 50.87591 Corrected Total 199 17878.88 Root MSE 7.13273 R-Square 0.43942 Dependent Mean 52.77500 Adj R-Sq 0.43373 Coeff Var 13.51537 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 20.22837 2.713756 7.45 <.0001 read 1 0.565887 0.049385 11.46 <.0001 female 1 5.486894 1.014261 5.41 <.0001

**Proc syslin **with option **sur**
also gives an estimate of the correlation between the errors of the two models.
Here is the corresponding output.

The SYSLIN Procedure Seemingly Unrelated Regression Estimation Cross Model Covariance SCIENCE WRITE SCIENCE 58.4464 7.8908 WRITE 7.8908 50.8759 Cross Model Correlation SCIENCE WRITE SCIENCE 1.00000 0.14471 WRITE 0.14471 1.00000 Cross Model Inverse Correlation SCIENCE WRITE SCIENCE 1.02139 -0.14780 WRITE -0.14780 1.02139 Cross Model Inverse Covariance SCIENCE WRITE SCIENCE 0.017476 -.002710 WRITE -.002710 0.020076 System Weighted MSE 0.9981 Degrees of freedom 394 System Weighted R-Square 0.3875Finally, we have the seemingly unrelated regression estimation for our models. Note that both the estimates of the coefficients and their standard errors are different from the OLS model estimates shown above.

The SYSLIN Procedure Seemingly Unrelated Regression Estimation Model SCIENCE Dependent Variable science Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 20.13265 3.149485 6.39 <.0001 math 1 0.625141 0.057528 10.87 <.0001 female 1 -2.18934 1.086038 -2.02 0.0452 Model WRITE Dependent Variable write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 21.83439 2.698827 8.09 <.0001 read 1 0.535484 0.049091 10.91 <.0001 female 1 5.453748 1.014244 5.38 <.0001

Now that we have estimated our models let's test the predictor variables. The test for **female**
combines information from both models. The tests for **math** and **read** are
actually equivalent to the t-tests above except that the results are displayed as
F-tests.

proc syslin data = "c:\sasreg\hsb2" sur ; science: model science = math female ; write: model write = read female ; female: stest science.female = write.female =0; math: stest science.math = 0; read: stest write.read = 0; run;

Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 2 394 18.48 0.0001 Test Results for Variable MATH Num DF Den DF F Value Pr > F 1 394 118.31 0.0001 Test Results for Variable READ Num DF Den DF F Value Pr > F 1 394 119.21 0.0001

Now, let's estimate 3 models where we use the same predictors in each model as shown below.

read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3

Here variable **prog1** and **prog3 **are dummy variables for the
variable **prog**. Let's generate these variables before estimating our three
models using **proc syslin**.

data hsb2; set "c:\sasreg\hsb2"; prog1 = (prog = 1); prog3 = (prog = 3); run; proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; run;

The OLS regression estimate of our three models are as follows.

<some output omitted>

The SYSLIN Procedure Ordinary Least Squares Estimation ModelMODEL1Dependent VariablereadParameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.170562 48.55 <.0001 female 1 -1.20858 1.327672 -0.91 0.3638 prog1 1 -6.42937 1.665893 -3.86 0.0002 prog3 1 -9.97687 1.606428 -6.21 <.0001 ModelMODEL2Dependent VariablewriteParameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.042019 51.46 <.0001 female 1 4.771211 1.181876 4.04 <.0001 prog1 1 -4.83293 1.482956 -3.26 0.0013 prog3 1 -9.43807 1.430021 -6.60 <.0001 ModelMODEL3Dependent VariablemathParameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.036890 55.07 <.0001 female 1 -0.67377 1.176059 -0.57 0.5674 prog1 1 -6.72394 1.475657 -4.56 <.0001 prog3 1 -10.3217 1.422983 -7.25 <.0001

These regressions provide fine estimates of the coefficients and standard errors but
these results assume the residuals of each analysis are completely independent of the
other. Also, if we wish to test **female**, we would have to do it three times and
would not be able to combine the information from all three tests into a single overall
test.

Now let's see the output of the estimate using seemingly unrelated regression.

The SYSLIN Procedure Seemingly Unrelated Regression Estimation

ModelMODEL1Dependent Variableread

Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.170562 48.55 <.0001 female 1 -1.20858 1.327672 -0.91 0.3638 prog1 1 -6.42937 1.665893 -3.86 0.0002 prog3 1 -9.97687 1.606428 -6.21 <.0001 ModelMODEL2Dependent VariablewriteParameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.042019 51.46 <.0001 female 1 4.771211 1.181876 4.04 <.0001 prog1 1 -4.83293 1.482956 -3.26 0.0013 prog3 1 -9.43807 1.430021 -6.60 <.0001 ModelMODEL3Dependent VariablemathParameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.036890 55.07 <.0001 female 1 -0.67377 1.176059 -0.57 0.5674 prog1 1 -6.72394 1.475657 -4.56 <.0001 prog3 1 -10.3217 1.422983 -7.25 <.0001

Note that the coefficients are identical in the OLS results and in the seemingly unrelated regression estimate, however the standard errors are different, only slightly, due to the correlation among the residuals in the multiple equations.

In addition to getting more appropriate standard errors,
we can test the effects of the predictors across the equations. We can test the
hypothesis that the coefficient for **female** is 0 for all three outcome
variables, as shown below.

proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; feamle: stest model1.female = model2.female = model3.female = 0; run;

Test Results for Variable FEAMLE Num DF Den DF F Value Pr > F 3 588 11.63 0.0001

We can also test the hypothesis that the coefficient for **female** is 0
for just **read** and **math**.

proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; f1: stest model1.female = model3.female = 0; run;

Test Results for Variable F1 Num DF Den DF F Value Pr > F 2 588 0.42 0.6599

We can also test the hypothesis that the coefficients for **prog1 **and **prog3**
are 0 for all three outcome variables, as shown below.

proc syslin data = hsb2 sur; model1: model read = female prog1 prog3; model2: model write = female prog1 prog3; model3: model math = female prog1 prog3; progs: stest model1.prog1 = model2.prog1 = model3.prog1 =0, model1.prog3 = model2.prog3 = model3.prog3 =0 ; run;

Test Results for Variable PROGS Num DF Den DF F Value Pr > F 6 588 11.83 0.0001

**4.5.2 Multivariate Regression**

Let's now use multivariate regression using **proc reg** to look
at the same analysis say that we saw in the **proc syslin** example above,
estimating the following 3 models.

read = female prog1 prog3 write = female prog1 prog3 math = female prog1 prog3

Below we use **proc reg** to predict **read write** and **math**
from **female prog1** and **prog3**. Note that the top part of
the output is similar to the **sureg** output in that it gives an overall
summary of the model for each outcome variable, however the results are somewhat different
and the **sureg** uses a **Chi-Square** test for the overall fit
of the model, and **mvreg** uses an **F-test**. The lower part
of the output appears similar to the **sureg** output, however when you
compare the standard errors you see that the results are not the same. These standard errors
correspond to the OLS standard errors, so these results below do not take into account the
correlations among the residuals (as do the **sureg** results).

proc reg data =hsb2; model read write math = female prog1 prog3 ; run;

The REG Procedure [Some output omitted] Dependent Variable: read Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 56.82950 1.17056 48.55 <.0001 female 1 -1.20858 1.32767 -0.91 0.3638 prog1 1 -6.42937 1.66589 -3.86 0.0002 prog3 1 -9.97687 1.60643 -6.21 <.0001 Dependent Variable: write Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 53.62162 1.04202 51.46 <.0001 female 1 4.77121 1.18188 4.04 <.0001 prog1 1 -4.83293 1.48296 -3.26 0.0013 prog3 1 -9.43807 1.43002 -6.60 <.0001 Dependent Variable: math Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 57.10551 1.03689 55.07 <.0001 female 1 -0.67377 1.17606 -0.57 0.5674 prog1 1 -6.72394 1.47566 -4.56 <.0001 prog3 1 -10.32168 1.42298 -7.25 <.0001

Now, let's **test female**. Note, that **female** was statistically significant
in only one of the three equations. Using the **mtest** statement after **proc
reg** allows us to
test **female** across all three equations simultaneously. And, guess what? It is
significant. This is consistent with what we found using seemingly unrelated
regression estimation.

female: mtest female=0; run;

Multivariate Test: female Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.84892448 11.51 3 194 <.0001 Pillai's Trace 0.15107552 11.51 3 194 <.0001 Hotelling-Lawley Trace 0.17796108 11.51 3 194 <.0001 Roy's Greatest Root 0.17796108 11.51 3 194 <.0001

We can also test **prog1** and **prog3**, both separately and combined. Remember
these are multivariate tests.

prog1: mtest prog1 = 0; run;

Multivariate Test: prog1 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.89429287 7.64 3 194 <.0001 Pillai's Trace 0.10570713 7.64 3 194 <.0001 Hotelling-Lawley Trace 0.11820192 7.64 3 194 <.0001 Roy's Greatest Root 0.11820192 7.64 3 194 <.0001

prog3: mtest prog3 = 0; run;

Multivariate Test: prog3 Multivariate Statistics and Exact F Statistics S=1 M=0.5 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.75267026 21.25 3 194 <.0001 Pillai's Trace 0.24732974 21.25 3 194 <.0001 Hotelling-Lawley Trace 0.32860304 21.25 3 194 <.0001 Roy's Greatest Root 0.32860304 21.25 3 194 <.0001

prog: mtest prog1 = prog3 =0; run; quit;

Multivariate Test: prog Multivariate Statistics and F Approximations S=2 M=0 N=96 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.73294667 10.87 6 388 <.0001 Pillai's Trace 0.26859190 10.08 6 390 <.0001 Hotelling-Lawley Trace 0.36225660 11.68 6 256.9 <.0001 Roy's Greatest Root 0.35636617 23.16 3 195 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.

**Proc syslin** with **sur** option and **proc reg** both allow you to test
multi-equation models while taking into account the fact that the equations are not
independent. The **proc syslin ** with **sur** option allows you to get estimates for each
equation which adjust for the non-independence of the equations, and it allows you to
estimate equations which don't necessarily have the same predictors. By contrast,
**proc reg** is restricted to equations that have the same set of predictors, and the estimates it
provides for the individual equations are the same as the OLS estimates. However,
**proc reg** allows you to perform more
traditional multivariate tests of predictors.

**
4.6 Summary
**

This chapter has covered a variety of topics that go beyond ordinary least squares regression, but there still remain a variety of topics we wish we could have covered, including the analysis of survey data, dealing with missing data, panel data analysis, and more. And, for the topics we did cover, we wish we could have gone into even more detail. One of our main goals for this chapter was to help you be aware of some of the techniques that are available in SAS for analyzing data that do not fit the assumptions of OLS regression and some of the remedies that are possible. If you are a member of the UCLA research community, and you have further questions, we invite you to use our consulting services to discuss issues specific to your data analysis.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.