Help the Stat Consulting Group by giving a gift

Multivariate regression in SAS

In many ways, multivariate regression is similar to MANOVA. The hypotheses, the methods used to obtain the estimates, and the assumptions are all similar. The multivariate test statistics are the same. The hypothesis being tested by a multivariate regression is that there is a joint linear effect of the set of predictors on the set of responses. Hence, the null hypothesis is that slope of all coefficients is simultaneously zero. Note that the "set" of predictors may include no predictor or only one predictor, but usually it contains more.

The basic assumptions of multivariate regression are 1) multivariate normality of the residuals, 2) homogenous variances of residuals conditional on predictors, 3) common covariance structure across observations, and 4) independent observations. Unfortunately, testing the first three assumptions is very difficult. Currently, many of the common statistical packages, such as SAS and SPSS, do not offer a test of multivariate normality. However, you can see if your data are close to being multivariate normal by creating some graphs. First, you want to see if your residuals for each dependent variable are normal by themselves. This is necessary, but not sufficient, for multivariate normality. Next, you can create scatterplots of the residuals. You want to see the points on the graph form an ellipse (as opposed to a V-shape, a wedge-shape, or some other kind of shape). Remember that an ellipse can be any form of a circle. You would like the points to line up in a "flattened" ellipse because the dependent variables are supposed to be correlated for MANOVA or multiple regression to be the analysis of choice, but this is not necessary for multivariate normality. Regarding the second assumption, homogeneity of variances, there are several tests available for this. However, most of them are very sensitive to nonnormality. Fortunately, the F statistic is fairly robust against violations of this assumption. As for the third assumption, the covariance matrices are rarely equal. Monte Carlo studies have shown that keeping the number of observations (subjects) per group approximately equal is an effective method of ensuring that violations of this assumption will not be too problematic. Regarding the independence of observations, clearly there is no statistical test for that. Rather, that is an issue of methodology. Care should be taken to ensure that the observations are independent, because even small intraclass correlations can cause serious problems. For example, suppose an experimenter had three groups with 30 subjects per group and a small dependence between the observations, say an intraclass correlation of .10. The actual alpha value would be .4917, rather than the standard .05.

If all of these assumptions are met, then the coefficients will be
unbiased, the least-squares estimates will have minimum variance, and the
relationships among the coefficients will reflect the relationships among the
predictors. Furthermore, a multivariate hypothesis test will account for the
relationship among the coefficients, whereas a univariate F test would not.

With all of this in mind, let's try a multivariate multiple regression. We will
use the hsb2 data set for our example,
and we will use **read** and **socst** as our dependent variables and **
write**, **math** and **science** as our independent variables. The **
proc reg** statement is the same as it would be in a univariate regression,
but the **model** statement is a little different: we now have two (we could
have more) dependent variables listed before the equals sign. Also, we have
included the **mtest** statement, which is used to test hypotheses in
multivariate regression. If no equations are listed on the **mtest**
statement, SAS tests the hypothesis that all coefficients except the intercept
are zero. You can specify some options on the **mtest **statement, including
**canprint**, which will print the canonical correlations for the hypothesis
combinations and the dependent variable combinations. The **details** option
will display the M matrix, and the **print** option will display the H and E
matrices.

proc reg data = "g:\SAS\hsb2"; model read socst = write math science; mtest / details print; run; quit;The REG Procedure Model: MODEL1 Dependent Variable: read Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 11313 3771.09916 76.94 <.0001 Error 196 9606.12253 49.01083 Corrected Total 199 20919 Root MSE 7.00077 R-Square 0.5408 Dependent Mean 52.23000 Adj R-Sq 0.5338 Coeff Var 13.40374 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 4.36993 3.20878 1.36 0.1748 write 1 0.23767 0.06969 3.41 0.0008 math 1 0.37840 0.07463 5.07 <.0001 science 1 0.29693 0.06763 4.39 <.0001The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 9551.66620 3183.88873 46.62 <.0001 Error 196 13385 68.28841 Corrected Total 199 22936 Root MSE 8.26368 R-Square 0.4164 Dependent Mean 52.40500 Adj R-Sq 0.4075 Coeff Var 15.76888 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 8.86989 3.78763 2.34 0.0202 write 1 0.46567 0.08227 5.66 <.0001 math 1 0.27630 0.08810 3.14 0.0020 science 1 0.08512 0.07984 1.07 0.2877The REG Procedure Model: MODEL1 Multivariate Test 1 L Ginv(X'X) L' LB-cj 0.0000991078 -0.000042904 -0.000028518 0.2376705687 0.4656741023 -0.000042904 0.0001136529 -0.000044399 0.3784014963 0.2763008055 -0.000028518 -0.000044399 0.0000933347 0.2969346843 0.0851168364 Inv(L Ginv(X'X) L') Inv()(LB-cj) 17878.875 10911.025 10653.25 11541.35 12247.225 10911.025 17465.795 11642.35 12659.33 10897.755 10653.25 11642.35 19507.5 12729.9 9838.15 Error Matrix (E) 9606.1225306 3657.5503071 3657.5503071 13384.528803 Hypothesis Matrix (H) 11313.297469 9955.8196929 9955.8196929 9551.6661967 Hypothesis + Error Matrix (T) 20919.42 13613.37 13613.37 22936.195 Eigenvectors 0.004986 0.002488 -0.007281 0.008053 Eigenvalues 0.587507 0.051687 The REG Procedure Model: MODEL1 Multivariate Test 1 Multivariate Statistics and F Approximations S=2 M=0 N=96.5 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.39117291 38.93 6 390 <.0001 Pillai's Trace 0.63919333 30.69 6 392 <.0001 Hotelling-Lawley Trace 1.47878554 47.94 6 258.23 <.0001 Roy's Greatest Root 1.42428180 93.05 3 196 <.0001 NOTE: F Statistic for Roy's Greatest Root is an upper bound. NOTE: F Statistic for Wilks' Lambda is exact.

Looking at the very bottom of the output we can see that the overall model is
statistically significant. We can look at the first half of the output to see
the univariate results. Here we see that with only the dependent variable **
read**, the overall model is statistically significant, as well as each of the
predictors. When we look at the univariate results for **socst**, we see that
the overall model is statistically significant, as are the predictors **write**
and **math**, but not **science**. In other words, multivariate tests tell
us that the set of predictors accounts for a statistically significant portion
of the variance in the dependent variables, and the univariate tests break this
down for us so that we can see where the significant differences are.

Let's run the same model again, but this time, we will specify some hypotheses
to be tested on the **mtest** statement. In the first **mtest** statement,
we will test the hypothesis that the parameter for **write** is the same for
**read** and **socst**. In the second **mtest** statement, we will test
the hypothesis that the parameter for **science** is the same for **read**
and **socst**. You will notice that, as with **test** statements in other
procs, we can use a label before the statement so that it is labeled in the
output.

proc reg data = "g:\SAS\hsb2"; model read socst = write math science; write: mtest read- socst, write / details print; science: mtest read - socst, science / details print; run; quit;The REG Procedure Model: MODEL1 Dependent Variable: read Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 11313 3771.09916 76.94 <.0001 Error 196 9606.12253 49.01083 Corrected Total 199 20919 Root MSE 7.00077 R-Square 0.5408 Dependent Mean 52.23000 Adj R-Sq 0.5338 Coeff Var 13.40374 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 4.36993 3.20878 1.36 0.1748 write 1 0.23767 0.06969 3.41 0.0008 math 1 0.37840 0.07463 5.07 <.0001 science 1 0.29693 0.06763 4.39 <.0001 The REG Procedure Model: MODEL1 Dependent Variable: socst Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 9551.66620 3183.88873 46.62 <.0001 Error 196 13385 68.28841 Corrected Total 199 22936 Root MSE 8.26368 R-Square 0.4164 Dependent Mean 52.40500 Adj R-Sq 0.4075 Coeff Var 15.76888 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 8.86989 3.78763 2.34 0.0202 write 1 0.46567 0.08227 5.66 <.0001 math 1 0.27630 0.08810 3.14 0.0020 science 1 0.08512 0.07984 1.07 0.2877 The REG Procedure Model: MODEL1 Multivariate Test: write Multivariate Statistics and Exact F Statistics S=1 M=-0.5 N=97 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.96762141 6.56 1 196 0.0112 Pillai's Trace 0.03237859 6.56 1 196 0.0112 Hotelling-Lawley Trace 0.03346205 6.56 1 196 0.0112 Roy's Greatest Root 0.03346205 6.56 1 196 0.0112 The REG Procedure Model: MODEL1 Multivariate Test: science Multivariate Statistics and Exact F Statistics S=1 M=-0.5 N=97 Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.97024627 6.01 1 196 0.0151 Pillai's Trace 0.02975373 6.01 1 196 0.0151 Hotelling-Lawley Trace 0.03066616 6.01 1 196 0.0151 Roy's Greatest Root 0.03066616 6.01 1 196 0.0151

For the dependent variable **read**, the predictors **write**, **math**
and **science** are significant. For the dependent variable **socst**,
the predictors **write** and **math** are significant. The last two
pages of the output indicate that both of the hypotheses regarding the
parameters were statistically significant (F = 6.56, p = 0.0112 and F = 6.01, p
= 0.0151, respectively). Hence, we would conclude that, based on the
results of the first test (which we called **write**), that parameters for **
read** and **socst** are not the same for the variable **write**.
The second test (which we called **science**) suggests that the parameters
for **read** and **socst** are not the same for the variable **science**.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.