Help the Stat Consulting Group by giving a gift

Multivariate Regression Analysis

**Version info: **Code for this page was tested in Stata 12.

As the name implies, multivariate regression is a technique that estimates a single regression model with more than one outcome variable. When there is more than one predictor variable in a multivariate regression model, the model is a multivariate multiple regression.

**Please Note:** The purpose of this page is to show how to use various data analysis commands.
It does not cover all aspects of the research process which researchers are expected to do. In
particular, it does not cover data cleaning and checking, verification of assumptions, model
diagnostics and potential follow-up analyses.

Example 1. A researcher has collected data on three psychological variables, four academic variables (standardized test scores), and the type of educational program the student is in for 600 high school students. She is interested in how the set of psychological variables is related to the academic variables and the type of program the student is in.

Example 2. A doctor has collected data on cholesterol, blood pressure, and weight. She also collected data on the eating habits of the subjects (e.g., how many ounces of red meat, fish, dairy products, and chocolate consumed per week). She wants to investigate the relationship between the three measures of health and eating habits.

Example 3. A researcher is interested in determining what factors influence the health African Violet plants. She collects data on the average leaf diameter, the mass of the root ball, and the average diameter of the blooms, as well as how long the plant has been in its current container. For predictor variables, she measures several elements in the soil, as well as the amount of light and water each plant receives.

Let's look at the data (note that there are no missing values in this data set).

use http://www.ats.ucla.edu/stat/stata/dae/mvreg, clearsummarize locus_of_control self_concept motivation read write scienceVariable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- locus_of_c~l | 600 .0965333 .6702799 -1.995957 2.205511 self_concept | 600 .0049167 .7055125 -2.53275 2.093563 motivation | 600 .0038979 .8224 -2.746669 2.583752 read | 600 51.90183 10.10298 24.62001 80.58649 write | 600 52.38483 9.726455 20.06888 83.93482 -------------+-------------------------------------------------------- science | 600 51.76333 9.706179 21.98953 80.36942tabulate progprogram | type | Freq. Percent Cum. ------------+----------------------------------- general | 138 23.00 23.00 academic | 271 45.17 68.17 vocational | 191 31.83 100.00 ------------+----------------------------------- Total | 600 100.00correlate locus_of_control self_concept motivation(obs=600) | locus_~l self_c~t motiva~n -------------+--------------------------- locus_of_c~l | 1.0000 self_concept | 0.1712 1.0000 motivation | 0.2451 0.2886 1.0000correlate read write science(obs=600) | read write science -------------+--------------------------- read | 1.0000 write | 0.6286 1.0000 science | 0.6907 0.5691 1.0000

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

- Multivariate multiple regression, the focus of this page.
- Separate OLS Regressions - You could analyze these data using separate OLS regression analyses for each outcome variable. The individual coefficients, as well as their standard errors will be the same as those produced by the multivariate regression. However, the OLS regressions will not produce multivariate results, nor will they allow for testing of coefficients across equations.
- Canonical correlation analysis might be feasible if you don't want to consider one set of variables as outcome variables and the other set as predictor variables.

To conduct a multivariate regression in Stata, we need to use two commands,
**manova** and **mvreg**. The **manova** command will indicate if
all of the equations, taken together, are statistically significant. The
F-ratios and p-values for four
multivariate criterion are given, including Wilks' lambda, Lawley-Hotelling
trace, Pillai's trace, and Roy's largest root. Next, we use the **mvreg**
command to obtain the coefficients, standard errors, etc., for each of the predictors in
each part of the
model. We will also show the use of the **test** command after the **
mvreg** command. The use of the **test** command is one of the
compelling reasons for conducting a multivariate regression analysis.

Below we run the **manova** command. Note the use of **c.** in front of the
names of the continuous outcome variables, this is part of the factor variable
syntax introduced in Stata 11. It is necessary to use the **c.** to identify
the continuous variables, because, by default, the **manova** command assumes all
predictor variables are categorical.

manova locus_of_control self_concept motivation = c.read c.write c.science progNumber of obs = 600 W = Wilks' lambda L = Lawley-Hotelling trace P = Pillai's trace R = Roy's largest root Source | Statistic df F(df1, df2) = F Prob>F -----------+-------------------------------------------------- Model | W 0.6992 5 15.0 1634.7 15.08 0.0000 a | P 0.3196 15.0 1782.0 14.17 0.0000 a | L 0.4035 15.0 1772.0 15.89 0.0000 a | R 0.3271 5.0 594.0 38.86 0.0000 u |-------------------------------------------------- Residual | 594 -----------+-------------------------------------------------- read | W 0.9764 1 3.0 592.0 4.76 0.0027 e | P 0.0236 3.0 592.0 4.76 0.0027 e | L 0.0241 3.0 592.0 4.76 0.0027 e | R 0.0241 3.0 592.0 4.76 0.0027 e |-------------------------------------------------- write | W 0.9474 1 3.0 592.0 10.96 0.0000 e | P 0.0526 3.0 592.0 10.96 0.0000 e | L 0.0555 3.0 592.0 10.96 0.0000 e | R 0.0555 3.0 592.0 10.96 0.0000 e |-------------------------------------------------- science | W 0.9834 1 3.0 592.0 3.33 0.0193 e | P 0.0166 3.0 592.0 3.33 0.0193 e | L 0.0169 3.0 592.0 3.33 0.0193 e | R 0.0169 3.0 592.0 3.33 0.0193 e |-------------------------------------------------- prog | W 0.8914 2 6.0 1184.0 11.67 0.0000 e | P 0.1086 6.0 1186.0 11.35 0.0000 a | L 0.1217 6.0 1182.0 11.99 0.0000 a | R 0.1209 3.0 593.0 23.89 0.0000 u |-------------------------------------------------- Residual | 594 -----------+-------------------------------------------------- Total | 599 -------------------------------------------------------------- e = exact, a = approximate, u = upper bound on F

- The tests for the overall mode, shown in the section labeled Model (under
Source), indicate that the model is statistically significant, regardless of the type of
multivariate criteria that is used (i.e. all of the p-values are less than 0.0001). If the
overall model was not statistically significant, you might want to modify it
before running
**mvreg**. - Below the overall model tests, are the multivariate tests for each of the predictor variables. Each of the predictors is statistically significant overall, regardless of which test is used.

We can use **mvreg** to obtain estimates of the coefficients in our model.
Normally **mvreg** requires the user to specify both outcome and predictor
variables, however, because we have just run the **manova** command, we can use the **mvreg** command, without
additional input, to run a multivariate regression corresponding to the model just
estimated by **maova** (note that this feature was introduced in Stata 11, if
you are using an earlier version of Stata, you'll need to use the full syntax for **mvreg**).

mvregEquation Obs Parms RMSE "R-sq" F P ---------------------------------------------------------------------- locus_of_c~l 600 6 .6069966 0.1868 27.28199 0.0000 self_concept 600 6 .6890684 0.0540 6.786094 0.0000 motivation 600 6 .761402 0.1500 20.96388 0.0000 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- locus_of_c~l | read | .0125046 .0037178 3.36 0.001 .005203 .0198062 write | .012145 .0033914 3.58 0.000 .0054845 .0188056 science | .0057615 .0036412 1.58 0.114 -.0013896 .0129126 | prog | 2 | .1277951 .063955 2.00 0.046 .0021896 .2534005 3 | .2516705 .0684699 3.68 0.000 .117198 .386143 | _cons | -1.624765 .1570053 -10.35 0.000 -1.933118 -1.316412 -------------+---------------------------------------------------------------- self_concept | read | .0013076 .0042205 0.31 0.757 -.0069812 .0095965 write | -.0042934 .0038499 -1.12 0.265 -.0118545 .0032676 science | .0053059 .0041335 1.28 0.200 -.0028121 .013424 | prog | 2 | .2764834 .0726023 3.81 0.000 .1338949 .4190719 3 | .4233592 .0777277 5.45 0.000 .2707047 .5760137 | _cons | -.3723412 .1782339 -2.09 0.037 -.7223865 -.0222958 -------------+---------------------------------------------------------------- motivation | read | .0096735 .0046635 2.07 0.038 .0005146 .0188325 write | .0175354 .004254 4.12 0.000 .0091807 .0258902 science | -.0090015 .0045674 -1.97 0.049 -.0179716 -.0000313 | prog | 2 | .3603294 .0802236 4.49 0.000 .2027729 .5178859 3 | .619696 .085887 7.22 0.000 .4510169 .7883751 | _cons | -1.310842 .1969437 -6.66 0.000 -1.697633 -.9240513 ------------------------------------------------------------------------------

- The output from the
**mvreg**command looks much like the output from the**regress**command, except that the output is for three equations (one for each outcome measure) instead of one. In addition to looking like the output from an OLS regression, the output is interpreted much like the output from an OLS regression. - The first table gives the number of observations, number of parameters, RMSE,
R-squared, F-ratio, and p-value for each of the three models.
- Looking at the column labeled P, we see that each of the three univariate models are statistically significant.
- In the column labeled R-sq, we see that the five predictor variables explain
19%, 5%, and 15% of the variance in the outcome variables
**locus_of_control**,**self_concept**, and**motivation**, respectively. (Note this value is a standard R-squared, not an adjusted R-squared.)

- The second table contains the coefficients, their standard errors, test statistic (t), p-values,
and 95% confidence interval, for each predictor variable in the model, grouped
by outcome. As mentioned above, the coefficients are interpreted in the
same way coefficients from an OLS regression are interpreted. For example, looking at the top of
the table, a one unit change in
**read**is associated with a 0.013 unit change in the predicted value of**locus_of_control**.

If you ran a separate OLS regression
for each outcome variable, you would get exactly the same coefficients, standard
errors, t- and
p-values, and confidence intervals as shown above. So why conduct a
multivariate regression? As we mentioned earlier, one of the advantages of using **mvreg** is that you
can conduct tests of the coefficients across the different outcome variables. (Please
note that many of these tests can be preformed after the **manova** command,
although the process can be more difficult because a series of contrasts needs
to be created.) In the
examples below, we test four different hypotheses.

For the first test, the null hypothesis is that the coefficients for the variable **read**
are equal to 0 in all three equations. (Note that this duplicates the
test for the variable **read** in the **manova** output above.)

test read( 1) [locus_of_control]read = 0 ( 2) [self_concept]read = 0 ( 3) [motivation]read = 0 F( 3, 594) = 4.78 Prob > F = 0.0027

The results of this test reject the null hypothesis that the coefficients for
**read** across the three equations are simultaneously equal to 0, in other
words, the coefficients for **read**, taken for all three outcomes together,
are statistically significant.

Second, we can test the null hypothesis that the coefficients for **prog**=2
(identified as **2.prog**) and **prog**=3 (identified as **3.prog**) are simultaneously equal to 0 in the
equation for **
locus_of_control**. When used to test the coefficients for dummy variables
that form a single categorical predictor, this type of test is sometimes called an overall test
for the effect of the categorical predictor (i.e. **prog**). Note that the variable name in brackets (i.e. **
locus_of_control**) indicates which equation the coefficient being tested
belongs to, with the equation identified by the name of the outcome variable.

test [locus_of_control]2.prog [locus_of_control]3.prog( 1) [locus_of_control]2.prog = 0 ( 2) [locus_of_control]3.prog = 0 F( 2, 594) = 6.83 Prob > F = 0.0012

The results of the above test indicate that the two coefficients together are
significantly different from 0, in other words, the overall effect of **prog**
on **locus_of_control**
is statistically significant.

The next example tests the null hypothesis that the coefficient for the variable
**write** in the equation with **
locus_of_control** as the outcome is equal to the coefficient for **write**
in the equation with **self_concept** as the outcome. The null hypothesis
printed by the **test** command is that the difference in the coefficients is 0,
which is another way of saying two coefficients are equal. Another way of
stating this null hypothesis is that,
that the effect of **write** on **locus_of_control** is equal to the
effect of **write** on **self_concept**.

test [locus_of_control]write = [self_concept]write( 1) [locus_of_control]write - [self_concept]write = 0 F( 1, 594) = 11.89 Prob > F = 0.0006

The results of this test indicate that the difference between the
coefficients for **write** with **locus_of_control** and **
self_concept** as the outcome is significantly different from 0, in other
words, the coefficients are significantly different.

For the final example, we test the null hypothesis that the
coefficient of **science** in the equation for **
locus_of_control** is equal to the coefficient for **science** in the
equation for **self_concept**, and that the coefficient for the variable
**write** in the equation with the outcome variable **
locus_of_control** equals the coefficient for **write** in the
equation with the outcome variable **self_concept**. We tested the
difference in the coefficients for **write** in the last example, so we can use
the **accum** option to add the test of the difference in coefficients
for **science**, allowing us to test both sets of coefficients at the
same time.

test [locus_of_control]science = [self_concept]science, accum( 1) [locus_of_control]write - [self_concept]write = 0 ( 2) [locus_of_control]science - [self_concept]science = 0 F( 2, 594) = 6.39 Prob > F = 0.0018

The results of the above test indicate that taken together the differences in the two sets of coefficients is statistically significant.

- The residuals from multivariate regression models are assumed to be multivariate normal. This is analogous to the assumption of normally distributed errors in univariate linear regression (i.e. ols regression).
- Multivariate regression analysis is not recommended for small samples.
- The outcome variables should be at least moderately correlated for the multivariate regression analysis to make sense.
- If the outcome variables are
dichotomous, then you will want to use either
**mvprobit**or**biprobit**.

Stata Online Manual

- Afifi, A., Clark, V. and May, S. (2004). Computer-Aided Multivariate Analysis. 4th ed. Boca Raton, Fl: Chapman & Hall/CRC.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.