Help the Stat Consulting Group by giving a gift

Regression Analysis

This page shows an example regression analysis with footnotes explaining the
output. These data were collected on 200 high schools students and are
scores on various tests, including science, math, reading and social studies (**socst**).
The variable **female** is a dichotomous variable coded 1 if the student was
female and 0 if male.

use http://www.ats.ucla.edu/stat/stata/notes/hsb2(highschool and beyond (200 cases))regress science math female socst readSource | SS df MS Number of obs = 200 -------------+------------------------------ F( 4, 195) = 46.69 Model | 9543.72074 4 2385.93019 Prob > F = 0.0000 Residual | 9963.77926 195 51.0963039 R-squared = 0.4892 -------------+------------------------------ Adj R-squared = 0.4788 Total | 19507.5 199 98.0276382 Root MSE = 7.1482 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 ------------------------------------------------------------------------------

Sourcea.| SS^{a}df^{b}MS^{c}-------------+------------------------------ Model | 9543.72074 4 2385.93019 Residual | 9963.77926 195 51.0963039 -------------+------------------------------ Total | 19507.5 199 98.0276382^{d}

b. **SS** - These are the Sum of Squares associated with the three sources of
variance, Total, Model and Residual.

d. **MS** - These are the Mean Squares, the Sum of Squares divided by their
respective DF.

Number of obse.= 200 F( 4, 195)^{e}= 46.69 Prob > F^{f}^{g}= 0.0000 R-squared^{h}= 0.4892 Adj R-squared^{i}= 0.4788 Root MSE^{j}= 7.1482

f. **F( 4, 195)** - This is the F-statistic is the Mean
Square Model (2385.93019) divided by the Mean Square Residual (51.0963039),
yielding F=46.69. The numbers in parentheses are the Model and Residual
degrees of freedom are from the ANOVA table above.

g. **Prob > F** - This is the p-value associated with the above
F-statistic. It is used in testing the null hypothesis that all of the
model coefficients are 0.

h. **R-squared** - R-Squared is the proportion
of variance in the dependent variable (**science**) which can be explained by the
independent variables (**math,** **female**, **socst** and **read**).
This is an overall
measure of the strength of association and does not reflect the extent to which
any particular independent variable is associated with the dependent variable.

i. **Adj R-squared** - This is an adjustment of the R-squared that penalizes
the addition of extraneous predictors to the model. Adjusted R-squared is
computed using the formula 1 - ((1 - Rsq)((N - 1) /( N - k - 1)) where k is the
number of predictors.

j. ** Root MSE** - Root MSE is the standard
deviation of the error term, and is the square root of the Mean Square Residual
(or Error).

------------------------------------------------------------------------------ science^{k}| Coef.^{l}Std. Err.^{m}t^{n}P>|t|^{o}[95% Conf. Interval]^{p}-------------+---------------------------------------------------------------- math | .3893102 .0741243 5.25 0.000 .243122 .5354983 female | -2.009765 1.022717 -1.97 0.051 -4.026772 .0072428 socst | .0498443 .062232 0.80 0.424 -.0728899 .1725784 read | .3352998 .0727788 4.61 0.000 .1917651 .4788345 _cons | 12.32529 3.193557 3.86 0.000 6.026943 18.62364 ------------------------------------------------------------------------------

k. **science** - This column shows the
dependent variable at the top (**science**) with the predictor variables below it
(**math,** **female**, **socst**, **read** and **_cons**).
The last variable (**_cons**) represents the
constant or intercept.

l. **Coef.** - These are the values for the regression equation for
predicting the dependent variable from the independent variable. The regression
equation is presented in many different ways, for example:

**Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4**

The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.

**math** - The coefficient is
.3893102. So for every unit
increase in **math**, a .3893102 unit increase in **science** is
predicted, holding all other variables constant.

**female** - For every unit increase in **female**, we expect
a 2.009765 unit decrease in the **science** score, holding all other variables constant. Since **female** is coded 0/1 (0=male,
1=female) the interpretation is more simply: for females, the predicted science
score would be 2 points lower than for males.

**socst** - The coefficient for **socst** is .0498443.
So for every unit increase in **socst**, we expect an approximately .05 point
increase in the science score, holding all other variables constant.

**read** - The coefficient for **read** is .3352998.
So for every unit increase in **read**, we expect a .34 point increase in the
science score.

m. **Std. Err.** - These are the standard errors associated with the
coefficients.

n. **t** - These are the t-statistics used in testing whether a
given coefficient is significantly different from zero.

o. **P>|t|** - This column shows the 2-tailed p-values used in testing
the null hypothesis that the coefficient (parameter) is 0. Using an
alpha of 0.05:

The coefficient for **math** is significantly different from 0 because its p-value is 0.000, which is smaller than 0.05.

The coefficient for **socst** (.0498443) is not statistically significantly different from 0 because
its p-value is definitely larger than 0.05.

The coefficient for **read** (**.3352998) is
**statistically **significant because its
p-value of 0.000 is less than .05.
The constant ( _cons) is significantly different from 0 at
the 0.05 alpha level. **

p. **[95% Conf. Interval]** - These are the 95%
confidence intervals for the coefficients. The confidence intervals are
related to the p-values such that the coefficient will not be statistically
significant if the confidence interval includes 0. These confidence intervals
can help you to put the estimate
from the coefficient into perspective by seeing how much the value could vary.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.