### SAS Annotated Output Regression Analysis

This page shows an example regression analysis with footnotes explaining the output.  These data (hsb2) were collected on 200 high schools students and are scores on various tests, including science, math, reading and social studies (socst).  The variable female is a dichotomous variable coded 1 if the student was female and 0 if male.

In the code below, the data = option on the proc reg statement tells SAS where to find the SAS data set to be used in the analysis.  On the model statement, we specify the regression model that we want to run, with the dependent variable (in this case, science) on the left of the equals sign, and the independent variables on the right-hand side.  We use the clb option after the slash on the model statement to get the 95% confidence limits of the parameter estimates.  The quit statement is included because proc reg is an interactive procedure, and quit tells SAS that not to expect another proc reg immediately.

proc reg data = "d:\hsb2";
model science = math female socst read / clb;
run;
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: science science score
                             Analysis of Variance
                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     4     9543.72074     2385.93019      46.69    <.0001
Error                   195     9963.77926       51.09630
Corrected Total         199          19507

Root MSE              7.14817    R-Square     0.4892
Dependent Mean       51.85000    Adj R-Sq     0.4788
Coeff Var            13.78624

                                    Parameter Estimates
                                             Parameter       Standard
Variable     Label                   DF       Estimate          Error    t Value    Pr > |t|
Intercept    Intercept                1       12.32529        3.19356       3.86      0.0002
math         math score               1        0.38931        0.07412       5.25      <.0001
female                                1       -2.00976        1.02272      -1.97      0.0508
socst        social studies score     1        0.04984        0.06223       0.80      0.4241
read         reading score            1        0.33530        0.07278       4.61      <.0001
                         Parameter Estimates
Variable     Label                   DF       95% Confidence Limits
Intercept    Intercept                1        6.02694       18.62364
math         math score               1        0.24312        0.53550
female                                1       -4.02677        0.00724
socst        social studies score     1       -0.07289        0.17258
read         reading score            1        0.19177        0.47883

#### Anova Table

                             Analysis of Variance
Sum of         Mean
Sourcea                  DFb       Squaresc       Squared    F Valuee   Pr > Ff
Model                     4     9543.72074     2385.93019      46.69    <.0001
Error                   195     9963.77926       51.09630
Corrected Total         199          19507
a. Source - Looking at the breakdown of variance in the outcome variable, these are the categories we will examine: Model, Error, and Corrected Total. The Total variance is partitioned into the variance which can be explained by the independent variables (Model) and the variance which is not explained by the independent variables (Error).

b. DF - These are the degrees of freedom associated with the sources of variance.  The total variance has N-1 degrees of freedom.  The model degrees of freedom corresponds to the number of coefficients estimated minus 1.  Including the intercept, there are 5 coefficients, so the model has 5-1=4 degrees of freedom.  The Error degrees of freedom is the DF total minus the DF model, 199 - 4 =195.

c. Sum of Squares - These are the Sum of Squares associated with the three sources of variance, Total, Model and Error.

d. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective DF.

e. F Value - This is the F-statistic is the Mean Square Model (2385.93019) divided by the Mean Square Error (51.09630), yielding F=46.69.

f. Pr > F - This is the p-value associated with the above F-statistic.  It is used in testing the null hypothesis that all of the model coefficients are 0.

#### Overall Model Fit

Root MSEg              7.14817    R-Squarej     0.4892
Dependent Meanh       51.85000    Adj R-Sqk     0.4788
Coeff Vari            13.78624


g.  Root MSE - Root MSE is the standard deviation of the error term, and is the square root of the Mean Square Error.

h.  Dependent Mean - This is the mean of the dependent variable.

i.  Coeff Var - This is the coefficient of variation, which is a unit-less measure of variation in the data.  It is the root MSE divided by the mean of the dependent variable, multiplied by 100: (100*(7.15/51.85) = 13.79).

j. R-Square - R-Squared is the proportion of variance in the dependent variable (science) which can be explained by the independent variables (math, female, socst and read).  This is an overall measure of the strength of association and does not reflect the extent to which any particular independent variable is associated with the dependent variable.

k. Adj R-Sq - This is an adjustment of the R-squared that penalizes the addition of extraneous predictors to the model.  Adjusted R-squared is computed using the formula 1 - ((1 - Rsq)(N - 1) /( N - k - 1)) where k is the number of predictors.

#### Parameter Estimates

                                    Parameter Estimates
                                             Parameter       Standard
Variablel    Labelm                  DFn     Estimateo         Errorp    t Valueq    Pr > |t|r
Intercept    Intercept                1       12.32529        3.19356       3.86      0.0002
math         math score               1        0.38931        0.07412       5.25      <.0001
female                                1       -2.00976        1.02272      -1.97      0.0508
socst        social studies score     1        0.04984        0.06223       0.80      0.4241
read         reading score            1        0.33530        0.07278       4.61      <.0001
                         Parameter Estimates
Variablel     Labelm                   DFn       95% Confidence Limitss
Intercept    Intercept                1        6.02694       18.62364
math         math score               1        0.24312        0.53550
female                                1       -4.02677        0.00724
socst        social studies score     1       -0.07289        0.17258
read         reading score            1        0.19177        0.47883

l. Variable - This column shows the predictor variables (constant, math, female, socst, read).  The first refers the model intercept, the height of the regression line when it crosses the Y axis.  In other words, this is the predicted value of science when all other variables are 0.

m.  Label - This column gives the label for the variable.  Usually, variable labels are added when the data set is created so that it is clear what the variable is (as the name of the variable can sometimes be ambiguous).  SAS has labeled the variable Intercept for us by default.  Note that this variable is not added to the data set.

n.  DF - This column give the degrees of freedom associated with each independent variable.  All continuous variables have one degree of freedom, as do binary variables (such as female).

o. Parameter Estimates - These are the values for the regression equation for predicting the dependent variable from the independent variable. The regression equation is presented in many different ways, for example:

Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4

The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.

math - The coefficient is .3893102.  So for every unit increase in math, a 0.38931 unit increase in science is predicted, holding all other variables constant.
female - For every unit increase in female, we expect a -2.00976 unit decrease in the science score, holding all other variables constant.  Since female is coded 0/1 (0=male, 1=female) the interpretation is more simply: for females, the predicted science score would be 2 points lower than for males.
socst - The coefficient for socst is .0498443.  So for every unit increase in socst, we expect an approximately .05 point increase in the science score, holding all other variables constant.
read - The coefficient for read is .3352998.  So for every unit increase in read, we expect a .34 point increase in the science score.

p. Standard Error - These are the standard errors associated with the coefficients.

q. t Value - These are the t-statistics used in testing whether a given coefficient is significantly different from zero.

r. Pr > |t|- This column shows the 2-tailed p-values used in testing the null hypothesis that the coefficient (parameter) is 0.   Using an alpha of 0.05:
The coefficient for math is significantly different from 0 because its p-value is 0.000, which is smaller than 0.05.
The coefficient for female (-2.01) is not statictically significant at the 0.05 level since the p-value is greater than .05.
The coefficient for socst (.0498443) is not statistically significantly different from 0 because its p-value is definitely larger than 0.05.
The coefficient for read (.3352998) is statistically significant because its p-value of 0.000 is less than .05.
The intercept is significantly different from 0 at the 0.05 alpha level.

s. 95% Confidence Limits - These are the 95% confidence intervals for the coefficients.  The confidence intervals are related to the p-values such that the coefficient will not be statistically significant at alpha = .05 if the 95% confidence interval includes zero.  These confidence intervals can help you to put the estimate from the coefficient into perspective by seeing how much the value could vary.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.