UCLA Academic Technology Services HomeServicesClassesContactJobs

Regression with SAS
Chapter 1 - Simple and Multiple Regression

Chapter Outline
    1.0 Introduction
    1.1 A First Regression Analysis
    1.2 Examining Data
    1.3 Simple linear regression
    1.4 Multiple regression
    1.5 Transforming variables
    1.6 Summary
    1.7 For more information

1.0 Introduction

This web book is composed of four chapters covering a variety of topics about using SAS for regression. We should emphasize that this book is about "data analysis" and that it demonstrates how SAS can be used for regression analysis, as opposed to a book that covers the statistical basis of multiple regression.  We assume that you have had at least one statistics course covering regression analysis and that you have a regression book that you can use as a reference (see the Regression With SAS page and our Statistics Books for Loan page for recommended regression analysis books). This book is designed to apply your knowledge of regression, combine it with instruction on SAS, to perform, understand and interpret regression analyses. 

This first chapter will cover topics in simple and multiple regression, as well as the supporting tasks that are important in preparing to analyze your data, e.g., data checking, getting familiar with your data file, and examining the distribution of your variables.  We will illustrate the basics of simple and multiple regression and demonstrate the importance of inspecting, checking and verifying your data before accepting the results of your analysis. In general, we hope to show that the results of your regression analysis can be misleading without further probing of your data, which could reveal relationships that a casual analysis could overlook. 

In this chapter, and in subsequent chapters, we will be using a data file that was created by randomly sampling 400 elementary schools from the California Department of Education's API 2000 dataset.  This data file contains a measure of school academic performance as well as other attributes of the elementary schools, such as, class size, enrollment, poverty, etc.

You can access this data file over the web by clicking on elemapi.sas7bdat, or by visiting the Regression with SAS page where you can download all of the data files used in all of the chapters of this book.  The examples will assume you have stored your files in a folder called c:\sasreg, but actually you can store the files in any folder you choose, but if you run these examples be sure to change c:\sasreg\ to the name of the folder you have selected.

1.1 A First Regression Analysis

Let's dive right in and perform a regression analysis using the variables api00, acs_k3, meals and full. These measure the academic performance of the school (api00), the average class size in kindergarten through 3rd grade (acs_k3), the percentage of students receiving free meals (meals) - which is an indicator of poverty, and the percentage of teachers who have full teaching credentials (full). We expect that better academic performance would be associated with lower class size, fewer students receiving free meals, and a higher percentage of teachers having full teaching credentials.   Below, we use proc reg for running this regression model followed by the SAS output.

Let's focus on the three predictors, whether they are statistically significant and, if so, the direction of the relationship. The average class size (acs_k3, b=-2.68), is not significant (p=0.0553), but only just so, and the coefficient is negative which would indicate that larger class sizes is related to lower academic performance -- which is what we would expect.   Next, the effect of meals (b=-3.70, p<.0001) is significant and its coefficient is negative indicating that the greater the proportion students receiving free meals, the lower the academic performance.  Please note, that we are not saying that free meals are causing lower academic performance.  The meals variable is highly related to income level and functions more as a proxy for poverty. Thus, higher levels of poverty are associated with lower academic performance. This result also makes sense.  Finally, the percentage of teachers with full credentials (full, b=0.11, p=.2321) seems to be unrelated to academic performance. This would seem to indicate that the percentage of teachers with full credentials is not an important factor in predicting academic performance -- this result was somewhat unexpected.

Should we take these results and write them up for publication?  From these results, we would conclude that lower class sizes are related to higher performance, that fewer students receiving free meals is associated with higher performance, and that the percentage of teachers with full credentials was not related to academic performance in the schools.  Before we write this up for publication, we should do a number of checks to make sure we can firmly stand behind these results.  We start by getting more familiar with the data file, doing preliminary data checking, looking for errors in the data. 

1.2 Examining data

First, let's use proc contents to learn more about this data file.  We can verify how many observations it has and see the names of the variables it contains. 

We will not go into all of the details of this output.  Note that there are 400 observations and 21 variables.  We have variables about academic performance in 2000 and 1999 and the change in performance, api00, api99 and growth respectively. We also have various characteristics of the schools, e.g., class size, parents education, percent of teachers with full and emergency credentials, and number of students.  Note that when we did our original regression analysis it said that there were 313 observations, but the proc contents output indicates that we have 400 observations in the data file.

If you want to learn more about the data file, you could use proc print to show some of the observations.  For example, below we proc print to show the first five observations.

This takes up lots of space on the page, but does not give us a lot of information.  Listing our data can be very helpful, but it is more helpful if you list just the variables you are interested in.  Let's list the first 10 observations for the variables that we looked at in our first regression analysis.

We see that among the first 10 observations, we have four missing values for meals.  It is likely that the missing data for meals had something to do with the fact that the number of observations in our first regression analysis was 313 and not 400.

Another useful tool for learning about your variables is proc means. Below we use proc means to learn more about the variables api00, acs_k3, meals, and full.  

We see that the api00 scores don't have any missing values (because the N is 400) and the scores range from 369-940.  This makes sense since the api scores can range from 200 to 1000.  We see that the average class size (acs_k3) had 398 valid values ranging from -21 to 25 and 2 are missing. It seems odd for a class size to be -21. The percent receiving free meals (meals) ranges from 6 to 100, but there are only 315 valid values (85 are missing). This seems like a large number of missing values.  The percent with full credentials (full) ranges from .42 to 100 with no missing.  

We can also use proc freq to learn more about any categorical variables, such as yr_rnd, as shown below.

The variable yr_rnd is coded 0=No (not year round) and 1=Yes (year round).  Of the 400 schools, 308 are non-year round and 92 are year round, and none are missing.

The above commands have uncovered a number of peculiarities worthy of further examination. For example, let us look further into the average class size by getting more detailed summary statistics for acs_k3 using proc univariate.  

Quantiles (Definition 5)

Quantile      Estimate
100% Max            25
99%                 23
95%                 21
90%                 21
75% Q3              20
50% Median          19
25% Q1              18
10%                 17
5%                  16
1%                 -20
0% Min             -21

        Extreme Observations

----Lowest----        ----Highest---
Value      Obs        Value      Obs
  -21       43           22      365
  -21       42           23       36
  -21       41           23       79
  -20       40           23      361
  -20       38           25      274

               Missing Values

                       -----Percent Of-----
Missing                             Missing
  Value       Count     All Obs         Obs
      .           2        0.50      100.00

Looking in the section labeled Extreme Observations, we see some of the class sizes are -21 and -20, so it seems as though some of the class sizes somehow became negative, as though a negative sign was incorrectly typed in front of them.   Let's do a proc freq for class size to see if this seems plausible.

Indeed, it seems that some of the class sizes somehow got negative signs put in front of them.  Let's look at the school and district number for these observations to see if they come from the same district.   Indeed, they all come from district 140.  

Notice that when we looked at the observations where (acs_k3 < 0) this also included observations where acs_k3 is missing (represented as a period).  To be more precise, the above command should exclude such observations like this.

proc print data="c:\sasreg\elemapi";
  where (acs_k3 < 0) and (acs_k3 ^= .);
  var snum dnum acs_k3;
run;
Obs    snum    dnum    acs_k3
 38     600     140      -20
 39     596     140      -19
 40     611     140      -20
 41     595     140      -21
 42     592     140      -21
 43     602     140      -21

Now,  let's look at all of the observations for district 140.

All of the observations from district 140 seem to have this problem.  When you find such a problem, you want to go back to the original source of the data to verify the values. We have to reveal that we fabricated this error for illustration purposes, and that the actual data had no such problem. Let's pretend that we checked with district 140 and there was a problem with the data there, a hyphen was accidentally put in front of the class sizes making them negative.  We will make a note to fix this!  Let's continue checking our data.

Let's take a look at some graphical methods for inspecting data.  For each variable, it is useful to inspect them using a histogram, boxplot, and stem-and-leaf plot.  These graphs can show you information about the shape of your variables better than simple numeric statistics can. We already know about the problem with acs_k3, but let's see how these graphical methods would have revealed the problem with this variable.

First, we show a histogram for acs_k3. This shows us the observations where the average class size is negative.

Likewise, a boxplot and stem-and-leaf plot would have called these observations to our attention as well.   In SAS you can use the plot option with proc univariate to request a boxplot and stem and leaf plot. Below we show just the combined boxplot and stem and leaf plot from this output. You can see the outlying negative observations way at the bottom of the boxplot.

                      Histogram                       #  Boxplot
     25+*                                              1     0
       .**                                            10     |
       .****************************                 137  +-----+
       .******************************************   207  *--+--*
       .*******                                       34     |
       .*                                              3     0
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .
       .*                                              3     *
    -21+*                                              3     *
        ----+----+----+----+----+----+----+----+--
        * may represent up to 5 counts

We recommend plotting all of these graphs for the variables you will be analyzing. We will omit, due to space considerations, showing these graphs for all of the variables. However, in examining the variables, the stem-and-leaf plot for full seemed rather unusual.  Up to now, we have not seen anything problematic with this variable, but look at the stem and leaf plot for full below. It shows 104 observations where the percent with a full credential that is much lower than all other observations.  This is over 25% of the schools and seems very unusual.

Let's look at the frequency distribution of full to see if we can understand this better.  The values go from 0.42 to 1.0, then jump to 37 and go up from there.   It appears as though some of the percentages are actually entered as proportions, e.g., 0.42 was entered instead of 42 or 0.96 which really should have been 96.

Let's see which district(s) these data came from. 

We note that all 104 observations in which full was less than or equal to one came from district 401.  Let's see if this accounts for all of the observations that come from district 401.

All of the observations from this district seem to be recorded as proportions instead of percentages.  Again, let us state that this is a pretend problem that we inserted into the data for illustration purposes.  If this were a real life problem, we would check with the source of the data and verify the problem.  We will make a note to fix this problem in the data as well.

Another useful graphical technique for screening your data is a scatterplot matrix. While this is probably more relevant as a diagnostic tool searching for non-linearities and outliers in your data, it can also be a useful data screening tool, possibly revealing information in the joint distributions of your variables that would not be apparent from examining univariate distributions.  Let's look at the scatterplot matrix for the variables in our regression model.  This reveals the problems we have already identified, i.e., the negative class sizes and the percent full credential being entered as proportions. 

We have identified three problems in our data.  There are numerous missing values for meals, there were negatives accidentally inserted before some of the class sizes (acs_k3) and over a quarter of the values for full were proportions instead of percentages.  The corrected version of the data is called elemapi2.  Let's use that data file and repeat our analysis and see if the results are the same as our original analysis. First, let's repeat our original regression analysis below.

proc reg data="c:\sasreg\elemapi"
  model api00 = acs_k3 meals full;
run;

Dependent Variable: api00 api 2000

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     3        2634884         878295     213.41    <.0001
Error                   309        1271713     4115.57673
Corrected Total         312        3906597

Root MSE             64.15276    R-Square     0.6745
Dependent Mean      596.40575    Adj R-Sq     0.6713
Coeff Var            10.75656

                                    Parameter Estimates

                                            Parameter       Standard
Variable     Label                  DF       Estimate          Error    t Value    Pr > |t|

Intercept    Intercept               1      906.73916       28.26505      32.08      <.0001
acs_k3       avg class size k-3      1       -2.68151        1.39399      -1.92      0.0553
meals        pct free meals          1       -3.70242        0.15403     -24.04      <.0001
full         pct full credential     1        0.10861        0.09072       1.20      0.2321

Now, let's use the corrected data file and repeat the regression analysis.  We see quite a difference in the results!  In the original analysis (above), acs_k3 was nearly significant, but in the corrected analysis (below) the results show this variable to be not significant, perhaps due to the cases where class size was given a negative value.  Likewise, the percentage of teachers with full credentials was not significant in the original analysis, but is significant in the corrected analysis, perhaps due to the cases where the value was given as the proportion with full credentials instead of the percent.   Also, note that the corrected analysis is based on 398 observations instead of 313 observations, due to getting the complete data for the meals variable which had lots of missing values.

From this point forward, we will use the corrected, elemapi2, data file.  

So far we have covered some topics in data checking/verification, but we have not really discussed regression analysis itself.  Let's now talk more about performing regression analysis in SAS.

1.3 Simple Linear Regression

Let's begin by showing some examples of simple linear regression using SAS. In this type of regression, we have only one predictor variable. This variable may be continuous, meaning that it may assume all values within a range, for example, age or height, or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1. The use of categorical variables with more than two levels will be covered in Chapter 3. There is only one response or dependent variable, and it is continuous.

In SAS, the dependent variable is listed immediately after the model statement followed by an equal sign and then one or more predictor variables. Let's examine the relationship between the size of school and academic performance to see if the size of the school is related to academic performance.  For this example, api00 is the dependent variable and enroll is the predictor.

Let's review this output a bit more carefully. First, we see that the F-test is statistically significant, which means that the model is statistically significant. The R-squared is .1012 means that approximately 10% of the variance of api00 is accounted for by the model, in this case, enroll. The t-test for enroll equals -6.70 , and is statistically significant, meaning that the regression coefficient for enroll is significantly different from zero. Note that (-6.70)2 = 44.89, which is the same as the F-statistic (with some rounding error). The coefficient for enroll is -.19987, or approximately -0.2, meaning that for a one unit increase in enroll, we would expect a 0.2-unit decrease in api00. In other words, a school with 1100 students would be expected to have an api score 20 units lower than a school with 1000 students.  The constant is 744.2514, and this is the predicted value when enroll equals zero.  In most cases, the constant is not very interesting.  We have prepared an annotated output which shows the output from this regression along with an explanation of each of the items in it.

In addition to getting the regression table, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.  SAS makes this very easy for you by using the plot statement as part of proc reg.  For example, below we show how to make a scatterplot of the outcome variable, api00 and the predictor, enroll. Note that the graph also includes the predicted values in the form of the regression line.

As you see, this one command produces a scatterplot and regression line, and it also includes the regression model with the correlation of the two variables in the title.  We could include a 95% prediction interval using the pred option on the plot statement as illustrated below.

proc reg data="c:\sasreg\elemapi2"  ;
  model api00 = enroll ;
  plot api00 * enroll / pred;
run;  
quit;

Another kind of graph that you might want to make is a residual versus fitted plot.  As shown below, we can use the plot statement to make this graph.  The keywords residual. and predicted. in this context refer to the residual value and predicted value from the regression analysis and can be abbreviated as r. and p. .

The table below shows a number of other keywords that can be used with the plot statement and the statistics they display.

Keyword Statistic
COOKD. Cook's D influence statistics
COVRATIO. standard influence of observation on covariance of betas
DFFITS. standard influence of observation on predicted value
H. leverage
LCL. lower bound of 100(1-\alpha)% confidence interval for individual prediction
LCLM. lower bound of 100(1-\alpha)% confidence interval for the mean of the dependent variable
PREDICTED. | PRED. | P. predicted values
PRESS. residuals from refitting the model with current observation deleted
RESIDUAL. | R. residuals
RSTUDENT. studentized residuals with the current observation deleted
STDI. standard error of the individual predicted value
STDP. standard error of the mean predicted value
STDR. standard error of the residual
STUDENT. residuals divided by their standard errors
UCL. upper bound of 100(1-\alpha)% confidence interval for individual prediction
UCLM. upper bound of 100(1-\alpha)% confidence interval for the mean of the dependent variables

1.4 Multiple Regression

Now, let's look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. For this multiple regression example, we will regress the dependent variable, api00, on all of the predictor variables in the data set.

Let's examine the output from this regression analysis.  As with the simple regression, we look to the p-value of the F-test to see if the overall model is significant. With a p-value of zero to four decimal places, the model is statistically significant. The R-squared is 0.8446, meaning that approximately 84% of the variability of api00 is accounted for by the variables in the model. In this case, the adjusted R-squared indicates that about 84% of the variability of api00 is accounted for by the model, even after taking into account the number of predictor variables in the model. The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable ell.   We would expect a decrease of 0.86 in the api00 score for every one unit increase in ell, assuming that all other variables in the model are held constant.  The interpretation of much of the output from the multiple regression is the same as it was for the simple regression.  We have prepared an annotated output that more thoroughly explains the output of this multiple regression analysis.

You may be wondering what a 0.86 change in ell really means, and how you might compare the strength of that coefficient to the coefficient for another variable, say meals. To address this problem, we can use the stb option on the model statement to request that in addition to the standard output that SAS also display a table of the standardized values, sometimes called beta coefficients.  Below we show just the portion of the output that includes these standardized values.  The beta coefficients are used by some researchers to compare the relative strength of the various predictors within the model. Because the beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. In other words, the beta coefficients are the coefficients that you would obtain if the outcome and predictor variables were all transformed to standard scores, also called z-scores, before running the regression.

Because these standardized coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors.  In this example, meals has the largest Beta coefficient, -0.66, and acs_k3 has the smallest Beta, 0.013.  Thus, a one standard deviation increase in meals leads to a 0.66 standard deviation decrease in predicted api00, with the other variables held constant. And, a one standard deviation increase in acs_k3, in turn, leads to a 0.013 standard deviation increase api00 with the other variables in the model held constant.

In interpreting this output, remember that the difference between the regular coefficients (from the prior output) and the standardized coefficients above is the units of measurement.  For example, to describe the raw coefficient for ell you would say  "A one-unit decrease in ell would yield a .86-unit increase in the predicted api00." However, for the standardized coefficient (Beta) you would say, "A one standard deviation decrease in ell would yield a .15 standard deviation increase in the predicted api00."

So far, we have concerned ourselves with testing a single variable at a time, for example looking at the coefficient for ell and determining if that is significant. We can also test sets of variables, using the test command, to see if the set of variables are significant.  First, let's start by testing a single variable, ell, using the test statement.  Note that the part before the test command, test1:, is merely a label to identify the output of the test command.  This label could be any short label to identify the output.

     Test TEST1 Results for Dependent Variable api00

                                Mean
Source             DF         Square    F Value    Pr > F

Numerator           1          53732      16.67    <.0001
Denominator       385     3222.61761

If you compare this output with the output from the last regression you can see that the result of the F-test, 16.67, is the same as the square of the result of the t-test in the regression (-4.083^2 = 16.67). Note that you could get the same results if you typed the following since SAS defaults to comparing the term(s) listed to 0.

Perhaps a more interesting test would be to see if the contribution of class size is significant.  Since the information regarding class size is contained in two variables, acs_k3 and acs_46, so we include both of these separated by a comma on the test command.  Below we show just the output from the test command. 

The significant F-test, 3.95, means that the collective contribution of these two variables is significant.  One way to think of this, is that there is a significant difference between a model with acs_k3 and acs_46 as compared to a model without them, i.e., there is a significant difference between the "full" model and the "reduced" models.

Finally, as part of doing a multiple regression analysis you might be interested in seeing the correlations among the variables in the regression model.  You can do this with proc corr as shown below.

We can see that the strongest correlation with api00 is meals with a correlation in excess of -0.9.  The variables ell and emer are also strongly correlated with api00. All three of these correlations are negative, meaning that as the value of one variable goes down, the value of the other variable tends to go up. Knowing that these variables are strongly associated with api00, we might predict that they would be statistically significant predictor variables in the regression model. Note that the number of cases used for each correlation is determined on a "pairwise" basis, for example there are 398 valid pairs of data for enroll and acs_k3, so that correlation of .1089 is based on 398 observations.

1.5 Transforming Variables

Earlier we focused on screening your data for potential errors.  In the next chapter, we will focus on regression diagnostics to verify whether your data meet the assumptions of linear regression.  Here, we will focus on the issue of normality.  Some researchers believe that linear regression requires that the outcome (dependent) and predictor variables be normally distributed. We need to clarify this issue. In actuality, it is the residuals that need to be normally distributed.  In fact, the residuals need to be normal only for the t-tests to be valid. The estimation of the regression coefficients do not require normally distributed residuals. As we are interested in having valid t-tests, we will investigate issues concerning normality.

A common cause of non-normally distributed residuals is non-normally distributed outcome and/or predictor variables.  So, let us explore the distribution of our variables and how we might transform them to a more normal shape.  Let's start by making a histogram of the variable enroll, which we looked at earlier in the simple regression.

We can use the normal option to superimpose a normal curve on this graph and the midpoints option to indicate that we want bins with midpoints from 100 to 1500 going in increments of 100.

proc univariate data="c:\sasreg\elemapi2";
  var enroll ;
  histogram / cfill=gray normal midpoints=100 to 1500 by 100;
run;

Histograms are sensitive to the number of bins or columns that are used in the display. An alternative to histograms is the kernel density plot, which approximates the probability density of the variable. Kernel density plots have the advantage of being smooth and of being independent of the choice of origin, unlike histograms. You can add a kernel density plot to the above plot with he kernel option as illustrated below.

Not surprisingly, the kdensity plot also indicates that the variable enroll does not look normal.  

There are two other types of graphs that are often used to examine the distribution of variables; quantile-quantile plots and normal probability plots.

A quantile-quantile plot graphs the quantiles of a variable against the quantiles of a normal (Gaussian) distribution. Such plots are  sensitive to non-normality near the tails, and indeed we see considerable deviations from normal, the diagonal line, in the tails. This plot is typical of variables that are strongly skewed to the right.

The normal probability plot is also useful for examining the distribution of variables and is sensitive to deviations from normality nearer to the center of the distribution. We will use SAS proc capability to get the normal probability plot. Again, we see indications non-normality in enroll.

Given the skewness to the right in enroll, let us try a log transformation to see if that makes it more normal.  Below we create a variable lenroll that is the natural log of enroll and then we repeat some of the above commands to see if lenroll is more normally distributed.

data elemapi3;
  set "c:\sasreg\elemapi2";
  lenroll = log(enroll);
run;

Now let's try showing a histogram for lenroll with a normal overlay and a kernel density estimate.

proc univariate data=elemapi3 noprint;
  var lenroll ;
  histogram / cfill=grayd0  normal kernel (color = red);
run;

We can see that lenroll looks quite normal.  We could then create a quantile-quantile plot and a normal probability plot to further assess whether lenroll seems normal, as well as seeing how lenroll impacts the residuals, which is really the important consideration.

1.6 Summary

In this lecture we have discussed the basics of how to perform simple and multiple regressions, the basics of interpreting output, as well as some related commands. We examined some tools and techniques for screening for bad data and the consequences such data can have on your results.  Finally, we touched on the assumptions of linear regression and illustrated how you can check the normality of your variables and how you can transform your variables to achieve normality.  The next chapter will pick up where this chapter has left off, going into a more thorough discussion of the assumptions of linear regression and how you can use SAS to assess these assumptions for your data.   In particular, the next lecture will address the following issues.

1.7 For more information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.