Stata Web Books
Regression with Stata
Chapter 2 - Regression Diagnostics

Chapter Outline
    2.0 Regression Diagnostics
    2.1 Unusual and Influential data
    2.2 Checking Normality of Residuals
    2.3 Checking Homoscedasticity
    2.4 Checking for Multicollinearity
    2.5 Checking Linearity
    2.6 Model Specification
    2.7 Issues of Independence
    2.8 Summary
    2.9 Self assessment
    2.10 For more information

2.0 Regression Diagnostics

 In the previous chapter, we learned how to do ordinary linear regression with Stata, concluding with methods for examining the distribution of our variables.   Without verifying that your data have met the assumptions underlying OLS regression, your results may be misleading.  This chapter will explore how you can use Stata to check on how well your data meet the assumptions of OLS regression.  In particular, we will consider the following assumptions.

Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are none the less, of great concern to data analysts.

Many graphical methods and numerical tests have been developed over the years for regression diagnostics. Stata has many of these methods built-in, and others are available that can be downloaded over the internet. In particular, Nicholas J. Cox (University of Durham) has produced a collection of convenience commands which can be downloaded from SSC (ssc install commandname). These commands include indexplot, rvfplot2, rdplot, qfrplot and ovfplot. In this chapter, we will explore these methods and show how to verify regression assumptions and detect potential problems using Stata.

2.1 Unusual and influential data

A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis.  If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further.  There are three ways that an observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

How can we identify these three types of observations? Let's look at an example dataset called crime. This dataset  appears in Statistical Methods for Social Sciences, Third Edition by Alan Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name (state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder),  the percent of the population living in metropolitan areas (pctmetro), the percent of the population that is white (pctwhite), percent of population with a high school education or above (pcths), percent of population living under poverty line (poverty), and percent of population that are single parents (single). 

Let's say that we want to predict crime by pctmetro, poverty, and single. That is to say, we want to build a linear regression model between the response variable crime and the independent variables pctmetro, poverty and single. We will first look at the scatter plots of crime against each of the predictor variables before the regression analysis so we will have some ideas about potential problems. We can create a scatterplot matrix of these variables as shown below.

The graphs of crime with other variables show some potential problems.  In every plot, we see a data point that is far away from the rest of the data points. Let's make individual graphs of crime with pctmetro and poverty and single so we can get a better view of these scatterplots.  We will add the mlabel(state) option to label each marker with the state name to identify outlying states.

All the scatter plots suggest that the observation for state = dc is a point that requires extra attention since it stands out away from all of the other points. We will keep it in mind when we do our regression analysis.

Now let's try the regression command predicting crime from pctmetro poverty and single.  We will go step-by-step to identify all the potentially unusual or influential points afterwards. 

Let's examine the studentized residuals as a first means for identifying outliers. Below we use the predict command with the rstudent option to generate studentized residuals and we name the residuals r.   We can choose any name we like as long as it is a legal Stata variable name. Studentized residuals are a type of standardized residual that can be used to identify outliers.

Let's examine the residuals with a stem and leaf plot. We see three residuals that stick out, -3.57, 2.62 and 3.77. 

stem r

Stem-and-leaf plot for r (Studentized residuals)

r rounded to nearest multiple of .01
plot in units of .01

-3** | 57
-3** | 
-2** | 
-2** | 
-1** | 84,69
-1** | 30,15,13,04,02
-0** | 87,85,65,58,56,55,54
-0** | 47,46,45,38,36,30,28,21,08,02
 0** | 05,06,08,13,27,28,29,31,35,41,48,49
 0** | 56,64,70,80,82
 1** | 01,03,03,08,15,29
 1** | 59
 2** | 
 2** | 62
 3** | 
 3** | 77

The stem and leaf display helps us see some potential outliers, but we cannot see which state (which observations) are potential outliers.  Let's sort the data on the residuals and show the 10 largest and 10 smallest residuals along with the state id and state name.  Note that in the second list command the -10/l the last value is the letter "l", NOT the number one.

We should pay attention to studentized residuals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3.  These results show that DC and MS are the most worrisome observations followed by FL.

Another way to get this kind of output is with a command called hilo.  You can download hilo from within Stata by typing findit hilo (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

Once installed, you can type the following and get output similar to that above by typing just one command.

hilo r state
10 smallest and largest observations on r

        r      state 
-3.570789         ms  
-1.838577         la  
-1.685598         ri  
-1.303919         wa  
 -1.14833         oh  
 -1.12934         wi  
-1.044952         co  
-1.022727         mi  
-.8699151         az  
-.8520518         ut  

        r      state 
8211724         mo
  1.01299         md  
 1.028869         ne  
 1.030343         sc  
 1.076718         ks  
 1.151702         il  
 1.293477         id  
 1.589644         ia  
 2.619523         fl  
 3.765847         dc  

Let's show all of the variables in our regression where the studentized residual exceeds +2 or -2, i.e., where the absolute value of the residual exceeds 2.  We see the data for the three potential outliers we identified, namely Florida, Mississippi and Washington D.C. Looking carefully at these three observations, we couldn't find any data entry error, though we may want to do another regression analysis with the extreme point such as DC deleted. We will return to this issue later.  

Now let's look at the leverage's to identify observations that will have potential great influence on regression coefficient estimates.  

We use the show(5) high options on the hilo command to show just the 5 largest observations (the high option can be abbreviated as h).  We see that DC has the largest leverage.

hilo lev state, show(5) high

5 largest observations on lev

      lev      state 
.1652769         la  
.1802005         wv  
 .191012         ms  
.2606759         ak  
 .536383         dc  

Generally, a point with leverage greater than (2k+2)/n should be carefully examined. Here k is the number of predictors and n is the number of observations. In our example, we can do the following. 

As we have seen, DC is an observation that both has a large residual and large leverage.  Such points are potentially the most influential.  We can make a plot that shows the leverage by the residual squared and look for observations that are jointly high on both of these measures.  We can do this using the lvr2plot command. lvr2plot stands for leverage versus residual squared plot. Using residual squared instead of residual itself, the graph is restricted to the first quadrant and the relative positions of data points are preserved. This is a quick way of checking potential influential observations and outliers at the same time. Both types of points are of great concern for us.  

The two reference lines are the means for leverage, horizontal, and for the normalized residual squared, vertical. The points that immediately catch our attention is DC (with the largest leverage) and MS (with the largest residual squared). We'll look at those observations more carefully by listing them.

Now let's move on to overall measures of influence, specifically let's look at Cook's D and DFITS.  These measures both combine information on the residual and leverage. Cook's D and DFITS are very similar except that they scale differently but they give us similar answers.

The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point. The convention cut-off point is 4/n. We can list any observation above the cut-off point by doing the following. We do see that the Cook's D for DC is by far the largest. 

Now let's take a look at DFITS. The cut-off point for DFITS is 2*sqrt(k/n). DFITS can be either positive or negative, with numbers close to zero corresponding to the points with small or zero influence. As we see, dfit also indicates that DC is, by far, the most influential observation.

The above measures are general measures of influence.  You can also consider more specific measures of influence that assess how each coefficient is changed by deleting the observation. This measure is called DFBETA and is created for each of the predictors. Apparently this is more computational intensive than summary statistics such as Cook's D since the more predictors a model has, the more computation it may involve. We can restrict our attention to only those predictors that we are most concerned with to see how well behaved those predictors are. In Stata, the dfbeta command will produce the DFBETAs for each of the predictors. The names for the new variables created are chosen by Stata automatically and begin with the letters DF. 

This created three variables, DFpctmetro, DFpoverty and DFsingle.  Let's look at the first 5 values. 

list state DFpctmetro DFpoverty DFsingle in 1/5

         state  DFpctme~o  DFpoverty   DFsingle 
  1.        ak  -.1061846  -.1313398   .1451826  
  2.        al   .0124287   .0552852  -.0275128  
  3.        ar  -.0687483   .1753482  -.1052626  
  4.        az  -.0947614  -.0308833    .001242  
  5.        ca   .0126401   .0088009  -.0036361  

The value for DFsingle for Alaska is .14, which means that by being included in the analysis (as compared to being excluded), Alaska increases the coefficient for single by 0.14 standard errors, i.e., .14 times the standard error for BSingle or by (0.14 * 15.5).  Since the inclusion of an observation could either contribute to an increase or decrease in a regression coefficient, DFBETAs can be either positive or negative.  A DFBETA value in excess of  2/sqrt(n) merits further investigation. In this example, we would be concerned about absolute values in excess of 2/sqrt(51) or .28.

We can plot all three DFBETA values against the state id in one graph shown below. We add a line at .28 and -.28 to help us see potentially troublesome observations.  We see the largest value is about 3.0 for DFsingle.

We can repeat this graph with the mlabel() option in the graph command to label the points. With the graph above we can identify which DFBeta is a problem, and with the graph below we can associate that observation with the state that it originates from. 

Now let's list those observations with DFsingle larger than the cut-off value.

The following table summarizes the general rules of thumb we use for these measures to identify observations worthy of further investigation (where k is the number of predictors and n is the number of observations).

Measure Value
leverage >(2k+2)/n
abs(rstu) > 2
Cook's D > 4/n
abs(DFITS) > 2*sqrt(k/n)
abs(DFBETA) > 2/sqrt(n)

We have used the predict command to create a number of variables associated with regression analysis and regression diagnostics. The help regress command not only gives help on the regress command, but also lists all of the statistics that can be generated via the predict command.  Below we show a snippet of the Stata help file illustrating the various statistics that can be computed via the predict command.

We have explored a number of the statistics that we can get after the regress command. There are also several graphs that can be used to search for unusual and influential observations. The avplot command graphs an added-variable plot. It is also called a partial-regression plot and is very useful in identifying influential points. For example, in the avplot for single shown below, the graph shows crime by single after both crime and single have been adjusted for all other predictors in the model.   The line plotted has the same slope as the coefficient for single.   This plot shows how the observation for DC influences the coefficient.  You can see how the regression line is tugged upwards trying to fit through the extreme value of DC.  Alaska and West Virginia may also exert substantial leverage on the coefficient of single

Stata also has the avplots command that creates an added variable plot for all of the variables, which can be very useful when you have many variables.  It does produce small graphs, but these graphs can quickly reveal whether you have problematic observations based on the added variable plots.

DC has appeared as an outlier as well as an influential point in every analysis.   Since DC is really not a state, we can use this to justify omitting it from the analysis saying that we really wish to just analyze states. First, let's repeat our analysis including DC by just typing regress.

Now, let's run the analysis omitting DC by including if state != "dc" on the regress command (here != stands for "not equal to" but you could also use ~= to mean the same thing). As we expect, deleting DC made a large change in the coefficient for single. The coefficient for single dropped from 132.4 to 89.4.  After having deleted DC, we would repeat the process we have illustrated in this section to search for any other outlying and influential observations.

regress crime pctmetro poverty single if state!="dc"

  Source |       SS       df       MS                  Number of obs =      50
---------+------------------------------               F(  3,    46) =   39.90
   Model |  3098767.11     3  1032922.37               Prob > F      =  0.0000
Residual |  1190858.11    46  25888.2199               R-squared     =  0.7224
---------+------------------------------               Adj R-squared =  0.7043
   Total |  4289625.22    49  87543.3718               Root MSE      =  160.90

   crime |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
pctmetro |   7.712334   1.109241      6.953   0.000       5.479547     9.94512
 poverty |   18.28265   6.135958      2.980   0.005       5.931611     30.6337
  single |   89.40078   17.83621      5.012   0.000       53.49836    125.3032
   _cons |  -1197.538   180.4874     -6.635   0.000       -1560.84   -834.2358

Finally, we showed that the avplot command can be used to searching for outliers among existing variables in your model, but we should note that the avplot command not only works for the variables in the model, it also works for variables that are not in the model, which is why it is called added-variable plot. Let's use the regression that includes DC as we want to continue to see ill-behavior caused by DC as a demonstration for doing regression diagnostics. We can do an avplot on variable pctwhite

At the top of the plot, we have "coef=-3.509". It is the coefficient for pctwhite if it were put in the model. We can check that by doing a regression as below.


In this section, we explored a number of methods of identifying outliers and influential points. In a typical analysis, you would probably use only some of these methods. Generally speaking, there are two types of methods for assessing outliers: statistics such as residuals, leverage, Cook's D and DFITS, that assess the overall impact of an observation on the regression results, and statistics such as DFBETA that assess the specific impact of an observation on the regression coefficients. 

In our example, we found that  DC was a point of major concern. We performed a regression with it and without it and the regression equations were very different. We can justify removing it from our analysis by reasoning that our model  is to predict crime rate for states, not for metropolitan areas.

2.2 Checking Normality of Residuals

Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.

After we run a regression analysis, we can use the predict command to create residuals and then use commands such as kdensity, qnorm and pnorm to check the normality of the residuals.

Let's use the elemapi2 data file we saw in Chapter 1 for these analyses.  Let's predict academic performance (api00) from percent receiving free meals (meals), percent of English language learners (ell), and percent of teachers with emergency credentials (emer).

We then use the predict command to generate residuals.

Below we use the kdensity command to produce a kernel density plot with the normal option requesting that a normal density be overlaid on the plot. kdensity stands for kernel density estimate. It can be thought of as a histogram with narrow bins and moving average. 

The pnorm command graphs a standardized normal probability (P-P) plot while qnorm plots the quantiles of a variable against the quantiles of a normal distribution. pnorm is sensitive to non-normality in the middle range of data and qnorm is sensitive to non-normality near the tails. As you see below, the results from pnorm show no indications of non-normality, while the qnorm command shows a slight deviation from normal at the upper tail, as can be seen in the kdensity above.  Nevertheless, this seems to be a minor and trivial deviation from normality. We can accept that  the residuals are close to a normal distribution.

There are also numerical tests for testing normality. One of the tests is the test written by Lawrence C. Hamilton, Dept. of Sociology, Univ. of New Hampshire, called iqr. You can get this program from Stata by typing findit iqr (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

iqr stands for inter-quartile range and assumes the symmetry of the distribution. Severe outliers consist of those points that are either 3 inter-quartile-ranges below the first quartile or 3 inter-quartile-ranges above the third quartile. The presence of any severe outliers should be sufficient evidence to reject normality at a 5% significance level. Mild outliers are common in samples of any size. In our case, we don't have any severe outliers and the distribution seems fairly symmetric. The residuals have an approximately normal distribution. 

Another test available is the swilk test which performs the Shapiro-Wilk W test for normality. The p-value  is based on the assumption that the distribution is normal. In our example, it is very large (.51), indicating that we cannot reject that r is normally distributed. 

2.3 Checking Homoscedasticity of Residuals

One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic." There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values.  We do this by issuing the rvfplot command. Below we use the rvfplot command with the yline(0) option to put a reference line at y=0. We see that the pattern of the data points is getting a little narrower towards the right end, which is an indication of heteroscedasticity.


Now let's look at a couple of commands that test for heteroscedasticity.

estat imtest

Cameron & Trivedi's decomposition of IM-test

              Source |       chi2     df      p
  Heteroskedasticity |      18.35      9    0.0313
            Skewness |       7.78      3    0.0507
            Kurtosis |       0.27      1    0.6067
               Total |      26.40     13    0.0150
estat hettest
Breusch-Pagan / Cook-Weisberg test for heteroskedasticity 
         Ho: Constant variance
         Variables: fitted values of api00
         chi2(1)      =     8.75
         Prob > chi2  =   0.0031

The first test on heteroskedasticity given by imest is the White's test and the second one given by hettest is the Breusch-Pagan test. Both test the null hypothesis that the variance of the residuals is homogenous. Therefore, if the p-value is very small, we would have to reject the hypothesis and accept the alternative hypothesis that the variance is not homogenous. So in this case, the evidence is against the null hypothesis that the variance is homogeneous. These tests are very sensitive to model assumptions, such as the assumption of normality. Therefore it is a common practice to combine the tests with diagnostic plots to make a judgment on the severity of the heteroscedasticity and to decide if any correction is needed for heteroscedasticity. In our case, the plot above does not show too strong an evidence. So we are not going to get into details on how to correct for heteroscedasticity even though there are methods available.

2.4 Checking for Multicollinearity

When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated. In this section, we will explore some Stata commands that help to detect multicollinearity.

We can use the vif command after the regression to check for multicollinearity. vif stands for variance inflation factor. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables. Let's first look at the regression we did from the last section, the regression model predicting api00 from meals, ell and emer and then issue the vif command. 

The VIFs look fine here. Here is an example where the VIFs are more worrisome.

In this example, the VIF and tolerance (1/VIF) values for avg_ed grad_sch and col_grad are worrisome.  All of these variables measure education of the parents and the very high VIF values indicate that these variables are possibly redundant.  For example, after you know grad_sch and col_grad, you probably can predict avg_ed very well.  In this example, multicollinearity arises because we have put in too many variables that measure the same thing, parent education. 

Let's omit one of the parent education variables, avg_ed.  Note that the VIF values in the analysis below appear much better.  Also, note how the standard errors are reduced for the parent education variables, grad_sch and col_grad.  This is because the high degree of collinearity caused the standard errors to be inflated.   With the multicollinearity eliminated, the coefficient for grad_sch, which had been non-significant, is now significant.

Let's introduce another command on collinearity. The collin command displays several different measures of collinearity. For example, we can test for collinearity among the variables we used in the two examples above.  Note that the collin command does not need to be run in connection with a regress command, unlike the vif command which follows a regress command. Also note that only predictor (independent) variables are used with the collin command. You can download collin from within Stata by typing findit collin (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

We now remove avg_ed and see the collinearity diagnostics improve considerably.

The condition number is a commonly used index of the global instability of the regression coefficients -- a large condition number, 10 or more, is an indication of instability.

 2.5 Checking Linearity

When we do linear regression, we assume that the relationship between the response variable and the predictors is linear. This is the assumption of linearity. If this assumption is violated, the linear regression will try to fit a straight line to data that does not follow a straight line. Checking the linear assumption in the case of simple regression is straightforward, since we only have one predictor. All we have to do is a scatter plot between the response variable and the predictor to see if nonlinearity is present, such as a curved band or a big wave-shaped curve. For example, recall we did a simple linear regression in Chapter 1 using dataset elemapi2.

Below we use the scatter command to show a scatterplot predicting api00 from enroll and use lfit to show a linear fit, and then lowess to show a lowess smoother predicting api00 from enroll. We clearly see some degree of nonlinearity. 

Checking the linearity assumption is not so straightforward in the case of multiple regression. We will try to illustrate some of the techniques that you can use. The most straightforward thing to do is to plot the standardized residuals against each of the predictor variables in the regression model. If there is a clear nonlinear pattern, there is a problem of nonlinearity. Otherwise, we should see for each of the plots just a random scatter of points. Let's continue to use dataset elemapi2 here. Let's use a different model. 

The two residual versus predictor variable plots above do not indicate strongly a clear departure from linearity. Another command for detecting non-linearity is acprplot. acprplot graphs an augmented component-plus-residual plot, a.k.a. augmented partial residual plot. It can be used to identify nonlinearities in the data. Let's use the acprplot command for meals and some_col and use the lowess lsopts(bwidth(1)) options to request lowess smoothing with a bandwidth of 1.

In the first plot below the smoothed line is very close to the ordinary regression line, and the entire pattern seems pretty uniform. The second plot does seem more problematic at the right end. This may come from some potential influential points. Overall, they don't look too bad and we shouldn't be too concerned about non-linearities in the data.

We have seen how to use acprplot to detect nonlinearity. However our last example didn't show much nonlinearity. Let's look at a more interesting example. This example is taken from "Statistics with Stata 5" by Lawrence C. Hamilton (1997, Duxbery Press). The dataset we will use is called nations.dta. We can get the dataset from the Internet.

Let's build a model that predicts birth rate (birth), from per capita gross national product (gnpcap), and urban population (urban). If this were a complete regression analysis, we would start with examining the variables, but for the purpose of illustrating nonlinearity, we will jump directly to the regression.

Now, let's do the acprplot on our predictors.

The acprplot plot for gnpcap shows clear deviation from linearity and the one for urban does not show nearly as much deviation from linearity. Now, let's look at these variables more closely.

We see that the relation between birth rate and per capita gross national product is clearly nonlinear and the relation between birth rate and urban population is not too far off from being linear. So let's focus on variable gnpcap. First let's look at the distribution of gnpcap. We suspect that gnpcap may be very skewed. This may affect the appearance of the acprplot.

Indeed, it is very skewed. This suggests to us that some transformation of the variable may be necessary. One of the commonly used transformations is log transformation. Let's try it here.

The transformation does seem to help correct the skewness greatly. Next, let's do the regression again replacing gnpcap by lggnp.

The plot above shows less deviation from nonlinearity than before, though the problem of nonlinearity has not been completely solved yet.

2.6 Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.

Consider the model below. This regression suggests that as class size increases the academic performance increases. Before we publish results saying that increased class size is associated with higher academic performance, let's check the model specification.

There are a couple of methods to detect specification errors. The linktest command performs a model specification link test for single-equation models. linktest is based on the idea that if a regression is properly specified, one should not be able to find any additional independent variables that are significant except by chance. linktest creates two new variables, the variable of prediction, _hat, and the variable of squared prediction, _hatsq. The model is then refit using these two variables as predictors. _hat should be significant since it is the predicted value. On the other hand, _hatsq shouldn't, because if our model is specified correctly, the squared predictions should not have much explanatory power. That is we wouldn't  expect  _hatsq to be a significant predictor if our model is specified correctly. So we will be looking at the p-value for _hatsq.

From the above linktest, the test of _hatsq is not significant. This is to say that linktest has failed to reject the assumption that the model is specified correctly. Therefore, it seems to us that we don't have a specification error. But now, let's look at another test before we jump to the conclusion.

The ovtest command performs another test of regression model specification. It performs a regression specification error test (RESET) for omitted variables. The idea behind ovtest is very similar to linktest. It also creates new variables based on the predictors and refits the model using those new variables to see if any of them would be significant.  Let's try ovtest on our model.

The ovtest command indicates that there are omitted variables. So we have tried both the linktest and ovtest, and one of them (ovtest) tells us that we have a specification error. We therefore have to reconsider our model. 

Let's try adding the variable full to the model. Now, both the linktest and ovtest are significant, indicating we have a specification error.

Let's try adding one more variable, meals, to the above model. 

The linktest is once again non-significant while the p-value for ovtest is slightly greater than .05. Note that after including meals and full, the coefficient for class size is no longer significant. While acs_k3 does have a positive relationship with api00 when no other variables are in the model, when we include, and hence control for, other important variables, acs_k3 is no longer significantly related to api00 and its relationship to api00 is no longer positive

linktest and ovtest are tools available in Stata for checking specification errors, though linktest can actually do more than check omitted variables as we used here, e.g., checking the correctness of link function specification. For more details on those tests, please refer to Stata manual.

2.7 Issues of Independence

The statement of this assumption that the errors associated with one observation are not correlated with the errors of any other observation cover several different situations. Consider the case of collecting data from students in eight different elementary schools. It is likely that the students within each school will tend to be more like one another than students from different schools, that is, their errors are not independent. We will deal with this type of situation in Chapter 4 when we demonstrate the regress command with cluster option.

Another way in which the assumption of independence can be broken is when data are collected on the same variables over time. Let's say that we collect truancy data every semester for 12 years. In this situation it is likely that the errors for observation between adjacent semesters will be more highly correlated than for observations more separated in time. This is known as autocorrelation. When you have data that can be considered to be time-series you should use the dwstat command that performs a Durbin-Watson test for correlated residuals.

We don't have any time-series data, so we will use the elemapi2 dataset and pretend that snum indicates the time at which the data were collected. We will also need to use the tsset command to let Stata know which variable is the time variable.

The Durbin-Watson statistic has a range from 0 to 4 with a midpoint of 2. The observed value in our example is very small, close to zero, which is not surprising since our data are not truly time-series. A simple visual check would be to plot the residuals versus the time variable.

2.8 Summary

In this chapter, we have used a number of tools in Stata for determining whether our data meets the regression assumptions. Below, we list the major commands we demonstrated organized according to the assumption the command was shown to test.

See the Stata Topics: Regression page for more information and resources on regression diagnostics in Stata.

2.9 Self Assessment

1. The following data set consists of measured weight, measured height, reported weight and reported height of some 200 people. You can get it from within Stata by typing use   We tried to build a model to predict measured weight by reported weight, reported height and measured height. We did an lvr2plot after the regression and here is what we have. Explain what you see in the graph and try to use other STATA commands to identify the problematic observation(s). What do you think the problem is and what is your solution?
. regress  measwt measht reptwt reptht 
  Source |       SS       df       MS                  Number of obs =     181
---------+------------------------------               F(  3,   177) = 1640.88
   Model |  40891.9594     3  13630.6531               Prob > F      =  0.0000
Residual |   1470.3279   177  8.30693727               R-squared     =  0.9653
---------+------------------------------               Adj R-squared =  0.9647
   Total |  42362.2873   180  235.346041               Root MSE      =  2.8822

  measwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
  measht |  -.9607757   .0260189    -36.926   0.000      -1.012123   -.9094285
  reptwt |    1.01917   .0240778     42.328   0.000        .971654    1.066687
  reptht |   .8184156   .0419658     19.502   0.000       .7355979    .9012334
   _cons |    24.8138   4.888302      5.076   0.000       15.16695    34.46065


2. Using the data from the last exercise, what measure would you use if you want to know how much change an observation would make on a coefficient for a predictor? For example, show how much change would it be for the coefficient of predictor reptht if we omit observation 12 from our regression analysis? What are the other measures that you would use to assess the influence of an observation on regression? What are the cut-off values for them?

3. The following data file is called bbwt.dta and it is from Weisberg's Applied Regression Analysis. You can obtain it from within Stata by typing use It consists of the body weights and brain weights of some 60 animals.  We want to predict the brain weight by body weight, that is, a simple linear regression of brain weight against body weight. Show what you have to do to verify the linearity assumption. If you think that it violates the linearity assumption, show some possible remedies that you would consider. 

use, clear
regress brainwt bodywt
  Source |       SS       df       MS                  Number of obs =      62
---------+------------------------------               F(  1,    60) =  411.12
   Model |  46067326.8     1  46067326.8               Prob > F      =  0.0000
Residual |  6723217.18    60   112053.62               R-squared     =  0.8726
---------+------------------------------               Adj R-squared =  0.8705
   Total |  52790543.9    61  865418.753               Root MSE      =  334.74

 brainwt |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
  bodywt |   .9664599   .0476651     20.276   0.000       .8711155    1.061804
   _cons |   91.00865   43.55574      2.089   0.041       3.884201    178.1331

4. We did a regression analysis using the data file elemapi2 in chapter 2. Continuing with the analysis we did, we did an avplot here.  Explain what an avplot is and what type of information you would get from the plot. If variable full were put in the model, would it be a significant predictor?

use, clear
regress api00 meals ell emer
  Source |       SS       df       MS                  Number of obs =     400
---------+------------------------------               F(  3,   396) =  673.00
   Model |  6749782.75     3  2249927.58               Prob > F      =  0.0000
Residual |  1323889.25   396  3343.15467               R-squared     =  0.8360
---------+------------------------------               Adj R-squared =  0.8348
   Total |  8073672.00   399  20234.7669               Root MSE      =   57.82

   api00 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
   meals |  -3.159189   .1497371    -21.098   0.000      -3.453568   -2.864809
     ell |  -.9098732   .1846442     -4.928   0.000      -1.272878   -.5468678
    emer |  -1.573496    .293112     -5.368   0.000      -2.149746   -.9972456
   _cons |   886.7033    6.25976    141.651   0.000       874.3967    899.0098
avplot full, mlabel(snum)

5. The data set wage.dta is from a national sample of 6000 households with a male head earning less than $15,000 annually in 1966. You can get this data file by typing use from within Stata. The data were classified into 39 demographic groups for analysis. We tried to predict the average hours worked by average age of respondent and average yearly non-earned income.

use, clear
. regress HRS AGE NEIN
  Source |       SS       df       MS                  Number of obs =      39
---------+------------------------------               F(  2,    36) =   39.72
   Model |  107205.109     2  53602.5543               Prob > F      =  0.0000
Residual |  48578.1222    36  1349.39228               R-squared     =  0.6882
---------+------------------------------               Adj R-squared =  0.6708
   Total |  155783.231    38   4099.5587               Root MSE      =  36.734

     HRS |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
     AGE |  -8.281632   1.603736     -5.164   0.000      -11.53416   -5.029104
    NEIN |   .4289202   .0484882      8.846   0.000       .3305816    .5272588
   _cons |    2321.03   57.55038     40.330   0.000       2204.312    2437.748

Both predictors are significant. Now if we add ASSET to our predictors list, neither NEIN nor ASSET is significant.

  Source |       SS       df       MS                  Number of obs =      39
---------+------------------------------               F(  3,    35) =   25.83
   Model |   107317.64     3  35772.5467               Prob > F      =  0.0000
Residual |  48465.5908    35  1384.73117               R-squared     =  0.6889
---------+------------------------------               Adj R-squared =  0.6622
   Total |  155783.231    38   4099.5587               Root MSE      =  37.212

     HRS |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
     AGE |  -8.007181    1.88844     -4.240   0.000      -11.84092   -4.173443
    NEIN |   .3338277    .337171      0.990   0.329      -.3506658    1.018321
   ASSET |   .0044232    .015516      0.285   0.777       -.027076    .0359223
   _cons |   2314.054   63.22636     36.600   0.000       2185.698    2442.411

Can you explain why?

6. Continue to use the previous data set. This time we want to predict the average hourly wage by average percent of white respondents. Carry out the regression analysis and list the STATA commands that you can use to check for heteroscedasticity. Explain the result of your test(s).

Now we want to build another model to predict the average percent of white respondents by the average hours worked. Repeat the analysis you performed on the previous regression model. Explain your results.

7. We have a data set that consists of volume, diameter and height of some objects. Someone did a regression of volume on diameter and height.

use, clear
regress vol dia height
  Source |       SS       df       MS                  Number of obs =      31
---------+------------------------------               F(  2,    28) =  254.97
   Model |  7684.16254     2  3842.08127               Prob > F      =  0.0000
Residual |  421.921306    28  15.0686181               R-squared     =  0.9480
---------+------------------------------               Adj R-squared =  0.9442
   Total |  8106.08385    30  270.202795               Root MSE      =  3.8818

     vol |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
     dia |   4.708161   .2642646     17.816   0.000       4.166839    5.249482
  height |   .3392513   .1301512      2.607   0.014       .0726487    .6058538
   _cons |  -57.98766   8.638225     -6.713   0.000      -75.68226   -40.29306

Explain what tests you can use to detect model specification errors and if there is any, your solution to correct it.

Click here for our answers to these self assessment questions.

2.10 For more information


How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.