Regression with Stata
Chapter 4: Answers to Excersises

1. Use the crime data file that was used in chapter 2 (use http://www.ats.ucla.edu/stat/stata/webbooks/reg/crime ) and look at a regression model predicting murder from pctmetro, poverty, pcths and single using OLS and make a avplots and a lvr2plot following the regression. Are there any states that look worrisome? Repeat this analysis using regression with robust standard errors and show avplots for the analysis. Repeat the analysis using robust regression and make a manually created lvr2plot. Also run the results using qreg. Compare the results of the different analyses. Look at the weights from the robust regression and comment on the weights.

Answer 1.
First, consider the OLS regression predicting murder from pctmetro, poverty, pcths and single.

These results suggest that single is the only predictor significantly related to number of murders in a state. Let's look at the lvr2plot for this analysis. Washington DC looks like it has both a very high leverage and a very high residual.

Let's consider the same analysis using robust standard errors. The results are largely the same, except that the p value for pctmetro fell from 0.08 to 0.049, which would then make it a significant predictor, however we would be somewhat skeptical of this particular result without further investigation.

Stata allows us to compute the residual for this analysis but will not allow us to compute the leverage (hat) value. So instead of showing a lvr2plot let's look at the avplots for this analysis.

As you can see, we still have an observation that sticks out from the rest, and this is Washington DC. This is especially pronounced for the lower right graph for single where DC would seem to have very strong leverage to influence the coefficient for single.

Now, let's look at the analysis using robust regression and save the weights, calling them rrwt.

If you try the avplots command, this command is not available after rreg and the lvr2plot is not available either. But we can manually create the residual and hat values and create an lvr2plot of our own, see below.

As you see above, using the robust regression, none of the observations are jointly high in leverage and their residual values. Let's recap the regress results and the rreg results below and compare them.

The results are consistent for poverty and for single, where poverty was not significant in both analyses and single was significant in both analyses. However, the results for pctmetro and pcths were both not significant in the OLS analysis and were significant in the robust regression anlaysis.

Let's look at the weights used in the robust regression to further understand why the results were so different. Note that the weight for dc is . meaning that it was eliminated from the analysis entirely (because it had such a high residual). Also, ri was weighted by less than half.

In our analyses in chapter 2 (involving different variables) we found dc to be a very serious outlier and decided that it should be excluded because it is not a state. If we investigated further into these variables we may reach the same conclusion and decide that dc should be excluded. If we did, we could try using OLS regression like this. These results are quite similar to the rreg results. The benefits of rreg is that it deals not only with the serious problems (like dc being a very bad outlier) but also minor problems as well.

Let's try running the results using qreg and compare them with rreg.

While the coefficients do not always match up, the variables that were significant in the qreg are also significant in the rreg and likewise for the non-significant variables. Even though these techniques use different strategies for resisting the influence of very deviant observations, they both arrive at the same conclusions regarding which variables are significantly related to murder, although they do not always agree in the strength of the relationship, i.e. the size of the coefficients.

2. Using the elemapi2 data file (use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) pretend that 550 is the lowest score that a school could achieve on api00, i.e., create a new variable with the api00 score and recode it such that any score of 550 or below becomes 550. Use meals, ell and emer to predict api scores using 1) OLS to predict the original api score (before recoding) 2) OLS to predict the recoded score where 550 was the lowest value, and 3) using tobit to predict the recoded api score indicating the lowest value is 550. Compare the results of these analyses.

Answer 2.
First, we will use the elemapi2 data file and create the recoded version of the api score where the lowest value is 550. We will call this value api00x.

Analysis 1. Now, we will run an OLS regression on the un-recoded version of api.

regress api00 meals ell emer 
      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  673.00
       Model |  6749782.75     3  2249927.58           Prob > F      =  0.0000
    Residual |  1323889.25   396  3343.15467           R-squared     =  0.8360
-------------+------------------------------           Adj R-squared =  0.8348
       Total |  8073672.00   399  20234.7669           Root MSE      =   57.82

------------------------------------------------------------------------------
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -3.159189   .1497371   -21.10   0.000    -3.453568   -2.864809
         ell |  -.9098732   .1846442    -4.93   0.000    -1.272878   -.5468678
        emer |  -1.573496    .293112    -5.37   0.000    -2.149746   -.9972456
       _cons |   886.7033    6.25976   141.65   0.000     874.3967    899.0098
------------------------------------------------------------------------------

Analysis 2. Now, we run an OLS regression on the recoded version of api.

regress api00x meals ell emer  
      Source |       SS       df       MS              Number of obs =     400
-------------+------------------------------           F(  3,   396) =  682.88
       Model |  4567355.46     3  1522451.82           Prob > F      =  0.0000
    Residual |  882862.941   396  2229.45187           R-squared     =  0.8380
-------------+------------------------------           Adj R-squared =  0.8368
       Total |  5450218.40   399  13659.6952           Root MSE      =  47.217

------------------------------------------------------------------------------
      api00x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       meals |  -3.010788   .1222786   -24.62   0.000    -3.251184   -2.770392
         ell |  -.3034092   .1507844    -2.01   0.045    -.5998472   -.0069713
        emer |  -.7484733   .2393616    -3.13   0.002    -1.219052    -.277895
       _cons |     869.31   5.111854   170.06   0.000     859.2602    879.3597
------------------------------------------------------------------------------

Analysis 3. And we use tobit to perform the analysis indicating that the lowest value possible was 550.

First, let's compare analysis 1 and 2. When the range in api was restricted in analysis 2, the size of the coefficients dropped due to the restriction in range of the api scores. For example, the coefficient for ell dropped from -.9 to -.3 and its significance level changed to 0.045 (nearly not significant from being quite significant). Let's see how well the tobit analysis compensated for the restriction in range by comparing analysis #1 and #3. The coefficients are quite similar in these two analyses. The standard errors are slightly larger in the tobit analysis leading the t values to be somewhat smaller. Nevertheless, the tobit estimates are much more on target than the second OLS analysis on the recoded data.

3. Using the elemapi2 data file (use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) pretend that only schools with api scores of 550 or higher were included in the sample. Use meals ell and emer to predict api scores using 1) OLS to predict api from the full set of observations, 2) OLS to predict api using just the observations with api scores of 550 or higher, and 3) using truncreg to predict api using just the observations where api is 550 or higher. Compare the results of these analyses.

Answer 3.
First, we use the elemapi2 data file and run the analysis on the complete data.

Analysis 1 using all of the data.

Now let's keep just the schools with api scores of 550 or higher for the next 2 analyses.

Analysis 2 using OLS on just the schools with api scores of 550 or higher.

Analysis 3 using truncreg on just the schools with api scores of 550 or higher.

Let's first compare the results of analysis 1 with analysis 2. When the schools with api scores of less than 550 are omitted, the coefficient for ell drops from -.9 to .35 and becomes no longer statistically significant. The coefficients for meals and emer remain significant although they both drop as well.

Now, let's compare analysis 3 using truncreg with the original OLS analysis of the complete data. In both of these analyses, all of the variables are significant and the coefficients are quite similar, although the standard errors are larger in the truncreg. The truncreg did a pretty good job of showing us what the coefficients were in the complete sample based just on the restricted sample.

4. Using the hsb2 data file (use http://www.ats.ucla.edu/stat/stata/webbooks/reg/hsb2 ) predict read from science, socst, math and write. Use the testparm and test commands to test the equality of the coefficients for science, socst and math. Use cnsreg to estimate a model where these three parameters are equal.

Answer 4.
We start by using the hsb2 data file.

We first run an ordinary regression predicting read from science, socst, math and write.

We use the testparm command to test that the coefficients for science, socst and math are equal.

We can also use the test command to test that the coefficients for science, socst and math are equal.

We now constrain these three coefficients to be equal.

And we use cnsreg to estimate the model with these constraints in place.

5. Using the elemapi2 data file (use http://www.ats.ucla.edu/stat/stata/webbooks/reg/elemapi2 ) consider the following 2 regression equations.

api00 = meals ell emer 
api99 = meals ell emer 

Estimate the coefficients for these predictors in predicting api00 and api99 taking into account the non-independence of the schools. Test the overall contribution of each of the predictors in jointly predicting api scores in these two years. Test whether the contribution of emer is the same for api00 and api99.

Answer 5.
First, let's use the elemapi2 data file.

Next, let's analysze these equations separately.

Now, let's analyze them using sureg that takes into account the non-independence of these equations.

We can test the contribution of meals ell and emer as shown below.

We can test whether the coefficients for emer were the same in predicting api00 and api99 as shown below.

We can also test the contribution of meals ell and emer using more traditional multivariate tests using the mvreg and mvtest commands as shown below.

Below we show the multivariate tests for meals ell and for emer

mvtest meals

                      MULTIVARIATE TESTS OF SIGNIFICANCE

Multivariate Test Criteria and Exact F Statistics for
the Hypothesis of no Overall "meals" Effect(s)

                            S=1    M=0    N=196.5

Test                          Value          F       Num DF     Den DF   Pr > F
Wilks' Lambda              0.43558762   255.9105          2   395.0000   0.0000
Pillai's Trace             0.56441238   255.9105          2   395.0000   0.0000
Hotelling-Lawley Trace     1.29574936   255.9105          2   395.0000   0.0000

mvtest ell
                      MULTIVARIATE TESTS OF SIGNIFICANCE

Multivariate Test Criteria and Exact F Statistics for
the Hypothesis of no Overall "ell" Effect(s)

                            S=1    M=0    N=196.5

Test                          Value          F       Num DF     Den DF   Pr > F
Wilks' Lambda              0.94161436    12.2462          2   395.0000   0.0000
Pillai's Trace             0.05838564    12.2462          2   395.0000   0.0000
Hotelling-Lawley Trace     0.06200590    12.2462          2   395.0000   0.0000

mvtest emer

                      MULTIVARIATE TESTS OF SIGNIFICANCE

Multivariate Test Criteria and Exact F Statistics for
the Hypothesis of no Overall "emer" Effect(s)

                            S=1    M=0    N=196.5

Test                          Value          F       Num DF     Den DF   Pr > F
Wilks' Lambda              0.93136794    14.5537          2   395.0000   0.0000
Pillai's Trace             0.06863206    14.5537          2   395.0000   0.0000
Hotelling-Lawley Trace     0.07368952    14.5537          2   395.0000   0.0000

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.