UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Textbook Examples
Applied Linear Statistical Models by Neter, Kutner, et. al.
Chapter 9: Building the Regression Model II: Diagnostics

Inputting the Life Insurance data, table 9.1, p. 364.
clear
input x1 x2 y
45.010   6   91
57.204   4  162
26.852   5   11
66.290   7  240
40.964   5   73
72.996  10  311
79.380   1  316
52.766   8  154
55.916   6  164
38.122   4   54
35.840   6   53
75.796   9  326
37.408   5   55
54.376   2  130
46.186   7  112
46.130   4   91
30.366   3   14
39.060   5   63
end
First order linear regression of y on x1 x2 (9.3), p. 364.
regress y x1 x2

      Source |       SS       df       MS              Number of obs =      18
-------------+------------------------------           F(  2,    15) =  542.33
       Model |  173919.296     2  86959.6481           Prob > F      =  0.0000
    Residual |  2405.14824    15  160.343216           R-squared     =  0.9864
-------------+------------------------------           Adj R-squared =  0.9845
       Total |  176324.444    17  10372.0261           Root MSE      =  12.663

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   6.288029   .2041495    30.80   0.000     5.852895    6.723163
          x2 |   4.737601    1.37808     3.44   0.004     1.800294    7.674908
       _cons |  -205.7187   11.39268   -18.06   0.000    -230.0016   -181.4357
------------------------------------------------------------------------------
Regressing both y and x1 on x2 (9.4a and 9.4b), p. 364.
regress y x2

      Source |       SS       df       MS              Number of obs =      18
-------------+------------------------------           F(  1,    16) =    2.26
       Model |  21800.4617     1  21800.4617           Prob > F      =  0.1525
    Residual |  154523.983    16  9657.74892           R-squared     =  0.1236
-------------+------------------------------           Adj R-squared =  0.0689
       Total |  176324.444    17  10372.0261           Root MSE      =  98.274

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   15.53969   10.34302     1.50   0.152    -6.386539    37.46592
       _cons |   50.70277   60.35893     0.84   0.413    -77.25244     178.658
------------------------------------------------------------------------------

regress x1 x2

      Source |       SS       df       MS              Number of obs =      18
-------------+------------------------------           F(  1,    16) =    1.11
       Model |  266.420425     1  266.420425           Prob > F      =  0.3082
    Residual |  3847.28124    16  240.455077           R-squared     =  0.0648
-------------+------------------------------           Adj R-squared =  0.0063
       Total |  4113.70166    17  241.982451           Root MSE      =  15.507

------------------------------------------------------------------------------
          x1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x2 |   1.717882   1.632024     1.05   0.308    -1.741854    5.177618
       _cons |    40.7793   9.524025     4.28   0.001     20.58927    60.96933
------------------------------------------------------------------------------
Fig. 9.3a and 9.3b, p. 365.
Note1: We were not interested in seeing the output of the regression again so we used the quietly option which suppresses the output.
Note2: The avplot command generates the partial regression plot.
quietly regress y x1 x2
rvpplot x1, ylabel(-20(5)25) xlabel(20(10)80)
avplot x1, ylabel(-100(50)250) xlabel(-25(25)50)
Inputting the Body Fat data, table 7.1, p. 261.
Note: We need the clear command to clear out the other dataset since Stata can only have one data set open at one time.
clear
input x1 x2 x3 y
  19.5  43.1  29.1  11.9
  24.7  49.8  28.2  22.8
  30.7  51.9  37.0  18.7
  29.8  54.3  31.1  20.1
  19.1  42.2  30.9  12.9
  25.6  53.9  23.7  21.7
  31.4  58.5  27.6  27.1
  27.9  52.1  30.6  25.4
  22.1  49.9  23.2  21.3
  25.5  53.5  24.8  19.3
  31.1  56.6  30.0  25.4
  30.4  56.7  28.3  27.2
  18.7  46.5  23.0  11.7
  19.7  44.2  28.6  17.8
  14.6  42.7  21.3  12.8
  29.5  54.4  30.1  23.9
  27.7  55.3  25.7  22.6
  30.2  58.6  24.6  25.4
  22.7  48.2  27.1  14.8
  25.2  51.0  27.5  21.1
end
Regressing y on x1 x2, p. 365.
regress y x1 x2

      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =   29.80
       Model |  385.438738     2  192.719369           Prob > F      =  0.0000
    Residual |  109.950775    17  6.46769267           R-squared     =  0.7781
-------------+------------------------------           Adj R-squared =  0.7519
       Total |  495.389513    19  26.0731323           Root MSE      =  2.5432

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .2223526   .3034389     0.73   0.474    -.4178475    .8625527
          x2 |   .6594218   .2911873     2.26   0.037     .0450704    1.273773
       _cons |  -19.17425    8.36064    -2.29   0.035    -36.81366   -1.534839
------------------------------------------------------------------------------
Fig. 9.4a, p. 366.
rvpplot x1, ylabel(-4(1)5) xlabel(10(5)35)
Fig. 9.4b, p. 366.
avplot x1, ylabel(-6(1)5) xlabel(-5(1)5)
Fig. 9.4c, p. 366.
rvpplot x2, ylabel(-4(1)5) xlabel(40(5)60)
Fig. 9.4d, p. 366.
avplot x2, ylabel(-6(1)5) xlabel(-5(1)5)
Inputting data for illustration of hat matrix, table 9.2, p. 371.
clear
input x1 x2 y
 14 25 301
 19 32 327
 12 22 246
 11 15 187
end
Regressing y on x1 x2 (9.17), p. 370.
regress y x1 x2

      Source |       SS       df       MS              Number of obs =       4
-------------+------------------------------           F(  2,     1) =    9.58
       Model |  11009.8607     2  5504.93035           Prob > F      =  0.2228
    Residual |  574.889291     1  574.889291           R-squared     =  0.9504
-------------+------------------------------           Adj R-squared =  0.8511
       Total |    11584.75     3  3861.58333           Root MSE      =  23.977

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |  -5.844605   11.74463    -0.50   0.706    -155.0743    143.3851
          x2 |   11.32528   5.931139     1.91   0.307    -64.03699    86.68755
       _cons |   80.93035   57.94355     1.40   0.396    -655.3123     817.173
------------------------------------------------------------------------------
predict yhat, xb
predict e, resid
predict hat, hat
predict stdp, stdp
gen s2e =  (e(rss)/e(df_r) )*(1-hat)
list yhat e hat stdp s2e

          yhat          e        hat       stdp        s2e
  1.  282.2379   18.76208   .3876812   14.92896   352.0155
  2.  332.2919  -5.291868   .9512882   23.38558   28.00388
  3.  259.9513  -13.95129   .6614332   19.50002   194.6384
  4.  186.5189   .4810789   .9995974   23.97202    .231433 
Generating the hat matrix and the variance of residuals matrix, table 9.2b and 9.2c, p. 371.
gen constant = 1
mkmat constant x1 x2, matrix(X)
matrix list X

X[4,3]
    constant        x1        x2
r1         1        14        25
r2         1        19        32
r3         1        12        22
r4         1        11        15
matrix H = X*inv(X'*X)*X'
matrix list H

symmetric H[4,4]
            r1          r2          r3          r4
r1   .38768116
r2   .17270531   .95128824
r3   .45531401   -.1284219   .66143317
r4  -.01570048   .00442834   .01167472   .99959742

matrix s2e = (e(rss)/e(df_r) )*(I(4) -H)
matrix list s2e

symmetric s2e[4,4]
            r1          r2          r3          r4
r1   352.01554
r2  -99.286436   28.003866
r3  -261.75515   73.828375   194.63844
r4   9.0260396   -2.545806  -6.7116705   .23143691
Returning to the Body Fat data and computing the residuals, the diagonal elements of the hat matrix, and the studentized deleted residuals for the regression of y on x1 x2, table 9.3, p. 375
Note: Since we are not interested in the output from the regression we use the quietly option which suppresses the output.
clear
input x1 x2 x3 y
  19.5  43.1  29.1  11.9
  24.7  49.8  28.2  22.8
  30.7  51.9  37.0  18.7
  29.8  54.3  31.1  20.1
  19.1  42.2  30.9  12.9
  25.6  53.9  23.7  21.7
  31.4  58.5  27.6  27.1
  27.9  52.1  30.6  25.4
  22.1  49.9  23.2  21.3
  25.5  53.5  24.8  19.3
  31.1  56.6  30.0  25.4
  30.4  56.7  28.3  27.2
  18.7  46.5  23.0  11.7
  19.7  44.2  28.6  17.8
  14.6  42.7  21.3  12.8
  29.5  54.4  30.1  23.9
  27.7  55.3  25.7  22.6
  30.2  58.6  24.6  25.4
  22.7  48.2  27.1  14.8
  25.2  51.0  27.5  21.1
end
quietly regress y x1 x2
predict resid, residuals
predict hat, hat
predict student, rstudent
list resid hat student

         resid        hat    student
  1. -1.682708   .2010126  -.7299849
  2.  3.642931   .0588948   1.534254
  3. -3.175971   .3719329   -1.65433
  4. -3.158464   .1109401  -1.348484
  5. -.0002889   .2480103  -.0001271
  6. -.3608158   .1286163  -.1475492
  7.  .7161994   .1555175   .2981277
  8.  4.014733   .0962878   1.760093
  9.  2.655104   .1146357   1.117648
 10. -2.474812   .1102444  -1.033729
 11.  .3358067   .1203366   .1366612
 12.  2.225511   .1092663   .9231787
 13. -3.946861   .1783818  -1.825903
 14.  3.447455   .1480068   1.524763
 15.  .5705876    .333212   .2671503
 16.   .642297   .0952774   .2581318
 17. -.8509458   .1055946  -.3445088
 18. -.7829196   .1967927   -.334408
 19. -2.857289   .0669542  -1.176171
 20.  1.040449   .0500853   .4093566 
Identifying potential outliers just by looking at the observations with the largest absolute studentized residuals, p. 374.
list resid hat student if abs(student) > 1.6

         resid        hat    student
  3. -3.175971   .3719329   -1.65433
  8.  4.014733   .0962878   1.760093
 13. -3.946861   .1783818  -1.825903 
Using the Bonferroni simultaneous test procedure to determine any of these three really are outliers, alpha = .10, p. 375.
Note: All the potential outliers are nonsignificant when tested using the Bonferroni simultaneous test procedure.
Note: We first demonstrate how to get the correct number for n-p-1 and for 1- alpha/(2*n) before we calculate the critical value using the invttail function. Then we calculate the actual p-value which is .505334 and therefore not significant at the .05 level and we conclude that observation 13 is not an outlier.
display e(df_r) -1
16

display 1 - .1/(2*e(N))
.9975

display invttail(e(df_r) -1 , 1-.1/(2*e(N)) )
-3.2519929

gen p_val = 1- ttail(.1/(2*e(N)) , abs(student))
list student p_val if abs(student) > 1.8

       student      p_val
 13. -1.825903    .505334 
Using the data for illustration of hat matrix again, table 9.2, p. 371.
clear
input x1 x2 y
14 25 301
 19 32 327
 12 22 246
 11 15 187
end
Calculating the diagonal elements of the hat matrix and generating fig. 9.6, p. 376.
Note: The mlabel(hat) option lets the entries for the hat variable be the points that are plotted.
quietly regress y x1 x2
predict hat, hat
replace hat=round(hat, .0001)
graph twoway scatter x2 x1, mlabel(hat) ylabel(15 25 35) xlabel(10 15 20)
Back to the Body Fat data.
clear
input x1 x2 x3 y
  19.5  43.1  29.1  11.9
  24.7  49.8  28.2  22.8
  30.7  51.9  37.0  18.7
  29.8  54.3  31.1  20.1
  19.1  42.2  30.9  12.9
  25.6  53.9  23.7  21.7
  31.4  58.5  27.6  27.1
  27.9  52.1  30.6  25.4
  22.1  49.9  23.2  21.3
  25.5  53.5  24.8  19.3
  31.1  56.6  30.0  25.4
  30.4  56.7  28.3  27.2
  18.7  46.5  23.0  11.7
  19.7  44.2  28.6  17.8
  14.6  42.7  21.3  12.8
  29.5  54.4  30.1  23.9
  27.7  55.3  25.7  22.6
  30.2  58.6  24.6  25.4
  22.7  48.2  27.1  14.8
  25.2  51.0  27.5  21.1
end
Fig. 9.7, p. 378.
gen id = _n
graph twoway scatter x2 x1, mlabel(id) msymbol(i) ylabel(42(2)60)xlab(14(2)32)
Regressing y on x1 x2 and calculating the DFFITS, Cook's distance, and DFBETAs, table 9.4, p. 380.
quietly regress y x1 x2
predict dfits, dfits
predict cooksd, cooksd
predict dfbeta1, dfbeta(x1)
predict dfbeta2, dfbeta(x2)
list dfits cooksd dfbeta1 dfbeta2

         dfits     cooksd    dfbeta1    dfbeta2
  1. -.3661471   .0459505  -.1314856   .2320319
  2.  .3838104   .0454812   .1150253  -.1426131
  3. -1.273067   .4901566  -1.182525   1.066903
  4. -.4763481   .0721618  -.2935194   .1960718
  5.  -.000073   1.89e-09  -.0000306   .0000503
  6. -.0566866   .0011365   .0400812  -.0442677
  7.  .1279371   .0057649   -.015613   .0543164
  8.  .5745215   .0979386   .3911265  -.3324536
  9.  .4021648   .0531335  -.2946556   .2469092
 10. -.3638727   .0439571   .2446011  -.2688087
 11.  .0505459   .0009038   .0170564  -.0024845
 12.  .3233367   .0351544   .0224579   .0699963
 13. -.8507811   .2121502   .5924201  -.3894911
 14.   .635514   .1248925   .1131721  -.2977041
 15.  .1888523   .0125753  -.1247569   .0687694
 16.  .0837681   .0024749   .0431133  -.0251249
 17. -.1183734   .0049261   .0550435    -.07609
 18. -.1655264   .0096365   .0753286  -.1161002
 19. -.3150707   .0323601   -.004072   .0644293
 20.  .0939971   .0030968   .0022908  -.0033142 
Fig. 9.8a, p. 382.
predict predict, xb
predict resid, resid
graph twoway scatter resid predict [w=cooksd] ///
	, ylabel(-4.5(1.5)4.5) xlabel(10(5)30) msymbol(o)
Fig. 9.8b, p. 382.
graph twoway scatter cooksd id, connect(l) ylabel(0(.1).5) xlabel(0(5)25)
Calculating the regression equations for regressing y on x1 x2 with and without observation 3, p. 384.
regress y x1 x2

      Source |       SS       df       MS              Number of obs =      20
-------------+------------------------------           F(  2,    17) =   29.80
       Model |  385.438738     2  192.719369           Prob > F      =  0.0000
    Residual |  109.950775    17  6.46769267           R-squared     =  0.7781
-------------+------------------------------           Adj R-squared =  0.7519
       Total |  495.389513    19  26.0731323           Root MSE      =  2.5432

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .2223526   .3034389     0.73   0.474    -.4178475    .8625527
          x2 |   .6594218   .2911873     2.26   0.037     .0450704    1.273773
       _cons |  -19.17425    8.36064    -2.29   0.035    -36.81366   -1.534839
------------------------------------------------------------------------------
regress y x1 x2 if id ~= 3

      Source |       SS       df       MS              Number of obs =      19
-------------+------------------------------           F(  2,    16) =   34.01
       Model |  399.146133     2  199.573067           Prob > F      =  0.0000
    Residual |  93.8907246    16  5.86817029           R-squared     =  0.8096
-------------+------------------------------           Adj R-squared =  0.7858
       Total |  493.036858    18  27.3909366           Root MSE      =  2.4224

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .5641417   .3552815     1.59   0.132    -.1890215    1.317305
          x2 |    .363502   .3300409     1.10   0.287    -.3361535    1.063158
       _cons |  -12.42817   8.947045    -1.39   0.184    -31.39506     6.53872
------------------------------------------------------------------------------
Calculating the VIF for the regression of y on x1 x2 x3, p. 388.
quietly regress y x1 x2 x3

* Stata 8 code.
vif

* Stata 9 code and output.
estat vif

    Variable |       VIF       1/VIF
-------------+----------------------
          x1 |    708.84    0.001411
          x2 |    564.34    0.001772
          x3 |    104.61    0.009560
-------------+----------------------
    Mean VIF |    459.26
Returning to the Surgical Unit Example, inputting the data, p. 388.
clear
input x1 x2 x3 x4 y logy
   6.7  62   81  2.59  200  2.3010
   5.1  59   66  1.70  101  2.0043
   7.4  57   83  2.16  204  2.3096
   6.5  73   41  2.01  101  2.0043
   7.8  65  115  4.30  509  2.7067
   5.8  38   72  1.42   80  1.9031
   5.7  46   63  1.91   80  1.9031
   3.7  68   81  2.57  127  2.1038
   6.0  67   93  2.50  202  2.3054
   3.7  76   94  2.40  203  2.3075
   6.3  84   83  4.13  329  2.5172
   6.7  51   43  1.86   65  1.8129
   5.8  96  114  3.95  830  2.9191
   5.8  83   88  3.95  330  2.5185
   7.7  62   67  3.40  168  2.2253
   7.4  74   68  2.40  217  2.3365
   6.0  85   28  2.98   87  1.9395
   3.7  51   41  1.55   34  1.5315
   7.3  68   74  3.56  215  2.3324
   5.6  57   87  3.02  172  2.2355
   5.2  52   76  2.85  109  2.0374
   3.4  83   53  1.12  136  2.1335
   6.7  26   68  2.10   70  1.8451
   5.8  67   86  3.40  220  2.3424
   6.3  59  100  2.95  276  2.4409
   5.8  61   73  3.50  144  2.1584
   5.2  52   86  2.45  181  2.2577
  11.2  76   90  5.59  574  2.7589
   5.2  54   56  2.71   72  1.8573
   5.8  76   59  2.58  178  2.2504
   3.2  64   65  0.74   71  1.8513
   8.7  45   23  2.52   58  1.7634
   5.0  59   73  3.50  116  2.0645
   5.8  72   93  3.30  295  2.4698
   5.4  58   70  2.64  115  2.0607
   5.3  51   99  2.60  184  2.2648
   2.6  74   86  2.05  118  2.0719
   4.3   8  119  2.85  120  2.0792
   4.8  61   76  2.45  151  2.1790
   5.4  52   88  1.81  148  2.1703
   5.2  49   72  1.84   95  1.9777
   3.6  28   99  1.30   75  1.8751
   8.8  86   88  6.40  483  2.6840
   6.5  56   77  2.85  153  2.1847
   3.4  77   93  1.48  191  2.2810
   6.5  40   84  3.00  123  2.0899
   4.5  73  106  3.05  311  2.4928
   4.8  86  101  4.10  398  2.5999
   5.1  67   77  2.86  158  2.1987
   3.9  82  103  4.55  310  2.4914
   6.6  77   46  1.95  124  2.0934
   6.4  85   40  1.21  125  2.0969
   6.4  59   85  2.33  198  2.2967
   8.8  78   72  3.20  313  2.4955
end
Calculating the VIF for the regression model y regressed on x1 x2 x3, p. 389.
quietly regress logy x1 x2 x3

* Stata 8 code.
vif

* Stata 9 code and output.
estat vif

    Variable |       VIF       1/VIF
-------------+----------------------
          x1 |      1.03    0.970108
          x3 |      1.02    0.977506
          x2 |      1.01    0.991774
-------------+----------------------
    Mean VIF |      1.02
Fig. 9.9a, p. 390.
predict resid, residual
rvfplot, yline(0) ylabel(-.15(.05).15) xlabel(1.5(.25)3)
Fig. 9.9b, p. 390.
graph twoway scatter resid x4, yline(0) ylabel(-.15(.05).15) xlabel(0(1)7)
Fig. 9.9c, p. 390.
avplot x1, ylabel(-.3(.1).4) xlabel(-4(2)6)
Fig. 9.9d, p. 390.
qnorm resid, ylabel(-.1(.05).15) xlabel(-.15(.05).15)
Computing various diagnostic for outlying cases.
quietly regress logy x1 x2 x3
predict hat, hat
predict student, rstudent
predict dfits, dfits
predict cooksd, cooksd
gen id = _n
Computing the Bonferroni test procedure for observation 22.
Note: We first demonstrate how to get the correct number for n-p-1 and for 1- alpha/(2*n) before we calculate the critical value using the invttail function. Then we calculate the actual p-value which is .0887476 and therefore not significant at the .05 level. However, the studentized residual is close enough to the critical value to warrant closer scrutiny.
display e(df_r) -1
49

display 1-.05/(2*e(N))
.99953704

display invttail(e(df_r) -1 , .05/(2*e(N)) )
3.5260926

gen p_val = ttail( 1-.05/(2*e(N)) , abs(student))
list student p_val if p_val < .1

       student      p_val
 22.  3.495166   .0887476 
Calculating the cut off for the outlying X observations.
display ( 2*(e(df_m)+1) )/e(N)
.14814815
Table 9.6, p. 391.
list resid hat student dfits cooksd if ( p_val < .1 | hat > ( 2*(e(df_m)+1) )/e(N) | resid > .11)

         resid        hat    student      dfits     cooksd
 13.  .0560032   .1494899   1.304571   .5469329   .0737486
 17.  -.016169   .1499111  -.3708858  -.1557489   .0061709
 22.  .1383144   .1273679   3.495166    1.33531   .3640895
 27.  .1117597   .0310852   2.552272   .4571527   .0470576
 28. -.0635543   .2618725  -1.602709  -.9546278   .2208982
 32.   .040223   .2112798   .9655761   .4997514   .0625225
 38.  .0902419   .2901519    2.39032    1.52822   .5335639 
The final regression model (9.46), p. 392.
regress logy x1 x2 x3

      Source |       SS       df       MS              Number of obs =      54
-------------+------------------------------           F(  3,    50) =  586.04
       Model |  3.86291372     3  1.28763791           Prob > F      =  0.0000
    Residual |  .109858708    50  .002197174           R-squared     =  0.9723
-------------+------------------------------           Adj R-squared =  0.9707
       Total |  3.97277243    53   .07495797           Root MSE      =  .04687

------------------------------------------------------------------------------
        logy |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          x1 |   .0692251   .0040779    16.98   0.000     .0610343    .0774159
          x2 |   .0092945   .0003825    24.30   0.000     .0085263    .0100628
          x3 |   .0095236   .0003064    31.08   0.000     .0089082    .0101391
       _cons |   .4836209   .0426287    11.34   0.000     .3979985    .5692432
------------------------------------------------------------------------------

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California