UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Textbook Examples
Applied Linear Statistical Models by Neter, Kutner, et. al.
Chapter 3: Diagnostics and Remedial Measures

Inputting the Toluca Company data.

clear
input x y
   80  399
   30  121
   50  221
   90  376
   70  361
   60  224
  120  546
   80  352
  100  353
   50  157
   40  160
   70  252
   90  389
   20  113
  110  435
  100  420
   30  212
   50  268
   90  377
  110  421
   30  273
   90  468
   40  244
   80  342
   70  323
end

Generate id variable for the sequence plot.  Label x variable "Lot size", y variable "Work hrs." and id variable "Run".

generate id = _n
label variable x "Lot size"
label variable y "Work hrs."
label variable id "Run"

Fig. 3.1a, p. 96.

dotplot x, nx(50)

Fig. 3.1 b, p. 96.

twoway connected x id

Fig. 3.1 c, p. 96.

stem x

   2* | 0
   3* | 000
   4* | 00
   5* | 000
   6* | 0
   7* | 000
   8* | 000
   9* | 0000
  10* | 00
  11* | 00
  12* | 0

Fig. 3.1 d, p. 96.

graph box x

Fig. 3.2a, p. 99.

regress y x
predict r, resid
twoway scatter r x

Fig. 3.2b, p. 99.

twoway connected r id, sort(id)

Fig. 3.2c, p. 99.

graph box r

Fig. 3.2d, p. 99.

qnorm r

Inputting and labeling the Transit data, p. 100.

clear
input y x
.60   80
6.70  220
5.30  140
4.00  120
6.55  180
2.15  100
6.60  200
5.75  160
end
label variable y "Ridership"
label variable x "Maps"

Fig. 3.3a, p. 100.

twoway (scatter y x) (lfit y x)

Fig. 3.3b, p. 100.

regress y x
predict r, resid
twoway scatter r x, yline(0)
Table 3.1, p. 100.
regress y x
predict yhat
predict r resid
list y x yhat r
     +-----------------------------------+
     |    y     x       yhat           r |
     |-----------------------------------|
  1. |   .6    80     1.6625     -1.0625 |
  2. |  6.7   220       7.75       -1.05 |
  3. |  5.3   140   4.271429    1.028572 |
  4. |    4   120   3.401786    .5982142 |
  5. | 6.55   180   6.010714    .5392859 |
     |-----------------------------------|
  6. | 2.15   100   2.532143   -.3821428 |
  7. |  6.6   200   6.880357   -.2803572 |
  8. | 5.75   160   5.141071    .6089286 |
     +-----------------------------------+
Table 3.2, p. 106.
Note: This table uses the Toluca company data.
egen rank = rank(r)
gen prob = (rank-.375)/(25+.25) 
gen expected = sqrt(2384)*invnormal(prob)
list id r rank expected
     +------------------------------------+
     | run           r   rank    expected |
     |------------------------------------|
  1. |   1    51.01798     22    51.97266 |
  2. |   2   -48.47192      5   -44.10749 |
  3. |   3   -19.87596     10   -14.76319 |
  4. |   4   -7.684041     11   -9.758779 |
  5. |   5       48.72     21     44.1075 |
     |------------------------------------|
  6. |   6   -52.57798      4   -51.97266 |
  7. |   7     55.2099     23    61.48703 |
  8. |   8     4.01798     15    9.758775 |
  9. |   9   -66.38606      2   -74.17666 |
 10. |  10   -83.87596      1   -95.90529 |
     |------------------------------------|
 11. |  11   -45.17394      6   -37.24776 |
 12. |  12      -60.28      3   -61.48703 |
 13. |  13    5.315959     16    14.76319 |
 14. |  14    -20.7699      8   -25.32683 |
 15. |  15   -20.08808      9   -19.92811 |
     |------------------------------------|
 16. |  16    .6139394     14    4.855084 |
 17. |  17    42.52808     20    37.24776 |
 18. |  18    27.12404     18    25.32683 |
 19. |  19   -6.684041     12   -4.855084 |
 20. |  20   -34.08808      7   -31.05527 |
     |------------------------------------|
 21. |  21    103.5281     25    95.90527 |
 22. |  22    84.31596     24    74.17666 |
 23. |  23    38.82606     19    31.05527 |
 24. |  24    -5.98202     13           0 |
 25. |  25       10.72     17    19.92811 |
     +------------------------------------+

The Modified Levene test of constancy of error variance, p. 112-114 and Table 3.3, p. 114.  The values of mr in the output correspond to the mean of the residuals for each group and the values of md correspond to mean of the deviations for each group. 
Note: Stata does have a Levene test (robvar) but it is not the same as the modified Levene test described in this section.

regress y x
predict r, resid
predict yhat
gen group = 1
replace group = 2 if x>70
sort group
by group: egen mr = median(r)
gen d = abs(r-mr)
by group:egen md = mean(d)
gen ddif =(d-md)^2
sort group
by group: list id x r d ddif
ttest d, by(group)
-> group = 1

     +--------------------------------------------+
     | run    x           r          d       ddif |
     |--------------------------------------------|
  1. |   6   60   -52.57798   32.70202   146.7261 |
  2. |  10   50   -83.87596         64   368.0613 |
  3. |  14   20    -20.7699     .89394   1929.066 |
  4. |  11   40   -45.17394   25.29798    380.917 |
  5. |  17   30    42.52808   62.40404   309.3716 |
     |--------------------------------------------|
  6. |   2   30   -48.47192   28.59596   263.0597 |
  7. |  18   50    27.12404         47   4.773898 |
  8. |   3   50   -19.87596          0   2008.391 |
  9. |  23   40    38.82606   58.70202   192.8472 |
 10. |  21   30    103.5281    123.404   6176.226 |
     |--------------------------------------------|
 11. |  25   70       10.72   30.59596   202.1833 |
 12. |  12   70      -60.28   40.40404   19.45725 |
 13. |   5   70       48.72   68.59596   565.5306 |
     +--------------------------------------------+

----------------------------------------------------
-> group = 2

     +---------------------------------------------+
     | run     x           r          d       ddif |
     |---------------------------------------------|
  1. |   8    80     4.01798    6.70202   472.9893 |
  2. |  13    90    5.315959          8   418.2162 |
  3. |   1    80    51.01798   53.70202   637.6475 |
  4. |  24    80    -5.98202    3.29798   632.6411 |
  5. |  15   110   -20.08808   17.40404   122.0206 |
     |---------------------------------------------|
  6. |  20   110   -34.08808   31.40404   8.724372 |
  7. |  22    90    84.31596         87   3428.063 |
  8. |   9   100   -66.38606   63.70202   1242.681 |
  9. |   4    90   -7.684041          5   549.9183 |
 10. |   7   120     55.2099   57.89394   866.9258 |
     |---------------------------------------------|
 11. |  16   100    .6139394    3.29798   632.6411 |
 12. |  19    90   -6.684041          4    597.819 |
     +---------------------------------------------+
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       1 |      13    44.81507    8.975255    32.36074    25.25967    64.37047
       2 |      12    28.45034    8.532597    29.55778    9.670218    47.23046
---------+--------------------------------------------------------------------
combined |      25       36.96    6.304496    31.52248    23.94816    49.97184
---------+--------------------------------------------------------------------
    diff |            16.36474    12.43066               -9.350043    42.07952
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =   1.3165
Ho: diff = 0                                     degrees of freedom =       23

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.8995         Pr(|T| > |t|) = 0.2010          Pr(T > t) = 0.1005

The Breusch-Pagan test, p. 115.
Note1: Stata has a user-written command, bpagan, that computes the Breusch-Pagan test statistic and p-value.  You can get this program, from within Stata while you are on-line by typing, for example, findit bpagan (see How can I used the findit command to search for programs and get additional help? for more information about using findit).
Note2: The book reports the p-value for this test as .64. We believe that the correct p-value for this test should be .36 = (1-.64) as reported.

regress y x
bpagan x
Breusch-Pagan LM statistic:  .8209193  Chi-sq( 1)  P-value =  .3649
Inputting the Bank data,  p. 117.
clear
input x y
  125  160
  100  112
  200  124
   75   28
  150  152
  175  156
   75   42
  175  124
  125  150
  200  104
  100  136
end
label variable x "deposit"
label variable y "new accounts"
gen branch = _n

Table 3.4a, p. 117

list branch x y

     +--------------------+
     | branch     x     y |
     |--------------------|
  1. |      1   125   160 |
  2. |      2   100   112 |
  3. |      3   200   124 |
  4. |      4    75    28 |
  5. |      5   150   152 |
     |--------------------|
  6. |      6   175   156 |
  7. |      7    75    42 |
  8. |      8   175   124 |
  9. |      9   125   150 |
 10. |     10   200   104 |
     |--------------------|
 11. |     11   100   136 |
     +--------------------+
Table 3.4b, The Anova table, p. 117.
regress y x
      Source |       SS       df       MS              Number of obs =      11
-------------+------------------------------           F(  1,     9) =    3.14
       Model |  5141.33841     1  5141.33841           Prob > F      =  0.1102
    Residual |  14741.5707     9   1637.9523           R-squared     =  0.2586
-------------+------------------------------           Adj R-squared =  0.1762
       Total |  19882.9091    10  1988.29091           Root MSE      =  40.472

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .4867016   .2747105     1.77   0.110    -.1347368     1.10814
       _cons |   50.72251   39.39791     1.29   0.230    -38.40176    139.8468
------------------------------------------------------------------------------
Table 3.5, p. 117.
drop branch
sort x
by x: egen mean = mean(y)
by x: gen replicate = _n
reshape wide y, i(x) j(replicate)
list
     +------------------------+
     |   x    y1    y2   mean |
     |------------------------|
  1. |  75    28    42     35 |
  2. | 100   112   136    124 |
  3. | 125   150   160    155 |
  4. | 150   152     .    152 |
  5. | 175   156   124    140 |
     |------------------------|
  6. | 200   104   124    114 |
     +------------------------+    

Table 3.6b, p. 123. The f variable in the output is the test statistic used in the test for lack of fit.

Note: Stata has a user-written command, maxr2, that computes the lack-of-fit test.  You can get this program, from within Stata while you are on-line by typing, for example, findit maxr2 (see How can I used the findit command to search for programs and get additional help? for more information about using findit).

Also note that if you used the reshape command to generate the last table, you will need to use the first line of syntax below. If you did not use the reshape command to generate the previous table, you will not.

reshape long
regress y x
maxr2
maximum R-square           = 0.9423
relative R-square          = 0.2744
relative adjusted R-square = 0.1938

SSLF (df) = 13593.571 (4) MSLF = 3398.3927
SSPE (df) = 1148 (5) MSPE = 229.6

F (dfn, dfd) for lack-of-fit test (MSLF/MSPE) =   14.8014 (4,5)
                                     prob > F =    0.0056

number of covariate patterns = 6
    as ratio of observations = 0.545
Inputting Sales Training data, table 3.7, p. 127
clear
input x y
  0.5   42.5
  0.5   50.6
  1.0   68.5
  1.0   80.7
  1.5   89.0
  1.5   99.6
  2.0  105.3
  2.0  111.8
  2.5  112.3
  2.5  125.7
end
label variable x "Training"
label variable y "Performance"
gen trainee = _n
gen sqrtx = sqrt(x)
list trainee x y sqrtx
     +----------------------------------+
     | trainee     x       y      sqrtx |
     |----------------------------------|
  1. |       1    .5    42.5   .7071068 |
  2. |       2    .5    50.6   .7071068 |
  3. |       3     1    68.5          1 |
  4. |       4     1    80.7          1 |
  5. |       5   1.5      89   1.224745 |
     |----------------------------------|
  6. |       6   1.5    99.6   1.224745 |
  7. |       7     2   105.3   1.414214 |
  8. |       8     2   111.8   1.414214 |
  9. |       9   2.5   112.3   1.581139 |
 10. |      10   2.5   125.7   1.581139 |
     +----------------------------------+
Fig. 3.14 and the fitted regression function at the bottom of p. 128.
twoway scatter y x, name(ch3_14a)
twoway scatter y sqrtx, name(ch3_14b)
regress y sqrtx
predict r, resid
twoway scatter r x, name(ch3_14c)
qnorm r, name(ch3_14d)

(a).

(b).

(c).

(d).

Inputting Plasma Levels data, table 3.8, p. 130.
clear
input x y logy
    0  13.44  1.1284
    0  12.84  1.1086
    0  11.91  1.0759
    0  20.09  1.3030
    0  15.60  1.1931
  1.0  10.11  1.0048
  1.0  11.38  1.0561
  1.0  10.28  1.0120
  1.0   8.96   .9523
  1.0   8.59   .9340
  2.0   9.83   .9926
  2.0   9.00   .9542
  2.0   8.65   .9370
  2.0   7.85   .8949
  2.0   8.88   .9484
  3.0   7.94   .8998
  3.0   6.01   .7789
  3.0   5.14   .7110
  3.0   6.90   .8388
  3.0   6.77   .8306
  4.0   4.86   .6866
  4.0   5.10   .7076
  4.0   5.67   .7536
  4.0   5.75   .7597
  4.0   6.23   .7945
end
label variable x "Age"
label variable y "Plasma
label variable logy "Log(plasma)"
list x y logy
     +----------------------------+
     | child   x       y     logy |
     |----------------------------|
  1. |     1   0   13.44   1.1284 |
  2. |     2   0   12.84   1.1086 |
  3. |     3   0   11.91   1.0759 |
  4. |     4   0   20.09    1.303 |
  5. |     5   0    15.6   1.1931 |
     |----------------------------|
  6. |     6   1   10.11   1.0048 |
  7. |     7   1   11.38   1.0561 |
  8. |     8   1   10.28    1.012 |
  9. |     9   1    8.96    .9523 |
 10. |    10   1    8.59     .934 |
     |----------------------------|
 11. |    11   2    9.83    .9926 |
 12. |    12   2       9    .9542 |
 13. |    13   2    8.65     .937 |
 14. |    14   2    7.85    .8949 |
 15. |    15   2    8.88    .9484 |
     |----------------------------|
 16. |    16   3    7.94    .8998 |
 17. |    17   3    6.01    .7789 |
 18. |    18   3    5.14     .711 |
 19. |    19   3     6.9    .8388 |
 20. |    20   3    6.77    .8306 |
     |----------------------------|
 21. |    21   4    4.86    .6866 |
 22. |    22   4     5.1    .7076 |
 23. |    23   4    5.67    .7536 |
 24. |    24   4    5.75    .7597 |
 25. |    25   4    6.23    .7945 |
     +----------------------------+
Fitted regression function at the bottom of p. 129 and Fig. 3.16, p. 131.
twoway scatter y x, name(ch3_16a)
twoway scatter logy x, name(ch3_16b)
regress logy x
predict r, resid
twoway scatter r x, yline(0) name(ch3_16c)
qnorm r, name(ch3_16d)

(a).

(b).

(c).

(d).

Table 3.9, p. 134

means y
scalar k2 =  r(mean_g) 

capture drop myw
gen myw = .

foreach n of numlist 0/20 {
  local lambda = (`n'-10)/10
  scalar k1 = k2^(1-`lambda')/`lambda'
  if (`lambda' ==0) {
  quietly replace myw = k2*ln(y) 
  }
  else {
  quietly replace myw = k1*(y^`lambda' -1)
   } 
  quietly reg myw x
  display  in yellow "`lambda'" _col(10) %4.2f  `e(rss)' 
 }
-1       33.91
-.9      32.70
-.8      31.76
-.7      31.09
-.6      30.69
-.5      30.56
-.4      30.72
-.3      31.18
-.2      31.95
-.1      33.06
0        34.52
.1       36.37
.2       38.64
.3       41.36
.4       44.59
.5       48.37
.6       52.76
.7       57.84
.8       63.67
.9       70.35
1        77.98

Fig. 3.18, p. 138.
Note: This table uses the Toluca company data.

twoway (lowess y x , bw(.7)) (scatter y x) ,  xscale(r(0 150)) legend(off) name (ch3_18a)
twoway (lfitci y x, nofit  level(90)  ciplot(rline) ) (lowess y x, bw(.7)), xscale(r(0 150)) legend(off) name(ch3_18b)

(a).

(b).

Inputting the Plutonium Measurement data, table 3.10, p. 139.
clear
input y x 
  0.150  20
  0.004   0
  0.069  10
  0.030   5
  0.011   0
  0.004   0
  0.041   5
  0.109  20
  0.068  10
  0.009   0
  0.009   0
  0.048  10
  0.006   0
  0.083  20
  0.037   5
  0.039   5
  0.132  20
  0.004   0
  0.006   0
  0.059  10
  0.051  10
  0.002   0
  0.049   5
  0.106   0
end
label variable x "Plutonium Activity, pCi/g"
label variable y "Alpha Count, #/sec."
gen case = _n
list case x y
     +------------------+
     | case    x      y |
     |------------------|
  1. |    1   20    .15 |
  2. |    2    0   .004 |
  3. |    3   10   .069 |
  4. |    4    5    .03 |
  5. |    5    0   .011 |
     |------------------|
  6. |    6    0   .004 |
  7. |    7    5   .041 |
  8. |    8   20   .109 |
  9. |    9   10   .068 |
 10. |   10    0   .009 |
     |------------------|
 11. |   11    0   .009 |
 12. |   12   10   .048 |
 13. |   13    0   .006 |
 14. |   14   20   .083 |
 15. |   15    5   .037 |
     |------------------|
 16. |   16    5   .039 |
 17. |   17   20   .132 |
 18. |   18    0   .004 |
 19. |   19    0   .006 |
 20. |   20   10   .059 |
     |------------------|
 21. |   21   10   .051 |
 22. |   22    0   .002 |
 23. |   23    5   .049 |
 24. |   24    0   .106 |
     +------------------+
Fig. 3.19a, p. 139.
twoway scatter y x, name(ch3_19a)
twoway (lowess y x, bw(.6)) (scatter y x), legend(off) name(ch3_19b)

(a).


Figure. 3.20, p. 140.

drop if case == 24
regress y x
predict r, resid
predict yhat
twoway scatter r yhat, yline(0)
qnorm r
(a)   Source |       SS       df       MS              Number of obs =      23
-------------+------------------------------           F(  1,    21) =  229.00
       Model |  .036190422     1  .036190422           Prob > F      =  0.0000
    Residual |  .003318796    21  .000158038           R-squared     =  0.9160
-------------+------------------------------           Adj R-squared =  0.9120
       Total |  .039509218    22  .001795874           Root MSE      =  .01257

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |    .005537   .0003659    15.13   0.000     .0047761    .0062979
       _cons |   .0070331   .0035988     1.95   0.064     -.000451    .0145173
------------------------------------------------------------------------------

(b).


(c).

Transforming the y variable.
gen sqrty = sqrt(y)

Fig. 3.21, p. 141. Repeating the whole analysis from fig. 3.20 with the transformed response variable.

regress sqrty x
predict r2, resid
predict yhat2
twoway scatter r2 yhat2, yline(0)
qnorm r2
(a)   Source |       SS       df       MS              Number of obs =      23
-------------+------------------------------           F(  1,    21) =  188.80
       Model |  .210846556     1  .210846556           Prob > F      =  0.0000
    Residual |  .023452708    21  .001116796           R-squared     =  0.8999
-------------+------------------------------           Adj R-squared =  0.8951
       Total |  .234299264    22  .010649967           Root MSE      =  .03342

------------------------------------------------------------------------------
       sqrty |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .0133648   .0009727    13.74   0.000      .011342    .0153876
       _cons |   .0947596   .0095668     9.91   0.000     .0748643    .1146549
------------------------------------------------------------------------------

(b).

(c).

Transforming X.
gen sqrtx = sqrt(x)
Fig. 3.22a, b and c, p. 142. Repeating the whole analysis from fig. 3.20 with the transformed response variable and the transformed predictor.
regress sqrty sqrtx
predict r3, resid
predict yhat3
twoway scatter r3 yhat3, yline(0)
qnorm r3
graph twoway (lfitci sqrty sqrtx, nofit  level(90)  ciplot(rline) ) (lowess sqrty sqrtx, bw(.7)) (scatter sqrty sqrtx), legend(off) name(ch3_22c)
(a)   Source |       SS       df       MS              Number of obs =      23
-------------+------------------------------           F(  1,    21) =  360.92
       Model |  .221416125     1  .221416125           Prob > F      =  0.0000
    Residual |  .012883139    21  .000613483           R-squared     =  0.9450
-------------+------------------------------           Adj R-squared =  0.9424
       Total |  .234299264    22  .010649967           Root MSE      =  .02477

------------------------------------------------------------------------------
       sqrty |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       sqrtx |   .0573055   .0030164    19.00   0.000     .0510325    .0635785
       _cons |   .0730056   .0078306     9.32   0.000      .056721    .0892902
------------------------------------------------------------------------------

(a).

(b).


(c).

 


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California