UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

SAS Textbook Examples
Regression Analysis by Example by Chatterjee, Hadi and Price
Chapter 4: Regression Diagnostics: Detection of Model Violations 

Inputting the Hamilton data, table 4.1, p. 9
data p095;
  input Y X1 X2;
cards;
12.37 2.23 9.66
12.66 2.57 8.94
12    3.87 4.4
11.93 3.1  6.64
11.06 3.39 4.91
13.03 2.83 8.52
13.13 3.02 8.04
11.44 2.14 9.05
12.86 3.04 7.71
10.84 3.26 5.11
11.2  3.39 5.05
11.56 2.35 8.51
10.83 2.76 6.59
12.63 3.9  4.9
12.46 3.16 6.96
;
run;
Correlations and scatter matrix, p. 94
Invoking a macro in order to create a scatter plot matrix.
proc corr data = p095;
  var y x1 x2;
run;
%scatter(data=p095, y x1 x2);
The CORR Procedure
   3  Variables:    Y        X1       X2
                                    Simple Statistics

Variable           N          Mean       Std Dev           Sum       Minimum       Maximum
Y                 15      12.00000       0.80217     180.00000      10.83000      13.13000
X1                15       3.00067       0.53470      45.01000       2.14000       3.90000
X2                15       6.99933       1.78145     104.99000       4.40000       9.66000
 Pearson Correlation Coefficients, N = 15
         Prob > |r| under H0: Rho=0

               Y            X1            X2
Y        1.00000       0.00250       0.43407
                        0.9930        0.1060

X1       0.00250       1.00000      -0.89978
          0.9930                      <.0001

X2       0.43407      -0.89978       1.00000
          0.1060        <.0001
Fitted Equations, p. 95.
proc reg data = p095;
 model y = x1;
 model y = x2;
 model y=x1 x2;
run;
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: Y

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     1     0.00005621     0.00005621       0.00    0.9930
Error                    13        9.00854        0.69296
Corrected Total          14        9.00860

Root MSE              0.83245    R-Square     0.0000
Dependent Mean       12.00000    Adj R-Sq    -0.0769
Coeff Var             6.93704
                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|
Intercept     1       11.98876        1.26689       9.46      <.0001
X1            1        0.00375        0.41608       0.01      0.9930

The REG Procedure
Model: MODEL2
Dependent Variable: Y

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     1        1.69736        1.69736       3.02    0.1060
Error                    13        7.31124        0.56240
Corrected Total          14        9.00860

Root MSE              0.74994    R-Square     0.1884
Dependent Mean       12.00000    Adj R-Sq     0.1260
Coeff Var             6.24946

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|
Intercept     1       10.63194        0.81094      13.11      <.0001
X2            1        0.19546        0.11251       1.74      0.1060

The REG Procedure
Model: MODEL3
Dependent Variable: Y

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     2        9.00722        4.50361    39222.3    <.0001
Error                    12        0.00138     0.00011482
Corrected Total          14        9.00860


Root MSE              0.01072    R-Square     0.9998
Dependent Mean       12.00000    Adj R-Sq     0.9998
Coeff Var             0.08930

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|
Intercept     1       -4.51541        0.06114     -73.85      <.0001
X1            1        3.09701        0.01227     252.31      <.0001
X2            1        1.03186        0.00368     280.08      <.0001
Inputting the New York Rivers data, p. 10.
data p010;
  length river $ 20 ;
  input river x1 x2 x3 x4 y;
  label x1 = 'Agriculture'
        x2 = 'Forest'
	x3 = 'Residential'
        x4 = 'Commercial/Industrial'
	y = 'Nitrogen';
cards;
Olean 26 63 1.2 0.29 1.1
Cassadaga 29 57 0.7 0.09 1.01
Oatka 54 26 1.8 0.58 1.9
Neversink 2 84 1.9 1.98 1
Hackensack 3 27 29.4 3.11 1.99
Wappinger 19 61 3.4 0.56 1.42
Fishkill 16 60 5.6 1.11 2.04
Honeoye  40 43 1.3 0.24 1.65
Susquehanna 28 62 1.1 0.15 1.01
Chenango 26 60 0.9 0.23 1.21
Tioughnioga 26 53 0.9 0.18 1.33
West_Canada 15 75 0.7 0.16 0.75
East_Canada 6 84 0.5 0.12 0.73
Saranac 3 81 0.8 0.35 0.8
Ausable 2 89 0.7 0.35 0.76
Black 6 82 0.5 0.15 0.87
Schoharie 22 70 0.9 0.22 0.8
Raquette 4 75 0.4 0.18 0.87
Oswegatchie 21 56 0.5 0.13 0.66
Cohocton 40 49 1.1 0.13 1.25
;
run;
Table 4.2, p. 99
proc reg data = p010;
  model y = x1-x4;
run; 
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: y Nitrogen

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     4        2.56985        0.64246       9.15    0.0006
Error                    15        1.05273        0.07018
Corrected Total          19        3.62257

Root MSE              0.26492    R-Square     0.7094
Dependent Mean        1.15750    Adj R-Sq     0.6319
Coeff Var            22.88715
                                     Parameter Estimates

                                              Parameter       Standard
Variable     Label                    DF       Estimate          Error    t Value    Pr > |t|
Intercept    Intercept                 1        1.72221        1.23408       1.40      0.1832
x1           Agriculture               1        0.00581        0.01503       0.39      0.7046
x2           Forest                    1       -0.01297        0.01393      -0.93      0.3667
x3           Residential               1       -0.00723        0.03383      -0.21      0.8337
x4           Commercial/Industrial     1        0.30503        0.16382       1.86      0.0823
Excluding the observation where River = Neversink.
proc reg data = p010; 
  model y = x1-x4;
  where river ~= 'Neversink';
run; 
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: y Nitrogen

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     4        3.07765        0.76941      20.76    <.0001
Error                    14        0.51881        0.03706
Corrected Total          18        3.59646

Root MSE              0.19250    R-Square     0.8557
Dependent Mean        1.16579    Adj R-Sq     0.8145
Coeff Var            16.51280
                                     Parameter Estimates

                                              Parameter       Standard
Variable     Label                    DF       Estimate          Error    t Value    Pr > |t|
Intercept    Intercept                 1        1.09947        0.91164       1.21      0.2478
x1           Agriculture               1        0.01014        0.01098       0.92      0.3717
x2           Forest                    1       -0.00759        0.01022      -0.74      0.4701
x3           Residential               1       -0.12379        0.03934      -3.15      0.0071
x4           Commercial/Industrial     1        1.52896        0.34372       4.45      0.0006
Excluding the observation where River = Hackensack.
proc reg data = p010; 
  model y = x1-x4;
  where river ~= 'Hackensack';
run; 
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: y Nitrogen

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F
Model                     4        2.49968        0.62492      22.24    <.0001
Error                    14        0.39336        0.02810
Corrected Total          18        2.89304

Root MSE              0.16762    R-Square     0.8640
Dependent Mean        1.11368    Adj R-Sq     0.8252
Coeff Var            15.05109

                                     Parameter Estimates

                                              Parameter       Standard
Variable     Label                    DF       Estimate          Error    t Value    Pr > |t|
Intercept    Intercept                 1        1.62601        0.78109       2.08      0.0562
x1           Agriculture               1        0.00235        0.00954       0.25      0.8088
x2           Forest                    1       -0.01276        0.00881      -1.45      0.1698
x3           Residential               1        0.18116        0.04439       4.08      0.0011
x4           Commercial/Industrial     1        0.07562        0.11396       0.66      0.5177
Fig. 4.5, p. 102.
symbol v=dot h=.8 c=blue;
proc gplot data = p010;
  plot y*x4;
run;
quit;

Fig. 4.6, p. 102-103.
proc reg data = p010 noprint;
  model y=x4;
  plot student.*obs. h.*obs.;
  output out=resid student=stdresid H=leverage; 
run;
quit;

Table 4.3, p. 103. The standardized residuals and the leverage values.
proc print data = resid;
 var stdresid leverage;
run;
Obs    stdresid    leverage

  1     0.03228     0.05469
  2    -0.04502     0.06670
  3     1.95292     0.05038
  4    -1.84723     0.24787
  5     0.15529     0.67101
  6     0.67231     0.05018
  7     1.92326     0.08261
  8     1.56562     0.05700
  9    -0.09515     0.06232
 10     0.38082     0.05752
 11     0.74924     0.06038
 12    -0.81033     0.06166
 13    -0.83246     0.06443
 14    -0.82939     0.05253
 15    -0.93761     0.05253
 16    -0.47590     0.06232
 17    -0.72323     0.05806
 18    -0.50049     0.06038
 19    -1.03103     0.06371
 20     0.57473     0.06371
Fig. 4.7(a) and 4.7(b), p. 107.
Plotting the Cook's Distance and the DFFits.
symbol v=dot h=.8 c=blue;
proc reg data = p010 noprint;
  model y = x4;
  plot cookd.*obs. dffits.*obs.;
  output out=resid r=r H=h cookd=CookD dffits=dffits ; 
run;
quit;

Creating the values for the Hadi influence measure.
ods listing close;
proc reg data = p010;
  model y = x4;
  ods output anova=temp;
run;
quit;
ods listing;
data temp;
  set temp;
  if source = 'Error' then call symput ('sse', ss);
  if source = 'Model' then call symput ('p', df );
run;
%put &sse &p ; /* To make the numbers sse and p appear in the log file. */
data resid2;
   set resid;
   id = _N_;
   newid + 1;
   d = .;
   d = r/sqrt(&sse);
   hadi = h/(1-h) + (&p + 1)*d**2/((1-h)*(1-d**2)); 
   keep hadi d id  dffits cookd h;
run;
Table 4.4, p. 106.
proc print data = resid2;
  var CookD dffits hadi;
run;
Obs     CookD      dffits       hadi

  1    0.00003     0.00755    0.05797
  2    0.00007    -0.01170    0.07170
  3    0.10118     0.49244    0.58357
  4    0.56225    -1.14475    0.77174
  5    0.02459     0.21567    2.04228
  6    0.01194     0.15210    0.10428
  7    0.16653     0.62923    0.59652
  8    0.07408     0.40249    0.37293
  9    0.00030    -0.02385    0.06747
 10    0.00443     0.09180    0.07727
 11    0.01804     0.18753    0.12852
 12    0.02157    -0.20566    0.14126
 13    0.02386    -0.21651    0.14874
 14    0.01907    -0.19351    0.13474
 15    0.02437    -0.21998    0.15786
 16    0.00753    -0.11999    0.09193
 17    0.01612    -0.17708    0.12139
 18    0.00805    -0.12417    0.09247
 19    0.03617    -0.26945    0.19307
 20    0.01124     0.14705    0.10539
Fig. 4.7(c), p. 107. Plotting the Hadi influence measure.
symbol v=dot h=.8 c=blue;
axis2 order=(0 to 20 by 4);
 
proc gplot data = resid2;
  plot hadi*id / haxis=axis2;  
run; 
quit;

>

Creating the potential and residuals.
data resid3;
  set resid2;
  po = h/(1-h);
  re = (1+1)*d**2/( (1-h)*(1-d**2) );
run;
Fig 4.8, p. 108. The Potential-Residual Plot.
symbol v=dot h=.8 c=blue;
axis2 order=(0 to 2.5 by .5);
 
proc gplot data = resid3;
  plot po*re/ vaxis=axis2;
run; 
quit;

Inputting the Scottish Hills Race data, table 4.5, p. 112
data p112;
  length Race $ 30;
  input Race Time  Distance  Climb;
cards;
Greenmantle_New_Year_Dash  	965  	2.5  	650  
Carnethy  		 	2901  	6  	2500  
Craig_Dunain  			2019  	6  	900  
Ben_Rha  		 	2736  	7.5  	800  
Ben_Lomond  			3736  	8  	3070  
Goatfell  			4393  	8  	2866  
Bens_of_Jura  		  	12277  	16  	7500  
Cairnpapple  			2182  	6  	800  
Scolty  			1785  	5  	800  
Traprain_Law  			2385  	6  	650  
Lairig_Ghru  			11560  	28  	2100  
Dollar  			2583  	5  	2000  
Lomonds_of_Fife  		3900  	9.5  	2200  
Cairn_Table  			2648  	6  	500  
Eildon_Two  			1616  	4.5  	1500  
Cairngorm  			4335  	10  	3000  
Seven_Hills_of_Edinburgh  	5905  	14  	2200  
Knock_Hill  			4719  	3  	350  
Black_Hill  			1045  	4.5  	1000  
Creag_Beag  			1954  	5.5  	600  
Kildoon  			957  	3  	300  
Meall_Ant_Suiche  		1674  	3.5	1500  
Half_Ben_Nevis  		2859  	6  	2200  
Cow_Hill  			1076  	2  	900  
North_Berwick_Law  		1121  	3  	600  
Creag_Dubh  			1573  	4  	2000  
Burnswark  			2066  	6  	800  
Largo  	        		1714  	5  	950  
Criffel  			3030  	6.5  	1750  
Achmony  			1257  	5  	500  
Ben_Nevis  			5135  	10  	4400  
Knockfarrel  			1943  	6  	600  
Two_Breweries_Fell  		10215  	18  	5200  
Cockleroi  			1686  	4.5  	850  
Moffat_Chase  			9590  	20  	5000  
;
run;
Fig. 4.11(a) and 4.11(b), p. 114. The added-variable plots and generating the output necessary for the residual plus component plots.
proc reg data = p112;
  model time = distance climb/partial;
  output out=resid H=h r=r ;
  ods output anova=temp;
run;
quit;
The REG Procedure
Model: MODEL1
Dependent Variable: Time

                             Analysis of Variance

                                    Sum of           Mean
Source                   DF        Squares         Square    F Value    Pr > F

Model                     2      202712289      101356144     147.51    <.0001
Error                    27       18552505         687130
Corrected Total          29      221264794


Root MSE            828.93294    R-Square     0.9162
Dependent Mean     3341.26667    Adj R-Sq     0.9099
Coeff Var            24.80894

                        Parameter Estimates

                     Parameter       Standard
Variable     DF       Estimate          Error    t Value    Pr > |t|

Intercept     1     -304.14317      263.96279      -1.15      0.2593
Distance      1      395.48217       35.81177      11.04      <.0001
Climb         1        0.40133        0.14962       2.68      0.0123

The REG Procedure
Model: MODEL1
Partial Regression Residual Plot

      --+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--
 Time |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 4000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                     1  |
      |                                                                                        |
      |                                                                                        |
 3000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 2000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |      1                                                                                 |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 1000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |  1                                                 1                                   |
      |    1                                                                       1           |
    0 +                                                                 1 1               1    +
      |                                  1      1                    1            1        1   |
      |                                                                            11     1  1 |
      |                                                  1  1                   2  1           |
      |                                             1                          11 1            |
      |                                                                     1                  |
      |                                                                               1        |
-1000 +                                                                            1           +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      --+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+--
      -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1  0.0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8

                                              Intercept
The REG Procedure
Model: MODEL1
Partial Regression Residual Plot

      -----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------
 8000 +                                                                                        +
      |                                                                                1       |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 6000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 4000 +                                                                                        +
      |                                                                                        |
      |                   1                                                                    |
 Time |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 2000 +                                                                                        +
      |                        1            1                                                  |
      |                                1                                                       |
      |                           1                                                            |
      |                                                                                        |
      |                          1                                                             |
      |                         11                                                             |
    0 +                         2                                                              +
      |                  1 2 1 3                                                               |
      |                 1                                                                      |
      |              11   1                                                                    |
      |            21                                                                          |
      |              1    1                                                                    |
      |            1                                                                           |
-2000 +                                                                                        +
      |       1 1                                                                              |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
-4000 +                                                                                        +
      -----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------
          -6    -4    -2     0     2     4     6     8    10    12    14    16    18    20

                                               Distance
The REG Procedure
Model: MODEL1
Partial Regression Residual Plot

      ------+------+------+------+------+------+------+------+------+------+------+------+------
 4000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                      1                                                 |
      |                                                                                        |
      |                                                                                        |
 3000 +                                                                                        +
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 2000 +                                                                         1              +
      |                                                                                        |
      |                                                                                        |
 Time |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
 1000 +                                                                                        +
      |                                                               1                        |
      |                                                                                1       |
      |                                                                  1                     |
      |                                                         1                              |
      |                                                1                                       |
      |                                                  1       1      1                      |
    0 +                                  1                                                     +
      |                                          1 1              11                           |
      |                                    1 1   1       11                                    |
      |                                    1 1 2                                               |
      |                                      11                                                |
      |                                   1                                                    |
      |                                            1                                           |
-1000 +                                    1                                                   +
      |        1                                                                               |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
      |                                                                                        |
-2000 +                                                                                        +
      ------+------+------+------+------+------+------+------+------+------+------+------+------
          -3000  -2500  -2000  -1500  -1000  -500     0     500   1000   1500   2000   2500

                                                Climb
Fig. 4.12, the partial residual plot for each predictor.
Note: Invoking the par_resid_plot macro which will generate only one plot at a time.
%par_resid_plot(p112, time, climb distance, distance)
%par_resid_plot(p112, time, climb distance, climb)
Generating the variables to be used in the Potential-Residual plot.
data temp;
  set temp;
  if source = 'Error' then call symput ('sse', ss);
run;
data resid2;
  set resid;
  d = .;
  d = r/sqrt(&sse);
  po = h/(1-h);
  re = 3*d**2/( (1-h)*(1-d**2) );
run;
It is 3 times d squared since d squared is multiplied by the number of predictors in the model + 1, i.e., p + 1 (which in this case is 2+1 = 3). The Potential-Residual Plot, fig. 4.13, p. 114.
symbol v=dot h=.8 c=blue; 
proc gplot data = resid2; 
  plot po*re ;
run; 
quit;


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California