|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.4.1.
Robust regression can be used in any situation in which you would use OLS regression. When doing the regression diagnostics, you might discover that one or more data points are moderately outlying. These are points that you have determined are not data entry errors, from a different population than the rest of your data, and for which you have no compelling reason to exclude them from the analysis. Robust regression is a compromise between deleting these points, and allowing them to violate the assumptions of OLS regression.
Before we continue with our discussion of robust regression, we need to define some terms.
Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value.
Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. These leverage points can have an effect on the estimate of regression coefficients.
Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.
Robust regression deals with cases that have very high leverage, and cases that are outliers. Robust regression is essentially a compromise between dropping the case(s) that are moderate outliers and seriously violating the assumptions of OLS regression. It is a form of weighted least squares regression and is done iteratively. At each step a new set of weights are determined based on the residuals. In general, the larger the residuals, the smaller the weights. So the weights depend on residuals. At the same time, the residuals depend on the model and the model depends on the weights. This generates an iteration process and it goes on until the change in the parameter estimates are small enough. A few types of weighting schemes are implemented. In Huber weighting, observations with small residuals get a weight of 1, the larger the residual, the smaller the weight. With bisquare, all cases with a non-zero residual get down-weighted at least a little.
crime<-read.table("http://www.ats.ucla.edu/stat/R/dae/crime.csv", sep=",", header = TRUE)
attach(crime)
crime2<-crime[ sid !=51, ]
In most cases, you begin by running an OLS regression and doing some diagnostics. We will begin by running an OLS regression.
m1 <- lm(crime ~ poverty + single, data=crime2) opar <- par(mfrow = c(2,2), oma = c(0, 0, 1.1, 0)) plot(m1, las = 1) par(opar)
As we can see, observation 9 and 25 have high residual and leverage values. Let us display these observations.
crime2[9,] sid state crime murder pctmetro pctwhite pcths poverty single 9 9 fl 1206 8.9 93 83.5 74.4 17.8 10.6 crime2[25,] sid state crime murder pctmetro pctwhite pcths poverty single 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7
We can also generate the Cook's D and standardized residuals using the package MASS. The lowest value that Cook's D can assume is zero, and the higher the Cook's D is, the more influential the point. The conventional cut-off point is 4/n, where n is the number of observations in the data set.
library(MASS) d1<-cooks.distance(m1) r<-stdres(m1) a<-cbind(crime2, d1, r) a[d1>4/50,]sid state crime murder pctmetro pctwhite pcths poverty single d1 r 9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.15731688 3.023632 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.79632761 -3.158753 49 49 wv 208 6.9 41.8 96.3 66.0 22.2 9.4 0.08725893 -1.047394
Now we will look at the residuals. We will first generate a new variable rabs containing the absolute value of standardized residuals. Then we sort the data on rabs in descending order. We then list the first ten observations.
rabs <-abs(r) a<cbind{crime2, d1, r, rabs) asorted<-a[order(-rabs), ] asorted[1:10,]sid state crime murder pctmetro pctwhite pcths poverty single d1 r rabs 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.79632761 -3.158753 3.158753 9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.15731688 3.023632 3.023632 46 46 vt 114 3.6 27.0 98.4 80.8 10.0 11.0 0.04732933 -1.831356 1.831356 20 20 md 998 12.7 92.8 68.9 78.4 9.7 12.0 0.06348335 1.620750 1.620750 26 26 mt 178 3.0 24.0 92.6 81.0 14.9 10.8 0.01990241 -1.588842 1.588842 21 21 me 126 1.6 35.7 98.5 78.8 10.7 10.6 0.02764574 -1.578434 1.578434 14 14 il 960 11.4 84.0 81.0 76.2 13.6 11.5 0.01848024 1.550569 1.550569 5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 0.03175037 1.401128 1.401128 34 34 ny 1074 13.3 91.7 77.2 74.8 16.4 12.7 0.02727377 1.334536 1.334536 19 19 ma 805 3.9 96.2 91.1 80.0 10.7 10.9 0.01896076 1.288611 1.288611
Now let's run our first robust regression. Robust regression is done by iterated re-weighted least squares. The function for running robust regression is rlm from the package MASS. There are a couple of estimators for IWLS. The default estimator that rlm uses is Huber estimator. We can save the final weights created by the IWLS process. This can be very useful.
m1.huber<-rlm(crime ~poverty + single, data=crime2) summary(m1.huber)Call: rlm(formula = crime ~ poverty + single, data = crime2) Residuals: Min 1Q Median 3Q Max -706.88 -96.04 -21.72 115.24 675.41 Coefficients: Value Std. Error t value (Intercept) -1046.8870 224.9294 -4.6543 poverty 8.9435 7.6522 1.1688 single 133.8002 22.2370 6.0170 Residual standard error: 165.8 on 47 degrees of freedom Correlation of Coefficients: (Intercept) poverty poverty -0.0042 single -0.8932 -0.4303 a<-cbind(crime2, d1, r, m1.huber$w) asorted<-a[order(m1.huber$w),] asorted[1:10,] sid state crime murder pctmetro pctwhite pcths poverty single d1 r m1.huber$w 25 25 ms 434 13.5 30.7 63.3 64.3 24.7 14.7 0.79632761 -3.158753 0.3155457 9 9 fl 1206 8.9 93.0 83.5 74.4 17.8 10.6 0.15731688 3.023632 0.3302352 46 46 vt 114 3.6 27.0 98.4 80.8 10.0 11.0 0.04732933 -1.831356 0.5571331 26 26 mt 178 3.0 24.0 92.6 81.0 14.9 10.8 0.01990241 -1.588842 0.6311308 20 20 md 998 12.7 92.8 68.9 78.4 9.7 12.0 0.06348335 1.620750 0.6326876 14 14 il 960 11.4 84.0 81.0 76.2 13.6 11.5 0.01848024 1.550569 0.6436023 21 21 me 126 1.6 35.7 98.5 78.8 10.7 10.6 0.02764574 -1.578434 0.6539268 19 19 ma 805 3.9 96.2 91.1 80.0 10.7 10.9 0.01896076 1.288611 0.7490496 31 31 nj 627 5.3 100.0 80.8 76.7 10.9 9.6 0.02156761 1.193654 0.7640642 5 5 ca 1078 13.1 96.7 79.3 76.2 18.2 12.5 0.03175037 1.401128 0.7701236
Roughly, as the residual goes down, the weight goes up. In other words, cases with a large residual tend to be down-weighted, and the values of Cook's D don't really correspond to the weights. This output shows us that the observation for Mississippi will be down-weighted the most. Florida will also be substantially down-weighted. In OLS regression, all cases have a weight of 1. Hence, the more cases in the robust regression that have a weight close to one, the closer the results of the OLS and robust regressions.
Now that we have seen the values of Cook's D and the residuals from the OLS regression model, let's look compare the results of a regular OLS regression and a robust regression. If the results are very different, you will most likely want to use the results from the robust regression.
m1<-lm(crime ~poverty + single, data=crime2) summary(m1) Call: lm(formula = crime ~ poverty + single, data = crime2) Residuals: Min 1Q Median 3Q Max -646.78 -106.42 -15.34 112.66 672.13 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -879.797 247.378 -3.556 0.00087 *** poverty 7.591 8.416 0.902 0.37164 single 120.617 24.456 4.932 1.06e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 228 on 47 degrees of freedom Multiple R-Squared: 0.4306, Adjusted R-squared: 0.4064 F-statistic: 17.77 on 2 and 47 DF, p-value: 1.785e-06m1.huber<-rlm(crime ~poverty + single, data=crime2) summary(m1.huber) Call: rlm(formula = crime ~ poverty + single, data = crime2) Residuals: Min 1Q Median 3Q Max -706.88 -96.04 -21.72 115.24 675.41 Coefficients: Value Std. Error t value (Intercept) -1046.8870 224.9294 -4.6543 poverty 8.9435 7.6522 1.1688 single 133.8002 22.2370 6.0170 Residual standard error: 165.8 on 47 degrees of freedom Correlation of Coefficients: (Intercept) poverty poverty -0.0042 single -0.8932 -0.4303
As you can see, the results from the two analyses are fairly different, especially with respect to the coefficients of single and the constant (_cons). While normally we are not interested in the constant, if you had centered one or both of the predictor variables, the constant would be useful. On the other hand, you will notice that poverty is not statistically significant in either analysis, while single is significant in both analyses. You will also notice that no R-squared, adjusted R-squared in rlm output.
The results of the robust regression would be written up in the same way that OLS results would be; the interpretation of the coefficients and overall significance of the model mean exactly the same as in OLS regression.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services