### Stata FAQ Why doesn't the test of the overall survey regression model in Stata match the results from SAS and SUDAAN?

NOTE:  We will use the NHANES II data as an example.

#### The question

Let's say that you ran an OLS regression model with survey data in Stata.

use http://www.stata-press.com/data/r12/nhanes2.dta, clear

svyset psu [pw=finalwgt], strata(strata)

pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>

svy: regress weight height age female
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        31                 Number of obs      =      10351
Number of PSUs     =        62                 Population size    =  117157513
Design df          =         31
F(   3,     29)    =    1177.18
Prob > F           =     0.0000
R-squared          =     0.2827

------------------------------------------------------------------------------
|             Linearized
weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
height |   .7405073    .027744    26.69   0.000     .6839229    .7970917
age |   .1484546   .0116501    12.74   0.000      .124694    .1722153
female |  -2.898197   .5888597    -4.92   0.000    -4.099184   -1.697209
_cons |   -57.6088   4.955696   -11.62   0.000      -67.716   -47.50159
------------------------------------------------------------------------------

At the top of the output, you see the test of the overall regression model: F(3, 29) = 1177.18, p < 0.0000.

Next, you run the same model in SAS.

proc surveyreg data = nhanes2;
cluster psu;
strata strata;
weight finalwgt;
model weight = height age female ;
run;
The SURVEYREG Procedure

Regression Analysis for Dependent Variable weight

Data Summary

Number of Observations          10351
Sum of Weights              117157513
Weighted Mean of weight      71.90064
Weighted Sum of weight     8423699699

Design Summary

Number of Strata              31
Number of Clusters            62

Fit Statistics

R-square            0.2827
Root MSE           13.0725
Denominator DF          31

Tests of Model Effects

Effect       Num DF    F Value    Pr > F

Model             3    1258.00    <.0001
Intercept         1     135.10    <.0001
height            1     712.19    <.0001
age               1     162.33    <.0001
female            1      24.22    <.0001

NOTE: The denominator degrees of freedom for the F tests is 31.

Estimated Regression Coefficients

Standard
Parameter      Estimate         Error    t Value    Pr > |t|

Intercept    -57.608796    4.95641443     -11.62      <.0001
height         0.740507    0.02774807      26.69      <.0001
age            0.148455    0.01165183      12.74      <.0001
female        -2.898197    0.58894508      -4.92      <.0001

NOTE: The denominator degrees of freedom for the t tests is 31.

The results for the overall test of the regression model are reported as F(3, 31) = 1258.00, p < .0001.  Both the test statistic and denominator degrees of freedom are different from your Stata output, so you decide to run the model in SUDAAN.

proc regress data = nhanes2 filetype = sas design = wr;
weight finalwgt;
nest strata psu;
model weight = height age female;
run;
                                  S U D A A N
Software for the Statistical Analysis of Correlated Data
Copyright      Research Triangle Institute      October 2009
Release 10.0.1

DESIGN SUMMARY: Variances will be computed using the Taylor Linearization Method, Assuming a
With Replacement (WR) Design
Sample Weight: FINALWGT
Stratification Variables(s): STRATA
Primary Sampling Unit: PSU

Number of observations read       :  10351    Weighted count:117157513
Observations used in the analysis :  10351    Weighted count:117157513
Denominator degrees of freedom    :     31

Maximum number of estimable parameters for the model is  4

File NHANES2 contains   62 Clusters
62 clusters were used to fit the model
Maximum cluster size is 288 records
Minimum cluster size is  67 records

Weighted mean response is 71.900636

Multiple R-Square for the dependent variable WEIGHT: 0.282704
------------------------------------------------------------------------------------------------
Independent                                                                             P-value
Variables and        Beta                      Lower 95%    Upper 95%                 T-Test
Effects              Coeff.          SE Beta   Limit Beta   Limit Beta   T-Test B=0   B=0
------------------------------------------------------------------------------------------------
Intercept                  -57.61         4.96       -67.72       -47.50       -11.62     0.0000
HEIGHT                       0.74         0.03         0.68         0.80        26.69     0.0000
AGE                          0.15         0.01         0.12         0.17        12.74     0.0000
FEMALE                      -2.90         0.59        -4.10        -1.70        -4.92     0.0000
------------------------------------------------------------------------------------------------
-------------------------------------------------------

Contrast               Degrees
of                      P-value
Freedom        Wald F   Wald F
-------------------------------------------------------
OVERALL MODEL                 4     58649.64     0.0000
MODEL MINUS
INTERCEPT                   3      1258.36     0.0000
INTERCEPT                     1       135.14     0.0000
HEIGHT                        1       712.39     0.0000
AGE                           1       162.38     0.0000
FEMALE                        1        24.22     0.0000
-------------------------------------------------------

The test of the overall model is F(3, 31) = 1258.36, p < 0.000.  The test statistic is pretty close to the SAS output, and the denominator degrees of freedom match the SAS output.  What is going on?

By default, Stata reports an adjusted Wald F test in the output, while SAS and SUDAAN do not.  To have Stata match the results given by SAS and SUDAAN, you can use the nosvyadjust option on the test command.  (We use the test command with all of the predictor variables in the model to recreate the test of the overall regression shown at the top of the Stata output.)

svy: regress weight height age female
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =        31                 Number of obs      =      10351
Number of PSUs     =        62                 Population size    =  117157513
Design df          =         31
F(   3,     29)    =    1177.18
Prob > F           =     0.0000
R-squared          =     0.2827

------------------------------------------------------------------------------
|             Linearized
weight |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
height |   .7405073    .027744    26.69   0.000     .6839229    .7970917
age |   .1484546   .0116501    12.74   0.000      .124694    .1722153
female |  -2.898197   .5888597    -4.92   0.000    -4.099184   -1.697209
_cons |   -57.6088   4.955696   -11.62   0.000      -67.716   -47.50159
------------------------------------------------------------------------------

test height age female

( 1)  height = 0
( 2)  age = 0
( 3)  female = 0

F(  3,    29) = 1177.18
Prob > F =    0.0000

The output from regress and test match.

test height age female, nosvyadjust

( 1)  height = 0
( 2)  age = 0
( 3)  female = 0

F(  3,    31) = 1258.36
Prob > F =    0.0000

The output from test, nosvyadjust is different from the above Stata output but match the SAS and SUDAAN output.  Alternatively, you could use the adjwaldf and adjwaldp options on the print command in SUDAAN to reproduce the results given by default by Stata.

#### The "why" and the degrees of freedom

A discussion of the adjusted Wald test is given on page 2184 of the Stata 12 Reference Guide (in the section for the -test- command).  This cites the 1990 American Statistician article by Edward Korn and Barry Graubard entitled "Simultaneous testing of regression coefficients with complex survey data:  Use of Bonferroni t statistics".  Basically, they argue that this test statistic is more appropriate when you have more than a few terms being tested simultaneously (in other words, more predictors in the model.)   The test statistic (what the authors call the Wald procedure) has numerator degrees of freedom a p, the number of predictors (excluding the intercept), and denominator degrees of freedom # of PSUs - # of strata - p + 1.  In the example above, we have 62 PSUs, 31 strata and 3 predictors.  Hence, the denominator degrees of freedom are calculated as 62 - 31- 3 + 1 = 29.  In SAS and SUDAAN, you see notes indicating that the denominator degrees of freedom equals 31, which is simply 62 - 31 = 31.

#### References

Korn, E. and Graubard, B.  (1990). Simultaneous testing of regression coefficients with complex survey data:  Use of Bonferroni t statistics.  American Statistician, Vol. 4, No. 4, pages 270-276.

Stata 12 Base Reference Manual. College Station, TX: Stata Press.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.