### Statistical Computing Seminars Introduction to Survey Data Analysis

The purpose of this seminar is to introduce you to the use of Stata, SUDAAN, WesVar and SAS for the analysis of survey data.  It will draw much of its materials and examples from Choosing the Correct Analysis for Various Survey Designs

#### Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling.  For experimental and quasi-experimental designs, this is exactly what we want.  However, when surveys are conducted, a simple random sample is rarely collected.  Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods.  When any sampling method other than simple random sampling is used, we need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling.  The sampling design affects the calculation of the standard errors of the estimates.  If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not.  The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set.  While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

#### Sampling designs

Most people do not conduct their own surveys.  Rather, they use survey data that some agency or company collected and made available to the public.  The documentation must be read carefully to find out what kind of sampling design was used to collect the data.  This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs.  Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

Weights:  There are many types of weights that can be associated with a survey.  Perhaps the most common is the sampling weight, sometimes called a pweight, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below).  The pweight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample.  For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33.  The sum of the pweights should equal the population total.

PSU:  This is the primary sampling unit.  This is the first unit that is sampled in the design.  For example, school districts from California may be sampled and then schools within districts may be sampled.  The school district would be the PSU.  If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU.  One does not need to use the same sampling method at all levels of sampling.  For example,  probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts).  In the case of a simple random sample, the PSUs and the elementary units are the same.

Strata:  Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES.  Once these groups have been defined, one samples from each group as if it were independent of all of the other groups.  For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another.  This means that the pweights for men will likely be different from the pweights for the women.  In most cases, you need to have two or more PSUs in each stratum.  The purpose of stratification is to improve the precision of the estimates.

FPC:  This is the finite population correction.  This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large.  The FPC is used in the calculation of the standard error of the estimate.  If the value of the FPC is close to 1, it will have little impact and can be safely ignored.  In some survey data analysis programs, such as SUDAAN, this information will be needed if  you specify that the data were collected without replacement (see below for a definition of "without replacement").   To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)  FPC

1                        1.0
10                        .9995
100                      .9950
500                      .9747
1000                    .9487
5000                    .7071
9000                    .3162

Imputation flag:  This is a 0/1 variable that is associated with a variable in the data set and indicates whether the corresponding value in the associated variable was imputed or given by the respondent.  For example, in the data set below

Subject  Response ImputeFlag

1            60            0
2            60            1
3            63            0

the data for subject number 2 was imputed.  The flag does not tell you how the imputation was done (i.e., mean substitution, multiple imputation, etc.).  These variables are useful for determining how much missing data each variable has.

Non-response weight:  There are both unit and item non-response weights.  The former down-weights an entire case because the respondent did not respond to any of the items on the survey (perhaps he wasn't home that day).  The later down-weights "responses" from respondents who did not answer that item.

Certainty PSU:  This is a PSU that was guaranteed to be in the sample.  This is independent of the sampling design:  any sampling design can have one or more certainty PSUs.  Certainty PSUs are also called self-representing units.

Poststratification:  This is stratification that happens after the sample has been collected, either because the information needed to do stratification was not available when the sample was collected, or because it was not known at the time of data collection that stratification on this variable would be necessary/desirable.  The purpose of poststratification is to improve the precision of the estimates or to reduce bias caused by non-response.

Clearly, not all surveys will have all of the features listed above.  We will concentrate only on the first four features because they are the most common.

#### Sampling with and without replacement

Most samples collected in the real world are collected "without replacement".  This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample.  Many of the calculations change depending on if a sample is collected with or without replacement.  Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

#### Replicate weights

Replicate weights are a feature of an increasing number of public use survey data sets.  Replicate weights are a series of weight variables that are used instead of PSUs and strata in an effort to protect the respondents' identity.  Either replicate weights or a Taylor series linearization, which is bases on PSUs and/or strata, are necessary for variance estimation.

#### Summary of four survey data analysis packages

We are now going to summarize some of the features of four survey data analysis packages:  Stata, SUDAAN, WesVar and SAS.  On feature that all four programs share is that once you specify the sampling design, it is either 1) applied to all analyses until you change it or exit the program (Stata and WesVar) or 2) very easy to apply to all analyses (SUDAAN and SAS).  In other words, you only need to go through the work of specifying the design once, and then it applies to all analyses of that data.

Stata:

• handles most sampling designs, except two-stage cluster sampling, probability-proportional-to-size sampling, poststratification and certainty PSUs
• has the most statistical procedures of any of the packages
• does not handle replicate weights
• has a relatively easy to use command interface (point and click in Stata version 8)

SUDAAN:

• handles all sampling designs
• has a fair number of statistical procedures
• handles replicate weights (except for survival analysis)
• has a relatively more difficult to use command interface

WesVar:

• handles all sampling designs except two-stage cluster sampling
• has a fair number of statistical procedures
• handles replicate weights (and can create them from PSUs and strata)
• has a relatively easy to use point-and-click interface

SAS:

• handles all sampling designs except poststratification and two-stage cluster sampling
• has a VERY limited number of statistical features (only means and regression in version 8, frequencies maybe logistic regression in version 9)
• does not handle replicate weights
• has a relatively more difficult to use command interface

#### Examples

Now we are going to try some analyses using the different packages.  All of the examples shown below were presented in Levy and Lemeshow's Sampling of Populations.  These and other examples from that text and other texts can be found on our website.  We will focus on Stata and SUDAAN.  If there is time, we will show an example in WesVar.  The code to do these examples in SAS is given at the end of this handout, along with some explanation.

#### Simple random sample in Stata

The entire Stata version 7 program can be downloaded here.  The entire Stata version 8 program can be downloaded here.  We will use the Stata version 7 code for this seminar.

Although simple random sampling (SRS) is almost never used, we start with this example because it is the least complex and it will serve as a comparison for later examples.  Note that in SRS sampling, each observation is a PSU.  The Stata code that would be used with a SRS design is given below.

use http://www.ats.ucla.edu/stat/books/sop/momsag.dta, clear
list birth weight1 momsag in 1/10
            birth    weight1        momsag
1.          773      30.92             0
2.          773      30.92             1
3.          773      30.92             1
4.          773      30.92             1
5.          773      30.92             1
6.          773      30.92             1
7.          773      30.92             1
8.          773      30.92             1
9.          773      30.92             1
10.          773      30.92             1
svyset pweight weight1
svyset fpc birth
svymean momsag

Survey mean estimation

pweight:  weight1                                 Number of obs    =        25
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =        25
FPC:      birth                                   Population size  =       773

------------------------------------------------------------------------------
Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
momsag |        .92    .0544746    .8075699     1.03243           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

Now let's see what happens when you ignore the sampling design.  We will clear the survey sets from Stata and use the ci command to get the mean, standard error and confidence interval.  Hence, in this analysis, the pweight and the fpc are ignored.  We will use the svyset command with no options to check that no variables have be set.

* PLEASE REMEMBER THAT THE ANALYSIS BELOW IS INCORRECT!!
svyset, clear
svyset
no variables have been set
ci momsag
    Variable |     Obs         Mean    Std. Err.       [95% Conf. Interval]
-------------+-------------------------------------------------------------
momsag |      25          .92    .0553775        .8057065    1.034294

As you can see, the mean is the same as that obtained when including the sampling design information.  However, the standard error is larger.  If we multiply the standard error by the square root of the fpc, we will obtain the correct standard error.

display sqrt((773-25)/773)*.0553775
.05447464
svyset pweight weight1
svyset fpc birth
svytotal momsag

Survey total estimation

pweight:  weight1                                 Number of obs    =        25
Strata:   <one>                                   Number of strata =         1
PSU:      <observations>                          Number of PSUs   =        25
FPC:      birth                                   Population size  =       773

------------------------------------------------------------------------------
Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
momsag |     711.16    42.10889    624.2515    798.0685           1
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

#### Stratified random sampling in Stata

The difference between the example above and the example below is that stratification has been added.

use http://www.ats.ucla.edu/stat/books/sop/hospsamp.dta, clear
list in 1/10
         hospno     oblevel     weighta     tothosp      births
1.         15           1        10.5          42         480
2.         80           1        10.5          42         426
3.         86           1        10.5          42         342
4.        136           1        10.5          42         174
5.          7           2   19.799988          99        2022
6.         26           2   19.799988          99         576
7.         62           2   19.799988          99        1999
8.         90           2   19.799988          99         482
9.        101           2   19.799988          99         836
10.         28           3   2.8333321          17        3108

svyset pweight weighta
svyset strata oblevel
svyset fpc tothosp
svytotal births

Survey total estimation

pweight:  weighta                                 Number of obs    =        15
Strata:   oblevel                                 Number of strata =         3
PSU:      <observations>                          Number of PSUs   =        15
FPC:      tothosp                                 Population size  = 157.99993

------------------------------------------------------------------------------
Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
births |   183982.9    34014.33      109872    258093.8    .7035474
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

svytotal births, by (oblevel)

Survey total estimation

pweight:  weighta                                 Number of obs    =        15
Strata:   oblevel                                 Number of strata =         3
PSU:      <observations>                          Number of PSUs   =        15
FPC:      tothosp                                 Population size  = 157.99993

------------------------------------------------------------------------------
Total  Subpop. |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------------+--------------------------------------------------------------
births         |
oblevel==1 |      14931    2669.857    9113.882    20748.12      .15648
oblevel==2 |   117116.9    33067.66    45068.68    189165.2    1.089405
oblevel==3 |   51934.98    7508.399    35575.58    68294.37    .0330073
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

#### One-stage cluster sampling in Stata

use http://www.ats.ucla.edu/stat/books/sop/tab9_1a.dta, clear
list devlpmnt  HH wt1 M NVSTNRS NGE65 hhneedvn in 1/10
     devlpmnt        HH        wt1          M   NVSTNRS     NGE65   hhneedvn
1.        2         1        2.5          5         1         2          1
2.        2         2        2.5          5         0         1          0
3.        2         3        2.5          5         0         2          0
4.        2         4        2.5          5         1         1          1
5.        2         5        2.5          5         0         1          0
6.        2         6        2.5          5         0         1          0
7.        2         7        2.5          5         1         2          1
8.        2         8        2.5          5         1         1          1
9.        2         9        2.5          5         1         3          1
10.        2        10        2.5          5         1         1          1
svyset pweight wt1
svyset fpc M
svyset psu devlpmnt
svytotal NVSTNRS NGE65

Survey total estimation

pweight:  wt1                                     Number of obs    =        40
Strata:   <one>                                   Number of strata =         1
PSU:      devlpmnt                                Number of PSUs   =         2
FPC:      M                                       Population size  =       100

------------------------------------------------------------------------------
Total |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
NVSTNRS |       57.5    1.936492    32.89454    82.10546    .0707804
NGE65 |      167.5    1.936492    142.8945    192.1055    .0393542
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.
svymean NVSTNRS hhneedvn

Survey mean estimation

pweight:  wt1                                     Number of obs    =        40
Strata:   <one>                                   Number of strata =         1
PSU:      devlpmnt                                Number of PSUs   =         2
FPC:      M                                       Population size  =       100

------------------------------------------------------------------------------
Mean |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
---------+--------------------------------------------------------------------
NVSTNRS |       .575    .0193649    .3289454    .8210546    .0707804
hhneedvn |       .525    .0193649    .2789454    .7710546    .0977444
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.
svyratio NVSTNRS NGE65

Survey ratio estimation

pweight:  wt1                                     Number of obs    =        40
Strata:   <one>                                   Number of strata =         1
PSU:      devlpmnt                                Number of PSUs   =         2
FPC:      M                                       Population size  =       100

------------------------------------------------------------------------------
Ratio       |   Estimate    Std. Err.   [95% Conf. Interval]        Deff
------------------+-----------------------------------------------------------
NVSTNRS/NGE65    |   .3432836    .0075924    .2468131    .4397541    .0325067
------------------------------------------------------------------------------
Finite population correction (FPC) assumes simple random sampling without
replacement of PSUs within each stratum with no subsampling within PSUs.
Weights must represent population totals for deff to be correct when
using an FPC.  Note: deft is invariant to the scale of weights.

#### Simple random sampling using SUDAAN

The SAS data files that are used for the SUDAAN and SAS examples can be downloaded here:

momsag.sas7bdat
hospsamp.sas7bdat
tab9_1c.sas7bdat

proc descript data = momsag filetype = sas design = wor total ;
weight weight1;
nest _one_;
totcnt birth;
var momsag;
run;

Number of observations read    :     25    Weighted count :      773
Denominator degrees of freedom :     24

Variance Estimation Method: Taylor Series (WOR)
by: Variable, One.

-----------------------------------------------------
|                 |                  |
| Variable        |                  | One
|                 |                  | 1            |
-----------------------------------------------------
|                 |                  |              |
| MOMSAG          | Sample Size      |           25 |
|                 | Weighted Size    |       773.00 |
|                 | Total            |       711.16 |
|                 | SE Total         |        42.11 |
|                 | Mean             |         0.92 |
|                 | SE Mean          |         0.05 |
-----------------------------------------------------

#### Stratified random sampling using SUDAAN

proc descript data = hospsamp filetype = sas design = wor totals;
nest oblevel;
weight weighta;
totcnt tothosp;
var births;
subgroup oblevel;
levels 3;
setenv decwidth = 3;
run;
Number of observations read    :     15    Weighted count :      158
Denominator degrees of freedom :     12

Variance Estimation Method: Taylor Series (WOR)
by: Variable, OBLEVEL.

------------------------------------------------------------------------------------
|                 |                  |
| Variable        |                  | OBLEVEL
|                 |                  | Total                | 1                    |
------------------------------------------------------------------------------------
|                 |                  |                      |                      |
| BIRTHS          | Sample Size      |               15.000 |                4.000 |
|                 | Weighted Size    |              158.000 |               42.000 |
|                 | Total            |           183982.904 |            14931.000 |
|                 | SE Total         |            34014.329 |             2669.857 |
|                 | Mean             |             1164.449 |              355.500 |
|                 | SE Mean          |              215.281 |               63.568 |
------------------------------------------------------------------------------------

Variance Estimation Method: Taylor Series (WOR)
by: Variable, OBLEVEL.

------------------------------------------------------------------------------------
|                 |                  |
| Variable        |                  | OBLEVEL
|                 |                  | 2                    | 3                    |
------------------------------------------------------------------------------------
|                 |                  |                      |                      |
| BIRTHS          | Sample Size      |                5.000 |                6.000 |
|                 | Weighted Size    |               99.000 |               17.000 |
|                 | Total            |           117116.928 |            51934.977 |
|                 | SE Total         |            33067.664 |             7508.399 |
|                 | Mean             |             1183.000 |             3055.000 |
|                 | SE Mean          |              334.017 |              441.671 |
------------------------------------------------------------------------------------

#### One-stage cluster sampling using SUDAAN

proc descript data = tab9_1c filetype =sas design = wor means totals;
nest _one_  devlpmnt;
totcnt m _zero_;
weight wt1;
var nge65 nvstnrs hhneedvn;
run;
Number of observations read    :     40    Weighted count :      100
Denominator degrees of freedom :      1

Variance Estimation Method: Taylor Series (WOR)
by: Variable, One.

------------------------------------------------------
|                 |                  |
| Variable        |                  | One
|                 |                  | 1             |
------------------------------------------------------
|                 |                  |               |
| NGE65           | Sample Size      |      40.00000 |
|                 | Weighted Size    |     100.00000 |
|                 | Total            |     167.50000 |
|                 | SE Total         |       1.93649 |
|                 | Mean             |       1.67500 |
|                 | SE Mean          |       0.01936 |
------------------------------------------------------
|                 |                  |               |
| NVSTNRS         | Sample Size      |      40.00000 |
|                 | Weighted Size    |     100.00000 |
|                 | Total            |      57.50000 |
|                 | SE Total         |       1.93649 |
|                 | Mean             |       0.57500 |
|                 | SE Mean          |       0.01936 |
------------------------------------------------------
|                 |                  |               |
| HHNEEDVN        | Sample Size      |      40.00000 |
|                 | Weighted Size    |     100.00000 |
|                 | Total            |      52.50000 |
|                 | SE Total         |       1.93649 |
|                 | Mean             |       0.52500 |
|                 | SE Mean          |       0.01936 |
------------------------------------------------------

proc ratio data = tab9_1c filetype = sas design = wor;
nest _one_ devlpmnt;
totcnt M _zero_;
weight wt1;
numer nvstnrs;
denom nge65;
setenv decwidth = 5;
run;
Number of observations read    :     40    Weighted count :      100
Denominator degrees of freedom :      1

Variance Estimation Method: Taylor Series (WOR)
by: Variable, One.

------------------------------------------------------
|                 |                  |
| Variable        |                  | One
|                 |                  | 1             |
------------------------------------------------------
|                 |                  |               |
| NVSTNRS/NGE65   | Sample Size      |      40.00000 |
|                 | Weighted Size    |     100.00000 |
|                 | Weighted X-Sum   |     167.50000 |
|                 | Weighted Y-Sum   |      57.50000 |
|                 | Ratio Est.       |       0.34328 |
|                 | SE Ratio         |       0.00759 |
------------------------------------------------------

#### Simple random sampling using SAS

proc surveymeans data = momsag n = 773 mean sum std;
weight weight1;
var momsag;
run;
The SURVEYMEANS Procedure

Data Summary

Number of Observations            25
Sum of Weights            773.000002

Statistics

Std Error
Variable            Mean         of Mean             Sum         Std Dev
------------------------------------------------------------------------
MOMSAG          0.920000        0.054475      711.160002       42.108894
------------------------------------------------------------------------

#### Stratified random sampling using SAS

data second138;
input id _TOTAL_ oblevel;
cards;
1 42 1
2 42 1
3 42 1
4 42 1
5 99 2
6 99 2
7 99 2
8 99 2
9 99 2
10 17 3
11 17 3
12 17 3
13 17 3
14 17 3
15 17 3
;
run;

NOTE:  You cannot get the totals for both the whole group and the sub-groups in the same proc surveymeans.
NOTE:  The data set second138 is used to tell SAS what the totals are in each stratum.  These totals are used to compute the finite population correction (fpc).  SAS allows only one number to be supplied on the proc surveymeans statement.  Because the totals change from one stratum to the next, we need to supply them to SAS in a data set.  You can include these data in the primary data set or in a secondary data set.  In this example, we will use a secondary data set.  Also note that the secondary data set can be "collapsed"; in other words, just one line (observations) for each strata.  In the secondary data set, the variable that contains the totals must be called _TOTAL_.  The variable oblevel is copied from the original data set because SAS requires all of the variables listed on the strata statement to appear in this data set.  In our example, there is only one variable listed on the strata statement, but in other cases, there may be two or more variables listed.

proc surveymeans data = hospsamp n = second138 sum ;
weight weighta;
strata oblevel;
var births;
run;
The SURVEYMEANS Procedure

Data Summary

Number of Strata                   3
Number of Observations            15
Sum of Weights            157.999931

Statistics

Variable             Sum         Std Dev
----------------------------------------
births            183983           34014
----------------------------------------
proc surveymeans data = hospsamp n = second138 sum;
weight weighta;
strata oblevel;
by oblevel;
var births;
run;
oblevel=1

The SURVEYMEANS Procedure

Data Summary

Number of Strata                   1
Number of Observations             4
Sum of Weights                    42

Statistics

Variable             Sum         Std Dev
----------------------------------------
births             14931     2669.856738
----------------------------------------

oblevel=2

The SURVEYMEANS Procedure

Data Summary

Number of Strata                   1
Number of Observations             5
Sum of Weights             98.999939

Statistics

Variable             Sum         Std Dev
----------------------------------------
births            117117           33068
----------------------------------------

oblevel=3

The SURVEYMEANS Procedure

Data Summary

Number of Strata                   1
Number of Observations             6
Sum of Weights            16.9999924

Statistics

Variable             Sum         Std Dev
----------------------------------------
births             51935     7508.399372
----------------------------------------

#### One-stage cluster sampling using SAS

proc surveymeans data = tab9_1c n = 5 sum mean;
weight wt1;
cluster devlpmnt;
var nge65 nvstnrs hhneedvn;
run;
The SURVEYMEANS Procedure

Data Summary

Number of Clusters                 2
Number of Observations            40
Sum of Weights                   100

Statistics

Std Error
Variable            Mean         of Mean             Sum         Std Dev
------------------------------------------------------------------------
NGE65           1.675000        0.019365      167.500000        1.936492
NVSTNRS         0.575000        0.019365       57.500000        1.936492
HHNEEDVN        0.525000        0.019365       52.500000        1.936492
------------------------------------------------------------------------

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.