Statistical Computing Seminars
Introduction to Survey Data Analysis in Stata 9

The purpose of this seminar is to explore how to analyze survey data collected under different sampling plans using Stata 9.  (If you are using Stata 8, please see Survey Data Analysis with Stata 8 .)  Other examples, including those using other survey data analysis packages, can be found at Choosing the Correct Analysis for Various Survey Designs.  Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling.  For experimental and quasi-experimental designs, this is exactly what we want.  However, when surveys are conducted, a simple random sample is rarely collected.  Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods.  When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling.  This is because the sampling design affects the calculation of the standard errors of the estimates.  If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not.  The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set.  While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be. 

Sampling designs

Most people do not conduct their own surveys.  Rather, they use survey data that some agency or company collected and made available to the public.  The documentation must be read carefully to find out what kind of sampling design was used to collect the data.  This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs.  Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong. 

Below are some common features of many sampling designs.

Weights:  There are many types of weights that can be associated with a survey.  Perhaps the most common is the sampling weight, sometimes called a pweight, which is used to denote the inverse of the probability of being included in the sample due to the sampling design (except for a certainty PSU, see below).  The pweight is calculated as N/n, where N = the number of elements in the population and n = the number of elements in the sample.  For example, if a population has 10 elements and 3 are sampled at random with replacement, then the pweight would be 10/3 = 3.33.   In a two-stage design, the pweight is calculated as f1f2, which means that the inverse of the sampling fraction for the first stage is multiplied by the inverse of the sampling fraction for the second stage.  Under many sampling plans, the sum of the pweights will equal the population total. 

PSU:  This is the primary sampling unit.  This is the first unit that is sampled in the design.  For example, school districts from California may be sampled and then schools within districts may be sampled.  The school district would be the PSU.  If states from the US were sampled, and then school districts from within each state, and then schools from within each district, then states would be the PSU.  One does not need to use the same sampling method at all levels of sampling.  For example,  probability-proportional-to-size sampling may be used at level 1 (to select states), while cluster sampling is used at level 2 (to select school districts).  In the case of a simple random sample, the PSUs and the elementary units are the same.

Strata:  Stratification is a method of breaking up the population into different groups, often by demographic variables such as gender, race or SES.  Once these groups have been defined, one samples from each group as if it were independent of all of the other groups.  For example, if a sample is to be stratified on gender, men and women would be sampled independent of one another.  This means that the pweights for men will likely be different from the pweights for the women.  In most cases, you need to have two or more PSUs in each stratum.  The purpose of stratification is to improve the precision of the estimates, and stratification works most effectively when the variance of the dependent variable is smaller within the strata than in the sample as a whole.

FPC:  This is the finite population correction.  This is used when the sampling fraction (the number of elements or respondents sampled relative to the population) becomes large.  The FPC is used in the calculation of the standard error of the estimate.  If the value of the FPC is close to 1, it will have little impact and can be safely ignored.  In some survey data analysis programs, such as SUDAAN, this information will be needed if  you specify that the data were collected without replacement (see below for a definition of "without replacement").   The formula for calculating the FPC is ((N-n)/(N-1))1/2, where N is the number of elements in the population and n is the number of elements in the sample. To see the impact of the FPC for samples of various proportions, suppose that you had a population of 10,000 elements.

Sample size (n)    FPC
1                1.0000
10                .9995 
100               .9950
500               .9747
1000              .9487
5000              .7071
9000              .3162

Sampling with and without replacement

Most samples collected in the real world are collected "without replacement".  This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample.  Many of the calculations change depending on if a sample is collected with or without replacement.  Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

Examples

In the examples that follow, we have data that represent a population, and we will discuss the analysis of these survey data as if they had been collected under five sampling plans:  simple random sampling, stratified random sampling, systematic sampling, one-stage cluster sampling and two-stage cluster sampling with stratification.  The Stata code necessary to generate the samples using each of these sampling plans is shown here.  The variables from the data set with which we will be working include api00 and api99, which is an aggregate of student test scores for each school, for the years 2000 and 1999, respectively; yr_rnd, which is a 0/1 variable indicating if the school is on a year-round calendar; awards, which indicates whether or not the school met their target; meals, which indicates the percentage of children receiving free or reduced-priced meals at school; both, which indicates that the school met both targets; and growth, which is the difference between the api scores in the current year and those of the last year.

One of the most important points to remember is that all svy commands can be used with any sampling plan.  To help illustrate this, we will use the svy: mean and the svy: total commands with each sampling plan.  Another important point is that the interpretation of the results from the svy commands is usually no different than the interpretation that you would have if you had used the equivalent non-survey command.  For example, there is no special interpretation of regression coefficients just because you obtained them using svy: reg instead of regress.

Simple random sample

We will start by showing how you can take a simple random sample (SRS) from you data file. While we will not go through the commands necessary for obtaining any other type of sample, we will go over how to draw an SRS. Simple random samples are very rare in actual practice; however, researchers will often draw an SRS of their data set so that they can work out their data analysis programs on a relatively small data file. This saves computing time and resources, as the analysis program may have to be run many times before it is satisfactory.

set mem 5m
use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear
count

 6194

set seed 1003002849
sample 5
count

 310

Because we have eliminated elements of our population to create our sample, we need to create pweights (probability weights). We selected 5% of the elements in our population into our sample, so our sampling fraction is 310/6194. The pweight is the inverse of the sampling fraction, or N/n, where N is the population total (6194) and n is the number of elements selected into the sample (310). Another way to think of this is: "How many elements (schools, people, whatever) in the population should each element in the sample represent?" Clearly, each school in our current sample should represent twenty schools in the population, so all of the p-weights will be the same; approximately 20.

gen pw = 6194/310

Next, we need to consider how large our sample is relative to our population to determine if we need to use a finite population correction. (For a quick review of FPCs, please see the summary at the beginning of this handout.) In Stata, we only need to give the population total, and Stata will make the necessary calculations to obtain the correct FPC.  Note that the svyset command is very different in Stata 8 than it was in Stata 7.

gen fpc = 6194

We use the svyset command to tell Stata about the features of the sampling design that we have. In this case, we only need to specify the pweight and the FPC.

svyset [pweight=pw], fpc(fpc)
      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: fpc

Next, we will use the svydes command to display the information Stata has regarding our sampling plan. As you can see, the number of PSUs and observations is the same, which reassures us that Stata understands that we have a simple random sample. We also see that there is only one strata, which is correct for this type of sampling plan. Note that once you have used the svyset command, Stata will remember this information for your entire session; you do not need to reissue this command (unless you want to change something). Also, if you save your data, Stata will save the survey information with the data set, so that when you open the data in your next session, the survey information will be used when you issue svy commands.

svydes
Survey: Describing stage 1 sampling units

      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: fpc

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1

We will start our analysis of these data with some basic descriptive statistics. We will use the svy: mean and svy: total commands. The svy: mean command is used to estimate the mean of a variable in the population. In our example, we will estimate the mean for api00 and growth.  Please note that svy: mean is an estimation command, and Stata will do a listwise deletion of missing data.  For example, if we had missing data on api00, we would probably get a different mean for growth than if we issued the command svy: mean growth because there would a different number of cases used in the calculation of the mean.

svy: mean api00 growth
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =     310
Number of PSUs   =     310          Population size  =    6194
                                    Design df        =     309

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |   663.2645    7.21478      649.0682    677.4608
      growth |   33.84516   1.667394      30.56428    37.12604
--------------------------------------------------------------

The svy: total command is used to get estimates of population totals. In our example, we will get an estimate of how many schools are on a year-round calendar. From the output of the svy: total command, we can see that approximately 719 schools are on a year-round calendar.

svy: total yr_rnd
(running total on estimation sample)

Survey: Total estimation

Number of strata =       1          Number of obs    =     310
Number of PSUs   =     310          Population size  =    6194
                                    Design df        =     309

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      yr_rnd |   719.3032   110.0291      502.8022    935.8042
--------------------------------------------------------------

We will now do a multiple regression. We will use api00 as the dependent variable and award and meals as independent variables. We can see from the output that the model is statistically significant (F = 464.21, p < .000), and that each of the predictors is also statistically significant. You can interpret the output from the svy commands in the same way that you would the non-svy command. In this example, you interpret the output from the svy: reg command in the same way that you would the output from the regress command. Remember that the difference between the svy: reg and the regress commands is how the standard errors are calculated. The svy: reg command takes into account the survey sampling plan, while the regress command does not.

svy: reg api00 awards meals
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =       310
Number of PSUs     =       310                  Population size    = 6193.9997
                                                Design df          =       309
                                                F(   2,    308)    =    464.21
                                                Prob > F           =    0.0000
                                                R-squared          =    0.7124

------------------------------------------------------------------------------
             |             Linearized
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   53.37164   9.047051     5.90   0.000     35.57002    71.17326
       meals |  -3.329605   .1285952   -25.89   0.000    -3.582638   -3.076571
       _cons |   738.9207   19.47419    37.94   0.000     700.6019    777.2395
------------------------------------------------------------------------------

Stratified random sampling

The difference between the example above and the next example is that stratification has been added to the sampling design.  For this example, we have calculated the mean of api99 and stratified schools based on this.  Schools that were above the mean were placed into one strata, and schools that were below the mean were placed in the other strata.  Simple random samples of schools were then drawn from each strata.  Although we have created only two strata, in many public-use data sets, you can have dozens of strata.

We have used the svyset, clear(all) command here to show how it is used.  After issuing the svyset command, we again use the svydes command to ensure that Stata is handling the survey design appropriately. Next, we use the svymean to obtain the estimated means of api00 and api99.  We can compare these estimates to those obtained from the SRS above.  (Please see the table at the end of this handout.)

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear

svyset, clear(all)
svyset [pweight = pw], strata(strat) fpc(fpc)
      pweight: pw
          VCE: linearized
     Strata 1: strat
         SU 1: <observations>
        FPC 1: fpc

svydes

Survey: Describing stage 1 sampling units

      pweight: pw
          VCE: linearized
     Strata 1: strat
         SU 1: <observations>
        FPC 1: fpc

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1       310       310         1       1.0         1
       2       310       310         1       1.0         1
--------  --------  --------  --------  --------  --------
       2       620       620         1       1.0         1

Below we use the svy: mean command to get the population estimate of the mean of api00.  We can  use the estat effects command to get the design effect.  Notice the value of the design effect, labeled Deff in the output.  The design effect compares the current sampling design (in this case, stratified random sampling) with simple random sampling.  Design effects of 1 (or close to 1) indicate that the current sampling design is about as efficient as a simple random sample.  Design effects that are smaller than 1 indicate that the current design is more efficient than simple random sampling, while design effects that are larger than 1 indicate that the current sampling design is less efficient than simple random sampling.  Here, we can see the benefit of the stratification:  the design effect for api00 is .35, well below 1.  However, you will remember that we stratified on the mean of api99, which is closely related to api00, the variable for which we are getting an estimate. 

svy: mean api00 growth
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       2          Number of obs    =     620
Number of PSUs   =     620          Population size  =    6194
                                    Design df        =     618

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |   665.6216   2.957053      659.8145    671.4287
      growth |   33.26666    1.10437      31.09788    35.43543
--------------------------------------------------------------
estat effects

----------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.       Deff      Deft
-------------+--------------------------------------------
       api00 |   665.6216   2.957053     .346875   .558707
      growth |   33.26666    1.10437     .962983   .930909
----------------------------------------------------------
Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of
      weights.

In the results of the svy: total shown below, you will see that the design effect is not much smaller than 1; in other words, we get relatively little benefit from the stratification.  That is because there is not much of a relationship between api99 and yr_rnd.  The point here is that to be genuinely useful, you need to stratify on variable(s) closely related to the variable of interest.  In many cases, this will mean that while stratification will make some estimates more efficient, it will not do so for others.

svy: total yr_rnd
(running total on estimation sample)

Survey: Total estimation

Number of strata =       2          Number of obs    =     620
Number of PSUs   =     620          Population size  =    6194
                                    Design df        =     618

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      yr_rnd |   789.5516   76.59607      639.1314    939.9717
--------------------------------------------------------------

When estimates are made for each strata, they are made independently of all other strata.  In other words, the estimate of yr_rnd for strata 1 was made independently of the estimate for strata 2.  Also note that the sum of the estimates for strata 1 and strata 2 equals the value shown above.

svy: total yr_rnd, over(strat)

(running total on estimation sample)

Survey: Total estimation

Number of strata =       2          Number of obs    =     620
Number of PSUs   =     620          Population size  =    6194
                                    Design df        =     618

            1: strat = 1
            2: strat = 2

--------------------------------------------------------------
             |             Linearized
        Over |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
yr_rnd       |
           1 |   639.7935   67.69423      506.8549    772.7321
           2 |   149.7581   35.83923      79.37663    220.1395
--------------------------------------------------------------

Systematic sampling

Systematic sampling is just that:  drawing a sample from elements that are ordered in a systematic way.  For example, you might take a systematic sample of library books by selecting every k-th book from the books on the shelf.  (Remember that librarians hate when people actually do this!)  Of course, first you need to determine how large of a sample you want to select.  There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500.  First, we need to determine the "rate" at which schools should be selected.  We do this by dividing the number of elements (e.g., schools) by the number desired in the sample.  Therefore, k = 6194/500 = 12.38, which we will round to 13.  Hence, we will select every 13th school.  We will also randomly select a number from 1 to 13 and start counting from there.  In our example, we selected the number 4.  Hence, we ordered the schools from lowest id number to highest id number, started with school number 4, and then selected into our sample every 13th school. After creating our sample, we follow the same procedure as before: open the correct data file, issue the svyset command, check to see that everything is OK with the svydes command, and then begin our analysis with descriptive statistics.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/systematic.dta, clear
svyset [pweight = pw], fpc(fpc)
      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: fpc

svydes

Survey: Describing stage 1 sampling units

      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: <observations>
        FPC 1: fpc

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1
--------  --------  --------  --------  --------  --------
       1       477       477         1       1.0         1

Below we get the population estimates for the mean of api00 and growth

svy: mean api00 growth
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =     477
Number of PSUs   =     477          Population size  =    6194
                                    Design df        =     476

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |   656.3061   5.655353      645.1935    667.4186
      growth |   33.08595   1.226588      30.67576    35.49615
--------------------------------------------------------------
estat effects

----------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.       Deff      Deft
-------------+--------------------------------------------
       api00 |   656.3061   5.655353           1   .960724
      growth |   33.08595   1.226588           1   .960724
----------------------------------------------------------
Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of
      weights.

Notice that the design effect for all variables is 1.  This is not necessarily because systematic sampling is always just as efficient as simple random sampling.  Rather, it has to do with the information that you have given to Stata.  The design effect is influenced by setting the strata and PSU.  In both simple random sampling and systematic sampling, we set neither the strata or PSU.  Hence, Stata "can't tell the two sampling plans apart."  Because the specification of the sampling design is exactly the same as with simple random sampling, the design effect is 1.  However, you can calculate the design effect by hand by dividing the variance of the variable of interest under the current sampling design by the variance of the same variable under simple random sampling.  We did this and found that the design effects were very close to 1.  We found them to be .96 for api00, .93 for growth and 1.2 for yr_rnd.

svy: total yr_rnd
(running total on estimation sample)

Survey: Total estimation

Number of strata =       1          Number of obs    =     477
Number of PSUs   =     477          Population size  =    6194
                                    Design df        =     476

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      yr_rnd |   779.1195   90.44644      601.3958    956.8432
--------------------------------------------------------------

Below we show the use of the svy: tab command.  This can be used to make one- and two-way crosstabulations.  Here we will make a crosstab of both and awards.  The values in the cells are proportions.  You can use the count option (as shown below) to obtain the counts in each cell.  The svy: tab command also gives us the chi-square test for these two variables.  We can see that the relationship between them is statistically significant.

svy: tab both awards
(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       477
Number of PSUs     =       477                  Population size    =      6194
                                                Design df          =       476

-------------------------------
met both  | eligible for awards
targets   |    no    yes  Total
----------+--------------------
       No | .3019      0  .3019
      Yes | .0503  .6478  .6981
          | 
    Total | .3522  .6478      1
-------------------------------
  Key:  cell proportions

  Pearson:
    Uncorrected   chi2(1)         =  379.3900
    Design-based  F(1, 476)       =  427.4673     P = 0.0000
svy: tab both awards, count

(running tabulate on estimation sample)

Number of strata   =         1                  Number of obs      =       477
Number of PSUs     =       477                  Population size    =      6194
                                                Design df          =       476

-------------------------------
met both  | eligible for awards
targets   |    no    yes  Total
----------+--------------------
       No |  1870      0   1870
      Yes | 311.6   4012   4324
          | 
    Total |  2182   4012   6194
-------------------------------
  Key:  weighted counts

  Pearson:
    Uncorrected   chi2(1)         =  379.3900
    Design-based  F(1, 476)       =  427.4673     P = 0.0000
svy: reg api00 award meals

(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         1                  Number of obs      =       477
Number of PSUs     =       477                  Population size    =      6194
                                                Design df          =       476
                                                F(   2,    475)    =    679.67
                                                Prob > F           =    0.0000
                                                R-squared          =    0.6967

------------------------------------------------------------------------------
             |             Linearized
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   46.30969   7.237096     6.40   0.000     32.08908    60.53029
       meals |  -3.406531   .1056495   -32.24   0.000    -3.614128   -3.198934
       _cons |   791.0985   9.321325    84.87   0.000     772.7825    809.4146
------------------------------------------------------------------------------

One-stage cluster sampling in Stata

In a one-stage cluster sample, the data are divided into two "levels", one "nested" in the other. At the first level, the data are grouped into clusters. In a one-stage cluster sample, clusters are selected first and are called primary sampling units, or PSUs.  All of the elements in each selected cluster are selected into the sample.  These elements represent the second "level" of the data.  In our one-stage cluster sample, the districts will be the clusters and the schools will be the elementary or sampling units.  Hence, we randomly select school districts and then select all schools within each selected district.  You can use any sampling plan to select the clusters; we have used SRS only for the sake of simplicity.

Typically, data values in one cluster are more similar to one another than data values in another cluster. For example, if we surveyed people in households (e.g., people nested within households), we would expect that people in one household would be more similar to one another than they would be to people in another household.  Unfortunately, this feature makes our standard errors less efficient. However, because of financial and/or logistical considerations, most surveys employ some sort of cluster sampling.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/oscs1, clear
svyset dnum [pweight = pw], fpc(fpc)

      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: dnum
        FPC 1: fpc

svydes

Survey: Describing stage 1 sampling units

      pweight: pw
          VCE: linearized
     Strata 1: <one>
         SU 1: dnum
        FPC 1: fpc

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100
--------  --------  --------  --------  --------  --------
       1       189      1463         1       7.7       100
svy: mean api00 growth
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       1          Number of obs    =    1463
Number of PSUs   =     189          Population size  = 5859.74
                                    Design df        =     188

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |   670.5202   11.09702      648.6295    692.4108
      growth |   32.85783   1.440905      30.01541    35.70025
--------------------------------------------------------------
svy: total yr_rnd
(running total on estimation sample)

Survey: Total estimation

Number of strata =       1          Number of obs    =    1463
Number of PSUs   =     189          Population size  = 5859.74
                                    Design df        =     188

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      yr_rnd |   797.0529   176.0585      449.7489    1144.357
--------------------------------------------------------------

As you can see, the standard errors for these estimates are much larger than they were for any of the previous sampling plans.  Although we don't show an example here, you can easily combine stratification with cluster sampling, and this will help to make the standard errors more efficient.

Two-stage cluster sampling with stratification

In this last example, we will take a stratified two-stage cluster sample.  As with the stratified random sample illustrated above, the sampling for each strata will be done independent of every other strata.  A two-stage cluster sample means that clusters will be sampled (using whatever sampling plan the researcher chooses), and then elements within each of the selected clusters will also be sampled.  This is different from what we did above in that, in a one-stage cluster sample, all of the elements in each selected cluster are selected into the sample.  In a two-stage cluster sample, (usually) only some of the elements are selected into the sample.  In our example, we will take an SRS of school districts (clusters), and then we will take an SRS of schools (elements).  In the same way that you can use pretty much any sampling plan to select clusters, you can use pretty much any sampling plan to select elements from within the selected clusters; the sampling plan for selecting the clusters does not have to be the same as the one for selecting the elements. Also, you do not have to use the same sampling plan from one strata to the next, as the sampling between strata is independent. To obtain the sample used below, we first used the stratification that we used before, stratifying schools based on their mean api99 score. Next, we randomly selected 25% of the school districts from each strata. Finally, we randomly selected three schools from each selected district. The choice to select three schools, as opposed to selecting two or four schools, was rather arbitrary. However, when deciding how many elements to select from a cluster, remember that you need to have a sufficient number to get stable estimates; however, because data values within each cluster are likely correlated, taking lots of them is often a waste of resources:  200 elements probably won't be any more informative than 100.  (This, of course, depends on how strong the correlation is.)

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/strataboth, clear
svyset dnum [pweight = pwt], fpc(fpc) strata(strata)
      pweight: pwt
          VCE: linearized
     Strata 1: strata
         SU 1: dnum
        FPC 1: fpc

svydes

Survey: Describing stage 1 sampling units

      pweight: pwt
          VCE: linearized
     Strata 1: strata
         SU 1: dnum
        FPC 1: fpc

                                      #Obs per Unit
                              ----------------------------
Stratum    #Units     #Obs      min       mean      max   
--------  --------  --------  --------  --------  --------
       1        94       227         1       2.4         3
       2        95       239         1       2.5         3
--------  --------  --------  --------  --------  --------
       2       189       466         1       2.5         3
svy: mean api00 growth
(running mean on estimation sample)

Survey: Mean estimation

Number of strata =       2          Number of obs    =     466
Number of PSUs   =     189          Population size  =  6032.9
                                    Design df        =     187

--------------------------------------------------------------
             |             Linearized
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
       api00 |     681.84   10.44856      661.2278    702.4522
      growth |   30.71763    2.22572      26.32688    35.10838
--------------------------------------------------------------
svy: total yr_rnd
(running total on estimation sample)

Survey: Total estimation

Number of strata =       2          Number of obs    =     466
Number of PSUs   =     189          Population size  =  6032.9
                                    Design df        =     187

--------------------------------------------------------------
             |             Linearized
             |      Total   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      yr_rnd |   718.9149   214.9205      294.9345    1142.895
--------------------------------------------------------------
svy: reg api00 awards meals
(running regress on estimation sample)

Survey: Linear regression

Number of strata   =         2                  Number of obs      =       466
Number of PSUs     =       189                  Population size    = 6032.9042
                                                Design df          =       187
                                                F(   2,    186)    =    556.68
                                                Prob > F           =    0.0000
                                                R-squared          =    0.7114

------------------------------------------------------------------------------
             |             Linearized
       api00 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      awards |   66.19885   5.867421    11.28   0.000       54.624    77.77369
       meals |  -3.192264   .1135934   -28.10   0.000    -3.416353   -2.968175
       _cons |   772.7654    6.72774   114.86   0.000     759.4934    786.0374
------------------------------------------------------------------------------

We have seen examples of how to do OLS regression with survey data, so now let's do a logistic regression. First, we need to recode our dependent variable so that is 0/1. Next, we issue the svy: logit command. If you want odds ratios, you can use the or option with svy: logit. In this example, we use some new variables. The variable comp_imp1 is coded 0/1 and indicates if the school met a comparable improvement target; growth is the difference between the current year's api score and last year's api score; ell is the percent of English language learners; and mobility is the percent of students for whom this is their first year at the school.

svy: logit comp_imp1 growth ell mobility
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =         2                  Number of obs      =       466
Number of PSUs     =       189                  Population size    = 6032.9042
                                                Design df          =       187
                                                F(   3,    185)    =     20.80
                                                Prob > F           =    0.0000

------------------------------------------------------------------------------
             |             Linearized
   comp_imp1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      growth |   .1213203   .0159442     7.61   0.000     .0898667    .1527739
         ell |  -.0702944   .0119777    -5.87   0.000    -.0939231   -.0466657
    mobility |  -.0781154   .0202496    -3.86   0.000    -.1180624   -.0381684
       _cons |   .6391637   .3169899     2.02   0.045      .013828    1.264499
------------------------------------------------------------------------------

Now we will use a three-level variable to show the use of the test command. Please note that "svytest" is an out-of-date command. As you can see, the xi prefix works with the svy commands (and so does xi3).  However, you need to use the prefixes in the correct order:  "svy: xi: logit" does not work.

xi: svy: logit comp_imp1 growth ell mobility i.meals3

i.meals3          _Imeals3_1-3        (naturally coded; _Imeals3_1 omitted)
(running logit on estimation sample)

Survey: Logistic regression

Number of strata   =         2                  Number of obs      =       466
Number of PSUs     =       189                  Population size    = 6032.9042
                                                Design df          =       187
                                                F(   5,    183)    =     14.28
                                                Prob > F           =    0.0000

------------------------------------------------------------------------------
             |             Linearized
   comp_imp1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      growth |   .1333139   .0177526     7.51   0.000     .0982928     .168335
         ell |  -.0335437   .0134298    -2.50   0.013    -.0600371   -.0070503
    mobility |  -.0528434   .0194839    -2.71   0.007      -.09128   -.0144068
  _Imeals3_2 |  -1.976366   .3789415    -5.22   0.000    -2.723916   -1.228817
  _Imeals3_3 |   -2.54474   .9051281    -2.81   0.005    -4.330314   -.7591659
       _cons |   .5236906   .2685344     1.95   0.053    -.0060555    1.053437
------------------------------------------------------------------------------
test  _Imeals3_2 _Imeals3_3
Adjusted Wald test
 ( 1)  _Imeals3_2 = 0
 ( 2)  _Imeals3_3 = 0
       F(  2,   186) =   15.94
            Prob > F =    0.0000

Summary of population values, estimates, standard errors, design effects and estimated population totals for each sampling plan

The table below summarizes the values obtained from the descriptive statistics that we ran under each of the sampling plans, as well as the estimated population size.  It also contains the population values, which, of course, are not estimates, and hence do not have standard errors or design effects associated with them.  (To obtain the design effects, you will need to issue the estat effects command after the analysis command.)  Design effects are the ratio of the variance of the variable under the current sampling design to the estimated variance under simple random sampling.  In other words, it is an estimate of efficiency of the current sampling design relative to simple random sampling.  As you can see, the standard errors and the design effects for the stratified simple random sample are the smallest, followed closely by those for the simple random sample.  The design effects obtained under the systematic sample are slightly larger, and they become even larger when cluster sampling is used.  The largest design effects are obtained using stratified one-stage cluster sampling.  Also notice that cluster sampling yields estimates of the population size that are considerably different from those obtained using other types of sampling plans.  You should not assume that this pattern of results will be obtained every time these sampling plans are compared.  Some plans that look relatively inefficient in this example may appear to be more efficient with other samples and/or other data.

  mean api00 mean growth total yr_rnd estimated population size
estimate standard error design effect estimate standard error design effect estimate standard error design effect
population values 664.71 N/A N/A 32.80 N/A N/A 874 N/A N/A 6194
SRS 663.26 7.21 1 33.85 1.67 1 719.30 110.03 1 6194
Stratified SRS 665.62 2.96 .35 33.27 1.10 .96 789.55 76.60 .95 6194
Systematic 656.31 5.66 1 33.09 1.23 1 779.12 90.45 1 6194
One-stage cluster 670.52 11.10 15.28 32.86 1.44 4.55 797.05 176.06 14.97 5860
Stratified two-stage cluster 681.84 10.45 3.90 30.72 2.23 3.01 818.92 214.92 6.09 6033

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.