UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata FAQ
Sample setups for commonly used survey data sets

This page shows the survey setups for common public use data sets in various statistical packages, including SUDAAN 9, Stata 9, WesVar 4.2 and SAS 9.  If you are using an earlier version of one of these packages, the code provided below may not work.  Also, please note that for your particular analysis, different pweight and/or replicate weights may be necessary.  For data sets that contain multiple pweights and/or replicate weights, the documentation for the survey will indicate when each set of weights should be used.  Many of the setups below show the use of different weights with the same data set.  Pay special attention to this issue when merging data sets.  This page is in no way intended to be a substitute for reading the documentation for the data set.

If you would like more information on the elements of survey designs, including pweights, PSUs, stratification and replicate weights, please see our page on replicate weights.  For more information on data analysis in Stata, please see our seminar on Survey Data Analysis in Stata.  For more information on using SUDAAN to analyze survey data, please see our seminar Introduction to SUDAAN

A note about missing data:  Many of the variables in these data sets have special values for missing data, such as 8888 or -9.  In most cases, the statistical package (e.g., Stata, SUDAAN, SAS) will not know that these values should be considered missing, and they will be included as legitimate values in any analysis that is run.  To convert these values to missing, please see our  Stata FAQ if you are using Stata, and our SAS learning module on missing data or our SAS FAQ if you are using either SAS or SUDAAN.  If you are using WesVar, convert these values to missing before importing the data into WesVar.  Also note that the different programs handle missing data differently when you use more than one variable in a descriptive command.  For example in Stata, svy: mean x y will give you different results than if you used two commands, svy: mean x and svy: mean y.  In the former command, listwise deletion is used.  You can quickly tell if listwise deletion is being used by the number of observations being used in the analysis.

A note about non-positive probability weights:  The different programs handle non-positive probability weights differently.  SUDAAN does not count these cases as cases read in and gives a note at the top of the output.  The cases with the non-positive weight are not included in the raw frequency of cases for each category shown in the first part o the output.  The top of the SAS output indicates the total number of cases in the data file, as well as the number of cases with a non-positive probability weight and the number of cases used  The raw number of cases matches that given by SUDAAN.  Stata can use cases with non-positive (i.e., zero) probability weights, and so the total number of cases read is the total number of cases used.  Hence the number of raw cases used in each category in the Stata output is different from that shown by SUDAAN or SAS.  However, in all cases, the percent of weighted cases for each category is the same for all packages.

This page contains the setups for the following data sets:

2000 US Census
CHIS
NHANES III
NCS

2000 US Census

Census data can be obtained from the Census website.  The documentation can be found here .  Chapter 5 describes the sampling used, and chapter 4 describes the calculations necessary to obtain the correct standard errors (pages 4-3 to 4-15).

The 2000 US Census was released with person and household weights to weight the sample (either the 1% or the 5% PUMS) back to the national totals.  In our examples, we will use the person weights with person level variables.  The data are clustered within household; every person within a selected household is included in the sample.  For both institutional and non-institutional group quarters, a pseudo household record number was assigned (see pages 2-3 and 3-1 of the documentation).  Although it is clearly stated in the documentation that the sample data set was constructed using stratified sampling, the stratification variable was not released with the data set.  Furthermore, some of the variables used in the stratification were also not released, so that the stratification variable cannot be reconstructed by  the user of the data set.  Hence, in our example setup, we will ignore the stratification.  Please see chapter 4 of the documentation for instructions on how to obtain correct standard errors.  In our examples, we use the 5% PUMS data for California.

The Jackknife delete-1 method of creating replicate weights should be used with the Census data because there is no stratification variable.  The output below was obtained by using the "super-PSUs" created in the SAS code above and the Jackknife delete-1 method of creating the replicate weights.  You will notice that while the point estimates match those obtained from the other packages, the standard errors do not.  This is because of the "super-PSUs" that were used instead of the PSU implied by the sampling design (serialno).

NOTE:  This is a very, very computer intensive procedure.  When creating the replicate weights, WesVar may appear to "hang" or "crash".  Just leave it alone for a while; it may take 10-15 minutes for WesVar to write the file with the replicate weights.  Also, be aware that the resulting .var file may be 2 or more gigabytes. 

pweight: pweight
VarUnit:  s
method:  JK1



CHIS

The data and documentation can be obtained from the CHIS web site.  CHIS (the California Health Interview Survey) was released with a pweight and jackknife replicate weights.  The adjustment value is 1.

NHANES III

The data and documentation can obtain from the NHANES website.

The NHANES III (1988 - 1994) data sets were released with the variables necessary to correct the standard errors of the estimates by either Taylor series linearization or the replicate weight method.  To ensure the privacy of the survey respondents, instead of releasing the actual strata and PSU variables, pseudo-strata and pseudo-PSU variables were released.  These are used in the same way that the "real" variables would be used.  The data sets also contain balanced-repeated replicate weights (brr).  The Fay's adjustment is 1.7 or .3, depending the statistical package that you are using.  Please note that these data sets were released with multiple pweights and multiple sets of replicate weights.  Care must be taken to ensure that the correct weights are being used with each analysis.  The choice of weights depends on the particular data set and variables being analyzed.  In the examples below, we use the adult data set.  Note that before using the pseudo strata and pseudo PSU variables, the data set must be sorted by the pseudo strata and pseudo PSU.  (For the data set containing only the data from 1999-2000, replicate weights using JK-1 are included with the data, with an adjustment of .980769 ( = 51/52).  In SUDAAN, the statements would be weight wtmec2yr; jackwgts wtmrep01 - wtmrep52 / adjjack = .980769.  See guidelines1.pdf for details.)


National Comorbidity Survey

The NCS data and documentation can be obtained from the NCS website.  (NOTE:  These examples are taken from the DS2: NCS Diagnosis/Demographic Data)


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California