|
|
|
||||
|
|
|||||
If you would like more information on the elements of survey designs, including pweights, PSUs, stratification and replicate weights, please see our page on replicate weights. For more information on data analysis in Stata, please see our seminar on Survey Data Analysis in Stata. For more information on using SUDAAN to analyze survey data, please see our seminar Introduction to SUDAAN.
A note about missing data: Many of the variables in these data sets have special values for missing data, such as 8888 or -9. In most cases, the statistical package (e.g., Stata, SUDAAN, SAS) will not know that these values should be considered missing, and they will be included as legitimate values in any analysis that is run. To convert these values to missing, please see our Stata FAQ if you are using Stata, and our SAS learning module on missing data or our SAS FAQ if you are using either SAS or SUDAAN. If you are using WesVar, convert these values to missing before importing the data into WesVar. Also note that the different programs handle missing data differently when you use more than one variable in a descriptive command. For example in Stata, svy: mean x y will give you different results than if you used two commands, svy: mean x and svy: mean y. In the former command, listwise deletion is used. You can quickly tell if listwise deletion is being used by the number of observations being used in the analysis.
A note about non-positive probability weights: The different programs handle non-positive probability weights differently. SUDAAN does not count these cases as cases read in and gives a note at the top of the output. The cases with the non-positive weight are not included in the raw frequency of cases for each category shown in the first part o the output. The top of the SAS output indicates the total number of cases in the data file, as well as the number of cases with a non-positive probability weight and the number of cases used The raw number of cases matches that given by SUDAAN. Stata can use cases with non-positive (i.e., zero) probability weights, and so the total number of cases read is the total number of cases used. Hence the number of raw cases used in each category in the Stata output is different from that shown by SUDAAN or SAS. However, in all cases, the percent of weighted cases for each category is the same for all packages.
This page contains the setups for the following data sets:
2000 US Census
CHIS
NHANES III
NCS
The 2000 US Census was released with person and household weights to weight the sample (either the 1% or the 5% PUMS) back to the national totals. In our examples, we will use the person weights with person level variables. The data are clustered within household; every person within a selected household is included in the sample. For both institutional and non-institutional group quarters, a pseudo household record number was assigned (see pages 2-3 and 3-1 of the documentation). Although it is clearly stated in the documentation that the sample data set was constructed using stratified sampling, the stratification variable was not released with the data set. Furthermore, some of the variables used in the stratification were also not released, so that the stratification variable cannot be reconstructed by the user of the data set. Hence, in our example setup, we will ignore the stratification. Please see chapter 4 of the documentation for instructions on how to obtain correct standard errors. In our examples, we use the 5% PUMS data for California.
proc sort data = census2000; by serialno; run; proc crosstab data = census2000 filetype = sas design = wr; nest _one_ serialno; weight pweight; class carpool; setenv colwidth = 14; print nsum wsum sewgt rowper serow; run;
Number of observations read :1690362 Weighted count : 33884660 Number of observations skipped : 280 (WEIGHT variable nonpositive) Denominator degrees of freedom : 616109 ------------------------------------------------------------------------------------------------
Frequencies and Values for CLASS Variables by: vehicle occupancy. ---------------------------------- vehicle occupancy Frequency Value ---------------------------------- Ordered Position: 1 1073728 0 Ordered Position: 2 511854 1 Ordered Position: 3 77629 2 Ordered Position: 4 16250 3 Ordered Position: 5 5941 4 Ordered Position: 6 2851 5 Ordered Position: 7 2109 6 ---------------------------------- Variance Estimation Method: Taylor Series (WR) by: vehicle occupancy. ------------------------------------------------------------------------ | | | | | | vehicle occupancy | | | Total | 0 | ------------------------------------------------------------------------ | | | | | | | Sample Size | 1690362 | 1073728 | | | Weighted Size | 33884660.00 | 21347148.00 | | | SE Weighted | 35600.88 | 29827.08 | | | Row Percent | 100.00 | 63.00 | | | SE Row Percent | 0.00 | 0.05 | ------------------------------------------------------------------------ ------------------------------------------------------------------------ | | | | | | vehicle occupancy | | | 1 | 2 | ------------------------------------------------------------------------ | | | | | | | Sample Size | 511854 | 77629 | | | Weighted Size | 10418251.00 | 1572572.00 | | | SE Weighted | 15946.86 | 7244.40 | | | Row Percent | 30.75 | 4.64 | | | SE Row Percent | 0.04 | 0.02 | ------------------------------------------------------------------------ ------------------------------------------------------------------------ | | | | | | vehicle occupancy | | | 3 | 4 | ------------------------------------------------------------------------ | | | | | | | Sample Size | 16250 | 5941 | | | Weighted Size | 327968.00 | 120283.00 | | | SE Weighted | 3364.46 | 2240.21 | | | Row Percent | 0.97 | 0.35 | | | SE Row Percent | 0.01 | 0.01 | ------------------------------------------------------------------------ ------------------------------------------------------------------------ | | | | | | vehicle occupancy | | | 5 | 6 | ------------------------------------------------------------------------ | | | | | | | Sample Size | 2851 | 2109 | | | Weighted Size | 56246.00 | 42192.00 | | | SE Weighted | 1504.68 | 1282.72 | | | Row Percent | 0.17 | 0.12 | | | SE Row Percent | 0.00 | 0.00 | ------------------------------------------------------------------------
svyset serialno [pweight = pweight]
svy: tab carpool, count se (running tabulate on estimation sample)
Number of strata = 1 Number of obs = 1690642
Number of PSUs = 616115 Population size = 33884660
Design df = 616114
----------------------------------
vehicle |
occupancy | count se
----------+-----------------------
not in u | 2.1e+07 3.0e+04
drove al | 1.0e+07 1.6e+04
2 people | 1.6e+06 7244
3 people | 3.3e+05 3364
4 people | 1.2e+05 2240
5 or 6 p | 5.6e+04 1505
7 or mor | 4.2e+04 1283
|
Total | 3.4e+07
----------------------------------
Key: count = weighted counts
se = linearized standard errors of weighted counts
ereturn list <output omitted to save space>
mat list e(b)
e(b)[1,7]
p11 p21 p31 p41 p51 p61 p71
y1 21347148 10418251 1572572 327968 120283 56246 42192
mat list e(V)
symmetric e(V)[7,7]
p11 p21 p31 p41 p51 p61 p71
p11 8.897e+08
p21 -549708.11 2.543e+08
p31 16052242 -4188780.9 52481384
p41 7773995.8 -1033481.5 689996.56 11319608
p51 4045734.1 -522758.86 272388.1 274547.86 5018560.5
p61 1948697.2 -250295.11 94854.857 94196.495 116127.39 2264058.9
p71 615052.89 -209080.69 30285.183 21298.549 35392.991 61222.333 1645365.3
di (8.897e+08)^.5
29827.839
mat list e(Obs)
e(Obs)[7,1]
c1
r1 1073916
r2 511926
r3 77640
r4 16254
r5 5944
r6 2852
r7 2110
The Jackknife delete-1 method of creating replicate weights should be used with the Census data because there is no stratification variable. The output below was obtained by using the "super-PSUs" created in the SAS code above and the Jackknife delete-1 method of creating the replicate weights. You will notice that while the point estimates match those obtained from the other packages, the standard errors do not. This is because of the "super-PSUs" that were used instead of the PSU implied by the sampling design (serialno).
NOTE: This is a very, very computer intensive procedure. When creating the replicate weights, WesVar may appear to "hang" or "crash". Just leave it alone for a while; it may take 10-15 minutes for WesVar to write the file with the replicate weights. Also, be aware that the resulting .var file may be 2 or more gigabytes.
pweight: pweight
VarUnit: s
method: JK1
proc surveyfreq data = census2000; weight pweight; cluster serialno; tables carpool; run;
The SURVEYFREQ Procedure
Data Summary
Number of Clusters 616115
Number of Observations 1690642
Number of Observations Used 1690362
Number of Obs with Nonpositive Weights 280
Sum of Weights 33884660
vehicle occupancy
Weighted Std Dev of Std Err of
CARPOOL Frequency Frequency Wgt Freq Percent Percent
--------------------------------------------------------------------------
0 1073728 21347148 29827 62.9994 0.0452
1 511854 10418251 15947 30.7462 0.0440
2 77629 1572572 7244 4.6410 0.0207
3 16250 327968 3364 0.9679 0.0098
4 5941 120283 2240 0.3550 0.0066
5 2851 56246 1505 0.1660 0.0044
6 2109 42192 1283 0.1245 0.0038
Total 1690362 33884660 35601 100.000
--------------------------------------------------------------------------
proc crosstab data=CHIS2001_PUFA2_082802 filetype=sas design=jackknife; weight rakedw0; jackwgts rakedw1--rakedw80/adjjack=1; tables srsex RACEHPRA srsex*racehpra; subgroup srsex RACEHPRA; levels 2 7 ; rformat srsex srsexf.; rformat racehpra rachpaf.; rtitle “ “ "SUDAAN using replicate weights for JACKKNIFE with adjusted factor” “FOR row percentage”; run;
Variance Estimation Method: Replicate Weight Jackknife SUDAAN using replicate weights for JACKKNIFE with adjusted factor FOR row percentage by: SRSEX. ----------------------------------------------------------------------------- | | | | | | SRSEX | | | Total | MALE | FEMALE | ----------------------------------------------------------------------------- | | | | | | | | Sample Size | 55428 | 23002 | 32426 | | | Weighted Size | ********** | ********** | ********** | | | SE Weighted | 587.03 | 515.26 | 291.17 | | | Row Percent | 100.00 | 48.78 | 51.22 | | | SE Row Percent | 0.00 | 0.00 | 0.00 | | | Lower 95% Limit | | | | | | ROWPER | . | 48.77 | 51.22 | | | Upper 95% Limit | | | | | | ROWPER | . | 48.78 | 51.23 | | | Col Percent | 100.00 | 48.78 | 51.22 | | | SE Col Percent | 0.00 | 0.00 | 0.00 | | | Lower 95% Limit | | | | | | COLPER | . | 48.77 | 51.22 | | | Upper 95% Limit | | | | | | COLPER | . | 48.78 | 51.23 | | | Tot Percent | 100.00 | 48.78 | 51.22 | | | SE Tot Percent | 0.00 | 0.00 | 0.00 | | | Lower 95% Limit | | | | | | TOTPER | . | 48.77 | 51.22 | | | Upper 95% Limit | | | | | | TOTPER | . | 48.78 | 51.23 | -----------------------------------------------------------------------------
< the rest of the output as been omitted to save space >
svyset [pw = rakedw0], jkrw(rakedw1 - rakedw80, multiplier(1)) vce(jack) mse
pweight: rakedw0
VCE: jackknife
MSE: on
jkrweight: rakedw1 rakedw2 rakedw3 rakedw4 rakedw5 rakedw6 rakedw7 rakedw8 rakedw9 rakedw10 rakedw11 rakedw12 rakedw13
rakedw14 rakedw15 rakedw16 rakedw17 rakedw18 rakedw19 rakedw20 rakedw21 rakedw22 rakedw23 rakedw24 rakedw25
rakedw26 rakedw27 rakedw28 rakedw29 rakedw30 rakedw31 rakedw32 rakedw33 rakedw34 rakedw35 rakedw36 rakedw37
rakedw38 rakedw39 rakedw40 rakedw41 rakedw42 rakedw43 rakedw44 rakedw45 rakedw46 rakedw47 rakedw48 rakedw49
rakedw50 rakedw51 rakedw52 rakedw53 rakedw54 rakedw55 rakedw56 rakedw57 rakedw58 rakedw59 rakedw60 rakedw61
rakedw62 rakedw63 rakedw64 rakedw65 rakedw66 rakedw67 rakedw68 rakedw69 rakedw70 rakedw71 rakedw72 rakedw73
rakedw74 rakedw75 rakedw76 rakedw77 rakedw78 rakedw79 rakedw80
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
svy: tab srsex
(running tabulate on estimation sample)
Number of strata = 1 Number of obs = 55428
Population size = 23847415
Replications = 80
Design df = 79
-----------------------
srsex | proportions
----------+------------
1 | .4878
2 | .5122
|
Total | 1
-----------------------
Key: proportions = cell proportions
svy: tab srsex, count se
(running tabulate on estimation sample)
Jackknife: for cell counts
Number of strata = 1 Number of obs = 55428
Population size = 23847415
Replications = 80
Design df = 79
----------------------------------
srsex | count se
----------+-----------------------
1 | 1.2e+07 515.3
2 | 1.2e+07 291.4
|
Total | 2.4e+07
----------------------------------
Key: count = weighted counts
se = jackknife standard errors of weighted counts

The data and documentation can obtain from the NHANES website.
The NHANES III (1988 - 1994) data sets were released with the variables necessary to correct the standard errors of the estimates by either Taylor series linearization or the replicate weight method. To ensure the privacy of the survey respondents, instead of releasing the actual strata and PSU variables, pseudo-strata and pseudo-PSU variables were released. These are used in the same way that the "real" variables would be used. The data sets also contain balanced-repeated replicate weights (brr). The Fay's adjustment is 1.7 or .3, depending the statistical package that you are using. Please note that these data sets were released with multiple pweights and multiple sets of replicate weights. Care must be taken to ensure that the correct weights are being used with each analysis. The choice of weights depends on the particular data set and variables being analyzed. In the examples below, we use the adult data set. Note that before using the pseudo strata and pseudo PSU variables, the data set must be sorted by the pseudo strata and pseudo PSU. (For the data set containing only the data from 1999-2000, replicate weights using JK-1 are included with the data, with an adjustment of .980769 ( = 51/52). In SUDAAN, the statements would be weight wtmec2yr; jackwgts wtmrep01 - wtmrep52 / adjjack = .980769. See guidelines1.pdf for details.)
* with brr replicate weights; proc descript data = adult1 filetype = sas design=brr; repwgt WTPQRP1 - WTPQRP52 / adjfay = 1.7; weight WTPFQX6 ; var HAZNOK5R; setenv colwidth = 19; setenv decwidth = 7; print nsum wsum mean semean / nohead; run;
Number of observations read : 20050 Weighted count :187647206 Denominator degrees of freedom : 52
Variance Estimation Method: BRR by: Variable, One. ------------------------------------------------------------ | | | | Variable | | One | | | 1 | ------------------------------------------------------------ | | | | | HAZNOK5R | Sample Size | 20014.0000000 | | | Weighted Size | 187378050.8999992 | | | Mean | 6.8511168 | | | SE Mean | 0.1024656 | ------------------------------------------------------------
* with pseudo-strata and pseudo-PSUs; proc sort data = adult1; by sdpstra6 sdppsu6; run;
proc descript data = adult1 filetype = sas design = wr; nest sdpstra6 sdppsu6 / missunit; weight WTPFQX6 ; var HAZNOK5R; setenv colwidth = 19; setenv decwidth = 7; print nsum wsum mean semean / nohead; run;
Number of observations read : 20050 Weighted count :187647206 Denominator degrees of freedom : 49
Variance Estimation Method: Taylor Series (WR) by: Variable, One. ------------------------------------------------------------ | | | | Variable | | One | | | 1 | ------------------------------------------------------------ | | | | | HAZNOK5R | Sample Size | 20014.0000000 | | | Weighted Size | 187378050.8999992 | | | Mean | 6.8511168 | | | SE Mean | 0.1237399 | ------------------------------------------------------------
* with replicate weights
* NOTE: You need to use the formula Fay=1-1/sqrt(adjfay) to convert the value of Fay's adjustment
given in the documentation to the form that Stata wants. You need to use the -vce(brr)- and -mse- options
to obtain the standard errors given by SUDAAN.
display 1-(1/sqrt(1.7))
.23303501
svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) mse
pweight: wtpfqx6
VCE: brr
MSE: on
brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14
wtpqrp15 wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27
wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40
wtpqrp41 wtpqrp42 wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52
fay: .23303501
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
svy: mean haznok5r
(running mean on estimation sample)
BRR replications (52)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.................................................. 50
..
Survey: Mean estimation Number of obs = 20014
Population size = 1.9e+08
Replications = 52
Design df = 51
--------------------------------------------------------------
| BRR *
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
haznok5r | 6.851117 .1024657 6.645408 7.056825
--------------------------------------------------------------
* with pseudo-strata and pseudo-PSUs;
svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6)
pweight: wtpfqx6
VCE: linearized
Strata 1: sdpstra6
SU 1: sdppsu6
FPC 1: <zero>
svy : mean haznok5r
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 49 Number of obs = 20014
Number of PSUs = 98 Population size = 1.9e+08
Design df = 49
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
haznok5r | 6.851117 .1237399 6.602452 7.099781
--------------------------------------------------------------

proc sort data = adult1; by sdpstra6 sdppsu6; run;The SURVEYMEANS Procedure Data Summary Number of Strata 49 Number of Clusters 98 Number of Observations 20050 Sum of Weights 187647206 Statistics Std Error Variable N Mean of Mean 95% CL for Mean --------------------------------------------------------------------------------- HAZNOK5R 20014 6.851117 0.123740 6.60245228 7.09978141 ---------------------------------------------------------------------------------
proc surveymeans data = adult1; weight wtpfqx6; strata sdpstra6; cluster sdppsu6; var HAZNOK5R; run;
The NCS data and documentation can be obtained from the NCS website. (NOTE: These examples are taken from the DS2: NCS Diagnosis/Demographic Data)
proc sort data = ncs2; by str secu; run; proc descript data = ncs2 filetype = sas design = wr; weight p1fwt; nest str secu; var deplt1 gadlt1; setenv colwidth = 12; setenv decwidth = 6; print nsum wsum mean semean lowmean upmean; run;
Number of observations read : 8098 Weighted count : 8098 Denominator degrees of freedom : 42
Variance Estimation Method: Taylor Series (WR) by: Variable, One. ----------------------------------------------------- | | | | Variable | | One | | | 1 | ----------------------------------------------------- | | | | | DEPLT1 | Sample Size | 8098.000000 | | | Weighted Size | 8097.996600 | | | Mean | 0.170652 | | | SE Mean | 0.006726 | | | Lower 95% Limit | | | | Mean | 0.157078 | | | Upper 95% Limit | | | | Mean | 0.184227 | ----------------------------------------------------- | | | | | GADLT1 | Sample Size | 8098.000000 | | | Weighted Size | 8097.996600 | | | Mean | 0.051491 | | | SE Mean | 0.003194 | | | Lower 95% Limit | | | | Mean | 0.045046 | | | Upper 95% Limit | | | | Mean | 0.057937 | -----------------------------------------------------
An example using a different pweight. Please consult the documentation for guidance on the correct use of a given pweight.
proc descript data = ncs2 filetype = sas design = wr; nest str secu; weight p2wtv3; var aablt; setenv colwidth = 12; setenv decwidth = 6; print nsum wsum mean semean lowmean upmean; run;
Number of observations read : 5877 Weighted count : 5877 Number of observations skipped : 2221 (WEIGHT variable nonpositive) Denominator degrees of freedom : 42
Variance Estimation Method: Taylor Series (WR) by: Variable, One. ----------------------------------------------------- | | | | Variable | | One | | | 1 | ----------------------------------------------------- | | | | | AABLT | Sample Size | 5877.000000 | | | Weighted Size | 5877.001100 | | | Mean | 0.054119 | | | SE Mean | 0.002884 | | | Lower 95% Limit | | | | Mean | 0.048300 | | | Upper 95% Limit | | | | Mean | 0.059939 | -----------------------------------------------------
svyset secu [pweight = p1fwt], strata(str)
pweight: p1fwt
VCE: linearized
Strata 1: str
SU 1: secu
FPC 1: <zero>
svy: mean deplt1
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 42 Number of obs = 8098
Number of PSUs = 84 Population size = 8098
Design df = 42
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
deplt1 | .1706523 .0067263 .1570781 .1842266
--------------------------------------------------------------
svy: mean gadlt1
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 42 Number of obs = 8098
Number of PSUs = 84 Population size = 8098
Design df = 42
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
gadlt1 | .0514915 .003194 .0450456 .0579373
--------------------------------------------------------------
An example using a different pweight. Please consult the documentation for guidance on the correct use of a given pweight.
svyset, clear
no survey characteristics are set
svyset secu [pweight = p2wtv3], strata(str)
pweight: p2wtv3
VCE: linearized
Strata 1: str
SU 1: secu
FPC 1: <zero>
svy: mean aablt
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 42 Number of obs = 5877
Number of PSUs = 84 Population size = 5877
Design df = 42
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
aablt | .0541191 .0028836 .0482997 .0599385
--------------------------------------------------------------
proc surveymeans data = ncs2; strata str; cluster secu; weight p1fwt; var deplt1 gadlt1; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 42
Number of Clusters 84
Number of Observations 8098
Sum of Weights 8097.9966
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
DEPLT1 8098 0.170652 0.006726 0.15707807 0.18422659
GADLT1 8098 0.051491 0.003194 0.04504561 0.05793729
---------------------------------------------------------------------------------
An example using a different pweight. Please consult the documentation for guidance on the correct use of a given pweight.
proc surveymeans data = ncs2; weight p2wtv3; strata str; cluster secu; var aablt; run;
The SURVEYMEANS Procedure
Data Summary
Number of Strata 42
Number of Clusters 84
Number of Observations 8098
Number of Observations Used 5877
Number of Obs with Nonpositive Weights 2221
Sum of Weights 5877.0011
Statistics
Std Error
Variable N Mean of Mean 95% CL for Mean
---------------------------------------------------------------------------------
AABLT 5877 0.054119 0.002884 0.04829968 0.05993852
---------------------------------------------------------------------------------
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services