|
|
|
||||
|
|
|||||
Stata takes theses characteristics into account through the use of survey procedures. Before issuing any survey commands it is necessary to set one or more of the following items:
Failure to analyze survey sampling designs without taking these characteristics into account can result in inaccurate point estimates and/or inaccurate estimates of standard errors.
In this unit we will be using data from the book Sampling of Populations by Levy and Lemeshow (1999) with permission of the authors.
Of course, in the normal course of events you wouldn't actually have access to data from the whole population. We were lucky in this instance that California collects and releases these data.
Let's try several computations on the population data.
use http://www.ats.ucla.edu/stat/stata/library/apipop, clear
tabulate stype
stype | Freq. Percent Cum.
------------+-----------------------------------
E | 4421 71.38 71.38
H | 755 12.19 83.56
M | 1018 16.44 100.00
------------+-----------------------------------
Total | 6194 100.00
summarize api00
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
api00 | 6194 664.7126 128.2441 346 969
quietly summarize enroll
display %10.0fc r(sum)
3,811,472
regress api00 meals ell avg_ed
Source | SS df MS Number of obs = 6016
---------+------------------------------ F( 3, 6012) = 5837.12
Model | 73775065.7 3 24591688.6 Prob > F = 0.0000
Residual | 25328472.8 6012 4212.98616 R-squared = 0.7444
---------+------------------------------ Adj R-squared = 0.7443
Total | 99103538.5 6015 16476.0662 Root MSE = 64.908
------------------------------------------------------------------------------
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+--------------------------------------------------------------------
meals | -1.672069 .0568866 -29.393 0.000 -1.783587 -1.560551
ell | -.6775632 .0616073 -10.998 0.000 -.7983355 -.5567908
avg_ed | 72.30502 2.09055 34.587 0.000 68.20679 76.40325
_cons | 558.443 7.969069 70.076 0.000 542.8207 574.0652
------------------------------------------------------------------------------
In this example, the sampling frame contains the 6,194 school so fpc = 6194 and the sampling weights (pw) = 6194/200 = 30.97.generate i = uniform() sort i . keep in 1/200
Of course, in the real world you probably wouldn't take a sample of 200 school from a computer file of 6,194, you would just analyze the entire dataset. But suppose you had to go out to each school to collect the data that you needed, then it would take much less time and cost much less money to go to 200 schools than to over 6,000 schools.
The file apisrs.dta has a simple random sample of 200 cases.
use http://www.ats.ucla.edu/stat/stata/library/apisrs, clear
tabulate stype
stype | Freq. Percent Cum.
------------+-----------------------------------
E | 145 72.50 72.50
H | 25 12.50 85.00
M | 30 15.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabulate dnum
district |
number | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 0.50 0.50
40 | 1 0.50 1.00
41 | 1 0.50 1.50
43 | 1 0.50 2.00
46 | 3 1.50 3.50
48 | 1 0.50 4.00
55 | 1 0.50 4.50
56 | 2 1.00 5.50
57 | 1 0.50 6.00
60 | 1 0.50 6.50
67 | 1 0.50 7.00
80 | 1 0.50 7.50
90 | 2 1.00 8.50
98 | 1 0.50 9.00
103 | 1 0.50 9.50
105 | 1 0.50 10.00
108 | 2 1.00 11.00
124 | 1 0.50 11.50
131 | 1 0.50 12.00
135 | 2 1.00 13.00
148 | 2 1.00 14.00
154 | 1 0.50 14.50
159 | 1 0.50 15.00
162 | 1 0.50 15.50
166 | 3 1.50 17.00
175 | 1 0.50 17.50
176 | 1 0.50 18.00
184 | 1 0.50 18.50
190 | 1 0.50 19.00
209 | 1 0.50 19.50
217 | 1 0.50 20.00
222 | 1 0.50 20.50
229 | 1 0.50 21.00
231 | 1 0.50 21.50
238 | 1 0.50 22.00
248 | 2 1.00 23.00
253 | 3 1.50 24.50
255 | 1 0.50 25.00
259 | 1 0.50 25.50
266 | 1 0.50 26.00
272 | 1 0.50 26.50
274 | 1 0.50 27.00
278 | 2 1.00 28.00
293 | 1 0.50 28.50
301 | 1 0.50 29.00
304 | 1 0.50 29.50
335 | 1 0.50 30.00
351 | 1 0.50 30.50
352 | 1 0.50 31.00
353 | 1 0.50 31.50
358 | 1 0.50 32.00
360 | 1 0.50 32.50
379 | 1 0.50 33.00
390 | 1 0.50 33.50
393 | 1 0.50 34.00
395 | 2 1.00 35.00
401 | 18 9.00 44.00
416 | 1 0.50 44.50
418 | 2 1.00 45.50
436 | 1 0.50 46.00
444 | 1 0.50 46.50
445 | 1 0.50 47.00
451 | 1 0.50 47.50
457 | 2 1.00 48.50
459 | 1 0.50 49.00
460 | 1 0.50 49.50
470 | 1 0.50 50.00
473 | 1 0.50 50.50
479 | 1 0.50 51.00
491 | 1 0.50 51.50
495 | 1 0.50 52.00
498 | 1 0.50 52.50
503 | 2 1.00 53.50
507 | 5 2.50 56.00
509 | 1 0.50 56.50
513 | 2 1.00 57.50
529 | 2 1.00 58.50
532 | 1 0.50 59.00
533 | 1 0.50 59.50
536 | 1 0.50 60.00
537 | 2 1.00 61.00
539 | 3 1.50 62.50
541 | 1 0.50 63.00
542 | 1 0.50 63.50
547 | 1 0.50 64.00
556 | 2 1.00 65.00
564 | 1 0.50 65.50
570 | 1 0.50 66.00
579 | 1 0.50 66.50
590 | 1 0.50 67.00
600 | 1 0.50 67.50
602 | 1 0.50 68.00
605 | 1 0.50 68.50
614 | 2 1.00 69.50
620 | 3 1.50 71.00
623 | 1 0.50 71.50
627 | 3 1.50 73.00
629 | 1 0.50 73.50
630 | 2 1.00 74.50
632 | 5 2.50 77.00
633 | 1 0.50 77.50
635 | 1 0.50 78.00
636 | 2 1.00 79.00
637 | 1 0.50 79.50
640 | 1 0.50 80.00
642 | 1 0.50 80.50
643 | 1 0.50 81.00
644 | 1 0.50 81.50
645 | 1 0.50 82.00
648 | 1 0.50 82.50
651 | 1 0.50 83.00
653 | 1 0.50 83.50
658 | 1 0.50 84.00
665 | 1 0.50 84.50
688 | 1 0.50 85.00
689 | 1 0.50 85.50
702 | 1 0.50 86.00
711 | 1 0.50 86.50
716 | 1 0.50 87.00
720 | 1 0.50 87.50
731 | 1 0.50 88.00
739 | 1 0.50 88.50
744 | 3 1.50 90.00
745 | 1 0.50 90.50
750 | 1 0.50 91.00
751 | 1 0.50 91.50
754 | 1 0.50 92.00
756 | 1 0.50 92.50
761 | 1 0.50 93.00
779 | 2 1.00 94.00
780 | 1 0.50 94.50
782 | 1 0.50 95.00
788 | 1 0.50 95.50
796 | 4 2.00 97.50
797 | 1 0.50 98.00
803 | 1 0.50 98.50
815 | 1 0.50 99.00
830 | 1 0.50 99.50
834 | 1 0.50 100.00
------------+-----------------------------------
Total | 200 100.00
svyset
pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: <observations>
FPC 1: fpc
svy: mean api00
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 1 Number of obs = 200
Number of PSUs = 200 Population size = 6194
Design df = 199
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
api00 | 660.165 9.186887 642.0489 678.2811
--------------------------------------------------------------
svy: total enroll
(running total on estimation sample)
Survey: Total estimation
Number of strata = 1 Number of obs = 200
Number of PSUs = 200 Population size = 6194
Design df = 199
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
enroll | 3924828 220705.4 3489607 4360049
--------------------------------------------------------------
svy: regress api00 meals ell avg_ed
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs = 200
Number of PSUs = 200 Population size = 6193.9999
Design df = 199
F( 3, 197) = 217.11
Prob > F = 0.0000
R-squared = 0.7640
------------------------------------------------------------------------------
| Linearized
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.367668 .3544273 -3.86 0.000 -2.066583 -.6687524
ell | -1.266818 .3895673 -3.25 0.001 -2.035028 -.4986079
avg_ed | 75.49145 14.28649 5.28 0.000 47.31912 103.6638
_cons | 544.7082 56.15402 9.70 0.000 433.9749 655.4414
------------------------------------------------------------------------------
In this example, there are three sampling frames: 4,421 elementary schools, 755 high schools, and 1,018 middle schools.
The file apistrat.dta contains the data for the stratified random sample.
use http://www.ats.ucla.edu/stat/stata/library/apistrat, clear
tabulate stype
stype | Freq. Percent Cum.
------------+-----------------------------------
E | 100 50.00 50.00
H | 50 25.00 75.00
M | 50 25.00 100.00
------------+-----------------------------------
Total | 200 100.00
tabulate dnum
district |
number | Freq. Percent Cum.
------------+-----------------------------------
19 | 1 0.50 0.50
20 | 1 0.50 1.00
25 | 1 0.50 1.50
27 | 1 0.50 2.00
40 | 1 0.50 2.50
41 | 1 0.50 3.00
64 | 1 0.50 3.50
69 | 1 0.50 4.00
105 | 1 0.50 4.50
108 | 1 0.50 5.00
114 | 1 0.50 5.50
135 | 1 0.50 6.00
140 | 1 0.50 6.50
148 | 2 1.00 7.50
153 | 5 2.50 10.00
155 | 1 0.50 10.50
158 | 2 1.00 11.50
160 | 1 0.50 12.00
162 | 1 0.50 12.50
176 | 1 0.50 13.00
182 | 1 0.50 13.50
185 | 2 1.00 14.50
196 | 1 0.50 15.00
202 | 1 0.50 15.50
208 | 1 0.50 16.00
214 | 1 0.50 16.50
215 | 2 1.00 17.50
216 | 1 0.50 18.00
223 | 1 0.50 18.50
225 | 1 0.50 19.00
226 | 1 0.50 19.50
233 | 1 0.50 20.00
238 | 2 1.00 21.00
247 | 1 0.50 21.50
253 | 4 2.00 23.50
259 | 4 2.00 25.50
266 | 2 1.00 26.50
270 | 2 1.00 27.50
273 | 1 0.50 28.00
275 | 1 0.50 28.50
279 | 1 0.50 29.00
284 | 1 0.50 29.50
294 | 1 0.50 30.00
308 | 1 0.50 30.50
316 | 1 0.50 31.00
324 | 1 0.50 31.50
333 | 1 0.50 32.00
339 | 1 0.50 32.50
348 | 1 0.50 33.00
349 | 1 0.50 33.50
351 | 1 0.50 34.00
358 | 1 0.50 34.50
364 | 1 0.50 35.00
376 | 1 0.50 35.50
382 | 2 1.00 36.50
390 | 1 0.50 37.00
394 | 1 0.50 37.50
395 | 3 1.50 39.00
401 | 16 8.00 47.00
419 | 1 0.50 47.50
423 | 1 0.50 48.00
432 | 1 0.50 48.50
439 | 1 0.50 49.00
448 | 1 0.50 49.50
450 | 1 0.50 50.00
457 | 1 0.50 50.50
459 | 1 0.50 51.00
460 | 1 0.50 51.50
465 | 1 0.50 52.00
473 | 3 1.50 53.50
475 | 1 0.50 54.00
478 | 1 0.50 54.50
484 | 1 0.50 55.00
492 | 1 0.50 55.50
495 | 1 0.50 56.00
497 | 1 0.50 56.50
498 | 1 0.50 57.00
499 | 1 0.50 57.50
501 | 1 0.50 58.00
507 | 4 2.00 60.00
509 | 1 0.50 60.50
512 | 1 0.50 61.00
513 | 2 1.00 62.00
514 | 1 0.50 62.50
515 | 1 0.50 63.00
531 | 2 1.00 64.00
532 | 1 0.50 64.50
537 | 1 0.50 65.00
541 | 3 1.50 66.50
550 | 1 0.50 67.00
554 | 1 0.50 67.50
569 | 1 0.50 68.00
575 | 2 1.00 69.00
590 | 2 1.00 70.00
596 | 1 0.50 70.50
602 | 2 1.00 71.50
605 | 1 0.50 72.00
620 | 2 1.00 73.00
621 | 3 1.50 74.50
627 | 1 0.50 75.00
630 | 2 1.00 76.00
632 | 4 2.00 78.00
635 | 2 1.00 79.00
636 | 2 1.00 80.00
639 | 2 1.00 81.00
650 | 1 0.50 81.50
653 | 2 1.00 82.50
655 | 1 0.50 83.00
656 | 1 0.50 83.50
662 | 1 0.50 84.00
685 | 1 0.50 84.50
689 | 5 2.50 87.00
702 | 1 0.50 87.50
706 | 1 0.50 88.00
722 | 1 0.50 88.50
725 | 2 1.00 89.50
735 | 1 0.50 90.00
738 | 1 0.50 90.50
751 | 1 0.50 91.00
756 | 1 0.50 91.50
760 | 1 0.50 92.00
766 | 1 0.50 92.50
767 | 2 1.00 93.50
774 | 1 0.50 94.00
780 | 2 1.00 95.00
781 | 1 0.50 95.50
784 | 1 0.50 96.00
787 | 1 0.50 96.50
796 | 1 0.50 97.00
797 | 1 0.50 97.50
802 | 1 0.50 98.00
806 | 1 0.50 98.50
813 | 1 0.50 99.00
819 | 1 0.50 99.50
825 | 1 0.50 100.00
------------+-----------------------------------
Total | 200 100.00
svyset
pweight: pw
VCE: linearized
Strata 1: stype
SU 1: <observations>
FPC 1: fpc
svy: mean api00
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 3 Number of obs = 200
Number of PSUs = 200 Population size = 6194
Design df = 197
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
api00 | 662.2874 9.408941 643.7322 680.8425
--------------------------------------------------------------
svy: total enroll
(running total on estimation sample)
Survey: Total estimation
Number of strata = 3 Number of obs = 200
Number of PSUs = 200 Population size = 6194
Design df = 197
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
enroll | 3687178 114641.7 3461095 3913260
--------------------------------------------------------------
svy: regress api00 meals ell avg_ed
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 3 Number of obs = 200
Number of PSUs = 200 Population size = 6194
Design df = 197
F( 3, 195) = 190.97
Prob > F = 0.0000
R-squared = 0.7125
------------------------------------------------------------------------------
| Linearized
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -1.818234 .4076227 -4.46 0.000 -2.622098 -1.01437
ell | -.0191524 .3890413 -0.05 0.961 -.7863727 .7480679
avg_ed | 77.47879 16.93665 4.57 0.000 44.07838 110.8792
_cons | 534.4453 65.57342 8.15 0.000 405.1294 663.7613
------------------------------------------------------------------------------
In this example, the sampling frame contains the 757 school districts.
The file apiclus1.dta will contain the data for the one-stage cluster sampling design.
use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear
tabulate stype
stype | Freq. Percent Cum.
------------+-----------------------------------
E | 144 78.69 78.69
H | 14 7.65 86.34
M | 25 13.66 100.00
------------+-----------------------------------
Total | 183 100.00
tabulate dnum
district |
number | Freq. Percent Cum.
------------+-----------------------------------
61 | 13 7.10 7.10
135 | 34 18.58 25.68
178 | 4 2.19 27.87
197 | 13 7.10 34.97
255 | 16 8.74 43.72
406 | 2 1.09 44.81
413 | 1 0.55 45.36
437 | 4 2.19 47.54
448 | 12 6.56 54.10
510 | 21 11.48 65.57
568 | 9 4.92 70.49
637 | 11 6.01 76.50
716 | 37 20.22 96.72
778 | 2 1.09 97.81
815 | 4 2.19 100.00
------------+-----------------------------------
Total | 183 100.00
svyset dnum [pw=pw], fpc(fpc)
pweight: pw
VCE: linearized
Strata 1: <one>
SU 1: dnum
FPC 1: fpc
/* list fpc pw dnum -- to see the values for these items */
svy: mean api00
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 1 Number of obs = 183
Number of PSUs = 15 Population size = 9235.4
Design df = 14
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
api00 | 644.1694 23.54224 593.6763 694.6625
--------------------------------------------------------------
svy: total enroll
(running total on estimation sample)
Survey: Total estimation
Number of strata = 1 Number of obs = 183
Number of PSUs = 15 Population size = 9235.4
Design df = 14
--------------------------------------------------------------
| Linearized
| Total Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
enroll | 5076846 1389984 2095626 8058066
--------------------------------------------------------------
svy: regress api00 meals ell avg_ed
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs = 157
Number of PSUs = 15 Population size = 9235.4001
Design df = 14
F( 3, 12) = 54.36
Prob > F = 0.0000
R-squared = 0.6978
------------------------------------------------------------------------------
| Linearized
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
meals | -2.948702 .3266161 -9.03 0.000 -3.649224 -2.24818
ell | -.2227005 .3938377 -0.57 0.581 -1.067398 .6219974
avg_ed | 16.42832 15.32151 1.07 0.302 -16.43304 49.28968
_cons | 755.4386 55.61202 13.58 0.000 636.1626 874.7145
------------------------------------------------------------------------------
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services