|
|
|
||||
|
|
|||||
Sometimes you may be analyzing a very large data file and want to work with just a simple random sample of the data file. Other times you may want to draw a simple random sample with replacement from a small data file. Either way, SAS proc surveyselect is one way to do it and it is fairly straightforward. Let's use the following data set for the purpose of demonstration.
DATA hsb25;
INPUT id gender $ race ses schtype $ prog
read write math science socst;
DATALINES;
147 f 1 3 pub 1 47 62 53 53 61
108 m 1 2 pub 2 34 33 41 36 36
18 m 3 2 pub 3 50 33 49 44 36
153 m 1 2 pub 3 39 31 40 39 51
50 m 2 2 pub 2 50 59 42 53 61
51 f 2 1 pub 2 42 36 42 31 39
102 m 1 1 pub 1 52 41 51 53 56
57 f 1 2 pub 1 71 65 72 66 56
160 f 1 2 pub 1 55 65 55 50 61
136 m 1 2 pub 1 65 59 70 63 51
88 f 1 1 pub 1 68 60 64 69 66
177 m 1 2 pri 1 55 59 62 58 51
95 m 1 1 pub 1 73 60 71 61 71
144 m 1 1 pub 2 60 65 58 61 66
139 f 1 2 pub 1 68 59 61 55 71
135 f 1 3 pub 1 63 60 65 54 66
191 f 1 1 pri 1 47 52 43 48 61
171 m 1 2 pub 1 60 54 60 55 66
22 m 3 2 pub 3 42 39 39 56 46
47 f 2 3 pub 1 47 46 49 33 41
56 m 1 2 pub 3 55 45 46 58 51
128 m 1 1 pub 1 39 33 38 47 41
36 f 2 3 pub 2 44 49 44 35 51
53 m 2 2 pub 3 34 37 46 39 31
26 f 4 1 pub 1 60 59 62 61 51
;
RUN;
In a simple random sample without replacement each observation in the data set has an equal chance of being selected, once selected it can not be chosen again. The following code creates a simple random sample of size 10 from data set hsb25. Here the method option in proc surveyselect statement specifies the method to be SRS (simple random sampling). The rep (=replicate) option specifies the number of simple random samples you want create. The sampsize is a required option here specifying the size of the random sample. This number has to be smaller than the size of the original data set, since the sampling is done without replacement. You can also specify the seed so a precise replicate can be reproduced later using the same seed. The id statement is used to specify the variables to be included in the sample. Here we use the _all_ to include all the variables to be in the sample.
proc surveyselect data = hsb25 method = SRS rep = 1
sampsize = 10 seed = 12345 out = hsbs1;
id _all_;
run;
proc print data = hsbs1 noobs;
run;
id gender race ses schtype prog read write math science socst
108 m 1 2 pub 2 34 33 41 36 36
153 m 1 2 pub 3 39 31 40 39 51
51 f 2 1 pub 2 42 36 42 31 39
95 m 1 1 pub 1 73 60 71 61 71
139 f 1 2 pub 1 68 59 61 55 71
135 f 1 3 pub 1 63 60 65 54 66
191 f 1 1 pri 1 47 52 43 48 61
22 m 3 2 pub 3 42 39 39 56 46
47 f 2 3 pub 1 47 46 49 33 41
53 m 2 2 pub 3 34 37 46 39 31
proc surveyselect data=hsb25 method = urs sampsize = 10
rep=1 seed=12345 out=hsbs2;
id id read write math science socst;
run;
proc print data = hsbs2 noobs;
run;
Number
id read write math science socst Hits
22 42 39 39 56 46 1
47 47 46 49 33 41 1
51 42 36 42 31 39 1
57 71 65 72 66 56 1
139 68 59 61 55 71 1
144 60 65 58 61 66 3
147 47 62 53 53 61 1
153 39 31 40 39 51 1
The data set hsbs2 has only 8 observations, because observation with id =
144 should
be counted three times. Here is sample code to create a data set with
10 observations based on hsbs2.
data hsbs2f;
set hsbs2;
do i = 1 to numberhits;
output;
end;
drop i;
run;
proc print data = hsbs2f noobs;
run;
Number
id read write math science socst Hits
22 42 39 39 56 46 1
47 47 46 49 33 41 1
51 42 36 42 31 39 1
57 71 65 72 66 56 1
139 68 59 61 55 71 1
144 60 65 58 61 66 3
144 60 65 58 61 66 3
144 60 65 58 61 66 3
147 47 62 53 53 61 1
153 39 31 40 39 51 1
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services
The %srs macro was written by Jonah Schlackman.