Help the Stat Consulting Group by giving a gift

How can I draw a random sample of my data?

There are two commands in Stata that can be used to take a random sample of your data set. Use the

As you can see, only 20 of the original 200 observations remain in the data set (20 is 10% of 200). You may want to save this smaller data set with a new name, so that you do not overwrite your original data set.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear count200sample 10(180 observations deleted)count20

Now let's specify the number of observations, say 50, that we want in our sample, instead of the percentage of the data set. To do this, we will use the

What will happen if we specify a number that is larger than the number of observations in the data set?use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sample 50, count(150 observations deleted)count50

As you can see, all of the observations from the data set were kept, but none were sample a second time to increase the sample size the desired number. Notice also that Stata did not issue an error message when the sample size exceeded the number of observations in the data set.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sample 250, count count200

You can also select a sample with a given percentage or number from each of level of a grouping variable. (This might be a strata variable.) In the

You can also also specify conditions by which the sample should be selected. For example, consider the code below.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sort prog by prog: count______________________________________________________________________________ -> prog = general 45 _______________________________________________________________________________ -> prog = academic 105 _______________________________________________________________________________ -> prog = vocation 50by prog: sample 15(169 observations deleted)count31by prog: count_______________________________________________________________________________ -> prog = general 7 _______________________________________________________________________________ -> prog = academic 16 _______________________________________________________________________________ -> prog = vocation 8

As you can see, all of the observations from the non-vocation (general and academic) categories were included in the sample, as well as approximately 12% of the cases from the vocation category were included (.12*50 = 6). Now let's consider writing the code as shown below.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sort prog by prog: count_______________________________________________________________________________ -> prog = general 45 _______________________________________________________________________________ -> prog = academic 105 _______________________________________________________________________________ -> prog = vocation 50sample 12 if prog == 3(44 observations deleted)count156sort prog by prog: count____________________________________________________________________________ -> prog = general 45 _______________________________________________________________________________ -> prog = academic 105 _______________________________________________________________________________ -> prog = vocation 6

We can see that all 50 cases from the vocation category were included, as well as approximately 12% from each of the other categories.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sample 12 if prog != 3(132 observations deleted)count68sort prog by prog: count_______________________________________________________________________________ -> prog = general 7 _______________________________________________________________________________ -> prog = academic 11 _______________________________________________________________________________ -> prog = vocation 50

The basic command isclear input id wt strata1 cluster1 x 1 4 1 1 15 2 4 1 1 29 3 4 2 2 14 4 4 2 2 25 5 4 3 2 17 6 5 3 3 19 7 5 4 3 20 8 5 4 3 27 9 5 5 4 26 10 5 5 4 28 end save "d:\wrsample.dta", replace

You can use thebsample 5 list+-----------------------------------+ | id wt strata1 cluster1 x | |-----------------------------------| 1. | 10 5 5 4 28 | 2. | 3 4 2 2 14 | 3. | 5 4 3 2 17 | 4. | 10 5 5 4 28 | 5. | 9 5 5 4 26 | +-----------------------------------+

In this example, observation number 2 was selected twice, and observations 8 and 3 were each selected once.use "d:\wrsample.dta", clear bsample 4, weight(wt) list+-----------------------------------+ | id wt strata1 cluster1 x | |-----------------------------------| 1. | 5 0 3 2 17 | 2. | 2 2 1 1 29 | 3. | 8 1 4 3 27 | 4. | 3 1 2 2 14 | 5. | 9 0 5 4 26 | |-----------------------------------| 6. | 10 0 5 4 28 | 7. | 1 0 1 1 15 | 8. | 4 0 2 2 25 | 9. | 7 0 4 3 20 | 10. | 6 0 3 3 19 | +-----------------------------------+

You still have all 10 observations, but the weights have been changed to reflect which observations should be included in the sample. Try running the code multiple times and you will see that you get different results each time that you run it.

If your data are stratified, you can sample from each of the strata. You need to provide Stata with the number of observations that you want from each strata, not the total number of observations that you want in the sample. In the following example, we will ask for one observation from each strata, giving us a total sample size of 5.

You can also sample clusters of your data using theuse "d:\wrsample.dta", clear bsample 1, strata(strata1) list+-----------------------------------+ | id wt strata1 cluster1 x | |-----------------------------------| 1. | 2 4 1 1 29 | 2. | 3 4 2 2 14 | 3. | 5 4 3 2 17 | 4. | 8 5 4 3 27 | 5. | 9 5 5 4 26 | +-----------------------------------+

In this example, Stata chose cluster 3 twice and cluster 1 once for a total of three clusters.use "d:\wrsample.dta", clear bsample 3, cluster(cluster1) list+-----------------------------------+ | id wt strata1 cluster1 x | |-----------------------------------| 1. | 6 5 3 3 19 | 2. | 7 5 4 3 20 | 3. | 8 5 4 3 27 | 4. | 6 5 3 3 19 | 5. | 7 5 4 3 20 | |-----------------------------------| 6. | 8 5 4 3 27 | 7. | 1 4 1 1 15 | 8. | 2 4 1 1 29 | +-----------------------------------+

set seed 2038947

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.