As you can see, only 20 of the original 200 observations remain in the data set (20 is 10% of 200). You may want to save this smaller data set with a new name, so that you do not overwrite your original data set.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear count 200 sample 10 (180 observations deleted) count 20
What will happen if we specify a number that is larger than the number of observations in the data set?use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sample 50, count (150 observations deleted) count 50
As you can see, all of the observations from the data set were kept, but none were sample a second time to increase the sample size the desired number. Notice also that Stata did not issue an error message when the sample size exceeded the number of observations in the data set.use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear sample 250, count count 200
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
sort prog
by prog: count
______________________________________________________________________________
-> prog = general
45
_______________________________________________________________________________
-> prog = academic
105
_______________________________________________________________________________
-> prog = vocation
50
by prog: sample 15
(169 observations deleted)
count
31
by prog: count
_______________________________________________________________________________
-> prog = general
7
_______________________________________________________________________________
-> prog = academic
16
_______________________________________________________________________________
-> prog = vocation
8
You can also also specify conditions by which the sample should be selected.
For example, consider the code below.
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
sort prog
by prog: count
_______________________________________________________________________________
-> prog = general
45
_______________________________________________________________________________
-> prog = academic
105
_______________________________________________________________________________
-> prog = vocation
50
sample 12 if prog == 3
(44 observations deleted)
count
156
sort prog
by prog: count
____________________________________________________________________________
-> prog = general
45
_______________________________________________________________________________
-> prog = academic
105
_______________________________________________________________________________
-> prog = vocation
6
As you can see, all of the observations from the non-vocation
(general and academic) categories were included in the sample, as well as approximately 12% of the
cases from the vocation category were included (.12*50 = 6). Now let's
consider writing the code as shown below.
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
sample 12 if prog != 3
(132 observations deleted)
count
68
sort prog
by prog: count
_______________________________________________________________________________
-> prog = general
7
_______________________________________________________________________________
-> prog = academic
11
_______________________________________________________________________________
-> prog = vocation
50
We can see that all 50 cases from the vocation category were included, as
well as approximately 12% from each of the other categories.The basic command is bsample followed by the number of observations that you want in the sample. Note that sample size cannot exceed the number of observations in the data set.clear input id wt strata1 cluster1 x 1 4 1 1 15 2 4 1 1 29 3 4 2 2 14 4 4 2 2 25 5 4 3 2 17 6 5 3 3 19 7 5 4 3 20 8 5 4 3 27 9 5 5 4 26 10 5 5 4 28 end save "d:\wrsample.dta", replace
bsample 5
list
+-----------------------------------+
| id wt strata1 cluster1 x |
|-----------------------------------|
1. | 10 5 5 4 28 |
2. | 3 4 2 2 14 |
3. | 5 4 3 2 17 |
4. | 10 5 5 4 28 |
5. | 9 5 5 4 26 |
+-----------------------------------+
You can use the weight option to see the frequency weights. Note
that you need to have a "weight" variable in the data set.
use "d:\wrsample.dta", clear
bsample 4, weight(wt)
list
+-----------------------------------+
| id wt strata1 cluster1 x |
|-----------------------------------|
1. | 5 0 3 2 17 |
2. | 2 2 1 1 29 |
3. | 8 1 4 3 27 |
4. | 3 1 2 2 14 |
5. | 9 0 5 4 26 |
|-----------------------------------|
6. | 10 0 5 4 28 |
7. | 1 0 1 1 15 |
8. | 4 0 2 2 25 |
9. | 7 0 4 3 20 |
10. | 6 0 3 3 19 |
+-----------------------------------+
In this example, observation number 2 was selected twice, and observations 8
and 3 were each selected once.
use "d:\wrsample.dta", clear
bsample 1, strata(strata1)
list
+-----------------------------------+
| id wt strata1 cluster1 x |
|-----------------------------------|
1. | 2 4 1 1 29 |
2. | 3 4 2 2 14 |
3. | 5 4 3 2 17 |
4. | 8 5 4 3 27 |
5. | 9 5 5 4 26 |
+-----------------------------------+
You can also sample clusters of your data using the cluster option.
Note that Stata will select as many clusters as you request, not that many
observations.
use "d:\wrsample.dta", clear
bsample 3, cluster(cluster1)
list
+-----------------------------------+
| id wt strata1 cluster1 x |
|-----------------------------------|
1. | 6 5 3 3 19 |
2. | 7 5 4 3 20 |
3. | 8 5 4 3 27 |
4. | 6 5 3 3 19 |
5. | 7 5 4 3 20 |
|-----------------------------------|
6. | 8 5 4 3 27 |
7. | 1 4 1 1 15 |
8. | 2 4 1 1 29 |
+-----------------------------------+
In this example, Stata chose cluster 3 twice and cluster 1 once for a total
of three clusters.set seed 2038947
The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.