SPSS Learning Module
How can I analyze a subset of my data?
There are several ways that you can
analyze a either a temporary or a
permanent subset of your data. The examples below will illustrate some of these methods.
Example 1: Creating a random sample
There may be times when you would like to
analyze only a subset of your
data. For example, suppose that you have a huge data file with thousands of cases, and that you written a syntax file to analyze the
data. Because the syntax may take hours to run, you may want to take a relatively small sample of your data and run the syntax on that to see if
it works properly. There are several ways that you could create a sub-sample, such as using the only the first 100 cases. However, in this situation, it may be best to take a random
sample of your data. The SPSS command to do this is sample. For this example, we will randomly select 20% of the data, and we
will use the means command to show the effect of taking the subset.
Let's consider the following data set. It has two independent
variables (iv1 and iv2) and two dependent variables (dv1 and
dv2).
data list list / sub iv1 iv2 dv1 dv2.
begin data
1 1 1 . 25
2 1 1 49 37
3 1 1 50 55
4 2 1 . 19
5 2 1 20 38
6 2 0 23 48
7 2 0 28 44
8 3 0 28 68
9 3 0 . 30
10 3 0 32 36
end data.
save outfile 'c:\sset.sav'.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
10 |
100.0% |
0 |
.0% |
10 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
37.2500 |
4 |
12.84199 |
| 3.00 |
44.6667 |
3 |
20.42874 |
| Total |
40.0000 |
10 |
14.46836 |
sample .20.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
4 |
100.0% |
0 |
.0% |
4 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
25.0000 |
1 |
. |
| 2.00 |
38.0000 |
1 |
. |
| 3.00 |
33.0000 |
2 |
4.24264 |
| Total |
32.2500 |
4 |
5.90903 |
Be aware that sample takes a
permanent sample of the data in the working file. In other words, the cases that are not selected
are deleted. The next example will illustrate how to take a subset without deleting the non-selected cases.
Example 2: Creating a temporary random
sample
The temporary command can be used with most SPSS commands,
and we will use it here to create a temporary subset of the data in the working file. The
temporary command allows you to create or transform variables and is in effect only until the next procedure is
executed. In the example below, the means command is the procedure that will terminate the
temporary command. To illustrate this, we will issue the means command twice. The first time, the
temporary command will be in effect, and the descriptive statistics will reflect the reduced
number of cases. It will also terminate the temporary command, so that
the second means command will be run on the full data set.
get file 'c:\sset.sav'.
temporary.
sample .20.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
4 |
100.0% |
0 |
.0% |
4 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
19.0000 |
1 |
. |
| Total |
34.0000 |
4 |
15.87451 |
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
10 |
100.0% |
0 |
.0% |
10 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
37.2500 |
4 |
12.84199 |
| 3.00 |
44.6667 |
3 |
20.42874 |
| Total |
40.0000 |
10 |
14.46836 |
Example 3: Selecting a specific number of cases
The sample command can also be used to select a specific number of
cases. For example, suppose that you wanted to obtain descriptive statistics on
four cases randomly drawn from the first eight cases in the data set.
temporary.
sample 4 from 8.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
4 |
100.0% |
0 |
.0% |
4 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
25.0000 |
1 |
. |
| 2.00 |
43.3333 |
3 |
5.03322 |
| Total |
38.7500 |
4 |
10.04573 |
If you wanted the eight cases to be drawn from the entire data set, you
would simply put the total number of cases after the keyword from.
temporary.
sample 4 from 10.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
4 |
100.0% |
0 |
.0% |
4 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
37.0000 |
1 |
. |
| 2.00 |
38.0000 |
1 |
. |
| 3.00 |
33.0000 |
2 |
4.24264 |
| Total |
35.2500 |
4 |
3.59398 |
Example 4: Selecting a specific number of the first cases
Suppose that you read a large data file into SPSS and you just wanted
to see if the data were read in properly. Because the file is large, running descriptive statistics on the entire data set would be time
consuming, and would probably not be more informative than running the descriptive statistics on a small sub-set. You could use the
n of cases command to select, say, the first 50 cases in the data file. As with sample, this command
permanently modifies your data set. If you do not want the rest of the cases to be deleted, you will need to
use the temporary command just before the n of cases command. Also
note that the n of cases command can be shortened to n.
temporary.
n of cases 5.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
5 |
100.0% |
0 |
.0% |
5 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
28.5000 |
2 |
13.43503 |
| Total |
34.8000 |
5 |
13.86362 |
Equivalently,
temporary.
n 5.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
5 |
100.0% |
0 |
.0% |
5 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
28.5000 |
2 |
13.43503 |
| Total |
34.8000 |
5 |
13.86362 |
Example 5: Selecting cases based on value of one or more variables
Sometimes you may want to select cases based on the value of one
or more variables. For example, suppose that you wanted to obtain descriptive statistics for only the those cases
where iv3 was greater than two and dv2 was less than 40. The select if command will
permanently select those cases from your data set. As with the other commands, you can use the
temporary command to temporarily select the desired cases.
temporary.
select if (iv1 gt 1 and dv2 lt 40).
means dv2 by iv2.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV2 |
4 |
100.0% |
0 |
.0% |
4 |
100.0% |
Report
DV2
| IV2 |
Mean |
N |
Std. Deviation |
| .00 |
33.0000 |
2 |
4.24264 |
| 1.00 |
28.5000 |
2 |
13.43503 |
| Total |
30.7500 |
4 |
8.53913 |
You can also use the select if command to select cases that have
a missing value for the variable of interest. For example, suppose that you wanted to select and analyze the cases for which
dv1 had a missing value.
temporary.
select if (sysmis(dv1)).
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
3 |
100.0% |
0 |
.0% |
3 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
25.0000 |
1 |
. |
| 2.00 |
19.0000 |
1 |
. |
| 3.00 |
30.0000 |
1 |
. |
| Total |
24.6667 |
3 |
5.50757 |
Example 6: Filtering by a variable
Another way to subset your data is to filter them by a variable. The
variable that is to be used as a filter must be a numeric variable that is coded zero/one (i.e., a dummy variable). The cases coded
as zero will be filtered. If the filter variable is dichotomous, but coded say, one/two, SPSS will execute the command requested
without a filter, and it will not issue either an error message or a warning message. You can tell if the filter is on by looking in the
lower right-hand corner of the data editor for the "filter on" message. You can see which cases are being filtered by looking at the left-most
column of the data editor. Cases with a slash through the number are being filtered. The
filter command does not make permanent changes to your data set, and you can turn it off by issuing the
filter off command. Let's suppose that you wanted to use iv2 as a filter.
filter by iv2.
means dv2 by iv1.
Case Processing Summary
|
Cases |
| Included |
Excluded |
Total |
| N |
Percent |
N |
Percent |
N |
Percent |
| DV2 * IV1 |
5 |
100.0% |
0 |
.0% |
5 |
100.0% |
Report
DV2
| IV1 |
Mean |
N |
Std. Deviation |
| 1.00 |
39.0000 |
3 |
15.09967 |
| 2.00 |
28.5000 |
2 |
13.43503 |
| Total |
34.8000 |
5 |
13.86362 |
filter off.
execute.
Example 7: Subsetting to match percentage in sample to percentage in
population
Suppose that you conducted a survey of 10000 people and 70% of
your respondents were female. You know that females make up only about 52% of the population, so you would like to take a subset of your
female respondents such that the proportion of females to males in your data is more similar to that found in the population. First, you need to
calculate how many female respondents you want to keep in your data set. Next,
you would put the sample command in a do if loop to create the
subset. Finally, you would save the file with a new name, so that your
original data would be preserved.
do if gender = 'female'.
sample 3250 from 7000.
end if.
save outfile 'c:\subset.sav'.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services
The content of this web site should not be
construed as an endorsement of any particular web site, book, or software
product by the University of California.