Stata Learning Module Subsetting data

This module shows how you can subset data in Stata. You can subset data by keeping or dropping variables, and you can subset data by keeping or dropping observations. You can also subset data as you use a data file if you are trying to read a file that is too big to fit into the memory on your computer.

Keeping and dropping variables

Sometimes you do not want all of the variables in a data file. You can use the keep and drop commands to subset variables. If we think of your data like a spreadsheet, this section will show how you can remove columns (variables) from your data. Let's illustrate this with the auto data file.

sysuse auto

We can use the describe command to see its variables.

describe

obs:            74                          1978 Automobile Data
vars:            12                          13 Apr 2007 17:45
size:         3,478 (99.7% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------------------------------
Sorted by:  foreign

Suppose we want to just have make mpg and price, we can keep just those variables, as shown below.

keep make mpg price

If we issue the describe command again, we see that indeed those are the only variables left.

describe

obs:            74                          1978 Automobile Data
vars:             3                          13 Apr 2007 17:45
size:         1,924 (99.8% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
-------------------------------------------------------------------------------
Sorted by:
Note:  dataset has changed since last saved

Remember, this has not changed the file on disk, but only the copy we have in memory. If we saved this file calling it auto, it would mean that we would replace the existing file (with all the variables) with this file which just has make, mpg and price. In effect, we would permanently lose all of the other variables in the data file. It is important to be careful when using the save command after you have eliminated variables, and it is recommended that you save such files to a file with a new name, e.g., save auto2. Let's show how to use the drop command to drop variables. First, let's clear out the data in memory and use the auto data file.

sysuse auto, clear

perhaps we are not interested in the variables displ and gear_ratio. We can get rid of them using the drop command shown below.

drop displ gear_ratio

Again, using describe shows that the variables have been eliminated.

describe

obs:            74                          1978 Automobile Data
vars:            10                          13 Apr 2007 17:45
size:         3,034 (99.7% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
foreign         byte   %8.0g       origin     Car type
-------------------------------------------------------------------------------
Sorted by:  foreign
Note:  dataset has changed since last save

If we wanted to make this change permanent, we could save the file as auto2.dta as shown below.

save auto2

file auto2.dta saved 

Keeping and dropping observations

The above showed how to use keep and drop variables to eliminate variables from your data file. The keep if and drop if commands can be used to keep and drop observations. Thinking of your data like a spreadsheet, the keep if and drop if commands can be used to eliminate rows of your data. Let's illustrate this with the auto data. Let's use the auto file and clear out the data currently in memory.

sysuse auto , clear

The variable rep78 has values 1 to 5, and also has some missing values, as shown below.

tabulate rep78 , missing
    Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        2.70        2.70
2 |          8       10.81       13.51
3 |         30       40.54       54.05
4 |         18       24.32       78.38
5 |         11       14.86       93.24
. |          5        6.76      100.00
------------+-----------------------------------
Total |         74      100.00

We may want to eliminate the observations which have missing values using drop if as shown below. The portion after the drop if specifies which observations that should be eliminated.

drop if missing(rep78)

(5 observations deleted) 

Using the tabulate command again shows that these observations have been eliminated.

tabulate rep78 , missing

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        2.90        2.90
2 |          8       11.59       14.49
3 |         30       43.48       57.97
4 |         18       26.09       84.06
5 |         11       15.94      100.00
------------+-----------------------------------
Total |         69      100.00 

We could make this change permanent by using the save command to save the file. Let's illustrate using keep if to eliminate observations. First let's clear out the current file and use the auto data file.

sysuse auto , clear

The keep if command can be used to eliminate observations, except that the part after the keep if specifies which observations should be kept. Suppose we want to keep just the cars which had a repair rating of 3 or less. The easiest way to do this would be using the keep if command, as shown below.

keep if (rep78 <= 3)

(34 observations deleted) 

The tabulate command shows that this was successful.

tabulate rep78, missing

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        5.00        5.00
2 |          8       20.00       25.00
3 |         30       75.00      100.00
------------+-----------------------------------
Total |         40      100.00 

Before we go on to the next section, let's clear out the data that is currently in memory.

clear

Selecting variables and observations with "use"

The above sections showed how to use keep, drop, keep if, and drop if for eliminating variables and observations. Sometimes, you may want to use a data file which is bigger than you can fit into memory and you would wish to eliminate variables and/or observations as you use the file. This is illustrated below with the auto data file. Selecting variables. You can specify just the variables you wish to bring in on the use command. For example, let's use the auto data file with just make price and mpg.

use make price mpg using http://www.stata-press.com/data/r10/auto

The describe command shows us that this worked.

describe

Contains data from http://www.stata-press.com/data/r10/auto.dta
obs:            74                          1978 Automobile Data
vars:             3                          13 Apr 2007 17:45
size:         1,924 (99.8% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
-------------------------------------------------------------------------------
Sorted by:

Let's clear out the data before the next example.

clear

Suppose we want to just bring in the observations where rep78 is 3 or less. We can do this as shown below.

use http://www.stata-press.com/data/r10/auto if (rep78 <= 3)

We can use tabulate to double check that this worked.

tabulate rep78, missing

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        5.00        5.00
2 |          8       20.00       25.00
3 |         30       75.00      100.00
------------+-----------------------------------
Total |         40      100.00 

Let's clear out the data before the next example.

clear

Let's show another example. Lets read in just the cars that had a rating of 4 or higher.

use http://www.stata-press.com/data/r10/auto if (rep78 >= 4) & (rep78 <.)

Let's check this using the tabulate command.

tabulate rep78, missing

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
4 |         18       62.07       62.07
5 |         11       37.93      100.00
------------+-----------------------------------
Total |         29      100.00 

Let's clear out the data before the next example.

clear

You can both eliminate variables and observations with the use command. Let's read in just make mpg price and rep78 for the cars with a repair record of 3 or lower.

use make mpg price rep78 if (rep78 <= 3) using http://www.stata-press.com/data/r10/auto

Let's check this using describe and tabulate.

describe

Contains data from http://www.stata-press.com/data/r10/auto.dta
obs:            40                          1978 Automobile Data
vars:             4                          13 Apr 2007 17:45
size:         1,120 (99.9% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
-------------------------------------------------------------------------------
Sorted by: 


tabulate rep78

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        5.00        5.00
2 |          8       20.00       25.00
3 |         30       75.00      100.00
------------+-----------------------------------
Total |         40      100.00 

Let's clear out the data before the next example.

clear

Note that the ordering of if and using is arbitrary.

use make mpg price rep78 using http://www.stata-press.com/data/r10/auto if (rep78 <= 3)

Let's check this using describe and tabulate.

describe

Contains data from http://www.stata-press.com/data/r10/auto.dta
obs:            40                          1978 Automobile Data
vars:             4                          13 Apr 2007 17:45
size:         1,120 (99.9% of memory free)   (_dta has notes)
-------------------------------------------------------------------------------
storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
-------------------------------------------------------------------------------
Sorted by:
tabulate rep78

rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |          2        5.00        5.00
2 |          8       20.00       25.00
3 |         30       75.00      100.00
------------+-----------------------------------
Total |         40      100.00 

Have a look at this command. Do you think it will work?

use make mpg if (rep78 <= 3) using http://www.stata-press.com/data/r10/auto
r(111); 

You see, rep78 was not one of the variables read in, so it could not be used in the if portion. To use a variable in the if portion, it has to be one of the variables that is read in.

Summary

Using keep/drop to eliminate variables
keep make price mpg

drop displ gear_ratio

Using keep if/drop if to eliminate observations
drop if missing(rep78)

keep if (rep78 <= 3)

Eliminating variables and/or observations with use
use make mpg price rep78 using auto

use auto if (rep78 <= 3)

use make mpg price rep78 using auto if (rep78 <= 3)

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.