Stata Learning Modules
Collapsing data across observations

Sometimes you have data files that need to be collapsed to be useful to you. For example, you might have student data but you really want classroom data, or you might have weekly data but you want monthly data, etc. We will illustrate this using an example showing how you can collapse data across kids to make family level data.

Here is a file containing information about the kids in three families. There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sex are the child's age, weight and sex. We will use this file for showing how to collapse data across observations.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 
list 
          famid    kidname      birth        age         wt        sex 
  1.         1       Beth          1          9         60          f  
  2.         1        Bob          2          6         40          m  
  3.         1       Barb          3          3         20          f  
  4.         2       Andy          1          8         80          m  
  5.         2         Al          2          6         50          m  
  6.         2        Ann          3          2         20          f  
  7.         3       Pete          1          6         60          m  
  8.         3        Pam          2          4         40          f  
  9.         3       Phil          3          2         20          m   

Consider the collapse command below. It collapses across all of the observations to make a single record with the average age of the kids.

collapse age 
list 
           age 
  1.  5.111111   

The above collapse command was not very useful, but you can combine it with the by(famid) option, and then it creates one record for each family that contains the average age of the kids in the family.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 
collapse age, by(famid) 
list 
         famid        age 
  1.         1          6  
  2.         2   5.333333  
  3.         3          4   

The following collapse command does the exact same thing as above, except that the average of age is named avgage and we have explicitly told the collapse command that we want it to compute the mean.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 
collapse (mean) avgage=age, by(famid) 
list 
         famid     avgage 
  1.         1          6  
  2.         2   5.333333  
  3.         3          4  

We can request averages for more than one variable. Here we get the average for age and for wt all in the same command.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 
collapse (mean) avgage=age avgwt=wt, by(famid) 
list 
         famid     avgage      avgwt 
  1.         1          6         40  
  2.         2   5.333333         50  
  3.         3          4         40   

This command gets the average of age and wt like the command above, and also computes numkids which is the count of the number of kids in each family (obtained by counting the number of observations with valid values of birth).

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear  
collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid) 
list 
         famid     avgage      avgwt    numkids 
  1.         1          6         40          3  
  2.         2   5.333333         50          3  
  3.         3          4         40          3   

Suppose you wanted a count of the number of boys and girls in the family. We can do that with one extra step. We will create a dummy variable that is 1 if the kid is a boy (0 if not), and a dummy variable that is 1 if the kid is a girl (and 0 if not). The sum of the boy dummy variable is the number of boys and the sum of the girl dummy variable is the number of girls.

First, let's use the kids file (and clear out the existing data).

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear  

We use tabulate with the generate option to make the dummy variables.

tabulate sex, generate(sexdum) 
        sex |      Freq.     Percent        Cum.
------------+-----------------------------------
          f |          4       44.44       44.44
          m |          5       55.56      100.00
------------+-----------------------------------
      Total |          9      100.00 

We can look at the dummy variables. Sexdum1 is the dummy variable for girls. Sexdum2 is the dummy variable for boys. The sum of sexdum1 is the number of girls in the family. The sum of sexdum2 is the number of boys in the family.

list famid sex sexdum1 sexdum2 
          famid        sex   sexdum1   sexdum2 
  1.         1          f         1         0  
  2.         1          m         0         1  
  3.         1          f         1         0  
  4.         2          m         0         1  
  5.         2          m         0         1  
  6.         2          f         1         0  
  7.         3          m         0         1  
  8.         3          f         1         0  
  9.         3          m         0         1   

The command below creates girls which is the number of girls in the family, and boys which is the number of boys in the family.

collapse (count) numkids=birth (sum) girls=sexdum1 boys=sexdum2, by(famid) 

We can list out the data to confirm that it worked correctly.

list famid boys girls numkids 
         famid      boys     girls    numkids 
  1.         1         1         2          3  
  2.         2         2         1          3  
  3.         3         2         1          3   

Summary

To create one record per family (famid) with the average of age within each family.

collapse age, by(famid)

To create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family.

collapse (mean) avgage=age avgwt=wt,  by(famid)

Same as above example, but also counts the number of kids within each family calling that numkids.

collapse (mean) avgage=age  avgwt=wt (count) numkids=birth, by(famid)

Counts the number of boys and girls in each family by using tabulate to create dummy variables based on sex and then summing the dummy variables within each family.

tabulate sex, generate(sexdum)
collapse (sum) girls=sexdum1 boys=sexdum2, by(famid) 

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.