SAS Learning Module
Collapsing across observations, intermediate

1. Introduction

This module will illustrate how to collapse across variables. First, let's read in a sample dataset named kids which includes the variables famid (family id) and wt (kids weight in pounds).

DATA kids;
  LENGTH kidname $ 4 sex $ 1;
  INPUT famid kidname birth age wt sex ;
CARDS;
1 Beth 1  9  60  f
1 Bob  2  6  40  m
1 Barb 3  3  20  f
2 Andy 1  8  80  m
2 Al   2  6  50  m
2 Ann  3  2  20  f
3 Pete 1  6  60  m
3 Pam  2  4  40  f
3 Phil 3  2  20  m
4 Sam  1 11 100  m
4 Stu  2  8  90  m
;
RUN;
 
PROC PRINT DATA=kids;
RUN;

The output is shown below.

OBS    KIDNAME    SEX    FAMID    BIRTH    AGE     WT
  1     Beth       f       1        1        9     60
  2     Bob        m       1        2        6     40
  3     Barb       f       1        3        3     20
  4     Andy       m       2        1        8     80
  5     Al         m       2        2        6     50
  6     Ann        f       2        3        2     20
  7     Pete       m       3        1        6     60
  8     Pam        f       3        2        4     40
  9     Phil       m       3        3        2     20
 10     Sam        m       4        1       11    100
 11     Stu        m       4        2        8     90

2. Collapsing and computing average weights using proc means

Next, by using proc means, one can create a variable that represents the sum of ALL the weights of each person within a family, a variable that represents the average weight of each person within a family, and a variable that counts the number of people within a family. This can be seen in the example below, where three new variables, sumwt, meanwt and cnt, are created by famid, and then written to the new dataset fam1.

PROC MEANS DATA=kids NWAY ;
  CLASS famid ;
  VAR wt ;
  OUTPUT OUT=fam1 SUM=sumwt MEAN=meanwt N=cnt ;
RUN;
 
PROC PRINT DATA=fam1;
  VAR famid sumwt meanwt cnt;
RUN;

The output is shown below.

Analysis Variable : WT


       FAMID  N Obs   N          Mean       Std Dev       Minimum       Maximum
-------------------------------------------------------------------------------
           1      3   3    40.0000000    20.0000000    20.0000000    60.0000000

           2      3   3    50.0000000    30.0000000    20.0000000    80.0000000

           3      3   3    40.0000000    20.0000000    20.0000000    60.0000000

           4      2   2    95.0000000     7.0710678    90.0000000   100.0000000
-------------------------------------------------------------------------------
OBS    FAMID    SUMWT    MEANWT    CNT

 1       1       120       40       3
 2       2       150       50       3
 3       3       120       40       3
 4       4       190       95       2

3. Collapsing and computing average weights manually (collapsing across observations)

Of course, collapsing can always be done manually within a data step. This, however, requires a bit more complex SAS programming. To create sum, mean, and N (sample size) variables that summarize values within a group (e.g., families), one can count over observations within a group by using a retained variable and a counter. In the example below, retained counter variables are created that count across observations within families until the last record within a family is encountered. (This is possible because a retained variable allows the value for the last observation to be available for use when accessing the current observation.) Then, the retained variable from the last observation within each family is written to the new SAS dataset fam2. At the final step, only the variables famid, sumwt, meanwt, and cnt are kept in the dataset fam2. Note that the variable meanwt does NOT need to be retained. This is because at each step, it is simply a function of the retained variables sumwt and cnt.

PROC SORT DATA=kids OUT=sortkids ;
  BY famid ;
RUN ;
 
DATA fam2 ;
 SET sortkids ; 
 
  BY famid ;
 
  RETAIN sumwt cnt;
 
  IF first.famid THEN
   DO;
     sumwt = 0;
     cnt   = 0;
   END;
 
  sumwt = sumwt + wt ;
  cnt = cnt + 1;
 
  meanwt=sumwt/cnt;
 
  /* this outputs a record ONLY when at the last obs in a family*/ ;
 
  IF last.famid THEN OUTPUT;
 
  KEEP famid sumwt meanwt cnt ;
 
RUN;
 
PROC PRINT DATA=fam2 ;
RUN;

The output is shown below.

OBS    FAMID    SUMWT    CNT    MEANWT
 1       1       120      3       40
 2       2       150      3       50
 3       3       120      3       40
 4       4       190      2       95

4. Computing sums, counts and other summary information

The above example illustrated how one can compute sums, means, and counts within groups using the retain statement within a data step. Other variables, such as dummy or flag variables, can also be computed using the retain statement. For example, say a study is interested in (1) the number of boys in each family, (2) whether or not there is a girl in the family and (3) if any of the children in each family are over 85 pounds in weight. All of this information can be collected and stored using the retain statement. The example below works similarly to the example above; however, this example additionally creates a variable numboys, which counts the number of boys in each family, and the flag variables hasgirl and over85, which take on the values of '1' or '0', depending on whether or not there is a girl in the family, or if a family has a child over 85 pounds, respectively.

PROC SORT DATA=kids OUT=sortkids ;
  BY famid ;
RUN ;
 
DATA fam3 ;
  SET sortkids ;
  BY famid ;
  RETAIN sumwt cnt numboys hasgirl over85 ;
  IF first.famid THEN
  DO;
    sumwt  = 0;  /* sum of weights for family */ ;
    cnt    = 0;  /* count of kids in family */;
    numboys= 0;  /* number of boys in family */;
    hasgirl= 0;  /* 1 if family has girl, 0 if no girl */;
    over85 = 0;  /* 1 if family has child with wt over 85, 0 if not */;
  END;
 
  sumwt = sumwt + wt ;
  cnt = cnt + 1;
  IF (sex = 'm') THEN numboys = numboys + 1 ;
  IF (sex = 'f') THEN hasgirl = 1 ;
  IF (wt > 85) THEN over85 =  1 ;
 
   /* this outputs a record ONLY when at the last obs in a family */;
  
  IF last.famid THEN 
   DO;
    meanwt = sumwt / cnt ; /* do any final computations before outputting record */;
    OUTPUT;
   END;
 
  KEEP famid sumwt cnt numboys hasgirl over85 meanwt ;
 
RUN;
 
PROC PRINT DATA=fam3 ;
RUN;

The output is shown below.

OBS    FAMID    SUMWT    CNT    NUMBOYS    HASGIRL    OVER85    MEANWT

 1       1       120      3        1          1          0        40
 2       2       150      3        2          1          0        50
 3       3       120      3        2          1          0        40
 4       4       190      2        2          0          1        95

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.