SAS Learning Module
Reshaping data long to wide using the data step

There are several ways to reshape data from a long to a wide format in SAS.  For example, you can reshape your data using proc transpose or reshaping the data in a data step. The following will illustrate how to reshape data from long to wide using the data step.  

Example 1:  Reshaping one variable

We will begin with a small data set with only one variable to be reshaped.  We will use the variables year and faminc (for family income) to create three new variables:  faminc96, faminc97 and faminc98.  First, let's look at the data set and use proc print to display it.

DATA long ; 
  INPUT famid year faminc ; 
CARDS ; 
1 96 40000 
1 97 40500 
1 98 41000 
2 96 45000 
2 97 45400 
2 98 45800 
3 96 75000 
3 97 76000 
3 98 77000 
; 
RUN ;
PROC PRINT DATA=long ;
RUN ;
Obs    famid    year    faminc

 1       1       96      40000
 2       1       97      40500
 3       1       98      41000
 4       2       96      45000
 5       2       97      45400
 6       2       98      45800
 7       3       96      75000
 8       3       97      76000
 9       3       98      77000

Now let's look at the program.  The first step in the reshaping process is sorting the data (using proc sort) on an identification variable (famid) and saving the sorted data set (longsort).  Next we write a data step to do the actual reshaping.  We will explain each of the statements in the data step in order.  

PROC SORT DATA=long OUT=longsort ;
  BY famid ;
RUN ;

DATA wide1 ;
  SET longsort ;
  BY famid ;

  KEEP famid faminc96 -faminc98 ;
  RETAIN faminc96 - faminc98 ;

  ARRAY afaminc(96:98) faminc96 - faminc98 ;

  IF first.famid THEN
  DO;
    DO i = 96 to 98 ;
      afaminc( i ) = . ;
    END;
  END;

  afaminc( year ) = faminc ;

  IF last.famid THEN OUTPUT ;

RUN;

The new data set is named wide1 on the data statement.  The old set, longsort, upon which the new data set will based, is named on the set statement.  Please note that you must use the sorted version of the old set in order for the reshaping to work properly.  The identification variable (famid) is used on the by statement.  Please note that this must be the same identification variable on which the data were sorted.  If the by statement is omitted, then the if-then-do statements (which occur later in the data step) will not work properly, and SAS will issue an error message.  

Next, a keep statement is used to keep the desired variables in the new data set.  We will keep the identification variable (famid) and the three variables we are creating, faminc96, faminc97 and faminc98.  Any variable that is not listed on the keep statement will not be present in the new data set.  A retain statement is used to tell SAS to retain the current values of the variables listed.  If the retain statement is omitted, only the values for faminc98 are placed in the new data set, while all of the data for faminc96 and faminc97 will be missing.  

An array statement is used to define the variables faminc96, faminc97 and faminc98.  In our example, we have named the array afaminc.  In parenthesis we have given the first and last value of the array separated by a colon (:).  The new variable names are then listed.  We can list the first and last new variable names separated by a dash because the names are numerically consecutive.  This is especially convenient when there are a large number of variables being defined by the array.  

We use an if-then-do statement to set the conditions for a do-loop.  The by statement that we used above not only caused SAS to process the data in the groups defined by the variable (famid) given on the by statement, it also caused SAS to create two temporary variables:  first.famid and last.famid.  Temporary variables are variables that you can use during a data step but do not appear in the new data set.  The value of first.famid is always zero except when SAS is processing the first row of data for a given value of famid, then first.famid is one.  In other words, first.famid is an indicator variable that is one when it is true that SAS is processing the first row of data for a given value of famid, and zero when it is not.  The variable last.famid is created with the same logic, except that it equals one when SAS is processing the last row of data for a given value of famid and zero otherwise.  You can now see why the data needed to be sorted on famid before we began the data step.  

In the do-loop, we set i (you can use any name that you like) equal to the range we used in the array statement.  We then set the array with i as the index variable equal to missing, which in SAS is a dot (.).  As SAS processes the data through this loop, the missing values will be replaced with the data.  If some of the data are missing, the missing value will not be altered.  Please note that you can set the array equal to any value and that value will appear in any place that there is missing data.  

After ending the do-loop and the if-then-do statement, we set the array with year as the index variable equal to faminc, the variable in the old data set.  (Please note that you must have an end statement for each do.)  It is at this point that the data in the variable faminc are associated with the new variables.  

Finally, we use an if-then statement to have SAS output the data to the new data set each time it encounters the last occurrence of each value of famid.  If this statement is omitted, SAS will output the results of its processing after each cycle of the loop, and we will end up with nine records instead of three.  At long last, it is time to look at the new data set with a proc print to ensure that the reshaping went as desired, and we can see that it did.

PROC PRINT DATA=wide1;
RUN;
Obs    famid    faminc96    faminc97    faminc98

 1       1        40000       40500       41000
 2       2        45000       45400       45800
 3       3        75000       76000       77000

Example 2:  Reshaping two variables

This example is very similar to the last example, except that in this example we will be reshaping two variables, faminc and spend.  

DATA long2; 
  INPUT famid year faminc spend ; 
CARDS; 
1 96 40000 38000 
1 97 40500 39000 
1 98 41000 40000 
2 96 45000 42000 
2 97 45400 43000 
2 98 45800 44000 
3 96 75000 70000 
3 97 76000 71000 
3 98 77000 72000 
; 
RUN ;

We will modify the previous data step by adding the new variable to two statements that were present in the previous example and adding three new statements.  In reshaping the data from long to wide, we will create six new variables:  faminc96, faminc97, faminc98, spend96, spend97 and spend98.  The six variables are listed on both the keep and the retain statements.  Following the same logic as in the previous example, we include an array statement to define the variables spend96, spend97 and spend98.  We have named the array aspend.  We have also included this array in the do-loop.  The third added statement sets aspend with year as the index variable equal to spend.  As before, we run a proc print on the new data set to ensure that the reshaping was done correctly.

PROC SORT DATA=long2 OUT=longsrt2 ;
  BY famid ;
RUN ;

DATA wide2 ;
  SET longsrt2 ;
  BY famid ;

  KEEP famid faminc96-faminc98 spend96-spend98 ;
  RETAIN faminc96-faminc98 spend96-spend98 ;

  ARRAY afaminc(96:98) faminc96-faminc98 ;
  ARRAY aspend(96:98) spend96-spend98 ;

  IF first.famid THEN
  DO;
    DO i = 96 to 98 ;
      afaminc( i ) = 0 ;
      aspend( i ) = 0 ;
    END;
  END;

  afaminc( year ) = faminc ;
  aspend( year ) = spend ;

  IF last.famid THEN OUTPUT ;

RUN;

PROC PRINT DATA=wide2;
RUN;
Obs   famid   faminc96   faminc97   faminc98   spend96   spend97   spend98

 1      1       40000      40500      41000     38000     39000     40000
 2      2       45000      45400      45800     42000     43000     44000
 3      3       75000      76000      77000     70000     71000     72000

Please note that you can reshape as many variables as you want in a single data step.  All you need to do is list the variables on both the keep and retain statements and add an array statement, include the array in the do-loop and add a statement setting the array with the proper index variable equal to the desired variable in the old data set.

Example 3:  Two variables that identify the wide record

In this example, we have two variables that, when taken together, identify the wide record.  In the data set below, the variables famid and birth together uniquely identify each wide record.  (Please note that the data shown below are in long format, so famid and birth do not uniquely identify each record.)  We will reshape one variable, ht (for height) from long to wide format.  

DATA long3; 
  INPUT famid birth age ht ; 
CARDS; 
1 1 1 2.8 
1 1 2 3.4 
1 2 1 2.9 
1 2 2 3.8 
1 3 1 2.2 
1 3 2 2.9 
2 1 1 2.0 
2 1 2 3.2 
2 2 1 1.8 
2 2 2 2.8 
2 3 1 1.9 
2 3 2 2.4 
3 1 1 2.2 
3 1 2 3.3 
3 2 1 2.3 
3 2 2 3.4 
3 3 1 2.1 
3 3 2 2.9 
; 
RUN; 

You will notice that the program used to reshape these data is very similar to the program used in example 1.  Clearly, the variable names are different.  The other important difference is that the data are sorted on both famid and birth in the proc sort, and both famid and birth are given in the by statement.  

PROC SORT DATA=long3 OUT=longsrt3 ;
  BY famid birth ;
RUN ;

DATA wide3 ;
  SET longsrt3 ;
  BY famid birth ;

  KEEP famid ht1-ht2 ;
  RETAIN ht1-ht2 ;

  ARRAY aht(1:2) ht1-ht2 ;

  IF first.birth THEN
  DO;
    DO i = 1 to 2 ;
      aht( i ) = 0 ;
    END;
  END;

  aht( age ) = ht ;

  IF last.birth THEN OUTPUT ;

RUN;

We will run a proc print on the new data set to ensure that the reshaping went as expected.

PROC PRINT DATA=wide3;
RUN;

Example 4:  A more realistic example

This example is very much like the first example, except that the variable names are different.  In this example, we will reshape the variable inc (for income).  This is a more realistic example of reshaping data from long to wide in that there are many more records (300, instead of nine) and six time points (instead of three).  Despite being a much larger data set, you will notice that the program to reshape it is almost identical to the program used to reshape the tiny data set in example 1.

data long4; 
  input id year inc ; 
cards; 
 1 90 66483 
 1 91 69146 
 1 92 74643 
 1 93 79783 
 1 94 81710 
 1 95 86143 
 2 90 17510 
 2 91 17947 
 2 92 19484 
 2 93 20979 
 2 94 21268 
 2 95 22998 
 3 90 57947 
 3 91 62964 
 3 92 68717 
 3 93 70957 
 3 94 75198 
 3 95 75722 
 4 90 64831 
 4 91 71060 
 4 92 71918 
 4 93 72514 
 4 94 73100 
 4 95 74379 
 5 90 18904 
 5 91 19949 
 5 92 21335 
 5 93 22237 
 5 94 23829 
 5 95 23913 
 6 90 32057 
 6 91 34770 
 6 92 35834 
 6 93 37387 
 6 94 40899 
 6 95 42372 
 7 90 60551 
 7 91 64869 
 7 92 67983 
 7 93 70498 
 7 94 71253 
 7 95 75177 
 8 90 16553 
 8 91 18189 
 8 92 18349 
 8 93 19815 
 8 94 21739 
 8 95 22980 
 9 90 32611 
 9 91 33465 
 9 92 35961 
 9 93 36416 
 9 94 37183 
 9 95 40627 
10 90 61379 
10 91 66002 
10 92 67936 
10 93 70513 
10 94 74405 
10 95 76009 
11 90 24065 
11 91 24229 
11 92 25709 
11 93 26121 
11 94 26617 
11 95 28142 
12 90 32975 
12 91 36185 
12 92 37601 
12 93 41336 
12 94 43399 
12 95 43670 
13 90 69548 
13 91 71341 
13 92 72455 
13 93 76552 
13 94 80538 
13 95 85330 
14 90 50274 
14 91 53349 
14 92 55900 
14 93 59375 
14 94 61216 
14 95 63911 
15 90 72011 
15 91 73334 
15 92 76248 
15 93 77724 
15 94 78638 
15 95 80582 
16 90 18911 
16 91 20046 
16 92 21343 
16 93 21630 
16 94 22330 
16 95 23081 
17 90 68841 
17 91 75410 
17 92 80806 
17 93 81327 
17 94 81571 
17 95 86499 
18 90 28099 
18 91 30716 
18 92 32986 
18 93 36097 
18 94 39124 
18 95 39866 
19 90 17302 
19 91 18778 
19 92 18872 
19 93 19884 
19 94 20665 
19 95 21855 
20 90 16291 
20 91 16674 
20 92 16770 
20 93 17182 
20 94 17979 
20 95 18917 
21 90 43244 
21 91 46545 
21 92 47633 
21 93 50744 
21 94 54734 
21 95 59075 
22 90 56393 
22 91 59120 
22 92 60801 
22 93 61404 
22 94 63111 
22 95 69278 
23 90 47347 
23 91 49571 
23 92 50101 
23 93 51345 
23 94 56463 
23 95 56927 
24 90 16076 
24 91 17217 
24 92 17296 
24 93 17900 
24 94 18171 
24 95 18366 
25 90 65906 
25 91 69679 
25 92 76131 
25 93 77676 
25 94 81980 
25 95 85426 
26 90 58586 
26 91 61188 
26 92 66542 
26 93 69267 
26 94 71063 
26 95 74549 
27 90 61674 
27 91 66584 
27 92 69185 
27 93 75193 
27 94 78647 
27 95 81898 
28 90 31673 
28 91 31883 
28 92 32774 
28 93 34485 
28 94 36929 
28 95 39751 
29 90 63412 
29 91 67593 
29 92 69911 
29 93 73092 
29 94 80105 
29 95 81840 
30 90 27684 
30 91 28439 
30 92 30861 
30 93 31406 
30 94 32960 
30 95 35530 
31 90 71873 
31 91 76449 
31 92 80848 
31 93 88691 
31 94 94149 
31 95 97431 
32 90 62177 
32 91 63812 
32 92 64235 
32 93 65703 
32 94 69985 
32 95 71136 
33 90 37684 
33 91 38258 
33 92 39208 
33 93 39489 
33 94 39745 
33 95 41236 
34 90 64013 
34 91 66398 
34 92 71877 
34 93 75610 
34 94 76395 
34 95 79644 
35 90 16011 
35 91 16847 
35 92 17746 
35 93 19123 
35 94 19183 
35 95 19996 
36 90 49215 
36 91 52195 
36 92 52343 
36 93 56365 
36 94 58752 
36 95 59354 
37 90 15774 
37 91 16643 
37 92 17605 
37 93 18781 
37 94 18996 
37 95 19685 
38 90 29106 
38 91 31693 
38 92 31852 
38 93 34505 
38 94 35806 
38 95 36179 
39 90 25147 
39 91 26923 
39 92 28785 
39 93 30987 
39 94 34036 
39 95 34106 
40 90 71978 
40 91 79144 
40 92 80453 
40 93 86580 
40 94 95164 
40 95 96155 
41 90 46166 
41 91 47579 
41 92 49455 
41 93 53849 
41 94 56630 
41 95 57473 
42 90 55810 
42 91 59443 
42 92 65291 
42 93 66065 
42 94 69009 
42 95 74365 
43 90 49642 
43 91 50603 
43 92 53917 
43 93 54858 
43 94 58470 
43 95 59767 
44 90 21348 
44 91 22361 
44 92 23412 
44 93 24038 
44 94 24774 
44 95 25828 
45 90 44361 
45 91 48720 
45 92 51356 
45 93 54927 
45 94 56670 
45 95 58800 
46 90 56509 
46 91 60517 
46 92 61532 
46 93 65077 
46 94 69594 
46 95 73089 
47 90 39097 
47 91 40293 
47 92 43237 
47 93 44809 
47 94 48782 
47 95 53091 
48 90 18685 
48 91 19405 
48 92 20165 
48 93 20316 
48 94 22197 
48 95 23557 
49 90 73103 
49 91 76243 
49 92 76778 
49 93 82734 
49 94 86279 
49 95 86784 
50 90 48129 
50 91 49267 
50 92 53799 
50 93 58768 
50 94 63011 
50 95 66461 
; 
run; 

PROC SORT DATA=long4 OUT=longsrt4 ;
  BY id ;
RUN ;

DATA wide4 ;
  SET longsrt4 ;
  BY id ;

  KEEP id inc90-inc95 ;
  RETAIN inc90-inc95 ;

  ARRAY ainc(90:95) inc90-inc95 ;

  IF first.id THEN
  DO;
    DO i = 90 to 95 ;
      ainc( i ) = 0 ;
    END;
  END;

  ainc( year ) = inc ;

  IF last.id THEN OUTPUT ;

RUN;

Because of the size of the data set, we will limit the proc print to the first five observations in the new data set to save space.  Please note that the first five observations in the wide data set consist of the first 25 observations in the long data set.  We will also run a proc means both data sets as well as our usual proc print to ensure that the reshaping was successful.

PROC PRINT DATA=wide4(obs=5) ;
RUN;
Obs    id    inc90    inc91    inc92    inc93    inc94    inc95

  1     1    66483    69146    74643    79783    81710    86143
  2     2    17510    17947    19484    20979    21268    22998
  3     3    57947    62964    68717    70957    75198    75722
  4     4    64831    71060    71918    72514    73100    74379
  5     5    18904    19949    21335    22237    23829    23913

PROC PRINT DATA=long4(obs=25) ;
RUN;
Obs    id    year     inc

  1     1     90     66483
  2     1     91     69146
  3     1     92     74643
  4     1     93     79783
  5     1     94     81710
  6     1     95     86143
  7     2     90     17510
  8     2     91     17947
  9     2     92     19484
 10     2     93     20979
 11     2     94     21268
 12     2     95     22998
 13     3     90     57947
 14     3     91     62964
 15     3     92     68717
 16     3     93     70957
 17     3     94     75198
 18     3     95     75722
 19     4     90     64831
 20     4     91     71060
 21     4     92     71918
 22     4     93     72514
 23     4     94     73100
 24     4     95     74379
 25     5     90     18904

PROC MEANS DATA=wide4 ;
  VAR inc90-inc95 ;
RUN;
The MEANS Procedure
Variable     N            Mean         Std Dev         Minimum         Maximum
------------------------------------------------------------------------------
inc90       50        43899.32        19523.39        15774.00        73103.00
inc91       50        46380.70        20749.43        16643.00        79144.00
inc92       50        48519.58        21720.12        16770.00        80848.00
inc93       50        50842.28        22780.12        17182.00        88691.00
inc94       50        53289.02        23824.01        17979.00        95164.00
inc95       50        55379.00        24592.83        18366.00        97431.00
------------------------------------------------------------------------------

PROC MEANS DATA=long4 ;
  CLASS year ;
  VAR inc ;
RUN;
The MEANS Procedure
                           Analysis Variable : inc
                N
        year  Obs    N          Mean       Std Dev       Minimum       Maximum
------------------------------------------------------------------------------
          90   50   50      43899.32      19523.39      15774.00      73103.00

          91   50   50      46380.70      20749.43      16643.00      79144.00

          92   50   50      48519.58      21720.12      16770.00      80848.00

          93   50   50      50842.28      22780.12      17182.00      88691.00

          94   50   50      53289.02      23824.01      17979.00      95164.00

          95   50   50      55379.00      24592.83      18366.00      97431.00
------------------------------------------------------------------------------

Example 5:  Reshaping with string (character) variables

This example is similar to example 2 except that we have a third variable to reshape, and this new variable, debt, is a string (i.e., character) variable.  There are only two minor differences between adding a numeric variable to our "template" program and adding a string variable.  The first is in the array statement for the string variable.  Before listing the names of the variables to be created, you need to include a dollar sign ($) to tell SAS to create a string variable and the number of variable to be created.  In our example, we are creating three variables.  The second difference is that when setting the string variable equal to missing in the do-loop, you use open-quote close-quote to indicate a null string instead of setting it equal to a dot.  A dot is a missing value only for a numeric variable; a null string is a missing value for a string variable.

data long5; 
  length debt $ 3; 
  input famid year faminc spend debt $ ; 
cards; 
1 96 40000 38000 yes 
1 97 40500 39000 yes 
1 98 41000 40000 no 
2 96 45000 42000 yes 
2 97 45400 43000 no 
2 98 45800 44000 no 
3 96 75000 70000 no 
3 97 76000 71000 no 
3 98 77000 72000 no 
; 
run; 

PROC SORT DATA=long5 OUT=long5srt ;
  BY famid ;
RUN; 

DATA wide5 ;
  SET long5srt ;
  BY famid ;

  KEEP famid faminc96-faminc98 spend96-spend98 debt96-debt98 ;
  RETAIN faminc96-faminc98 spend96-spend98 debt96-debt98 ;

  ARRAY afaminc(96:98) faminc96-faminc98 ;
  ARRAY aspend(96:98) spend96-spend98 ;
  ARRAY adebt(96:98) $ 3 debt96-debt98 ;  

  IF first.famid THEN
  DO;
    DO i = 96 to 98 ;
      afaminc( i ) = 0 ;
      aspend( i ) = 0 ;
      adebt( i ) = " " ;
    END;
  END;

  afaminc( year ) = faminc ;
  aspend( year ) = spend ;
  adebt( year ) = debt ;

  IF last.famid THEN OUTPUT ;

RUN;

We will run a proc print on the new data set to ensure that the reshaping was successful.

PROC PRINT DATA=wide5 ;
RUN;  

            f        f        f
            a        a        a        s        s        s
            m        m        m        p        p        p       d      d     d
     f      i        i        i        e        e        e       e      e     e
     a      n        n        n        n        n        n       b      b     b
O    m      c        c        c        d        d        d       t      t     t
b    i      9        9        9        9        9        9       9      9     9
s    d      6        7        8        6        7        8       6      7     8

1    1    40000    40500    41000    38000    39000    40000    yes    yes    no
2    2    45000    45400    45800    42000    43000    44000    yes    no     no
3    3    75000    76000    77000    70000    71000    72000    no     no     no

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.