SAS Library
Data Transformations and Data Manipulation in SAS


This page was adapted from a page created by Professor Oliver Schabenberger .  We thank Professor Schabenberger for permission to adapt and distribute this page via our web site.


1. Making a copy of a SAS data set
2. Creating new variables
   2.1. Transformations
   2.2. Operators
   2.3. Algebra with logical expressions
3.
Dropping variables from a data set
4. Dropping observations from a data set (Subsetting data)
5. Setting and merging multiple data sets
6. Sorting your data
   6.1. By-processing in PROC steps
   6.2. By-merging of sorted data sets (sorted matching)
7.
Formatting
   7.1. Labels
   7.2. Formats
   7.3. SAS Dates
8.
Advanced data manipulation
   8.1. Renaming of variables
   8.2. Retaining of variables
   8.3. DO Blocks and DO Loops
   8.4. IF .. THEN .. ELSE statements
   8.5. SELECT case distinction
   8.6. Arrays
   8.7. Lagged variables
   8.8. Generating multiple data sets in a DATA step
   8.9. Converting character variables to numeric variables


1. Making a copy of a SAS data set

data mydata;
   set mydata;
  < more statements>
run;

data newdata;
  set mydata;
  <more statements>
run;

2. Creating new variables
2.1. Transformations

data survey; set survey;
  x      = ranuni(123);  /* A uniform(0,1) random number   */
  lny    = log(inc);     /* The natural logarithm (base e) */
  logy   = log10(inc);   /* The log to base 10              */
  rooty  = sqrt(inc);    /* The square root                 */
  expy   = exp(inc/10);  /* The exponential function        */
  cos    = cos(x);       /* The cosine function            */
  sin    = sin(x);       /* The sine function              */
  tan    = tan(x);       /* The tangent function           */
  cosz   = cos(z);
  z      = ranuni(0);
run;

OBS  ID  SEX  AGE  INC  R1  R2  R3      X       LNY       LOGY    ROOTY

  1   1   F    35   17    7   2   2  0.75040  2.83321  1.23045   4.12311
  2  17   M    50   14   5    5   3  0.17839  2.63906  1.14613  3.74166
  3  33   F    45    6   7    2   7  0.35712  1.79176  0.77815  2.44949
  4  49   M    24   14   7    5   7  0.78644  2.63906  1.14613  3.74166
  5  65   F    52    9   4    7   7  0.12467  2.19722  0.95424  3.00000
  6  81   M    44   11   7    7   7  0.77618  2.39790  1.04139  3.31662
  7   2   F    34   17   6    5   3  0.96750  2.83321  1.23045  4.12311
  8  18   M    40   14   7    5   2  0.71393  2.63906  1.14613  3.74166
  9  34   F    47    6   6    5   6  0.53125  1.79176  0.77815  2.44949
 10  50   M    35   17   5    7   5  0.14208  2.83321  1.23045  4.12311

OBS    EXPY       COS         SIN         TAN      COSZ        Z

  1  5.47395    0.73142    0.68193     0.93234      .     0.32091
  2  4.05520    0.98413    0.17745     0.18031      .     0.90603
  3  1.82212    0.93691    0.34957     0.37312      .     0.22111
  4  4.05520    0.70637    0.70784     1.00208      .     0.39808
  5  2.45960    0.99224    0.12434     0.12532      .     0.18769
  6  3.00417    0.71359    0.70056     0.98174      .     0.43607
  7  5.47395    0.56736    0.82347     1.45141      .     0.26370
  8  4.05520    0.75579    0.65481     0.86639      .     0.55486
  9  1.82212    0.86217    0.50661     0.58760      .     0.86134
 10  5.47395    0.98992    0.14160     0.14304      .     0.86042

2.2. Operators

2.3. Algebra with logical expressions

data survey; set survey;
  agegr = (age <= 25) + 2*((age > 25) and (age <=40)) + 3*(age > 41);
run;

data survey; set survey;
  if age <= 25 then agegr = 1;
     else if age <= 40 then agegr = 2;
          else agegr = 3;
run;

3. Dropping variables from a data set

data survey; set survey;
   drop r1 r2 r3;
run;

data survey; set survey(drop=r1 r2 r3); run;

data survey; set survey;
   total = r1 + r2 + r3;
   drop r1 r2 r3;
run;

data survey; set survey(drop=r1 r2 r3);
  total = r1 + r2 + r3;
run;

data survey; set survey(drop=r1-r3); run;

data survey; set survey(keep=age inc); run;

4. Dropping observations from a data set (Subsetting data)

data survey; set survey;
  if age > 35;
run;

data survey; set survey;
  if age <= 35 then delete; /* or use: if not (age > 35) then delete; */
run;

data survey; set survey;
  where age > 35;
run;

data survey; set survey(where=(age > 35));  run;

proc print data=survey(where=(age >= 35)); run;

proc means data=yielddat(where=(year = 1994)); run;

5. Setting and merging multiple data sets

data growth1;
  input block trtmnt growth @@;
  year = 1997;
  datalines;
1 1 7.84  2 1 8.69  3 1 8.11  4 1 7.74   5 1 8.35
1 2 6.78  2 2 6.69  3 2 6.95  4 2 6.41   5 2 6.64
1 3 6.79  2 3 6.79  3 3 6.79  4 3 6.43   5 3 6.61
;
run;

proc print data=growth1(obs=10); run;

OBS    BLOCK    TRTMNT    GROWTH     YEAR

  1      1          1        7.84     1997
  2      2          1        8.69     1997
  3      3          1        8.11     1997
  4      4          1        7.74     1997
  5      5          1        8.35     1997
  6      1          2        6.78     1997
  7      2          2        6.69     1997
  8      3          2        6.95     1997
  9      4          2        6.41     1997
 10      5          2        6.64     1997

data growth2;
  input block trtmnt growth @@;
  datalines;
1 4 6.64  2 4 6.57   3 4 6.78   4 4 6.54   5 4 6.48
1 5 7.31  2 5 7.65   3 5 7.26   4 5 6.98   5 5 7.39
run;

data growth; set growth1 growth2; run;
proc print data=growth; run;

OBS    BLOCK    TRTMNT    GROWTH     YEAR

  1      1          1        7.84     1997
  2      2          1        8.69     1997
  3      3          1        8.11     1997
  4      4          1        7.74     1997
  5      5          1        8.35     1997
  6      1          2        6.78     1997
  7      2          2        6.69     1997
  8      3          2        6.95     1997
  9      4          2        6.41     1997
 10      5          2        6.64     1997
 11      1          3        6.79     1997
 12      2          3        6.79     1997
 13      3          3        6.79     1997
 14      4          3        6.43     1997
 15      5          3        6.61     1997
 16      1          4        6.64        .
 17      2          4        6.57        .
 18      3          4        6.78        .
 19      4          4        6.54        .
 20      5          4        6.48        .
 21      1          5        7.31        .
 22      2          5        7.65        .
 23      3          5        7.26        .
 24      4          5        6.98        .
 25      5          5        7.39        .

DATA baseline;
   INPUT id sex $ age inc;
   DATALINES;
 1  F  35 17
17  M  50 14
33  F  45  6
49  M  24 14
65  F  52  9
81  M  44 11
2   F  34 17
18  M  40 14
34  F  47  6
50  M  35 17
;
DATA rating;
   INPUT r1 r2 r3 ;
   DATALINES;
7 2 2
5 5 3
7 2 7
7 5 7
4 7 7
7 7 7
6 5 3
7 5 2
6 5 6
5 7 5
;
run;

data survey; merge baseline rating; run;
proc print data=survey; run;

OBS    ID    SEX    AGE     INC    R1    R2    R3

  1     1     F       35     17     7      2     2
  2    17     M       50     14     5      5     3
  3    33     F       45      6      7     2     7
  4    49     M       24     14     7      5     7
  5    65     F       52      9      4     7     7
  6    81     M       44     11     7      7     7
  7     2     F       34     17     6      5     3
  8    18     M       40     14     7      5     2
  9    34     F       47      6      6     5     6
 10    50     M       35     17     5      7     5

6. Sorting your data

proc sort data=yourdata;
   by <variable list>
run;

by block trtmnt;

by descending block trtmnt;

proc sort data=growth; by block trtmnt; run;

first.block
last.block
first.trtmnt
last.trtmnt

                          first.    last.
OBS    BLOCK    TRTMNT    block     block    GROWTH

  1      1          1         1        0        7.84
  2      1          2         0        0        6.78
  3      1          3         0        0        6.79
  4      1          4         0        0        6.64
  5      1          5         0        1        7.31
  6      2          1         1        0        8.69
  7      2          2         0        0        6.69
  8      2          3         0        0        6.79
  9      2          4         0        0        6.57
 10      2          5         0        1        7.65
 11      3          1         1        0        8.11
 12      3          2         0        0        6.95
 13      3          3         0        0        6.79
 14      3          4         0        0        6.78
 15      3          5         0        1        7.26
 16      4          1         1        0        7.74
 17      4          2         0        0        6.41
 18      4          3         0        0        6.43
 19      4          4         0        0        6.54
 20      4          5         0        1        6.98
 21      5          1         1        0        8.35
 22      5          2         0        0        6.64
 23      5          3         0        0        6.61
 24      5          4         0        0        6.48
 25      5          5         0        1        7.39

data blocks;
  set growth; by block;
  if first.block;
run;
proc print data=blocks; run;

OBS    BLOCK    TRTMNT    GROWTH     YEAR

 1       1          1        7.84     1997
 2       2          1        8.69     1997
 3       3          1        8.11     1997
 4       4          1        7.74     1997
 5       5          1        8.35     1997

proc sort data=growth out=newdata; by descending block; run;

proc sort data=growth out=blocks nodupkey; by block; run;

6.1. By-processing in PROC steps

by <variable list>

by block tx a b ;

by block;
by block tx;
by block tx a;
by block tx a b;

by b;
by tx a;
by a block;

proc means data=growth;
  var growth;
run;

proc means data=growth;
  var growth;
  by block;
run;

6.2. By-merging of sorted data sets (sorted matching)

/* 1996 yield data */
data spacing1;
  input P space rep yield96;
  datalines;
0   40  1 57
0   40  2 58
0   80  1 57
0   80  2 58
0   80  3 56
0  120  1 49
0  120  2 54
0  120  3 53
25  40  1 53
25  40  2 45
25  40  3 46
25  80  1 54
25  80  2 50
25  80  3 48
25 120  1 63
25 120  2 57
25 120  3 53
;;
run;

/* 1997 yield data */
data spacing2;
  input P space rep yield97;
  datalines;
0   40  1 35
0   40  2 28
0   40  3 29
0   80  1 38
0   80  2 29
0   80  3 27
0  120  1 10
0  120  2 25
0  120  3 34
25  40  1 24
25  40  2 24
25  40  3 17
25  80  1 25
25  80  2 31
25  80  3 29
25 120  1 44
25 120  3 28
;;
run;

data spacing;
  merge spacing1 spacing2;
run;
proc print data=spacing; run;

OBS     P    SPACE    REP     YIELD96    YIELD97

  1     0      40       1        57          35
  2     0      40       2        58          28
  3     0      40       3        57          29
  4     0      80       1        58          38
  5     0      80       2        56          29
  6     0      80       3        49          27
  7     0     120       1        54          10
  8     0     120       2        53          25
  9     0     120       3        53          34
 10    25      40       1        45          24
 11    25      40       2        46          24
 12    25      40       3        54          17
 13    25      80       1        50          25
 14    25      80       2        48          31
 15    25      80       3        63          29
 16    25     120       1        57          44
 17    25     120       3        53          28

proc sort data=spacing1; by p space rep; run;
proc sort data=spacing2; by p space rep; run;

data spacing;
  merge spacing1 spacing2;
  by p space rep;
run;
proc print data=spacing; run;

OBS     P    SPACE    REP     YIELD96    YIELD97

  1     0      40       1        57          35
  2     0      40       2        58          28
  3     0      40       3         .          29
  4     0      80       1        57          38
  5     0      80       2        58          29
  6     0      80       3        56          27
  7     0     120       1        49          10
  8     0     120       2        54          25
  9     0     120       3        53          34
 10    25      40       1        53          24
 11    25      40       2        45          24
 12    25      40       3        46          17
 13    25      80       1        54          25
 14    25      80       2        50          31
 15    25      80       3        48          29
 16    25     120       1        63          44
 17    25     120       2        57           .
 18    25     120       3        53          28

data rabbits;
  input Treat Rabbit day0 day4;
  datalines;
  1  1  -0.3  -0.2
  1  2  -0.5   2.2
  1  3  -1.1   2.4
  1  4   1.0   1.7
  1  5  -0.3   0.8
  2  1  -1.1  -2.2
  2  2  -1.4  -0.2
  2  3  -0.1  -0.1
  2  4  -0.2   0.1
  2  5  -0.1  -0.2
  3  1  -1.8   0.2
  3  2  -0.5   0.0
  3  3  -1.0  -0.3
  3  4   0.4   0.4
  3  5  -0.5   0.9
;;
run;

data means;
  input treat day0mn day4mn;
  datalines;
  1       -0.24      1.38
  2       -0.58     -0.52
  3       -0.68      0.24
;
run;

proc sort data=rabbits; by treat; run;
proc sort data=means; by treat; run;
data deviate;
  merge rabbits means; by treat;
  dev0 = day0 - day0mn;
  dev4 = day4 - day4mn;
run;
proc print data=deviate; run;

OBS    TREAT    RABBIT    DAY0     DAY4    DAY0MN    DAY4MN      DEV0     DEV4

  1      1          1      -0.3     -0.2     -0.24      1.38     -0.06    -1.58
  2      1          2      -0.5      2.2     -0.24       1.38    -0.26     0.82
  3      1          3      -1.1      2.4     -0.24       1.38    -0.86     1.02
  4      1          4       1.0      1.7     -0.24       1.38     1.24      0.32
  5      1          5      -0.3      0.8     -0.24       1.38    -0.06    -0.58
  6      2          1      -1.1     -2.2     -0.58     -0.52     -0.52    -1.68
  7      2          2      -1.4     -0.2     -0.58     -0.52     -0.82     0.32
  8      2          3      -0.1     -0.1     -0.58     -0.52      0.48     0.42
  9      2          4      -0.2      0.1     -0.58     -0.52      0.38     0.62
 10      2          5      -0.1     -0.2     -0.58     -0.52      0.48     0.32
 11      3          1      -1.8      0.2     -0.68       0.24    -1.12    -0.04
 12      3          2      -0.5      0.0     -0.68       0.24     0.18    -0.24
 13      3          3      -1.0     -0.3     -0.68      0.24     -0.32    -0.54
 14      3          4       0.4      0.4     -0.68       0.24     1.08      0.16
 15      3          5      -0.5      0.9     -0.68       0.24     0.18      0.66

data means;
  input treat day0mn day4mn;
  datalines;
  1       -0.24      1.38
  2       -0.58     -0.52
;
run;

proc sort data=rabbits; by treat; run;
proc sort data=means; by treat; run;
data deviate;
    merge rabbits means(in=y); by treat;
    if y;
    dev0 = day0 - day0mn;
    dev4 = day4 - day4mn;
run;
proc print data=deviate; run;

OBS    TREAT    RABBIT    DAY0     DAY4    DAY0MN    DAY4MN      DEV0     DEV4

  1      1          1      -0.3     -0.2     -0.24      1.38     -0.06    -1.58
  2      1          2      -0.5      2.2     -0.24       1.38    -0.26     0.82
  3      1          3      -1.1      2.4     -0.24       1.38    -0.86     1.02
  4      1          4       1.0      1.7     -0.24       1.38     1.24      0.32
  5      1          5      -0.3      0.8     -0.24       1.38    -0.06    -0.58
  6      2          1      -1.1     -2.2     -0.58     -0.52     -0.52    -1.68
  7      2          2      -1.4     -0.2     -0.58     -0.52     -0.82     0.32
  8      2          3      -0.1     -0.1     -0.58     -0.52      0.48     0.42
  9      2          4      -0.2      0.1     -0.58     -0.52      0.38     0.62
 10      2          5      -0.1     -0.2     -0.58     -0.52      0.48     0.32

7. Formatting

7.1. Labels

data chloro;
  label block  = 'Experimental replicate'
        nitro  = 'Nitrogen Source'
        thatch = 'Thatch management system'
        chloro = 'Amount chlorophyll in leaves (mg/g)';
  input block nitro thatch chloro;
  datalines;
  1  1  1  3.8
  1  1  2  5.3
  1  1  3  5.9
<and so forth>
;;
run;
proc print data=chloro label; run;

 produces
                                                   Amount
                                     Thatch       chlorophyll
       Experimental    Nitrogen     management     in leaves
OBS      replicate      Source        system          (mg/g)

  1          1              1             1             3.8
  2          1              1             2             5.3
  3          1              1             3             5.9
etc.

7.2. Formats

proc format;
    value Ni
      1='Urea' 2='Amm. sulf.' 3='IBDU' 4='Urea(SC)';
    value
      Yea 1='2 years' 2='5 years' 3='8 years';
run;
 

data chloro;
  label block  = 'Experimental replicate'
        nitro  = 'Nitrogen Source'
        thatch = 'Thatch management system'
        chloro = 'Amount chlorophyll in leaves (mg/g)';
  format nitro ni.
         thatch yea.;
  input block nitro thatch chloro;
  datalines;
  1  1  1  3.8
  1  1  2  5.3
  1  1  3  5.9
<and so forth>
;;
run;

                                                      Amount
                                       Thatch       chlorophyll
       Experimental    Nitrogen       management     in leaves
OBS      replicate     Source           system          (mg/g)

  1          1          Urea            2 years           3.8
  2          1          Urea            5 years           5.3
  3          1          Urea            8 years           5.9
  4          1          Amm. sulf.     2 years          5.2
  5          1          Amm. sulf.     5 years          5.6
  6          1          Amm. sulf.     8 years          5.4
  7          1          IBDU            2 years           6.0
  8          1          IBDU            5 years           5.6
  9          1          IBDU            8 years           7.8
 10          1          Urea(SC)        2 years           6.8
 11          1          Urea(SC)        5 years           8.6

proc means data=chloro;
  class nitro thatch;
  var chloro;
run;

Analysis Variable : CHLORO Amount chlorophyll in leaves (mg/g)
 

     NITRO   THATCH  N Obs   N           Mean        Std Dev       Minimum
------------------------------------------------------------------------
Urea        2 years      2    2     3.8500000     0.0707107      3.8000000

            5 years       2   2     5.3500000      0.0707107     5.3000000

            8 years       2   2     5.1000000      1.1313708     4.3000000

Amm. sulf.  2 years      2   2      5.6000000     0.5656854      5.2000000

            5 years       2   2     5.8500000      0.3535534     5.6000000

            8 years       2   2     5.8000000      0.5656854     5.4000000

IBDU        2 years       2   2     6.5000000      0.7071068     6.0000000

data urea;
  set chloro;
  if nitro = 1;
run;
 

7.3. SAS Dates

data one; set one;
  if (date >= 13769) and (date <= 14015);
run;

 data time;
 input date date7.;
 sasdate = date;
 format date mmddyy8.;
 datalines;
01JAN60
12SEP97
16MAY98
;
run;
proc print; run;

produces

OBS        DATE    SASDATE

 1     01/01/60          0
 2     09/12/97     13769
 3     05/16/98     14015

8. Advanced data manipulation
8.1. Renaming of variables

Renaming variables in SAS is not really an advanced data manipulation. But it has a quirk, which causes novice users frequently trouble. Variables are renamed with either the RENAME= data set option or the RENAME statement. The data set option form is shown here:

data two; set one(rename=(y=x1 u=z1)); run;

The statement form is part of the data step:
data two; set one;
  rename y=x1 u=z1;
run;
The difference between the two statements is that the RENAME= data set option is executed when data set one is copied into the new data set. The RENAME statement is executed upon successful completion of the DATA step. Consequently, if you wish to access the variables being renamed inside the DATA step, you have to access them under their new name, if you chose the RENAME= data set option and under their old name if you chose the RENAME statement. The following code examples are correct and produce the same result:

data two; set one(rename=(y=x1 u=z1));
    logx = log(x1);
    /* logy  = log(y);  This will produce an error, since variable Y */
    /*                   is unknown inside the DATA step               */
    /*                   It has been renamed to X1                     */
run;

date two; set one;
   /* logx = log(x1); This will not work, since x1 is unknown. */
   /*                  The RENAME statement is not executed     */
   /*                  until the end of the DATA step.          */
   logy = log(y);
   rename y = x1 u=z1;
run;

8.2. Retaining of variables

To understand the concept of variable retaining, it is helpful to look at DATA steps in some more detail.

data one;
  input x ;
  datalines;
0.1
0.5
1.0
2.0
;
data one; set one;
    y = log(x);
    if y <  0 then z = 'negative';
    if y >= 0 then u = sqrt(y);
    keep u x y z;
    rename  z = isneg;
run;

The second data step consists of five statements. The KEEP and RENAME statement are executed at the very end of the DATA step. They are sort of "set aside" until all observations have been processed. The first three statements are executed in sequence for every observation in the data set. SAS picks the first observation and executes the three statements. Then it moves to the second observation and executes the statements again. At the beginning of each cycle, variables that are not in the data set (y, z, and u) are being initialized with missing values in the case of numeric variables or blanks in the case of character variables. If a statement generates a value for the variable the missing value is overwritten. If a value is not generated, the missing value code remains. In the example above, u is assigned a value only if y is positive. Since y is the natural logarithm of x, this is the case for the thrid and fourth observation. The variable z is assigned a value only, if y is negative. The printout of this data set is as follows:

OBS     X         Y         ISNEG          U

 1     0.1    -2.30259     negative     .
 2     0.5    -0.69315     negative     .
 3     1.0     0.00000                 0.00000
 4     2.0     0.69315                 0.83255

U contains missing values for the first two observation, ISNEG is set to an empty string for observation three and four.
Since SAS sets new variables to missing for each observation, how can one, for example, add variables successively. In the DATA step

data one;
  input x ;
  datalines;
0.1
0.5
1.0
2.0
;
data one; set one;
  c = c + x;
run;

the variable c will contain missing values, since c has no value assigned to it at the beginning of the data cycle. This can be remedied with the RETAIN statement. It causes SAS to do two things. The variables being retained are not initialized with missing values at the top of the cycle, but retain their previous value. Also, a starting value is assigned to the retained variable. The DATA step

data one; set one;
  retain c 0;
  c = c + x;
run;
proc print; run;

sets initially variable c to 0. Since c is part of a RETAIN statement, it keeps the value from the previous data cycle. In successive cycles c on the right hand side of the "=" sign contain the previous sum of variable x and the current value of x is being added to it. The data set one at the end of the DATA step is

OBS     X      C

 1     0.1    0.1
 2     0.5    0.6
 3     1.0    1.6
 4     2.0    3.6

A short-cut for this syntax construction can be used:

data one; set one;
   c+x;
run;

The syntax c+x implies that c is automatically retained and the values of variable x are successively to be added to it.
Retaining variables is also very convenient if you work with sorted data sets. In the next example

data spike;
 input rating $ tx;
 datalines;
apparent        1
discrete        2
apparent        3
discrete        4
discrete        5
discrete        6
none            7
discrete        8
discrete        9
apparent       10
;;
run;

data set spike contains a character variable (rating). We want to find out how many observations are in the data set for each unique value of variable rating and how many unique values there are. First, sort the data by rating. Then define two new variables, both of which are initialized with 0 and retained. For each new value of rating, the variable cnt is reset to 1. Variable howmany is incremented only when a new value of rating is found.

proc sort data=spike; by rating; run;

data countem; set spike; by rating;
  retain cnt 0 howmany 0;
  if first.rating then cnt=1; else cnt+1;
  if first.rating then howmany+1;
  if last.rating then output;
run;
proc print data=countem; run;

Here is the printout of data set countem:

OBS    RATING      TX    CNT     HOWMANY

 1     apparent    10      3        1
 2     discrete     9      6        2
 3     none         7      1        3

There are three observations with rating='apparent', six observations with rating='discrete' and one with rating='none' and a total of three unique values of rating.

8.3. DO Blocks and DO Loops

A DO block begins with the reserved word DO and ends with the reserved word END. The statements enclosed inside DO..END are called a block. The DO..END construct is an important device to group statements inside a DATA step.

data two; set one;
  if x < 0 then do;
    y = .;
    z = sqrt(-x) + rannor(123);
    Tx = 'Control';
  end;
run;

If the condition in the IF statement is true, SAS executes the statements in the DO..END block. Otherwise the statements are ignored. Without the DO..END block, the DATA statement would require three IF statements. The DO statement also is part of looping constructs (iterative DO). These can be written in various ways. Here are examples:

do i = 1 to 4;  /* A index loop, runs from 1 to 4 in increments of 1 */
  < SAS statements>
end;

do i = 1 to 10 by 2; /* index loop, runs from 1 to 10 in increments of 2 */
  <SAS statements>
end;

/* index loop over x=10, 20, 30, 50, 55, 60, 65, ... , 100 */
do x = 10, 20, 30, 50 to 100 by 5;
  <SAS statements>
end;

do month = 'FEB', 'MAR', 'APR';
 <SAS statements>
end;

/* the statements inside the loop are executed only while the */
/* expression in parentheses is true                           */
do k =1 to 12 while(month='APR');
  <SAS statements>
end;

The next example generates 100 observations from a Gaussian distribution with mean 12 and variance 3. For each observation, it calculates its right-tail probability:

data Gauss;
  do i = 1 to 100;
     z = rannor(8923);
     p = 1 - Probnorm(z);
     x = z*Sqrt(3) + 12;
     output;
  end;
run;

Notice that the DO i1 to 100; .. END; construct is executed for each observation in the data set. Since no observations are input or transferred from another SAS data set, you need the OUTPUT statement inside the DO loop to instruct SAS to write to the data set when the loop is completed.
executes the loop do i = 1 to 4; output; for each observation and simply produces four copies of each observation. The OUTPUT statement should be the last statement inside the loop.
Other forms of DO loops are the DO..WHILE() and DO..UNTIL() constructs. A logical expression inside the parentheses is evaluated for each repetition of the loop. The loop continues as long as the statement is true for the DO..WHILE() loop or until the statement becomes true (DO..UNTIL()). DO..WHILE and DO..UNTIL loops are dangerous. For example, if the logical statment in the WHILE() expression is not true, the loop will never executes. If it is not true, there must be a mechanism inside the loop that eventually makes the statement false, otherwise the loop will continue infinitely. It is important to remember that the loop is executed for each observation in the data set. Care must be exercised not to write infinite loops with DO..WHILE. For example, the following loop

n=0;
do while(n <= 5);
  <Statements>
  n+1;
end;

works, since n is changed inside the loop and the WHILE() condition eventually will become false. If you forget to increment n inside the loop, the statements will be processed indefinitely. In this case, a DO n = .. END; loop is much safer:

do n=1 to 5;
  <statements>
end;

8.4. IF .. THEN .. ELSE statements

Also known as conditional statements, these are very important when subsetting data or processing observations conditionally. The ELSE part is not required as we have seen earlier with statements like
if x < 0 then delete;
if name='Smith';
In the form IF <condition> THEN <statement1> ELSE <statement2> SAS evaluates for each observation the logical condition. If the condition is true, it executes statement1, if it is false statement2. Important to note is that only a single statement follows the THEN and ELSE clause. For example,

data two; set one;
  if (x < 0) then
    y = .;
    z = sqrt(-x);
  else z = sqrt(x);
run;

will cause an error, since SAS expects the ELSE clause after the y=.; statement. If more than one statment is to be executed in the THEN or ELSE clause, group them into DO blocks:

data two; set one;
  if (x < 0) then do;
  y = .;
  z = sqrt(-x);
end; else z = sqrt(x);
run;

IF .. THEN .. ELSE statements can be nested:

data two; set one;
  if score < 4 then rating='below ave';
   else if score < 6 then rating = 'average';
       else if score < 8 then rating = 'above ave.'
          else rating = 'superior';
run;

If many cases are to be distinguished, or multiple statements are to be executed for one or all conditions, this construct is hard to read and debug. A simpler method uses the SELECT case distinction.

8.5. SELECT case distinction

SELECT expressions are more convenient to program and easier to read than a series of nested (and convoluted) IF..THEN expressions. Recall the survey data set

DATA survey;
   INPUT id sex $ age inc r1 r2 r3 ;
   DATALINES;
 1  F  35 17  7 2 2
17  M  50 14  5 5 3
33  F  45  6  7 2 7
49  M  24 14  7 5 7
65  F  52  9  4 7 7
81  M  44 11  7 7 7
2   F  34 17  6 5 3
18  M  40 14  7 5 2
34  F  47  6  6 5 6
50  M  35 17  5 7 5
;

We wish to recode variable r3 and assign character strings. This could be accomplished with a series of IF..THEN statements:
data survey; set survey;
  if r3 < 3 then rating='below ave';
   else if r3 < 5 then rating = 'average';
       else if r3 < 6 then rating = 'above ave.'
          else r3 = 'superior';
run;
The SELECT construct would be

data two; set survey;
  select ;
    when ( r3 < 3 ) rating = 'below ave.';
    when ( r3 < 5 ) rating = 'average';
    when ( r3 < 6 ) rating = 'above ave.';
    otherwise rating = 'superior';
  end;
run;

SAS evaluates a logical expression for each of the WHEN expressions. For the first expression that returns true, the statement is executed. If no WHEN expression is true, SAS executes the statement following OTHERWISE. The OTHERWISE statement is optional, but should be used as a safeguard. If none of the WHEN expressions is true and the OTHERWISE clause is missing, SAS stops the DATA step with an error message. An alternative method of writng the SELECT..END construct in this example is

data two; set survey;
  select ( r3 );
    when ( 1,2 ) rating = 'below ave.';
    when ( 3,4 ) rating = 'average';
    when ( 5   ) rating = 'above ave.';
    otherwise rating = 'superior';
  end;
run;

The variable which follows the select keyword in parentheses is compared against the values in the WHEN expressions. The first WHEN expression that is true will be executed, all others will be ignored.

8.6. Arrays

Arrays in SAS are different from arrays in other programming languages. A SAS array is simply a grouping of variables in a data set for ease of processing. Array processing in SAS consists of three steps:
1. Define the array which groups variables
2. Process the array by repeating an action
3. Select an individual element of the array at each repetition of the action.
This probably sounds more complicated than it should. Here is an example. Three variables (X, Y, Z) are read into a data set. Initially, missing values were coded as 9999. You want to replace them with a missing value  (.), to prevent accidental involvement in calculations. You could of course do this with three IF statements:

data two; set one(keep=X Y Z);
  if x = 9999 then x = . ;
  if y = 9999 then y = . ;
  if z = 9999 then z = .;
run;

Notice that the structure of the IF statements is very similar. Each variable is compared against 9999 and if it contains that value, it is replaced with . ; Processing the same problem with arrays could proceed as follows:

data two; set one(keep = X Y Z);
  array vars{3} x y z; /* define the array, i.e., group the variables x y z into array vars */
  do i =1  to 3;       /* repeat an action for each element of the array */
    if vars{i} = 9999 then vars{i} = .;  /* select individual elements of the array */
  end;
run;

The ARRAY statement tells SAS that you are about to define an array. It is followed by the name of the array and its dimension (the number of variables being grouped) in curly braces. Then, you follow it with the names of the variables being grouped. The number of variable names should match the number in curly braces. If the variables being listed are not part of the data set they will be created. You can omit the number of array elements in curly braces and replace it with an asterisk. SAS will then determine the number of array elements from the number of variables in the list:

array vars{*} x y z;

Alternatively, you can omit the variable names and SAS will create variables named ArrayName1, ArrayName2, .... For example

array vars{4}

will create variables VARS1, VARS2, VARS3, VARS4.
A particular element of the array is accessed inside the DATA step by following the array name with a number in curly braces. The second element of the vars array, the variable y, is accessed as vars{2}, for example. DO i=1... loops are very common to process the elements of an array in turn.
If the variables in the list of the ARRAY statement are not part of the data, they are added to the data set. Sometimes, you need an array only during the DATA step, then you can follow the array definition with the reserved word _TEMPORARY_:

array temparr{10} x1-x10 _temporary_;

A typical example where array processing is helpful is when switching a multivariate data set into a univariate data set. The next data set stems from a repeated measures study in which a temperature difference was measured at four time points for 15 subjects. 5 subjects each received one of three treatments. The time points are spaced 15 days apart.

/* The data as a multivariate data set */
data repmeas;
  input Treat subject time1 time2 time3 time4;
  datalines;
  1  1  -0.3  -0.2   1.2   3.1
  1  2  -0.5   2.2   3.3   3.7
  1  3  -1.1   2.4   2.2   2.7
  1  4   1.0   1.7   2.1   2.5
  1  5  -0.3   0.8   0.6   0.9
  2  1  -1.1  -2.2   0.2   0.3
  2  2  -1.4  -0.2  -0.5  -0.1
  2  3  -0.1  -0.1  -0.5  -0.3
  2  4  -0.2   0.1  -0.2   0.4
  2  5  -0.1  -0.2   0.7  -0.3
  3  1  -1.8   0.2   0.1   0.6
  3  2  -0.5   0.0   1.0   0.5
  3  3  -1.0  -0.3  -2.1   0.6
  3  4   0.4   0.4  -0.7  -0.3
  3  5  -0.5   0.9  -0.4  -0.3
;;
run;

This data set is setup appropriately for a repeated measures analysis with PROC GLM, but not for a repeated measures analysis with PROC MIXED. PROC MIXED requires that the four measurements appear as separate observations indexed by a variable measuring passage of time. The next data step creates this univariate structure with the help of an array:
/* create the univariate data set */

data univ; set repmeas;
  array t{4} time1-time4; /* group variable time1 through time4 into an array */
  do i=1 to 4;             /* loop through the array */
     time = 15*(i-1);     /* calculate the variable measuring passage of time */
     temp = t{i};          /* record the temperature as the ith element of the array */
     output;               /* write the observation to the new data set       */
  end;
  drop time1-time4 i;     /* no need for those anymore                        */
run;
proc print data=univ(obs=20); run;

The DO i = 1 to 4 loop creates four observation from each observation in the multivariate data set. TREAT and SUBJECT variable remain unchanged, but a variable for time and one for the actual measured temperature must be created. Since each temperature is stored in a different variable (time1 .. time4) in the multivariate data set, the array is perfectly suited to deal with the problem. Here are the first twenty observations of the univariate data set:

OBS    TREAT    SUBJECT    TIME     TEMP

  1      1          1          0     -0.3
  2      1          1         15     -0.2
  3      1          1         30      1.2
  4      1          1         45      3.1
  5      1          2          0     -0.5
  6      1          2         15      2.2
  7      1          2         30      3.3
  8      1          2         45      3.7
  9      1          3          0     -1.1
 10      1          3         15      2.4
 11      1          3         30      2.2
 12      1          3         45      2.7
 13      1          4          0      1.0
 14      1          4         15      1.7
 15      1          4         30      2.1
 16      1          4         45      2.5
 17      1          5          0     -0.3
 18      1          5         15      0.8
 19      1          5         30      0.6
 20      1        5         45      0.9
 

8.7. Lagged variables

At any cycle of the DATA step, only the values of the variables for the current observation are available for processing. Values for previous or upcoming observations are not accessible. You can make the values of previously processed observation available through the lag() functions
LAG(variable name) returns the previous value
LAG2(variable name) returns the value of the observation processed second to last
LAG5(variable name) returns the value of the observation processed 5 cycles ago
and so forth. Lagging temperatures in the previous example:

data univ; set univ;
  lagt = lag(temp);
  lag2t = lag2(temp);
run;
proc print data=univ(obs=20); run;

produces

OBS    TREAT    SUBJECT    TIME     TEMP    LAGT    LAG2T

  1      1          1          0     -0.3       .        .
  2      1          1         15     -0.2     -0.3       .
  3      1          1         30      1.2     -0.2     -0.3
  4      1          1         45      3.1      1.2     -0.2
  5      1          2          0     -0.5      3.1      1.2
  6      1          2         15      2.2     -0.5      3.1
  7      1          2         30      3.3      2.2     -0.5
  8      1          2         45      3.7      3.3      2.2
  9      1          3          0     -1.1      3.7      3.3
 10      1          3         15      2.4     -1.1      3.7
 11      1          3         30      2.2      2.4     -1.1
 12      1          3         45      2.7      2.2      2.4
 13      1          4          0      1.0      2.7      2.2
 14      1          4         15      1.7      1.0      2.7
 15      1          4         30      2.1      1.7      1.0
 16      1          4         45      2.5      2.1      1.7
 17      1          5          0     -0.3      2.5      2.1
 18      1          5         15      0.8     -0.3      2.5
 19      1          5         30      0.6      0.8     -0.3
 20      1          5         45      0.9      0.6      0.8

 

8.8. Generating multiple data sets in a DATA step

Sometimes you want to generate multiple data sets in a single DATA step. For example, when reading a large amount of data from a file, you may want to separate it into data sets according to the year of measurement. This can be done by listing multiple data set names after the DATA keyword and using separate OUTPUT statements for each:

filename inf 'C:\Research\Data\Allyears\The Whole Thing.txt';
data y1994 y1995 y1996 y1997;
  infile inf firstobs=20;
  input year location block a b absorp transloc;
  select (year);
    when (1994) output y1994;
    when (1995) output y1995;
    when (1996) output y1996;
  otherwise output y1997;
end;
run;

Notice that the SELECT..END construction was used here instead of multiple IF..THEN statements.

8.9. Converting character variables to numeric variables

SAS uses several rules to convert character to numeric variables. For example, if a character variable is used with a numeric operand such as addition, multiplication, etc., the variable is automatically converted to a numeric variable. If a numeric variable is used on the left side and a character variable on the right side of a statement, the character variable is converted to numeric format.

data test;
  input s $ ;
  x = 0;
  x = s;
  datalines;
  1
  2
  3
  4
  5
;;
run;

In this example, variable S is defined as read as a character variable. The statement x = 0; defines as a numeric variable. The next statement, x = s;, invokes the character-to-numeric conversion, x will contain the values of s in numeric format. Alternatively, you could have combined the two statements into a single statement

x = s + 0;


This page was adapted from a page created by Professor Oliver Schabenberger.  We thank Professor Schabenberger for permission to adapt and distribute this page via our web site.