UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

SAS FAQ
Where can I find more functions to create new variables?

When creating new variables, it is often handy to use a function.  While there are many functions that can be used when creating a new variable in a data step, you might not find the function that you need.  Another place to look for more functions is in proc expand.  This procedure is one of SAS's time series procedures, but you don't need time series data to use this procedure.  In the examples below, we illustrate the use of proc expand to create new variables.  With proc expand, it is easy to operate across observations; this is much more difficult to do in a data step.  In later examples on this page, we will illustrate how easy it is to operate across observations with proc expand.  We will use the hsb2 dataset in all of the examples on this page.

In our first example, we will take the negative of the variable socst.  If you were creating this variable in a data step, you would multiply socst by -1.  We will also create a variable called number_write, which will be the variable write with 10 added to each value.  In a data step, you would simply add 10 to the variable write.

proc expand data = "d:\data\hsb2" out = test1;
convert socst = neg_socst / transformout = (neg);
convert write = number_write / transformout = (+ 10);
run;
proc print data = test1 (obs = 15);
var socst neg_socst write number_write;
run;
                 neg_             number_
Obs    socst    socst    write     write

  1      41      -41       44        54
  2      56      -56       65        75
  3      51      -51       50        60
  4      31      -31       40        50
  5      41      -41       41        51
  6      48      -48       44        54
  7      36      -36       31        41
  8      51      -51       46        56
  9      71      -71       65        75
 10      41      -41       44        54
 11      56      -56       65        75
 12      51      -51       54        64
 13      51      -51       49        59
 14      51      -51       47        57
 15      56      -56       57        67

In the examples above, a function wasn't really necessary to create the new variables; they could have easily been created in a data step.  Let's create some other variables that might not be so easy to create without a function.  First, we will create med_read, which is the centered moving median of the variable read.  Next, the variable sum_science is the cumulative sum of the variable science.  For both of these, our window will be 5.  The variable ratio_sci2 is ratio of the current value to the lagged value.  In our example, we will use a lag of 2.  Finally, pct_sci is the percent difference between the current value and the lagged value.  Because we did not specify a value in the parentheses with the pctdif option, the lag will be 1.

proc expand data = hsb2 out = test2;
convert read = med_read  / transformout = (cmovmed 5);
convert science = sum_science / transformout = (cusum 5);
convert science = ratio_sci2 / transformout = (ratio 2);
convert science = pct_sci / transformout = (pctdif);
run;
proc print data = test2 (obs = 15);
var read med_read science sum_science ratio_sci2 pct_sci;
run;
                                        sum_      ratio_
Obs    read    med_read    science    science      sci2      pct_sci

  1     34       44.0         39         39       .            .
  2     63       45.5         63         63       .          61.5385
  3     44       47.0         39         39      1.00000    -38.0952
  4     47       47.0         45         45      0.71429     15.3846
  5     47       47.0         40         40      1.02564    -11.1111
  6     39       47.0         44         83      0.97778     10.0000
  7     47       47.0         36         99      0.90000    -18.1818
  8     28       47.0         44         83      1.00000     22.2222
  9     65       52.0         58        103      1.61111     31.8182
 10     52       57.0         55         95      1.25000     -5.1724
 11     57       57.0         54        137      0.93103     -1.8182
 12     60       52.0         50        149      0.90909     -7.4074
 13     44       45.0         35        118      0.64815    -30.0000
 14     41       44.0         39        142      0.78000     11.4286
 15     45       44.0         31        126      0.88571    -20.5128

Now, let's use a few functions within the levels of a categorical variable.  We have a variable called race in our dataset that has four levels.  First, we will create the variable socst_cusum, which will be the cumulative sum of the variable socst over five observations.  As you can see in the proc print output, the first five observations for socst and socst_cusum have the same values.  The new variable socst_sum is simply the sum of the values of the variable socst.  The new variable rev_math is the values of the variable math in reverse order within each level of the variable race. 

proc sort data = hsb2 out = hsb2_sorted;
by race;
run;

proc expand data = hsb2_sorted out = test3;
by race;
convert socst = socst_cusum / transformout = (cusum 5);
convert socst = socst_sum / transformout = (sum);
convert math = rev_math /transformout = (reverse);
run;

proc print data = test3 (obs = 40);
var race socst socst_cusum socst_sum math rev_math;
run;
                        socst_    socst_
Obs    race    socst     cusum      sum     math    rev_math

  1      1       56        56        56      48        44
  2      1       61        61       117      57        44
  3      1       46        46       163      39        45
  4      1       66        66       229      57        45
  5      1       61        61       290      49        41
  6      1       51       107       341      50        40
  7      1       46       107       387      61        46
  8      1       51        97       438      59        54
  9      1       61       127       499      39        33
 10      1       31        92       530      43        49
 11      1       48       155       578      52        48
 12      1       51       158       629      43        52
 13      1       51       148       680      52        43
 14      1       41       168       721      48        52
 15      1       36       128       757      49        43
 16      1       41       196       798      33        39
 17      1       56       214       854      54        59
 18      1       41       189       895      46        61
 19      1       41       209       936      40        50
 20      1       51       179       987      41        49
 21      1       36       232      1023      45        57
 22      1       46       260      1069      45        39
 23      1       36       225      1105      44        57
 24      1       42       251      1147      44        48
 25      2       56        56        56      66        42
 26      2       51        51       107      62        52
 27      2       71        71       178      64        42
 28      2       56        56       234      61        66
 29      2       41        41       275      49        54
 30      2       56       112       331      72        72
 31      2       41        92       372      54        49
 32      2       46       117       418      66        61
 33      2       36        92       454      42        64
 34      2       56        97       510      52        62
 35      2       51       163       561      42        66
 36      3       66        66        66      67        41
 37      3       56        56       122      45        42
 38      3       66        66       188      53        50
 39      3       61        61       249      42        49
 40      3       41        41       290      43        44

Let's look at a few more useful functions, including lag, lead and reciprocal.

proc expand data = hsb2 out = test4;
convert math = lag_math / transformout = (lag);
convert math = lead_math /transformout = (lead);
convert science = rec_science / transformout = (reciprocal);
run;
proc print data = test4 (obs = 15);
var math lag_math lead_math science rec_science;
run;
                           lead_                 rec_
Obs    math    lag_math     math    science     science

  1     40         .         48        39      0.025641
  2     48        40         41        63      0.015873
  3     41        48         43        39      0.025641
  4     43        41         46        45      0.022222
  5     46        43         52        40      0.025000
  6     52        46         44        44      0.022727
  7     44        52         43        36      0.027778
  8     43        44         64        44      0.022727
  9     64        43         49        58      0.017241
 10     49        64         72        55      0.018182
 11     72        49         50        54      0.018519
 12     50        72         44        50      0.020000
 13     44        50         40        35      0.028571
 14     40        44         50        39      0.025641
 15     50        40         43        31      0.032258

Another useful function in proc expand is the scale function.  This function allows the user to create a new variable that is rescaled to the user-specified minimum and maximum.  To create the new variable read_write, we set the scale of the variable write to the minimum and maximum of the variable read.  To create read_25_80 and write_25_80, we set the minimum and maximum of both read and write to be 25 and 80, respectively.

proc expand data = hsb2_sorted out = test5;
convert write = read_write / transformout = (scale (28 76));
convert read = read_25_80 / transformout = (scale (25 80));
convert write = write_25_80 / transformout = (scale (25 80));
run;

proc print data = test5 (obs = 15);
var read write read_write read_25_80 write_25_80;
run;
                         read_      read_      write_
Obs    read    write     write      25_80      25_80

  1     63       65     73.3333    65.1042    76.9444
  2     60       52     56.0000    61.6667    57.0833
  3     42       39     38.6667    41.0417    37.2222
  4     73       61     68.0000    76.5625    70.8333
  5     47       54     58.6667    46.7708    60.1389
  6     60       54     58.6667    61.6667    60.1389
  7     44       44     45.3333    43.3333    44.8611
  8     57       54     58.6667    58.2292    60.1389
  9     47       46     48.0000    46.7708    47.9167
 10     47       40     40.0000    46.7708    38.7500
 11     39       44     45.3333    37.6042    44.8611
 12     28       46     48.0000    25.0000    47.9167
 13     48       49     52.0000    47.9167    52.5000
 14     47       57     62.6667    46.7708    64.7222
 15     50       33     30.6667    50.2083    28.0556
proc corr data = test5;
var read write read_write read_25_80 write_25_80;
run;
                                       Simple Statistics

Variable            N        Mean     Std Dev         Sum     Minimum     Maximum  Label

read              200    52.23000    10.25294       10446    28.00000    76.00000  reading score
write             200    52.77500     9.47859       10555    31.00000    67.00000  writing score
read_write        200    57.03333    12.63811       11407    28.00000    76.00000  writing score
read_25_80        200    52.76354    11.74816       10553    25.00000    80.00000  reading score
write_25_80       200    58.26736    14.48117       11653    25.00000    80.00000  writing score


                    Pearson Correlation Coefficients, N = 200
                            Prob > |r| under H0: Rho=0

                                                  read_         read_        write_
                       read         write         write         25_80         25_80

read                1.00000       0.59678       0.59678       1.00000       0.59678
reading score                      <.0001        <.0001        <.0001        <.0001

write               0.59678       1.00000       1.00000       0.59678       1.00000
writing score        <.0001                      <.0001        <.0001        <.0001

read_write          0.59678       1.00000       1.00000       0.59678       1.00000
writing score        <.0001        <.0001                      <.0001        <.0001

read_25_80          1.00000       0.59678       0.59678       1.00000       0.59678
reading score        <.0001        <.0001        <.0001                      <.0001

write_25_80         0.59678       1.00000       1.00000       0.59678       1.00000
writing score        <.0001        <.0001        <.0001        <.0001

For a complete list of the functions available for creating variables on the convert statement of proc expand, please see the SAS documentation .


How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.