### Statistical Computing Seminars Arrays in SAS

This seminar is designed to help you improve your SAS data management skills via the use of arrays. According to SAS, "an array is a temporary grouping of SAS variables that are arranged in a particular order and identified by an array-name." In SAS, an array is created with the array statement within a data step. Arrays can be useful in very explicit ways, like creating or modifying a series of variables, and in more subtle ways, like reshaping a dataset or comparing across observations. The application of arrays will be the focus of this seminar.

In this seminar we will cover the following topics:

Recoding variables
Applying the same computation to many variables simultaneously
Computing new variables
Revisiting recoding variables - two variations of the example on recoding
Identify patterns across variables using arrays
Reshaping wide to long
Understanding the functions first., last. and the retain statement
Reshaping long to wide using arrays
Comparisons across observations using arrays

Here is the program on which the seminar is based.

After completing this seminar, we hope that you will be able to recognize opportunities to use arrays in your data analysis and execute arrays properly. Before we start, let's run some SAS options so that we can get rid of the date, page number, centering and page break in the output.

options nodate nonumber nocenter formdlim="-";

#### Recoding variables

Let's say we have variables on family monthly income for twelve month and the missing values have been coded as (-999).

data faminc;
input famid faminc1-faminc12 ;
cards;
1 3281 3413 3114 2500 2700 3500 3114 -999 3514 1282 2434 2818
2 4042 3084 3108 3150 -999 3100 1531 2914 3819 4124 4274 4471
3 6015 6123 6113 -999 6100 6200 6186 6132 -999 4231 6039 6215
;
run;

We need to recode the twelve family income variables so that missing will be represented by a dot. Let's see how we would do this in a data step one variable at a time.

data recode_missing;
set faminc;
if faminc1 = -999 then faminc1 = .;
if faminc2 = -999 then faminc2 = .;
if faminc3 = -999 then faminc3 = .;
if faminc4 = -999 then faminc4 = .;
if faminc5 = -999 then faminc5 = .;
if faminc6 = -999 then faminc6 = .;
if faminc7 = -999 then faminc7 = .;
if faminc8 = -999 then faminc8 = .;
if faminc9 = -999 then faminc9 = .;
if faminc10 = -999 then faminc10 = .;
if faminc11 = -999 then faminc11 = .;
if faminc12 = -999 then faminc12 = .;
run;
proc print data = recode_missing heading= h noobs;
run;
famid    faminc1    faminc2    faminc3    faminc4    faminc5    faminc6
  1        3281       3413       3114       2500       2700       3500
2        4042       3084       3108       3150          .       3100
3        6015       6123       6113          .       6100       6200
faminc7    faminc8    faminc9    faminc10    faminc11    faminc12
  3114          .       3514       1282        2434        2818
1531       2914       3819       4124        4274        4471
6186       6132          .       4231        6039        6215

We are using many repetitive steps here, making this is a perfect situation for using arrays as shown below. Note: In the code we use the square brackets around the subscript variable i. The choice between square brackets, curly brackets or parenthesis is completely arbitrary. We have decided to use the square brackets as a visual reminder that i is a subscript and not a part of a mathematical computation.

data recode_missing;
set faminc;
array inc[12] faminc1 - faminc12;
do i = 1 to 12;
if inc[i]=-999 then inc[i]=.;
end;
drop i;
run;

Within the array statement, we first named the array, inc, and indicated in brackets how many variables are in the array, then listed which dataset variables our array variables represent, faminc1-faminc12. The dataset variables can be existing variables or can be new variables added to the dataset with this step. After the array statement, we used a do loop to cycle through our array variables. Upon completing this data step, our dataset has the same set of variables it started with. The only new variable created in the data step was i, and i was dropped explicitly. The array variables, inc1-inc12, simply act as stand-ins for the variables they represent and are then discarded.

Other examples of recoding situations where arrays may be useful:

• If measures for some observations are expressed as proportions and others as percents, recoding all such variables into one consistent measure.
• If all negative values are considered invalid, replacing values less than zero with missing.

#### Applying the same math computation to many variables at a time

Now let's say we have survey data with responses coded on a -3 to 3 scale and we wish to reverse code several items. We can do this using an array, cycling through the items to be reversed and multiplying the response values by (-1).

data score;
input item1 item2 item3 item4;
cards;
-2   1   -3   0
-1   2   -2   1
0  -1   -3  -1
;
run;
data score_array1;
set score;
array item(4) item1-item4;
do i=1 to 4;
item(i) = -1*item(i);
end;
drop i;
run;
proc print data=score_array1;
run;

Obs    item1    item2    item3    item4
 1       2        -1       3         0
2       1        -2       2        -1
3       0         1       3         1

In this example, we are only reversing four items. Completing this task without arrays would have been perfectly doable. However, if we wished to reverse 40 items, the non-array code would become much longer, going from 4 to 40 lines, and would be annoying to write. The array coding, on the other hand, would remain the same length and would require only minor changes.

Other examples of applying the same math computation to many variables:

• If variables are measured in absolute time and you are interested in relative time, subtracting the same starting time from all variables.
• If  you wish to add random noise to a set of variables, adding in random numbers from standard normal distribution.

#### Computing new variables

Let's come back to our family income data set. We are interested in creating a new variable for each month containing the amount of income tax based on that month's income. We can compute the income tax variables manually.

data tax_manual;
set recode_missing;
taxinc1 = faminc1 * .10 ;
taxinc2 = faminc2 * .10 ;
taxinc3 = faminc3 * .10 ;
taxinc4 = faminc4 * .10 ;
taxinc5 = faminc5 * .10 ;
taxinc6 = faminc6 * .10 ;
taxinc7 = faminc7 * .10 ;
taxinc8 = faminc8 * .10 ;
taxinc9 = faminc9 * .10 ;
taxinc10= faminc10 * .10 ;
taxinc11= faminc11 * .10 ;
taxinc12= faminc12 * .10 ;
run;
var famid faminc6-faminc12 taxinc6-taxinc12;
run;
famid    faminc6    faminc7    faminc8    faminc9    faminc10    faminc11    faminc12
  1        3500       3114          .       3514       1282        2434        2818
2        3100       1531       2914       3819       4124        4274        4471
3        6200       6186       6132          .       4231        6039        6215
taxinc6    taxinc7    taxinc8    taxinc9    taxinc10    taxinc11    taxinc12
  350       311.4         .       351.4       128.2       243.4       281.8
310       153.1      291.4      381.9       412.4       427.4       447.1
620       618.6      613.2         .        423.1       603.9       621.5

Now we will compute the same income tax variables using array technique. We have to use two arrays because the first array, inc, is the array for the existing variables (faminc1-faminc12); the second array, tax, is created as a "placeholder" where we will store the new variables (taxinc1-taxinc12).

data tax_array;
set recode_missing;
array inc(12) faminc1-faminc12; /* existing variables */
array tax(12) taxinc1-taxinc12; /* new variables */
do month = 1 to 12;
tax[month] = inc[month]*0.1;
end;
run;
proc print data=tax_array noobs;
var famid faminc1-faminc3 taxinc1-taxinc3;
run;
famid    faminc1    faminc2    faminc3    taxinc1    taxinc2    taxinc3

1        3281       3413       3114      328.1      341.3      311.4
2        4042       3084       3108      404.2      308.4      310.8
3        6015       6123       6113      601.5      612.3      611.3

Other examples of computing new variables:

• If you wish to standardize a set of variables and retain both the standardized and unstandardized values.
• If you wish to add variables containing projected values of variables in the dataset.

#### Revisiting recoding variables - two variations of the example on recoding

SAS has some nice flexibility in terms of defining arrays. Here is the first variation of our recoding variable example:

data test;
set faminc;
array inc(*) faminc:;
do i =1 to dim(inc);
if inc(i) = -999 then inc(i)=.;
end;
drop i;
run;

What are the differences between this segment of code and the code for our first example?

• Using the prefix convention "faminc:" instead of faminc1 - faminc12;
• Using dim(inc) to dynamically determine the number of variables in the array;

Here is another variation.

data test;
set faminc;
array a(*) _numeric_;
do i =1 to dim(a);
if a(i) = -999 then a(i)=.;
end;
drop i;
run;
Now the only difference between this data step and the previous one is the use of the SAS key word "_numeric_". This works nicely if we need to make the same change to all numerical variables.

#### Identify patterns across variables using arrays

In this section the objective is to identify the number of missing values for each row. We are going to create a new variable named nmiss, which will be the number of missing values across variables faminc1 - faminc12

data mspatterns;
set recode_missing;
array inc(12) faminc1-faminc12; /* existing vars */
nmiss = 0;
do i = 1 to 12;
if inc(i) = . then nmiss = nmiss + 1;
end;
run;
var famid faminc1-faminc12 nmiss;
run;
famid    faminc1    faminc2    faminc3    faminc4    faminc5    faminc6    faminc7

1        3281       3413       3114       2500       2700       3500       3114
2        4042       3084       3108       3150          .       3100       1531
3        6015       6123       6113          .       6100       6200       6186

faminc8    faminc9    faminc10    faminc11    faminc12    nmiss

.       3514       1282        2434        2818        1
2914       3819       4124        4274        4471        1
6132          .       4231        6039        6215        2

Other examples of computing new variables:

• If you have sales data for each month and you wish to know how many months had sales over a given amount.
• If you have survey data and you are interested in how many positive responses each respondent gave.

#### Some notes on the above uses of arrays

From the uses of arrays already presented, you have perhaps recognized times when you could have used arrays.  Anytime you find yourself

• Copying and pasting lines of code in a data step
• Performing the same operation (like recoding missing values or reverse scaling responses) on multiple variables
• Creating a set of new variables from a set of existing variables using a consistent formulation
• Calculating a single variable based on conditions of a set of existing variables
you could likely be using arrays. Compared to repetitive statements, arrays are efficient in their coding and less subject to typos and syntactical errors. Admittedly, arrays can detract from the readability of SAS code, but careful commenting can fix this.

#### Reshaping wide to long

Reshaping wide to long creating only one variable--manually.
In the problem data set we show what happens when we forget to include the appropriate output statements in the data step.

data wide;
input famid faminc96 faminc97 faminc98 ;
cards;
1 40000 40500 41000
2 45000 45400 45800
3 75000 76000 77000
;
run;
data long_manual;
set wide;
year=96;
faminc=faminc96;
output;
year=97;
faminc=faminc97;
output;
year=98;
faminc=faminc98;
output;
run;
proc print data=long_manual;
var famid year faminc;
run;

Obs    famid    year    faminc
1       1       96      40000
2       1       97      40500
3       1       98      41000
4       2       96      45000
5       2       97      45400
6       2       98      45800
7       3       96      75000
8       3       97      76000
9       3       98      77000

data problem;
set wide;
year=96;
faminc=faminc96;
*output;
year=97;
faminc=faminc97;
*output;
year=98;
faminc=faminc98;
output;
run;
proc print data=problem;
var famid year faminc;
run;

Obs    famid    year    faminc
1       1       98      41000
2       2       98      45800
3       3       98      77000

Reshaping wide to long creating only one variable using arrays.

data long_array;
set wide;
array Afaminc(96:98) faminc96 - faminc98;
do year = 96 to 98;
faminc = Afaminc[year];
output;
end;
drop faminc96-faminc98;
run;
proc print data=long_array;
run;

Obs    famid    year    faminc
1       1       96      40000
2       1       97      40500
3       1       98      41000
4       2       96      45000
5       2       97      45400
6       2       98      45800
7       3       96      75000
8       3       97      76000
9       3       98      77000

Reshaping wide to long creating multiple variables (including string variables) using arrays.

data multi_wide;
input famid faminc96 faminc97 faminc98 spend96 spend97 spend98
debt96 $debt97$ debt98 $; cards; 1 40000 40500 41000 38000 39000 40000 yes yes no 2 45000 45400 45800 42000 43000 44000 yes no no 3 75000 76000 77000 70000 71000 72000 no no no ; run; data multi_long; set multi_wide; length debt$ 3;
array Afaminc(96:98) faminc96-faminc98;
array Aspend(96:98) spend96-spend98;
do year = 96 to 98;
faminc = Afaminc[year];
spend = Aspend[year];
output;
end;
drop faminc96-faminc98 spend96-spend98;
run;
proc print data=multi_long;
var famid year faminc spend debt;
run;

Obs    famid    year    faminc    spend    debt
1       1       96      40000    38000    yes
2       1       97      40500    39000    yes
3       1       98      41000    40000    no
4       2       96      45000    42000    yes
5       2       97      45400    43000    no
6       2       98      45800    44000    no
7       3       96      75000    70000    no
8       3       97      76000    71000    no
9       3       98      77000    72000    no

Reshaping wide to long in presence of character suffixes. In the above example we had numeric suffixes (96, 97 and 98). We can reshape even if we have character suffixes such as old, now and future.

data character;
length name_old name_now name_future $24; input id name_old$ name_now $name_future$ inc_old inc_now inc_future;
cards;
1  Ramon  Martin  Martin_Sheen  23000  50000  700000
2  John  Johnnie  J_boy         10000  20000  600000
3  Mary_Cathleen  Bo  Bo_Derek  15000  40000  250000
;
run;
proc print data=character;
run;
Obs    name_old         name_now    name_future     id    inc_old    inc_now    future

1     Ramon            Martin      Martin_Sheen     1     23000      50000     700000
2     John             Johnnie     J_boy            2     10000      20000     600000
3     Mary_Cathleen    Bo          Bo_Derek         3     15000      40000     250000

data character_array;
set character;
length name $24; array Aname(3)$ name_old name_now name_future;
array Aincome(3) inc_old inc_now inc_future;
do time = 1 to 3;
name = Aname[time];
income = Aincome[time];
output;
end;
run;
proc format;
value t_format 1='old' 2='now' 3='future';
run;
proc print data=character_array ;
format time t_format.;
var id time name income;
run;

Obs    id    time      name             income
1      1    old       Ramon             23000
2      1    now       Martin            50000
3      1    future    Martin_Sheen     700000
4      2    old       John              10000
5      2    now       Johnnie           20000
6      2    future    J_boy            600000
7      3    old       Mary_Cathleen     15000
8      3    now       Bo                40000
9      3    future    Bo_Derek         250000

#### Understanding the functions first., last. and the retain statement

The previous section demonstrated how to reshape data sets from wide to long. Unfortunately, reshaping data sets from long to wide is more complex. In order to better understand how to use arrays to reshape from long to wide we will need to understand how the first. and last. functions work as well as understand how the retain statement works. The following are examples of the retain statement.

We would like to create a new variable called new_meas which contains the same values as measurement but with the missing values filled in. The new_meas variable should have a starting value of 0 and then change values every time measurement has a non-missing value.

data missings;
input id measurement;
cards;
1  .
1  2
3  .
2  3
3  4
2  .
3  .
1  .
3  5
3  6
;
run;
data ex_retain;
set missings;
retain new_meas 0;
if measurement ne . then new_meas = measurement;
run;
proc print data=ex_retain;
run;

Obs    id    measurement    new_meas
1     1         .             0
2     1         2             2
3     3         .             2
4     2         3             3
5     3         4             4
6     2         .             4
7     3         .             4
8     1         .             4
9     3         5             5
10     3         6             6

Omitting the retain statement gives us the wrong new_meas, now it is just a copy of measurement.

data ex_retain;
set missings;
*retain new_meas 0;
if measurement ne . then new_meas = measurement;
run;
proc print data=ex_retain;
run;

Obs    id    measurement    new_meas
1     1         .             .
2     1         2             2
3     3         .             .
4     2         3             3
5     3         4             4
6     2         .             .
7     3         .             .
8     1         .             .
9     3         5             5
10     3         6             6

In the next example we want to create a variable called new1 which contains the cumulative sum of the values in the variable measurement. Note that when measurement is missing the sum should remain unchanged.

data ex2_retain;
set missings;
retain new1 0;
if measurement ne . then new1 = new1 + measurement;
run;
proc print data=ex2_retain;
run;

Obs    id    measurement    new1
1     1         .           0
2     1         2           2
3     3         .           2
4     2         3           5
5     3         4           9
6     2         .           9
7     3         .           9
8     1         .           9
9     3         5          14
10     3         6          20

Omitting the retain statement gives us the wrong new1.

data ex2_retain;
set missings;
*retain new1 0;
if measurement ne . then new1 = new1 + measurement;
run;
proc print data=ex2_retain;
run;

Obs    id    measurement    new1
1     1         .           .
2     1         2           .
3     3         .           .
4     2         3           .
5     3         4           .
6     2         .           .
7     3         .           .
8     1         .           .
9     3         5           .
10     3         6           .

Let's look at the first. and last. variables. In the first example we create indicator variables, first and last. The variable first indicates the first observation for each person as indicated by id; the variable last indicates the last observation for each person. Note: When using first.var_name or last.var_name we must first sort the data set on the variable var_name. Moreover, in the data step we must always precede first.var_name or last.var_name with a by var_name statement.

proc sort data=missings out=sort_miss;
by id;
run;
data ex1;
set sort_miss;
by id;
if first.id then first=1;
else first=0;
if last.id then last=1;
else last=0;
run;
proc print data=ex1;
run;

Obs    id    measurement    first    last
1     1         .           1        0
2     1         2           0        0
3     1         .           0        1
4     2         3           1        0
5     2         .           0        1
6     3         .           1        0
7     3         4           0        0
8     3         .           0        0
9     3         5           0        0
10     3         6           0        1

We can combine the first. function with a retain statement to get a cumulative sum and count.

data kids;
length kidname \$ 4;
input famid kidname birth_order wt;
cards;
1 Beth 1  60
1 Barb 3  20
4 Sam  1 100
4 Stu  2  90
1 Bob  2  40
3 Pete 1  60
3 Phil 3  20
2 Andy 1  80
3 Pam  2  40
2 Al   2  50
2 Ann  3  20
;
run;

We will be using first.famid so we must sort the data set on famid.

proc sort data=kids out=sort_kids;
by famid;
run;
data retain1;
set sort_kids;
retain sumwt count; /*carry over the value from previous obs to next obs*/
by famid;
if first.famid then do; /*at 1st obs of each family set sumwt and count = 0*/
sumwt=0;
count=0;
end;
sumwt = sumwt + wt;
count = count + 1;
meanwt = sumwt/count;
run;
proc print data=retain1;
var famid kidname wt sumwt count meanwt;
run;

famid    kidname     wt    sumwt    count    meanwt
1       Beth       60      60       1         60
1       Barb       20      80       2         40
1       Bob        40     120       3         40
2       Andy       80      80       1         80
2       Al         50     130       2         65
2       Ann        20     150       3         50
3       Pete       60      60       1         60
3       Phil       20      80       2         40
3       Pam        40     120       3         40
4       Sam       100     100       1        100
4       Stu        90     190       2         95

By adding an if last.famid statement to the program we output only the last observation per family which shows the final sumwt, count and meanwt for each family.
Note: We do not need to resort the data since it is already sorted on famid.

data retain2;
set retain1;
by famid;
if last.famid then output; /*output only the last obs for each family*/
run;
proc print data=retain2;
var famid sumwt meanwt;
run;

famid    sumwt    count    meanwt
1       120       3        40
2       150       3        50
3       120       3        40
4       190       2        95

#### Reshaping long to wide using arrays

We will use the long_array data set created from the wide data set and we will reshape it back to the original wide format.

proc print data=long_array;
run;

Obs    famid    year    faminc
1       1       96      40000
2       1       97      40500
3       1       98      41000
4       2       96      45000
5       2       97      45400
6       2       98      45800
7       3       96      75000
8       3       97      76000
9       3       98      77000

We will be using first.famid so we must sort the data set on famid.

proc sort data=long_array out=long_sort;
by famid;
run;
data wide_array;
set long_sort;
by famid;
retain faminc96-faminc98;
array Afaminc(96:98) faminc96-faminc98;
if first.famid then do;
do i = 96 to 98;
Afaminc[i] = .; /*initializing to missing*/
end;
end;
Afaminc(year) = faminc; /*looping across values in the variable year*/
*if last.famid then output; /* outputs only the last obs in a family*/
drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
1        40000           .           .
1        40000       40500           .
1        40000       40500       41000
2        45000           .           .
2        45000       45400           .
2        45000       45400       45800
3        75000           .           .
3        75000       76000           .
3        75000       76000       77000

data wide_array;
set long_sort;
by famid;
retain faminc96-faminc98;
array Afaminc(96:98) faminc96-faminc98;
if first.famid then do;
do i = 96 to 98;
Afaminc[i] = .;
end;
end;
Afaminc(year) = faminc; /*looping across values in the variable year*/
if last.famid then output; /* outputs only the last obs in a family*/
drop year faminc i;
run;
proc print data=wide_array noobs;
run;

famid    faminc96    faminc97    faminc98
1        40000       40500       41000
2        45000       45400       45800
3        75000       76000       77000

#### Comparisons across observations using arrays

One difficulty of data management in SAS is that we cannot easily compare across observations. However, with our knowledge of arrays, we can transpose the data from long to wide; then we can use the array to do the comparisons very easily. This is a rather subtle use of arrays that can be extremely helpful.

We have up to six observations for four people. Our goal is to compare, for each person, a given observation with the previous and the next observations. If a person has three consecutive observations that are the same, we wish to flag the person.

data real_life;
input person topicA;
cards;
1   0
1   1
3  -1
1   0
2   0
1   1
2  -1
2  -1
3   0
3   1
4   0
1   1
4   1
4   0
2  -1
4   0
4   0
1  -1
;
run;

We need to number the observations within each person. We will be using first.person in the process of doing this, so we must first sort the data on person. Then we will create the count variable which will enumerates the observations within each person.

proc sort data=real_life out=sort_real;
by person;
run;
data count_real;
set sort_real;
retain count;
by person;
if first.person then count = 0;
count = count + 1;
run;
proc print data=count_real noobs;
run;
topic
person      A      count
1         0       1
1         1       2
1         0       3
1         1       4
1         1       5
1        -1       6
2         0       1
2        -1       2
2        -1       3
2        -1       4
3        -1       1
3         0       2
3         1       3
4         0       1
4         1       2
4         0       3
4         0       4
4         0       5

We now convert the data set from long to wide.
Note: We are using first.person and last.person but we do not need to resort the data since it is already sorted on person.

data wide_real;
set count_real;
array AtopicA(6) topicA_1-topicA_6;
retain topicA_1-topicA_6;
by person;
if first.person then do;
do i = 1 to 6;
AtopicA[i] = .;
end;
end;
AtopicA(count) = topicA; /*looping across values in the variable count*/
if last.person then output; /* outputs only the last obs per person */
run;
proc print data=wide_real noobs;
var person topicA_1-topicA_6;
run;
topic    topic    topic    topic    topic    topic    flag
person     A_1      A_2      A_3      A_4      A_5      A_6       A
1         0        1        0        1       1        -1       0
2         0       -1       -1       -1       .         .       1
3        -1        0        1        .       .         .       0
4         0        1        0        0       0         .       1

Now, let's find the people who have the same value for 3 observations in a row.

data three;
set wide_real;
array topic(6) topicA_1-topicA_6;
do i = 2 to 5;
if topic[i-1] ne . & topic[i] ne . & topic[i+1] ne . &
topic[i]=topic[i-1] & topic[i]=topic[i+1] then flagA=1;
end;
if flagA=. then flagA=0;
run;
proc print data=three noobs;
var person topicA_1-topicA_6 flagA;
run;

topic    topic    topic    topic    topic    topic    flag
person     A_1      A_2      A_3      A_4      A_5      A_6       A

1         0        1        0        1       1        -1       0
2         0       -1       -1       -1       .         .       1
3        -1        0        1        .       .         .       0
4         0        1        0        0       0         .       1

This seminar is based on examples from the following SAS Learning Modules shown below.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.