Using Multidimensional Arrays in SAS

Example 1:  Inputting data

Multidimensional arrays are used when you want to put values in a table format (i.e., rows and columns).  You can use a multidimensional array to input data or to perform operations on the data set.  For example, let's say you had 10 families that each had one son and one daughter.  Your data consist of the weights of the kids.  You could set up the input statement such that you listed each variable name, but as you can see, that would be lots of typing.  Instead, we will use an array statement.  We name the array wt and then list the dimensions.  If we only listed one dimension, it would be a single dimensional array, and if we listed three numbers, it would be a three-dimensional array.  The important thing to remember is that the last number listed is the columns.  The second to last number listed is the rows, and if you have a three dimensional array, the third number (listed first) would be the levels.  SAS determines the columns, rows, etc., starting from the last number given so that you can create n-dimensional arrays if you wanted.  After listing the dimensions, we name the variables that will be created by the array.  In our example, the first set of 10 variables will be named k1f1 to k1f10, and the second set of 10 variables will be named k2f1 to k2f10.  We use the input statement to input the variables created by the array, and then we use do loops to have SAS cycle through the multidimensional array to input the data.  We use the round function in this example to round off the values given in the data set, but you can do any processing that you need to at this point.  Remember that  you need to use an end statement for each do-loop that you specify.  In the case of a two-dimensional array, you will usually use two do-loops, and therefore will need two end statements.  The processing is usually done with the columns in the outer-most loop and the rows in the inner-most loop.

data kids;
array wt{2,10} k1f1-k1f10 k2f1-k2f10;
input k1f1-k1f10 / k2f1-k2f10;
do i=1 to 2;
do j=1 to 10;
wt{i,j}=round(wt{i,j});
end;
end;
drop i j;
cards;
56.7 58.7 68.3 70.3 61.9 87.6 99.3 65.3 78.3 91.5
88.5 63.9 66.8 77.6 55.9 35.9 54.8 98.6 77.3 68.5
51.6 88.7 79.6 65.9 94.1 91.6 83.6 85.7 66.8 77.9
98.5 79.9 99.9 55.6 68.9 46.8 79.8 88.6 88.8 97.5
;
run;

proc print data = kids;
run;
                                        k                                        k
    k   k   k   k   k   k   k   k   k   1   k   k   k    k   k   k   k   k   k   2
O   1   1   1   1   1   1   1   1   1   f   2   2   2    2   2   2   2   2   2   f
b   f   f   f   f   f   f   f   f   f   1   f   f   f    f   f   f   f   f   f   1
s   1   2   3   4   5   6   7   8   9   0   1   2   3    4   5   6   7   8   9   0

1  57  59  68  70  62  88  99  65  78  92  89  64   67  78  56  36  55  99  77  69
2  52  89  80  66  94  92  84  86  67  78  99  80  100  56  69  47  80  89  89  98

Now let's suppose that we wanted to create a two-dimensional array called myarray with two elements that we wanted as the rows and three elements that we wanted as the columns.  The array statement would be:

array myarray(2, 3) r1c1-r1c3 r2c1-r2c3;

The data would be organized as follows:

  column 1 column 2 column 3
row 1 r1c1 r1c2 r1c3
row 2 r2c1 r2c2 r2c3

In summary, listing the dimensions of arrays works like the following:
One-dimensional array:  array x(cols)
Two-dimensional array:  array y(rows, cols)
Three-dimensional array:  array z(levels, rows, cols).

Note that if you are going to do the same thing to all elements of the array, you can use a one-dimensional array.  Let's say that you find out later that the scale was inaccurate when the measurements for the kids in the k1 group were weighed, so you need to add five to their values.

data kids1;
set kids;
array k(10) k1f1 - k1f10;
do i = 1 to 10;
k(i) = k(i) + 5;
end;
run;

proc print data = kids1;
run;
                                         k                                        k
    k   k   k   k   k   k   k    k   k   1   k   k   k    k   k   k   k   k   k   2
O   1   1   1   1   1   1   1    1   1   f   2   2   2    2   2   2   2   2   2   f
b   f   f   f   f   f   f   f    f   f   1   f   f   f    f   f   f   f   f   f   1
s   1   2   3   4   5   6   7    8   9   0   1   2   3    4   5   6   7   8   9   0

1  62  64  73  75  67  93  104  70  83  97  89  64   67  78  56  36  55  99  77  69
2  57  94  85  71  99  97   89  91  72  83  99  80  100  56  69  47  80  89  89  98

Example 2: Finding averages

For our next example, let's suppose that you had a data set containing the grades that six students obtained during high school.  Each student took six classes for each of the four years that they were in high school, and you want to calculate their grade point average (gpa) for each year for each student, as well as the cumulative gpa for each student.  The multidimensional array will have two dimensions, one for years with four elements and one for classes with six elements.

data grades;
array g(4,6) g9c1-g9c6 g10c1-g10c6 g11c1-g11c6 g12c1-g12c6;
input g9c1-g9c6 / g10c1-g10c6 / g11c1-g11c6 / g12c1-g12c6;
cards;
1 3 2 4 2 3
4 2 3 1 2 4
3 2 4 1 3 2
3 3 4 3 2 2
3 3 2 3 4 4
4 3 1 1 1 2
3 3 2 1 1 2
3 2 4 3 3 2
1 2 3 4 3 2
2 3 4 1 1 2
4 4 4 3 2 3
4 3 3 3 2 2
2 3 4 4 4 4
2 2 1 1 2 3
3 2 2 3 4 4
4 4 3 3 4 2
1 3 4 2 3 2
1 2 3 2 1 1
1 1 1 1 1 2
3 3 2 3 3 4
4 3 2 2 3 2
2 3 4 4 3 3
2 1 2 4 3 2
4 3 4 3 3 2
;
run;

Now that we have input the data set, we need to do the calculations.  First, we will create an array called g and populate it with the variables g9c1 to g12c6.  Next, we will create an array into which we will put the values of the annual gpa's.  After we establish the loop for the year, we set the initial value of yr(i) to zero.  Next, we set up the loop to loop over classes, add up the grades, divide by six to get the average, and finally, calculate cumgpa.  We drop the indexes i and j because they are not needed and use proc print to see our results.

data grades1;
set grades;
array g(4,6) g9c1-g9c6 g10c1-g10c6 g11c1-g11c6 g12c1-g12c6;
array yr(4);
do i = 1 to 4;
yr(i)=0;
do j = 1 to 6;
yr(i) = yr(i)+g(i,j);
end;
yr(i)=yr(i)/6;
end;
cumgpa = mean(of yr1 - yr4);
drop i j;
run;

proc print data = grades1;
run;
                                                                                     c
              g g g g g g g g g g g g g g g g g g                                    u
  g g g g g g 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1                                    m
O 9 9 9 9 9 9 0 0 0 0 0 0 1 1 1 1 1 1 2 2 2 2 2 2    y       y       y       y       g
b c c c c c c c c c c c c c c c c c c c c c c c c    r       r       r       r       p
s 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6    1       2       3       4       a

1 1 3 2 4 2 3 4 2 3 1 2 4 3 2 4 1 3 2 3 3 4 3 2 2 2.50000 2.66667 2.50000 2.83333 2.62500
2 3 3 2 3 4 4 4 3 1 1 1 2 3 3 2 1 1 2 3 2 4 3 3 2 3.16667 2.00000 2.00000 2.83333 2.50000
3 1 2 3 4 3 2 2 3 4 1 1 2 4 4 4 3 2 3 4 3 3 3 2 2 2.50000 2.16667 3.33333 2.83333 2.70833
4 2 3 4 4 4 4 2 2 1 1 2 3 3 2 2 3 4 4 4 4 3 3 4 2 3.50000 1.83333 3.00000 3.33333 2.91667
5 1 3 4 2 3 2 1 2 3 2 1 1 1 1 1 1 1 2 3 3 2 3 3 4 2.50000 1.66667 1.16667 3.00000 2.08333
6 4 3 2 2 3 2 2 3 4 4 3 3 2 1 2 4 3 2 4 3 4 3 3 2 2.66667 3.16667 2.33333 3.16667 2.83333

Example 3:  Counting the number of missing values

In this next example, we are using a data set that is doubly multivariate.  In other words, there are two dependent variables that were measured across time.  Our goal is to count up the number of missing data points per subject, and based on that total, determine if there is so much missing data for that subject that it will be problematic for the analysis.  To do this, we will first create a multidimensional array called d and input the data as we have in the previous examples.

data double;
array d(2, 4) dv1t1 - dv1t4 dv2t1 - dv2t4;
input dv1t1 - dv1t4 / dv2t1 - dv2t4;
cards;
23 46 .  46
12 15 69 83
.  98 74 85
36 45 96 32
86 36 45 23
14 87 56 98
65 67 76 73
71 96 36 54
55 .  33 11
47 64 31 39
85 68 15 36
95 94 .  56
48 65 54 55
89 96 41 25
.  96 36 15
.  97 32 . 
;
run;

Next, we will create two new arrays, a multidimensional array called x into which we will put our two sets of dependent variables (called dv1t1, etc.), and a single-dimensional array called m into which we will put the number of missing data points per dependent variable.  After we start the do loop to process the two groups of dependent variables, we set m(i) to zero (so that it is zero at the start of the loop), and then start the loop to go through each of the four dependent variables in each group.  We use an if/then statement to do the counting, and, after ending both do loops, we create a new variable called ttlmiss, which is the sum of the missing for both groups of dependent variables.  Finally, we create a new 0/1 variable called problematic, which is set initially set to and is switched to one if ttlmiss is greater than two.  (We arbitrarily chose two; there is no statistical reason behind this selection.)

data double1;
set double;
array x(2, 4) dv1t1 - dv1t4 dv2t1 - dv2t4;
array m(2); * one new variable to hold the count of the number of missings per dv;
do i = 1 to 2;
m(i) = 0;
do j = 1 to 4;
if x(i, j) = . then m(i) = m(i) + 1;
end;
end;
ttlmiss = m1 + m2;
problematic = 0;
if ttlmiss > 2 then problematic = 1;
run;

proc print data = double1;
run;
Obs  dv1t1  dv1t2  dv1t3  dv1t4  dv2t1  dv2t2  dv2t3  dv2t4  i  j  m1  m2  ttlmiss  problematic

 1     23     46      .     46     12     15     69     83   3  5   1   0     1          0
 2      .     98     74     85     36     45     96     32   3  5   1   0     1          0
 3     86     36     45     23     14     87     56     98   3  5   0   0     0          0
 4     65     67     76     73     71     96     36     54   3  5   0   0     0          0
 5     55      .     33     11     47     64     31     39   3  5   1   0     1          0
 6     85     68     15     36     95     94      .     56   3  5   0   1     1          0
 7     48     65     54     55     89     96     41     25   3  5   0   0     0          0
 8      .     96     36     15      .     97     32      .   3  5   1   2     3          1

Example 4:  Multiple arrays to input and process data

In this example, we are illustrating the use of more than one array to input and process data, as well as slightly more advanced used of do-loops to process data.  We will create four groups of variables:  age, dv, dvn and dvm.  Each of these groups of variables will have five variables in it.  The groups dv, dvn and dvm are groups of dependent variables, and age is considered to be an independent variable.  Unlike the previous two examples, we will do the manipulations that we want in the same data step that we use to input the data.  First, if age is less than 10, we would like to add 100 to the first group of dependent variables (dv).  We would like to add 200 to the second group of dependent variables (dvn), and add 300 to the third set of dependent variables (dvm).  Because the amount that we want to add to the scores increments (by 100) with the index variable i, we will multiple i by 100 and add it to the scores for each dependent variable.  Also, for the first group of dependent variables (dv), we would like to divide scores by 50 if age is greater than or equal to 12.  For the second group of dependent variables (dvn), we would like to divide scores by two for the same age condition.  To specify that just one group should be processed, we list the number of the group to be processed in the array statement.  Notice also that dividing is handled differently than adding:  we need to use an equality (the = sign) to do the division.  Finally, notice that the first do-loop is not closed until all processing has been finished, because the index variable k is used throughout the processing.

data ages;
array age(5) age1 - age5;
array dv(3, 5) dv1 - dv5 dvn1 - dvn5 dvm1 - dvm5;
input age1 - age5 / dv1 - dv5 / dvn1 - dvn5 / dvm1 - dvm5;
do k = 1 to 5;
if (age(k) < 10) then 
do i = 1 to 3;
dv(i, k) + i*100;
end;
if (age(k) >= 12) then do;
dv(1, k) = dv(1, k) / 50;
dv(2, k) = dv(2, k) / 2;
end;
end;
cards;
9 15 8 21 2 3
65 98 32 64 64
65 67 74 85 96
50 21 63 41 32
10 5 6 8 12 5
45 78 78 98 65
20 23 30 36 35
34 32 11 12 88
1 9 6 5 12 7
85 74 85 25 45
98 65 31 21 32
65 98 32 64 64
;
run;

proc print data = ages;
run;
   a  a a  a  a                              d    d    d    d     d    d   d   d   d   d
O  g  g g  g  g  d     d    d     d       d  v    v    v    v     v    v   v   v   v   v
b  e  e e  e  e  v     v    v     v       v  n    n    n    n     n    m   m   m   m   m
s  1  2 3  4  5  1     2    3     4       5  1    2    3    4     5    1   2   3   4   5

1  9 15 8 21  2 165   1.96 132   1.28 164.0 265  33.5 274  42.5 296.0 350  21 363  41 332
2 10  5 6  8 12  45 178.00 178 198.00   1.3  20 223.0 230 236.0  17.5  34 332 311 312  88
3  1  9 6  5 12 185 174.00 185 125.00   0.9 298 265.0 231 221.0  16.0 365 398 332 364  64

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.