UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

SAS FAQ
How can I create different kinds of centered variables in SAS?

Centering a variable means that a constant has been subtracted from every value of a variable.  There are several ways that you can center variables.  For example, you could center the variable around a constant that has intrinsic meaning for the variable, such as centering a continuous variable age around 18 to represent when Americans come of voting age.  You could also center a variable around its mean, or you could use a categorical variable to group your continuous variable, and get means for each group.  Each of these techniques is shown below.

We will use the test data set presented below for all of our examples.  We understand that for most purposes such a data set is unrealistically small, but its size makes it easier to see what is happening in each step.

data test;
input studentid class score1 score2;
cards;
1 1 34 24
2 1 39 25
3 1 34 26
4 1 38 20
5 1 32 21
1 2 45 36
2 2 43 30
3 2 48 39
4 2 41 37
5 2 40 31
1 3 50 46
2 3 51 49
3 3 57 48
4 3 50 40
5 3 57 46
;
run;

1. Centering a variable around a constant

Suppose that we wanted to center all of the values in the variable score1 around 45.

data center45;
set test;
c45 = score1 - 45;
run;

proc print data = center45;
run;
Obs    studentid    class    score1    score2    c45

  1        1          1        34        24      -11
  2        1          2        45        36        0
  3        1          3        50        46        5
  4        2          1        39        25       -6
  5        2          2        43        30       -2
  6        2          3        51        49        6
  7        3          1        34        26      -11
  8        3          2        48        39        3
  9        3          3        57        48       12
 10        4          1        38        20       -7
 11        4          2        41        37       -4
 12        4          3        50        40        5
 13        5          1        32        21      -13
 14        5          2        40        31       -5
 15        5          3        57        46       12

Now let's center the scores for each class around a different constant.  Let's suppose that score1 for class 1 should be centered around 30, for class 2 the scores should centered around 40, and for class 3 the scores should centered around 50.  The proc sort was added only to make the output easier to read; it is not necessary for the program to work.

data centerdiff;
set test;
if class = 1 then c1 = score1 - 30;
if class = 2 then c1 = score1 - 40;
if class = 3 then c1 = score1 - 50;
run;

proc sort data = centerdiff;
by class studentid;
run;

proc print data = centerdiff;
run;
Obs    studentid    class    score1    score2    c1

  1        1          1        34        24       4
  2        2          1        39        25       9
  3        3          1        34        26       4
  4        4          1        38        20       8
  5        5          1        32        21       2
  6        1          2        45        36       5
  7        2          2        43        30       3
  8        3          2        48        39       8
  9        4          2        41        37       1
 10        5          2        40        31       0
 11        1          3        50        46       0
 12        2          3        51        49       1
 13        3          3        57        48       7
 14        4          3        50        40       0
 15        5          3        57        46       7

2.  Grand mean centering

Instead of centering a variable around a value that you select, you may want to center it around its mean.  This is known as grand mean centering.  There are at least three ways that you can do this.  Perhaps the most straight-forward way is to get the mean of each variable that you wan to center and subtract that value from the variable in a data step.  This is simple if you only need to center a few variables.

proc means data = test mean;
var score1 score2;
run;
Variable            Mean
------------------------
score1        43.9333333
score2        34.5333333
------------------------

data grand;
set test;
grmscore1 = score1 - 43.93;
grmscore2 = score2 - 34.53;
run;

proc print data = grand;
run;
Obs    studentid    class    score1    score2    grmscore1    grmscore2

  1        1          1        34        24         -9.93       -10.53
  2        2          1        39        25         -4.93        -9.53
  3        3          1        34        26         -9.93        -8.53
  4        4          1        38        20         -5.93       -14.53
  5        5          1        32        21        -11.93       -13.53
  6        1          2        45        36          1.07         1.47
  7        2          2        43        30         -0.93        -4.53
  8        3          2        48        39          4.07         4.47
  9        4          2        41        37         -2.93         2.47
 10        5          2        40        31         -3.93        -3.53
 11        1          3        50        46          6.07        11.47
 12        2          3        51        49          7.07        14.47
 13        3          3        57        48         13.07        13.47
 14        4          3        50        40          6.07         5.47
 15        5          3        57        46         13.07        11.47

A second way to create a grand mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set.  This is illustrated below.  The data set outputted from the proc means is shown below.  As you can see, it has only one observation.  The other thing to notice about this data set is that it has no variables in common with the original data set.  This makes merging it with the original data set somewhat more difficult.  The steps needed to overcome this problem are explained just above the data set that performs the merge.

proc means data = test mean;
var score1 score2;
output out = grand1 mean=m1 m2;
run;

proc print data = grand1;
run;
Obs    _TYPE_    _FREQ_       m1         m2

 1        0        15      43.9333    34.5333
proc sort data = test;
by studentid class;
run;

If  you try to merge the grand1 data set and the original test data set as you normally would, you will find that you have the values of m1 and m2 only for the first case, and missing values for the remaining 14 cases.  Hence, we need to use a do loop to assign the values of m1 and m2 to new variables, which we have called mean1 and mean2.  Also, we need to use the retain statement to retain the values of mean1 and mean2 so that their values are not set to missing when the data step iterates the second time.  We cannot just retain m1 and m2, because that would be altering their values as we read them into the grand1merged data set, which is not allowed.  We use the drop statement to drop the variables m1 and m2, as well as the _type_ and _freq_ variables that were in the grand1 data set.  Finally, we calculate the grand mean centered variables that we want, grmscore1 and grmscore2.

data grand1merged;
merge test grand1;
retain mean1 mean2;
if _n_ = 1 then do;
mean1 = m1; 
mean2 = m2; 
end;
drop _freq_ _type_ m1 m2;
grmscore1 = score1 - mean1;
grmscore2 = score2 - mean2;
run;

proc print data = grand1merged;
run;
Obs    studentid    class    score1    score2     mean1      mean2     grmscore1    grmscore2

  1        1          1        34        24      43.9333    34.5333      -9.9333     -10.5333
  2        1          2        45        36      43.9333    34.5333       1.0667       1.4667
  3        1          3        50        46      43.9333    34.5333       6.0667      11.4667
  4        2          1        39        25      43.9333    34.5333      -4.9333      -9.5333
  5        2          2        43        30      43.9333    34.5333      -0.9333      -4.5333
  6        2          3        51        49      43.9333    34.5333       7.0667      14.4667
  7        3          1        34        26      43.9333    34.5333      -9.9333      -8.5333
  8        3          2        48        39      43.9333    34.5333       4.0667       4.4667
  9        3          3        57        48      43.9333    34.5333      13.0667      13.4667
 10        4          1        38        20      43.9333    34.5333      -5.9333     -14.5333
 11        4          2        41        37      43.9333    34.5333      -2.9333       2.4667
 12        4          3        50        40      43.9333    34.5333       6.0667       5.4667
 13        5          1        32        21      43.9333    34.5333     -11.9333     -13.5333
 14        5          2        40        31      43.9333    34.5333      -3.9333      -3.5333
 15        5          3        57        46      43.9333    34.5333      13.0667      11.4667

In the code below, four new variables are created:  mean1 is the mean of score1, mean2 is the mean of score2, grandmc1 is the grand mean centered variable for score1 and grandmc2 is the grand mean centered variable for score2.

* grand mean centering using proc sql;
proc sql; 
create table grndmc as
select *, mean(score1) as mean1, mean(score2) as mean2,
score1 - mean(score1) as grandmc1, score2 - mean(score2) as grandmc2
from test;
quit;

proc print data = grndmc;
run;
Obs    studentid    class    score1    score2     mean1      mean2     grandmc1    grandmc2

  1        1          1        34        24      43.9333    34.5333     -9.9333    -10.5333
  2        1          2        45        36      43.9333    34.5333      1.0667      1.4667
  3        1          3        50        46      43.9333    34.5333      6.0667     11.4667
  4        2          1        39        25      43.9333    34.5333     -4.9333     -9.5333
  5        2          2        43        30      43.9333    34.5333     -0.9333     -4.5333
  6        2          3        51        49      43.9333    34.5333      7.0667     14.4667
  7        3          1        34        26      43.9333    34.5333     -9.9333     -8.5333
  8        3          2        48        39      43.9333    34.5333      4.0667      4.4667
  9        3          3        57        48      43.9333    34.5333     13.0667     13.4667
 10        4          1        38        20      43.9333    34.5333     -5.9333    -14.5333
 11        4          2        41        37      43.9333    34.5333     -2.9333      2.4667
 12        4          3        50        40      43.9333    34.5333      6.0667      5.4667
 13        5          1        32        21      43.9333    34.5333    -11.9333    -13.5333
 14        5          2        40        31      43.9333    34.5333     -3.9333     -3.5333
 15        5          3        57        46      43.9333    34.5333     13.0667     11.4667

3.  Creating an aggregate variable

There may be times when you want to create an aggregate variable.  An aggregate variable is one that aggregates data from a "lower level" to a "higher level".  In this example, the students' test scores (which can be thought of as a level 1 variable) are aggregated to the classroom level (which can be thought of as a level 2 variable).  Hence, a new variable is created that is the mean of the test scores for each class.

In the code below, the output statement is used to output the means for each variable (in this case, score1 and score2) to a new data set called aggtest.  The means for score1 are put into a variable called m1 and the means for score2 are put into a variable called m2.

proc means data = test mean ;
var score1 score2;
by class;
output out = aggtest mean=m1 m2;
run;

proc print data = aggtest;
run;
Obs    class    _TYPE_    _FREQ_     m1      m2

 1       1         0         5      35.4    23.2
 2       2         0         5      43.4    34.6
 3       3         0         5      53.0    45.8

proc sort data = test;
by class;
run;

data merged;
merge test aggtest;
by class;
drop _TYPE_ _FREQ_;
run;

proc print data = merged;
run;
Obs    studentid    class    score1    score2     m1      m2

  1        1          1        34        24      35.4    23.2
  2        2          1        39        25      35.4    23.2
  3        3          1        34        26      35.4    23.2
  4        4          1        38        20      35.4    23.2
  5        5          1        32        21      35.4    23.2
  6        1          2        45        36      43.4    34.6
  7        2          2        43        30      43.4    34.6
  8        3          2        48        39      43.4    34.6
  9        4          2        41        37      43.4    34.6
 10        5          2        40        31      43.4    34.6
 11        1          3        50        46      53.0    45.8
 12        2          3        51        49      53.0    45.8
 13        3          3        57        48      53.0    45.8
 14        4          3        50        40      53.0    45.8
 15        5          3        57        46      53.0    45.8

You can do the same thing using proc sql.  In the code below, a data set called aggtestsql is created.  In the third line, you can see the mean of score1 is created in stored in a variable called mean1, and the mean for score2 is created and stored in a variable called mean2.  The group by statement is needed so that the means are by groups, in this case, the variable class.  If this statement was omitted, the means created would be grand means (in other words, means for the whole variable not broken out by classes).

proc sql; 
create table aggtestsql as
select *, mean(score1) as mean1, mean(score2) as mean2 
from test
group by class;
quit;

proc print data = aggtestsql;
run;
Obs    studentid    class    score1    score2    mean1    mean2

  1        1          1        34        24       35.4     23.2
  2        2          1        39        25       35.4     23.2
  3        3          1        34        26       35.4     23.2
  4        4          1        38        20       35.4     23.2
  5        5          1        32        21       35.4     23.2
  6        1          2        45        36       43.4     34.6
  7        2          2        43        30       43.4     34.6
  8        3          2        48        39       43.4     34.6
  9        4          2        41        37       43.4     34.6
 10        5          2        40        31       43.4     34.6
 11        1          3        50        46       53.0     45.8
 12        2          3        51        49       53.0     45.8
 13        3          3        57        48       53.0     45.8
 14        4          3        50        40       53.0     45.8
 15        5          3        57        46       53.0     45.8

4. Group mean centering

Just as there are at least three ways to create a grand mean centered variable, there are at least three different ways to create a group mean centered variable.  The first way illustrated below is very straight-forward, but it may be impractical if you have lots of groups (or classes).  To save space, we have only group mean centered one variable, score1.

proc means data = test mean;
by class;
var score1;
run;
class=1

The MEANS Procedure

Analysis Variable : score1

        Mean
------------
  34.0000000
------------

class=2

Analysis Variable : score1

        Mean
------------
  45.0000000
------------

data group;
set test;
if class = 1 then grpmscore1 = score1 - 35.4;
if class = 2 then grpmscore1 = score1 - 43.4;
if class = 3 then grpmscore1 = score1 - 53.0;
run;

proc print data = group;
run;
Obs    studentid    class    score1    score2    grpmscore1

  1        1          1        34        24         -1.4
  2        1          2        45        36          1.6
  3        1          3        50        46         -3.0
  4        2          1        39        25          3.6
  5        2          2        43        30         -0.4
  6        2          3        51        49         -2.0
  7        3          1        34        26         -1.4
  8        3          2        48        39          4.6
  9        3          3        57        48          4.0
 10        4          1        38        20          2.6
 11        4          2        41        37         -2.4
 12        4          3        50        40         -3.0
 13        5          1        32        21         -3.4
 14        5          2        40        31         -3.4
 15        5          3        57        46          4.0

A second way to create a group mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set.  This is shown below.

proc means data = test mean;
var score1 score2;
by class;
output out = grpmeanctr mean=m1 m2;
run;

proc sort data = test;
by class studentid;
run;

data merged2;
merge test grpmeanctr;
by class;
drop _TYPE_ _FREQ_;
groupmc1 = score1 - m1;
groupmc2 = score2 - m2;
run;

proc print data = merged2;
run;
Obs    studentid    class    score1    score2     m1      m2     groupmc1    groupmc2

  1        1          1        34        24      35.4    23.2      -1.4         0.8
  2        2          1        39        25      35.4    23.2       3.6         1.8
  3        3          1        34        26      35.4    23.2      -1.4         2.8
  4        4          1        38        20      35.4    23.2       2.6        -3.2
  5        5          1        32        21      35.4    23.2      -3.4        -2.2
  6        1          2        45        36      43.4    34.6       1.6         1.4
  7        2          2        43        30      43.4    34.6      -0.4        -4.6
  8        3          2        48        39      43.4    34.6       4.6         4.4
  9        4          2        41        37      43.4    34.6      -2.4         2.4
 10        5          2        40        31      43.4    34.6      -3.4        -3.6
 11        1          3        50        46      53.0    45.8      -3.0         0.2
 12        2          3        51        49      53.0    45.8      -2.0         3.2
 13        3          3        57        48      53.0    45.8       4.0         2.2
 14        4          3        50        40      53.0    45.8      -3.0        -5.8
 15        5          3        57        46      53.0    45.8       4.0         0.2

A third way to accomplish the same thing is to use proc sql.  As before, four new variables are being created.  You do not have to create the mean1 and mean2 variables; we have included them only for the sake of completeness and to show how this would be done.

proc sql; 
create table grpmeanctrsql as
select *, mean(score1) as mean1, mean(score2) as mean2,
score1 - mean(score1) as groupmc1, score2 - mean(score2) as groupmc2
from test
group by class;
quit;

proc print data = grpmeanctrsql;
run;
Obs    studentid    class    score1    score2    mean1    mean2    groupmc1    groupmc2

  1        1          1        34        24       35.4     23.2      -1.4         0.8
  2        2          1        39        25       35.4     23.2       3.6         1.8
  3        3          1        34        26       35.4     23.2      -1.4         2.8
  4        4          1        38        20       35.4     23.2       2.6        -3.2
  5        5          1        32        21       35.4     23.2      -3.4        -2.2
  6        1          2        45        36       43.4     34.6       1.6         1.4
  7        2          2        43        30       43.4     34.6      -0.4        -4.6
  8        3          2        48        39       43.4     34.6       4.6         4.4
  9        4          2        41        37       43.4     34.6      -2.4         2.4
 10        5          2        40        31       43.4     34.6      -3.4        -3.6
 11        1          3        50        46       53.0     45.8      -3.0         0.2
 12        2          3        51        49       53.0     45.8      -2.0         3.2
 13        3          3        57        48       53.0     45.8       4.0         2.2
 14        4          3        50        40       53.0     45.8      -3.0        -5.8
 15        5          3        57        46       53.0     45.8       4.0         0.2

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.