UCLA Academic Technology Services HomeServicesClassesContactJobs
Help the Stat Consulting Group by giving a gift             
Loading

SAS FAQ
How do I make unique anonymous ID variables for my data?

Suppose you had a file with 25 observations that had a variable identifying the observations called id and you had information about the observation, here we just have age.
DATA orig;
INPUT id age;
CARDS;
 1   3
 2  32
 3  13
 4  16
 5   4
 6   9
 7  43
 8  29
 9  43
10  47
11  13
12   6
13  43
14  48
15  34
16  13
17  47
18   6
19  34
20  42
21  47
22  49
23  28
24  25
25  39
;
RUN; 
Suppose you want to make a new id variable called newid that is unique for all observations but conceals the identify of who the observation is. The strategy for this can be done like this.

1. Create a new data file with IDs in it (we will call this newids). Make more IDs than necessary because there may be duplicate IDs.

2. Eliminate any records with duplicate newid in the newids data file.

3. Scramble the order of the newids file (so the order of newid does not give away the person's identity).
 
4. Merge newids with the original data file (orig), and get rid of the old id variable.

5. During the merge in step 4, make a file called crossref that shows the correspondence between id and newid.

6. Store crossref in a safe place since that file can be used with orig2 to determine the identify of the observations.

1. Here we make newid which is the new random ID and we make ranord which will be used for scrambling the data file.

data NEWIDS;
  do NOBS = 1 to 40 ; /* we make up 40 observations in case of duplicates */
    newid = "     " ;  /* newid will be 5 characters wide */
    do i = 1 to 5;    /* create each digit of newid, 1 - 5 */
      * make random number 0-35, 0-9, a-z ;
      rannum = int(uniform(0)*36) ;                          
      * if it is 0-9, convert it into 0-9, which is byte(48) - byte(57) ;
      if (0  <= rannum <= 9) then ranch = byte(rannum + 48) ;
      * if it is 10-36, convert it into a-z, which is byte(65)-byte(90) ;
      if (10 <= rannum <= 36) then ranch = byte(rannum + 55); 
      * combine each digit of "newid"  ;
      substr(newid,i,1) = ranch ;
    end;
    * make ranord ;
    ranord = uniform(0) ;
    output ;
  end;
  * just keep "newid" and "ranord" ;
  keep newid ranord ;
run; 
2. Get rid of any duplicates in newids.
 PROC SORT DATA=newids NODUPLICATES;
  BY newid ;
RUN; 
3. Scramble the order of newids so the order of the variables does not give any the identify of the observations.
PROC SORT DATA=newids ;
  BY ranord ;
RUN; 
4. Now, merge orig with newids. If id is missing, that means we have matched all orig observations with newids and it is a newids without an orig, so we should delete the observation. For orig2 drop id and ranord so the identity is now anonymous.

5. For crossref, keep id and newid so the identity can be looked up by you if you need to. Keep crossref in a safe, secret place.
DATA orig2(DROP=id ranord) crossref(KEEP=id newid);
  MERGE orig newids ;
  IF (id = .) THEN DELETE ;
run; 
Show new version of original data file with newid.
PROC PRINT DATA=orig2(obs=10);
RUN; 
OBS    AGE    NEWID
  1      3    QMB02
  2     32    1QXCR
  3     13    VO5FC
  4     16    4C63M
  5      4    2QQR8
  6      9    VT4O5
  7     43    W9IFN
  8     29    BHPJW
  9     43    B0LJQ
 10     47    QN0CC
Show cross reference file, with id and newid.
PROC PRINT DATA=crossref(obs=10);
RUN; 
OBS    ID    NEWID
  1     1    QMB02
  2     2    1QXCR
  3     3    VO5FC
  4     4    4C63M
  5     5    2QQR8
  6     6    VT4O5
  7     7    W9IFN
  8     8    BHPJW
  9     9    B0LJQ
 10    10    QN0CC

How to cite this page

Report an error on this page or leave a comment

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.