SAS Learning Module
Match merging data files in SAS

1. Introduction

When you have two data files, you can combine them by merging them side by side, matching up observations based on an identifier. For example, below we have a data file containing information on dads and we have a file containing information on family income called faminc. We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid.

dads 

famid name inc 
2     Art  22000 
1     Bill 30000 
3     Paul 25000 
faminc 

famid faminc96 faminc97 faminc98 
3     75000    76000    77000 
1     40000    40500    41000 
2     45000    45400    45800 

After match merging the files, they would look like this.

famid name    inc faminc96 faminc97 faminc98 
  1   Bill  30000    40000    40500    41000 
  2   Art   22000    45000    45400    45800 
  3   Paul  25000    75000    76000    77000 

2. One-to-one merge

There are three steps to match merge the dads file with the faminc file (this is called a one-to-one merge because there is a one to one correspondence between the dads and faminc records). These three steps are illustrated in the SAS program merge1.sas below.

  1. Use proc sort to sort dads on famid and save that file (we will call it dads2)
  2. Use proc sort to sort faminc on famid and save that file (we will call it faminc2)
  3. merge the dads2 and faminc2 files based on famid

These three steps are illustrated in the program below.

* We first created the dads and faminc data files below ; 

DATA dads; 
  INPUT famid name $ inc ; 
CARDS; 
2 Art  22000 
1 Bill 30000 
3 Paul 25000 
; 
RUN; 
DATA faminc; 
  INPUT famid faminc96 faminc97 faminc98 ; 
CARDS; 
3 75000 76000 77000 
1 40000 40500 41000 
2 45000 45400 45800 
* 1. Sort the dads file by "famid" & save sorted file as dads2 ; 
PROC SORT DATA=dads OUT=dads2; 
  BY famid; 
RUN; 
* 2. Sort faminc by "famid" & save sorted file as faminc2 ; 
PROC SORT DATA=faminc OUT=faminc2; 
  BY famid; 
RUN; 
* 3. Merge dads2 and faminc2 by famid in a data step ; 
DATA dadfam ; 
  MERGE dads2 faminc2; 
  BY famid; 
RUN; 
* Let's do a proc print and look at the results. ; 
PROC PRINT DATA=dadfam; 
RUN; 

The output of the program is shown below.

OBS    FAMID    NAME     INC     FAMINC96    FAMINC97    FAMINC98 

 1       1      Bill    30000      40000       40500       41000 
 2       2      Art     22000      45000       45400       45800 
 3       3      Paul    25000      75000       76000       77000 

The output from shows that the match merge worked properly. The dad and faminc are merged side by side. The next example considers a one-to-many merge where one observation in one file may have multiple matching records in another file. We will see that kind of merge is really no different from the one-to-one merge we saw here.

3. One-to-many merge

Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid. Matching up the "dads" with the "kids" is called a "one-to-many" merge since you are matching one dad observation to possibly many kids records. The dads and kids records are shown below.

dads 

famid name inc 
2     Art  22000 
1     Bill 30000 
3     Paul 25000 
kids 

famid kidname birth age wt sex 
1     Beth     1    9   60 f 
1     Bob      2    6   40 m 
1     Barb     3    3   20 f 
2     Andy     1    8   80 m 
2     Al       2    6   50 m 
2     Ann      3    2   20 f 
3     Pete     1    6   60 m 
3     Pam      2    4   40 f 
3     Phil     3    2   20 m 

After matching the dads with the kids you get a file that looks like the one below. Bill is matched up with his kids Beth, Bob and Barb; Art is matched up with Andy Al, and Ann; and Paul is matched up with Pete, Pam and Phil.

dadkid 

FAMID NAME   INC   KIDNAME  BIRTH  AGE  WT  SEX 

  1   Bill  30000   Beth      1     9   60   f 
  1   Bill  30000   Bob       2     6   40   m 
  1   Bill  30000   Barb      3     3   20   f 
  2   Art   22000   Andy      1     8   80   m 
  2   Art   22000   Al        2     6   50   m 
  2   Art   22000   Ann       3     2   20   f 
  3   Paul  25000   Pete      1     6   60   m 
  3   Paul  25000   Pam       2     4   40   f 
  3   Paul  25000   Phil      3     2   20   m 

Just like the "one-to-one" merge, we follow the same three steps for a "one-to-many" merge. These three steps are illustrated in the SAS program merge2.sas below.

  1. Use proc sort to sort dads on famid and save that file (we will call it dads2)
  2. Use proc sort to sort kids on famid and save that file (we will call it kids2)
  3. merge the dads2 and kids2 files based on famid

The program below illustrates these steps.

* first we make the "dads" data file ;
DATA dads; 
  INPUT famid name $ inc ; 
CARDS; 
2 Art  22000 
1 Bill   30000 
3 Paul  25000 
; 
RUN; 
* Next we make the "kids" data file ;
DATA kids; 
  INPUT famid kidname $ birth age wt sex $ ; 
CARDS; 
1 Beth 1 9 60 f 
1 Bob  2 6 40 m 
1 Barb 3 3 20 f 
2 Andy 1 8 80 m 
2 Al   2 6 50 m 
2 Ann  3 2 20 f 
3 Pete 1 6 60 m 
3 Pam  2 4 40 f 
3 Phil 3 2 20 m 
; 
RUN; 
* 1. sort "dads" on famid and save the sorted file as "dads2" ; 
PROC SORT DATA=dads OUT=dads2; 
  BY famid; 
RUN; 
* 2. sort "kids" on famid and save the sorted file as "kids2" ; 
PROC SORT DATA=kids OUT=kids2; 
  BY famid; 
RUN; 
* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ; 
DATA dadkid; 
  MERGE dads2 kids2; 
  BY famid; 
RUN; 
* Let's do a PROC PRINT of "dadkid" to see if the merge worked ; 
PROC PRINT DATA=dadkid; 
RUN; 

The output of the program is shown below.

OBS FAMID NAME   INC   KIDNAME BIRTH AGE WT SEX 

 1    1   Bill  30000   Beth     1    9  60  f
 2    1   Bill  30000   Bob      2    6  40  m 
 3    1   Bill  30000   Barb     3    3  20  f 
 4    2   Art   22000   Andy     1    8  80  m 
 5    2   Art   22000   Al       2    6  50  m 
 6    2   Art   22000   Ann      3    2  20  f 
 7    3   Paul  25000   Pete     1    6  60  m 
 8    3   Paul  25000   Pam      2    4  40  f 
 9    3   Paul  25000   Phil     3    2  20  m 

The output shows just what we hoped to see, the dads merged along side of their kids. You might have wondered what would have happened if the merge statement had reversed the order of the files, had we changed step 3 to look like below.

* 3. merge "dads2" and "kids2" based on famid, creating "dadkid" ; 
DATA dadkid; 
  MERGE kids2 dads2; 
  BY famid; 
RUN; 
* Let's do a PROC PRINT of "dadkid" see what happens ; 
PROC PRINT DATA=dadkid; 
RUN; 

The output with the modified step 3 is shown below.

OBS FAMID KIDNAME BIRTH AGE WT SEX NAME  INC   
 
 1    1    Beth     1    9  60  f  Bill 30000   
 2    1    Bob      2    6  40  m  Bill 30000   
 3    1    Barb     3    3  20  f  Bill 30000   
 4    2    Andy     1    8  80  m  Art  22000   
 5    2    Al       2    6  50  m  Art  22000  
 6    2    Ann      3    2  20  f  Art  22000   
 7    3    Pete     1    6  60  m  Paul 25000   
 8    3    Pam      2    4  40  f  Paul 25000  
 9    3    Phil     3    2  20  m  Paul 25000   

This output shows what happened when we switched the order of kids2 and dads2 in the merge statement. The merge results are basically the same, except that the order of the variables is modified -- the kids variables are on the left and the dads variables are at the right. Other than that, the results are the same.

4. Problems to look out for

These examples cover situations where there are no complications. We show some examples of complications that can arise and how you can solve them below.

4.1 Mismatching records in one-to-one merge

The two data files have may have records that do not match. Below we illustrate this by including an extra dad (Karl in famid 4) that does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.
DATA dads;
 INPUT famid name $ inc;
DATALINES;
2 Art  22000
1 Bill 30000
3 Paul 25000
4 Karl 95000
;
RUN;

DATA faminc;
 INPUT famid faminc96 faminc97 faminc98;
DATALINES;
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
5 55000 65000 70000
6 22000 24000 28000
;
RUN;

PROC SORT DATA=dads;
 BY famid;
RUN;

PROC SORT DATA=faminc;
 BY famid;
RUN;

DATA merge121;
  MERGE dads(IN=fromdadx) faminc(IN=fromfamx);
  BY famid;
  fromdad = fromdadx;
  fromfam = fromfamx;
RUN;
As you see above, we use the in option to create a 0/1 variable fromdadx that indicates whether the resulting file contains a record with data from the dads file. Likewise, we use IN option to create a 0/1 variable fromfamx that indicates if the observation came from the faminc file. The fromdadx and fromfamx variables are temporary, so we make copies of them in fromdad and fromfam so we have copies of these variables that stay with the file. We can then use proc print and proc freq to identify the mismatching records.
PROC PRINT DATA=merge121;
RUN;

PROC FREQ DATA=merge121;
TABLES fromdad*fromfam;
RUN;

The output below illustrates that there were mismatching records. For famid 4, the value of fromdad is 1 and fromfam is 0, as we would expect since there was data from dads for famid 4, but no data from faminc. Also, as we expect, this record has valid data for the variables from the dads file (name and inc) and missing data for the variables from faminc (faminc96 faminc97 and faminc98). We see the reverse pattern for famid's5 and 6.
OBS   FAMID   NAME    INC    FAMINC96   FAMINC97   FAMINC98   FROMDAD   FROMFAM

 1      1     Bill   30000     40000      40500      41000       1         1   
 2      2     Art    22000     45000      45400      45800       1         1   
 3      3     Paul   25000     75000      76000      77000       1         1   
 4      4     Karl   95000         .          .          .       1         0   
 5      5                .     55000      65000      70000       0         1   
 6      6                .     22000      24000      28000       0         1   
A closer look at the fromdad and fromfam variables reveals that there are three records that have matching data: one that has data from the dads only, and two records that have data from the faminc file only. The crosstab table below confirms this.
TABLE OF FROMDAD BY FROMFAM

FROMDAD     FROMFAM

Frequency|
Percent  |
Row Pct  |
Col Pct  |       0|       1|  Total
---------+--------+--------+
       0 |      0 |      2 |      2
         |   0.00 |  33.33 |  33.33
         |   0.00 | 100.00 |
         |   0.00 |  40.00 |
---------+--------+--------+
       1 |      1 |      3 |      4
         |  16.67 |  50.00 |  66.67
         |  25.00 |  75.00 |
         | 100.00 |  60.00 |
---------+--------+--------+
Total           1        5        6
            16.67    83.33   100.00
You may want to use this strategy to check the matching of the two files. If there are unexpected mismatched records, then you should investigate to understand the cause of the mismatched records.

Use the where statement in a proc print to eliminate some of the non-matching records.

4.2 Variables with the same name, but different information

Below we have the files with the information about the dads and family, but look more closely at the names of the variables. In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98. Let's attempt to merge these files and see what happens.
DATA dads;
 INPUT famid name $ inc98;
DATALINES;
2 Art  22000
1 Bill 30000
3 Paul 25000
;
RUN;

DATA faminc;
 INPUT famid inc96 inc97 inc98;
DATALINES;
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
;
RUN;

PROC SORT DATA=dads;
 BY famid;
RUN;

PROC SORT DATA=faminc;
 BY famid;
RUN;

DATA merge121;
 MERGE faminc dads;
 BY famid;
RUN;
PROC PRINT DATA=merge121;
RUN;
The results are shown below. As you see, the variable inc98 has the data from the dads file, the file that appears last on the merge statement. When you merge files that have the same variable, SAS will use the values from the file that appears last on the merge statement.
OBS    FAMID    INC96    INC97    INC98    NAME

 1       1      40000    40500    30000    Bill
 2       2      45000    45400    22000    Art 
 3       3      75000    76000    25000    Paul
There are a couple of ways you can solve this problem.

Solution #1. The most obvious solution is to choose variable names in the original files that will not conflict with each other. However, you may have files where the names have already been chosen.

Solution #2. You can rename the variables in a data step using the rename option (which renames the variables before doing the merging). This allows you to select variable names that do not conflict with each other, as illustrated below.

DATA merge121;
MERGE faminc(RENAME=(inc96=faminc96 inc97=faminc97 inc98=faminc98))
dads(RENAME=(inc98=dadinc98));
BY famid;
RUN;

PROC PRINT DATA=merge121;
RUN;

As you can see below, the variables were renamed as specified.
OBS    FAMID    FAMINC96    FAMINC97    FAMINC98    NAME    DADINC98

 1       1        40000       40500       41000     Bill      30000 
 2       2        45000       45400       45800     Art       22000 
 3       3        75000       76000       77000     Paul      25000 

5. For more information

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.