### SPSS Learning Module Match merging data files

#### 1. Introduction

When you have two data files, you can combine them by merging them side by side, matching up observations based on an identifier.  For example, below we have a file containing dads and we have a file containing faminc.  We would like to match merge the files together so we have the dads observation on the same line with the faminc observation based on the key variable famid.

dads

famid name inc
2     Art  22000
1     Bill 30000
3     Paul 25000

faminc

famid faminc96 faminc97 faminc98
3     75000    76000    77000
1     40000    40500    41000
2     45000    45400    45800 

After match merging the dads and faminc, the data would look like this.

famid name    inc faminc96 faminc97 faminc98
1   Bill  30000    40000    40500    41000
2   Art   22000    45000    45400    45800
3   Paul  25000    75000    76000    77000 

#### 2. One-to-one merge

Let's start by creating the files that we will be merging.  Below we create the files dads.sav and faminc.sav.

DATA LIST LIST / famid * name (A4) inc.
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
END DATA.

LIST.

DATA LIST LIST / famid faminc96 faminc97 faminc98.
BEGIN DATA.
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
END DATA.

LIST.
SAVE OUTFILE = "faminc.sav". 

The output of these statements is shown below, confirming that we have read the data properly.

FAMID NAME      INC
2.00 Art  22000.00
1.00 Bill 30000.00
3.00 Paul 25000.00

FAMID FAMINC96 FAMINC97 FAMINC98
3.00 75000.00 76000.00 77000.00
1.00 40000.00 40500.00 41000.00
2.00 45000.00 45400.00 45800.00

There are three steps to match merge dads.sav with faminc.sav.  (Note that this is a one to one merge because there is a one to one correspondence between the dads and faminc records.)  These three steps are illustrated below.

1. Use SORT CASES to sort dads on famid and save that file (we will call it dads2.sav)
2. Use SORT CASES to sort faminc on famid and save that file (we will call it faminc2.sav)
3. Use MATCH FILES to merge the dads2.sav and faminc2.sav files based on famid

Below we show the commands for performing the merge.

GET FILE="dads.sav".
SORT CASES BY famid.

GET FILE="faminc.sav".
SORT CASES BY famid.
SAVE OUTFILE="faminc2.sav".

/FILE="faminc2.sav"
/BY famid.

LIST.

The output below shows that the match merge worked properly.

FAMID NAME      INC FAMINC96 FAMINC97 FAMINC98
1.00 Bill 30000.00 40000.00 40500.00 41000.00
2.00 Art  22000.00 45000.00 45400.00 45800.00
3.00 Paul 25000.00 75000.00 76000.00 77000.00

#### 3. One-to-many merge

The next example considers a one to many merge where one observation in one file may have multiple matching records in another file.  Imagine that we had a file with dads like we saw in the previous example, and we had a file with kids where a dad could have more than one kid.  You see why this is called a one to many merge since you are matching one dad observation to one or more (many) kids observations.  Remember that the dads file is the file with one observation, and the kids file is the one with many observations.  Below, we create the data file for the dads and for the kids.

DATA LIST LIST / famid * name (A4) inc .
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
END DATA.

LIST.

DATA LIST LIST / famid * kidname (A4) birth age wt * sex (A1).
BEGIN DATA.
1 Beth 1 9 60 f
1 Bob  2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al   2 6 50 m
2 Ann  3 2 20 f
3 Pete 1 6 60 m
3 Pam  2 4 40 f
3 Phil 3 2 20 m
END DATA.

LIST.
SAVE OUTFILE="kids.sav".

As you see below, the steps for doing a one to many merge is similar to the one to one merge that we saw above.

1. Use SORT CASES BY to sort dads on famid and save that file (we will call it dads2)
2. Use SORT CASES BY to sort kids on famid and save that file (we will call it kids2)
3. Use MATCH FILES to merge the dads2 and kids2 files.  However, since the dads file is the file with one observation, use /TABLE="dads2.sav", not /FILE="dads2.sav" to specify the dads file.
GET FILE="dads.sav".
SORT CASES BY famid.

GET FILE="kids.sav".
SORT CASES BY famid.
SAVE OUTFILE="kids2.sav".

MATCH FILES FILE="kids2.sav" /TABLE="dads2.sav" /BY famid.
LIST.

The output below shows that this merge worked as we hoped.

FAMID KIDNAME    BIRTH      AGE       WT SEX NAME      INC
1.00 Beth        1.00     9.00    60.00 f   Bill 30000.00
1.00 Bob         2.00     6.00    40.00 m   Bill 30000.00
1.00 Barb        3.00     3.00    20.00 f   Bill 30000.00
2.00 Andy        1.00     8.00    80.00 m   Art  22000.00
2.00 Al          2.00     6.00    50.00 m   Art  22000.00
2.00 Ann         3.00     2.00    20.00 f   Art  22000.00
3.00 Pete        1.00     6.00    60.00 m   Paul 25000.00
3.00 Pam         2.00     4.00    40.00 f   Paul 25000.00
3.00 Phil        3.00     2.00    20.00 m   Paul 25000.00

The key difference between a one to one merge and a one to many merge is that you need to use /TABLE="dads2.sav" instead of /FILE="dads2.sav".  For your data, when you do a one to many merge, ask yourself which file plays the role of one (in one to many).  For that file, use /TABLE= instead of /FILE=.

Let's intentionally make an error and use /FILE="dads2.sav"and see what SPSS does.

MATCH FILES
/FILE="kids2.sav"
/BY famid.
LIST.

The first thing we notice is that SPSS gives us the warning shown below.  This is telling us that there are multiple kids for a given dad.

Warning # 5132
Duplicate key in a file.  The BY variables do not uniquely identify each
case on the indicated file.  Please check the results carefully.

As SPSS advises, we will inspect the results carefully.  Indeed, we see the results are not what we desired.  When there were multiple kids per dad, it only merged the dad with the first kid, and then the following kids with the same dads were assigned missing values for the dads information (name and inc).  When we used the /TABLE= subcommand in the previous example, SPSS carried the dads information across all of the kids.

FAMID KIDNAME    BIRTH      AGE       WT SEX NAME      INC
1.00 Beth        1.00     9.00    60.00 f   Bill 30000.00
1.00 Bob         2.00     6.00    40.00 m             .
1.00 Barb        3.00     3.00    20.00 f             .
2.00 Andy        1.00     8.00    80.00 m   Art  22000.00
2.00 Al          2.00     6.00    50.00 m             .
2.00 Ann         3.00     2.00    20.00 f             .
3.00 Pete        1.00     6.00    60.00 m   Paul 25000.00
3.00 Pam         2.00     4.00    40.00 f             .
3.00 Phil        3.00     2.00    20.00 m             .

#### 4. Ordering the variables in the new file

You can use the /MAP subcommand with the ADD FILES command to see the order of the variables in the new file, as illustrated below. If you would like to rearrange the order of the variables in the new file, you can also add the /KEEP subcommand to the ADD FILES command. The variables will be ordered in the new file in the order that you list them on the /KEEP subcommand. If you do not list all of the variables on the /KEEP subcommand, the variables not listed will not be present in the new file.  Also note that you can list the first few variables if they are the only ones that need to be reordered, and then use the keyword ALL to have the rest of the variables included in the new file. The variables not specified on the /KEEP subcommand will remain the order in which they are in the original files.

DATA LIST LIST / famid * name (A4) inc.
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
END DATA.

DATA LIST LIST / famid * name (A4) inc.
BEGIN DATA.
1 Bess 15000
3 Pat  50000
2 Amy  18000
END DATA.

SAVE OUTFILE = "moms.sav".
ADD FILES FILE="dads.sav"
/FILE="moms.sav"
/KEEP = name ALL
/MAP.
EXECUTE.
Map of the result file
Result Input1 Input2
------ ------ ------
NAME NAME NAME
FAMID FAMID FAMID
INC INC INC

As you can see, the variables in the new file are now in the order name, famid inc.

#### 5.1 Mismatching records in one-to-one merge

The two data files have may have records that do not match.  Below we illustrate this by including an extra dad (Karl in famid 4) who does not have a corresponding family, and there are two extra families (5 and 6) in the family file that do not have a corresponding dad.

DATA LIST LIST / famid * name (A4) inc.
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
4 Karl 95000
END DATA.
SORT CASES BY famid.

DATA LIST LIST / famid faminc96 faminc97 faminc98.
BEGIN DATA.
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
5 55000 65000 70000
6 22000 24000 28000
END DATA.
SORT CASES BY famid.
SAVE OUTFILE = "faminc.sav".

/FILE="faminc.sav" /IN=fromfam
/BY famid.
LIST.

CROSSTABS
/TABLES= fromdad BY fromfam.

As you see above, we use /IN=fromdad to create a 0/1 variable that indicates whether the resulting file contains a record with data from the dads file.  Likewise, we use /IN=fromfam to indicate if the resulting file has a record from the faminc file.  The LIST and CROSSTABS then show us about the mismatching records.

The output from the LIST command shows us that when there were mismatching records.  For famid 4, the value of  fromdad is 1 and fromfam is 0, as we would expect since there was data from dads for famid 4, but no data from faminc.  Also, as we expect, this record has valid data for the variables from the dads file (name and inc) and missing data for the variables from faminc (faminc96 faminc97 and faminc98).  We see the reverse pattern for famid 5 and 6.

FAMID NAME      INC FAMINC96 FAMINC97 FAMINC98 FROMDAD FROMFAM
1.00 Bill 30000.00 40000.00 40500.00 41000.00    1       1
2.00 Art  22000.00 45000.00 45400.00 45800.00    1       1
3.00 Paul 25000.00 75000.00 76000.00 77000.00    1       1
4.00 Karl 95000.00      .        .        .      1       0
5.00           .   55000.00 65000.00 70000.00    0       1
6.00           .   22000.00 24000.00 28000.00    0       1

If we look at the fromdad and fromfam variables, we can see that there are three records that have matching data, one that has data from the dads only, and two records that have data from the faminc file only.  The crosstab below shows us the same results, and is an easier way of tallying the matching than manually tallying the matching.

FROMDAD  by  FROMFAM

FROMFAM     Page 1 of 1
Count  |
|
|                Row
|     0|     1| Total
0  |      |     2|     2
|      |      |  33.3
+------+------+
1  |     1|     3|     4
|      |      |  66.7
+------+------+
Column       1      5      6
Total    16.7   83.3  100.0

When matching files, we suggest that you use this strategy to check the matching of the two files.  If there are unexpected mismatched records, then you should investigate to understand the cause of the mismatched records.

You can use SELECT IF to eliminate some of the non-matching records.  For example, if you wanted to keep just the records where the dads matched with the family information, you could type

SELECT IF fromdad AND fromfam.
LIST.

The results are shown below, including just the three matching records.

FAMID NAME      INC FAMINC96 FAMINC97 FAMINC98 FROMDAD FROMFAM
1.00 Bill 30000.00 40000.00 40500.00 41000.00    1       1
2.00 Art  22000.00 45000.00 45400.00 45800.00    1       1
3.00 Paul 25000.00 75000.00 76000.00 77000.00    1       1

#### 5.2 Mismatching records in one-to-many merge

SPSS handles the inclusion of mismatched records in a one to-many merge differently than a one-to-one merge.  Remember that in a one-to-many merge, there is a file that has one observation that matches to many observations in the other file; let us refer to these as the one file and the many file.  If there are observations in the one file that do not match to the many file, then these observations will not appear in the merged file at all.  If there are observations in the many file that do not match the one file, those records will appear in the merged file.  If this is what you desire, then you can merge the files as illustrated in Section 3, and use the /IN= as illustrated in the prior section to track the matching.  However, if you would like mismatched records from the one and many file to both appear in the merged file, then you can use the matching strategy outlined below.

Below we use our example to merge dads with kids, and in this example we have mismatched records in both files.  Below we match the files to include all mismatched records in the merged file.  The parts that are different are indicated in red.

DATA LIST LIST / famid * name (A4) inc.
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
4 Karl 95000
END DATA.

DATA LIST LIST / famid * kidname (A4) birth age wt * sex (A1).
BEGIN DATA.
1 Beth 1 9 60 f
1 Bob  2 6 40 m
1 Barb 3 3 20 f
2 Andy 1 8 80 m
2 Al   2 6 50 m
2 Ann  3 2 20 f
3 Pete 1 6 60 m
3 Pam  2 4 40 f
3 Phil 3 2 20 m
5 Ted  1 4 35 m
5 Tess 2 2 18 f
END DATA.
SAVE OUTFILE="kids.sav".

GET FILE="kids.sav".
SORT CASES BY famid.
SAVE OUTFILE="kids2.sav".

SORT CASES BY famid.

MATCH FILES FILE="kids2.sav" /IN=inkid
/BY famid .
LIST.
SAVE OUTFILE="temp.sav".

MATCH FILES FILE="temp.sav"
/BY famid.

LIST.

The section in red adds an extra step to the matching.  The purpose of this step is to add any values of famid that are only in the dads file to the kids file.  It does by doing a one-to-one merge between dadid and the kids and saves that file as temp.  Since dadid just the famid of all of the dads, this merge basically adds observations for any famid that is in the dads file but not in the kids file, and saves this as temp. Then, we can then merge temp with dads2 and temp will have a famid for every observation in the dads2 file. This assures that the resulting file will include all observations from the dads file, even if they do not have a matching record in the kids file.  The result is shown below.  Indeed, the file contains the observation for the dad Karl who does not have any matching kids.  If we omitted the extra code in this step, that record would not have been included in this file.

FAMID KIDNAME    BIRTH      AGE       WT SEX INKID NAME      INC DADS
1.00 Beth        1.00     9.00    60.00 f     1   Bill 30000.00   1
1.00 Bob         2.00     6.00    40.00 m     1   Bill 30000.00   1
1.00 Barb        3.00     3.00    20.00 f     1   Bill 30000.00   1
2.00 Andy        1.00     8.00    80.00 m     1   Art  22000.00   1
2.00 Al          2.00     6.00    50.00 m     1   Art  22000.00   1
2.00 Ann         3.00     2.00    20.00 f     1   Art  22000.00   1
3.00 Pete        1.00     6.00    60.00 m     1   Paul 25000.00   1
3.00 Pam         2.00     4.00    40.00 f     1   Paul 25000.00   1
3.00 Phil        3.00     2.00    20.00 m     1   Paul 25000.00   1
4.00              .        .        .         0   Karl 95000.00   1
5.00 Ted         1.00     4.00    35.00 m     1             .     0
5.00 Tess        2.00     2.00    18.00 f     1             .     0

#### 5.3 Variables with the same name, but different information

Below we have the files with the information about the dads and family, but look more closely at the names of the variables.  In the dads file, there is a variable called inc98, and in the family file there are variables inc96, inc97 and inc98.  Let's go ahead and merge these files and see what SPSS does.

DATA LIST LIST / famid * name (A4) inc98.
BEGIN DATA.
2 Art  22000
1 Bill 30000
3 Paul 25000
END DATA.

DATA LIST LIST / famid inc96 inc97 inc98.
BEGIN DATA.
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
END DATA.
SAVE OUTFILE = "faminc.sav".

SORT CASES BY famid.

GET FILE="faminc.sav".
SORT CASES BY famid.
SAVE OUTFILE="faminc2.sav".

/FILE="faminc2.sav"
/BY famid.
LIST.

The results are shown below. As you see, the variable inc98 has the data from the dads file, the file that appeared first in the MATCH FILES command.  When you match files that have the same variable, SPSS will use the values from the file that appears earliest in the MATCH FILES command.

FAMID NAME    INC98    INC96    INC97
1.00 Bill 30000.00 40000.00 40500.00
2.00 Art  22000.00 45000.00 45400.00
3.00 Paul 25000.00 75000.00 76000.00

There are a couple of ways you can solve this problem.

Solution #1. The most obvious solution is to choose variable names in the original files that will not conflict with each other.  However, you may receive files where the names have already been chosen.

Solution #2. You can rename the variables in the MATCH FILES command (which renames the variables before doing the matching).  This allows you to select variable names that do not conflict with each other, as illustrated below.

GET FILE="dads.sav".
SORT CASES BY famid.

GET FILE="faminc.sav".
SORT CASES BY famid.
SAVE OUTFILE="faminc2.sav".

/FILE="faminc2.sav" /RENAME=(inc96 inc97 inc98 = faminc96 faminc97 faminc98)
/BY famid.
LIST.

As you can see below, the variables were renamed as we specified.

FAMID NAME DADINC98 FAMINC96 FAMINC97 FAMINC98
1.00 Bill 30000.00 40000.00 40500.00 41000.00
2.00 Art  22000.00 45000.00 45400.00 45800.00
3.00 Paul 25000.00 75000.00 76000.00 77000.00

#### 5.4 The same variables with different dictionary information

This problem is similar to the one outlined above. In this example, we have two variables with the same name and the same information, but with different dictionary information associated with them. This dictionary information could include value labels and/or variable labels. As with the example above, SPSS will take the information from the file listed first in the MATCH FILES command. No error or warning message will be issued to let you know that the information from the variable in the later file has been lost. The solution to this problem is to list first in the MATCH FILES command the file with the dictionary information that you want in the resulting file.

#### 5.5 You have run the ADD FILES command, and nothing happened

If you run just the ADD FILES command, as shown below, SPSS will not do anything. However,  you will see a note in the lower right corner of the data editor saying "transformation pending".

ADD FILES FILE="dads.sav"
/FILE="moms.sav"

Solution: The solution is to add either the execute command or a procedure command that will force the execution of the transformation, such as the list command or the crosstab command.

ADD FILES FILE="dads.sav"
/FILE="moms.sav"
execute.