### Stata Learning Module Combining data

This module will illustrate how you can combine files in Stata. Examples will include appending files, one to one match merging, and one to many match merging.

#### Appending data files

When you have two data files, you may want to combine them by stacking them one on top of the other. For example, we have a file containing dads and a file containing moms as shown below.

input famid str4 name inc
2 "Art" 22000
1 "Bill" 30000
3 "Paul" 25000
end
list 
         famid       name        inc
1.         2        Art      22000
2.         1       Bill      30000
3.         3       Paul      25000   
clear
input famid str4 name inc
1 "Bess" 15000
3 "Pat" 50000
2 "Amy" 18000
end
save moms, replace
list 
         famid       name        inc
1.         1       Bess      15000
2.         3        Pat      50000
3.         2        Amy      18000   

If we wanted to combine these files by stacking them one atop the other, we can use the append command as shown below.

use dads, clear
append using moms

We can use the list command to see if this worked correctly.

list
         famid       name        inc
1.         2        Art      22000
2.         1       Bill      30000
3.         3       Paul      25000
4.         1       Bess      15000
5.         3        Pat      50000
6.         2        Amy      18000   

The append worked properly... the dads and moms are stacked together in one file. But, there is a little problem. We can't tell the dads from the moms. Let's try doing this again, but first we will create a variable called momdad in the dads and moms data file which will contain dad for the dads data file and mom for the moms data file. When we combine the two files together, the momdad variable will tell us who the moms and dads are.

Here we make momdad variable for the dads data file. We save the file calling it dads1.

use dads, clear
generate str3 momdad = "dad"
save dads1
 file dads1.dta saved

Here we make momdad variable for the moms data file. We save the file calling it moms1.

use moms, clear
generate str3 momdad = "mom"
save moms1
 file moms1.dta saved

Now, let's append dads1 and moms1 together.

use dads1, clear
append using moms1

Now, when we list the data the momdad variable shows who the moms and dads are.

list
         famid       name        inc     momdad
4.         1       Bess      15000        mom
5.         3        Pat      50000        mom
6.         2        Amy      18000        mom   

#### Match merging

Another way of combining data files is match merging. Say that we wanted to combine the dads with the faminc data file, having the dads information and the family information side by side. We can do this with a match merge.

Let's have a look at the dads and faminc file.

use dads, clear
list
         famid       name        inc
1.         2        Art      22000
2.         1       Bill      30000
3.         3       Paul      25000   
clear
input famid faminc96 faminc97 faminc98
3 75000 76000 77000
1 40000 40500 41000
2 45000 45400 45800
end
save faminc, replace
list 
         famid   faminc96   faminc97   faminc98
1.         3      75000      76000      77000
2.         1      40000      40500      41000
3.         2      45000      45400      45800   

We want to combine the data files so they look like this.

famid name  inc    faminc96 faminc97 faminc98
1     Bill  30000  40000    40500    41000
2     Art   22000  45000    45400    45800
3     Paul  25000  75000    76000    77000

Notice that the famid variable is used to associate the observation from the dads file with the appropriate observation from the faminc file.  The strategy for merging the files goes like this.
1. sort dads on famid and save that file (calling it dads2).
2. sort faminc on famid and save that file (calling it faminc2).
4. merge the dads2 file with the faminc2 file using famid to match them.

Here are those four steps.

1. Sort the dads file by famid and save it as dads2

use dads, clear
sort famid
save dads2
 file dads2.dta saved

2. Sort the faminc file by famid and save it as faminc2.

use faminc, clear
sort famid
save faminc2
 file faminc2.dta saved

use dads2, clear

4. Merge with the faminc2 file using famid as the key variable.

merge famid using faminc2

It seems like this worked just fine, but what is that _merge variable?

list, nodisplay noobs
     famid       name        inc   faminc96   faminc97   faminc98    _merge
1       Bill      30000      40000      40500      41000         3
2        Art      22000      45000      45400      45800         3
3       Paul      25000      75000      76000      77000         3   

The _merge variable indicates, for each observation, how the merge went. This is useful for identifying mismatched records. _merge can have one of three values
1 - The record contains information from file1 only (e.g., a dad2 record with no corresponding faminc2 record.
2 - The record contains information from file2 only (e.g., a faminc2 record with no corresponding dad2 record.
3 - The record contains information from both files (e.g., the dad2 and faminc2 records matched up).

When you have many records, tabulating _merge is very useful to summarize how many mismatched you have. In our case, all of the records match so the value for _merge was always 3.

tabulate _merge
     _merge |      Freq.     Percent        Cum.
------------+-----------------------------------
3 |          3      100.00      100.00
------------+-----------------------------------
Total |          3      100.00 

#### One-to-many match merging

Another kind of merge is called a one to many merge. Our one to one merge matched up dads and faminc and there was a one to one matching of the files. If we merge dads with kids, there can be multiple kids per dad and hence this is a one to many merge.

As you see below, the strategy for the one to many merge is really the same as the one to one merge.

2. sort kids on famid and save that file as kids3
4. merge the dads3 file with the kids3 file using famid to match them.

The 4 steps are shown below.

1. Sort the dads data file on famid and save that file as dads3.

use dads, clear
sort famid
save dads3
 file dads3.dta saved
list
         famid       name        inc
1.         1       Bill      30000
2.         2        Art      22000
3.         3       Paul      25000   

2. Sort the kids data file on famid and save that file as kids3.

clear
input famid str4 kidname birth age wt str1 sex
1 "Beth" 1 9 60 "f"
2 "Andy" 1 8 40 "m"
3 "Pete" 1 6 20 "f"
1 "Bob" 2 6 80 "m"
1 "Barb" 3 3 50 "m"
2 "Al" 2 6 20 "f"
2 "Ann" 3 2 60 "m"
3 "Pam" 2 4 40 "f"
3 "Phil" 3 2 20 "m"
end

sort famid

save kids3
file kids3.dta saved

list
famid    kidname      birth        age         wt        sex
1.         1       Beth          1          9         60          f
2.         1        Bob          2          6         40          m
3.         1       Barb          3          3         20          f
4.         2       Andy          1          8         80          m
5.         2         Al          2          6         50          m
6.         2        Ann          3          2         20          f
7.         3       Pete          1          6         60          m
8.         3        Pam          2          4         40          f
9.         3       Phil          3          2         20          m   

use dads3, clear

4. Merge the dads3 file with the kids3 file using famid to match them.

merge famid using kids3

Let's list out the results.

list famid name kidname birth age _merge
       famid   name   kidname   birth   age   _merge
1.       1   Bill      Barb       3     3        3
2.       2    Art        Al       2     6        3
3.       3   Paul       Pam       2     4        3
4.       1   Bill       Bob       2     6        3
5.       1   Bill      Beth       1     9        3
6.       2    Art      Andy       1     8        3
7.       2    Art       Ann       3     2        3
8.       3   Paul      Phil       3     2        3
9.       3   Paul      Pete       1     6        3  

The results are a bit easier to read if we sort the data on famid and birth.

sort famid birth
list famid name kidname birth age _merge
         famid       name    kidname      birth        age    _merge
1.         1       Bill       Beth          1          9         3
2.         1       Bill        Bob          2          6         3
3.         1       Bill       Barb          3          3         3
4.         2        Art       Andy          1          8         3
5.         2        Art         Al          2          6         3
6.         2        Art        Ann          3          2         3
7.         3       Paul       Pete          1          6         3
8.         3       Paul        Pam          2          4         3
9.         3       Paul       Phil          3          2         3   

As you see, this is basically the same as a one to one merge. You may wonder if the order of the files on the merge statement is relevant. Here, we switch the order of the files and the results are the same. The only difference is the order of the records after the merge.

use kids3, clear
merge famid using dads3
list famid name kidname birth age
         famid       name    kidname      birth        age
1.         1       Bill       Beth          1          9
2.         1       Bill        Bob          2          6
3.         1       Bill       Barb          3          3
4.         2        Art       Andy          1          8
5.         2        Art         Al          2          6
6.         2        Art        Ann          3          2
7.         3       Paul       Pete          1          6
8.         3       Paul        Pam          2          4
9.         3       Paul       Phil          3          2   

#### Summary

Appending data example

use dads, clear
append using moms

Match merge example steps (one-to-one and one-to-many)

1. sort dads on famid and save that file
2. sort kids on famid and save that file
4. merge the dads file with the kids file using  famid to match them.

Match merge example program

use dads, clear
sort famid

use faminc, clear
sort famid
save faminc2

merge famid using faminc2