UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

SAS Learning Module
Concatenating data files in SAS

1. Introduction

When you have two data files, you may want to combine them by stacking them one on top of the other (referred to as concatenating files). Below we have a file called dads and a file containing moms.

dads 

famid name inc 
2     Art  22000 
1     Bill 30000 
3     Paul 25000  

moms 

famid name inc 
1     Bess 15000 
3     Pat  50000 
2     Amy  18000  

Below we have stacked (concatenated) these files creating a file we called momdad. These examples will show how to concatenate files in SAS.

momdad 

famid name inc 
2     Art  22000 
1     Bill 30000 
3     Paul 25000 
1     Bess 15000 
3     Pat  50000 
2     Amy  18000  

2. Concatenating the moms and dads

The SAS program below creates a SAS data file called dads and a file called moms. It then combines them (concatenates them) creating a file called dadmom.

* Here is a file with information about dads with their family id name and income ; 

DATA dads; 
  INPUT famid name $ inc ; 
CARDS; 
2 Art  22000 
1 Bill 30000 
3 Paul 25000 
; 
RUN; 

* Here is a file with information about moms with their family id name and income ; 

DATA moms; 
  INPUT famid name $ inc ; 
CARDS; 
1 Bess 15000 
3 Pat  50000 
2 Amy  18000 
; 
RUN; 

* We can combine these files by stacking them one on top the other ; 
* by setting them both together in the same data step as shown below ; 

DATA dadmom; 
  SET dads moms; 
RUN; 

 * Let's use PROC PRINT to look at the result ; 

PROC PRINT DATA=dadmom; 
RUN;  

The output of this program is shown below.

OBS    FAMID    NAME     INC 

1       2      Art     22000 
2       1      Bill    30000 
3       3      Paul    25000 
4       1      Bess    15000 
5       3      Pat     50000 
6       2      Amy     18000

The output from this program shows that the files were combined properly. The dads and moms are stacked together in one file. But, there is a little problem. We can't tell the dads from the moms. Let's try doing this again but in such a way that we can tell which observations are the moms and which are the dads.

3. Concatenating the moms and dads, a better example

In order to tell the dads from the moms, let's create a variable called momdad in the dads and moms data files that will contain dad for the dads data file and mom for the moms data file. When we combine the two files together the momdad variable will tell us who the moms and dads are.

DATA dads; 
  INPUT famid name $ inc ; 
  momdad = "dad";  
CARDS; 
2 Art  22000 
1 Bill   30000 
3 Paul  25000 
; 
RUN;  
DATA moms; 
  INPUT famid name $ inc ; 
  momdad = "mom";  
CARDS; 
1 Bess  15000 
3 Pat   50000 
2 Amy  18000 
; 
RUN;  
DATA dadmom; 
  SET dads moms; 
RUN;  
* Now when we do the proc print you can see the dads from the moms ;  
PROC PRINT DATA=dadmom; 
RUN;  

The output of this program is shown below.

OBS    FAMID    NAME     INC     MOMDAD

 1       2      Art     22000     dad 
 2       1      Bill    30000     dad 
 3       3      Paul    25000     dad 
 4       1      Bess    15000     mom 
 5       3      Pat     50000     mom 
 6       2      Amy     18000     mom  

Here we get a more desirable result, because we can tell the dads from the moms by looking at the variable momdad. This required some thinking ahead because we had to put momdad in both the dads data file and the moms data file before we merged the data files.

4. Problems to look out for

These above examples cover situations where there are no complications. However, look out for the following problems.

4.1. The two data files have different variable names for the same thing

For example, income is called dadinc and in the dads file and called mominc in the moms file, as shown below.
DATA dads;
 INPUT famid name $  dadinc ;
DATALINES;
2 Art  22000
1 Bill 30000
3 Paul 25000
;
RUN;

DATA moms;
 INPUT famid name $  mominc ;
DATALINES;
1 Bess 15000
3 Pat  50000
2 Amy  18000
;
RUN;

DATA momdad;
 SET dads(IN=dad) moms(IN=mom);
 IF dad=1 THEN momdad="dad";
 IF mom=1 THEN momdad="mom";
run;
PROC PRINT DATA=momdad;
RUN;
You can see the problem illustrated below.
 OBS    FAMID    NAME    DADINC    MOMINC    DAD    MOM    MOMDAD

 1       2      Art      22000         .     1      0      dad
 2       1      Bill     30000         .     1      0      dad
 3       3      Paul     25000         .     1      0      dad
 4       1      Bess         .     15000     0      1      mom
 5       3      Pat          .     50000     0      1      mom
 6       2      Amy          .     18000     0      1      mom 
Solution #1. The most obvious solution is to choose appropriate variable names for the original files (i.e., name the variable inc in both the moms and dads file). This solution is not always possible since you might be concatenating files that you did not originally create. To save space, we omit illustrating this solution.

Solution #2. If solution #1 is not possible, then this problem can be addressed using an if statement in a data step.

DATA momdad;
  SET dads(IN=dad) moms(IN=mom);
  IF dad=1 THEN
  DO;
    momdad="dad";
    inc=dadinc;
  END;
  IF mom=1 THEN
  DO;
    momdad="mom";
    inc=mominc;
  END;
RUN;

PROC PRINT DATA=momdad;
RUN;
The results are shown below, where inc now has the income for both the moms and dads.
OBS    FAMID    NAME    DADINC    MOMINC    DAD    MOM    MOMDAD     INC

 1       2      Art      22000         .     1      0      dad      22000
 2       1      Bill     30000         .     1      0      dad      30000
 3       3      Paul     25000         .     1      0      dad      25000
 4       1      Bess         .     15000     0      1      mom      15000
 5       3      Pat          .     50000     0      1      mom      50000
 6       2      Amy          .     18000     0      1      mom      18000
Solution 3. Another way you can fix this is by using the rename option on the set statement of a data step to rename the variables just before the files are combined.
DATA momdad;
  SET dads(RENAME=(dadinc=inc)) moms(RENAME=(mominc=inc));
RUN;

PROC PRINT DATA=momdad;
RUN;
The output for Solution 3 is below.
OBS    FAMID    NAME     INC

 1       2      Art     22000
 2       1      Bill    30000
 3       3      Paul    25000
 4       1      Bess    15000
 5       3      Pat     50000
 6       2      Amy     18000

4.2 The two data files have different lengths for variables of the same name

In all of the examples above, the variable name was input with the format $ indicating name is an alphabetic (string) variable with a default length of 8. What would happen if name in the dads file was input using $3. and name in the moms file was input using $4. ? This is illustrated below.
DATA dads; 
  INPUT famid name $3. inc;
DATALINES;  
 2 Art  22000 
 1 Bob  30000 
 3 Tom  25000 
 RUN;  

DATA moms;  
  INPUT famid name $4. inc; 
DATALINES;  
 1 Bess 15000  
 3 Rory 50000  
 2 Jane 18000  
 RUN;  

DATA momdad; 
  SET dads moms;
RUN; 
PROC PRINT DATA=momdad; 
RUN; 
The output is below.
OBS    FAMID   NAME     INC 
 1      2      Art     22000 
 2      1      Bob     30000
 3      3      Tom     25000 
 4      1      Bes     15000
 5      3      Ror     50000 
 6      2      Jan     18000 
Note that the names for the moms are truncated to be length 3. This is because the length for names in the dads file is 3. To fix this, use the length statement in the data step that merges the two files.
DATA momdad; 
  LENGTH name $ 4;
  SET dads moms;
RUN; 
PROC PRINT DATA=momdad; 
RUN; 
The output is below.
OBS    NAME   FAMID     INC
 1     Art       2      22000
 2     Bob       1      30000
 3     Tom       3      25000
 4     Bess      1      15000
 5     Rory      3      50000
 6     Jane      2      18000

4.3 The two data files have variables with the same name but different codes

This problem is similar to the problem above, except that it has an additional wrinkle, illustrated below. In the dads file there is a variable called fulltime that is coded 1 if the dad is working full time, 0 if he is not. The moms file also has a variable called fulltime that is coded Y is she is working full time, and N if she is not. Not only are these variables of different types (numeric and character), but they are coded differently as well.
DATA dads;
 INPUT famid name $ inc fulltime;
DATALINES;
2 Art  22000 0
1 Bill 30000 1
3 Paul 25000 1
;
RUN;

DATA moms;
 INPUT famid name $ inc fulltime $1.;
DATALINES;
1 Bess 15000 N
3 Pat  50000 Y
2 Amy  18000 N
;
RUN;
Solution #1. Code the variables in the two files in the same way. For example, code fulltime using 0/1 for both files with 1 indicating working fulltime. This is the simplest solution if you are creating the files yourself. We will omit illustrating this solution to save space.

Solution #2. You may not have created the original raw data files, so solution #1 may not be possible for you. In that case, you can create a new variable in each file that has the same coding and will be compatible when you merge the files. Below we illustrate this strategy.

For the dads file, we make a variable called full that is the same as fulltime, and save the file as dads2, dropping fulltime. For the moms, we create full by recoding fulltime, and save the file as moms2, also dropping fulltime. The files dads2 and moms2 both have the variable full coded the same way (0/1 where 1=works full time) so we can combine those files together.

DATA dads;
 SET dads;
 full=fulltime;
 DROP fulltime;
RUN;

DATA moms;
 SET moms;
 IF fulltime="Y" THEN full=1;
 IF fulltime="N" THEN full=0;
 DROP fulltime;
RUN;

DATA momdad;
  SET dads moms;
RUN;
PROC PRINT DATA=momdad;
RUN;
The results are shown below.
OBS    FAMID    NAME     INC     FULL

 1       2      Art     22000      0
 2       1      Bill    30000      1
 3       3      Paul    25000      1
 4       1      Bess    15000      0
 5       3      Pat     50000      1
 6       2      Amy     18000      0

5. For more information


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California