|
|
|
||||
|
|
|||||
Datasets sometimes come as a series of smaller datasets each containing information from a subset of cases. For example, one dataset for each U.S. State, or one dataset per year for data over a series of years. The Stata command append is simple enough to use, but if one has a large number of datasets, it can be time consuming to write code to append each dataset to a master dataset. In addition to being time consuming, it is also very easy to make errors when performing this sort of task. The process can be streamlined and made less error prone using a few Stata commands. We start by creating a list of the file names in a text file. Then the file read command is used to process every item on the list. Our example uses a list of file names to read in and merge a series of text files, one for each U.S. state.
The first step is to create a list of files that need to be read in. To do this we need to start with all of the datasets in a single directory. Next you can use dos commands to create a list of all of the files with a certain extension in that directory. You can use Stata to relay commands to your operating system. You can execute any (valid) command to your operating system in Stata by beginning the line with either "!" (without the quotation marks) or the word "shell" (without the quotation marks). For example, if I wanted to execute the command "del myfile.txt" (which would delete the file myfile.txt in the current directory), I could type "! del myfile.txt" or "shell del myfile.txt" (without the quotation marks). The first line of syntax below moves us to the directory where all of the files we want to merge are stored. The next line of syntax uses the ability to access the dos shell to create a text file called filelist.txt. The second line of code uses the ! to tell Stata that we want to execute a command in dos, then it uses the dir command with options to create a list of all files with the extension .dta in the current directory, and saves that list to d:\filelist.txt. A more detailed explanation of how to create a list of files in a directory (on a PC) can be found on our page How can I get a list of files in a directory?
cd d:\mydir ! dir *.dta /a-d /b >d:\filelist.txt
On a Mac the command is:
cd d:\mydir ! ls *.dta >filelist.txt
Assuming the directory d:\mydir contained a Stata data file (note Stata data files have the extension dta) for each of the 50 U.S. States, if we open the file filelist.txt we should see a list of 50 Stata data files, one for each state. The file looks like this (with states omitted to save space):
dataset_ak.dta dataset_al.dta dataset_ar.dta <lines omitted> dataset_wi.dta dataset_wv.dta dataset_wy.dta
Now that we have a list of files to read from, we can go about combining the files. The first line below instructs Stata to open a file (file open), and tells Stata that within Stata we will refer to this file as myfile. The word using followed by a file path and name tells Stata where the text file we are reading is saved. As is typical in Stata, a comma (,) separates the main part of the command from the options. In the options we specify read which tells Stata that we are going to be reading (i.e. getting information from, but not adding information to) the file. Although it appears in the options section of the command, it is necessary to specify read or some other action, in order to open a file. All commands for dealing with external files have a similar structure. They start with the word file, followed by the type of operation we want to perform (in this case of the first line of syntax below, open the file), then we see whatever name that we have given the file within Stata (also known as a handle, in this case myfile), followed by something else, what this something else is will depend on what we are doing with the file. You have to specify the name of the file (in our example it is called myfile), because it is possible to have more than one file open at a time, meaning that we need some way to tell Stata which file we are referring to with any given command. The second line of syntax reads the first line of the file we have named myfile. When Stata reads a text file, unless we tell it to do otherwise, it reads a single line and stops. Next time we tell Stata to read from the file, it will read the next line, and so forth working its way through the file (note, this is the default behavior, we can instruct Stata to do otherwise). When Stata "reads" a line from a file, it copies the information in that line, and places it in a local macro, which the user can then access. We tell Stata to read a line from the file we have called myfile by starting with the command (file), then the action we want to perform on the file, in this case read, followed by the name of the file (myfile) and finally, by giving Stata the name of the local macro into which we want to place the information Stata reads from the file, in this case, line.
file open myfile using "d:\filelist.txt", read file read myfile line
Now we can start combining our datasets. We will open each of our state datasets, and append them to a dataset called master_data.dta . The syntax to do this uses the information from the local macro line. A macro is a type of placeholder, we can place a piece of information (e.g. a numeric value or a string of text) into it, and, whenever we want to use it, we can put in the name of the macro, rather than having to type in the actual piece of information contained in the macro. This may not seem that useful at first, but it can be very powerful. Stata uses a ` (it is on the same key as the ~ character, at the top of your keyboard) followed by the name of the macro, and then a ' (a normal apostrophe or single quote) to represent whatever you want to fill in, since we assigned the name line to the information from myfile, we type `line'. When it interprets the command Stata replaces the `line' with whatever is stored in the local macro line, so if the macro line contains the first line of information from myfile (i.e. "dataset_ak.dta", without the quotes), when Stata interprets the first line below, what it actually interprets is "use dataset_ak.dta" (without the quotes). Since this is the first dataset we have processed we need to create the file master_data, and the second line of syntax below does this by saving the dataset as master_data. The replace option after the comma tells Stata that if it encounters another file called master_data in the current directory, that it should replace that file with the current file. Be careful with this option, since it will allow you to overwrite another file, which you may or may not want to do. Finally, the third line of syntax below reads the next line from the file myfile.
use `line' save master_data, replace file read myfile line
Now we are ready to go on and read in the rest of our files. However, there are a few things you need to know as background in order for this process to make sense. We know that when we use the file read command, Stata reads lines one a time, and that each time we ask it to read a line, it will read the next line in the file. What does Stata do when it reaches the end of the file? Each time Stata reads a line from a file, it returns that line in the local macro we specify (in this example line), but it also returns something else, a returned result called "r(eof)". In general, a returned result is a result of a command that is not necessarily displayed for the user, but which is saved in working memory until another command is run. The r( ) denotes that it is a specific type of returned result, the name inside the parentheses tells us what the result is, in this case, eof stands for end of file. (Some commands return results of the form e( ) and s( ) but that isn't important here.) As long as we are not at the end of the file, r(eof) is equal to zero, when the end of the file is reached, Stata sets r(eof) equal to one, and the local macro specified in the command will be empty. The code below takes advantage of this behavior. In the lines above, we have just read the second line of the file, so now we have the information in the second line of myfile in the local macro line, and r(eof) is equal to zero (since we are not at the end of the file). In the first line of syntax below, the command while tells Stata to repeat whatever is inside the curly braces for as long as the statement following it is true. In this case the statement is r(eof)==0, so, as long as r(eof) is equal to zero, that is, as long as we are not at the end of the file, Stata should keep repeating whatever is in the curly brackets. Note that the first curly bracket ( { ) occurs after the statement we want Stata to test (in this case r(eof)==0), and that the rest of the line is blank, this is necessary for the while command to work properly. The last line below contains a close curly bracket ( } ) and nothing else. This tells Stata that this is the end of the commands it should repeat, the close curly bracket must be on a line alone. Everything on the lines between the open and close brackets is what Stata will repeat while the statement r(eof)==0 is true. In the first line of syntax between brackets, the command append command tells Stata that we want to add observations from another dataset to the working dataset. The using `line' tells Stata that we want to add the observations from the dataset named in the local macro line (i.e. the using command followed by a file name tells Stata where the observations should come from). The next line saves master_data, overwriting the old file (which we want to do, since we have changed the dataset and want to save those changes). The final (third) line of syntax in the curly brackets reads the next line from the file myfile and places it in the local macro line just as we did before. Stata will continue reading lines and appending datasets until it reaches the end of myfile, then it will move on to whatever is next in the .do file. When it is done, you should have one dataset, master_data.dta, that contains the cases from each of your datafiles.
while r(eof)==0 {
append using `line'
save master_data, replace
file read myfile line
}
To make it easier for you to cut and paste into a .do file, here is all the syntax (except for the commands to your operating system) in a single block:
file open myfile using d:\filelist.txt, read
file read myfile line
use `line'
save master_data, replace
file read myfile line
while r(eof)==0 { /* while you're not at the end of the file */
append using `line'
save master_data, replace
file read myfile line
}
In the previous example we read in Stata data files. However, datasets often come as ascii (text) files. In this example we add the steps to convert the files from comma separated text files to a Stata data file and then add them to the master dataset. As we do this, we will also add a variable which identifies which dataset the file came from. As we did above, we first create a list of files in the directory. This time, instead of getting a list of all files with the extension .dta, we will get all files with the extension .txt, by using *.txt in the second line of syntax below.
cd d:\mydir ! dir *.txt /a-d /b >d:\filelist.txt
Assuming the directory d:\mydir contained a comma separated data file for each of the 50 U.S. States, if we open the file filelist.txt we should see a list of 50 text files, one for each state. The file looks like this (with states omitted to save space):
dataset_ak.txt dataset_al.txt dataset_ar.txt <lines omitted> dataset_wi.txt dataset_wv.txt dataset_wy.txt
As in the example above, the first line below opens the file using the file open command, giving it the name myfile. While the second line of syntax reads the first line from the file we have named myfile.
file open myfile using "d:\filelist.txt", read file read myfile line
Now we want to read in the first of our datasets. In this example, we assume that the datasets are comma separated and that the first line of each of the text files contains the names of the variables. Remember that Stata uses a ` (on the same key as the ~ character at the top of your keyboard) followed by the name of the macro, and then a ' (a normal apostrophe) to represent whatever you want to fill in, since we assigned the name line to the information from myfile, we type `line'. Stata replaces the `line' with whatever is stored in the local macro line, so that when Stata interprets the first line of syntax below, what it actually sees is "insheet using dataset_ak.txt, comma names" (without the quotes). In the first line of syntax below, the insheet command tells Stata that we want to read in an ascii dataset. The insheet command requires that there be one case per line and that the data be either comma or tab separated. This will cover many cases, but if you have data in some other format, for example fixed format data, you will need to modify the syntax to accommodate that. The second line of syntax below saves the dataset we have just read in as a Stata data file, this file will have the same name as the ascii datasets plus the file extension .dta, so, for the first dataset on our list the file name for the Stata dataset will be dataset_ak.txt.dta this is a slightly odd looking file name, but, it should not cause any problems. The next (third) line of syntax generates a new variable in our dataset called state_id. We use the string function substr( ) to set the value of this variable to the two letter state id. The function substr( ), extracts a portion of a string (a substring). substr( ) takes three arguments, each separated by a comma, the first argument is the string we want to use enclosed in quotation marks, in this case, the string we want to take part of is contained in the local macro line. The string we want to edit is followed by a comma (,), and the other two arguments. The first of these arguments is the place in the string where we want to start extracting information, in this case, we want to start with the 9th character in the string. The third argument is the number of characters we want to use, in this case, the two letter state id, so the third argument is 2. As an example, if we entered substring("dataset_ak.txt",9,2) we can count in 9 characters, and see that the 9th character in the string is an "a", and we tell the function to take 2 characters, so it will take the 9th and 10th characters, which are "ak".
insheet using `line', comma names
save `line'.dta, replace
generate state_id = substr("`line'",9,2)
We don't yet have a master dataset, so we will create one using the first line of syntax below. The second line of syntax below reads the next line in myfile.
save master_data.dta, replace file read myfile line
The syntax below works similarly to the syntax for the previous example. The first line tells Stata to perform the commands within the curly brackets for as long as r(eof)==0, that is, for as long as we are not at the end of the file. The second through fourth lines repeat what we did just above. The second line of syntax below uses the command insheet to read in a comma separated data file. The third line saves the new dataset as a Stata data file, and the forth line creates a variable called state_id that will contain the two letter state abbreviation. The next line of syntax, append using master_data.dta appends the dataset master_data.dta to our current dataset, in other words, it adds the cases from the current dataset to the dataset master_data. The following line saves the dataset in memory as master_data.dta, replacing the old version of this file. Moving to the next line of syntax, drop _all drops all observations from the current dataset, effectively clearing the dataset, but without some of the other effects of the using the command clear. Finally, as above, we read in the next line of myfile using the command file read. The next, and final, line contains only the close curly brace for the while command.
while r(eof)==0 {
insheet using `line', comma names
save `line'.dta, replace
generate state_id = substr("`line'",9,2)
append using master_data.dta
save master_data.dta, replace
drop _all
file read myfile line
}
To make it easier for you to cut and paste into a .do file, here is all the syntax (except for the commands to your operating system) in a single block:
file open myfile using d:\filelist_text.txt, read
file read myfile line
insheet using `line', comma names
gen state_id = substr("`line'",9,2)
save `line'.dta, replace
save master_data.dta, replace
drop _all
file read myfile line
while r(eof)==0 {
insheet using `line', comma names
gen state_id = substr("`line'",9,2)
save `line'.dta, replace
append using master_data.dta
save master_data.dta, replace
drop _all
file read myfile line
}
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services