|
|
|
||||
|
|
|||||
This module illustrates the power (and simplicity) of Stata in its ability to reshape data files. These examples take wide data files and reshape them into long form. These show common examples of reshaping data, but do not exhaustively demonstrate the different kinds of data reshaping that you could encounter.
Consider the family income data file below.
use http://www.ats.ucla.edu/stat/stata/modules/faminc, clearlistfamid faminc96 faminc97 faminc98 1. 3 75000 76000 77000 2. 1 40000 40500 41000 3. 2 45000 45400 45800
This is called a wide format since the years of data are wide. We may want the data to be long, where each year of data is in a separate observation. The reshape command can accomplish this, as shown below.
reshape long faminc, i(famid) j(year)(note: j = 96 97 98) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 9 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: faminc96 faminc97 faminc98 -> faminc -----------------------------------------------------------------------------
The list command shows that the data are now in long form, where each year is represented as its own observation.
listfamid year faminc 1. 1 96 40000 2. 1 97 40500 3. 1 98 41000 4. 2 96 45000 5. 2 97 45400 6. 2 98 45800 7. 3 96 75000 8. 3 97 76000 9. 3 98 77000
Let's look at the wide format and contrast it with the long format.
The reshape wide command puts the data back into wide format. We then list out the wide file.
reshape wide(note: j = 96 97 98) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 9 -> 3 Number of variables 3 -> 4 j variable (3 values) year -> (dropped) xij variables: faminc -> faminc96 faminc97 faminc98 -----------------------------------------------------------------------------listfamid faminc96 faminc97 faminc98 1. 1 40000 40500 41000 2. 2 45000 45400 45800 3. 3 75000 76000 77000
The reshape long command puts the data back into long format. We then list out the long file.
reshape long(note: j = 96 97 98) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 9 Number of variables 4 -> 3 j variable (3 values) -> year xij variables: faminc96 faminc97 faminc98 -> faminc -----------------------------------------------------------------------------listfamid year faminc 1. 1 96 40000 2. 1 97 40500 3. 1 98 41000 4. 2 96 45000 5. 2 97 45400 6. 2 98 45800 7. 3 96 75000 8. 3 97 76000 9. 3 98 77000
Now let's look at the pieces of the original reshape command.
long tells reshape that we want to go from wide to longreshape long faminc, i(famid) j(year)
Consider the file containing the kids and their heights at 1 year of age (ht1) and at 2 years of age (ht2).
use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clearlist famid birth ht1 ht2famid birth ht1 ht2 1. 1 1 2.8 3.4 2. 1 2 2.9 3.8 3. 1 3 2.2 2.9 4. 2 1 2 3.2 5. 2 2 1.8 2.8 6. 2 3 1.9 2.4 7. 3 1 2.2 3.3 8. 3 2 2.3 3.4 9. 3 3 2.1 2.9
Lets reshape this data into a long format. The critical questions
are:
Q: What is the stem of the variable going from wide to long.
A: The stem is
ht
Q: What variable uniquely identifies an observation when it is in the
wide form.
A: famid and birth together uniquely identify the
wide observations.
Q: What do we want to call the variable which contains the suffix of
ht, i.e., 1
and 2.
A: Lets call the suffix age.
With the answers to these questions, the reshape command will look like this.
reshape long ht, i(famid birth) j(age)
Let's look at the wide data, and then the data reshaped to be long.
list famid birth ht1 ht2famid birth ht1 ht2 1. 1 1 2.8 3.4 2. 1 2 2.9 3.8 3. 1 3 2.2 2.9 4. 2 1 2 3.2 5. 2 2 1.8 2.8 6. 2 3 1.9 2.4 7. 3 1 2.2 3.3 8. 3 2 2.3 3.4 9. 3 3 2.1 2.9reshape long ht, i(famid birth) j(age)(note: j = 1 2) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 9 -> 18 Number of variables 7 -> 7 j variable (2 values) -> age xij variables: ht1 ht2 -> ht -----------------------------------------------------------------------------list famid birth age htfamid birth age ht 1. 1 1 1 2.8 2. 1 1 2 3.4 3. 1 2 1 2.9 4. 1 2 2 3.8 5. 1 3 1 2.2 6. 1 3 2 2.9 7. 2 1 1 2 8. 2 1 2 3.2 9. 2 2 1 1.8 10. 2 2 2 2.8 11. 2 3 1 1.9 12. 2 3 2 2.4 13. 3 1 1 2.2 14. 3 1 2 3.3 15. 3 2 1 2.3 16. 3 2 2 3.4 17. 3 3 1 2.1 18. 3 3 2 2.9
The file with the kids heights at age 1 and age 2 also contains their weights at age 1 and age 2 (called wt1 and wt2).
use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clearlist famid birth ht1 ht2 wt1 wt2famid birth ht1 ht2 wt1 wt2 1. 1 1 2.8 3.4 19 28 2. 1 2 2.9 3.8 21 28 3. 1 3 2.2 2.9 20 23 4. 2 1 2 3.2 25 30 5. 2 2 1.8 2.8 20 33 6. 2 3 1.9 2.4 22 33 7. 3 1 2.2 3.3 22 28 8. 3 2 2.3 3.4 20 30 9. 3 3 2.1 2.9 22 31
Let's reshape this data into a long format. This is basically the same as the previous command except that ht is replaced with ht wt.
reshape long ht wt, i(famid birth) j(age)
Let's look at the wide data, and then the data reshaped to be long.
list famid birth ht1 ht2 wt1 wt2famid birth ht1 ht2 wt1 wt2 1. 1 1 2.8 3.4 19 28 2. 1 2 2.9 3.8 21 28 3. 1 3 2.2 2.9 20 23 4. 2 1 2 3.2 25 30 5. 2 2 1.8 2.8 20 33 6. 2 3 1.9 2.4 22 33 7. 3 1 2.2 3.3 22 28 8. 3 2 2.3 3.4 20 30 9. 3 3 2.1 2.9 22 31reshape long ht wt, i(famid birth) j(age)(note: j = 1 2) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 9 -> 18 Number of variables 7 -> 6 j variable (2 values) -> age xij variables: ht1 ht2 -> ht wt1 wt2 -> wt -----------------------------------------------------------------------------list famid birth age ht wtfamid birth age ht wt 1. 1 1 1 2.8 19 2. 1 1 2 3.4 28 3. 1 2 1 2.9 21 4. 1 2 2 3.8 28 5. 1 3 1 2.2 20 6. 1 3 2 2.9 23 7. 2 1 1 2 25 8. 2 1 2 3.2 30 9. 2 2 1 1.8 20 10. 2 2 2 2.8 33 11. 2 3 1 1.9 22 12. 2 3 2 2.4 33 13. 3 1 1 2.2 22 14. 3 1 2 3.3 28 15. 3 2 1 2.3 20 16. 3 2 2 3.4 30 17. 3 3 1 2.1 22 18. 3 3 2 2.9 31
It also is possible to reshape a wide data file to be long when there are character suffixes. Look at the dadmomw file below.
use http://www.ats.ucla.edu/stat/stata/modules/dadmomw, clearlistfamid named incd namem incm 1. 1 Bill 30000 Bess 15000 2. 2 Art 22000 Amy 18000 3. 3 Paul 25000 Pat 50000
We would like to make name and inc into long formats but their suffixes are characters (d & m) instead of numbers. Stata can handle that as long as you use string in the command to indicate that the suffix is a character.
reshape long name inc, i(famid) j(dadmom) string
Let's look at the data before and after reshaping.
listfamid named incd namem incm 1. 1 Bill 30000 Bess 15000 2. 2 Art 22000 Amy 18000 3. 3 Paul 25000 Pat 50000reshape long name inc, i(famid) j(dadmom) string(note: j = d m) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 3 -> 6 Number of variables 5 -> 4 j variable (2 values) -> dadmom xij variables: named namem -> name incd incm -> inc -----------------------------------------------------------------------------listfamid dadmom name inc 1. 1 d Bill 30000 2. 1 m Bess 15000 3. 2 d Art 22000 4. 2 m Amy 18000 5. 3 d Paul 25000 6. 3 m Pat 50000
Wide format
famid faminc96 faminc97 faminc98
1. 1 40000 40500 41000
2. 2 45000 45400 45800
3. 3 75000 76000 77000
reshape long faminc, i(famid) j(year)
Long Format
famid year faminc
1. 1 96 40000
2. 1 97 40500
3. 1 98 41000
4. 2 96 45000
5. 2 97 45400
6. 2 98 45800
7. 3 96 75000
8. 3 97 76000
9. 3 98 77000
The general syntax of reshape long can be expressed as...
wherereshape long stem-of-wide-vars, i(wide-id-var) j(var-for-suffix)
stem-of-wide-vars is the stem of the wide variables, e.g., faminc
wide-id-var is the variable that uniquely identifies wide
observations, e.g., famid
var-for-suffix is the variable that will contain the suffix of
the wide variables, e.g., year
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services