Stata Learning Module Reshaping data wide to long

This module illustrates the power (and simplicity) of Stata in its ability to reshape data files. These examples take wide data files and reshape them into long form. These show common examples of reshaping data, but do not exhaustively demonstrate the different kinds of data reshaping that you could encounter.

Example #1: Reshaping data wide to long

Consider the family income data file below.

use http://www.ats.ucla.edu/stat/stata/modules/faminc, clear
list
         famid   faminc96   faminc97   faminc98
1.         3      75000      76000      77000
2.         1      40000      40500      41000
3.         2      45000      45400      45800


This is called a wide format since the years of data are wide. We may want the data to be long, where each year of data is in a separate observation. The reshape command can accomplish this, as shown below.

reshape long faminc, i(famid) j(year)
(note:  j = 96 97 98)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   4   ->       3
j variable (3 values)                     ->   year
xij variables:
faminc96 faminc97 faminc98   ->   faminc
----------------------------------------------------------------------------- 

The list command shows that the data are now in long form, where each year is represented as its own observation.

list
         famid       year     faminc
1.         1         96      40000
2.         1         97      40500
3.         1         98      41000
4.         2         96      45000
5.         2         97      45400
6.         2         98      45800
7.         3         96      75000
8.         3         97      76000
9.         3         98      77000   

Let's look at the wide format and contrast it with the long format.

The reshape wide command puts the data back into wide format. We then list out the wide file.

reshape wide
(note:  j = 96 97 98)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                        9   ->       3
Number of variables                   3   ->       4
j variable (3 values)              year   ->   (dropped)
xij variables:
faminc   ->   faminc96 faminc97 faminc98
----------------------------------------------------------------------------- 
list
         famid   faminc96   faminc97   faminc98
1.         1      40000      40500      41000
2.         2      45000      45400      45800
3.         3      75000      76000      77000   

The reshape long command puts the data back into long format. We then list out the long file.

reshape long
(note:  j = 96 97 98)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       9
Number of variables                   4   ->       3
j variable (3 values)                     ->   year
xij variables:
faminc96 faminc97 faminc98   ->   faminc
----------------------------------------------------------------------------- 
list
         famid       year     faminc
1.         1         96      40000
2.         1         97      40500
3.         1         98      41000
4.         2         96      45000
5.         2         97      45400
6.         2         98      45800
7.         3         96      75000
8.         3         97      76000
9.         3         98      77000   

Now let's look at the pieces of the original reshape command.

reshape long faminc, i(famid) j(year)
long tells reshape that we want to go from wide to long
faminc tells Stata that the stem of the variable to be converted from wide to long is faminc
i(famid) option tells reshape that famid is the unique identifier for records in their wide format
j(year) tells reshape that the suffix of faminc (i.e., 96 97 98) should be placed in a variable called year

Example #2: Reshaping data wide to long

Consider the file containing the kids and their heights at 1 year of age (ht1) and at 2 years of age (ht2).

use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clear
list famid birth ht1 ht2
          famid      birth        ht1        ht2
1.         1          1        2.8        3.4
2.         1          2        2.9        3.8
3.         1          3        2.2        2.9
4.         2          1          2        3.2
5.         2          2        1.8        2.8
6.         2          3        1.9        2.4
7.         3          1        2.2        3.3
8.         3          2        2.3        3.4
9.         3          3        2.1        2.9   

Lets reshape this data into a long format. The critical questions are:
Q: What is the stem of the variable going from wide to long.
A: The stem is ht
Q: What variable uniquely identifies an observation when it is in the wide form.
A: famid and birth together uniquely identify the wide observations.
Q: What do we want to call the variable which contains the suffix of ht, i.e., 1 and 2.
A: Lets call the suffix age.

With the answers to these questions, the reshape command will look like this.

reshape long ht, i(famid birth) j(age)

Let's look at the wide data, and then the data reshaped to be long.

list famid birth ht1 ht2
         famid      birth        ht1        ht2
1.         1          1        2.8        3.4
2.         1          2        2.9        3.8
3.         1          3        2.2        2.9
4.         2          1          2        3.2
5.         2          2        1.8        2.8
6.         2          3        1.9        2.4
7.         3          1        2.2        3.3
8.         3          2        2.3        3.4
9.         3          3        2.1        2.9   
reshape long ht, i(famid birth) j(age)
(note:  j = 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        9   ->      18
Number of variables                   7   ->       7
j variable (2 values)                     ->   age
xij variables:
ht1 ht2   ->   ht
----------------------------------------------------------------------------- 
list famid birth age ht
         famid      birth        age         ht
1.         1          1          1        2.8
2.         1          1          2        3.4
3.         1          2          1        2.9
4.         1          2          2        3.8
5.         1          3          1        2.2
6.         1          3          2        2.9
7.         2          1          1          2
8.         2          1          2        3.2
9.         2          2          1        1.8
10.         2          2          2        2.8
11.         2          3          1        1.9
12.         2          3          2        2.4
13.         3          1          1        2.2
14.         3          1          2        3.3
15.         3          2          1        2.3
16.         3          2          2        3.4
17.         3          3          1        2.1
18.         3          3          2        2.9   

Example #3: Reshaping data wide to long

The file with the kids heights at age 1 and age 2 also contains their weights at age 1 and age 2 (called wt1 and wt2).

use http://www.ats.ucla.edu/stat/stata/modules/kidshtwt, clear
list famid birth ht1 ht2 wt1 wt2
         famid      birth        ht1        ht2        wt1        wt2
1.         1          1        2.8        3.4         19         28
2.         1          2        2.9        3.8         21         28
3.         1          3        2.2        2.9         20         23
4.         2          1          2        3.2         25         30
5.         2          2        1.8        2.8         20         33
6.         2          3        1.9        2.4         22         33
7.         3          1        2.2        3.3         22         28
8.         3          2        2.3        3.4         20         30
9.         3          3        2.1        2.9         22         31   

Let's reshape this data into a long format. This is basically the same as the previous command except that ht is replaced with ht wt.

reshape long ht wt, i(famid birth) j(age)

Let's look at the wide data, and then the data reshaped to be long.

list famid birth ht1 ht2 wt1 wt2
         famid      birth        ht1        ht2        wt1        wt2
1.         1          1        2.8        3.4         19         28
2.         1          2        2.9        3.8         21         28
3.         1          3        2.2        2.9         20         23
4.         2          1          2        3.2         25         30
5.         2          2        1.8        2.8         20         33
6.         2          3        1.9        2.4         22         33
7.         3          1        2.2        3.3         22         28
8.         3          2        2.3        3.4         20         30
9.         3          3        2.1        2.9         22         31   
reshape long ht wt, i(famid birth) j(age)
(note:  j = 1 2)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        9   ->      18
Number of variables                   7   ->       6
j variable (2 values)                     ->   age
xij variables:
ht1 ht2   ->   ht
wt1 wt2   ->   wt
----------------------------------------------------------------------------- 
list famid birth age ht wt
          famid      birth        age         ht         wt
1.         1          1          1        2.8         19
2.         1          1          2        3.4         28
3.         1          2          1        2.9         21
4.         1          2          2        3.8         28
5.         1          3          1        2.2         20
6.         1          3          2        2.9         23
7.         2          1          1          2         25
8.         2          1          2        3.2         30
9.         2          2          1        1.8         20
10.         2          2          2        2.8         33
11.         2          3          1        1.9         22
12.         2          3          2        2.4         33
13.         3          1          1        2.2         22
14.         3          1          2        3.3         28
15.         3          2          1        2.3         20
16.         3          2          2        3.4         30
17.         3          3          1        2.1         22
18.         3          3          2        2.9         31   

Example #4: Reshaping data wide to long with character suffixes

It also is possible to reshape a wide data file to be long when there are character suffixes. Look at the dadmomw file below.

use http://www.ats.ucla.edu/stat/stata/modules/dadmomw, clear
list
         famid      named       incd      namem       incm
1.         1       Bill      30000       Bess      15000
2.         2        Art      22000        Amy      18000
3.         3       Paul      25000        Pat      50000   

We would like to make name and inc into long formats but their suffixes are characters (d & m) instead of numbers. Stata can handle that as long as you use string in the command to indicate that the suffix is a character.

reshape long name  inc, i(famid) j(dadmom) string

Let's look at the data before and after reshaping.

list
         famid      named       incd      namem       incm
1.         1       Bill      30000       Bess      15000
2.         2        Art      22000        Amy      18000
3.         3       Paul      25000        Pat      50000   
reshape long name inc, i(famid) j(dadmom) string
(note:  j = d m)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        3   ->       6
Number of variables                   5   ->       4
j variable (2 values)                     ->   dadmom
xij variables:
named namem   ->   name
incd incm   ->   inc
----------------------------------------------------------------------------- 
list
         famid     dadmom       name        inc
1.         1          d       Bill      30000
2.         1          m       Bess      15000
3.         2          d        Art      22000
4.         2          m        Amy      18000
5.         3          d       Paul      25000
6.         3          m        Pat      50000   

Summary reshaping data wide to long

 Wide format
famid   faminc96   faminc97   faminc98
1.         1      40000      40500      41000
2.         2      45000      45400      45800
3.         3      75000      76000      77000

reshape long faminc, i(famid) j(year)

Long Format
famid       year     faminc
1.         1         96      40000
2.         1         97      40500
3.         1         98      41000
4.         2         96      45000
5.         2         97      45400
6.         2         98      45800
7.         3         96      75000
8.         3         97      76000
9.         3         98      77000


The general syntax of reshape long can be expressed as...

reshape long stem-of-wide-vars, i(wide-id-var)  j(var-for-suffix)
where
stem-of-wide-vars  is the stem of the wide variables, e.g., faminc
wide-id-var        is the variable that uniquely identifies wide
observations, e.g., famid
var-for-suffix     is the variable that will contain the suffix of
the wide variables, e.g., year 

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.