Stata FAQ:  How can I perform multiple imputation on longitudinal data using ICE?

Imputing longitudinal or panel data poses special problems. If data the data are in long form, each case has multiple rows in the dataset, so this needs to be accounted for in estimates. At the same time, the information from other time points can be important predictors of missing values, so we want to take advantage of this. The following example shows how to impute longitudinal data, accommodating the structure of this type of data. The example dataset contains data on student's reading  and math scores at three time points (read and math respectively), as well as data on the time invariant covariates female, private, and ses. The data are in long form, so there are 3 rows in the data for each of the 200 students for whom we have data. The data also contain an id variable, which allows us to match the cases across the three waves of data collection, and a variable time which tells us when the data were collected. There are missing data on three of the four substantive variables. How does one create multiple imputed datasets that account for the clustering in the data (multiple observations per student) and take advantage of the fact that reading scores at the other two time points are likely to be good predictors of any missing values of the time-varying variables?

First we want to look at our data to confirm that there is missing data, we can do this using the summarize command (which can be abbreviated to sum).  We can also use the user-written command misschk to look at the patterns of missingness within our data.  You can download misschk from within Stata by typing findit misschk (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

use "http://www.ats.ucla.edu/stat/stata/faq/mi_longi.dta"

sum female ses private read math

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
female |       600        .545    .4983864          0          1
ses |       531    2.011299    .7136568          1          3
private |       570    .1526316    .3599479          0          1
read |       538    52.66543    10.29398         26         76
math |       554    52.16106    9.301105         26         75

misschk female private ses read math

Variables examined for missing values

#  Variable        # Missing   % Missing
--------------------------------------------
1  female                0         0.0
2  private              30         5.0
3  ses                  69        11.5
5  math                 46         7.7

Missing for |
which |
variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
_234_ |          2        0.33        0.33
_23__ |          1        0.17        0.50
_2_4_ |          1        0.17        0.67
_2__5 |          2        0.33        1.00
_2___ |         24        4.00        5.00
__345 |          1        0.17        5.17
__34_ |          9        1.50        6.67
__3_5 |          2        0.33        7.00
__3__ |         54        9.00       16.00
___45 |          3        0.50       16.50
___4_ |         46        7.67       24.17
____5 |         38        6.33       30.50
_____ |        417       69.50      100.00
------------+-----------------------------------
Total |        600      100.00

Missing for |
how many |
variables? |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |        417       69.50       69.50
1 |        162       27.00       96.50
2 |         18        3.00       99.50
3 |          3        0.50      100.00
------------+-----------------------------------
Total |        600      100.00

Once we are familiar with our data, the first step in the imputation process is to reshape the data from long to wide. Having the data in wide form takes care of both the nesting issue (there is now only one row of data per student) and allows us to easily use variables from the other time periods as predictors of missing values, since in wide form, they are just other variables in the dataset (rather than being part of another row in the dataset). We do this using the reshape command, and then check the output from reshape to make sure everything went the way it should, and it has. Note that the variable time is dropped, and that there are now three read variables and three math variables.

reshape wide read math, i(id) j(time)
(note: j = 1 2 3)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                      600   ->     200
Number of variables                   7   ->      10
j variable (3 values)              time   ->   (dropped)
xij variables:
math   ->   math1 math2 math3
-----------------------------------------------------------------------------


After reshaping the data, and checking to make sure that the reshape command worked as we want it to, we can do whatever steps are necessary to impute the missing values. Since our dataset is small and simple, we will issue the ice command with only two options, but in general, the command issued to ice will be much more complex. The important point is that since our data are in wide (rather than long) format, the fact that data are longitudinal does not create any additional complications. Note that we have used the command set seed to set the seed for the random number generator, this will enable you to reproduce the results of our imputation. Note that ice is a user-written command. You can download ice from within Stata by typing findit ice (see How can I use the findit command to search for programs and get additional help? for more information about using findit).

set seed 091107

#missing |
values |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         90       45.00       45.00
1 |         85       42.50       87.50
2 |         21       10.50       98.00
3 |          2        1.00       99.00
4 |          2        1.00      100.00
------------+-----------------------------------
Total |        200      100.00

Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female |         | [No missing data in estimation sample]
------------------------------------------------------------------------------

Imputing 1..2..3..4..file imputed_dataset.dta saved

After imputation, ice saves a copy of the new dataset in the current working directory. Let's start out by looking at the file that is saved. The new dataset contains all of our variables, plus two new variables:  _mi, which contains an identifier for the observation, which we don't really need, since we already have the variable id and _mj which tells us which imputation each row of the data belongs to. We asked for four imputed datasets (m(4)), but when we tabulate _mj we see that it actually takes on five values, and that there are 1000 cases in our dataset, when we are only expecting 800 (200*4). Summarizing the variables in our dataset helps us see where this extra "imputation" comes from. We can see that some cases still have missing values. It turns out that cases for which _mj = 0 are actually the cases that made up our original dataset. Hence, we want to drop those cases. If we were to summarize the data again after dropping those cases, we would see that there are 800 cases, none of which contain missing values.

use imputed_dataset, clear
(a heavily modified version of highschool and beyond (200 cases))

tab _mj

imputation |
number |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |        200       20.00       20.00
1 |        200       20.00       40.00
2 |        200       20.00       60.00
3 |        200       20.00       80.00
4 |        200       20.00      100.00
------------+-----------------------------------
Total |      1,000      100.00

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
female |      1000        .545    .4982201          0          1
private |       990    .1545455    .3616535          0          1
ses |       977    2.021494    .7089508          1          3
read1 |       994    52.44085    10.06883         28         76
read2 |       968    52.68957    9.655967   29.89433   79.52792
-------------+--------------------------------------------------------
read3 |       976    52.34583    10.98606   11.84808    76.0173
math1 |       995     52.6196    9.325515         33         75
math2 |       980     51.8959    9.767504         26    74.0519
math3 |       979    52.15079    8.782948   35.04641   71.48292

drop if _mj==0
(200 observations deleted)

So now we have our multiply imputed data, but they are still in wide format, and we will probably want them in long form to run the analyses. Returning the data to long format has an added complication:  we already have four rows of data for each student, one for each of the imputations. As a result, the variable id no longer uniquely identifies an observation.  However, including both id and _mj as identifiers will uniquely identify each case, this is what we do below. The resulting dataset will have 12 (3*4) observations for each case (a total of 4*3*200=2400 cases), because for each of the four imputations there will be three time points.

reshape long read math, i(id _mj)

(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      800   ->    2400
Number of variables                  12   ->       9
j variable (3 values)                     ->   _j
xij variables:
math1 math2 math3   ->   math
-----------------------------------------------------------------------------

After reshaping the data, we will want to explore our imputations. It is important to make sure that the imputed values make sense, that they are not out of range of the original values, etc. We can start by summarizing our data, we may also want to use by to look at the values generated by each imputation separately.  It might also be useful to generate either boxplots or histograms of our variables, so see that the distributions look reasonable after imputation. We will also rename the variable _j to time since it contains the time at which each observation was collected. This is not necessary, but it is easier to remember what the variable is this way.

sum female private ses read math

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
female |      2400        .545    .4980747          0          1
private |      2400        .155    .3619801          0          1
ses |      2400     2.02375    .7077391          1          3
read |      2400    52.45219    10.24189   11.84808   79.52792
math |      2400     52.2387    9.304953         26         75

rename _j time

Once we have carefully checked our data to make sure there were no problems in the imputation, we can run an analysis on our data using mim. You can download mim from within Stata by typing findit mim (see How can I use the findit command to search for programs and get additional help? for more information about using findit). Below we have used the mim: prefix with the command xtreg to predict reading test scores using time and math test scores (math), accounting for the fact that there are multiple observations per student. The command syntax is the same as for xtreg all that needs to be added is the mim prefix. Note that using the mim: prefix reduces the amount of output.

mim: xtreg read math time, i(id)

Multiple-imputation estimates (xtreg)                    Imputations =       4
Minimum obs =     600
Minimum dof =    85.7

------------------------------------------------------------------------------
read |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
math |   .535503   .047671   11.23   0.000    .440733  .630274    85.7
time |    .10996   .348473    0.32   0.752   -.573873  .793794   987.6
_cons |   24.2584   2.62314    9.25   0.000    19.0651  29.4517   120.8
------------------------------------------------------------------------------

To demonstrate the difference in terms of how much output is generated, we have run the same analysis xtreg on only the first imputation (if _mj==1) and without the prefix mim:. The output without mim: contains much more information. In general, the parts of the output that are omitted when mim: is used are estimates for which it doesn't make sense to average across imputations.

xtreg read math time if _mj==1, i(id)

Random-effects GLS regression                   Number of obs      =       600
Group variable: id                              Number of groups   =       200

R-sq:  within  = 0.0016                         Obs per group: min =         3
between = 0.5426                                        avg =       3.0
overall = 0.3407                                        max =         3

Random effects u_i ~ Gaussian                   Wald chi2(2)       =    135.99
corr(u_i, X)       = 0 (assumed)                Prob > chi2        =    0.0000

------------------------------------------------------------------------------
read |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
math |   .5119683   .0439027    11.66   0.000     .4259207     .598016
time |   .1384974   .3373886     0.41   0.681    -.5227721     .799767
_cons |   25.40161   2.453613    10.35   0.000     20.59261     30.2106
-------------+----------------------------------------------------------------
sigma_u |  4.7500306
sigma_e |  6.3321664
rho |  .36008789   (fraction of variance due to u_i)
------------------------------------------------------------------------------

References

Allison, Paul (2001) Missing Data. Sage University Paperback 136. Sage Publications: Thousand Oaks, CA. pg 73-75.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.