|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
Imputing longitudinal or panel data poses special problems. If data the data are in long form, each case has multiple rows in the dataset, so this needs to be accounted for in estimates. At the same time, the information from other time points can be important predictors of missing values, so we want to take advantage of this. The following example shows how to impute longitudinal data, accommodating the structure of this type of data. The example dataset contains data on student's reading and math scores at three time points (read and math respectively), as well as data on the time invariant covariates female, private, and ses. The data are in long form, so there are 3 rows in the data for each of the 200 students for whom we have data. The data also contain an id variable, which allows us to match the cases across the three waves of data collection, and a variable time which tells us when the data were collected. There are missing data on three of the four substantive variables. How does one create multiple imputed datasets that account for the clustering in the data (multiple observations per student) and take advantage of the fact that reading scores at the other two time points are likely to be good predictors of any missing values of the time-varying variables?
First we want to look at our data to confirm that there is missing data, we can do this using the summarize command (which can be abbreviated to sum). We can also use the user-written command misschk to look at the patterns of missingness within our data. You can download misschk from within Stata by typing findit misschk (see How can I use the findit command to search for programs and get additional help? for more information about using findit).
use "http://www.ats.ucla.edu/stat/stata/faq/mi_longi.dta"
sum female ses private read math
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
female | 600 .545 .4983864 0 1
ses | 531 2.011299 .7136568 1 3
private | 570 .1526316 .3599479 0 1
read | 538 52.66543 10.29398 26 76
math | 554 52.16106 9.301105 26 75
misschk female private ses read math
Variables examined for missing values
# Variable # Missing % Missing
--------------------------------------------
1 female 0 0.0
2 private 30 5.0
3 ses 69 11.5
4 read 62 10.3
5 math 46 7.7
Missing for |
which |
variables? | Freq. Percent Cum.
------------+-----------------------------------
_234_ | 2 0.33 0.33
_23__ | 1 0.17 0.50
_2_4_ | 1 0.17 0.67
_2__5 | 2 0.33 1.00
_2___ | 24 4.00 5.00
__345 | 1 0.17 5.17
__34_ | 9 1.50 6.67
__3_5 | 2 0.33 7.00
__3__ | 54 9.00 16.00
___45 | 3 0.50 16.50
___4_ | 46 7.67 24.17
____5 | 38 6.33 30.50
_____ | 417 69.50 100.00
------------+-----------------------------------
Total | 600 100.00
Missing for |
how many |
variables? | Freq. Percent Cum.
------------+-----------------------------------
0 | 417 69.50 69.50
1 | 162 27.00 96.50
2 | 18 3.00 99.50
3 | 3 0.50 100.00
------------+-----------------------------------
Total | 600 100.00
Once we are familiar with our data, the first step in the imputation process is to reshape the data from long to wide. Having the data in wide form takes care of both the nesting issue (there is now only one row of data per student) and allows us to easily use variables from the other time periods as predictors of missing values, since in wide form, they are just other variables in the dataset (rather than being part of another row in the dataset). We do this using the reshape command, and then check the output from reshape to make sure everything went the way it should, and it has. Note that the variable time is dropped, and that there are now three read variables and three math variables.
reshape wide read math, i(id) j(time)
(note: j = 1 2 3)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 600 -> 200
Number of variables 7 -> 10
j variable (3 values) time -> (dropped)
xij variables:
read -> read1 read2 read3
math -> math1 math2 math3
-----------------------------------------------------------------------------
After reshaping the data, and checking to make sure that the reshape command worked as we want it to, we can do whatever steps are necessary to impute the missing values. Since our dataset is small and simple, we will issue the ice command with only two options, but in general, the command issued to ice will be much more complex. The important point is that since our data are in wide (rather than long) format, the fact that data are longitudinal does not create any additional complications. Note that we have used the command set seed to set the seed for the random number generator, this will enable you to reproduce the results of our imputation. Note that ice is a user-written command. You can download ice from within Stata by typing findit ice (see How can I use the findit command to search for programs and get additional help? for more information about using findit).
set seed 091107
ice female private ses read1 read2 read3 math1 math2 math3, saving(imputed_dataset) m(4)
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 90 45.00 45.00
1 | 85 42.50 87.50
2 | 21 10.50 98.00
3 | 2 1.00 99.00
4 | 2 1.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | | [No missing data in estimation sample]
private | logit | female ses read1 read2 read3 math1 math2 math3
ses | mlogit | female private read1 read2 read3 math1 math2 math3
read1 | regress | female private ses read2 read3 math1 math2 math3
read2 | regress | female private ses read1 read3 math1 math2 math3
read3 | regress | female private ses read1 read2 math1 math2 math3
math1 | regress | female private ses read1 read2 read3 math2 math3
math2 | regress | female private ses read1 read2 read3 math1 math3
math3 | regress | female private ses read1 read2 read3 math1 math2
------------------------------------------------------------------------------
Imputing 1..2..3..4..file imputed_dataset.dta saved
After imputation, ice saves a copy of the new dataset in the current working directory. Let's start out by looking at the file that is saved. The new dataset contains all of our variables, plus two new variables: _mi, which contains an identifier for the observation, which we don't really need, since we already have the variable id and _mj which tells us which imputation each row of the data belongs to. We asked for four imputed datasets (m(4)), but when we tabulate _mj we see that it actually takes on five values, and that there are 1000 cases in our dataset, when we are only expecting 800 (200*4). Summarizing the variables in our dataset helps us see where this extra "imputation" comes from. We can see that some cases still have missing values. It turns out that cases for which _mj = 0 are actually the cases that made up our original dataset. Hence, we want to drop those cases. If we were to summarize the data again after dropping those cases, we would see that there are 800 cases, none of which contain missing values.
use imputed_dataset, clear
(a heavily modified version of highschool and beyond (200 cases))
tab _mj
imputation |
number | Freq. Percent Cum.
------------+-----------------------------------
0 | 200 20.00 20.00
1 | 200 20.00 40.00
2 | 200 20.00 60.00
3 | 200 20.00 80.00
4 | 200 20.00 100.00
------------+-----------------------------------
Total | 1,000 100.00
sum female private ses read1 read2 read3 math1 math2 math3
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
female | 1000 .545 .4982201 0 1
private | 990 .1545455 .3616535 0 1
ses | 977 2.021494 .7089508 1 3
read1 | 994 52.44085 10.06883 28 76
read2 | 968 52.68957 9.655967 29.89433 79.52792
-------------+--------------------------------------------------------
read3 | 976 52.34583 10.98606 11.84808 76.0173
math1 | 995 52.6196 9.325515 33 75
math2 | 980 51.8959 9.767504 26 74.0519
math3 | 979 52.15079 8.782948 35.04641 71.48292
drop if _mj==0
(200 observations deleted)
So now we have our multiply imputed data, but they are still in wide format, and we will probably want them in long form to run the analyses. Returning the data to long format has an added complication: we already have four rows of data for each student, one for each of the imputations. As a result, the variable id no longer uniquely identifies an observation. However, including both id and _mj as identifiers will uniquely identify each case, this is what we do below. The resulting dataset will have 12 (3*4) observations for each case (a total of 4*3*200=2400 cases), because for each of the four imputations there will be three time points.
reshape long read math, i(id _mj)
(note: j = 1 2 3)
Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 800 -> 2400
Number of variables 12 -> 9
j variable (3 values) -> _j
xij variables:
read1 read2 read3 -> read
math1 math2 math3 -> math
-----------------------------------------------------------------------------
After reshaping the data, we will want to explore our imputations. It is important to make sure that the imputed values make sense, that they are not out of range of the original values, etc. We can start by summarizing our data, we may also want to use by to look at the values generated by each imputation separately. It might also be useful to generate either boxplots or histograms of our variables, so see that the distributions look reasonable after imputation. We will also rename the variable _j to time since it contains the time at which each observation was collected. This is not necessary, but it is easier to remember what the variable is this way.
sum female private ses read math
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
female | 2400 .545 .4980747 0 1
private | 2400 .155 .3619801 0 1
ses | 2400 2.02375 .7077391 1 3
read | 2400 52.45219 10.24189 11.84808 79.52792
math | 2400 52.2387 9.304953 26 75
rename _j time
Once we have carefully checked our data to make sure there were no problems in the imputation, we can run an analysis on our data using mim. You can download mim from within Stata by typing findit mim (see How can I use the findit command to search for programs and get additional help? for more information about using findit). Below we have used the mim: prefix with the command xtreg to predict reading test scores using time and math test scores (math), accounting for the fact that there are multiple observations per student. The command syntax is the same as for xtreg all that needs to be added is the mim prefix. Note that using the mim: prefix reduces the amount of output.
mim: xtreg read math time, i(id)
Multiple-imputation estimates (xtreg) Imputations = 4
Minimum obs = 600
Minimum dof = 85.7
------------------------------------------------------------------------------
read | Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df
-------------+----------------------------------------------------------------
math | .535503 .047671 11.23 0.000 .440733 .630274 85.7
time | .10996 .348473 0.32 0.752 -.573873 .793794 987.6
_cons | 24.2584 2.62314 9.25 0.000 19.0651 29.4517 120.8
------------------------------------------------------------------------------
To demonstrate the difference in terms of how much output is generated, we have run the same analysis xtreg on only the first imputation (if _mj==1) and without the prefix mim:. The output without mim: contains much more information. In general, the parts of the output that are omitted when mim: is used are estimates for which it doesn't make sense to average across imputations.
xtreg read math time if _mj==1, i(id)
Random-effects GLS regression Number of obs = 600
Group variable: id Number of groups = 200
R-sq: within = 0.0016 Obs per group: min = 3
between = 0.5426 avg = 3.0
overall = 0.3407 max = 3
Random effects u_i ~ Gaussian Wald chi2(2) = 135.99
corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000
------------------------------------------------------------------------------
read | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
math | .5119683 .0439027 11.66 0.000 .4259207 .598016
time | .1384974 .3373886 0.41 0.681 -.5227721 .799767
_cons | 25.40161 2.453613 10.35 0.000 20.59261 30.2106
-------------+----------------------------------------------------------------
sigma_u | 4.7500306
sigma_e | 6.3321664
rho | .36008789 (fraction of variance due to u_i)
------------------------------------------------------------------------------
Allison, Paul (2001) Missing Data. Sage University Paperback 136. Sage Publications: Thousand Oaks, CA. pg 73-75.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services