|
|
|
||||
|
|
|||||
The idea of multiple imputation is to create multiple imputed data sets for a data set with missing values. The analysis of a statistical model is then done on each of the multiple data sets. The multiple analyses are then combined to yield a set of results. In general, multiple imputation techniques require that missing observations are missing at random (MAR).
There are two major approaches in multiple imputations. The first one is based on the joint distribution of all the variables considered, be it a variable to be imputed or a variable to be used only for the purpose of imputation. This is the approach that SAS proc mi takes. So, in general, it works well with multivariate normal data. The other approach is based on each conditional density of a variable given all other variables. This is the approach Stata's user-written program ice takes. ice stands for Imputation by Chained Equations. The drawback is that the conditional densities can be incompatible. However simulation studies have shown that in practice it performs well. Stata's program ice is written by Patrick Royston, and he has a few articles introducing his suite of ice program and improvements made to it. There are some obvious advantages using Stata's ice instead of SAS's proc mi or any other program with the same approach:
One major disadvantage, beside that it is not as theoretically sound as proc mi, could be that it is more computational intensive. But considering the increasing computing power nowadays, this is less of an issue.
Let's discuss in some detail how ice works. Conceptually there are two major steps. The first step is the imputation of a single variable given a set of predictor variables, done by the program uvis, which comes a part of ice. The second step is the so called "regression switching", a scheme for cycling through all the variables to be imputed using uvis. In other words, internally, ice calls uvis many times.
uvis does the imputation of a single variable on a set of predictor variables by an appropriate regression model based on the predictors. The regression model can be OLS if the imputed variable is a continuous variable or a logit model if it is a binary variable. It can be other types of models as well. With the regression model, uvis can create the imputed values for the missing observations either by drawing from the posterior predictive distribution or by predictive matching. There are two types of distributions here, the distributions of the regression coefficients and the distribution of the residual standard deviation. Disturbances are added in both types of distributions. With the boot option, we can relax the assumption of multivariate normality on the distribution of regression coefficients. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.
For example, the Stata code below shows what internally uvis does for the method of random drawing. Variable w1 is created using uvis and variable w2 is created using the code that uvis uses. They are of course identical. The point of showing the Stata code here is to see what is involved.
uvis regress write math science, seed(12457) gen(w1)set seed 12457 regress write math science tempname b e V chol bstar tempvar xb u matrix `b'=e(b) matrix `e'=e(b) matrix `V'=e(V) matrix `chol'=cholesky(`V') local colsofb=colsof(`b')local rmse=e(rmse) local df=e(df_r) local chi2=2*invgammap(`df'/2,uniform())local rmsestar=`rmse'*sqrt(`df'/`chi2') matrix `chol'=`chol'*sqrt(`df'/`chi2')forvalues i=1/`colsofb' { matrix `e'[1,`i']=invnorm(uniform()) } matrix `bstar'=`b'+`e'*`chol'' /*disturbance here*/ gen `u'=uniform() matrix score `xb'=`bstar' /*score the data with the new coefficient*/ gen w2 = write replace w2=`xb'+`rmsestar'*invnorm(`u') if write==. /*disturbance here again*/
To create one imputed data set for multiple variables x_1, x_2, ..., x_k, with missing observations, ice does the following:
Now let's look at some examples to see how ice works.
Please note that you may have to download some of the programs used in the examples on this page. These include ice, mvpatterns and mim. You can do this with the findit command. For example, to download the mpatterns command, you can type findit mvpatterns (see How can I use the findit command to search for programs and get additional help? for more information about using findit). Both ice and mim (the prefix used to combine the results from the different imputed data sets) have been updated several times since they were originally released. If you have an older version of either of these programs, you may need to download the current version. To determine which version of ice you have, for example, you can type which ice. To ensure that you have the current version of ice, you can type
which ice ssc describe ice ssc install ice, replace
To ensure that you have the most current version of mim, you can type
which mim ssc describe mim ssc install mim, replace
To demonstrate, we have created a data set with missing values. Below is the code for creating the data set used in these examples. Notice that variables fxw and fxr are interaction terms. We also use the mvpatterns command to see the pattern of the missing data.
clear use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear set seed 1368 gen r = uniform() replace female =. if r >.8 set seed 4689 gen r1 = uniform() replace ses = . if r1>.8 replace write = . if math >60gen fxw=female*write gen fxr=female*read drop r r1mvpatterns ses female write fxw fxrVariable | type obs mv variable label -------------+----------------------------------- ses | float 154 46 female | float 155 45 write | float 156 44 writing score fxw | float 118 82 fxr | float 155 45 -------------------------------------------------Patterns of missing values+------------------------+ | _pattern _mv _freq | |------------------------| | +++++ 0 90 | | ++..+ 2 30 | | .++++ 1 28 | | +.+.. 3 28 | | ..+.. 4 10 | |------------------------| | .+..+ 3 7 | | +.... 4 6 | | ..... 5 1 | +------------------------+label data "hsb2 with missing data" save hsb2_mice, replace
A dry run is a run where no imputed data set is created and no imputation model is actually run. This is a very useful first step to view the default settings and to pinpoint the changes needed in order to build up a sensible imputation model. For example, in the run below, we can see the regression model for each variable with missing values. Some of them might be appropriate already and some of them might need some changes. For example, variable ses is a categorical variable, but it is being used as a continuous variable in the prediction model for the variable write. Variables fxw and fxr are interaction terms, but they are treated as if they are not.
Please note that we change the imputation model as we move through the examples. Our goal is not to show how to develop a theoretically plausible imputation model, but rather to illustrate the various options that may be useful in different situations.
use hsb2_mice, clearice female ses read math write socst science race fxw fxr, dryrun#missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 90 45.00 45.00 1 | 28 14.00 59.00 2 | 30 15.00 74.00 3 | 35 17.50 91.50 4 | 16 8.00 99.50 5 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race fxw fxr ses | mlogit | female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxw fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | regress | female ses read math write socst science race fxr fxr | regress | female ses read math write socst science race fxwEnd of dry run. No imputations were done, no files were created.
We have seen in the previous example that variables fxw and fxr are not being treated as interaction terms yet. They are to be imputed as if they are independent of variable female, write and read. We can correct it by using the passive option. This means the imputed values for fxw will be simply the products of female and write. The same is done for fxr. Notice that multiple terms are separated by backslashes "\" in the passive option.
ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) dryrunVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race ses | mlogit | female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]End of dry run. No imputations were done, no files were created.
The variable ses has three categories and by default, ice will treat it as a unordered categorical variable. Therefore, mlogit is used in the prediction model. Now, we might say that ses is actually ordered and we want to use ordinal logistic regression for prediction instead. This can be done by using the option cmd. Multiple changes can be made and are separated by commas.
ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) dryrunVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race ses | ologit | female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]End of dry run. No imputations were done, no files were created.
We can also change the set of predictors in the prediction models. In the example below, we use only the variables read and math in the prediction model for ses and the variable write for the prediction model of the variable female. Obviously, there is no good reason for simplifying the prediction models this way; this is only for the purpose of demonstration.
ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) /// eq(ses: read math, female: write) dryrunVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | write ses | ologit | read math read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]
In the previous example, at least one obvious problem still remains. That is the prediction equation for the variable write is not really set up correctly since both of the variable ses and race are categorical variables but they are used instead as continuous variables. Since the variable race does not have missing values, we can simply create dummy variables for it. But for the variable ses, it is not as simple. This is because that we want use ses itself in the imputation process so it will be imputed as a single categorical variable. We will also create dummy variables s1 and s2 for the variable ses. Now, since dummy variables s1 and s2 are NOT directly in the imputation model, we will need to do the "substitution" for the prediction model for the variable write. This is done using the option sub (short for substitute). This substitutes variable ses with s1 and s2 whenever ses is in a prediction model.
tab ses, gen(s) tab race, gen(r)ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2, /// passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) /// sub(ses: s1 s2) cmd(ses:ologit) /// eq(ses: read math, female: write) dryrun #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 90 45.00 45.00 2 | 30 15.00 60.00 3 | 56 28.00 88.00 4 | 6 3.00 91.00 5 | 7 3.50 94.50 6 | 10 5.00 99.50 7 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | write ses | ologit | read math read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female read math socst science r1 r2 r3 fxr s1 s2 socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] r1 | | [No missing data in estimation sample] r2 | | [No missing data in estimation sample] r3 | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read] s1 | | [Passively imputed from (ses==1)] s2 | | [Passively imputed from (ses==2)] End of dry run. No imputations were done, no files were created.
By now, it seems that at least mechanically we have a correct imputation model. It turns out that there are actually many issues that we need consider. For example, as we mentioned briefly in the Introduction, that the normality assumption on the posterior distribution may not be valid. We therefore would want to use the bootstrap option. We may also want to use the predictive matching method for some of the variables. We also want to use the seed option so that our results are reproducible. Here is an example that does all of these.
ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2, ///
passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) ///
substitute(ses: s1 s2) cmd(ses:ologit) ///
eq(ses: read math write socst science, female: write read) boot(write) match(female) seed(1285964) dryrun
Now we are all set to create our imputed data set. We choose m = 5 for five copies of imputed data sets. This is quite arbitrary. Royston has a detailed discussion on the choice of m in his article. The choice of m still remains an issue for more research.
*get the final imputation model ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2 using imp, m(5) /// passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) /// substitute(ses: s1 s2) cmd(ses:ologit) /// eq(ses: read math write socst science, female: write read) boot(write) match(female) seed(1285964) replaceVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | write read ses | ologit | read math write socst science read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female read math socst science r1 r2 r3 fxr s1 s2 socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] r1 | | [No missing data in estimation sample] r2 | | [No missing data in estimation sample] r3 | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read] s1 | | [Passively imputed from (ses==1)] s2 | | [Passively imputed from (ses==2)] Imputing 1..2..3..4..5..file imp.dta saved
There are many options that ice can take. Here is a partial list of options which we will discuss. These options are essential for building up an imputation model.
Once we have created our imputed data set, we are ready to do our data analysis. Here are some examples. We use the mim prefix to combine the results from the different imputed data sets into a single output.
use imp, clear (hsb2 with missing data)mim: reg write math female Multiple-imputation estimates (regress) Imputations = 5 Linear regression Minimum obs = 200 Minimum dof = 12.5 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- math | .660069 .071499 9.23 0.000 .513096 .807042 26.0 female | 4.76661 1.57098 3.03 0.010 1.35986 8.17335 12.5 _cons | 15.5316 3.699 4.20 0.000 8.05143 23.0118 39.3 ------------------------------------------------------------------------------mim: logit female write math Multiple-imputation estimates (logit) Imputations = 5 Logistic regression Minimum obs = 200 Minimum dof = 12.1 ------------------------------------------------------------------------------ female | Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- write | .083813 .031366 2.67 0.020 .01556 .152065 12.1 math | -.062983 .027661 -2.28 0.032 -.120149 -.005817 23.4 _cons | -.938246 1.19821 -0.78 0.441 -3.41531 1.53882 23.3 ------------------------------------------------------------------------------ mim: logit female write math, or Multiple-imputation estimates (logit) Imputations = 5 Logistic regression Minimum obs = 200 Minimum dof = 12.1 ------------------------------------------------------------------------------ female | Odds Rat. Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- write | 1.08743 .034109 2.67 0.020 1.01568 1.16424 12.1 math | .938959 .025973 -2.28 0.032 .886789 .9942 23.4 ------------------------------------------------------------------------------ mim: mlogit ses write math Multiple-imputation estimates (mlogit) Imputations = 5 Multinomial logistic regression Minimum obs = 200 Minimum dof = 19.0 ------------------------------------------------------------------------------ ses | Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- low | write | .033503 .031968 1.05 0.308 -.033413 .100419 19.0 math | -.081796 .034454 -2.37 0.024 -.152199 -.011394 29.6 _cons | 1.69122 1.51042 1.12 0.272 -1.40459 4.78704 27.6 -------------+---------------------------------------------------------------- high | write | .027159 .024009 1.13 0.260 -.020349 .074667 127.6 math | .031625 .02691 1.18 0.244 -.022111 .085361 65.4 _cons | -3.65612 1.42128 -2.57 0.015 -6.56174 -.750497 29.3 ------------------------------------------------------------------------------ mim: mlogit ses write math, rrr Multiple-imputation estimates (mlogit) Imputations = 5 Multinomial logistic regression Minimum obs = 200 Minimum dof = 19.0 ------------------------------------------------------------------------------ ses | RRR Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- low | write | 1.03407 .033057 1.05 0.308 .967139 1.10563 19.0 math | .92146 .031748 -2.37 0.024 .858818 .988671 29.6 -------------+---------------------------------------------------------------- high | write | 1.02753 .02467 1.13 0.260 .979857 1.07753 127.6 math | 1.03213 .027774 1.18 0.244 .978132 1.08911 65.4 ------------------------------------------------------------------------------
The ice program creates one single data set with multiple copies of imputed data inside. It also creates two new variables, _mi and _mj. The variable _mi indicates the observation number, and _mj indicates the imputation number. If you have used other imputation program, you might have multiple data sets, each being one single imputation data. In order to use mim to perform analyses, you can combine the multiple data sets into one using mijoin.
Please note that if you are using older versions of ice or mim, you have the variables _i and _j instead of _mi and _mj. They are the same variables, so you can simply rename _i and _j to be _mi and _mj.
ls imp* 2.8k 6/15/05 11:00 imp1.dta 2.8k 6/15/05 11:00 imp2.dta 2.8k 6/15/05 11:00 imp3.dta 2.8k 6/15/05 11:00 imp4.dta 2.8k 6/15/05 11:00 imp5.dta 2.8k 6/15/05 10:11 imp6.dta 2.8k 6/15/05 10:11 imp7.dta 2.8k 6/15/05 10:11 imp8.dta 2.8k 6/15/05 10:11 imp9.dta mijoin imp, m(5) cleartab _mjimputation | number | Freq. Percent Cum. ------------+----------------------------------- 1 | 200 20.00 20.00 2 | 200 20.00 40.00 3 | 200 20.00 60.00 4 | 200 20.00 80.00 5 | 200 20.00 100.00 ------------+----------------------------------- Total | 1,000 100.00
Conversely, occasionally, you might want to use other program for analyzing your imputed data and you might have to break the single data set created by ice into multiple data sets. Then you can use misplit to accomplish that.
misplit, clear m(5)Data for 5 imputations have been copied to _mitemp1.dta to _mitemp5.dta
S. van Buuren, J.P.L. Brand, C.G.M. Groothuis-Oudshoorn and D.B. Rubin, Full Conditional Specification in Multivariate Imputation
Royston, P. 2004. Multiple imputation of missing values.
Stata Journal 4(3): 227–241.Royston, P. 2005. Multiple imputation of missing values: update.
Stata Journal 5(2): 188–201.Royston, P. 2005. Multiple imputation of missing values: Update of ice.
Stata Journal 5(4): 527-536.UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services