Help the Stat Consulting Group by giving a gift

Note 2: Even though there are examples on this page using passive imputation, we do not usually recommend the use of it.

Multiple Imputation Using ICE Introduction

The idea of multiple imputation is that instead of filling in missing values to create a single imputed dataset, several (or more) imputed data sets are created each of which contains different imputed values. The analysis of a statistical model is then done on each of the imputed data sets. The multiple analyses are then combined to yield a single set of results. The major advantage of multiple imputation over single imputation is that it produces standard errors that reflect the degree of uncertainty due to the imputation missing values. In general, multiple imputation techniques require that missing observations are missing at random (MAR).

There are two major approaches to creating multiply imputed datasets. The first one is
based on the joint distribution of all the variables in the imputation model,
including variables to be imputed and variables to be used only for the purpose of
imputing other variables. In this approach, the joint distribution of all variables in the
imputation model is assumed to be multivariate normal. This is the approach
taken by Stata's **mi impute mvn** command (introduced in Stata 11). The other approach is based on each conditional
density of a variable given other variables. This is the approach Stata's
user-written program **ice** takes. **ice** stands for Imputation by Chained
Equations. Stata's
program **ice** was written by Patrick Royston, and he has published a number
of articles
in Stata Journal, introducing **ice** and documenting improvements made to it. One disadvantage of the imputation by chained equations approach is that it is not as theoretically sound as
the multivariate normal approach. An additional drawback to this approach is that the conditional densities can be incompatible.
However simulation studies have shown that in practice it performs well. There
are some advantages using Stata's** ice** instead of **mi impute mvn**
or other programs with the same approach:

- no multivariate joint distribution assumption;
- allows different kinds of weights, as long as the regression models allow them;
- easy to understand;
- may have lower sample size requirements that the multivariate normal approach.

Let's discuss in some detail how **ice** works. Conceptually there are two
major steps. The first step is the imputation of a single variable given a set
of predictor variables, done by the program **uvis**, which is part of
**ice**. The second step is the so called "regression switching", a scheme for cycling
through all the variables to be imputed using **uvis. **In other words,
internally, **ice** calls **uvis** many times.

**uvis** does the imputation of a single variable on a set of predictor
variables by an appropriate regression model. The
regression model can be OLS if the variable being imputed is a continuous variable or
logistic regression if it is a binary variable. Other types of regression models can
also be used. With the regression model, **uvis** can create the imputed values for
the missing observations either by drawing from the posterior predictive
distribution or by predictive matching. There are two types of distributions
here, the distributions of the regression coefficients and the distribution of
the residual standard deviation. Disturbances are added in both types of
distributions. With the **boot** option, we can relax the assumption of
multivariate normality on the distribution of regression coefficients. This has
the advantage of robustness since the distribution of beta is no longer assumed
to be multivariate normal.

The Stata code below shows
what
**uvis** does internally to create the random draws (i.e. imputed values). Variable **w1** is created using
**uvis** and variable **w2** is created using the code
that **uvis** uses. They are of course identical. The point of showing the
Stata code here is to see what is involved.

uvis regress write math science, seed(12457) gen(w1)set seed 12457 regress write math science tempname b e V chol bstar tempvar xb u matrix `b'=e(b) matrix `e'=e(b) matrix `V'=e(V) matrix `chol'=cholesky(`V') local colsofb=colsof(`b')local rmse=e(rmse) local df=e(df_r) local chi2=2*invgammap(`df'/2,uniform())local rmsestar=`rmse'*sqrt(`df'/`chi2') matrix `chol'=`chol'*sqrt(`df'/`chi2')forvalues i=1/`colsofb' { matrix `e'[1,`i']=invnorm(uniform()) } matrix `bstar'=`b'+`e'*`chol'' /*disturbance here*/ gen `u'=uniform() matrix score `xb'=`bstar' /*score the data with the new coefficient*/ gen w2 = write replace w2=`xb'+`rmsestar'*invnorm(`u') if write==. /*disturbance here again*/

To create one imputed dataset for multiple variables x_1, x_2, ..., x_k, with missing
observations, **ice** does the following:

- Ignore observations for which every member of x_1, x_2, ..., x_k has a missing value. This step will eliminate the observations that are impossible to impute;
- For each variable with any missing data in x_1, x_2, ..., x_k, randomly order that variable and replicate its observed values across the missing cases. This step initializes the iterative procedure by filling in missing data at random;
- For each of x_1, x_2, ..., x_k, in turn, impute missing values by
applying
**uvis**with the remaining variables as covariates. - Repeat the step above # times specified by the
**cycles(#)**option. The default is 10.

Now let's look at some examples to see how **ice** works.

Please note that you may have to download some of the programs used in the
examples on this page. These include **ice**, and **mvpatterns**. You can do this with the **findit** command. For
example, to download the **mpatterns **command, you can type **findit
mvpatterns** in the Stata command window (see How can I use the findit command
to search for programs and get additional help? for more information about
using **findit**). Note that **ice** has been updated
several times since it was originally released. If you have an older
version of **ice**, you may need to download the current
version. To determine which version of **ice** you have, for example,
you can type **which ice**. To ensure that you have the current version
of **ice**, you can type:

which ice ssc describe ice ssc install ice, replace

To demonstrate, we have created a data set with missing values. Below is
the code for creating the data set used in these examples. Notice that
variables **fxw** and **fxr** are interaction terms. We also use the
**mvpatterns** command to see the pattern of the missing data.

clear use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear set seed 1368 gen r = uniform() replace female =. if r >.8 set seed 4689 gen r1 = uniform() replace ses = . if r1>.8 replace write = . if math >60gen fxw=female*write gen fxr=female*read drop r r1mvpatterns ses female write fxw fxrVariable | type obs mv variable label -------------+----------------------------------- ses | float 154 46 female | float 155 45 write | float 156 44 writing score fxw | float 118 82 fxr | float 155 45 -------------------------------------------------Patterns of missing values+------------------------+ | _pattern _mv _freq | |------------------------| | +++++ 0 90 | | ++..+ 2 30 | | .++++ 1 28 | | +.+.. 3 28 | | ..+.. 4 10 | |------------------------| | .+..+ 3 7 | | +.... 4 6 | | ..... 5 1 | +------------------------+label data "hsb2 with missing data" save hsb2_mice, replace

When the **dryrun** option is used no imputed data set is created and no imputation
model is actually run. This is a very useful first step to view the default
settings and to pinpoint the changes needed in order to build up a sensible
imputation model. In the run below, we can see the regression
model for each variable with missing values. Some of them might be
appropriate already and some of them might need some changes. For example,
the variable **ses** is a categorical variable, but it is being treated as a
continuous variable when it is used to predict other variables with missing
values. Variables **fxw** and **fxr** are interaction terms, but they are
imputed the same manner as other variables in the model.

Please note that we change the imputation model as we move through the examples. Our goal is not to show how to develop a theoretically plausible imputation model, but rather to illustrate the various options that may be useful in different situations.

use hsb2_mice, clearice female ses read math write socst science race fxw fxr, dryrun#missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 90 45.00 45.00 1 | 28 14.00 59.00 2 | 30 15.00 74.00 3 | 35 17.50 91.50 4 | 16 8.00 99.50 5 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race fxw fxr ses | mlogit | female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxw fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | regress | female ses read math write socst science race fxr fxr | regress | female ses read math write socst science race fxwEnd of dry run. No imputations were done, no files were created.

A variety of methods have been developed for imputing interactions and other non-linear terms (e.g. squared terms). One method is to create the
non-linear terms before the imputation, and impute them as though they were just another
variable to be imputed, this is what was done in Example 1. One
alternative method is what is known as passive imputation. In passive
imputation, the non-linear term is included in the imputation model as a
predictor, but instead of imputing values of the non-linear term directly, its
values are recalculated at each iteration based on the imputed values of the
variables from which it is formed. This second method can be implemented in **
ice** using the
**passive(**...**)** option. This means the imputed values for **fxw** will be the
product of **female** and **write**. The same is done for **fxr**. Notice that multiple terms are separated by backslashes "\" in
the **passive** option. Note that there is no universally agreed upon
method for imputing non-linear terms, see von Hippel (2009) for a discussion of
these and other methods.

ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) dryrunVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race ses | mlogit | female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample]fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]End of dry run. No imputations were done, no files were created.

The variable **ses** has three categories and by default, when imputing
values of **ses**, **ice** will treat
it as an unordered categorical variable. Therefore, **mlogit** is used in the
prediction model. Now, we might say that **ses** is actually ordered and
we want to use ordinal logistic regression for prediction instead. This can
be done by using the option **cmd(...)**. Multiple changes can be made and are separated by commas.

ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) dryrunVariable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | ses read math write socst science race ses |ologit| female read math write socst science race fxw fxr read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]End of dry run. No imputations were done, no files were created.

By default, ice will impute each variable using all other variables in the
imputation model. It is possible to select a subset of variables for each
prediction model using the **eq(...)** option. In the example below, we use only
the variables **read** and **math**
in the prediction model for **ses** and the variable **write** for
the prediction model of the variable **female**. In this case, there is no good reason for simplifying
the prediction models this way; this is only for the purpose of
demonstration. In practice, if there are more than a few variables in the
imputation model, the regression models used to impute each variable can become
too large to estimate, in these cases, specifying custom equations becomes
necessary.

ice female ses read math write socst science race fxw fxr , /// passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) /// eq(ses: read math, female: write) dryrunVariable | Command | Prediction equation ------------+---------+-------------------------------------------------------female | logit | write ses | ologit | read mathread | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female ses read math socst science race fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] race | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read]

In the previous example, at least one obvious problem still remains. That is,
the prediction equation for the variable **write** is not really set up
correctly since the variable **ses** and **race** are
categorical variables but when used to predict other variables, they are used as though they are continuous
variables. The variable **race** does not have any missing values, so we can
add **i.** before the variable name (i.e. **i.race**) to instruct
**ice** to replace all instances of **race** with a series of dummy variables. For the variable **ses**,
we cannot use **i.** because also need to tell **ice** whether it should
use **ologit** or **mlogit** to predict **ses**. In the code below, we use **
o.** before the variable name (i.e. **o.ses**) to indicate we want to use
**ologit** to predict **ses**. If **ses** were nominal, we could use **m.** to indicate
that **mlogit** should be used. Note that the **cmd(**...**)** option
to specify that **ologit** be used to predict **ses,** is no longer necessary, the **o.**
prefix handles this. Also note that when we use **i.**, **o.**, or **m.**, the **ice**
output includes an interpretation of the command at the top of the output, this interpreted command includes both the **cmd(...)** and **
substitute(...)** options, as well as the **xi:** prefix which is used to
create the dummy variables used in the imputation model.

ice female o.ses read math write socst science i.race fxw fxr , /// passive(fxw:female*write \fxr: female*read) eq(ses: read math, female: write) dryrun=>ice female ses i.ses read math write socst science i.race fxw fxr,xi:cmd(ses:ologit)>substitute(ses:i.ses)eq(ses: read math, female: write) passive(fxw:female*write \ > fxr: female*read) dryrun i.ses _Ises_1-3 (naturally coded; _Ises_1 omitted) i.race _Irace_1-4 (naturally coded; _Irace_1 omitted) #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 90 45.00 45.00 1 | 28 14.00 59.00 2 | 30 15.00 74.00 3 | 35 17.50 91.50 4 | 16 8.00 99.50 5 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | write ses | ologit | read math _Ises_2 | | [Passively imputed from (ses==2)] _Ises_3 | | [Passively imputed from (ses==3)] read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female _Ises_2 _Ises_3 read math socst science | | _Irace_2 _Irace_3 _Irace_4 fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] _Irace_2 | | [No missing data in estimation sample] _Irace_3 | | [No missing data in estimation sample] _Irace_4 | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read] ------------------------------------------------------------------------------ End of dry run. No imputations were done, no files were created.

By now, it seems that at least mechanically we have a correct imputation
model. It turns out that there are actually many issues that we need
consider. For example, as we mentioned briefly in the Introduction, the
normality assumption on the posterior distribution may not be valid. Therefore
we might want to use the **boot(**...**)** (for boostrap) option. We can also
use the **match(**...**)** option to use prediction matching rather than the default
imputation method. Below we impute **female** using bootstrapping and **
write**
using matching. If we wanted to use the **boot(**...**)** or **match(**...**)**
option for all of the variables to be imputed, we could omit the parenthesis and
variable list, that is, include the word **boot** or **match** in the options. We might also want to
use the **seed** option so that our results are reproducible. Here is an example
that does all of these.

ice female o.ses read math write socst science r1 r2 r3 fxw fxr s1 s2, /// passive(fxw:female*write \fxr: female*read) /// substitute(ses: s1 s2) cmd(ses:ologit) /// eq(ses: read math write socst science, female: write read) /// boot(female) match(write) seed(1285964) dryrun

Now we are all set to create our imputed data set. We choose m = 5 for five copies of imputed data sets. We selected five datasets because it is sufficient for this example, however, more imputed datasets may be necessary for actual research applications. One article that discusses the choice of m is Graham, Olchowski, and Gilreach (2007). It seems unlikely that a single correct value for m will be established because, like sample size, the number of imputations that are necessary depends on features of the individual dataset and analysis model. Nonetheless, the choice of m remains an issue for more research.

*run the final imputation modelice female o.ses read math write socst science i.race fxw fxr , /// passive(fxw:female*write \fxr: female*read) /// eq(ses: read math, female: write) /// boot(female) match(write) seed(1285964) m(5) saving(imp, replace)=> xi: ice female ses i.ses read math write socst science i.race fxw fxr, cmd(ses:ologi > t) substitute(ses:i.ses) eq(ses: read math, female: write) passive(fxw:female*write \ > fxr: female*read) boot(female) match(write) seed(1285964) m(5) saving(d:\temp\ice_upd > ating, replace) i.ses _Ises_1-3 (naturally coded; _Ises_1 omitted) i.race _Irace_1-4 (naturally coded; _Irace_1 omitted) #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 90 45.00 45.00 1 | 28 14.00 59.00 2 | 30 15.00 74.00 3 | 35 17.50 91.50 4 | 16 8.00 99.50 5 | 1 0.50 100.00 ------------+----------------------------------- Total | 200 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | write ses | ologit | read math _Ises_2 | | [Passively imputed from (ses==2)] _Ises_3 | | [Passively imputed from (ses==3)] read | | [No missing data in estimation sample] math | | [No missing data in estimation sample] write | regress | female _Ises_2 _Ises_3 read math socst science | | _Irace_2 _Irace_3 _Irace_4 fxr socst | | [No missing data in estimation sample] science | | [No missing data in estimation sample] _Irace_2 | | [No missing data in estimation sample] _Irace_3 | | [No missing data in estimation sample] _Irace_4 | | [No missing data in estimation sample] fxw | | [Passively imputed from female*write] fxr | | [Passively imputed from female*read] ------------------------------------------------------------------------------ Imputing ..........1..........2..........3..........4..........5 file d:\temp\ice_updating.dta saved

There are many options that **ice** can take. Here is a partial list of
options which we will discuss. These options can be important for building
up an imputation model.

**m**This is used to specify the number of imputations to carry out. Further research is needed regarding the choice of m. Royston has a discussion in his 2004 article (another source is Graham, Olchowski, and Gilreach 2007). In general, it is not as simple as just choosing m = 5. If

-**m(...)**is not specified,**ice**will generate a single imputation.**cmd**

-**ice**will decide the variable type for each imputed variable automatically based on the number of values the variable takes on. For example, if the variable**x**takes only 0/1/2 values,**ice**will assume**x**is a categorical variable and will use multinomial logistic regression for**x**in the imputation process. In other words,**mlogit**will be the default model for**x**. But we can force**ice**to use**ologit**instead for**x**if we think that ordinal logistic regression is more appropriate. For example,**cmd(x1 x2:logit, x3:regress)**specifies that logistic regression will be used for variable**x1**and**x2**, while OLS regression will be used for variable**x3**.**substitute**

- This makes substitutions of the variables in the prediction models. For example, a categorical variable might be both imputed and used as a predictor in the prediction models for other imputed variables. This option allows us to treat the variable as a categorical variable in when it is used to predict other variables. This option has largely been replaced by the**o.**and**m.**options.**o. and m.**

- These "options" are added to variable names (i.e.**m.varname**), to specify that the variable is either**ologit**(**o.**) or**mlogit**(**m.**) should be used to impute a variable. These options also create dummy variables, and include them in the imputation model when the variable is used as a predictor. Using either**o.**or**m.**is equivalent to using**substitute(...)**and**cmd(...)**together for those variables (i.e. it's a short cut).**eq**

- Each variable being imputed has a regression model associated with it. The**eq**option can be used to specify the predictors on the right-hand side of the regression model. By default,**ice**takes every variable in the main list as a predictor variable. This can work well in small models, but often becomes problematic when one is imputing many variables. Specifying,**eq(ses: read math, female: write)**means that the predictor variables for the variable**ses**will be**read**and**math**, and that the only predictor variable for the variable**female**is**write**.**passive**

- This means that a variable will not be independently imputed. For example, if the variable**x12**is the interaction term of variable**x1**and**x2**, its imputed values will be determined by the imputed values of**x1**and**x2**, rather than directly imputing values of**x12**. This option can be used with interaction terms or transformed variables. With this option, we can put all of the variables in the imputation model. For example,**passive(fxw:female*write \fxr: female*read)**means that**fxw**is the interaction term of the variable**female**and**write**, and**fxr**is the interaction term of the variable**female**and**read**. Values of both**fxw**and**fxr**will be determined by the values of variable**female**,**write**and**read**. While**ice**has the capacity to passively impute variables, whether this is a desirable way of handling interactions and transformations is still a subject of research.**boot**

- This option specifies that parameter should be estimated by regression within a bootstrap sample. It has the advantage of robustness since it is not necessary to require the normality assumption.**match**

- This option is used to switch to the prediction matching imputation method, it over-rides the default imputation method, which is random draw from the posterior distribution.**match**can only be used with continuous variables.**seed**

- This sets the random number seed for the purpose of replicating the imputed data sets.**cycles**

- This determines the number of cycles of regression switching to be carried out between imputations.**on**-This changes the entire imputation model. Given a variable list using the option**on**, we have fixed a single prediction model for all the variables to be imputed. For example,**on(write math)**means that whatever variable is being imputed, the predictors in the prediction model are always**write**and**math**.**genmiss**

- This option produces a series of variables, one for each variable in the imputation model, the variables are equal to 1 when the associated variable was imputed, and 0 otherwise. These variables are named using a prefix specified by the user in**genmiss(...)**followed by the name of the associated variable. For example, if we had specified**genmiss(m_)**, ice would have generated a variable**m_write**, equal to 1 when write was imputed, and 0 otherwise.

For more information on working with missing data in Stata, see our seminars Multiple Imputation in Stata, Part 1 and Part 2.

Carlin, J. B., Galati, J. C., Royston, P. (2008). A new framework for
managing and analyzing multiply imputed data in Stata. *Stata Journal* 8(1):49-67

Graham, J. W., Olchowski, A. E. and Gilreath, T. D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

van Buuren, S, Brand, J.P.L., Groothuis-Oudshoorn, C.G.M. and Rubin, D.B. Full Conditional Specification in Multivariate Imputation

von Hippel, P.T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology.

Royston, P. 2004. Multiple imputation of missing values. *Stata Journal *4(3):
227-241.

Royston, P. 2005. Multiple imputation of missing values: update.
*Stata Journal* 5(2): 188-201.

Royston, P. 2005. Multiple imputation of missing values: Update of ice.
*Stata Journal * 5(4): 527-536.

Royston, P. 2007. Multiple imputation of missing values: further update
of ice, with an emphasis on interval censoring. *Stata Journal* (7)4: 445-464.

Royston, P. 2009. Multiple imputation of missing values: Further update of
ice, with an emphasis on categorical variables. *Stata Journal *9(3): 466-477.

Royston, P., Carlin, J. B., White, I.R. 2009. Multiple imputation of missing values:
New features for mim. *Stata Journal* 9(2): 252-264.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.