Note 1: This page has been deprecated as of September 20, 2011, due to the fact that multiple imputation using ICE is now available as part of mi in Stata 12.
Note 2: Even though there are examples on this page using passive imputation, we do not usually recommend the use of it.

Stata Library
Multiple Imputation Using ICE Introduction

The idea of multiple imputation is that instead of filling in missing values to create a single imputed dataset, several (or more) imputed data sets are created each of which contains different imputed values. The analysis of a statistical model is then done on each of the imputed data sets. The multiple analyses are then combined to yield a single set of results. The major advantage of multiple imputation over single imputation is that it produces standard errors that reflect the degree of uncertainty due to the imputation missing values. In general, multiple imputation techniques require that missing observations are missing at random (MAR).

There are two major approaches to creating multiply imputed datasets. The first one is based on the joint distribution of all the variables in the imputation model, including variables to be imputed and variables to be used only for the purpose of imputing other variables. In this approach, the joint distribution of all variables in the imputation model is assumed to be multivariate normal. This is the approach taken by Stata's mi impute mvn command (introduced in Stata 11). The other approach is based on each conditional density of a variable given other variables. This is the approach Stata's user-written program ice takes.  ice stands for Imputation by Chained Equations. Stata's program ice was written by Patrick Royston, and he has published a number of articles in Stata Journal, introducing ice and documenting improvements made to it. One disadvantage of the imputation by chained equations approach is that it is not as theoretically sound as the multivariate normal approach. An additional drawback to this approach is that the conditional densities can be incompatible. However simulation studies have shown that in practice it performs well. There are some advantages using Stata's ice instead of mi impute mvn or other programs with the same approach:

Let's discuss in some detail how ice works. Conceptually there are two major steps. The first step is the imputation of a single variable given a set of predictor variables, done by the program uvis, which is part of ice. The second step is the so called "regression switching", a scheme for cycling through all the variables to be imputed using uvis. In other words, internally, ice calls uvis many times.

uvis does the imputation of a single variable on a set of predictor variables by an appropriate regression model. The regression model can be OLS if the variable being imputed is a continuous variable or logistic regression if it is a binary variable. Other types of regression models can also be used. With the regression model, uvis can create the imputed values for the missing observations either by drawing from the posterior predictive distribution or by predictive matching. There are two types of distributions here, the distributions of the regression coefficients and the distribution of the residual standard deviation. Disturbances are added in both types of distributions. With the boot option, we can relax the assumption of multivariate normality on the distribution of regression coefficients. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.

The Stata code below shows what uvis does internally to create the random draws (i.e. imputed values). Variable w1 is created using uvis and variable w2 is created using the code that uvis uses. They are of course identical. The point of showing the Stata code here is to see what is involved.

uvis regress write math science, seed(12457) gen(w1)
set seed 12457
regress write math science 
   tempname b e V chol bstar
   tempvar xb u
   matrix `b'=e(b)
   matrix `e'=e(b)
   matrix `V'=e(V)
   matrix `chol'=cholesky(`V')
   local colsofb=colsof(`b')
  local rmse=e(rmse)
  local df=e(df_r)
  local chi2=2*invgammap(`df'/2,uniform())
  local rmsestar=`rmse'*sqrt(`df'/`chi2')
   matrix `chol'=`chol'*sqrt(`df'/`chi2')
  forvalues i=1/`colsofb' {
    matrix `e'[1,`i']=invnorm(uniform())
     }
  matrix `bstar'=`b'+`e'*`chol'' /*disturbance here*/
  gen `u'=uniform()
  matrix score `xb'=`bstar' /*score the data with the new coefficient*/
  gen w2 = write
  replace w2=`xb'+`rmsestar'*invnorm(`u') if write==. /*disturbance here again*/

To create one imputed dataset for multiple variables x_1, x_2, ..., x_k, with missing observations, ice does the following:

Now let's look at some examples to see how ice works.

Downloading the programs

Please note that you may have to download some of the programs used in the examples on this page.  These include ice, and mvpatterns.  You can do this with the findit command.  For example, to download the mpatterns command, you can type findit mvpatterns in the Stata command window (see How can I use the findit command to search for programs and get additional help? for more information about using findit). Note that ice has been updated several times since it was originally released.  If you have an older version of ice, you may need to download the current version.  To determine which version of ice you have, for example, you can type which ice.  To ensure that you have the current version of ice, you can type:

which ice
ssc describe ice
ssc install ice, replace

Usage of ICE with examples

To demonstrate, we have created a data set with missing values. Below is the code for creating the data set used in these examples. Notice that variables fxw and fxr are interaction terms. We also use the mvpatterns command to see the pattern of the missing data.

clear
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
set seed 1368
gen r = uniform()
replace female =. if r >.8
set seed 4689
gen r1 = uniform()
replace ses = . if r1>.8 
replace write = . if math >60
gen fxw=female*write 
gen fxr=female*read

drop r r1
mvpatterns ses female write fxw fxr
Variable     | type     obs   mv   variable label
-------------+-----------------------------------
ses          | float    154   46   
female       | float    155   45   
write        | float    156   44   writing score
fxw          | float    118   82   
fxr          | float    155   45   
-------------------------------------------------
Patterns of missing values
  +------------------------+
  | _pattern   _mv   _freq |
  |------------------------|
  |    +++++     0      90 |
  |    ++..+     2      30 |
  |    .++++     1      28 |
  |    +.+..     3      28 |
  |    ..+..     4      10 |
  |------------------------|
  |    .+..+     3       7 |
  |    +....     4       6 |
  |    .....     5       1 |
  +------------------------+
label data "hsb2 with missing data"
save hsb2_mice, replace

Example 1: Let's begin with a necessary dry run

When the dryrun option is used no imputed data set is created and no imputation model is actually run. This is a very useful first step to view the default settings and to pinpoint the changes needed in order to build up a sensible imputation model. In the run below, we can see the regression model for each variable with missing values. Some of them might be appropriate already and some of them might need some changes. For example, the variable ses is a categorical variable, but it is being treated as a continuous variable when it is used to predict other variables with missing values. Variables fxw and fxr are interaction terms, but they are imputed the same manner as other variables in the model.

Please note that we change the imputation model as we move through the examples. Our goal is not to show how to develop a theoretically plausible imputation model, but rather to illustrate the various options that may be useful in different situations.

use hsb2_mice, clear
ice female ses read math write socst science race fxw fxr, dryrun

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         90       45.00       45.00
          1 |         28       14.00       59.00
          2 |         30       15.00       74.00
          3 |         35       17.50       91.50
          4 |         16        8.00       99.50
          5 |          1        0.50      100.00
------------+-----------------------------------
      Total |        200      100.00
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race fxw fxr
        ses | mlogit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxw fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw | regress | female ses read math write socst science race fxr
        fxr | regress | female ses read math write socst science race fxw
End of dry run. No imputations were done, no files were created.

Example 2: Specifying the interaction terms using the passive option

A variety of methods have been developed for imputing interactions and other non-linear terms (e.g. squared terms). One method is to create the non-linear terms before the imputation, and impute them as though they were just another variable to be imputed, this is what was done in Example 1. One alternative method is what is known as passive imputation. In passive imputation, the non-linear term is included in the imputation model as a predictor, but instead of imputing values of the non-linear term directly, its values are recalculated at each iteration based on the imputed values of the variables from which it is formed. This second method can be implemented in ice using the passive(...) option. This means the imputed values for fxw will be the product of female and write. The same is done for fxr. Notice that multiple terms are separated by backslashes "\" in the passive option. Note that there is no universally agreed upon method for imputing non-linear terms, see von Hippel (2009) for a discussion of these and other methods.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race
        ses | mlogit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

Example 3: Specifying the types of prediction models using the cmd option

The variable ses has three categories and by default, when imputing values of ses, ice will treat it as an unordered categorical variable. Therefore, mlogit is used in the prediction model. Now, we might say that ses is actually ordered and we want to use ordinal logistic regression for prediction instead. This can be done by using the option cmd(...). Multiple changes can be made and are separated by commas.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race
        ses | ologit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

Example 4: Specifying the predictors in the prediction models using the eq option

By default, ice will impute each variable using all other variables in the imputation model. It is possible to select a subset of variables for each prediction model using the eq(...) option. In the example below, we use only the variables read and math in the prediction model for ses and the variable write for the prediction model of the variable female. In this case, there is no good reason for simplifying the prediction models this way; this is only for the purpose of demonstration. In practice, if there are more than a few variables in the imputation model, the regression models used to impute each variable can become too large to estimate, in these cases, specifying custom equations becomes necessary.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) ///
    eq(ses: read math, female: write) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write
        ses | ologit  | read math
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]

Example 5: Specifying categorical variables using i., o., and m.

In the previous example, at least one obvious problem still remains. That is, the prediction equation for the variable write is not really set up correctly since the variable ses and race are categorical variables but when used to predict other variables, they are used as though they are continuous variables. The variable race does not have any missing values, so we can add i. before the variable name (i.e. i.race) to instruct ice to replace all instances of race with a series of dummy variables. For the variable ses, we cannot use i. because also need to tell ice whether it should use ologit or mlogit to predict ses. In the code below, we use o. before the variable name (i.e. o.ses) to indicate we want to use ologit to predict ses. If ses were nominal, we could use m. to indicate that mlogit should be used. Note that the cmd(...) option to specify that ologit be used to predict ses, is no longer necessary, the o. prefix handles this. Also note that when we use i., o., or m., the ice output includes an interpretation of the command at the top of the output, this interpreted command includes both the cmd(...) and substitute(...) options, as well as the xi: prefix which is used to create the dummy variables used in the imputation model.

ice female o.ses read math write socst science i.race fxw fxr , ///
passive(fxw:female*write \fxr: female*read) eq(ses: read math, female: write) dryrun

=> xi: ice female ses i.ses read math write socst science i.race fxw fxr, cmd(ses:ologit)
> substitute(ses:i.ses) eq(ses: read math, female: write) passive(fxw:female*write \
> fxr: female*read) dryrun

i.ses             _Ises_1-3           (naturally coded; _Ises_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         90       45.00       45.00
          1 |         28       14.00       59.00
          2 |         30       15.00       74.00
          3 |         35       17.50       91.50
          4 |         16        8.00       99.50
          5 |          1        0.50      100.00
------------+-----------------------------------
      Total |        200      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write
        ses | ologit  | read math
    _Ises_2 |         | [Passively imputed from (ses==2)]
    _Ises_3 |         | [Passively imputed from (ses==3)]
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female _Ises_2 _Ises_3 read math socst science
            |         | _Irace_2 _Irace_3 _Irace_4 fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
   _Irace_2 |         | [No missing data in estimation sample]
   _Irace_3 |         | [No missing data in estimation sample]
   _Irace_4 |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
------------------------------------------------------------------------------

End of dry run. No imputations were done, no files were created.

Example 6: More things to be considered

By now, it seems that at least mechanically we have a correct imputation model. It turns out that there are actually many issues that we need consider. For example, as we mentioned briefly in the Introduction, the normality assumption on the posterior distribution may not be valid. Therefore we might want to use the boot(...) (for boostrap) option. We can also use the match(...) option to use prediction matching rather than the default imputation method. Below we impute female using bootstrapping and write using matching. If we wanted to use the boot(...) or match(...) option for all of the variables to be imputed, we could omit the parenthesis and variable list, that is, include the word boot or match in the options. We might also want to use the seed option so that our results are reproducible. Here is an example that does all of these.

ice female o.ses read math write socst science r1 r2 r3 fxw fxr s1 s2, ///
    passive(fxw:female*write \fxr: female*read) ///
    substitute(ses: s1 s2) cmd(ses:ologit) ///
    eq(ses: read math write socst science, female: write read) ///
    boot(female) match(write) seed(1285964) dryrun

Now we are all set to create our imputed data set. We choose m = 5 for five copies of imputed data sets. We selected five datasets because it is sufficient for this example, however, more imputed datasets may be necessary for actual research applications. One article that discusses the choice of m is Graham, Olchowski, and Gilreach (2007). It seems unlikely that a single correct value for m will be established because, like sample size, the number of imputations that are necessary depends on features of the individual dataset and analysis model. Nonetheless, the choice of m remains an issue for more research.

*run the final imputation model
ice female o.ses read math write socst science i.race fxw fxr , ///
	passive(fxw:female*write \fxr: female*read) ///
	eq(ses: read math, female: write) ///
	boot(female) match(write) seed(1285964) m(5) saving(imp, replace)

   
=> xi: ice female ses i.ses read math write socst science i.race fxw fxr, cmd(ses:ologi
> t) substitute(ses:i.ses) eq(ses: read math, female: write) passive(fxw:female*write \
> fxr: female*read) boot(female) match(write) seed(1285964) m(5) saving(d:\temp\ice_upd
> ating, replace)

i.ses             _Ises_1-3           (naturally coded; _Ises_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         90       45.00       45.00
          1 |         28       14.00       59.00
          2 |         30       15.00       74.00
          3 |         35       17.50       91.50
          4 |         16        8.00       99.50
          5 |          1        0.50      100.00
------------+-----------------------------------
      Total |        200      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write
        ses | ologit  | read math
    _Ises_2 |         | [Passively imputed from (ses==2)]
    _Ises_3 |         | [Passively imputed from (ses==3)]
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female _Ises_2 _Ises_3 read math socst science
            |         | _Irace_2 _Irace_3 _Irace_4 fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
   _Irace_2 |         | [No missing data in estimation sample]
   _Irace_3 |         | [No missing data in estimation sample]
   _Irace_4 |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
------------------------------------------------------------------------------

Imputing ..........1..........2..........3..........4..........5
file d:\temp\ice_updating.dta saved

Options for ice

There are many options that ice can take. Here is a partial list of options which we will discuss.  These options can be important for building up an imputation model.

See Also

For more information on working with missing data in Stata, see our seminars Multiple Imputation in Stata, Part 1 and Part 2.

References

Carlin, J. B., Galati, J. C., Royston, P. (2008). A new framework for managing and analyzing multiply imputed data in Stata. Stata Journal 8(1):49-67

Graham, J. W., Olchowski, A. E. and Gilreath, T. D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

van Buuren, S, Brand, J.P.L., Groothuis-Oudshoorn, C.G.M. and Rubin, D.B. Full Conditional Specification in Multivariate Imputation

von Hippel, P.T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology.

Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4(3): 227-241.

Royston, P. 2005. Multiple imputation of missing values: update. Stata Journal 5(2): 188-201.

Royston, P. 2005. Multiple imputation of missing values: Update of ice. Stata Journal 5(4): 527-536.

Royston, P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal (7)4: 445-464.

Royston, P. 2009. Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal 9(3): 466-477.

Royston, P., Carlin, J. B., White, I.R. 2009. Multiple imputation of missing values: New features for mim. Stata Journal 9(2): 252-264.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.