UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata Library
Multiple Imputation Using ICE

Introduction

The idea of multiple imputation is to create multiple imputed data sets for a data set with missing values. The analysis of a statistical model is then done on each of the multiple data sets. The multiple analyses are then combined to yield a set of results. In general, multiple imputation techniques require that missing observations are missing at random (MAR).

There are two major approaches in multiple imputations. The first one is based on the joint distribution of all the variables considered, be it a variable to be imputed or a variable to be used only for the purpose of imputation. This is the approach that SAS proc mi takes. So, in general, it works well with multivariate normal data. The other approach is based on each conditional density of a variable given all other variables. This is the approach Stata's user-written program ice takes.  ice stands for Imputation by Chained Equations. The drawback is that the conditional densities can be incompatible. However simulation studies have shown that in practice it performs well. Stata's program ice is written by Patrick Royston, and he has a few articles introducing his suite of ice program and improvements made to it. There are some obvious advantages using Stata's ice instead of SAS's proc mi or any other program with the same approach:

One major disadvantage, beside that it is not as theoretically sound as proc mi, could be that it is more computational intensive. But considering the increasing computing power nowadays, this is less of an issue.

Let's discuss in some detail how ice works. Conceptually there are two major steps. The first step is the imputation of a single variable given a set of predictor variables, done by the program uvis, which comes a part of ice. The second step is the so called "regression switching", a scheme for cycling through all the variables to be imputed using uvis. In other words, internally, ice calls uvis many times.

uvis does the imputation of a single variable on a set of predictor variables by an appropriate regression model based on the predictors. The regression model can be OLS if the imputed variable is a continuous variable or a logit model if it is a binary variable. It can be other types of models as well. With the regression model, uvis can create the imputed values for the missing observations either by drawing from the posterior predictive distribution or by predictive matching. There are two types of distributions here, the distributions of the regression coefficients and the distribution of the residual standard deviation. Disturbances are added in both types of distributions. With the boot option, we can relax the assumption of multivariate normality on the distribution of regression coefficients. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.

For example, the Stata code below shows what internally uvis does for the method of random drawing. Variable w1 is created using uvis and variable w2 is created using the code that uvis uses. They are of course identical. The point of showing the Stata code here is to see what is involved.

uvis regress write math science, seed(12457) gen(w1)
set seed 12457
regress write math science 
   tempname b e V chol bstar
   tempvar xb u
   matrix `b'=e(b)
   matrix `e'=e(b)
   matrix `V'=e(V)
   matrix `chol'=cholesky(`V')
   local colsofb=colsof(`b')
  local rmse=e(rmse)
  local df=e(df_r)
  local chi2=2*invgammap(`df'/2,uniform())
  local rmsestar=`rmse'*sqrt(`df'/`chi2')
   matrix `chol'=`chol'*sqrt(`df'/`chi2')
  forvalues i=1/`colsofb' {
    matrix `e'[1,`i']=invnorm(uniform())
     }
  matrix `bstar'=`b'+`e'*`chol'' /*disturbance here*/
  gen `u'=uniform()
  matrix score `xb'=`bstar' /*score the data with the new coefficient*/
  gen w2 = write
  replace w2=`xb'+`rmsestar'*invnorm(`u') if write==. /*disturbance here again*/

To create one imputed data set for multiple variables x_1, x_2, ..., x_k, with missing observations, ice does the following:

Now let's look at some examples to see how ice works.

Downloading the programs

Please note that you may have to download some of the programs used in the examples on this page.  These include icemvpatterns and mim.  You can do this with the findit command.  For example, to download the mpatterns command, you can type findit mvpatterns (see How can I use the findit command to search for programs and get additional help? for more information about using findit).  Both ice and mim (the prefix used to combine the results from the different imputed data sets) have been updated several times since they were originally released.  If you have an older version of either of these programs, you may need to download the current version.  To determine which version of ice you have, for example, you can type which ice.  To ensure that you have the current version of ice, you can type

which ice
ssc describe ice
ssc install ice, replace

To ensure that you have the most current version of mim, you can type

which mim
ssc describe mim
ssc install mim, replace

Usage of ICE with examples

To demonstrate, we have created a data set with missing values. Below is the code for creating the data set used in these examples. Notice that variables fxw and fxr are interaction terms. We also use the mvpatterns command to see the pattern of the missing data.

clear
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
set seed 1368
gen r = uniform()
replace female =. if r >.8
set seed 4689
gen r1 = uniform()
replace ses = . if r1>.8 
replace write = . if math >60
gen fxw=female*write 
gen fxr=female*read

drop r r1
mvpatterns ses female write fxw fxr
Variable     | type     obs   mv   variable label
-------------+-----------------------------------
ses          | float    154   46   
female       | float    155   45   
write        | float    156   44   writing score
fxw          | float    118   82   
fxr          | float    155   45   
-------------------------------------------------
Patterns of missing values
  +------------------------+
  | _pattern   _mv   _freq |
  |------------------------|
  |    +++++     0      90 |
  |    ++..+     2      30 |
  |    .++++     1      28 |
  |    +.+..     3      28 |
  |    ..+..     4      10 |
  |------------------------|
  |    .+..+     3       7 |
  |    +....     4       6 |
  |    .....     5       1 |
  +------------------------+
label data "hsb2 with missing data"
save hsb2_mice, replace

Example 1: Let's begin with a necessary dry run

A dry run is a run where no imputed data set is created and no imputation model is actually run. This is a very useful first step to view the default settings and to pinpoint the changes needed in order to build up a sensible imputation model. For example, in the run below, we can see the regression model for each variable with missing values. Some of them might be appropriate already and some of them might need some changes. For example, variable ses is a categorical variable, but it is being used as a continuous variable in the prediction model for the variable write. Variables fxw and fxr are interaction terms, but they are treated as if they are not.

Please note that we change the imputation model as we move through the examples.  Our goal is not to show how to develop a theoretically plausible imputation model, but rather to illustrate the various options that may be useful in different situations.

use hsb2_mice, clear
ice female ses read math write socst science race fxw fxr, dryrun
   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         90       45.00       45.00
          1 |         28       14.00       59.00
          2 |         30       15.00       74.00
          3 |         35       17.50       91.50
          4 |         16        8.00       99.50
          5 |          1        0.50      100.00
------------+-----------------------------------
      Total |        200      100.00
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race fxw fxr
        ses | mlogit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxw fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw | regress | female ses read math write socst science race fxr
        fxr | regress | female ses read math write socst science race fxw
End of dry run. No imputations were done, no files were created.

Example 2: Specifying the interaction terms using the passive option

We have seen in the previous example that variables fxw and fxr are not being treated as interaction terms yet. They are to be imputed as if they are independent of variable female, write and read. We can correct it by using the passive option. This means the imputed values for fxw will be simply the products of female and write. The same is done for fxr. Notice that multiple terms are separated by backslashes "\" in the passive option.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race
        ses | mlogit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

Example 3: Specifying the types of prediction models using the cmd option

The variable ses has three categories and by default, ice will treat it as a unordered categorical variable. Therefore, mlogit is used in the prediction model. Now, we might say that ses is actually ordered and we want to use ordinal logistic regression for prediction instead. This can be done by using the option cmd. Multiple changes can be made and are separated by commas.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | ses read math write socst science race
        ses | ologit  | female read math write socst science race fxw fxr
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

Example 4: Specifying the predictors in the prediction models using the eq option

We can also change the set of predictors in the prediction models. In the example below, we use only the variables read and math in the prediction model for ses and the variable write for the prediction model of the variable female. Obviously, there is no good reason for simplifying the prediction models this way; this is only for the purpose of demonstration.

ice female ses read math write socst science race fxw fxr , ///
    passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) ///
    eq(ses: read math, female: write) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write
        ses | ologit  | read math
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female ses read math socst science race fxr
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
       race |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]

Example 5: Specifying categorical variables using the option substitute

In the previous example, at least one obvious problem still remains. That is the prediction equation for the variable write is not really set up correctly since both of the variable ses and race are categorical variables but they are used instead as continuous variables. Since the variable race does not have missing values, we can simply create dummy variables for it. But for the variable ses, it is not as simple. This is because that we want use ses itself in the imputation process so it will be imputed as a single categorical variable. We will also create dummy variables s1 and s2 for the variable ses. Now, since dummy variables s1 and s2 are NOT directly in the imputation model, we will need to do the "substitution" for the prediction model for the variable write. This is done using the option sub (short for substitute). This substitutes variable ses with s1 and s2 whenever ses is in a prediction model.

tab ses, gen(s)
tab race, gen(r)
ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2, ///
    passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) ///
    sub(ses: s1 s2) cmd(ses:ologit) ///
    eq(ses: read math, female: write) dryrun

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         90       45.00       45.00
          2 |         30       15.00       60.00
          3 |         56       28.00       88.00
          4 |          6        3.00       91.00
          5 |          7        3.50       94.50
          6 |         10        5.00       99.50
          7 |          1        0.50      100.00
------------+-----------------------------------
      Total |        200      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write
        ses | ologit  | read math
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female read math socst science r1 r2 r3 fxr s1 s2
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
         r1 |         | [No missing data in estimation sample]
         r2 |         | [No missing data in estimation sample]
         r3 |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
         s1 |         | [Passively imputed from (ses==1)]
         s2 |         | [Passively imputed from (ses==2)]

End of dry run. No imputations were done, no files were created.

Example 6: More things to be considered

By now, it seems that at least mechanically we have a correct imputation model. It turns out that there are actually many issues that we need consider. For example, as we mentioned briefly in the Introduction, that the normality assumption on the posterior distribution may not be valid. We therefore would want to use the bootstrap option. We may also want to use the predictive matching method for some of the variables. We also want to use the seed option so that our results are reproducible. Here is an example that does all of these.

    ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2, ///
    passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) ///
    substitute(ses: s1 s2) cmd(ses:ologit) ///
    eq(ses: read math write socst science, female: write read) boot(write) match(female) seed(1285964) dryrun

Now we are all set to create our imputed data set. We choose m = 5 for five copies of imputed data sets. This is quite arbitrary. Royston has a detailed discussion on the choice of m in his article. The choice of m still remains an issue for more research.

*get the final imputation model
ice female ses read math write socst science r1 r2 r3 fxw fxr s1 s2 using imp, m(5) ///
    passive(fxw:female*write \fxr: female*read \s1:(ses==1) \s2:(ses==2)) ///
    substitute(ses: s1 s2) cmd(ses:ologit) ///
    eq(ses: read math write socst science, female: write read) boot(write) match(female) seed(1285964) replace
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | write read
        ses | ologit  | read math write socst science
       read |         | [No missing data in estimation sample]
       math |         | [No missing data in estimation sample]
      write | regress | female read math socst science r1 r2 r3 fxr s1 s2
      socst |         | [No missing data in estimation sample]
    science |         | [No missing data in estimation sample]
         r1 |         | [No missing data in estimation sample]
         r2 |         | [No missing data in estimation sample]
         r3 |         | [No missing data in estimation sample]
        fxw |         | [Passively imputed from female*write]
        fxr |         | [Passively imputed from female*read]
         s1 |         | [Passively imputed from (ses==1)]
         s2 |         | [Passively imputed from (ses==2)]

Imputing 1..2..3..4..5..file imp.dta saved

Options for ice

There are many options that ice can take. Here is a partial list of options which we will discuss.  These options are essential for building up an imputation model.

Analysis on imputed data

Once we have created our imputed data set, we are ready to do our data analysis. Here are some examples.  We use the mim prefix to combine the results from the different imputed data sets into a single output.

use imp, clear
(hsb2 with missing data)
mim: reg write math female

Multiple-imputation estimates (regress)                  Imputations =       5
Linear regression                                        Minimum obs =     200
                                                         Minimum dof =    12.5

------------------------------------------------------------------------------
       write |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
        math |   .660069   .071499    9.23   0.000    .513096  .807042    26.0
      female |   4.76661   1.57098    3.03   0.010    1.35986  8.17335    12.5
       _cons |   15.5316     3.699    4.20   0.000    8.05143  23.0118    39.3
------------------------------------------------------------------------------
mim: logit female write math

Multiple-imputation estimates (logit)                    Imputations =       5
Logistic regression                                      Minimum obs =     200
                                                         Minimum dof =    12.1

------------------------------------------------------------------------------
      female |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
       write |   .083813   .031366    2.67   0.020     .01556  .152065    12.1
        math |  -.062983   .027661   -2.28   0.032   -.120149 -.005817    23.4
       _cons |  -.938246   1.19821   -0.78   0.441   -3.41531  1.53882    23.3
------------------------------------------------------------------------------

mim: logit female write math, or

Multiple-imputation estimates (logit)                    Imputations =       5
Logistic regression                                      Minimum obs =     200
                                                         Minimum dof =    12.1

------------------------------------------------------------------------------
      female | Odds Rat.  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
       write |   1.08743   .034109    2.67   0.020    1.01568  1.16424    12.1
        math |   .938959   .025973   -2.28   0.032    .886789    .9942    23.4
------------------------------------------------------------------------------

mim: mlogit ses write math

Multiple-imputation estimates (mlogit)                   Imputations =       5
Multinomial logistic regression                          Minimum obs =     200
                                                         Minimum dof =    19.0

------------------------------------------------------------------------------
         ses |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
low          |
       write |   .033503   .031968    1.05   0.308   -.033413  .100419    19.0
        math |  -.081796   .034454   -2.37   0.024   -.152199 -.011394    29.6
       _cons |   1.69122   1.51042    1.12   0.272   -1.40459  4.78704    27.6
-------------+----------------------------------------------------------------
high         |
       write |   .027159   .024009    1.13   0.260   -.020349  .074667   127.6
        math |   .031625    .02691    1.18   0.244   -.022111  .085361    65.4
       _cons |  -3.65612   1.42128   -2.57   0.015   -6.56174 -.750497    29.3
------------------------------------------------------------------------------

mim: mlogit ses write math, rrr

Multiple-imputation estimates (mlogit)                   Imputations =       5
Multinomial logistic regression                          Minimum obs =     200
                                                         Minimum dof =    19.0

------------------------------------------------------------------------------
         ses |       RRR  Std. Err.     t    P>|t|    [95% Conf. Int.]   MI.df
-------------+----------------------------------------------------------------
low          |
       write |   1.03407   .033057    1.05   0.308    .967139  1.10563    19.0
        math |    .92146   .031748   -2.37   0.024    .858818  .988671    29.6
-------------+----------------------------------------------------------------
high         |
       write |   1.02753    .02467    1.13   0.260    .979857  1.07753   127.6
        math |   1.03213   .027774    1.18   0.244    .978132  1.08911    65.4
------------------------------------------------------------------------------

Other issues

The ice program creates one single data set with multiple copies of imputed data inside. It also creates two new variables, _mi and _mj.  The variable _mi indicates the observation number, and _mj indicates the imputation number. If you have used other imputation program, you might have multiple data sets, each being one single imputation data. In order to use mim to perform analyses, you can combine the multiple data sets into one using mijoin

Please note that if you are using older versions of ice or mim, you have the variables _i and _j instead of _mi and _mj.  They are the same variables, so you can simply rename _i and _j to be _mi and _mj.

ls imp*
   2.8k   6/15/05 11:00  imp1.dta          
   2.8k   6/15/05 11:00  imp2.dta          
   2.8k   6/15/05 11:00  imp3.dta          
   2.8k   6/15/05 11:00  imp4.dta          
   2.8k   6/15/05 11:00  imp5.dta          
   2.8k   6/15/05 10:11  imp6.dta          
   2.8k   6/15/05 10:11  imp7.dta          
   2.8k   6/15/05 10:11  imp8.dta          
   2.8k   6/15/05 10:11  imp9.dta      
  
mijoin imp, m(5) clear
tab _mj
 imputation |
     number |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        200       20.00       20.00
          2 |        200       20.00       40.00
          3 |        200       20.00       60.00
          4 |        200       20.00       80.00
          5 |        200       20.00      100.00
------------+-----------------------------------
      Total |      1,000      100.00

Conversely, occasionally, you might want to use other program for analyzing your imputed data and you might have to break the single data set created by ice into multiple data sets. Then you can use misplit to accomplish that.

misplit, clear m(5)
Data for 5 imputations have been copied to _mitemp1.dta to _mitemp5.dta

References

S. van Buuren, J.P.L. Brand, C.G.M. Groothuis-Oudshoorn and D.B. Rubin, Full Conditional Specification in Multivariate Imputation

Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4(3): 227–241.

Royston, P. 2005. Multiple imputation of missing values: update. Stata Journal 5(2): 188–201.

Royston, P. 2005. Multiple imputation of missing values: Update of ice. Stata Journal 5(4): 527-536.


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California