### Stata Library Multiple Imputation Using ICE Introduction

The idea of multiple imputation is that instead of filling in missing values to create a single imputed dataset, several (or more) imputed data sets are created each of which contains different imputed values. The analysis of a statistical model is then done on each of the imputed data sets. The multiple analyses are then combined to yield a single set of results. The major advantage of multiple imputation over single imputation is that it produces standard errors that reflect the degree of uncertainty due to the imputation missing values. In general, multiple imputation techniques require that missing observations are missing at random (MAR).

There are two major approaches to creating multiply imputed datasets. The first one is based on the joint distribution of all the variables in the imputation model, including variables to be imputed and variables to be used only for the purpose of imputing other variables. In this approach, the joint distribution of all variables in the imputation model is assumed to be multivariate normal. This is the approach taken by Stata's mi impute mvn command (introduced in Stata 11). The other approach is based on each conditional density of a variable given other variables. This is the approach Stata's user-written program ice takes.  ice stands for Imputation by Chained Equations. Stata's program ice was written by Patrick Royston, and he has published a number of articles in Stata Journal, introducing ice and documenting improvements made to it. One disadvantage of the imputation by chained equations approach is that it is not as theoretically sound as the multivariate normal approach. An additional drawback to this approach is that the conditional densities can be incompatible. However simulation studies have shown that in practice it performs well. There are some advantages using Stata's ice instead of mi impute mvn or other programs with the same approach:

•  no multivariate joint distribution assumption;
• allows different kinds of weights, as long as the regression models allow them;
• easy to understand;
• may have lower sample size requirements that the multivariate normal approach.

Let's discuss in some detail how ice works. Conceptually there are two major steps. The first step is the imputation of a single variable given a set of predictor variables, done by the program uvis, which is part of ice. The second step is the so called "regression switching", a scheme for cycling through all the variables to be imputed using uvis. In other words, internally, ice calls uvis many times.

uvis does the imputation of a single variable on a set of predictor variables by an appropriate regression model. The regression model can be OLS if the variable being imputed is a continuous variable or logistic regression if it is a binary variable. Other types of regression models can also be used. With the regression model, uvis can create the imputed values for the missing observations either by drawing from the posterior predictive distribution or by predictive matching. There are two types of distributions here, the distributions of the regression coefficients and the distribution of the residual standard deviation. Disturbances are added in both types of distributions. With the boot option, we can relax the assumption of multivariate normality on the distribution of regression coefficients. This has the advantage of robustness since the distribution of beta is no longer assumed to be multivariate normal.

The Stata code below shows what uvis does internally to create the random draws (i.e. imputed values). Variable w1 is created using uvis and variable w2 is created using the code that uvis uses. They are of course identical. The point of showing the Stata code here is to see what is involved.

uvis regress write math science, seed(12457) gen(w1)
set seed 12457
regress write math science
tempname b e V chol bstar
tempvar xb u
matrix b'=e(b)
matrix e'=e(b)
matrix V'=e(V)
matrix chol'=cholesky(V')
local colsofb=colsof(b')
  local rmse=e(rmse)
local df=e(df_r)
local chi2=2*invgammap(df'/2,uniform())
  local rmsestar=rmse'*sqrt(df'/chi2')
matrix chol'=chol'*sqrt(df'/chi2')
  forvalues i=1/colsofb' {
matrix e'[1,i']=invnorm(uniform())
}
matrix bstar'=b'+e'*chol'' /*disturbance here*/
gen u'=uniform()
matrix score xb'=bstar' /*score the data with the new coefficient*/
gen w2 = write
replace w2=xb'+rmsestar'*invnorm(u') if write==. /*disturbance here again*/

To create one imputed dataset for multiple variables x_1, x_2, ..., x_k, with missing observations, ice does the following:

• Ignore observations for which every member of x_1, x_2, ..., x_k has a missing value. This step will eliminate the observations that are impossible to impute;
• For each variable with any missing data in x_1, x_2, ..., x_k, randomly order that variable and replicate its observed values across the missing cases. This step initializes the iterative procedure by filling in missing data at random;
• For each of x_1, x_2, ..., x_k, in turn, impute missing values by applying uvis with the remaining variables as covariates.
• Repeat the step above # times specified  by the cycles(#) option. The default is 10.

Now let's look at some examples to see how ice works.

which ice
ssc describe ice
ssc install ice, replace

#### Usage of ICE with examples

To demonstrate, we have created a data set with missing values. Below is the code for creating the data set used in these examples. Notice that variables fxw and fxr are interaction terms. We also use the mvpatterns command to see the pattern of the missing data.

clear
use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
set seed 1368
gen r = uniform()
replace female =. if r >.8
set seed 4689
gen r1 = uniform()
replace ses = . if r1>.8
replace write = . if math >60
gen fxw=female*write

drop r r1
mvpatterns ses female write fxw fxr
Variable     | type     obs   mv   variable label
-------------+-----------------------------------
ses          | float    154   46
female       | float    155   45
write        | float    156   44   writing score
fxw          | float    118   82
fxr          | float    155   45
-------------------------------------------------
Patterns of missing values
  +------------------------+
| _pattern   _mv   _freq |
|------------------------|
|    +++++     0      90 |
|    ++..+     2      30 |
|    .++++     1      28 |
|    +.+..     3      28 |
|    ..+..     4      10 |
|------------------------|
|    .+..+     3       7 |
|    +....     4       6 |
|    .....     5       1 |
+------------------------+
label data "hsb2 with missing data"
save hsb2_mice, replace

#### Example 1: Let's begin with a necessary dry run

When the dryrun option is used no imputed data set is created and no imputation model is actually run. This is a very useful first step to view the default settings and to pinpoint the changes needed in order to build up a sensible imputation model. In the run below, we can see the regression model for each variable with missing values. Some of them might be appropriate already and some of them might need some changes. For example, the variable ses is a categorical variable, but it is being treated as a continuous variable when it is used to predict other variables with missing values. Variables fxw and fxr are interaction terms, but they are imputed the same manner as other variables in the model.

Please note that we change the imputation model as we move through the examples. Our goal is not to show how to develop a theoretically plausible imputation model, but rather to illustrate the various options that may be useful in different situations.

use hsb2_mice, clear
ice female ses read math write socst science race fxw fxr, dryrun

#missing |
values |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         90       45.00       45.00
1 |         28       14.00       59.00
2 |         30       15.00       74.00
3 |         35       17.50       91.50
4 |         16        8.00       99.50
5 |          1        0.50      100.00
------------+-----------------------------------
Total |        200      100.00
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | ses read math write socst science race fxw fxr
ses | mlogit  | female read math write socst science race fxw fxr
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female ses read math socst science race fxw fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
race |         | [No missing data in estimation sample]
fxw | regress | female ses read math write socst science race fxr
fxr | regress | female ses read math write socst science race fxw
End of dry run. No imputations were done, no files were created.

#### Example 2: Specifying the interaction terms using the passive option

A variety of methods have been developed for imputing interactions and other non-linear terms (e.g. squared terms). One method is to create the non-linear terms before the imputation, and impute them as though they were just another variable to be imputed, this is what was done in Example 1. One alternative method is what is known as passive imputation. In passive imputation, the non-linear term is included in the imputation model as a predictor, but instead of imputing values of the non-linear term directly, its values are recalculated at each iteration based on the imputed values of the variables from which it is formed. This second method can be implemented in ice using the passive(...) option. This means the imputed values for fxw will be the product of female and write. The same is done for fxr. Notice that multiple terms are separated by backslashes "\" in the passive option. Note that there is no universally agreed upon method for imputing non-linear terms, see von Hippel (2009) for a discussion of these and other methods.

ice female ses read math write socst science race fxw fxr , ///
passive(fxw:female*write \fxr: female*read) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | ses read math write socst science race
ses | mlogit  | female read math write socst science race fxw fxr
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female ses read math socst science race fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
race |         | [No missing data in estimation sample]
fxw |         | [Passively imputed from female*write]
fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

#### Example 3: Specifying the types of prediction models using the cmd option

The variable ses has three categories and by default, when imputing values of ses, ice will treat it as an unordered categorical variable. Therefore, mlogit is used in the prediction model. Now, we might say that ses is actually ordered and we want to use ordinal logistic regression for prediction instead. This can be done by using the option cmd(...). Multiple changes can be made and are separated by commas.

ice female ses read math write socst science race fxw fxr , ///
passive(fxw:female*write \fxr: female*read) cmd(ses:ologit) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | ses read math write socst science race
ses | ologit  | female read math write socst science race fxw fxr
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female ses read math socst science race fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
race |         | [No missing data in estimation sample]
fxw |         | [Passively imputed from female*write]
fxr |         | [Passively imputed from female*read]
End of dry run. No imputations were done, no files were created.

#### Example 4: Specifying the predictors in the prediction models using the eq option

By default, ice will impute each variable using all other variables in the imputation model. It is possible to select a subset of variables for each prediction model using the eq(...) option. In the example below, we use only the variables read and math in the prediction model for ses and the variable write for the prediction model of the variable female. In this case, there is no good reason for simplifying the prediction models this way; this is only for the purpose of demonstration. In practice, if there are more than a few variables in the imputation model, the regression models used to impute each variable can become too large to estimate, in these cases, specifying custom equations becomes necessary.

ice female ses read math write socst science race fxw fxr , ///
eq(ses: read math, female: write) dryrun
   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | write
ses | ologit  | read math
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female ses read math socst science race fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
race |         | [No missing data in estimation sample]
fxw |         | [Passively imputed from female*write]
fxr |         | [Passively imputed from female*read]

#### Example 5: Specifying categorical variables using i., o., and m.

In the previous example, at least one obvious problem still remains. That is, the prediction equation for the variable write is not really set up correctly since the variable ses and race are categorical variables but when used to predict other variables, they are used as though they are continuous variables. The variable race does not have any missing values, so we can add i. before the variable name (i.e. i.race) to instruct ice to replace all instances of race with a series of dummy variables. For the variable ses, we cannot use i. because also need to tell ice whether it should use ologit or mlogit to predict ses. In the code below, we use o. before the variable name (i.e. o.ses) to indicate we want to use ologit to predict ses. If ses were nominal, we could use m. to indicate that mlogit should be used. Note that the cmd(...) option to specify that ologit be used to predict ses, is no longer necessary, the o. prefix handles this. Also note that when we use i., o., or m., the ice output includes an interpretation of the command at the top of the output, this interpreted command includes both the cmd(...) and substitute(...) options, as well as the xi: prefix which is used to create the dummy variables used in the imputation model.

ice female o.ses read math write socst science i.race fxw fxr , ///

=> xi: ice female ses i.ses read math write socst science i.race fxw fxr, cmd(ses:ologit)
> substitute(ses:i.ses) eq(ses: read math, female: write) passive(fxw:female*write \

i.ses             _Ises_1-3           (naturally coded; _Ises_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

#missing |
values |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         90       45.00       45.00
1 |         28       14.00       59.00
2 |         30       15.00       74.00
3 |         35       17.50       91.50
4 |         16        8.00       99.50
5 |          1        0.50      100.00
------------+-----------------------------------
Total |        200      100.00

Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | write
ses | ologit  | read math
_Ises_2 |         | [Passively imputed from (ses==2)]
_Ises_3 |         | [Passively imputed from (ses==3)]
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female _Ises_2 _Ises_3 read math socst science
|         | _Irace_2 _Irace_3 _Irace_4 fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
_Irace_2 |         | [No missing data in estimation sample]
_Irace_3 |         | [No missing data in estimation sample]
_Irace_4 |         | [No missing data in estimation sample]
fxw |         | [Passively imputed from female*write]
fxr |         | [Passively imputed from female*read]
------------------------------------------------------------------------------

End of dry run. No imputations were done, no files were created.

#### Example 6: More things to be considered

By now, it seems that at least mechanically we have a correct imputation model. It turns out that there are actually many issues that we need consider. For example, as we mentioned briefly in the Introduction, the normality assumption on the posterior distribution may not be valid. Therefore we might want to use the boot(...) (for boostrap) option. We can also use the match(...) option to use prediction matching rather than the default imputation method. Below we impute female using bootstrapping and write using matching. If we wanted to use the boot(...) or match(...) option for all of the variables to be imputed, we could omit the parenthesis and variable list, that is, include the word boot or match in the options. We might also want to use the seed option so that our results are reproducible. Here is an example that does all of these.

ice female o.ses read math write socst science r1 r2 r3 fxw fxr s1 s2, ///
substitute(ses: s1 s2) cmd(ses:ologit) ///
boot(female) match(write) seed(1285964) dryrun

Now we are all set to create our imputed data set. We choose m = 5 for five copies of imputed data sets. We selected five datasets because it is sufficient for this example, however, more imputed datasets may be necessary for actual research applications. One article that discusses the choice of m is Graham, Olchowski, and Gilreach (2007). It seems unlikely that a single correct value for m will be established because, like sample size, the number of imputations that are necessary depends on features of the individual dataset and analysis model. Nonetheless, the choice of m remains an issue for more research.

*run the final imputation model
ice female o.ses read math write socst science i.race fxw fxr , ///
eq(ses: read math, female: write) ///
boot(female) match(write) seed(1285964) m(5) saving(imp, replace)

=> xi: ice female ses i.ses read math write socst science i.race fxw fxr, cmd(ses:ologi
> t) substitute(ses:i.ses) eq(ses: read math, female: write) passive(fxw:female*write \
> fxr: female*read) boot(female) match(write) seed(1285964) m(5) saving(d:\temp\ice_upd
> ating, replace)

i.ses             _Ises_1-3           (naturally coded; _Ises_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

#missing |
values |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |         90       45.00       45.00
1 |         28       14.00       59.00
2 |         30       15.00       74.00
3 |         35       17.50       91.50
4 |         16        8.00       99.50
5 |          1        0.50      100.00
------------+-----------------------------------
Total |        200      100.00

Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit   | write
ses | ologit  | read math
_Ises_2 |         | [Passively imputed from (ses==2)]
_Ises_3 |         | [Passively imputed from (ses==3)]
read |         | [No missing data in estimation sample]
math |         | [No missing data in estimation sample]
write | regress | female _Ises_2 _Ises_3 read math socst science
|         | _Irace_2 _Irace_3 _Irace_4 fxr
socst |         | [No missing data in estimation sample]
science |         | [No missing data in estimation sample]
_Irace_2 |         | [No missing data in estimation sample]
_Irace_3 |         | [No missing data in estimation sample]
_Irace_4 |         | [No missing data in estimation sample]
fxw |         | [Passively imputed from female*write]
fxr |         | [Passively imputed from female*read]
------------------------------------------------------------------------------

Imputing ..........1..........2..........3..........4..........5
file d:\temp\ice_updating.dta saved`

#### Options for ice

There are many options that ice can take. Here is a partial list of options which we will discuss.  These options can be important for building up an imputation model.

• m
-
This is used to specify the number of imputations to carry out. Further research is needed regarding the choice of m. Royston has a discussion in his 2004 article (another source is Graham, Olchowski, and Gilreach 2007). In general, it is not as simple as just choosing m = 5. If m(...) is not specified, ice will generate a single imputation.
• cmd
- ice will decide the variable type for each imputed variable automatically based on the number of values the variable takes on. For example, if the variable x takes only 0/1/2 values, ice will assume x is a categorical variable and will use multinomial logistic regression for x in the imputation process. In other words, mlogit will be the default model for x. But we can force ice to use ologit instead for x if we think that ordinal logistic regression is more appropriate. For example, cmd(x1 x2:logit, x3:regress) specifies that logistic regression will be used for variable x1 and x2, while OLS regression will be used for variable x3.
• substitute
- This makes substitutions of the variables in the prediction models. For example, a categorical variable might be both imputed and used as a predictor in the prediction models for other imputed variables. This option allows us to treat the variable as a categorical variable in when it is used to predict other variables. This option has largely been replaced by the o. and m. options.
• o. and m.
- These "options" are added to variable names (i.e. m.varname), to specify that the variable is either ologit (o.) or mlogit (m.) should be used to impute a variable. These options also create dummy variables, and include them in the imputation model when the variable is used as a predictor. Using either o. or m. is equivalent to using substitute(...) and cmd(...) together for those variables (i.e. it's a short cut).
• eq
- Each variable being imputed has a regression model associated with it. The eq option can be used to specify the predictors on the right-hand side of the regression model. By default, ice takes every variable in the main list as a predictor variable. This can work well in small models, but often becomes problematic when one is imputing many variables. Specifying, eq(ses: read math, female: write) means that the predictor variables for the variable ses will be read and math, and that the only predictor variable for the variable female is write.
• passive
- This means that a variable will not be independently imputed. For example, if the variable x12 is the interaction term of variable x1 and x2, its imputed values will be determined by the imputed values of x1 and x2, rather than directly imputing values of x12. This option can be used with interaction terms or transformed variables. With this option, we can put all of the variables in the imputation model. For example, passive(fxw:female*write \fxr: female*read) means that fxw is the interaction term of the variable female and write, and fxr is the interaction term of the variable female and read. Values of both fxw and fxr will be determined by the values of variable female, write and read. While ice has the capacity to passively impute variables, whether this is a desirable way of handling interactions and transformations is still a subject of research.
• boot
- This option specifies that parameter should be estimated by regression within a bootstrap sample. It has the advantage of robustness since it is not necessary to require the normality assumption.
• match
- This option is used to switch to the prediction matching imputation method, it over-rides the default imputation method, which is random draw from the posterior distribution. match can only be used with continuous variables.
• seed
- This sets the random number seed for the purpose of replicating the imputed data sets.
• cycles
- This determines the number of cycles of regression switching to be carried out between imputations.
• on
-This changes the entire imputation model. Given a variable list using the option on, we have fixed a single prediction model for all the variables to be imputed. For example, on(write math) means that whatever variable is being imputed, the predictors in the prediction model are always write and math.
• genmiss
- This option produces a series of variables, one for each variable in the imputation model, the variables are equal to 1 when the associated variable was imputed, and 0 otherwise. These variables are named using a prefix specified by the user in genmiss(...) followed by the name of the associated variable. For example, if we had specified genmiss(m_), ice would have generated a variable m_write, equal to 1 when write was imputed, and 0 otherwise.

For more information on working with missing data in Stata, see our seminars Multiple Imputation in Stata, Part 1 and Part 2.

#### References

Carlin, J. B., Galati, J. C., Royston, P. (2008). A new framework for managing and analyzing multiply imputed data in Stata. Stata Journal 8(1):49-67

Graham, J. W., Olchowski, A. E. and Gilreath, T. D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

van Buuren, S, Brand, J.P.L., Groothuis-Oudshoorn, C.G.M. and Rubin, D.B. Full Conditional Specification in Multivariate Imputation

von Hippel, P.T. (2009). How to impute interactions, squares, and other transformed variables. Sociological Methodology.

Royston, P. 2004. Multiple imputation of missing values. Stata Journal 4(3): 227-241.

Royston, P. 2005. Multiple imputation of missing values: update. Stata Journal 5(2): 188-201.

Royston, P. 2005. Multiple imputation of missing values: Update of ice. Stata Journal 5(4): 527-536.

Royston, P. 2007. Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring. Stata Journal (7)4: 445-464.

Royston, P. 2009. Multiple imputation of missing values: Further update of ice, with an emphasis on categorical variables. Stata Journal 9(3): 466-477.

Royston, P., Carlin, J. B., White, I.R. 2009. Multiple imputation of missing values: New features for mim. Stata Journal 9(2): 252-264.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.