Help the Stat Consulting Group by giving a gift

Multiple Imputation in Stata, Part 2

Outline of this seminar:

Part 1:

- Introduction
- Multiple imputation
- Missing data patterns and mechanisms
- Building an imputation model
- An example of multivariate normal imputation using
**mi impute mvn**

Part 2:

- An example of multiple imputation by chained equations using
**ice** - Managing MI datasets
- Analyzing MI data

Although both have the same goal, and the end product can be very similar, imputation by chained equations (ICE) and the multivariate normal approach to imputation are somewhat different. In the multivariate normal model, the imputations come from a single multivariate distribution, in simple terms, information from all variables is used to impute all other variables based on a single model. In the ICE approach, the imputed values are generated from a series of univariate models, in which a single variable is imputed based on a group of variables. As we discussed in the first seminar, each approach has advantages and disadvantages. One advantage of the ICE approach is that it does not assume a multivariate normal distribution, so it can easily be used to impute a variety of different types of variables ( i.e., categorical, counts, etc.). This is less of an advantage when imputing predictor variables, but can be useful when imputing outcome variables, or imputing for an entire dataset where one may not know which variables will be used as predictors or outcomes. A second advantage of the ICE approach is that because it estimates a series of univariate models, it can sometimes accommodate larger imputation models than the multivariate normal approach. One disadvantage of the ICE approach is that, in comparison to the multivariate normal model, it lacks strong theoretical underpinnings. An additional disadvantage is that specifying ICE models can be tedious, especially when the imputation model is large.

For this example, we have decided to impute missing values for all variables
in the dataset as though we did not have a specific analysis model in mind.
Because the example dataset contains so few variables, the resulting imputation
model is not substantially different from the imputation model that would result
from a model for a specific analysis that included a few so called auxiliary
variables. Below we show the syntax to create multiply imputed datasets using **ice**.

The first
line of syntax below opens the dataset we wish to impute. The second command (on
the second and third lines) runs the imputation model. The command name (**ice**)
is followed by the list of variables to be included in the imputation model. The
variable list can include both variables with and without missing values, **ice**
will automatically determine which variables are have missing values and impute
them. The variable **prog** is listed with the **m.** prefix ( i.e., **m.prog**),
which indicates to **ice** that the variable is categorical, and that we wish
to use multinomial logistic regression (**mlogit**) to model it. If we wished
to treat the variable as ordinal, we could use the **o.** prefix. The variable
**race** appears with the** i.** prefix ( i.e., **i.race**), this
indicates to **ice** that the model should include dummy variables to
represent the categories of **race**.
The **i.** prefix is used only with variables that are not missing any
values. The list of variables is followed by a comma (","), which separates the
variable list from the options. The comma is followed by three slashes ("///"), this is a
comment code that tells Stata to ignore anything after the slashes and continue
reading on the next line as though there were no line break. This isn't
necessary, but it does make the syntax easier to read and organize. Note that
there needs to be a space between the comma and the slash marks. The options
start on the next line, the **gen(...)** option generates imputation indicators
for the imputed variables, for example, because we have specified **
m_** in the **gen(...)** option, for the variable **female**, **ice** creates the indictor **m_female**, which is equal to 1 when **female**
was imputed
and 0 otherwise. The next option **saving(...)** gives **ice** the name of
the new Stata data file for the imputed dataset, our new dataset will be called
**ice_imputation.dta**. Next we have used the **m(...)** option to specify
the number of imputations, in this case, 5. In many applications we might want
to create more imputations, but we may wish to run a model with a small number
of imputations and examine those, before creating a large number of imputations.
Finally, we use the **seed(...)** option to set the seed so that we can
recreate the results of the imputation model, setting the random number seed is
necessary to obtain the same results because MI contains a random component.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear ice female read write math science socst m.prog i.race ses schtyp, /// gen(m_) saving(ice_imputation) m(5) seed(4324)

The output from **ice** is rather long, so we have split it into two blocks
to discuss it. Below we show the first part of the output from the **ice** command above. The output begins
by showing what the **ice** command would look like if we had used the alternative syntax, which involves the use of the **xi:** prefix, as well as the **cmd(...)**
and **substitute(...)** options. We can use this output to confirm that we
have specified the model we intended to specify. The **xi:** prefix, among other
things, allows the user to create dummy variables to represent categorical
variables in a model (see **help xi** for more information on the **xi:**
prefix). The **cmd(...)** option specifies the type of model that should be
used to impute a given variable (e.g., **mlogit**,
**ologit**). Because we specified **m.prog**, **mlogit** will be used to
predict will be used to impute **prog**. The **substitute(...)** option is used to specify which
dummy variables should be used to represent variables that are being imputed
when they are used to predict other variables. In this case, we are using the
dummy variables created by the **xi:** prefix (represented as **i.prog**
in the output) in place of **prog** when predicting other variables. The next
portion of output shows the names of the variables used to represent **prog**
and **race** ( i.e., _Iprog_1-3 and _Irace_1-4 respectively). Below that is a
table that shows the number of cases missing various numbers of variables, we
see that 117 cases are missing no values, and 73 are missing only one value,
etc..

=> xi: ice female read write math science socst prog i.prog i.race ses schtyp, cmd(prog:mlo > git) substitute(prog:i.prog) gen(m_) saving(ice_imputation, replace) m(5) seed(4324) i.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted) i.race _Irace_1-4 (naturally coded; _Irace_1 omitted) #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 117 58.50 58.50 1 | 73 36.50 95.00 2 | 10 5.00 100.00 ------------+----------------------------------- Total | 200 100.00

The second part of the output from **ice** is shown below. A table that gives
the name of the variable in the variable list, the command used to impute it, and lists the variables used to impute its values is printed. For example, this table tells us
that the variable **female** will be imputed using the **logit** command,
which is appropriate since female is a binary variable. By default, **ice** will
used **logit** to impute variables that take on two values, **mlogit** to
impute variables that take on 3-5 values, and **regress** to impute variables that
take on more than 5 values. The default can be overridden for some or all
variables using the **cmd(...)** option, **logit**, **mlogit**, and **
regress** can be specified, along with other types of models (e.g., **ologit)**
as the user deems appropriate. Variables with no
missing values are listed, along with a message that the variable does not
contain missing values. In this case, the table shows us that all variables in
the imputation model were used to impute missing values on each variable.
Because **ice** commands are often somewhat long and complicated, it is
important to check this table carefully to be sure the model is what you
intended to specify.
Finally, the output tells us that it is imputing (with progress dots), and then
saves the datasets.

Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | read write math science socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp read | regress | female write math science socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp write | regress | female read math science socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp math | regress | female read write science socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp science | regress | female read write math socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp socst | | [No missing data in estimation sample] prog | mlogit | female read write math science socst _Irace_2 _Irace_3 | | _Irace_4 ses schtyp _Iprog_2 | | [Passively imputed from (prog==2)] _Iprog_3 | | [Passively imputed from (prog==3)] _Irace_2 | | [No missing data in estimation sample] _Irace_3 | | [No missing data in estimation sample] _Irace_4 | | [No missing data in estimation sample] ses | | [No missing data in estimation sample] schtyp | | [No missing data in estimation sample] ------------------------------------------------------------------------------ Imputing ..........1..........2..........3..........4..........5 file ice_imputation.dta saved

The imputation model above contained 10 variables, two of which were
represented by dummy variables, meaning that at most, there were 12 variables
available as predictors for any variable with missing values. Because the
imputation model was relatively small, we were able to predict each variable
with missing values using all other variables in the imputation model. However, in larger imputation models,
it isn't feasible to use all other variables in the imputation model to
impute a given variable, the univariate regression models simply become too large to
estimate. For example, in an imputation model with
30 variables, it is unlikely that all 29 other variables could be used in the
prediction equation. If you attempt to run such a model, you will encounter
error messages similar to the error message below, where we have
tried to impute missing values for **var1**-**var30**.

ice var1-var30, gen(m_) saving(error_imputation) m(5)<output omitted> Error #430 encountered while running -uvis- I detected a problem with running uvis with command mlogit on response var11 and covariates var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var12 var13 var14 var15 v > ar16 var17 var18 var19 var20 var21 var22 var23 var24 var25 var26 var27 var28 var29 var30. The offending command resembled: uvis mlogit var11 var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var12 var13 var14 var1 > 5 var16 var17 var18 var19 var20 var21 var22 var23 var24 var25 var26 var27 var28 var29 var > 30 , gen([imputed]) With mlogit, try combining categories of var11, or if appropriate, use ologit you may wish to try the -persist- option to persist beyond this error. dumping current data to ./_ice_dump.dta convergence not achieved r(430);

What this error message tells us is that **ice** encountered a problem when it attempted to impute
**var11**, using all 29 other variables in the model. More
specifically, **ice** was attempting to use **mlogit** to predict **var11**
but encountered an error. Note that depending on the specific problem
encountered, **ice** may not produce this exact error message, but other
error messages caused by problems related to an excess of predictors are likely
to be similar. The third paragraph of the error message suggests that
we try combining categories of **var11** (which would help if **var11**
had a small number of cases in some categories), or imputing using **ologit**, if **var11**
is ordinal. We could also try using the **persist** option for **ice**,
which would cause ice to ignore some errors. These are not bad suggestions, but,
first we probably want to stop to think about the model we were attempting to run, we
were trying to run an **mlogit** model using 29 predictors. We wouldn't
generally do this in an analysis model, and, for similar reasons, doing so in an
imputation model may not be a good idea. Instead, we may want to select a subset
of variables that will be used to predict each variable to be imputed.

Selecting which variables should be used to impute each of the variables with missing values can be a time consuming process. If you have a specific analysis model in mind, then all variables to be included in that model should be used to predict all other variables in the analysis model, so that the imputations accurately capture the relationships of interest. If you wish to include auxiliary variables ( i.e., variables that are not part of the analysis model) in the imputation model, or if you don't have a specific analysis model in mind when creating the imputation model, then you will need to think carefully about what variables should be used to impute each variable with missing values. An example of the process of building such a model is a paper by van Buuren, Boshuizen, and Knook (1999). As you might imagine, the process of creating a custom equation for each, or even some, of the variables in the imputation model can be time consuming, but it is important to be thoughtful in creating these models.

Once we have decided which variables should be used to predict the variables
we want to impute, we can
use the **eq(...)** option with **ice** to specify the equations. The **ice** command below is similar to the
first model, except that it
includes custom equations for 5 of the variables we wish to impute (**female**,
**read**, **math**, **science**, and **prog**). The first two lines of syntax
below are the same as before, and we have used three slashes ( i.e., "///") to allow the command
to run over multiple lines. The **eq(...)** option specifying the custom prediction
equations for 5 of our variables is on lines 3 to 7. Following the open
parentheses ( **eq(** ) the first variable for which we wish to specify an
equation ( i.e., female) is listed, followed by a colon ( **:** ),
following the colon are those variables we wish to use to impute values of
female ( i.e., ** read
write math socst)**, this list is followed by a comma, which separates the
prediction equations. We have added a line break (using
/// ) after the comma and begin the next custom equation ( i.e., the equation for
**read**) on the next line. This continues on for each of the variables for which we
are specifying an equation, the last equation ( i.e., for **prog**) ends with a
close parenthesis (** )** ) instead of a comma. Because we haven't included a custom equation for
the variable **write**, **ice** will use
all other variables in the model to predict it.

ice female read write math science socst m.prog i.race ses schtyp, /// gen(m_) saving(ice_imputation_custom) m(5) seed(4324) /// eq(female: read write math science socst, /// read: female socst schtyp _Irace_2 _Irace_3 _Irace_4, /// math: science write _Iprog_2 _Iprog_3 ses, /// science: math socst read schtyp, /// prog: female ses math science socst)

The output generated by the above syntax is similar to the output from the
earlier imputation models.
The command listed at the very top of the output includes the **eq(...)** option. Looking
at the third table, which displays the command and prediction equation used to
impute each variable,
we can confirm that the predictors we specified in the syntax were used to
impute each variable.

=> xi: ice female read write math science socst prog i.prog i.race ses schtyp, cmd(prog:mlo > git) substitute(prog:i.prog) eq(female: read write math science socst, read: female socst > schtyp _Irace_2 _Irace_3 _Irace_4, math: science write _Iprog_2 _Iprog_3 ses, science: m > ath socst read schtyp, prog: female ses math science socst) gen(m_) saving(ice_imputation > _custom, replace) m(5) seed(4324) i.prog _Iprog_1-3 (naturally coded; _Iprog_1 omitted) i.race _Irace_1-4 (naturally coded; _Irace_1 omitted) #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 117 58.50 58.50 1 | 73 36.50 95.00 2 | 10 5.00 100.00 ------------+----------------------------------- Total | 200 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | read write math science socst read | regress | female socst schtyp _Irace_2 _Irace_3 _Irace_4 write | regress | female read math science socst _Iprog_2 _Iprog_3 | | _Irace_2 _Irace_3 _Irace_4 ses schtyp math | regress | science write _Iprog_2 _Iprog_3 ses science | regress | math socst read schtyp socst | | [No missing data in estimation sample] prog | mlogit | female ses math science socst _Iprog_2 | | [Passively imputed from (prog==2)] _Iprog_3 | | [Passively imputed from (prog==3)] _Irace_2 | | [No missing data in estimation sample] _Irace_3 | | [No missing data in estimation sample] _Irace_4 | | [No missing data in estimation sample] ses | | [No missing data in estimation sample] schtyp | | [No missing data in estimation sample] ------------------------------------------------------------------------------ Imputing ..........1..........2..........3..........4..........5 file ice_imputation_custom.dta saved

We specified **ice_imputation_custom** in the **saving(...)** option,
and once the imputation is complete we can open that file.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation_custom, clear(highschool and beyond (200 cases))

As with the data from the multivariate normal imputation model, we will want to
explore the imputed datasets in much the same way we would explore a new dataset.
Recall from the previous lecture that we don't expect the descriptive statistics
for the imputed data to be the same as those in the original data, but examining
these values can help us see where things might have gone wrong in the
imputation process. For example, if a variable with a range of 1-10 (in the
original data), has a mean outside that range (e.g., 15) after imputation, this
suggests there may have been a problem in the imputation model. Once we have
explored the imputed dataset, we can analyze the data using **mim**, or set it up for use
with Stata's **mi** commands (see section below on importing datasets not imputed
using **mi impute**).

As mentioned above there are two major options for management and analysis of MI datasets in
Stata. There are the **mi** commands, introduced in Stata 11, and there is the user-written command
**mim** which works with earlier versions of Stata. Throughout the following
sections, we will discuss the use of both Stata's **mi** commands
and the user written command **mim**.

Note that one difference between the **mi** commands and **mim** is
that when you use the **mi**
commands, Stata automatically performs checks to be sure the datasets are
consistent, for example, that the values of a variable that is not missing in
the original dataset do not vary across imputations. The user written program **
mim** does not do this type of checking. Another difference is that while Stata's **mi**
commands allow different data styles (discussed below), **mim** requires
that all of the imputations be in a single data file. If the original dataset is
large, or
there are a lot of imputations, a single data file with all of the imputations may
be quite large, if you encounter problems because of the size of an MI data file,
and do not have access to the **mi** commands, another option is the
user-written command **mira** which analyzes MI data where each imputation
is stored in its own dataset.

Before we talk about data management in terms of commands, there are a few
things you should know about Stata's **mi** commands. As we've mentioned in
the first seminar the original (pre-imputation) data is stored along with the
imputed data. The original data is
referred to in Stata (and elsewhere) as m=0, that is imputation 0.

Note that Stata's **mi**
commands store the imputation number in the variable **_mi_m**, this is
identified as a system variable and should generally not be modified by the
user. Stata's **mi**
commands expect that the original data is
present in the dataset (which will be the case if you've created the data using
either **mi impute** or **ice**). If there is no m=0, Stata will treat the lowest valued imputation (for
example m=1 if m takes on values from 1-5) as if it were m=0. If the dataset
Stata thinks contains the original data is actually an imputed dataset, this
will cause problems. For more
information on handling this situation, see our FAQ: How can I use multiply
imputed data where original data is not included?. Another thing to be aware of
is that changing the original data (m=0) can
cause Stata's data management for **mi** to change the imputed data as well.

If you are using **mim** rather than Stata's **mi** commands,
the above does not apply. Datasets used with **mim** may or may not contain
the original dataset (m=0), but, if the original data is included, it is
important that it be denoted m=0 or as missing ( i.e., m=.), otherwise **mim** will treat the original data as an imputed
dataset. The **mim** command expects that the imputation number is stored in
the variable **_mj**, as with the variable **_mi_m**, this should not generally
be modified by the user.

Stata's **mi** commands allow MI data to be
stored in several different styles. The four styles used by Stata are **flong**,
**flongsep**, **mlong**, and **wide**. In the **flong** format, in
addition to the original data, the
dataset contains one copy of the data for each imputation. So, if a file contained 5 imputations, each case would occur in
the dataset 6 times, once in the original data, and once in each of the
imputations. This is a common format for MI data, but can result in very large
datasets, particularly if there are a large number of imputations. Below is a
small example of an **flong** dataset. The first variable ( i.e., column),
labeled **m**,
contains the imputation number (0 to 3), while the variable **id** contains the
case id, and v1-v3 contain data. Note that in the original data (m=0) case 1
(id=1) is missing on v1, while case 2 is missing on v2. In the imputed datasets
(m= 1 to 3) the missing values have been imputed.

m id v1 v2 v3 0 1 . 3 8 0 2 2 . 7 1 1 5 3 8 1 2 2 4 7 2 1 3 3 8 2 2 2 2 7 3 1 1 3 8 3 2 2 3 7

In the **flongsep** style, each imputation is stored in a separate data file, each of
which contains one row for every case in the dataset. If there are 5
imputations, the **flongsep** style requires 6 datasets, one for the
original data, and one for each of the 5 imputations. Because the data is stored
in multiple files, it takes longer to run commands on MI datasets stored in the
**flongsep** style, making this style somewhat inefficient. However, when the dataset and/or the number of
imputations is large, **flongsep** has the advantage of resulting in less
data in memory at any time.

In the **mlong** format, cases with no imputed
values are included in the dataset only once (in the original data), while cases
with imputed values appear in the dataset multiple times, once in the original
data, and once for each imputation. Depending on the number of cases without
imputed values, the **mlong** style may result in substantially smaller
datasets than the **flong** style.

In the **wide** style, instead
of adding additional cases to accommodate the imputations, imputed variables
appear in the dataset multiple times, once for the original data, and once for each imputation. So if the
variable **age** has been imputed (with 5 imputations), the variable **age**,
containing the original data would exist in the dataset, as well as five copies
**_1_age** to **_5_age**.

For a more thorough explanation of missing data styles, see the documentation
by typing **help mi styles** in the Stata command window. If you are working in **
mim** the data is stored in
what Stata's **mi** commands call the **flong** style.

Note that if the data is set up for use with the **mi** commands, and you attempt to perform either analyses, or
data management on your own, that is, without using Stata's **mi** commands,
it is important to consider how the style will influence that command. If you
use the **mi** commands, Stata will keep track of the style and adjust
the commands as necessary for you.

Although not
typically necessary, it is possible to move between various **mi** styles
using the **mi convert** command. Below we open the dataset **mvn_imputation**,
which contains MI data. Then we use the **mi query** command to tells us the
**mi** style and number of imputations. Currently the style is **flong**, to convert it, we use the **mi convert** command, followed by the
style we would like to convert to, in this case **wide**. We can then use the
**mi query** command again to confirm that the style is now **wide**.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi querydata mi set flong, M = 5mi convert widemi querydata mi set wide, M = 5 last mi update 0 seconds ago

We've already used **mi describe** in a number of places, so although it
is an important tool for managing data using Stata's **mi** commands, we won't cover it here,
but keep in mind that it is probably the first command you will want to run on your imputed
data.
Instead we will start with the **mi varying** command, which checks for
inconsistencies in the data, for example, variables that are not registered as
imputed, but vary across the imputations. variables that either vary across the
imputations but aren't marked as imputed, or variables that are marked as
imputed or passive but don't vary across the imputations.
These conditions may indicate an improperly registered variable or a problem
with the imputation. It is a good idea to run this command after using **mi
import**, as well as after more complex data management commands (e.g., **mi merge**),
to make sure everything went as expected. Below we open an MI dataset created by
**ice**, named **
ice_imputation** and use the **mi import ice** command to convert the dataset from the form
created by **ice** to the format used by Stata's **mi** commands. The **imputed(**...**)**
option specifies which variables in the dataset are imputed. The third line of syntax below runs the
**mi varying** command.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clearmi import ice , imputed(female read write math science prog)(83 m=0 obs. now marked as incomplete)mi varyingPossible problem variable names ----------------------------------------------------------------------------------------- imputed nonvarying: (none) passive nonvarying: (none) unregistered varying: _Iprog_2 _Iprog_3 *unregistered super/varying: m__Iprog_2 m__Iprog_3 m__Irace_2 m__Irace_3 m__Irace_4 m_female m_math m_prog m_read m_schtyp m_science m_ses m_socst m_write unregistered super varying: (none) ----------------------------------------------------------------------------------------- * super/varying means super varying but would be varying if registered as imputed; variables vary only where equal to soft missing in m=0.

The output from the **mi varying** command lists five conditions that are possible problems,
as well as any variables that occur in each condition. The criteria for identifying each of
these types of potentially problematic variables, as well as the possible causes in the data are
somewhat difficult to describe briefly. The help file for the **mi varying**
command contains a description of each of these conditions along with
recommendations for how one might handle variables that are identified in each
condition. You can access the help file by
typing **help** **mi varying** in the Stata command window. Looking at the
above output, we see that the variables **
_Iprog_2** and **_Iprog_3** are identified as unregistered varying, meaning
some of the values for this variable vary across the imputations. This suggests
that these variables may have been imputed, which is correct in this case. We'd
probably want to go ahead and register them as imputed, note that the variable
**prog**, the categories of which **_Iprog_2** and **_Iprog_3**
represent is already registered as imputed.

Stata has also identified
all of the imputation identifiers ( i.e., the variables that begin with **m_**
) as unregistered super/varying. Registering these variables as imputed would
cause them not to show up in the **mi varying** output in the future, but is
probably inappropriate because they aren't imputed variables. Registering them
as imputed would also
result in all of the cases in the dataset being marked as incomplete, since the
imputation indicators are missing in the original data. The best solution in
this case may be to do nothing, since we understand why these variables are
being flagged by **mi varying**. It is important to remember that when a variable
is flagged by **mi varying**, it does
not necessarily mean there is a problem in the dataset, it merely indicates a
specific condition, which may or may not be indicative of a problem.

Because **mim** does not do the consistency checks that Stata's **mi**
commands do, there is no command for **mim** that is equivalent to **mi varying**.

The **mi xeq:** command is useful for exploring the data.
It allows the user to easily run a command separately on each of the imputed datasets. Below we use **mi xeq:** followed by **sum read write**, to
summarize the variables **read** and **write** in each of the imputations.
The output gives the imputation number, followed by the command and its output.
The table produced by the **sum** command looks as it normally would. The
table for m=0 is followed by the table for m=1, and so on (note that we have omitted
the rest of the output).

mi xeq: sum read writem=0 data: -> sum read write Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 m=1 data: -> sum read write Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 200 52.4295 10.21239 28 76 write | 200 52.68909 9.593023 21.93323 71.58826 <output omitted>

Using Stata's **mi** commands, we may want to sort the data before running a command using **mi xeq:**
for example, we may want to summarize **read** and **write** by school
type ( i.e.,
the variable **schtyp**) in each imputation. If we weren't working with MI
data, we would simply sort by **schtyp** and then run the **summarize** command,
so we might try to do the same here. Below we first sort by schtyp and then use
**mi xeq** and the **by** prefix.

sort schtyp mi xeq: by schtyp: sum read writem=0 data: -> by schtyp: sum read write not sorted r(5);

However, when working with **mi xeq:** this results in an error message because **mi xeq:** sorts the data
after we do.
Instead, we need to include the **sort** command as part of the **mi xeq:** command.
The syntax below does this, the line starts with **mi xeq:** followed by **
sort schtyp** then, where we would normally use a line break to indicate the
end of a command, we use a semicolon (";"), then we issue the command we wish to
run, in this case **by schtyp: sum read write**. This results in **mi xeq:**
first sorting the data by **schtyp** within each imputation, and then running
the summarize command by **schtyp**.

mi xeq: sort schtyp; by schtyp: sum read writem=0 data: -> sort schtyp -> by schtyp: sum read write ------------------------------------------------------------------------------------------- -> schtyp = public Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 161 51.99379 10.39741 28 76 write | 152 52.32895 9.584602 31 67 ------------------------------------------------------------------------------------------- -> schtyp = private Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 30 53.86667 9.141543 36 73 write | 31 56 6.78233 38 67 m=1 data: -> sort schtyp -> by schtyp: sum read write <output omitted>

Stata's **mi xeq** command also allows us to run a command on only a subset of the imputations. For example,
the code below runs **sum read write**, on only m=1 and m=2. The number list after
**mi xeq** ( i.e., 1 2) but
before the colon (":") indicates which imputations should be used.

mi xeq 1 2: sum read writem=1 data: -> sum read write Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 200 52.30726 10.18691 28 76 write | 200 52.71049 9.263646 27.90232 67 m=2 data: -> sum read write Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 200 52.19646 10.16644 28 76 write | 200 52.90549 9.420377 31 72.66547

If your data are set up for use with **mim** you can use Stata's standard
**sort**, **by**, and **if** commands to produce the same output. Below
we first open a dataset with the structure expected by **mim**. Then we sort by imputation number (
i.e., the variable **
_mj**), and finally we use the **by** prefix to summarize (**sum**) the
variables **read** and **write** separately for each imputation.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim, clear sort _mj by _mj: sum read write---------------------------------------------------------------------------------------------- -> _mj = 0 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 ---------------------------------------------------------------------------------------------- -> _mj = 1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 200 52.30726 10.18691 28 76 write | 200 52.71049 9.263646 27.90232 67 <output omitted>

If we wish to create output by some other variable, for example **schtyp**, as in
the second example above, we can also do this using the **sort** command and
**by** prefix. Below we first **sort** by **_mj** and **schtyp** (in that order) and then
use the **by** prefix to summarize the **read** and **write** by **schtyp**
separately for each imputation.

sort _mj schtyp by _mj schtyp: sum read write---------------------------------------------------------------------------------------------- -> _mj = 0, schtyp = public Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- read | 161 51.99379 10.39741 28 76 write | 152 52.32895 9.584602 31 67 ---------------------------------------------------------------------------------------------- -> _mj = 0, schtyp = private <output omitted>

To run a command only on certain imputations, you can use **if**, as shown below.

sum read write if _mj==1 sum read write if _mj==2<output omitted>

Once you have created the imputations, you may want to create new variables
in the dataset. Depending on the structure of your dataset (what Stata
calls style), and the nature of the variable being generated, using the standard
Stata commands might work, however, in some situations the resulting variables
will not be calculated as expected. This is particularly true if you are using
Stata's **mi** commands, because of the consistency checks Stata
performs.

Stata's **mi** commands provide tools for creating variables with MI
data, when possible, you
probably want to use these tools. Stata calls variables that are created from imputed variables after the
imputation process passive variables. (Note that is somewhat different from the
meaning of passive variables in the context of ICE.) The **mi passive:** command can be used as a prefix for **generate**,
**egen**, and **replace**. The syntax for these commands after the **mi passive:**
prefix is identical to their normal syntax. In the syntax below we start by
opening the dataset **
mvn_imputation** which was produced by **mi impute mvn** in the first
seminar. The second line of syntax begins with **mi passive:** and uses the
**generate** command to create a variable that is
the sum of the imputed variables **read** and **
write**. The output gives us the output from running this command in each
imputation. In the original data (m=0), 26 missing values were created, this is
not surprising since m=0 contains missing values. In our case, no missing values
are produced in the imputed datasets, however, depending on the command
specified and the variables used, we might end up with some missing values even
in the imputed dataset. As always when creating new variables, you'll want to
consider whether the number of missing variables created seems appropriate.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi passive: generate english = read + writem=0: (26 missing values generated) m=1: m=2: m=3: m=4: m=5:

As mentioned above **replace** can also be used with **mi passive:**, here we
change the values of **english** from the sum of **read** and **write** to the average of
**read** and **write**.

mi passive: replace english = (read + write)/2m=0: (174 real changes made) m=1: (200 real changes made) m=2: (200 real changes made) m=3: (200 real changes made) m=4: (200 real changes made) m=5: (200 real changes made)

Below we use **mi passive:** with the **egen** command to create a variable that is the total (sum) of
the variables **read**, **write**, **math**, **science**, and **socst**. Notice
that no missing values were created in m=0, this is because the function **rowtotal(**...**)** treats missing values as though they were equal to
0,
hence it produces values for **total** even when some or all of the variables are
missing (as is sometimes the case in m=0).

mi passive: egen total = rowtotal(read write math science socst)m=0: m=1: m=2: m=3: m=4: m=5:

If you are working with **mim**, the structure of the data allows you to
use standard Stata commands to create variables, as long as the command used to
create the variable only use information from one row at a time. For example, to
create the variable **english** as in the first example above ( i.e., **read** + **write**)
we can use the generate command. This creates 26 missing values, because there are
26 cases in m=0 that are missing **read** and/or **write**.

generate english = read + write(26 missing values generated)

Below we change the sum of **read** and **write** to the average of **read** and
**write** using the **replace** command, just as we would if the data had not been
multiply imputed. The next line of syntax creates the variable **total** using the **egen**
command with **rowtotal** function.

replace english = (read+write)/2(1174 real changes made)egen total = rowtotal(read write math science socst)

Regardless of how the data is stored ( i.e., what Stata calls style), the structure of MI data can complicate the creation of variables that
utilize multiple rows of data for example, group mean variables, and variables dependent on sort order (e.g., lag variables). For example, if we had data on
students at multiple time points, such as test scores for each year of high
school. Such datasets are often stored so that each student has multiple rows in the dataset, one for each year.
In such a dataset
we might want to create a lag variable equal to the value of the variable **read** at the
previous time point (**year**).

If you are using Stata's **mi** commands, you can do this with the **mi
passive** command. As noted above, we don't need to sort, **mi passive:**
handles the sort order, and because it does so, we cannot rely on sorting the
data before the command to allow us to assume the proper order within the data.
Instead we use the **by** prefix to perform the action for each student (**id**), the **year**
in parentheses tells Stata to first sort the data by **year** within
**id**, and then perform the operation by **id**. The generate command
uses **read[_n-1]** to refer to the
value of **read** in the previous line (within **id**) when the data is sorted by
**id** and **year**, this results in a lagged variable. Note that this
command will not run in the current dataset, because
there is no year variable.

mi passive: by id (year): gen lread = read[_n-1]

To create a lag variable for a dataset formatted for use with **mim**, we first sort by imputation number (**_mj**), then **id** and **year**. In the second line of syntax below, we use the **by** prefix to perform the action for each student (**id**) within each imputation (**_mj**), the **year** in parentheses tells Stata to first sort the data by **year** within **_mj** and **id**, and then perform the operation by **_mj** and **id**. The generate command
uses **read[_n-1]** to refer to the
value of **read** in the previous line (within **_mj** and **id**) when the data is sorted by **id** and **year**, this results in a lagged variable. Note that this will not run in the current dataset, because
there is no year variable.

sort _mj id year by _mj id (year): gen lread = read[_n-1]

If you are
using Stata's **mi** commands, there may be cases where you want to create variables using commands not
supported by **mi passive:**. In this case you can run the command, and then register the variable in the appropriate
form. In the example below, we use **recode** to recode the variable **ses** into **ses2**.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi xeq: recode ses (1/2=1)(3=2), gen(ses2)m=0 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2) m=1 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2) m=2 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2) m=3 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2) m=4 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2) m=5 data: -> recode ses (1/2=1)(3=2), gen(ses2) (153 differences between ses and ses2)

The resulting output shows the results of running the recode command in each
MI dataset. Note that if the data is in either **flong** or **mlong** style, using the **xeq:** prefix is unnecessary, we could have just used the **recode** command, however, using the **xeq:** command works in all styles,
so we might want to make it a habit to use **xeq:**, even
if it isn't strictly necessary. Below we use the **mi register** command to register **ses2** as a regular variable,
because the variable **ses** was not an imputed or passive variable. If **ses** had been either
imputed or passive, we would have registered **ses2** as passive.

mi register regular ses2

Because of the way **mim** datasets are structured, no special steps are needed to perform this type of action, so we could perform the above recode with the standard Stata commands shown below.

use use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim, clear recode ses (1/2=1)(3=2), gen(ses2)(918 differences between ses and ses2)

**Dropping variables or cases, and renaming variables**

Because of the consistency checks performed when you use Stata's **mi** commands, dropping variables or cases, and renaming variables is somewhat different than in datasets that are not **mi set**. If you are using a dataset in the form expected by **mim** the standard Stata commands can be used.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear drop pr_2 mi update(imputed variable pr_2 unregistered because not in m=0)

The **drop** command can also be used to delete observations, but as when it is used
to delete variables, it should be followed with **mi update. ** command. Using the mi
update command tells Stata to make sure the data is still consistent. In this case, we drop cases where **female** = 0 and then run **mi update**. When **mi update** is run Stata notices that the number of observations has changed, and it updates
one of the system variables to reflect this. The effect of running **mi update** may vary depending on the style, but in any case, Stata looks at the dataset
in its current form and any necessary changes to the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear drop if female==0(486 observations deleted)mi update(system variable _mi_id updated due to changed number of obs.)

To rename a variable but retain its status as imputed, passive, or regular,
use the **mi rename** command. Here we rename the variable **female** to **gender**.

mi rename female gender

As mentioned above, because of the format of the data used with **mim**, and because it does not do the sort of
consistency checks that Stata's **mi** commands do, in datasets used with **mim**, variables and cases
can be dropped, and variables renamed in the usual fashion.

Stata's **mi** commands include a special version of merge, **mi merge**.
In this example, we have a dataset named **demo** that contains the
demographic variables (**female, race, ses, schtyp**, and **prog**)
for 200 students in our sample, along with the subject identifier (**id**). Below we
open the dataset and use the **summarize** (abbreviated **sum**) command to
confirm that the data has the structure we think it does. We confirm that the
dataset contains the variables that we think it does. Note that female has only
182 observations, indicating that it contains some missing values. We know that
none of these values have been multiply imputed because the dataset contains
only one observation for each of the 200 cases.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/demo, clear sumVariable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 182 .5549451 .4983428 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------- prog | 182 2.027473 .6927511 1 3

We also have a data file called **scores**, which contains the test score data ( i.e., **read, write, math, science**, and **socst**). The variables **read, write, math**, and **science** contain missing values and have been imputed.
Below we open the dataset, and confirm its structure, because the data has been **mi set** , we
use **mi describe** to do this. From this output we confirm that **read,
write, math**, and **science** are all registered as imputed, and that there are 5 imputations.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/scores, clear mi describeStyle: flong last mi update 6 days ago Obs.: complete 147te 147 incomplete 53 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 4; read(9) write(17) math(15) science(16) passive: 0 regular: 1; socst system: 3; _mi_m _mi_id _mi_miss (there is one unregistered variable; id)(there is one unregistered variable; id)

Now that we understand the data structure, we can use **mi merge** to combine the files
so that we have both demographic data and test scores in the same file. In order to use **mi merge**, both
datasets must be **mi set**, the test score data (**scores**) is already **mi set**, but the demographic data (**demo**) is not. In the first line of syntax below we open **demo**, and then **mi set** the data.
The third line of syntax below shows the command to merge the datasets. The
command name (**mi merge**) is followed by a description of the type of merge
we wish to perform. Since each case in **demo**, should be matched to only one case
(per imputation) in **scores**, we want to do a one to one merge ( i.e., **1:1**).
This is followed by the name of the index variable (**id**). Finally, **
using scores** specifies that we wish to merge the current dataset with the
dataset **scores**.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/demo, clearmi set flongmi merge 1:1 id using scores(M increased from 0 to 5) (new variables read write math science registered as imputed) (new variable socst registered as regular) Result # of obs. ----------------------------------------- not matched 0 matched 200 -----------------------------------------

The output from **mi merge** first tells us that M (the number of imputations)
has increased from 0 to 5. This is because the file **demo** contained no imputations,
while **scores** contained 5 imputations. The output then tells us how the
new variables (from **scores**) were registered. Finally, the output gives a
table showing the number of observations that were matched and not matched. In
this case, all 200 observations were matched, which is what we would expect in
this case.

The above example shows a case where a file containing only original data (no
imputations) was merged with an MI dataset. You can also merge two imputed
datasets, but there are a few things to be aware of. First, when two MI datasets are
merged using **mi merge**, they are matched on both the index variable (e.g.,
**id**) and the imputation number (stored in **_mi_m**). For example, the case
with **id**=1 in **_mi_m**=2, is matched to the case with **id**=1 in **_mi_m**=2 in the other dataset. Second, if you are merging datasets with unequal
numbers of imputations, then the number of imputations will be set to the larger
of the two, and the "missing" imputations from the dataset with fewer
imputations will be filled in using the original data. This results in unequal
numbers of complete cases across the imputations. Neither of the conditions
described here are necessarily problematic, but you probably want to think about whether you will encounter them and how you want to proceed.

An unimputed dataset and an MI dataset can also be merged using **mim**. Below is an example, the two starting datasets, **mim_demo** and **mim_scores** are identical to the datasets **demo** and **scores** from the previous example except that they are formatted to work with **mim**. We start the merge process by
opening the **mim_demo **dataset. When we merge the data, **mim** will
use both the imputation number (stored in the variable **_mj**) and the id variable we specify ( i.e., **id**) to match the rows in the dataset. Currently, **mim_demo** contains only 1 copy of the dataset, and no **_mj** variable, while **mim_scores** contains 6 copies of the dataset (with **_mj** taking on
the values 0 to 5). So in order to merge the two files, we need to modify **mim_demo** so that it contains 6 copies of the data, with a variable **_mj** that takes on values 0 to 5 for each case. The second line of syntax below uses
the **expand** command to modify the dataset so that each case appears 6
times. Next we **sort** the data by **id**. Then for each **id** (**by
id:**)** **we use the **egen** command with the **seq(**...**)** function to
create a new variable, called **_mj**, that takes on a sequence of values
from 0 to 5 (specified using the **from(...)** and **to(...)** options).
Now we have six copies of the original (unimputed) data in **mim_demo**, with a variable **_mj** that indexes the copies of the dataset. When we merge the data, **mim** will
also expect the dataset to contain a variable **_mi** that indexes the cases.
When it merges the datasets **mim** will recreate this variable in both datasets
based on the values of **id**, so the values of this variable don't matter at
this point, there just needs to be a variable called **_mi** in the dataset. Hence, the final line of syntax generates the variable **_mi** and sets all values
equal to missing.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_demo, clear(highschool and beyond (200 cases))expand = 6(1000 observations created)sort idby id: egen _mj = seq(), from(0) to(5)gen _mi = .(1200 missing values generated)

Now that the **mim_demo** dataset is in the correct form, and has the correct variables, we can
merge the two datasets. With **mim_demo** still open, the syntax below merges
the two datasets. The command starts with the **mim** prefix followed by a comma (",") to
designate the beginning of the options. The **sortorder(**...**)** option
identifies the variable that uniquely identifies cases in both datasets (in this
case **id**), this
variable will be used to create **mim**'s own case identification variable ( i.e., **_mi**) after the datasets are merged. Following the colon (":") that marks
the end of the **mim** prefix, is the **merge** command, followed by the name of the variable that identifies the observations in both datasets ( i.e., **id**), and the keyword **using** with the name and location of the dataset to be merged ( i.e., **mim_scores**).

mim, sortorder(id): merge id using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_scores(highschool and beyond (200 cases))

The above example shows a case where a file containing only original data (no imputations) was merged with an MI dataset. Two imputed datasets can also be merged, but there are a few things to be aware of. First, when two MI datasets are merged using **mim**, they are matched on both the index variable (e.g., **id**) and the imputation number (stored in **_mj**). Unlike Stata's **mi merge** command, **mim** will not allow you to merge to datasets with unequal numbers of observations. If for some reason you have two such datasets, it is possible to merge them by either adding or dropping imputations. , but it is left to the user to decide how to treat the extra imputations.

If you are using Stata's **mi** commands, the **mi append** command can be used to append (stack) one dataset on to another. For
example, if we had one MI dataset containing information on students in public schools (named **public**) and another containing the same information on students from private schools
(named **private**), we could use **mi append** to stack them so that all of the cases are in a single
MI dataset. Below we open the dataset **public**, and use **mi describe** to provide some
basic information on the dataset. Note that there are currently 168 cases in the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/public, clearmi describeStyle: flong Obs.: complete 92 incomplete 76 (M = 5 imputations) --------------------- total 168 Vars.: imputed: 7; female(15) read(7) write(16) math(15) science(15) pr_2(17) pr_3(17) passive: 0 regular: 4; race ses schtyp socst system: 3; _mi_m _mi_id _mi_miss (there are 3 unregistered variables; id prog pr_1)

In general, we would want to use **mi describe** on both datasets, to be sure
each contained the same variables, etc. For this example, we have skipped this step to save space. Below we use the **mi append** command, followed by the
keyword **using** and the name of the dataset we wish to append (**private**) to add cases from
the dataset **private** to the current dataset (**public**). The **gen(**...**)** option generates a new variable named **dataset** in the combined data that denotes
which dataset each case came from (**public** vs. **private**). Then we use **mi
describe** to provide information about the combined dataset. Note that the
dataset now contains 200 cases.

mi append using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/private, gen(dataset) mi describeStyle: flong Obs.: complete 117 incomplete 83 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 7; female(18) read(9) write(17) math(15) science(16) pr_2(18) pr_3(18) passive: 0 regular: 4; race ses schtyp socst system: 3; _mi_m _mi_id _mi_miss (there are 4 unregistered variables; id prog pr_1 dataset)

If you are using **mim** the storage format allows you to use Stata's standard **append** command. For example, the append command for the
previous example using Stata's usual **append** command is shown below.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_public, replace append using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_private, gen(dataset)

**Reshaping Data**

You may sometimes need to reshape MI datasets, that is, move from wide to long, or from long to wide. One reason for this might be that you have imputed longitudinal data. In this case, you might have started with a dataset that had multiple rows per subject ( i.e., long form), but reshaped to wide form, where each case has only one row, and separate variables for each observation in order to impute. Once you have imputed, you may want to move back to long form to analyze the data. Because of their structure, MI datasets require special commands for reshaping data.

If your data is **mi set**, you can use Stata's **mi reshape** command to reshape the data. In the example below we convert from wide to long.
The dataset, named **wide**, contains reading test scores at three time points (**read1,
read2**, and **read3**). The dataset also contains an id variable ( i.e., **id**), and three variables that are measured only once per respondent ( i.e., **female, ses**, and **schtyp**). First we open the
dataset, named **wide**. Then use **mi describe** to examine the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wide, clear(highschool and beyond (200 cases))mi describeStyle: flong last mi update 6 seconds ago Obs.: complete 145 incomplete 55 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 4; female(18) read1(9) read2(17) read3(15) passive: 0 regular: 2; ses schtyp system: 3; _mi_m _mi_id _mi_miss (there is one unregistered variable; id)(there is one unregistered variable; id)

The syntax for **mi reshape** is very similar to the syntax for Stata's standard **reshape** command. Below we use the command **mi reshape**, followed by the desired form ( i.e., we wish
to convert from wide to **long**) followed by the variable name "stub" ( i.e., **read**) for the variables that are repeated over time ( i.e., **read1**-**read3**). The i(...) and j(...) "options" after the comma
are required. The variable name in **i(**...**)** is the id variable, and the new variable name in **j(**...**)** is
the name of a new variable (**time**) created to index observations ( i.e., time points).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wide, clear mi reshape long read, i(id) j(time)reshaping m=0 data ... (note: j = 1 2 3) Data wide -> long ----------------------------------------------------------------------------- Number of obs. 200 -> 600 Number of variables 7 -> 6 j variable (3 values) -> time xij variables: read1 read2 read3 -> read ----------------------------------------------------------------------------- reshaping m=1 data ... reshaping m=2 data ... reshaping m=3 data ... reshaping m=4 data ... reshaping m=5 data ... assembling results ...

To move from long to wide form, we use a similar syntax.
Below we open a dataset in long form. This time
the keyword following the **mi reshape** command is **wide**, followed by the name of the
variable **read**, for which there are multiple observations per **id**. The **i(**...**)** and **j(**...**)** "options" are required, where **id(**...**)** gives the case id variable (in our case **id**)
and **j(**...**)** gives an existing variable that indexes observations ( i.e., time).

mi reshape wide read, i(id) j(time)reshaping m=0 data ... (note: j = 1 2 3) Data long -> wide ----------------------------------------------------------------------------- Number of obs. 600 -> 200 Number of variables 6 -> 7 j variable (3 values) time -> (dropped) xij variables: read -> read1 read2 read3 ----------------------------------------------------------------------------- reshaping m=1 data ... reshaping m=2 data ... reshaping m=3 data ... reshaping m=4 data ... reshaping m=5 data ... assembling results ...

Using **mim** the reshape command is Stata's standard **reshape** command,
preceded
by the **mim** prefix. For example, the syntax below shows how to convert
the same dataset as above from wide to long.

mim: reshape long read, i(id) j(time)

The syntax to convert from wide to long is shown below.

mim: reshape wide read, i(id) j(time)

Because it has a very specific way of storing MI data, we need a way to
convert MI datasets not currently stored the way the **mi** commands expect to that format.
These might be datasets imputed by **ice**, but they might also be datasets imputed
by other packages. Before we can make Stata aware of the MI structure of the data we need three things. First, we need a Stata format dataset, if the dataset
is in some other format, it must be imported into Stata through one of the usual
methods ( i.e., StatTransfer, the **insheet** command,** **etc.). Second, the dataset must contain the original (pre-imputation) data. If the imputed datasets were released without m=0, then the original data must be recreated before **mi import** can be used on the data. For more information on how to do this, see our FAQ: How can I use multiply imputed data where original data is not included?. Finally, we
need to know how the imputed data is structured, for example, is it in what Stata would call flong, mlong, etc.. We also need to be aware of how the missing
values are indicated. If missing values in the unimputed data ( i.e., m=0) are
user missing (e.g., .a) **mi import** will set the corresponding values in the imputed dataset to missing. To avoid this, replace all user missing values with system missing values, which are indicated by a period (".").

In the following example, we have a Stata
dataset stored in what Stata would call the flong style, with the original data stored in m=0, and all missing values specified as "." so we are ready to proceed. The
first line of syntax below opens the dataset. The next line of syntax shows the
command to make Stata aware of the MI structure of the dataset. The command is **mi import flong** which is followed by a comma (",") that indicates whatever
follows is an "option." Both the **m(...)** option, and the **id(...)** option are required. The variable listed in the **m(...)** (in our
case **_mj**) is the variable that identifies which imputation each case
belongs to ( i.e., m=0, m=1, ...). The variable listed in the **id(...)** option gives the case id variable, in this case **_mi**, this variable
allows Stata to match cases across the imputations for data management purposes.
The **imputed(...)** option allows the user to list the imputed variables, so
that Stata can register them, this isn't required, and imputed variables can always be registered later.
The **clear** option allows the command to replace the current data in memory
even if it has changed since the last time the data was saved (this is like the **clear** option for the **use** command).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clear mi import flong, m(_mj) id(_mi) imputed(female read write math science prog) clear(83 m=0 obs. now marked as incomplete)

After the **mi import** command we use **mi describe** to examine the dataset. Everything looks as we expect it to, note that Stata has created three new system variables (**_mi_m, _mi_id**, and **_mi_miss**), these variables are used by Stata's **mi** commands to manage the data, and should not be changed by the user. We might also want to run the command **mi varying** to make sure that the data imported properly (discussed above in the section headed Data Management with Imputed Datasets).

mi describeStyle: flong last mi update 0 seconds ago Obs.: complete 117 incomplete 83 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 6; female(18) read(9) write(17) math(15) science(16) prog(18) passive: 0 regular: 0 system: 3; _mi_m _mi_id _mi_miss (there are 26 unregistered variables)

The example above uses **mi import** on what Stata calls an flong dataset,
one can also import wide and flongsep datasets using the commands **mi import
wide** and **mi import flongsep **respectively. Stata also has two **mi
import** commands that make it easier to import two common types of imputated datasets. One, **mi import ice**, imports datasets in the form created by **ice** (shown below), and the other, **mi import nhanes1** imports NHANES datasets. The syntax for these commands is similar, although not identical to, the syntax shown above.

Above, we used **mi import flong** to set up a dataset for use with Stata's **mi** commands. However, as just mentioned, there is an easier
way to import datasets produced by **ice**. The command for this is shown below. We don't need to
specify either **m(...)** or **id(...)** because Stata knows what **ice** names these variables. The **auto** option (after the comma), tells Stata that we want it to automatically determine which variables have been imputed. Use of the **auto** option is recommended by Stata.
After we import the data, we use **mi describe** to examine the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clear mi import ice, auto(200 m=0 obs. now marked as incomplete)mi describeStyle: flong last mi update 0 seconds ago Obs.: complete 0 incomplete 200 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 22; female(18) prog(18) read(9) write(17) math(15) science(16) _Iprog_2(18) _Iprog_3(18) m_female(200) m_read(200) m_write(200) m_math(200) m_science(200) m_socst(200) m_prog(200) m__Iprog_2(200) m__Iprog_3(200) m__Irace_2(200) m__Irace_3(200) m__Irace_4(200) m_ses(200) m_schtyp(200) passive: 0 regular: 0 system: 3; _mi_m _mi_id _mi_miss (there are 8 unregistered variables)

Looking at the output from **mi describe** we see that Stata has
identified 22 variables as imputed, which isn't correct. Stata has correctly identified
the six imputed variables as such, but it has also registered
a number of other variables as imputed. Two of the variables registered as imputed, **_Iprog_2** and **_Iprog_3** were passively imputed by **ice** based on **prog**. This happens because Stata cannot distinguish imputed variables and passive variables created by **ice**.
Statistically, it doesn't matter whether a variable is registered as imputed or
passive, so we can leave them, or, for data management/documentation purposes we can
reregister them as passive variables, which we do below using the command **mi
register passive**. Stata also incorrectly identified the imputation indicators
( i.e., the **m_** variables) as imputed variables. It did so because they are
all missing in m=0, but contain valid values in the imputations. Again, it's not
terribly important to unregister these variables, but we do so below using **mi unregister**. Then we rerun **mi describe** and see something more like we
were expecting. Note that the number of complete cases has changed from 0 to 117,
because we unregistered the variables that were missing for all cases in m=0 ( i.e., the **m_** variables). We could go on to register the regular variables, or not.

mi register passive _Iprog_2 _Iprog_3(variables _Iprog_2 _Iprog_3 were registered as imputed, now registered as passive)mi unregister m_*(117 m=0 obs. now marked as complete)mi describeStyle: flong last mi update 0 seconds ago Obs.: complete 117 incomplete 83 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 6; female(18) prog(18) read(9) write(17) math(15) science(16) passive: 2; _Iprog_2(18) _Iprog_3(18) regular: 0 system: 3; _mi_m _mi_id _mi_miss (there are 22 unregistered variables)

If you plan to use **mim** to analyze an MI dataset, the data must first be in the format expected by **mim**. As discussed above, **mim** assumes that the dataset contains m complete copies of the data, one for each of the imputations (or, optionally m+1 copies, to include the pre-imputation data). The imputation number is stored in the variable **_mj**, while a case id variable is in **_mi**. If you imputed using **ice**, the dataset will already be in the correct format, if not, then utilities exist to help you create an appropriate dataset. If the dataset is in the format expected by Stata's **mi** commands, you can reformat it for **mim** using the **mi export ice** command. Below we open a dataset in Stata's **mi** format, and export it to **ice**/**mim** format.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi export ice

If the MI datasets are stored in separate files, you can use the command **mimstack** (which is installed along with **mim**) to combine the files in
the format expected by **mim** . In order to use **mimstack**, the datasets to be combined must already be in Stata format ( i.e., .dta files). In the example below, we have six datasets (5 imputations plus the pre-imputation data), named **imp0.dta** to **imp5.dta**. We first change the working directory to the directory where the imputed datasets are stored (**c:\data**) and then use **mimstack** to combine the datasets. The **m(**...**)** and **sortorder(**...**)** options are required, they give the number of imputations (**m(5)**) and the name of the case id variable (**sortorder(id)**). The **istub(**...**)** option gives the first part of the name of the imputed datasets ( i.e., **imp** for **imp0.dta**, **imp1.dta **etc.). If we did not have the preimputation data ( i.e., , **imp0.dta**), we would need to use the **nomj0** option. (Note this example will not run, because the datasets in question are probably not available in **c:\data** on your computer.)

cd c:\data mimstack , m(5) sortorder(id) istub(imp)

As with data management, there are two options for analyzing MI data in Stata, Stata's **mi** commands introduced in version 11, and the user written package **mim**. Both can be used to estimate a number of models and can also be used to perform post-estimation tests. For a list of estimation commands currently supported by **mi estimate** see the **mi estimation** help file by typing **help
mi estimation** in the Stata command window. For information on the post-estimation commands for use with Stata's **mi** commands, see the **mi postestimation** help file by typing **help mi postestimation** in the Stata command window. For a list of estimation and postestimation commands supported by **mim** see the package help by typing **help mim** in the Stata command window.

Before we estimate any models, lets briefly review how MI estimates are calculated. The MI estimate of a parameter, for example, a regression coefficient or a mean, is the average of the estimates across the m imputations. The MI estimate of the standard error of a parameter is somewhat more complicated. The variance ( i.e., s.e.^2) is composed of two parts, the between variance and the within variance. The within variance is the average of the squared standard errors across the m imputations. The between variance is the variance of the coefficient estimates themselves across the m imputations. The MI estimate of the variance is the sum of the within and between variance estimates with an adjustment for the number of imputations. The MI estimate of the standard error is the square root of the variance. Including the between imputation variance allows us to account for the added uncertainty because some of the values in the dataset were imputed rather than observed. While none of these calculations are particularly difficult, it would be time tedious to run a model in each dataset and then combine the estimates by hand. Fortunately, **mi estimate** and **mim** automate this process by estimating the specified model in each of the MI datasets and then combining the results to produce the MI estimates, which are displayed for the user.

Below we use Stata's **mi estimate** command to estimate a regression
model using **write**, **read**, **math** and **ses** to predict **science** with MI data. The first line of syntax below opens the MI dataset, **mvn_imputation**.
The second line of syntax begins with **mi estimate** indicating that the following
command should be executed on the MI datasets and the coefficients and standard errors
combined. After the **mi estimate** prefix, the basic syntax to run a regression model is identical
to the syntax without MI data. The command name, **regress**, is followed by the outcome (**science**) and then the
list of predictor variables. The **i.** preceding the variable **ses** (forming **i.ses**) in the list of predictors indicates that the
variable **ses** should be included in the model as a series of dummy
variables. (Note that syntax is new to Stata 11, type **help factor variables** in the
Stata command window for more information.)

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi estimate: regress science write read math i.sesMultiple-imputation estimates Imputations = 5 Linear regression Number of obs = 200 Average RVI = 0.1549 Complete DF = 194 DF adjustment: Small sample DF: min = 62.21 avg = 107.02 max = 154.10 Model F test: Equal FMI F( 5, 142.5) = 32.34 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .204958 .0760487 2.70 0.008 .0545139 .3554021 read | .3258264 .0775502 4.20 0.000 .1708615 .4807913 math | .2692184 .0848079 3.17 0.002 .0997013 .4387356 | ses | 2 | 1.860882 1.313735 1.42 0.159 -.7343733 4.456137 3 | 1.863389 1.519993 1.23 0.222 -1.142558 4.869336 | _cons | 8.197699 3.570605 2.30 0.024 1.110246 15.28515 ------------------------------------------------------------------------------

In addition to the information normally included in regression output (e.g.,
the number of observations included in the analysis) the output for **mi
estimate** includes some information specific to MI analysis. The number
of imputations used is given, along with the average RVI. The RVI, or relatively
variance increase, is the increase in parameter variance ( i.e., s.e.^2) due to missing values. Each parameter estimated has its own RVI, the default regression output gives the average of these estimates. The output also gives the DF adjustment used in calculating the degrees
of freedom for both the model and individual parameters. By default the small
sample adjustment is used. Also by default, the overall F test for the model is
performed using a test that assumes an equal fraction of missing information for
all coefficients. This is indicated in the output as Equal FMI. The regression
table in the output gives the MI estimates of the coefficients and their
standard errors.

It is possible to get the RVI, as well as related values, individually for each coefficient. Below we rerun the regression model, this time using the **vartable** option for **mi estimate**. Note that
when specifying options for **mi estimate**, the command is followed by a comma (",") the desired option (e.g., **vartarble**), and then the colon (":").

mi estimate, vartable: regress science write read math i.sesMultiple-imputation estimates Imputations = 5 Variance information ------------------------------------------------------------------------------ | Imputation variance Relative | Within Between Total RVI FMI efficiency -------------+---------------------------------------------------------------- write | .005275 .000424 .005783 .096409 .091437 .982041 read | .004849 .000971 .006014 .240226 .208406 .959987 math | .005783 .001174 .007192 .243684 .21094 .95952 | ses | 2 | 1.62 .088249 1.7259 .06537 .063121 .987533 3 | 2.12005 .158607 2.31038 .089775 .085478 .983192 | _cons | 11.0585 1.40893 12.7492 .152888 .140141 .972736 ------------------------------------------------------------------------------ <output omitted>

With the addition of the **vartable** option, the **mi estimate** output now begins with a
table outlining the variance estimation for each coefficient in the model. The
first column lists the predictor
variables and intercept, each of which is associated with a regression coefficient. The second
column gives the within imputation variance ( i.e., the average of the estimated variances across the m imputations). The third
column gives the between imputation variance ( i.e., the variance in
coefficient estimates across the m imputed datasets). The fourth column contains
the total variance, which is the sum of the within and between variance with an
adjustment for the number of imputations. The fifth column gives the relative
variance increase or RVI, this is the between variance (with an adjustment for
the number of imputations) divided by the within variance. This gives a sense of
how much the variance in coefficient estimates increased due to missing values.
The fraction of missing information, or FMI, is given in the second to last
column. This is a measure of the proportion of information lost due to
non-response for a specific coefficient. The FMI is important because some of
the hypothesis tests commonly used with MI analyses, assume that the FMI is
equal across coefficients, which may or may not be a tenable assumption. If this
assumption does not seem appropriate, tests that do not make this assumption are
available, but they require much larger values of m. The final column in the
table gives the relative efficiency, this is a measure that compares the
estimate of the variance with the current value of m, to the variance with an
infinite number of imputations. As the number of imputations increases, this
value will approach 1. The table of variance
information is followed by regression output identical to that produced
without the **vartable** option (this output has been omitted).

In MI analysis, the standard post-estimation tests, such as Wald (e.g., the **test** command) and
likelihood ratio tests are generally not valid. For some of these tests, similar tests are available for use
in with MI data. Stata has implemented some of the applicable tests as part of the **mi** commands.
For example, we can test to see whether multiple coefficients are simultaneously
equal to 0 using the **mi test** command. This test is often used to test for an overall effect of
a categorical variable represented by a series of dummy variables, or more generally to test for differences between nested models. Below
we use **mi test** to test whether the overall effect of **ses** is statistically
significant. The command name, **mi test**, is followed by **2.ses** and **3.ses**, which refer to the coefficients in the model associated with the
second and third level of **ses** respectively. The note at the top of the
output indicates that by default, the F test performed assumes equal fractions
of missing information. Below that the parameters being tested are listed
followed by the results of the F test.

mi test 2.ses 3.sesnote: assuming equal fractions of missing information ( 1) 2.ses = 0 ( 2) 3.ses = 0 F( 2, 64.4) = 1.00 Prob > F = 0.3748

We can also test linear combinations of parameters, the type of test performed using the command **lincom** after some non-MI estimation commands. For example, we might want
to test that the coefficients for **read** and **math** are the same ( i.e., **read** = **math**), which we can do by testing whether the difference between the two coefficients is equal to 0 ( i.e., **read**-**math**=0). In order to perform this
type of test, Stata needs to store information about the results from each
imputed dataset. The** saving(...)** option of the **mi estimate** command
is used to save the necessary information. Below we rerun the regression from above,
this time adding a comma (",") after **mi estimate** and including the **saving(**...**)** option. The text inside the **saving(**...**)** option ( i.e., **myresults**)
assigns a name to our stored results so that we can recall them later. The
output from this command is identical to the output from the previous regression
so it has been omitted. Next we use the **mi estimate** command to estimate
the difference between the coefficients for **read** and **math**. The command name ( i.e., **mi estimate**) is followed by the expression we want to
test in parentheses ( "**(**" and "**)**" ). As in many other Stata
estimation commands, after **mi estimate**,
the coefficients can be referred to by **_b** followed by the variable name
in brackets, in this case, **_b[read]** and **_b[math]** refer to the coefficients
for **read** and **math** respectively. The expression to be tested is
followed by the keyword **using** and the name of the estimation results to be used ( i.e., **myresults**). After the comma, the **nocoef** and **noheader** options suppresses the display of the output from the original
model, if we omit these options, the header and coefficient table from the
original **mi estimate** command associated with **myresults** will be
shown.

mi estimate, saving(myresults): regress science write read math i.ses<output omitted>mi estimate (_b[read] - _b[math]) using myresults, nocoef noheadercommand: regress science write read math i.ses _mi_1: _b[read] - _b[math] ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _mi_1 | .0566079 .1401431 0.40 0.688 -.2252566 .3384725

At the top of the output is the command associated with **myresults** ( i.e., the command after the **mi estimate** prefix), this is followed by a listing of
the parameters estimated, in this case we have specified only one expression
within parentheses, but multiple expressions can be included, each within its
own set of parentheses. The table shows the estimate of the difference between
the coefficients for **read** and **math** under** **"Coef." ( i.e., 0.057), followed by
the standard error (0.14), t-value, p-value, and confidence interval. The
results indicate that there is no statistically significant difference between
the coefficients for **read** and **math**. Note that this is a test of a single coefficient ( i.e., a one degree of freedom test).

We can also test multiple comparisons. For example, if we wanted to test that the
coefficients for **read**, **math**, and **write** are all equal ( i.e., **read**=**math**=**write**).
This can be tested by testing that **read**=**math** ( i.e., **read**-**math**=0) and **
read**=**write** ( i.e., **read**-**write**=0). To test this
hypothesis, we first estimate the differences between **read** and **math**, and **read**
and **write** separately using **mi estimate**, and then use another command, **mi testtransform,** to test that the coefficients estimates from **mi estimate** are
simultaneously equal to 0. In the first line of syntax below, the **mi estimate** command contains two expressions in parentheses. We assign each of the estimates
(**read**-**math** and **read**-**write**) a name by typing a label ( i.e., **t1** and **t2**), the
label is listed after the open parenthesis ( "**(**" ), and followed by a
colon (":"). The two expressions are listed in the table of output by their
assigned labels. Below that we use the command **mi testtransform**, followed
by the labels of the parameters we want to test simultaneously ( i.e., t1 t2).

mi estimate (t1: _b[read] - _b[math]) (t2: _b[read] - _b[write]) using myresults, nocoef noheadercommand: regress science write read math i.ses t1: _b[read] - _b[math] t2: _b[read] - _b[write] ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- t1 | .0566079 .1401431 0.40 0.688 -.2252566 .3384725 t2 | .1208683 .1294069 0.93 0.354 -.137425 .3791617 ------------------------------------------------------------------------------mi testtransform t1 t2note: assuming equal fractions of missing information t1: _b[read] - _b[math] t2: _b[read] - _b[write] ( 1) t1 = 0 ( 2) t2 = 0 F( 2, 61.5) = 0.45 Prob > F = 0.6418

The first part of the output notes that this test assumes equal FMI, which is
the default for F tests performed by **mi estimate**. Next, the two estimates being tested are listed, followed by the specific
hypotheses being tested ( i.e., t1=0 and t2=0), and then the F test and p-value
for the test. This two degree of freedom test does not find any significant
differences between the coefficients for **read**, **write**, and **math**.

It is a good idea to test the sensitivity of the results to both the number of imputations
used and the specific imputations used. If the results of an analysis change substantially
depending on the number of imputations used, or the specific subset of imputations used,
this may suggest that there is a problem in the imputation model. If the analysis seems
particularly sensitive to the number of imputations, you may want to increase the number of
imputations used. The **mi estimate** command makes it relatively easy to perform this type of
sensitivity analysis. Rather than using **mi estimate** to reestimate the
entire model on different subsets of the imputations, we can estimate the model
once for all imputations, saving the results in the same manner as above, and
then use the **mi estimate using** command to recombine the saved results in
various ways. Below we use **mi estimate** to run a regression model using **read**,** **and **write** to predict **socst** and save the results
as **myresults2**.

mi estimate, saving(myresults2): regress socst read writeMultiple-imputation estimates Imputations = 5 Linear regression Number of obs = 200 Average RVI = 0.0344 Complete DF = 197 DF adjustment: Small sample DF: min = 153.03 avg = 165.98 max = 186.00 Model F test: Equal FMI F( 2, 163.8) = 79.36 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .4179087 .0717291 5.83 0.000 .2762435 .559574 write | .412377 .0785752 5.25 0.000 .2571447 .5676093 _cons | 8.773376 3.493966 2.51 0.013 1.880479 15.66627 ------------------------------------------------------------------------------

The above model uses all 5 of the imputations to estimate the model. Below we use **mi estimate using** to reestimate the MI coefficients using only 3 of the 5 imputations. The **mi estimate** command is followed
by the keyword **using** and the name of the saved results to be used ( i.e., **myresults2**). The **nimputations(**#**)** option
after the comma instructs Stata to reestimate the MI coefficients using only the first # imputations, in this case 3.

mi estimate using myresults2, nimputations(3)Multiple-imputation estimates Imputations = 3 Linear regression Number of obs = 200 Average RVI = 0.0146 Complete DF = 197 DF adjustment: Small sample DF: min = 173.94 avg = 183.85 max = 191.31 Model F test: Equal FMI F( 2, 277.7) = 82.77 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .4162061 .0704377 5.91 0.000 .2771834 .5552287 write | .4200142 .0761381 5.52 0.000 .2698106 .5702179 _cons | 8.457841 3.468331 2.44 0.016 1.616759 15.29892 ------------------------------------------------------------------------------

In this output, the number of imputations used is listed as 3, which is what
we requested. The coefficients in this output are very similar to the MI estimates
using all 5 imputations, suggesting that the model is not particularly sensitive to the number of
imputations. Note that in many applications one may have more than 5 imputations available,
and one might want to try different numbers of imputations, when examining the sensitivity of the model to the number of imputations. Also note that 3 is the minimum number
of imputations from which MI estimates can be calculated. We can also examine differences in the MI estimates
depending on which imputations are used. Below we use the **mi estimate using** command with the **imputations(...)**
option, where the **imputations(...)** option is used to specify which imputations should be used.
In this case we calculate the MI estimates using imputations number 2, 3, 4, and 5.

mi estimate using myresults2, imputations(2 3 4 5)Multiple-imputation estimates Imputations = 4 Linear regression Number of obs = 200 Average RVI = 0.0438 Complete DF = 197 DF adjustment: Small sample DF: min = 122.88 avg = 147.12 max = 180.48 Model F test: Equal FMI F( 2, 111.8) = 78.46 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .4157905 .0725943 5.73 0.000 .2722496 .5593314 write | .412966 .0797544 5.18 0.000 .2550955 .5708366 _cons | 8.844287 3.494272 2.53 0.012 1.949405 15.73917 ------------------------------------------------------------------------------

The output indicates that the number of imputations used is 4, which is what we requested. And again the estimates are similar to the previous estimates, suggesting that our model is not particularly sensitive to which imputations are used. Note that in many applications one might want to try different numbers and subsets of imputations, rather than a single subset.

Alternatively, we can use **mim** to analyze the MI data. As mentioned above, **mim** assumes the data is in the structure produced by **ice**, that is, each of the imputations is stacked with the others in a single dataset, and that the imputation number and case id are stored in _mj and _mi respectively. The
first line of syntax below uses **mi export ice** to convert the dataset in memory (**mvn_imputation**) from the structure used by Stata's **mi** commands to the structure used by **mim**. Next we specify the
second regression model we estimated above, predicting **socst** with **read** and **write**, this time using **mim**. The second line of syntax below
uses the **mim** prefix,
followed by the command name (**regress**), the outcome variable (**socst**)
and then the list of predictor variables.

mi export ice mim: regress socst read writeMultiple-imputation estimates (regress) Imputations = 5 Linear regression Minimum obs = 200 Minimum dof = 153.0 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Int.] FMI -------------+---------------------------------------------------------------- read | .417909 .071729 5.83 0.000 .276244 .559574 0.060 write | .412377 .078575 5.25 0.000 .257145 .567609 0.067 _cons | 8.77338 3.49397 2.51 0.013 1.88048 15.6663 0.023 ------------------------------------------------------------------------------

The output from **mim** is somewhat less detailed than the output from
the **mi estimate** command, although it does list the number of imputations, number of observations and the degrees of freedom.

If the model contains categorical variables, such as the first model we ran
above,
which used **write**, **read**, **math**, and **ses** (which is categorical) to
predict **science**, we can use the **xi** prefix, along with the **mim** prefix to estimate
the model without creating dummy variables by hand. Below we estimate this
model. Note that the **xi** prefix comes before the **mim** prefix, and
that an **i.** precedes the variable **ses** ( i.e., **i.ses**)

xi: mim: regress science write read math i.sesi.ses _Ises_1-3 (naturally coded; _Ises_1 omitted) Multiple-imputation estimates (regress) Imputations = 5 Linear regression Minimum obs = 200 Minimum dof = 62.2 ------------------------------------------------------------------------------ science | Coef. Std. Err. t P>|t| [95% Conf. Int.] FMI -------------+---------------------------------------------------------------- write | .204958 .076049 2.70 0.008 .054514 .355402 0.091 read | .325826 .07755 4.20 0.000 .170861 .480791 0.208 math | .269218 .084808 3.17 0.002 .099701 .438736 0.211 _Ises_2 | 1.86088 1.31374 1.42 0.159 -.734373 4.45614 0.063 _Ises_3 | 1.86339 1.51999 1.23 0.222 -1.14256 4.86934 0.085 _cons | 8.1977 3.57061 2.30 0.024 1.11025 15.2852 0.140 ------------------------------------------------------------------------------

We can test the hypothesis that both the coefficients for **ses**=2
(denoted **_Ises_2**), and **ses**=3 (**_Ises_3**) are
equal to 0. To do this we start with the **mim** prefix, followed by the
command **testparm** (which is a limited version of the **test** command)
followed by a list of the parameters we wish to test. Similar to both the **test** and **mi test** commands, the output for **mim: testparm** lists
the hypotheses being tested followed by the F test and p-value. While the tests
are similar, the **mim: testparm** and **mi test** commands are not
implemented identically. For users with access to both sets of commands, this
can be an advantage, because you can examine whether the results are sensitive
to the specific implementation of the test. In this case, the difference was very small, a p-value of 0.3728 using **mim** and 0.3748 using **mi test**. For more information about the differences see
the documentation for **mi test** and **mim**.

mim: testparm _Ises_2 _Ises_3( 1) _Ises_2 = 0 ( 2) _Ises_3 = 0 F( 2, 101.7) = 1.00 Prob > F = 0.3728

Linear combinations of parameters can also be examined. Below we once again test whether the coefficients
for **read** and **math** are equal (**read**=**math**) by testing whether their difference is equal to 0
(**read**-**math**=0). Below the **mim** prefix is followed by the command name ( i.e., **lincom**) and the expression we wish to test ( i.e., **read-math**).

mim: lincom read-mathMultiple-imputation estimates for lincom Imputations = 5 ( 1) read - math = 0 ------------------------------------------------------------------------------ science | Coeff. Std. Err. t P>|t| [95% Conf. Int.] FMI -------------+---------------------------------------------------------------- (1) | .056608 .140143 0.40 0.688 -.223338 .336554 0.314 ------------------------------------------------------------------------------

The output gives the estimate of the coefficient, its standard error, t-value, p-value, confidence
interval and the fraction of missing information for this parameter. As before, the difference between the coefficients for **read** and **math** is not statistically significant.

For various "special" types of data, Stata allows you to inform it of the
data structure in advance, so that commands utilizing that structure can be
run more easily. For example, when working with survey data, you can **svyset** the data, when working with cross sectional time-series data, you can **xtset** the data. If you xxxset the data before you **mi set** it,
then Stata will "remember" the setting. If you **mi set** the data before you set the other structure, the usual commands (e.g., **stset**, **tsset**) will no longer work. Instead you need
to use **mi xxxset** to set the data. Other than adding an **mi** before
the set command, the syntax remains identical.

On the first line of syntax below we **mi xtset** data with the level 2
units defined by the variable **group**. On the second line of syntax we use **mi estimate** to estimate a random intercept model, with **y** regressed on **x1** and **x2**. Because **xtreg** is not currently supported by **mi estimate**,
we have used the **cmdok** option to force **mi estimate** to estimate
the model anyway. This may or may not be a good idea, depending on the
estimation command in question. When using the **cmdok** option, it is the
users responsibility to ensure that the estimation procedure in question is
appropriate for use with **mi estimate**. Additionally, the
output produced by **mi estimate** when the **cmdok** option is used may or may
not be well formatted. Note that an estimation command not currently supported
by **mi estimate** is not necessarily inappropriate for use with MI data,
hence the presence of the **cmdok** option. There was an interesting
discussion of various reasons commands may not be currently supported by **mi estimate** on Statalist written by Yulia Marchenko of Stata, it can be found here.

mi xtset group mi estimate, cmdok: xtreg y x1 x2

Similarly, the first line of syntax below survey sets MI data. The second line of syntax below
uses **mi estimate** followed by the **svy:** prefix to estimate the mean
of the variable **y**. Estimation commands that are supported by **mi estimate** more generally ( i.e., in non-survey data), for example **mean**, and **regress**, are also available
using **mi estimate svy:**.

mi svyset su [pweight=pw], strata(s) mi estimate svy: mean y

If you are using **mim** then you can set the data as one usually would, and
then run the appropriate commands with the **mim** prefix. For example

xtset group mim: xtreg y x1 x2

and

svyset su [pweight=pw], strata(s) mim: svy: mean y

In the first seminar, we reviewed some of the basic concepts in multiple imputation, as well as some of the major options available in the imputation process. We also demonstrated the use of tools for examining patterns of missing values and Stata's **mi impute** command, introduced in version 11. In this seminar, we introduced the use of the user-written command **ice** to create MI datasets. We also demonstrated the use of both Stata's **mi** commands and the user-written package **mim** to perform data management and analysis with MI datasets. For the most part, the commands we have used were easy to use, so it is easy to forget that the statistical procedures they implement are often complex. Because some of the statistical procedures being used are so complex, when implementing MI in your own research, you probably want to proceed carefully. The quality of the analysis model and hence the final conclusions of the research, is influenced by quality of the imputation model, so it is important to carefully read the literature on MI and consider all of the options available. In the end, you may also want to try implementing several different strategies, to be sure that your results are not dependent on the particular imputation procedures, tests, or datasets used.

Graham, John W., Olchowski, Allison E. and Gilreath, Tamika D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

Little, Roderick J.A., and Rubin, Donald B. (2002). Statistical Analysis with Missing Data, Second Edition. Hoboken, New Jersey: Whiley-InterScience.

McKnight, Patrick E., McKnight, Katherine M., Sidani, Souraya, and Figueredo, Aurelio Jose (2007). Missing Data: A Gentle Introduction. New York, New York: The Guilford Press.

Molenberghs, Geert, and Kenward, Michael G. (2007). Missing Data in Clinical Studies. Chichester, West Sussex: John Whiley & Sons Ltd.

Potthoff, Richard F., Tudor, Gail E., Pieper, Karen S., and Hasselblad, Vic (2006). Can one assess whether missing data are missing at random in medical studies? Statistical Methods in Medical Research 15:213-234.

van Buuren S., H. C. Boshuizen and D. L. Knook. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694.

von Hippel, Paul T. (2007). Regression with missing y's: an improved strategy for analyzing multiple imputed data, Sociological Methodology, 37.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.