Statistical Computing Seminars
Multiple Imputation in Stata, Part 2

Outline of this seminar:

Part 1:

Part 2:

Imputation by chained equations

Although both have the same goal, and the end product can be very similar, imputation by chained equations (ICE) and the multivariate normal approach to imputation are somewhat different. In the multivariate normal model, the imputations come from a single multivariate distribution, in simple terms, information from all variables is used to impute all other variables based on a single model. In the ICE approach, the imputed values are generated from a series of univariate models, in which a single variable is imputed based on a group of variables. As we discussed in the first seminar, each approach has advantages and disadvantages. One advantage of the ICE approach is that it does not assume a multivariate normal distribution, so it can easily be used to impute a variety of different types of variables ( i.e.,  categorical, counts, etc.). This is less of an advantage when imputing predictor variables, but can be useful when imputing outcome variables, or imputing for an entire dataset where one may not know which variables will be used as predictors or outcomes. A second advantage of the ICE approach is that because it estimates a series of univariate models, it can sometimes accommodate larger imputation models than the multivariate normal approach. One disadvantage of the ICE approach is that, in comparison to the multivariate normal model, it lacks strong theoretical underpinnings. An additional disadvantage is that specifying ICE models can be tedious, especially when the imputation model is large.

For this example, we have decided to impute missing values for all variables in the dataset as though we did not have a specific analysis model in mind. Because the example dataset contains so few variables, the resulting imputation model is not substantially different from the imputation model that would result from a model for a specific analysis that included a few so called auxiliary variables. Below we show the syntax to create multiply imputed datasets using ice.

The first line of syntax below opens the dataset we wish to impute. The second command (on the second and third lines) runs the imputation model. The command name (ice) is followed by the list of variables to be included in the imputation model. The variable list can include both variables with and without missing values, ice will automatically determine which variables are have missing values and impute them. The variable prog is listed with the m. prefix ( i.e.,  m.prog), which indicates to ice that the variable is categorical, and that we wish to use multinomial logistic regression (mlogit) to model it. If we wished to treat the variable as ordinal, we could use the o. prefix. The variable race appears with the i. prefix ( i.e.,  i.race), this indicates to ice that the model should include dummy variables to represent the categories of race. The i. prefix is used only with variables that are not missing any values. The list of variables is followed by a comma (","), which separates the variable list from the options. The comma is followed by three slashes ("///"), this is a comment code that tells Stata to ignore anything after the slashes and continue reading on the next line as though there were no line break. This isn't necessary, but it does make the syntax easier to read and organize. Note that there needs to be a space between the comma and the slash marks. The options start on the next line, the gen(...) option generates imputation indicators for the imputed variables, for example, because we have specified m_ in the gen(...) option, for the variable female, ice creates the indictor m_female, which is equal to 1 when female was imputed and 0 otherwise. The next option saving(...) gives ice the name of the new Stata data file for the imputed dataset, our new dataset will be called ice_imputation.dta. Next we have used the m(...) option to specify the number of imputations, in this case, 5. In many applications we might want to create more imputations, but we may wish to run a model with a small number of imputations and examine those, before creating a large number of imputations. Finally, we use the seed(...) option to set the seed so that we can recreate the results of the imputation model, setting the random number seed is necessary to obtain the same results because MI contains a random component.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
ice female read write math science socst m.prog i.race ses schtyp, ///
	gen(m_) saving(ice_imputation) m(5) seed(4324)

The output from ice is rather long, so we have split it into two blocks to discuss it. Below we show the first part of the output from the ice command above. The output begins by showing what the ice command would look like if we had used the alternative syntax, which involves the use of the xi: prefix, as well as the cmd(...) and substitute(...) options. We can use this output to confirm that we have specified the model we intended to specify. The xi: prefix, among other things, allows the user to create dummy variables to represent categorical variables in a model (see help xi for more information on the xi: prefix). The cmd(...) option specifies the type of model that should be used to impute a given variable (e.g., mlogit, ologit). Because we specified m.prog, mlogit will be used to predict will be used to impute prog. The substitute(...) option is used to specify which dummy variables should be used to represent variables that are being imputed when they are used to predict other variables. In this case, we are using the dummy variables created by the xi: prefix (represented as i.prog in the output) in place of prog when predicting other variables. The next portion of output shows the names of the variables used to represent prog and race ( i.e.,  _Iprog_1-3 and _Irace_1-4 respectively). Below that is a table that shows the number of cases missing various numbers of variables, we see that 117 cases are missing no values, and 73 are missing only one value, etc..

=> xi: ice female read write math science socst prog i.prog i.race ses schtyp, cmd(prog:mlo
> git) substitute(prog:i.prog) gen(m_) saving(ice_imputation, replace) m(5) seed(4324)

i.prog            _Iprog_1-3          (naturally coded; _Iprog_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        117       58.50       58.50
          1 |         73       36.50       95.00
          2 |         10        5.00      100.00
------------+-----------------------------------
      Total |        200      100.00

The second part of the output from ice is shown below. A table that gives the name of the variable in the variable list, the command used to impute it, and lists the variables used to impute its values is printed. For example, this table tells us that the variable female will be imputed using the logit command, which is appropriate since female is a binary variable. By default, ice will used logit to impute variables that take on two values, mlogit to impute variables that take on 3-5 values, and regress to impute variables that take on more than 5 values. The default can be overridden for some or all variables using the cmd(...) option, logit, mlogit, and regress can be specified, along with other types of models (e.g., ologit) as the user deems appropriate. Variables with no missing values are listed, along with a message that the variable does not contain missing values. In this case, the table shows us that all variables in the imputation model were used to impute missing values on each variable. Because ice commands are often somewhat long and complicated, it is important to check this table carefully to be sure the model is what you intended to specify. Finally, the output tells us that it is imputing (with progress dots), and then saves the datasets.

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | read write math science socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
       read | regress | female write math science socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
      write | regress | female read math science socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
       math | regress | female read write science socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
    science | regress | female read write math socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
      socst |         | [No missing data in estimation sample]
       prog | mlogit  | female read write math science socst _Irace_2 _Irace_3
            |         | _Irace_4 ses schtyp
   _Iprog_2 |         | [Passively imputed from (prog==2)]
   _Iprog_3 |         | [Passively imputed from (prog==3)]
   _Irace_2 |         | [No missing data in estimation sample]
   _Irace_3 |         | [No missing data in estimation sample]
   _Irace_4 |         | [No missing data in estimation sample]
        ses |         | [No missing data in estimation sample]
     schtyp |         | [No missing data in estimation sample]
------------------------------------------------------------------------------

Imputing ..........1..........2..........3..........4..........5
file ice_imputation.dta saved

The imputation model above contained 10 variables, two of which were represented by dummy variables, meaning that at most, there were 12 variables available as predictors for any variable with missing values. Because the imputation model was relatively small, we were able to predict each variable with missing values using all other variables in the imputation model. However, in larger imputation models, it isn't feasible to use all other variables in the imputation model to impute a given variable, the univariate regression models simply become too large to estimate. For example, in an imputation model with 30 variables, it is unlikely that all 29 other variables could be used in the prediction equation. If you attempt to run such a model, you will encounter error messages similar to the error message below, where we have tried to impute missing values for var1-var30.

ice var1-var30, gen(m_) saving(error_imputation) m(5)

<output omitted>

Error #430 encountered while running -uvis-
I detected a problem with running uvis with command mlogit on response var11
and covariates var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var12 var13 var14 var15 v
> ar16 var17 var18 var19 var20 var21 var22 var23 var24 var25 var26 var27 var28 var29 var30.

The offending command resembled:
uvis mlogit var11 var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var12 var13 var14 var1
> 5 var16 var17 var18 var19 var20 var21 var22 var23 var24 var25 var26 var27 var28 var29 var
> 30 , gen([imputed]) 

With mlogit, try combining categories of var11, or if appropriate, use ologit

you may wish to try the -persist- option to persist beyond this error.
dumping current data to ./_ice_dump.dta
convergence not achieved
r(430);

What this error message tells us is that ice encountered a problem when it attempted to impute var11, using all 29 other variables in the model. More specifically, ice was attempting to use mlogit to predict var11 but encountered an error. Note that depending on the specific problem encountered, ice may not produce this exact error message, but other error messages caused by problems related to an excess of predictors are likely to be similar. The third paragraph of the error message suggests that we try combining categories of var11 (which would help if var11 had a small number of cases in some categories), or imputing using ologit, if var11 is ordinal. We could also try using the persist option for ice, which would cause ice to ignore some errors. These are not bad suggestions, but, first we probably want to stop to think about the model we were attempting to run, we were trying to run an mlogit model using 29 predictors. We wouldn't generally do this in an analysis model, and, for similar reasons, doing so in an imputation model may not be a good idea. Instead, we may want to select a subset of variables that will be used to predict each variable to be imputed.

Selecting which variables should be used to impute each of the variables with missing values can be a time consuming process. If you have a specific analysis model in mind, then all variables to be included in that model should be used to predict all other variables in the analysis model, so that the imputations accurately capture the relationships of interest. If you wish to include auxiliary variables ( i.e.,  variables that are not part of the analysis model) in the imputation model, or if you don't have a specific analysis model in mind when creating the imputation model, then you will need to think carefully about what variables should be used to impute each variable with missing values. An example of the process of building such a model is a paper by van Buuren, Boshuizen, and Knook (1999). As you might imagine, the process of creating a custom equation for each, or even some, of the variables in the imputation model can be time consuming, but it is important to be thoughtful in creating these models.

Once we have decided which variables should be used to predict the variables we want to impute, we can use the eq(...) option with ice to specify the equations. The ice command below is similar to the first model, except that it includes custom equations for 5 of the variables we wish to impute (female, read, math, science, and prog). The first two lines of syntax below are the same as before, and we have used three slashes ( i.e.,  "///") to allow the command to run over multiple lines. The eq(...) option specifying the custom prediction equations for 5 of our variables is on lines 3 to 7. Following the open parentheses ( eq( ) the first variable for which we wish to specify an equation ( i.e.,  female) is listed, followed by a colon ( : ), following the colon are those variables we wish to use to impute values of female ( i.e.,  read write math socst), this list is followed by a comma, which separates the prediction equations. We have added a line break (using /// ) after the comma and begin the next custom equation ( i.e.,  the equation for read) on the next line. This continues on for each of the variables for which we are specifying an equation, the last equation ( i.e.,  for prog) ends with a close parenthesis ( ) ) instead of a comma. Because we haven't included a custom equation for the variable write, ice will use all other variables in the model to predict it.

ice female read write math science socst m.prog i.race ses schtyp, ///
	gen(m_) saving(ice_imputation_custom) m(5) seed(4324) ///
	eq(female: read write math science socst, ///
	read: female socst schtyp _Irace_2 _Irace_3 _Irace_4, ///
	math: science write _Iprog_2 _Iprog_3 ses, ///
	science: math socst read schtyp, ///
	prog: female ses math science socst)

The output generated by the above syntax is similar to the output from the earlier imputation models. The command listed at the very top of the output includes the eq(...) option. Looking at the third table, which displays the command and prediction equation used to impute each variable, we can confirm that the predictors we specified in the syntax were used to impute each variable.

=> xi: ice female read write math science socst prog i.prog i.race ses schtyp, cmd(prog:mlo
> git) substitute(prog:i.prog) eq(female: read write math science socst, read: female socst
>  schtyp _Irace_2 _Irace_3 _Irace_4, math: science write _Iprog_2 _Iprog_3 ses, science: m
> ath socst read schtyp, prog: female ses math science socst) gen(m_) saving(ice_imputation
> _custom, replace) m(5) seed(4324)

i.prog            _Iprog_1-3          (naturally coded; _Iprog_1 omitted)
i.race            _Irace_1-4          (naturally coded; _Irace_1 omitted)

   #missing |
     values |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        117       58.50       58.50
          1 |         73       36.50       95.00
          2 |         10        5.00      100.00
------------+-----------------------------------
      Total |        200      100.00

   Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
     female | logit   | read write math science socst
       read | regress | female socst schtyp _Irace_2 _Irace_3 _Irace_4
      write | regress | female read math science socst _Iprog_2 _Iprog_3
            |         | _Irace_2 _Irace_3 _Irace_4 ses schtyp
       math | regress | science write _Iprog_2 _Iprog_3 ses
    science | regress | math socst read schtyp
      socst |         | [No missing data in estimation sample]
       prog | mlogit  | female ses math science socst
   _Iprog_2 |         | [Passively imputed from (prog==2)]
   _Iprog_3 |         | [Passively imputed from (prog==3)]
   _Irace_2 |         | [No missing data in estimation sample]
   _Irace_3 |         | [No missing data in estimation sample]
   _Irace_4 |         | [No missing data in estimation sample]
        ses |         | [No missing data in estimation sample]
     schtyp |         | [No missing data in estimation sample]
------------------------------------------------------------------------------

Imputing ..........1..........2..........3..........4..........5
file ice_imputation_custom.dta saved

We specified ice_imputation_custom in the saving(...) option, and once the imputation is complete we can open that file.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation_custom, clear
(highschool and beyond (200 cases))

As with the data from the multivariate normal imputation model, we will want to explore the imputed datasets in much the same way we would explore a new dataset. Recall from the previous lecture that we don't expect the descriptive statistics for the imputed data to be the same as those in the original data, but examining these values can help us see where things might have gone wrong in the imputation process. For example, if a variable with a range of 1-10 (in the original data), has a mean outside that range (e.g., 15) after imputation, this suggests there may have been a problem in the imputation model. Once we have explored the imputed dataset, we can analyze the data using mim, or set it up for use with Stata's mi commands (see section below on importing datasets not imputed using mi impute).

A note on commands for management and analysis of MI data

As mentioned above there are two major options for management and analysis of MI datasets in Stata. There are the mi commands, introduced in Stata 11, and there is the user-written command mim which works with earlier versions of Stata. Throughout the following sections, we will discuss the use of both Stata's mi commands and the user written command mim.

Note that one difference between the mi commands and mim is that when you use the mi commands, Stata automatically performs checks to be sure the datasets are consistent, for example, that the values of a variable that is not missing in the original dataset do not vary across imputations. The user written program mim does not do this type of checking. Another difference is that while Stata's mi commands allow different data styles (discussed below), mim requires that all of the imputations be in a single data file. If the original dataset is large, or there are a lot of imputations, a single data file with all of the imputations may be quite large, if you encounter problems because of the size of an MI data file, and do not have access to the mi commands, another option is the user-written command mira which analyzes MI data where each imputation is stored in its own dataset.

Data management with imputed datasets

Before we talk about data management in terms of commands, there are a few things you should know about Stata's mi commands. As we've mentioned in the first seminar the original (pre-imputation) data is stored along with the imputed data. The original data is referred to in Stata (and elsewhere) as m=0, that is imputation 0.

Note that Stata's mi commands store the imputation number in the variable _mi_m, this is identified as a system variable and should generally not be modified by the user. Stata's mi commands expect that the original data is present in the dataset (which will be the case if you've created the data using either mi impute or ice). If there is no m=0, Stata will treat the lowest valued imputation (for example m=1 if m takes on values from 1-5) as if it were m=0. If the dataset Stata thinks contains the original data is actually an imputed dataset, this will cause problems. For more information on handling this situation, see our FAQ: How can I use multiply imputed data where original data is not included?. Another thing to be aware of is that changing the original data (m=0) can cause Stata's data management for mi to change the imputed data as well.

If you are using mim rather than Stata's mi commands, the above does not apply. Datasets used with mim may or may not contain the original dataset (m=0), but, if the original data is included, it is important that it be denoted m=0 or as missing ( i.e.,  m=.), otherwise mim will treat the original data as an imputed dataset. The mim command expects that the imputation number is stored in the variable _mj, as with the variable _mi_m, this should not generally be modified by the user.

Stata's mi commands allow MI data to be stored in several different styles. The four styles used by Stata are flong, flongsep, mlong, and wide. In the flong format, in addition to the original data, the dataset contains one copy of the data for each imputation. So, if a file contained 5 imputations, each case would occur in the dataset 6 times, once in the original data, and once in each of the imputations. This is a common format for MI data, but can result in very large datasets, particularly if there are a large number of imputations. Below is a small example of an flong dataset. The first variable ( i.e.,  column), labeled m, contains the imputation number (0 to 3), while the variable id contains the case id, and v1-v3 contain data. Note that in the original data (m=0) case 1 (id=1) is missing on v1, while case 2 is missing on v2. In the imputed datasets (m= 1 to 3) the missing values have been imputed.

m id v1 v2 v3
0  1  .  3  8
0  2  2  .  7
1  1  5  3  8
1  2  2  4  7
2  1  3  3  8
2  2  2  2  7
3  1  1  3  8
3  2  2  3  7

In the flongsep style, each imputation is stored in a separate data file, each of which contains one row for every case in the dataset. If there are 5 imputations, the flongsep style requires 6 datasets, one for the original data, and one for each of the 5 imputations. Because the data is stored in multiple files, it takes longer to run commands on MI datasets stored in the flongsep style, making this style somewhat inefficient. However, when the dataset and/or the number of imputations is large, flongsep has the advantage of resulting in less data in memory at any time.

In the mlong format, cases with no imputed values are included in the dataset only once (in the original data), while cases with imputed values appear in the dataset multiple times, once in the original data, and once for each imputation. Depending on the number of cases without imputed values, the mlong style may result in substantially smaller datasets than the flong style.

In the wide style, instead of adding additional cases to accommodate the imputations, imputed variables appear in the dataset multiple times, once for the original data, and once for each imputation. So if the variable age has been imputed (with 5 imputations), the variable age, containing the original data would exist in the dataset, as well as five copies _1_age to _5_age.

For a more thorough explanation of missing data styles, see the documentation by typing help mi styles in the Stata command window. If you are working in mim the data is stored in what Stata's mi commands call the flong style.

Note that if the data is set up for use with the mi commands, and you attempt to perform either analyses, or data management on your own, that is, without using Stata's mi commands, it is important to consider how the style will influence that command. If you use the mi commands, Stata will keep track of the style and adjust the commands as necessary for you.

Although not typically necessary, it is possible to move between various mi styles using the mi convert command. Below we open the dataset mvn_imputation, which contains MI data. Then we use the mi query command to tells us the mi style and number of imputations. Currently the style is flong, to convert it, we use the mi convert command, followed by the style we would like to convert to, in this case wide. We can then use the mi query command again to confirm that the style is now wide.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi query
data mi set flong, M = 5

mi convert wide

mi query
data mi set wide, M = 5
last mi update 0 seconds ago

We've already used mi describe in a number of places, so although it is an important tool for managing data using Stata's mi commands, we won't cover it here, but keep in mind that it is probably the first command you will want to run on your imputed data. Instead we will start with the mi varying command, which checks for inconsistencies in the data, for example, variables that are not registered as imputed, but vary across the imputations. variables that either vary across the imputations but aren't marked as imputed, or variables that are marked as imputed or passive but don't vary across the imputations. These conditions may indicate an improperly registered variable or a problem with the imputation. It is a good idea to run this command after using mi import, as well as after more complex data management commands (e.g., mi merge), to make sure everything went as expected. Below we open an MI dataset created by ice, named ice_imputation and use the mi import ice command to convert the dataset from the form created by ice to the format used by Stata's mi commands. The imputed(...) option specifies which variables in the dataset are imputed. The third line of syntax below runs the mi varying command.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clear
mi import ice , imputed(female read write math science prog)
(83 m=0 obs. now marked as incomplete)

mi varying

             Possible problem   variable names
  -----------------------------------------------------------------------------------------
           imputed nonvarying:  (none)
           passive nonvarying:  (none)
         unregistered varying:  _Iprog_2 _Iprog_3
  *unregistered super/varying:  m__Iprog_2 m__Iprog_3 m__Irace_2 m__Irace_3 m__Irace_4
                                m_female m_math m_prog m_read m_schtyp m_science m_ses
                                m_socst m_write
   unregistered super varying:  (none)
  -----------------------------------------------------------------------------------------
  * super/varying means super varying but would be varying if registered as imputed;
    variables vary only where equal to soft missing in m=0.

The output from the mi varying command lists five conditions that are possible problems, as well as any variables that occur in each condition. The criteria for identifying each of these types of potentially problematic variables, as well as the possible causes in the data are somewhat difficult to describe briefly. The help file for the mi varying command contains a description of each of these conditions along with recommendations for how one might handle variables that are identified in each condition. You can access the help file by typing help mi varying in the Stata command window. Looking at the above output, we see that the variables _Iprog_2 and _Iprog_3 are identified as unregistered varying, meaning some of the values for this variable vary across the imputations. This suggests that these variables may have been imputed, which is correct in this case. We'd probably want to go ahead and register them as imputed, note that the variable prog, the categories of which _Iprog_2 and _Iprog_3 represent is already registered as imputed.

Stata has also identified all of the imputation identifiers ( i.e.,  the variables that begin with m_ ) as unregistered super/varying. Registering these variables as imputed would cause them not to show up in the mi varying output in the future, but is probably inappropriate because they aren't imputed variables. Registering them as imputed would also result in all of the cases in the dataset being marked as incomplete, since the imputation indicators are missing in the original data. The best solution in this case may be to do nothing, since we understand why these variables are being flagged by mi varying. It is important to remember that when a variable is flagged by mi varying, it does not necessarily mean there is a problem in the dataset, it merely indicates a specific condition, which may or may not be indicative of a problem.

Because mim does not do the consistency checks that Stata's mi commands do, there is no command for mim that is equivalent to mi varying.

The mi xeq: command is useful for exploring the data. It allows the user to easily run a command  separately on each of the imputed datasets. Below we use mi xeq: followed by sum read write, to summarize the variables read and write in each of the imputations. The output gives the imputation number, followed by the command and its output. The table produced by the sum command looks as it normally would. The table for m=0 is followed by the table for m=1, and so on (note that we have omitted the rest of the output).

mi xeq: sum read write

m=0 data:
-> sum read write

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       191    52.28796    10.21072         28         76
       write |       183    52.95082    9.257773         31         67

m=1 data:
-> sum read write

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       200     52.4295    10.21239         28         76
       write |       200    52.68909    9.593023   21.93323   71.58826

<output omitted>

Using Stata's mi commands, we may want to sort the data before running a command using mi xeq: for example, we may want to summarize read and write by school type ( i.e.,  the variable schtyp) in each imputation. If we weren't working with MI data, we would simply sort by schtyp and then run the summarize command, so we might try to do the same here. Below we first sort by schtyp and then use mi xeq and the by prefix.

sort schtyp
mi xeq: by schtyp: sum read write

m=0 data:
-> by schtyp: sum read write
not sorted
r(5);

However, when working with mi xeq: this results in an error message because mi xeq: sorts the data after we do. Instead, we need to include the sort command as part of the mi xeq: command. The syntax below does this, the line starts with mi xeq: followed by sort schtyp then, where we would normally use a line break to indicate the end of a command, we use a semicolon (";"), then we issue the command we wish to run, in this case by schtyp: sum read write. This results in mi xeq: first sorting the data by schtyp within each imputation, and then running the summarize command by schtyp.

mi xeq: sort schtyp; by schtyp: sum read write

m=0 data:
-> sort schtyp
-> by schtyp: sum read write

-------------------------------------------------------------------------------------------
-> schtyp = public

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       161    51.99379    10.39741         28         76
       write |       152    52.32895    9.584602         31         67

-------------------------------------------------------------------------------------------
-> schtyp = private

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |        30    53.86667    9.141543         36         73
       write |        31          56     6.78233         38         67


m=1 data:
-> sort schtyp
-> by schtyp: sum read write

<output omitted>

Stata's mi xeq command also allows us to run a command on only a subset of the imputations. For example, the code below runs sum read write, on only m=1 and m=2. The number list after mi xeq ( i.e.,  1 2) but before the colon (":") indicates which imputations should be used.

mi xeq 1 2: sum read write

m=1 data:
-> sum read write

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       200    52.30726    10.18691         28         76
       write |       200    52.71049    9.263646   27.90232         67

m=2 data:
-> sum read write

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       200    52.19646    10.16644         28         76
       write |       200    52.90549    9.420377         31   72.66547

If your data are set up for use with mim you can use Stata's standard sort, by, and if commands to produce the same output. Below we first open a dataset with the structure expected by mim. Then we sort by imputation number ( i.e.,  the variable _mj), and finally we use the by prefix to summarize (sum) the variables read and write separately for each imputation.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim, clear
sort _mj
by _mj: sum read write

----------------------------------------------------------------------------------------------
-> _mj = 0

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       191    52.28796    10.21072         28         76
       write |       183    52.95082    9.257773         31         67

----------------------------------------------------------------------------------------------
-> _mj = 1

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       200    52.30726    10.18691         28         76
       write |       200    52.71049    9.263646   27.90232         67

<output omitted>

If we wish to create output by some other variable, for example schtyp, as in the second example above, we can also do this using the sort command and by prefix. Below we first sort by _mj and schtyp (in that order) and then use the by prefix to summarize the read and write by schtyp separately for each imputation.

sort _mj schtyp
by _mj schtyp: sum read write

----------------------------------------------------------------------------------------------
-> _mj = 0, schtyp = public

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        read |       161    51.99379    10.39741         28         76
       write |       152    52.32895    9.584602         31         67

----------------------------------------------------------------------------------------------
-> _mj = 0, schtyp = private

<output omitted>

To run a command only on certain imputations, you can use if, as shown below.

sum read write if _mj==1
sum read write if _mj==2

<output omitted>

Creating new variables

Once you have created the imputations, you may want to create new variables in the dataset. Depending on the structure of your dataset (what Stata calls style), and the nature of the variable being generated, using the standard Stata commands might work, however, in some situations the resulting variables will not be calculated as expected. This is particularly true if you are using Stata's mi commands, because of the consistency checks Stata performs.

Stata's mi commands provide tools for creating variables with MI data, when possible, you probably want to use these tools. Stata calls variables that are created from imputed variables after the imputation process passive variables. (Note that is somewhat different from the meaning of passive variables in the context of ICE.) The mi passive: command can be used as a prefix for generate, egen, and replace. The syntax for these commands after the mi passive: prefix is identical to their normal syntax. In the syntax below we start by opening the dataset mvn_imputation which was produced by mi impute mvn in the first seminar. The second line of syntax begins with mi passive: and uses the generate command to create a variable that is the sum of the imputed variables read and write. The output gives us the output from running this command in each imputation. In the original data (m=0), 26 missing values were created, this is not surprising since m=0 contains missing values. In our case, no missing values are produced in the imputed datasets, however, depending on the command specified and the variables used, we might end up with some missing values even in the imputed dataset. As always when creating new variables, you'll want to consider whether the number of missing variables created seems appropriate.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi passive: generate english = read + write
m=0:
(26 missing values generated)
m=1:
m=2:
m=3:
m=4:
m=5:

As mentioned above replace can also be used with mi passive:, here we change the values of english from the sum of read and write to the average of read and write.

mi passive: replace english = (read + write)/2
m=0:
(174 real changes made)
m=1:
(200 real changes made)
m=2:
(200 real changes made)
m=3:
(200 real changes made)
m=4:
(200 real changes made)
m=5:
(200 real changes made)

Below we use mi passive: with the egen command to create a variable that is the total (sum) of the variables read, write, math, science, and socst. Notice that no missing values were created in m=0, this is because the function rowtotal(...) treats missing values as though they were equal to 0, hence it produces values for total even when some or all of the variables are missing (as is sometimes the case in m=0).

mi passive: egen total = rowtotal(read write math science socst)
m=0:
m=1:
m=2:
m=3:
m=4:
m=5:

If you are working with mim, the structure of the data allows you to use standard Stata commands to create variables, as long as the command used to create the variable only use information from one row at a time. For example, to create the variable english as in the first example above ( i.e.,  read + write) we can use the generate command. This creates 26 missing values, because there are 26 cases in m=0 that are missing read and/or write.

generate english = read + write
(26 missing values generated)

Below we change the sum of read and write to the average of read and write using the replace command, just as we would if the data had not been multiply imputed. The next line of syntax creates the variable total using the egen command with rowtotal function.

replace english = (read+write)/2
(1174 real changes made)

egen total = rowtotal(read write math science socst)

Regardless of how the data is stored ( i.e.,  what Stata calls style), the structure of MI data can complicate the creation of variables that utilize multiple rows of data for example, group mean variables, and variables dependent on sort order (e.g., lag variables). For example, if we had data on students at multiple time points, such as test scores for each year of high school. Such datasets are often stored so that each student has multiple rows in the dataset, one for each year. In such a dataset we might want to create a lag variable equal to the value of the variable read at the previous time point (year).

If you are using Stata's mi commands, you can do this with the mi passive command. As noted above, we don't need to sort, mi passive: handles the sort order, and because it does so, we cannot rely on sorting the data before the command to allow us to assume the proper order within the data. Instead we use the by prefix to perform the action for each student (id), the year in parentheses tells Stata to first sort the data by year within id, and then perform the operation by id. The generate command uses read[_n-1] to refer to the value of read in the previous line (within id) when the data is sorted by id and year, this results in a lagged variable. Note that this command will not run in the current dataset, because there is no year variable.

mi passive: by id (year): gen lread = read[_n-1]

To create a lag variable for a dataset formatted for use with mim, we first sort by imputation number (_mj), then id and year. In the second line of syntax below, we use the by prefix to perform the action for each student (id) within each imputation (_mj), the year in parentheses tells Stata to first sort the data by year within _mj and id, and then perform the operation by _mj and id. The generate command uses read[_n-1] to refer to the value of read in the previous line (within _mj and id) when the data is sorted by id and year, this results in a lagged variable. Note that this will not run in the current dataset, because there is no year variable.

sort _mj id year
by _mj id (year): gen lread = read[_n-1]

If you are using Stata's mi commands, there may be cases where you want to create variables using commands not supported by mi passive:. In this case you can run the command, and then register the variable in the appropriate form. In the example below, we use recode to recode the variable ses into ses2.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi xeq: recode ses (1/2=1)(3=2), gen(ses2)

m=0 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

m=1 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

m=2 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

m=3 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

m=4 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

m=5 data:
-> recode ses (1/2=1)(3=2), gen(ses2)
(153 differences between ses and ses2)

The resulting output shows the results of running the recode command in each MI dataset. Note that if the data is in either flong or mlong style, using the xeq: prefix is unnecessary, we could have just used the recode command, however, using the xeq: command works in all styles, so we might want to make it a habit to use xeq:, even if it isn't strictly necessary. Below we use the mi register command to register ses2 as a regular variable, because the variable ses was not an imputed or passive variable. If ses had been either imputed or passive, we would have registered ses2 as passive.

mi register regular ses2

Because of the way mim datasets are structured, no special steps are needed to perform this type of action, so we could perform the above recode with the standard Stata commands shown below.

use use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim, clear
recode ses (1/2=1)(3=2), gen(ses2)
(918 differences between ses and ses2)

Dropping variables or cases, and renaming variables

Because of the consistency checks performed when you use Stata's mi commands, dropping variables or cases, and renaming variables is somewhat different than in datasets that are not mi set. If you are using a dataset in the form expected by mim the standard Stata commands can be used.

If you are using Stata's mi commands, variables can be deleted using the drop command, just as with other datasets. However, this should be followed by the mi update command so that Stata can run the necessary consistency checks. Below, we delete the variable pr_2, and then run mi update. Note that before we deleted the variable pr_2 it was registered as imputed, because the variable no longer exists, when we run mi update Stata removes pr_2 from the list of imputed variables.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
drop pr_2
mi update
(imputed variable pr_2 unregistered because not in m=0)

The drop command can also be used to delete observations, but as when it is used to delete variables, it should be followed with mi update. command. Using the mi update command tells Stata to make sure the data is still consistent. In this case, we drop cases where female = 0 and then run mi update. When mi update is run Stata notices that the number of observations has changed, and it updates one of the system variables to reflect this. The effect of running mi update may vary depending on the style, but in any case, Stata looks at the dataset in its current form and any necessary changes to the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
drop if female==0
(486 observations deleted)

mi update
(system variable _mi_id updated due to changed number of obs.)

To rename a variable but retain its status as imputed, passive, or regular, use the mi rename command. Here we rename the variable female to gender.

mi rename female gender

As mentioned above, because of the format of the data used with mim, and because it does not do the sort of consistency checks that Stata's mi commands do, in datasets used with mim, variables and cases can be dropped, and variables renamed in the usual fashion.

Merging datasets

Stata's mi commands include a special version of merge, mi merge. In this example, we have a dataset named demo that contains the demographic variables (female, race, ses, schtyp, and prog) for 200 students in our sample, along with the subject identifier (id). Below we open the dataset and use the summarize (abbreviated sum) command to confirm that the data has the structure we think it does. We confirm that the dataset contains the variables that we think it does. Note that female has only 182 observations, indicating that it contains some missing values. We know that none of these values have been multiply imputed because the dataset contains only one observation for each of the 200 cases.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/demo, clear
sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |       200       100.5    57.87918          1        200
      female |       182    .5549451    .4983428          0          1
        race |       200        3.43    1.039472          1          4
         ses |       200       2.055    .7242914          1          3
      schtyp |       200        1.16     .367526          1          2
-------------+--------------------------------------------------------
        prog |       182    2.027473    .6927511          1          3

We also have a data file called scores, which contains the test score data ( i.e.,  read, write, math, science, and socst). The variables read, write, math, and science contain missing values and have been imputed. Below we open the dataset, and confirm its structure, because the data has been mi set , we use mi describe to do this. From this output we confirm that read, write, math, and science are all registered as imputed, and that there are 5 imputations.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/scores, clear
mi describe

  Style:  flong
          last mi update 6 days ago

  Obs.:   complete          147te          147
          incomplete         53  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  4; read(9) write(17) math(15) science(16)

          passive: 0

          regular: 1; socst

          system:  3; _mi_m _mi_id _mi_miss

         (there is one unregistered variable; id)(there is one unregistered variable; id)

Now that we understand the data structure, we can use mi merge to combine the files so that we have both demographic data and test scores in the same file. In order to use mi merge, both datasets must be mi set, the test score data (scores) is already mi set, but the demographic data (demo) is not. In the first line of syntax below we open demo, and then mi set the data. The third line of syntax below shows the command to merge the datasets. The command name (mi merge) is followed by a description of the type of merge we wish to perform. Since each case in demo, should be matched to only one case (per imputation) in scores, we want to do a one to one merge ( i.e.,  1:1). This is followed by the name of the index variable (id). Finally, using scores specifies that we wish to merge the current dataset with the dataset scores.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/demo, clear
mi set flong

mi merge 1:1 id using scores
(M increased from 0 to 5)
(new variables read write math science registered as imputed)
(new variable socst registered as regular)

    Result                           # of obs.
    -----------------------------------------
    not matched                             0
    matched                               200
    -----------------------------------------

The output from mi merge first tells us that M (the number of imputations) has increased from 0 to 5. This is because the file demo contained no imputations, while scores contained 5 imputations. The output then tells us how the new variables (from scores) were registered. Finally, the output gives a table showing the number of observations that were matched and not matched. In this case, all 200 observations were matched, which is what we would expect in this case.

The above example shows a case where a file containing only original data (no imputations) was merged with an MI dataset. You can also merge two imputed datasets, but there are a few things to be aware of. First, when two MI datasets are merged using mi merge, they are matched on both the index variable (e.g., id) and the imputation number (stored in _mi_m). For example, the case with id=1 in _mi_m=2, is matched to the case with id=1 in _mi_m=2 in the other dataset. Second, if you are merging datasets with unequal numbers of imputations, then the number of imputations will be set to the larger of the two, and the "missing" imputations from the dataset with fewer imputations will be filled in using the original data. This results in unequal numbers of complete cases across the imputations. Neither of the conditions described here are necessarily problematic, but you probably want to think about whether you will encounter them and how you want to proceed.

An unimputed dataset and an MI dataset can also be merged using mim. Below is an example, the two starting datasets, mim_demo and mim_scores are identical to the datasets demo and scores from the previous example except that they are formatted to work with mim. We start the merge process by opening the mim_demo dataset. When we merge the data, mim will use both the imputation number (stored in the variable _mj) and the id variable we specify ( i.e.,  id) to match the rows in the dataset. Currently, mim_demo contains only 1 copy of the dataset, and no _mj variable, while mim_scores contains 6 copies of the dataset (with _mj taking on the values 0 to 5). So in order to merge the two files, we need to modify mim_demo so that it contains 6 copies of the data, with a variable _mj that takes on values 0 to 5 for each case. The second line of syntax below uses the expand command to modify the dataset so that each case appears 6 times. Next we sort the data by id. Then for each id (by id:) we use the egen command with the seq(...) function to create a new variable, called _mj, that takes on a sequence of values from 0 to 5 (specified using the from(...) and to(...) options). Now we have six copies of the original (unimputed) data in mim_demo, with a variable _mj that indexes the copies of the dataset. When we merge the data, mim will also expect the dataset to contain a variable _mi that indexes the cases. When it merges the datasets mim will recreate this variable in both datasets based on the values of id, so the values of this variable don't matter at this point, there just needs to be a variable called _mi in the dataset. Hence, the final line of syntax generates the variable _mi and sets all values equal to missing.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_demo, clear
(highschool and beyond (200 cases))

expand = 6
(1000 observations created)

sort id
by id: egen _mj = seq(), from(0) to(5)

gen _mi = .
(1200 missing values generated)

Now that the mim_demo dataset is in the correct form, and has the correct variables, we can merge the two datasets. With mim_demo still open, the syntax below merges the two datasets. The command starts with the mim prefix followed by a comma (",") to designate the beginning of the options. The sortorder(...) option identifies the variable that uniquely identifies cases in both datasets (in this case id), this variable will be used to create mim's own case identification variable ( i.e.,  _mi) after the datasets are merged. Following the colon (":") that marks the end of the mim prefix, is the merge command, followed by the name of the variable that identifies the observations in both datasets ( i.e.,  id), and the keyword using with the name and location of the dataset to be merged ( i.e.,  mim_scores).

mim, sortorder(id): merge id using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_scores
(highschool and beyond (200 cases))

The above example shows a case where a file containing only original data (no imputations) was merged with an MI dataset. Two imputed datasets can also be merged, but there are a few things to be aware of. First, when two MI datasets are merged using mim, they are matched on both the index variable (e.g., id) and the imputation number (stored in _mj). Unlike Stata's mi merge command, mim will not allow you to merge to datasets with unequal numbers of observations. If for some reason you have two such datasets, it is possible to merge them by either adding or dropping imputations. , but it is left to the user to decide how to treat the extra imputations.

Appending datasets

If you are using Stata's mi commands, the mi append command can be used to append (stack) one dataset on to another. For example, if we had one MI dataset containing information on students in public schools (named public) and another containing the same information on students from private schools (named private), we could use mi append to stack them so that all of the cases are in a single MI dataset. Below we open the dataset public, and use mi describe to provide some basic information on the dataset. Note that there are currently 168 cases in the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/public, clear
mi describe

  Style:  flong

  Obs.:   complete           92
          incomplete         76  (M = 5 imputations)
          ---------------------
          total             168

  Vars.:  imputed:  7; female(15) read(7) write(16) math(15) science(15) pr_2(17) pr_3(17)

          passive: 0

          regular: 4; race ses schtyp socst

          system:  3; _mi_m _mi_id _mi_miss

         (there are 3 unregistered variables; id prog pr_1)

In general, we would want to use mi describe on both datasets, to be sure each contained the same variables, etc. For this example, we have skipped this step to save space. Below we use the mi append command, followed by the keyword using and the name of the dataset we wish to append (private) to add cases from the dataset private to the current dataset (public). The gen(...) option generates a new variable named dataset in the combined data that denotes which dataset each case came from (public vs. private). Then we use mi describe to provide information about the combined dataset. Note that the dataset now contains 200 cases.

mi append using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/private, gen(dataset)
mi describe

  Style:  flong

  Obs.:   complete          117
          incomplete         83  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  7; female(18) read(9) write(17) math(15) science(16) pr_2(18) pr_3(18)

          passive: 0

          regular: 4; race ses schtyp socst

          system:  3; _mi_m _mi_id _mi_miss

         (there are 4 unregistered variables; id prog pr_1 dataset)

If you are using mim the storage format allows you to use Stata's standard append command. For example, the append command for the previous example using Stata's usual append command is shown below.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_public, replace
append using http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mim_private, gen(dataset)

Reshaping Data

You may sometimes need to reshape MI datasets, that is, move from wide to long, or from long to wide. One reason for this might be that you have imputed longitudinal data. In this case, you might have started with a dataset that had multiple rows per subject ( i.e.,  long form), but reshaped to wide form, where each case has only one row, and separate variables for each observation in order to impute. Once you have imputed, you may want to move back to long form to analyze the data. Because of their structure, MI datasets require special commands for reshaping data.

If your data is mi set, you can use Stata's mi reshape command to reshape the data. In the example below we convert from wide to long. The dataset, named wide, contains reading test scores at three time points (read1, read2, and read3). The dataset also contains an id variable ( i.e.,  id), and three variables that are measured only once per respondent ( i.e.,  female, ses, and schtyp). First we open the dataset, named wide. Then use mi describe to examine the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wide, clear
(highschool and beyond (200 cases))

mi describe

  Style:  flong
          last mi update 6 seconds ago

  Obs.:   complete          145
          incomplete         55  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  4; female(18) read1(9) read2(17) read3(15)

          passive: 0

          regular: 2; ses schtyp

          system:  3; _mi_m _mi_id _mi_miss

         (there is one unregistered variable; id)(there is one unregistered variable; id)

The syntax for mi reshape is very similar to the syntax for Stata's standard reshape command. Below we use the command mi reshape, followed by the desired form ( i.e.,  we wish to convert from wide to long) followed by the variable name "stub" ( i.e.,  read) for the variables that are repeated over time ( i.e.,  read1-read3). The i(...) and j(...) "options" after the comma are required. The variable name in i(...) is the id variable, and the new variable name in j(...) is the name of a new variable (time) created to index observations ( i.e.,  time points).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wide, clear
mi reshape long read, i(id) j(time)

reshaping m=0 data ...
(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      200   ->     600
Number of variables                   7   ->       6
j variable (3 values)                     ->   time
xij variables:
                      read1 read2 read3   ->   read
-----------------------------------------------------------------------------

reshaping m=1 data ...

reshaping m=2 data ...

reshaping m=3 data ...

reshaping m=4 data ...

reshaping m=5 data ...

assembling results ...

To move from long to wide form, we use a similar syntax. Below we open a dataset in long form. This time the keyword following the mi reshape command is wide, followed by the name of the variable read, for which there are multiple observations per id. The i(...) and j(...) "options" are required, where id(...) gives the case id variable (in our case id) and j(...) gives an existing variable that indexes observations ( i.e.,  time).

mi reshape wide read, i(id) j(time)

reshaping m=0 data ...
(note: j = 1 2 3)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                      600   ->     200
Number of variables                   6   ->       7
j variable (3 values)              time   ->   (dropped)
xij variables:
                                   read   ->   read1 read2 read3
-----------------------------------------------------------------------------

reshaping m=1 data ...

reshaping m=2 data ...

reshaping m=3 data ...

reshaping m=4 data ...

reshaping m=5 data ...

assembling results ...

Using mim the reshape command is Stata's standard reshape command, preceded by the mim prefix. For example, the syntax below shows how to convert the same dataset as above from wide to long.

mim: reshape long read, i(id) j(time)

The syntax to convert from wide to long is shown below.

mim: reshape wide read, i(id) j(time)

Importing datasets not imputed using the mi impute command

Because it has a very specific way of storing MI data, we need a way to convert MI datasets not currently stored the way the mi commands expect to that format. These might be datasets imputed by ice, but they might also be datasets imputed by other packages. Before we can make Stata aware of the MI structure of the data we need three things. First, we need a Stata format dataset, if the dataset is in some other format, it must be imported into Stata through one of the usual methods ( i.e.,  StatTransfer, the insheet command, etc.). Second, the dataset must contain the original (pre-imputation) data. If the imputed datasets were released without m=0, then the original data must be recreated before mi import can be used on the data. For more information on how to do this, see our FAQ: How can I use multiply imputed data where original data is not included?. Finally, we need to know how the imputed data is structured, for example, is it in what Stata would call flong, mlong, etc.. We also need to be aware of how the missing values are indicated. If missing values in the unimputed data ( i.e.,  m=0) are user missing (e.g., .a) mi import will set the corresponding values in the imputed dataset to missing. To avoid this, replace all user missing values with system missing values, which are indicated by a period (".").

In the following example, we have a Stata dataset stored in what Stata would call the flong style, with the original data stored in m=0, and all missing values specified as "." so we are ready to proceed. The first line of syntax below opens the dataset. The next line of syntax shows the command to make Stata aware of the MI structure of the dataset. The command is mi import flong which is followed by a comma (",") that indicates whatever follows is an "option." Both the m(...) option, and the id(...) option are required. The variable listed in the m(...) (in our case _mj) is the variable that identifies which imputation each case belongs to ( i.e.,  m=0, m=1, ...). The variable listed in the id(...) option gives the case id variable, in this case _mi, this variable allows Stata to match cases across the imputations for data management purposes. The imputed(...) option allows the user to list the imputed variables, so that Stata can register them, this isn't required, and imputed variables can always be registered later. The clear option allows the command to replace the current data in memory even if it has changed since the last time the data was saved (this is like the clear option for the use command).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clear
mi import flong, m(_mj) id(_mi) imputed(female read write math science prog) clear
(83 m=0 obs. now marked as incomplete)

After the mi import command we use mi describe to examine the dataset. Everything looks as we expect it to, note that Stata has created three new system variables (_mi_m, _mi_id, and _mi_miss), these variables are used by Stata's mi commands to manage the data, and should not be changed by the user. We might also want to run the command mi varying to make sure that the data imported properly (discussed above in the section headed Data Management with Imputed Datasets).

mi describe

  Style:  flong
          last mi update 0 seconds ago

  Obs.:   complete          117
          incomplete         83  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  6; female(18) read(9) write(17) math(15) science(16) prog(18)

          passive: 0

          regular: 0

          system:  3; _mi_m _mi_id _mi_miss

         (there are 26 unregistered variables)

The example above uses mi import on what Stata calls an flong dataset, one can also import wide and flongsep datasets using the commands mi import wide and mi import flongsep respectively. Stata also has two mi import commands that make it easier to import two common types of imputated datasets. One, mi import ice, imports datasets in the form created by ice (shown below), and the other, mi import nhanes1 imports NHANES datasets. The syntax for these commands is similar, although not identical to, the syntax shown above.

Above, we used mi import flong to set up a dataset for use with Stata's mi commands. However, as just mentioned, there is an easier way to import datasets produced by ice. The command for this is shown below. We don't need to specify either m(...) or id(...) because Stata knows what ice names these variables. The auto option (after the comma), tells Stata that we want it to automatically determine which variables have been imputed. Use of the auto option is recommended by Stata. After we import the data, we use mi describe to examine the dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/ice_imputation, clear
mi import ice, auto
(200 m=0 obs. now marked as incomplete)

mi describe

  Style:  flong
          last mi update 0 seconds ago

  Obs.:   complete            0
          incomplete        200  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  22; female(18) prog(18) read(9) write(17) math(15) science(16)
                    _Iprog_2(18) _Iprog_3(18) m_female(200) m_read(200) m_write(200)
                    m_math(200) m_science(200) m_socst(200) m_prog(200) m__Iprog_2(200)
                    m__Iprog_3(200) m__Irace_2(200) m__Irace_3(200) m__Irace_4(200)
                    m_ses(200) m_schtyp(200)

          passive: 0

          regular: 0

          system:  3; _mi_m _mi_id _mi_miss

         (there are 8 unregistered variables)

Looking at the output from mi describe we see that Stata has identified 22 variables as imputed, which isn't correct. Stata has correctly identified the six imputed variables as such, but it has also registered a number of other variables as imputed. Two of the variables registered as imputed, _Iprog_2 and _Iprog_3 were passively imputed by ice based on prog. This happens because Stata cannot distinguish imputed variables and passive variables created by ice. Statistically, it doesn't matter whether a variable is registered as imputed or passive, so we can leave them, or, for data management/documentation purposes we can reregister them as passive variables, which we do below using the command mi register passive. Stata also incorrectly identified the imputation indicators ( i.e.,  the m_ variables) as imputed variables. It did so because they are all missing in m=0, but contain valid values in the imputations. Again, it's not terribly important to unregister these variables, but we do so below using mi unregister. Then we rerun mi describe and see something more like we were expecting. Note that the number of complete cases has changed from 0 to 117, because we unregistered the variables that were missing for all cases in m=0 ( i.e.,  the m_ variables). We could go on to register the regular variables, or not.

mi register passive _Iprog_2 _Iprog_3
(variables _Iprog_2 _Iprog_3 were registered as imputed, now registered as passive)

mi unregister m_*
(117 m=0 obs. now marked as complete)

mi describe

  Style:  flong
          last mi update 0 seconds ago

  Obs.:   complete          117
          incomplete         83  (M = 5 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  6; female(18) prog(18) read(9) write(17) math(15) science(16)

          passive: 2; _Iprog_2(18) _Iprog_3(18)

          regular: 0

          system:  3; _mi_m _mi_id _mi_miss

         (there are 22 unregistered variables)

Importing datasets not imputed using ice

If you plan to use mim to analyze an MI dataset, the data must first be in the format expected by mim. As discussed above, mim assumes that the dataset contains m complete copies of the data, one for each of the imputations (or, optionally m+1 copies, to include the pre-imputation data). The imputation number is stored in the variable _mj, while a case id variable is in _mi. If you imputed using ice, the dataset will already be in the correct format, if not, then utilities exist to help you create an appropriate dataset. If the dataset is in the format expected by Stata's mi commands, you can reformat it for mim using the mi export ice command. Below we open a dataset in Stata's mi format, and export it to ice/mim format.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi export ice

If the MI datasets are stored in separate files, you can use the command mimstack (which is installed along with mim) to combine the files in the format expected by mim . In order to use mimstack, the datasets to be combined must already be in Stata format ( i.e.,  .dta files). In the example below, we have six datasets (5 imputations plus the pre-imputation data), named imp0.dta to imp5.dta. We first change the working directory to the directory where the imputed datasets are stored (c:\data) and then use mimstack to combine the datasets. The m(...) and sortorder(...) options are required, they give the number of imputations (m(5)) and the name of the case id variable (sortorder(id)). The istub(...) option gives the first part of the name of the imputed datasets ( i.e.,  imp for imp0.dta, imp1.dta etc.). If we did not have the preimputation data ( i.e., , imp0.dta), we would need to use the nomj0 option. (Note this example will not run, because the datasets in question are probably not available in c:\data on your computer.)

cd c:\data
mimstack , m(5) sortorder(id) istub(imp)

Analyzing imputed data

As with data management, there are two options for analyzing MI data in Stata, Stata's mi commands introduced in version 11, and the user written package mim. Both can be used to estimate a number of models and can also be used to perform post-estimation tests. For a list of estimation commands currently supported by mi estimate see the mi estimation help file by typing help mi estimation in the Stata command window. For information on the post-estimation commands for use with Stata's mi commands, see the mi postestimation help file by typing help mi postestimation in the Stata command window. For a list of estimation and postestimation commands supported by mim see the package help by typing help mim in the Stata command window.

Before we estimate any models, lets briefly review how MI estimates are calculated. The MI estimate of a parameter, for example, a regression coefficient or a mean, is the average of the estimates across the m imputations. The MI estimate of the standard error of a parameter is somewhat more complicated. The variance ( i.e.,  s.e.^2) is composed of two parts, the between variance and the within variance. The within variance is the average of the squared standard errors across the m imputations. The between variance is the variance of the coefficient estimates themselves across the m imputations. The MI estimate of the variance is the sum of the within and between variance estimates with an adjustment for the number of imputations. The MI estimate of the standard error is the square root of the variance. Including the between imputation variance allows us to account for the added uncertainty because some of the values in the dataset were imputed rather than observed. While none of these calculations are particularly difficult, it would be time tedious to run a model in each dataset and then combine the estimates by hand. Fortunately,  mi estimate and mim automate this process by estimating the specified model in each of the MI datasets and then combining the results to produce the MI estimates, which are displayed for the user.

Below we use Stata's mi estimate command to estimate a regression model using write, read, math and ses to predict science with MI data. The first line of syntax below opens the MI dataset, mvn_imputation. The second line of syntax begins with mi estimate indicating that the following command should be executed on the MI datasets and the coefficients and standard errors combined. After the mi estimate prefix, the basic syntax to run a regression model is identical to the syntax without MI data. The command name, regress, is followed by the outcome (science) and then the list of predictor variables. The i. preceding the variable ses (forming i.ses) in the list of predictors indicates that the variable ses should be included in the model as a series of dummy variables. (Note that syntax is new to Stata 11, type help factor variables in the Stata command window for more information.)

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi estimate: regress science write read math i.ses

Multiple-imputation estimates                     Imputations     =          5
Linear regression                                 Number of obs   =        200
                                                  Average RVI     =     0.1549
                                                  Complete DF     =        194
DF adjustment:   Small sample                     DF:     min     =      62.21
                                                          avg     =     107.02
                                                          max     =     154.10
Model F test:       Equal FMI                     F(   5,  142.5) =      32.34
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
     science |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |    .204958   .0760487     2.70   0.008     .0545139    .3554021
        read |   .3258264   .0775502     4.20   0.000     .1708615    .4807913
        math |   .2692184   .0848079     3.17   0.002     .0997013    .4387356
             |
         ses |
          2  |   1.860882   1.313735     1.42   0.159    -.7343733    4.456137
          3  |   1.863389   1.519993     1.23   0.222    -1.142558    4.869336
             |
       _cons |   8.197699   3.570605     2.30   0.024     1.110246    15.28515
------------------------------------------------------------------------------

In addition to the information normally included in regression output (e.g., the number of observations included in the analysis) the output for mi estimate includes some information specific to MI analysis. The number of imputations used is given, along with the average RVI. The RVI, or relatively variance increase, is the increase in parameter variance ( i.e.,  s.e.^2) due to missing values. Each parameter estimated has its own RVI, the default regression output gives the average of these estimates. The output also gives the DF adjustment used in calculating the degrees of freedom for both the model and individual parameters. By default the small sample adjustment is used. Also by default, the overall F test for the model is performed using a test that assumes an equal fraction of missing information for all coefficients. This is indicated in the output as Equal FMI. The regression table in the output gives the MI estimates of the coefficients and their standard errors.

It is possible to get the RVI, as well as related values, individually for each coefficient. Below we rerun the regression model, this time using the vartable option for mi estimate. Note that when specifying options for mi estimate, the command is followed by a comma (",") the desired option (e.g., vartarble), and then the colon (":").

mi estimate, vartable: regress science write read math i.ses

Multiple-imputation estimates                     Imputations     =          5

Variance information
------------------------------------------------------------------------------
             |        Imputation variance                             Relative
             |    Within   Between     Total       RVI       FMI    efficiency
-------------+----------------------------------------------------------------
       write |   .005275   .000424   .005783   .096409   .091437       .982041
        read |   .004849   .000971   .006014   .240226   .208406       .959987
        math |   .005783   .001174   .007192   .243684    .21094        .95952
             |
         ses |
          2  |      1.62   .088249    1.7259    .06537   .063121       .987533
          3  |   2.12005   .158607   2.31038   .089775   .085478       .983192
             |
       _cons |   11.0585   1.40893   12.7492   .152888   .140141       .972736
------------------------------------------------------------------------------

<output omitted>

With the addition of the vartable option, the mi estimate output now begins with a table outlining the variance estimation for each coefficient in the model. The first column lists the predictor variables and intercept, each of which is associated with a regression coefficient. The second column gives the within imputation variance ( i.e.,  the average of the estimated variances across the m imputations). The third column gives the between imputation variance ( i.e.,  the variance in coefficient estimates across the m imputed datasets). The fourth column contains the total variance, which is the sum of the within and between variance with an adjustment for the number of imputations. The fifth column gives the relative variance increase or RVI, this is the between variance (with an adjustment for the number of imputations) divided by the within variance. This gives a sense of how much the variance in coefficient estimates increased due to missing values. The fraction of missing information, or FMI, is given in the second to last column. This is a measure of the proportion of information lost due to non-response for a specific coefficient. The FMI is important because some of the hypothesis tests commonly used with MI analyses, assume that the FMI is equal across coefficients, which may or may not be a tenable assumption. If this assumption does not seem appropriate, tests that do not make this assumption are available, but they require much larger values of m. The final column in the table gives the relative efficiency, this is a measure that compares the estimate of the variance with the current value of m, to the variance with an infinite number of imputations. As the number of imputations increases, this value will approach 1. The table of variance information is followed by regression output identical to that produced without the vartable option (this output has been omitted).

In MI analysis, the standard post-estimation tests, such as Wald (e.g., the test command) and likelihood ratio tests are generally not valid. For some of these tests, similar tests are available for use in with MI data. Stata has implemented some of the applicable tests as part of the mi commands. For example, we can test to see whether multiple coefficients are simultaneously equal to 0 using the mi test command. This test is often used to test for an overall effect of a categorical variable represented by a series of dummy variables, or more generally to test for differences between nested models. Below we use mi test to test whether the overall effect of ses is statistically significant. The command name, mi test, is followed by 2.ses and 3.ses, which refer to the coefficients in the model associated with the second and third level of ses respectively. The note at the top of the output indicates that by default, the F test performed assumes equal fractions of missing information. Below that the parameters being tested are listed followed by the results of the F test.

mi test 2.ses 3.ses
note: assuming equal fractions of missing information

 ( 1)  2.ses = 0
 ( 2)  3.ses = 0

       F(  2,  64.4) =    1.00
            Prob > F =    0.3748

We can also test linear combinations of parameters, the type of test performed using the command lincom after some non-MI estimation commands. For example, we might want to test that the coefficients for read and math are the same ( i.e.,  read = math), which we can do by testing whether the difference between the two coefficients is equal to 0 ( i.e.,  read-math=0). In order to perform this type of test, Stata needs to store information about the results from each imputed dataset. The saving(...) option of the mi estimate command is used to save the necessary information. Below we rerun the regression from above, this time adding a comma (",") after mi estimate and including the saving(...) option. The text inside the saving(...) option ( i.e.,  myresults) assigns a name to our stored results so that we can recall them later. The output from this command is identical to the output from the previous regression so it has been omitted. Next we use the mi estimate command to estimate the difference between the coefficients for read and math. The command name ( i.e.,  mi estimate) is followed by the expression we want to test in parentheses ( "(" and ")" ). As in many other Stata estimation commands, after mi estimate, the coefficients can be referred to by _b followed by the variable name in brackets, in this case, _b[read] and _b[math] refer to the coefficients for read and math respectively. The expression to be tested is followed by the keyword using and the name of the estimation results to be used ( i.e.,  myresults). After the comma, the nocoef and noheader options suppresses the display of the output from the original model, if we omit these options, the header and coefficient table from the original mi estimate command associated with myresults will be shown.

mi estimate, saving(myresults): regress science write read math i.ses

<output omitted>

mi estimate (_b[read] - _b[math]) using myresults, nocoef noheader

      command: regress science write read math i.ses
        _mi_1: _b[read] - _b[math]

------------------------------------------------------------------------------
     science |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _mi_1 |   .0566079   .1401431     0.40   0.688    -.2252566    .3384725

At the top of the output is the command associated with myresults ( i.e.,  the command after the mi estimate prefix), this is followed by a listing of the parameters estimated, in this case we have specified only one expression within parentheses, but multiple expressions can be included, each within its own set of parentheses. The table shows the estimate of the difference between the coefficients for read and math under "Coef." ( i.e.,  0.057), followed by the standard error (0.14), t-value, p-value, and confidence interval. The results indicate that there is no statistically significant difference between the coefficients for read and math. Note that this is a test of a single coefficient ( i.e.,  a one degree of freedom test).

We can also test multiple comparisons. For example, if we wanted to test that the coefficients for read, math, and write are all equal ( i.e.,  read=math=write). This can be tested by testing that read=math ( i.e.,  read-math=0) and read=write ( i.e.,  read-write=0). To test this hypothesis, we first estimate the differences between read and math, and read and write separately using mi estimate, and then use another command, mi testtransform, to test that the coefficients estimates from mi estimate are simultaneously equal to 0. In the first line of syntax below, the mi estimate command contains two expressions in parentheses. We assign each of the estimates (read-math and read-write) a name by typing a label ( i.e.,  t1 and t2), the label is listed after the open parenthesis ( "(" ), and followed by a colon (":"). The two expressions are listed in the table of output by their assigned labels. Below that we use the command mi testtransform, followed by the labels of the parameters we want to test simultaneously ( i.e.,  t1 t2).

mi estimate (t1: _b[read] - _b[math]) (t2: _b[read] - _b[write]) using myresults, nocoef noheader

      command: regress science write read math i.ses
           t1: _b[read] - _b[math]
           t2: _b[read] - _b[write]

------------------------------------------------------------------------------
     science |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          t1 |   .0566079   .1401431     0.40   0.688    -.2252566    .3384725
          t2 |   .1208683   .1294069     0.93   0.354     -.137425    .3791617
------------------------------------------------------------------------------

mi testtransform t1 t2
note: assuming equal fractions of missing information

           t1: _b[read] - _b[math]
           t2: _b[read] - _b[write]

 ( 1)  t1 = 0
 ( 2)  t2 = 0

       F(  2,  61.5) =    0.45
            Prob > F =    0.6418

The first part of the output notes that this test assumes equal FMI, which is the default for F tests performed by mi estimate. Next, the two estimates being tested are listed, followed by the specific hypotheses being tested ( i.e.,  t1=0 and t2=0), and then the F test and p-value for the test. This two degree of freedom test does not find any significant differences between the coefficients for read, write, and math.

It is a good idea to test the sensitivity of the results to both the number of imputations used and the specific imputations used. If the results of an analysis change substantially depending on the number of imputations used, or the specific subset of imputations used, this may suggest that there is a problem in the imputation model. If the analysis seems particularly sensitive to the number of imputations, you may want to increase the number of imputations used. The mi estimate command makes it relatively easy to perform this type of sensitivity analysis. Rather than using mi estimate to reestimate the entire model on different subsets of the imputations, we can estimate the model once for all imputations, saving the results in the same manner as above, and then use the mi estimate using command to recombine the saved results in various ways. Below we use mi estimate to run a regression model using read, and write to predict socst and save the results as myresults2.

mi estimate, saving(myresults2): regress socst read write

Multiple-imputation estimates                     Imputations     =          5
Linear regression                                 Number of obs   =        200
                                                  Average RVI     =     0.0344
                                                  Complete DF     =        197
DF adjustment:   Small sample                     DF:     min     =     153.03
                                                          avg     =     165.98
                                                          max     =     186.00
Model F test:       Equal FMI                     F(   2,  163.8) =      79.36
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .4179087   .0717291     5.83   0.000     .2762435     .559574
       write |    .412377   .0785752     5.25   0.000     .2571447    .5676093
       _cons |   8.773376   3.493966     2.51   0.013     1.880479    15.66627
------------------------------------------------------------------------------

The above model uses all 5 of the imputations to estimate the model. Below we use mi estimate using to reestimate the MI coefficients using only 3 of the 5 imputations. The mi estimate command is followed by the keyword using and the name of the saved results to be used ( i.e.,  myresults2). The nimputations(#) option after the comma instructs Stata to reestimate the MI coefficients using only the first # imputations, in this case 3.

mi estimate using myresults2, nimputations(3)

Multiple-imputation estimates                     Imputations     =          3
Linear regression                                 Number of obs   =        200
                                                  Average RVI     =     0.0146
                                                  Complete DF     =        197
DF adjustment:   Small sample                     DF:     min     =     173.94
                                                          avg     =     183.85
                                                          max     =     191.31
Model F test:       Equal FMI                     F(   2,  277.7) =      82.77
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .4162061   .0704377     5.91   0.000     .2771834    .5552287
       write |   .4200142   .0761381     5.52   0.000     .2698106    .5702179
       _cons |   8.457841   3.468331     2.44   0.016     1.616759    15.29892
------------------------------------------------------------------------------

In this output, the number of imputations used is listed as 3, which is what we requested. The coefficients in this output are very similar to the MI estimates using all 5 imputations, suggesting that the model is not particularly sensitive to the number of imputations. Note that in many applications one may have more than 5 imputations available, and one might want to try different numbers of imputations, when examining the sensitivity of the model to the number of imputations. Also note that 3 is the minimum number of imputations from which MI estimates can be calculated. We can also examine differences in the MI estimates depending on which imputations are used. Below we use the mi estimate using command with the imputations(...) option, where the imputations(...) option is used to specify which imputations should be used. In this case we calculate the MI estimates using imputations number 2, 3, 4, and 5.

mi estimate using myresults2, imputations(2 3 4 5)

Multiple-imputation estimates                     Imputations     =          4
Linear regression                                 Number of obs   =        200
                                                  Average RVI     =     0.0438
                                                  Complete DF     =        197
DF adjustment:   Small sample                     DF:     min     =     122.88
                                                          avg     =     147.12
                                                          max     =     180.48
Model F test:       Equal FMI                     F(   2,  111.8) =      78.46
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        read |   .4157905   .0725943     5.73   0.000     .2722496    .5593314
       write |    .412966   .0797544     5.18   0.000     .2550955    .5708366
       _cons |   8.844287   3.494272     2.53   0.012     1.949405    15.73917
------------------------------------------------------------------------------

The output indicates that the number of imputations used is 4, which is what we requested. And again the estimates are similar to the previous estimates, suggesting that our model is not particularly sensitive to which imputations are used. Note that in many applications one might want to try different numbers and subsets of imputations, rather than a single subset.

Alternatively, we can use mim to analyze the MI data. As mentioned above, mim assumes the data is in the structure produced by ice, that is, each of the imputations is stacked with the others in a single dataset, and that the imputation number and case id are stored in _mj and _mi respectively. The first line of syntax below uses mi export ice to convert the dataset in memory (mvn_imputation) from the structure used by Stata's mi commands to the structure used by mim. Next we specify the second regression model we estimated above, predicting socst with read and write, this time using mim. The second line of syntax below uses the mim prefix, followed by the command name (regress), the outcome variable (socst) and then the list of predictor variables.

mi export ice
mim: regress socst read write

Multiple-imputation estimates (regress)                  Imputations =       5
Linear regression                                        Minimum obs =     200
                                                         Minimum dof =   153.0

------------------------------------------------------------------------------
       socst |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]     FMI
-------------+----------------------------------------------------------------
        read |   .417909   .071729    5.83   0.000    .276244  .559574   0.060
       write |   .412377   .078575    5.25   0.000    .257145  .567609   0.067
       _cons |   8.77338   3.49397    2.51   0.013    1.88048  15.6663   0.023
------------------------------------------------------------------------------

The output from mim is somewhat less detailed than the output from the mi estimate command, although it does list the number of imputations, number of observations and the degrees of freedom.

If the model contains categorical variables, such as the first model we ran above, which used write, read, math, and ses (which is categorical) to predict science, we can use the xi prefix, along with the mim prefix to estimate the model without creating dummy variables by hand. Below we estimate this model. Note that the xi prefix comes before the mim prefix, and that an i. precedes the variable ses ( i.e.,  i.ses)

xi: mim: regress science write read math i.ses

i.ses             _Ises_1-3           (naturally coded; _Ises_1 omitted)

Multiple-imputation estimates (regress)                  Imputations =       5
Linear regression                                        Minimum obs =     200
                                                         Minimum dof =    62.2

------------------------------------------------------------------------------
     science |     Coef.  Std. Err.     t    P>|t|    [95% Conf. Int.]     FMI
-------------+----------------------------------------------------------------
       write |   .204958   .076049    2.70   0.008    .054514  .355402   0.091
        read |   .325826    .07755    4.20   0.000    .170861  .480791   0.208
        math |   .269218   .084808    3.17   0.002    .099701  .438736   0.211
     _Ises_2 |   1.86088   1.31374    1.42   0.159   -.734373  4.45614   0.063
     _Ises_3 |   1.86339   1.51999    1.23   0.222   -1.14256  4.86934   0.085
       _cons |    8.1977   3.57061    2.30   0.024    1.11025  15.2852   0.140
------------------------------------------------------------------------------

We can test the hypothesis that both the coefficients for ses=2 (denoted _Ises_2), and ses=3 (_Ises_3) are equal to 0. To do this we start with the mim prefix, followed by the command testparm (which is a limited version of the test command) followed by a list of the parameters we wish to test. Similar to both the test and mi test commands, the output for mim: testparm lists the hypotheses being tested followed by the F test and p-value. While the tests are similar, the mim: testparm and mi test commands are not implemented identically. For users with access to both sets of commands, this can be an advantage, because you can examine whether the results are sensitive to the specific implementation of the test. In this case, the difference was very small, a p-value of 0.3728 using mim and 0.3748 using mi test. For more information about the differences see the documentation for mi test and mim.

mim: testparm _Ises_2 _Ises_3

 ( 1)  _Ises_2 = 0
 ( 2)  _Ises_3 = 0

       F(  2, 101.7) =    1.00
            Prob > F =    0.3728

Linear combinations of parameters can also be examined. Below we once again test whether the coefficients for read and math are equal (read=math) by testing whether their difference is equal to 0 (read-math=0). Below the mim prefix is followed by the command name ( i.e.,  lincom) and the expression we wish to test ( i.e.,  read-math).

mim: lincom read-math

Multiple-imputation estimates for lincom                       Imputations = 5

 ( 1)  read - math = 0

------------------------------------------------------------------------------
     science |    Coeff.  Std. Err.     t    P>|t|    [95% Conf. Int.]     FMI
-------------+----------------------------------------------------------------
         (1) |   .056608   .140143    0.40   0.688   -.223338  .336554   0.314
------------------------------------------------------------------------------

The output gives the estimate of the coefficient, its standard error, t-value, p-value, confidence interval and the fraction of missing information for this parameter. As before, the difference between the coefficients for read and math is not statistically significant.

Working with special data structures (xxxset)

For various "special" types of data, Stata allows you to inform it of the data structure in advance, so that commands utilizing that structure can be run more easily. For example, when working with survey data, you can svyset the data, when working with cross sectional time-series data, you can xtset the data. If you xxxset the data before you mi set it, then Stata will "remember" the setting. If you mi set the data before you set the other structure, the usual commands (e.g., stset, tsset) will no longer work. Instead you need to use mi xxxset to set the data. Other than adding an mi before the set command, the syntax remains identical.

On the first line of syntax below we mi xtset data with the level 2 units defined by the variable group. On the second line of syntax we use mi estimate to estimate a random intercept model, with y regressed on x1 and x2. Because xtreg is not currently supported by mi estimate, we have used the cmdok option to force mi estimate to estimate the model anyway. This may or may not be a good idea, depending on the estimation command in question. When using the cmdok option, it is the users responsibility to ensure that the estimation procedure in question is appropriate for use with mi estimate. Additionally, the output produced by mi estimate when the cmdok option is used may or may not be well formatted. Note that an estimation command not currently supported by mi estimate is not necessarily inappropriate for use with MI data, hence the presence of the cmdok option. There was an interesting discussion of various reasons commands may not be currently supported by mi estimate on Statalist written by Yulia Marchenko of Stata, it can be found here.

mi xtset group
mi estimate, cmdok: xtreg y x1 x2

Similarly, the first line of syntax below survey sets MI data. The second line of syntax below uses mi estimate followed by the svy: prefix to estimate the mean of the variable y. Estimation commands that are supported by mi estimate more generally ( i.e.,  in non-survey data), for example mean, and regress, are also available using mi estimate svy:.

mi svyset su [pweight=pw], strata(s)
mi estimate svy: mean y

If you are using mim then you can set the data as one usually would, and then run the appropriate commands with the mim prefix. For example

xtset group
mim: xtreg y x1 x2

and

svyset su [pweight=pw], strata(s)
mim: svy: mean y

Some concluding remarks

In the first seminar, we reviewed some of the basic concepts in multiple imputation, as well as some of the major options available in the imputation process. We also demonstrated the use of tools for examining patterns of missing values and Stata's mi impute command, introduced in version 11. In this seminar, we introduced the use of the user-written command ice to create MI datasets. We also demonstrated the use of both Stata's mi commands and the user-written package mim to perform data management and analysis with MI datasets. For the most part, the commands we have used were easy to use, so it is easy to forget that the statistical procedures they implement are often complex. Because some of the statistical procedures being used are so complex, when implementing MI in your own research, you probably want to proceed carefully. The quality of the analysis model and hence the final conclusions of the research, is influenced by quality of the imputation model, so it is important to carefully read the literature on MI and consider all of the options available. In the end, you may also want to try implementing several different strategies, to be sure that your results are not dependent on the particular imputation procedures, tests, or datasets used.

References

Graham, John W., Olchowski, Allison E. and Gilreath, Tamika D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

Little, Roderick J.A., and Rubin, Donald B. (2002). Statistical Analysis with Missing Data, Second Edition. Hoboken, New Jersey: Whiley-InterScience.

McKnight, Patrick E., McKnight, Katherine M., Sidani, Souraya, and Figueredo, Aurelio Jose (2007). Missing Data: A Gentle Introduction. New York, New York: The Guilford Press.

Molenberghs, Geert, and Kenward, Michael G. (2007). Missing Data in Clinical Studies. Chichester, West Sussex: John Whiley & Sons Ltd.

Potthoff, Richard F., Tudor, Gail E., Pieper, Karen S., and Hasselblad, Vic (2006). Can one assess whether missing data are missing at random in medical studies? Statistical Methods in Medical Research 15:213-234.

van Buuren S., H. C. Boshuizen and D. L. Knook. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694.

von Hippel, Paul T. (2007). Regression with missing y's: an improved strategy for analyzing multiple imputed data, Sociological Methodology, 37.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.