Help the Stat Consulting Group by giving a gift

Multiple Imputation in Stata, Part 1

Outline of this seminar:

Part 1:

- Introduction
- Multiple imputation
- Missing data patterns and mechanisms
- Building an imputation model
- An example of multivariate normal imputation using
**mi impute mvn**

Part 2:

- An example of multiple imputation by chained equations using
**ice** - Managing MI datasets
- Analyzing MI data

This seminar is on multiple imputation using Stata, but imputation is much more than the mechanical process of running commands, it requires creating a model. Building a good imputation model requires knowledge of the data as well as careful considerations of a number of options, a process that can take as long as creating a good analysis model. This seminar includes a brief review of some important concepts in multiple imputation, and handling missing data more generally, but is not intended to be a replacement for more thorough treatments of this topic. We have made an effort to point out some of the important steps and decisions in the imputation process. However, handling missing data is a complex and developing topic, so we recommend you read the related literature carefully before implementing multiple imputation or other missing data handling techniques in your own research. The classic text on handling missing data, now in its second edition, is Statistical Analysis with Missing Data by Little and Rubin (2002). This text is technical, so it may not be the best introduction to the topic, especially for those without a background in mathematics. A more approachable text is Missing Data: A Gentle Introduction by McKnight, McKnight, Sidani, and Figueredo (2007). Another text we like is Missing Data in Clinical Studies by Molenberghs and Kenward (2007). Many other excellent texts on this topic exist, this just happens to be one that we like and have in our library of books for loan. When reading books and articles on MI we recommend that you be conscious of when it was published, because as mentioned above, MI is a developing field, and what is generally considered good practice may have changed.

Throughout the seminar we will be using a dataset that contains test scores,
as well as demographic and school information for 200 high school students.
Below we have summarized this dataset, note that although the dataset contains
200 cases, six of the variables have fewer than 200 observations. These variables have missing values for between 4.5% (**read**) and 9%
(**science**) of cases. This doesn't seem like a lot of missing data, so we might be inclined to try to analyze
the observed data as is, a strategy sometimes referred to
as complete case analysis.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear(highschool and beyond (200 cases))sumVariable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 182 .5549451 .4983428 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------- prog | 182 2.027473 .6927511 1 3 read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 math | 185 52.8973 9.360837 33 75 science | 184 51.30978 9.817833 26 74 -------------+-------------------------------------------------------- socst | 200 52.405 10.73579 26 71

Below we use **write**, **read**, **female**, and **math** to predict **
socst** in a regression model. The regression model uses just those cases with complete data for all the
variables in the model (i.e., no missing values on **socst, write, read, female, **or **math**).
This is the default in Stata and many other statistical packages.
Looking at the top of the output, we see that only 145 cases were used in the
analysis, in other words, more than one quarter of the cases in our dataset (55/200)
were excluded from the analysis because of missing data. Below the regression
table we use the **estimates store** command to save the results so we can recall them later.

regress socst write read female mathSource | SS df MS Number of obs = 145 -------------+------------------------------ F( 4, 140) = 28.10 Model | 6630.7694 4 1657.69235 Prob > F = 0.0000 Residual | 8259.47888 140 58.9962777 R-squared = 0.4453 -------------+------------------------------ Adj R-squared = 0.4295 Total | 14890.2483 144 103.404502 Root MSE = 7.6809 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .3212789 .1020247 3.15 0.002 .1195706 .5229871 read | .3047733 .0899709 3.39 0.001 .1268961 .4826505 female | .2233572 1.404163 0.16 0.874 -2.552749 2.999463 math | .1988131 .1016747 1.96 0.053 -.0022031 .3998294 _cons | 9.358279 4.262397 2.20 0.030 .9312916 17.78527 ------------------------------------------------------------------------------estimates store cc

The reduction in sample size alone might be considered a problem, but complete case analysis can also lead to biased parameter estimates. Because this is an example dataset, and we created the missing values for this example, we also have the complete dataset. We can compare the results from the complete case analysis above, to the results from the original data (i.e., the dataset with no missing values). Below we open the original dataset, and run the same regression model as above, note that since there is no missing data, all 200 observations were used in this regression. Below the regression output, we store the estimates.

use http://www.ats.ucla.edu/stat/data/hsb2, clear(highschool and beyond (200 cases))regress socst write read female mathSource | SS df MS Number of obs = 200 -------------+------------------------------ F( 4, 195) = 44.45 Model | 10938.9795 4 2734.74487 Prob > F = 0.0000 Residual | 11997.2155 195 61.5241822 R-squared = 0.4769 -------------+------------------------------ Adj R-squared = 0.4662 Total | 22936.195 199 115.257261 Root MSE = 7.8437 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .3757491 .0852101 4.41 0.000 .2076975 .5438007 read | .3696825 .0775725 4.77 0.000 .2166938 .5226712 female | -.2340534 1.207995 -0.19 0.847 -2.616465 2.148358 math | .1209005 .0861526 1.40 0.162 -.0490101 .2908111 _cons | 7.029076 3.562453 1.97 0.050 .003192 14.05496 ------------------------------------------------------------------------------estimates store full

Now we use **estimate table** to display the results of the complete case
analysis (labeled **cc**) and analysis of the full dataset (labeled **full**).
For each of the variables in our model (as well as the constant), the table
includes three values, the coefficient estimate, below that is the standard
error, and below that the p-value for the coefficient. Comparing the
coefficients, as well as their standard errors and p-values, from the complete
case and full data analyses, we can see that for all of the coefficients there
is some bias (i.e., difference between the two analyses). The largest absolute
difference is in the coefficient for **
female**, however, the coefficient for female was non-significant in both
models. In the case of **
math**, the coefficient in the complete case analysis is significant, while it
is not in the model estimated with the full data, and hence our inference about the coefficient would be different
for the two models. The coefficients for **read**, and **write**, as well as
their p-values
are similar in the two analyses. The intercepts differ by a relatively small
value (give the scale of the dependent variable). There is no consistent pattern of either larger or smaller
coefficient estimates between the two, but the standard errors for the analysis
of the complete data are all smaller, due in part to the larger sample size.

estimate table cc full, b se p---------------------------------------- Variable | cc full -------------+-------------------------- write | .32127885 .3757491 | .10202467 .08521005 | 0.0020 0.0000 read | .30477331 .36968249 | .08997086 .07757247 | 0.0009 0.0000 female | .22335724 -.23405342 | 1.4041631 1.2079946 | 0.8738 0.8466 math | .19881314 .12090052 | .10167466 .08615264 | 0.0525 0.1621 _cons | 9.358279 7.0290761 | 4.2623968 3.5624529 | 0.0298 0.0499 ---------------------------------------- legend: b/se/p

We can also compare these results to those from a multiply imputed dataset. Below we use a multiply imputed dataset to estimate our model.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear mi estimate, post: reg socst write read female mathMultiple-imputation estimates Imputations = 5 Linear regression Number of obs = 200 Average RVI = 0.0820 Complete DF = 195 DF adjustment: Small sample DF: min = 59.71 avg = 121.37 max = 181.12 Model F test: Equal FMI F( 4, 163.6) = 38.78 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .3472116 .0956238 3.63 0.000 .1572004 .5372228 read | .3673822 .0803328 4.57 0.000 .2086775 .5260869 female | .525372 1.375176 0.38 0.704 -2.225667 3.276411 math | .1508523 .0908884 1.66 0.099 -.0290372 .3307417 _cons | 6.59747 3.707945 1.78 0.077 -.7188551 13.9138 ------------------------------------------------------------------------------estimates store mi

Below we use **estimates table** to display results from all three models
next to each other. As with the table, the estimate, standard error, and p-value
for each coefficient are listed (in that order). Comparing the results across the three analyses, we see that for all of the estimates in the model, the coefficient estimates from the MI
analysis are closer to those from the full dataset than those from the complete case analysis, that is, the MI coefficients contain
less bias (although the magnitude of these differences is often small in this
example). In general, if done well, analysis using MI should result in
coefficients with less bias than complete case analysis.

estimates table cc full mvn, b se p----------------------------------------------------- Variable | cc full mi -------------+--------------------------------------- write | .32127885 .3757491 .34721159 | .10202467 .08521005 .09562376 | 0.0020 0.0000 0.0004 read | .30477331 .36968249 .36738221 | .08997086 .07757247 .08033285 | 0.0009 0.0000 0.0000 female | .22335724 -.23405342 .52537204 | 1.4041631 1.2079946 1.3751758 | 0.8738 0.8466 0.7028 math | .19881314 .12090052 .15085228 | .10167466 .08615264 .09088836 | 0.0525 0.1621 0.0986 _cons | 9.358279 7.0290761 6.5974704 | 4.2623968 3.5624529 3.7079453 | 0.0298 0.0499 0.0768 ----------------------------------------------------- legend: b/se/p

To impute values generally means to replace missing values with some other value. There are a variety of methods of selecting the imputed value. One very simple approach is to replace missing values with the sample mean. While popular, mean imputation produces distributions that have far too many cases at the mean. More importantly, mean imputation can often produce estimates that are more biased than those from complete case analysis (Little & Rubin 2002, pg. 62). Another possibility is to perform a conditional mean imputation, that is, rather than imputing the sample mean, use the mean from cases that are similar to the case with the missing values in important ways. Replacing missing values with predicted values from a regression analysis of the complete data is a form of conditional mean imputation. What these methods of imputation have in common is that the imputed values are completely determined by a model applied to the observed data, in other words, they contain no error. This tends to reduce variance, and can distort relationships among variables. An alternative approach is to incorporate some error into the imputed values. The values imputed in multiple imputation are draws from a distribution, in other words, they inherently contain some variation. This variation is important for several reasons, not just for creating reasonable distributions.

A limitation of single imputation is that it treats imputed values as though they were observed, which is not the case, imputations are only estimates. As a result, standard analyses of a single imputation will tend to overstate our confidence in the parameter estimates, that is, the standard errors are too small. Multiple imputation addresses this problem by introducing an additional form of error based on variation in the parameter estimates across the imputations, so called, between imputation error. An MI analysis involves three steps, first, an imputation model is formulated and a series of imputed datasets are created. Second, the analysis of each imputed dataset is carried out separately, for example, you might calculate a mean or run a regression model. Finally, the estimates from the imputed datasets are combined, or pooled, to generate a single set of estimates. For parameters (e.g., means or regression coefficients), the MI estimate is simply the mean of parameter estimates across the imputations. The calculation of the standard errors is a little trickier. As mentioned above, the MI estimate of the standard error of a parameter contains two components, the within imputation variance, and the between imputation variance. The within imputation variance is the average of the variance (i.e., the standard error squared) across the imputations. The between imputation variance is a function of the variance of the parameter estimate across the imputed datasets and the number of imputations. The MI estimate of the standard error is the square root of the within and between variances added together. This process allows us to see how much our results change, when different values are imputed. Combining the results across the imputations allows us to account for the uncertainty in the imputed values.

It is important to note that MI is not the only appropriate method of handling missing data. Another particularly good method is full information maximum likelihood (FIML). Both FIML and MI have advantages and disadvantages, depending on your specific situation. To our knowledge, the FIML approach has not been implemented in Stata. FIML is commonly implemented in structural equation modeling packages such as Mplus, LISREL, and EQS.

As with most, if not all, analyses, the first step in handling missing data is to get to know the data.
This includes the usual exploratory data techniques, such as
examining means and standard deviations, and graphing distributions. In
addition, in the presence of missing data it is important to understand not just how much data is missing,
but the patterns of missing values. Stata provides tools to do this.
Throughout the process of creating and analyzing MI data in Stata, which tools
are available depends on which version of Stata you have access to. Starting
with version 11, Stata has a suite of commands for handling MI data, all of
which start with the **mi** prefix. In earlier versions of Stata, a number of user
written tools are available for working with missing data. We will introduce
both Stata's built in commands and user written commands as we go along (see our FAQ: How do I use the
findit command to search for programs and additional help? for more information about finding and
installing user written commands).

Before we can use the **mi** commands, the data must be **mi set**.
Fortunately, we don't need to know anything about the missing data structure
in order to use the **mi set** command. We do need to declare the dataset
style, but, since the data hasn't been imputed
yet, we can use any style. The style can always be changed using the **mi
convert** command (for more information on MI data styles see Stata's
documentation on them by typing **help mi styles** in the Stata command window). For now we set the MI storage format to be **wide**. Next we use the
**mi misstable sum** command to begin to explore the pattern of missing
values in our dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear(highschool and beyond (200 cases))mi set wide mi misstable sumObs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ female | 18 182 | 2 0 1 prog | 18 182 | 3 1 3 read | 9 191 | 30 28 76 write | 17 183 | 29 31 67 math | 15 185 | 39 33 75 science | 16 184 | 32 26 74 -----------------------------------------------------------------------------

The output above shows us which variables have missing values. The second, third, and fourth columns
tell us about the number and type of missing values. The column labeled Obs=. tells
us how many cases have a system missing value (i.e., "."), the column labeled Obs>. gives the number
of cases with so called extended missing values (i.e., ".a", ".b", ".c"...), and the column labeled
Obs<. gives the number of cases with non-missing values for each variable.
The use of greater than and less than symbols may seem somewhat confusing,
the reason for them is this, when Stata stores data, it stores system
missing values as a very large number, when the value is an extended
missing value (e.g., .a) Stata stores an even larger value in its place. As a
result, any observed value in the dataset is less than the value stored for a system
missing value (hence Obs<.), and the value stored for an extended missing value
is larger than the value stored for a system missing value (hence Obs>.).
The distinction between system and user missing variables is important because Stata's
**mi impute** command
will not impute extended missing values (the user written program **ice**
does not make this distinction). Therefore, if you wish to impute missing
values for cases with system missing values using **mi impute**, you will
first need to replace the user missing values with system missing values.
The final three columns give information about the cases with observed (i.e.,
non-missing) values, specifically, the number of unique values (Unique
values), as well as the minimum (Min) and maximum (Max) values that each
variable takes on.

Next, we want to explore the pattern of missing values in the dataset. It is
important to know about the pattern of missing values because the patterns of
missingness can sometimes suggest why the values are missing. For example, are
there a lot of missing values for certain variables? Below is a sample dataset,
instead of showing observed values, each case (row) contains a 1 if the value
was observed, and a 0 if the value for that case is missing. The dataset
contains four variables
**v1**-**v4**.Notice that while **v1** is observed for all cases, and
**v2** and
**v3** are observed for most cases, variable **v4** has a lot of missing values. When
you encounter a variable that has many more missing values than others, you probably want
to ask yourself why this variable has so many missing values. Sometimes, the
is part of a skip pattern or just doesn't apply to some respondents. However, you may also
encounter variables with a large number of missing values, without such obvious
causes, in these cases you may want to put some thought into why that particular
variable would be missing so often. Was it a particularly sensitive survey
question? Is this a reading generated by an unreliable machine or process? Was there a data
entry or processing error that created missing values when the values were
observed?

v1 v2 v3 v4 1 1 1 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 0

Another thing to watch out for is cases that are missing a lot of data. In the example above, the pattern shown in the sixth row is missing more values than the other rows. As with variables that are missing a lot of data, we probably want to ask ourselves if there is anything unusual about these cases that would lead to a large number of missing values. If there are multiple cases with a lot of missing data, do they have anything in common? Carefully exploring the data, and thinking about what we find, can aid us in making better decisions about how to treat the missing values later on.

Another thing to take note of is the pattern of missingness, which influences
what kinds of imputation can be used.
Missing data patterns are commonly described as either monotone or
arbitrary. Below is another example dataset, where 1s indicate observed values
and 0s missing values. This dataset
is an example of monotone missingness, all values of
**v1** are observed, while all but the final two values of **v2** are observed, and so on.
It may be necessary to reorder variables and/or cases in order to be able to "see" monotone
missingness, that is fine, as long as it is possible to do so, the missing data
pattern is considered monotone.

v1 v2 v3 v4 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0

For comparison, below is an example of what an arbitrary missing data pattern looks like (again, a value of 1 represents an observed value, while 0 indicates a missing value). Note that it would not be possible to reorder the variables and/or cases to form a monotone pattern.

v1 v2 v3 v4 1 1 0 1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 1 0 1

As you might imagine, when the cases and variables aren't ordered nicely, it
can be difficult to spot a monotone missing data pattern. Below we use the **misstable nested** command to examine the nesting structure
of the missing values. The output shows no patterns in which a missing value on
one variable is always missing on another variable. This suggests that the pattern of missingness
in the test score dataset is non-monotone.

mi misstable nested1. read(9) 2. math(15) 3. science(16) 4. write(17) 5. prog(18) 6. female(18)

If the data was perfectly monotone (as in the example above)
the output would look something like that shown below. The 2 in parentheses
after **v2** tells us that this variable has two missing values. The arrow
(i.e., -> ) pointing towards **v3** tells us that when **v2** is missing,
**v3** is always missing as well. The second arrow, which points towards **
v4**, tells us that **v2** and **v3** are always missing when **v4** is missing.

1. v2(2) -> v3(3) -> v4(4)

The command **mi misstable patterns** provides another way to examine the patterns
of missing data in our dataset. Below we use **mi misstable patterns**
to display the missing data patterns. The first column gives the percent of
cases in each pattern, there are five additional columns, one for each each
variable with missing values. The order of the variables is shown below the
table. In the body of the table, a 1 indicates the variable was observed in
that pattern, and a 0 indicates that the variable is missing in that pattern. The most common
pattern (59% of cases) is no missing values at all. The next most common pattern
(8% of cases) is missing on **female** (the variable for the column labeled
5), but all other variables are observed. The **bypatterns** option, used to
generate the second set of output below, groups the missing data patterns based
on the number of variables missing in that pattern, rather than the frequency of
the missing data pattern. Both of these tables can be useful in detecting cases
with a large number of missing values.

mi misstable patternsMissing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 5 6 ------------+--------------------- 59% | 1 1 1 1 1 1 | 8 | 1 1 1 1 0 1 7 | 1 1 1 1 1 0 7 | 1 1 0 1 1 1 6 | 1 1 1 0 1 1 6 | 1 0 1 1 1 1 4 | 0 1 1 1 1 1 1 | 1 0 1 0 1 1 <1 | 0 1 0 1 1 1 <1 | 1 0 1 1 0 1 <1 | 1 0 1 1 1 0 <1 | 1 1 0 0 1 1 <1 | 1 1 0 1 1 0 <1 | 1 1 1 0 0 1 <1 | 1 1 1 0 1 0 <1 | 1 1 1 1 0 0 ------------+--------------------- 100% | Variables are (1) read (2) math (3) science (4) write (5) female (6) progmi misstable patterns , bypatternsMissing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 5 6 ------------+--------------------- 59% | 1 1 1 1 1 1 | 1: | 7 | 1 1 1 1 1 0 8 | 1 1 1 1 0 1 6 | 1 1 1 0 1 1 7 | 1 1 0 1 1 1 6 | 1 0 1 1 1 1 4 | 0 1 1 1 1 1 2: | <1 | 1 1 1 1 0 0 <1 | 1 1 1 0 1 0 <1 | 1 1 1 0 0 1 <1 | 1 1 0 1 1 0 <1 | 1 1 0 0 1 1 <1 | 1 0 1 1 1 0 <1 | 1 0 1 1 0 1 1 | 1 0 1 0 1 1 <1 | 0 1 0 1 1 1 ------------+--------------------- 100% | Variables are (1) read (2) math (3) science (4) write (5) female (6) prog

Below we show the what the output would look like if the data was perfectly
monotone, using the small dataset from above. Because **v1** is observed
for all observations, it is entirely omitted from the table. The striking
feature of the table below is that, except for the first row, the upper portion is all 0s, while the
lower portion is all 1s, reflecting the monotone missing data pattern.

Missing-value patterns (1 means complete) | Pattern Percent | 1 2 3 ------------+------------- 33% | 1 1 1 | 33 | 0 0 0 17 | 1 0 0 17 | 1 1 0 ------------+------------- 100% | Variables are (1) v2 (2) v3 (3) v4

Similar output can be produced using the user-written program **mvpatterns**. One
advantage of **mvpatterns** is that it can be used in Stata 10 and earlier.
The command and its output are shown below. The first table in the output lists the variables with missing values in a table that
also lists their storage type (labeled type), number of observations (obs),
number of missing values (mv), and the variable label. The second table
produced by **mvpatterns** shows the missing data patterns. Under the
heading _pattern is a visual representation of the missing data patterns,
the columns under this heading represent the variables with missing values
(in the order shown in the first table). An addition sign ("+") indicates
that the variable is observed in a given missing data pattern, while a
period (".") indicates that the variable is not observed in that missing
data pattern. The second table also shows the number of missing values in
each missing data pattern (_mv), and the frequency of that pattern (_freq).
From this table we can see that the most frequent missing data pattern,
shown in the first row, is actually no missing data at all (i.e., "+++++").

mvpatternsvariables with no mv's: id race ses schtyp socst _mi_miss Variable | type obs mv variable label -------------+------------------------------------ female | float 182 18 prog | float 182 18 type of program read | float 191 9 reading score write | float 183 17 writing score math | float 185 15 math score science | float 184 16 science score -------------------------------------------------- Patterns of missing values +------------------------+ | _pattern _mv _freq | |------------------------| | ++++++ 0 117 | | .+++++ 1 15 | | +.++++ 1 14 | | +++++. 1 13 | | +++.++ 1 12 | |------------------------| | ++++.+ 1 11 | | ++.+++ 1 8 | | +++..+ 2 2 | | +++.+. 2 1 | | ++.++. 2 1 | |------------------------| | +.+++. 2 1 | | +.++.+ 2 1 | | +.+.++ 2 1 | | .+++.+ 2 1 | | .++.++ 2 1 | |------------------------| | ..++++ 2 1 | +------------------------+

You can also request a description of the missing data in tree form, this may
be a particularly useful method of displaying data that is near-monotone,
because it
would allow one to easily see this type of pattern even if the variables aren't
arranged in the proper order. **mi misstable tree** becomes less
useful, and more difficult to create, when the number of variables with missing data
is large. Below is the command and its output, we have used the **frequency**
option so that the output displays the number of cases with missing values rather than the
percent of total that are missing. The output can be a little tricky to read at first. The variables are
listed starting with the variable missing the most values (on the left) to the
variable
missing the fewest values (on the right), so we know that **female** is the
variable with the most missing values, and **read** the least (which we also know from the earlier output).
If we start reading down the first column, the output tells us that 18 cases are
missing values of **female**, while 182 cases are observed. Now reading
across the first row, we see that of those cases 18 cases with missing values on
**female**, 1 is also missing on **prog**, and the other 17 are not. Reading
further down we see that of the 182 cases with valid values of **female**, 17
have missing values on **prog**, while 165 do not. Returning to the
first row, and reading
across to the third through sixth columns we see
the case that is missing on both **female** and **
prog** does not have missing values on any other variables. Looking further down the table, we see
that of the 17 cases missing on **female**, but not missing on **prog**, 1
has a missing
values for **write**, and so on. This confirms what we already suspected about the
data, the pattern of missing values is arbitrary rather than monotone.

mi misstable tree , frequencyNested pattern of missing values female prog write science math read ----------------------------------------------------------------- 18 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 17 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 16 0 0 0 0 0 0 0 16 1 0 1 15 0 15 182 17 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 16 1 0 0 0 1 0 1 15 1 0 1 14 0 14 165 15 1 0 0 0 1 0 1 14 2 0 2 12 0 12 150 14 0 0 0 14 1 13 136 11 0 11 125 8 117 ----------------------------------------------------------------- (number missing listed first)

The output from **mi misstable tree , frequency** run on the small monotone dataset
from above is shown below. Here we can see the monotone missing structure, the
important thing to notice is that reading from left to right, once a
variable has an observed value for one variable, it will have observed
values for all subsequent variables.

Nested pattern of missing values v4 v3 v2 -------------------------------- 4 3 2 1 1 0 1 2 0 0 0 2 0 2 -------------------------------- (number missing listed first)

The missing data mechanism is the process that generates missing values, that is, what predicts whether a given value is missing or not. Missing data mechanisms generally fall into three categories: missing completely at random, missing at random, and missing not at random. There are precise technical definitions for these terms in the literature, the following explanation necessarily contains simplifications.

When data is missing completely at random (MCAR), neither other variables in the dataset, nor the unobserved value of the variable itself, predict whether a value will be missing. MCAR is a fairly strong assumption, and tends to be relatively rare. One relatively common scenario, in which data can be assumed to be missing completely at random is when a subset of cases is randomly selected to undergo additional measurement, for example, health surveys, where some subjects are randomly selected to undergo more extensive physical examination.

Data is said to be missing at random (MAR) if other variables in the dataset can be used to predict missingness on a given variable. For example, in surveys, men may be more likely to decline to answer some questions than women (i.e., gender predicts missingness on other variables). MAR is a less restrictive assumption than MCAR.

Finally, data is said to be missing not at random (MNAR, sometimes also called not missing at random, or NMAR) if the value of the unobserved variable itself predicts missingness. A classic example of this is income, individuals with very high incomes are more likely to decline to answer questions about their income, than individuals with more moderate incomes. It can be difficult to determine whether variables are MNAR, because the information that would confirm that values are MNAR is unobserved. As a result, the decision to treat data as MNAR is often made based on theoretical and/or substantive information, rather than information present in the data itself.

The missing data mechanism is important, because different types of missing data require different treatment. When data is missing completely at random, analyzing only the complete cases will not result in biased parameter estimates (e.g., regression coefficients), it can, however, substantially reduce the sample size for an analysis, leading to larger standard errors. In contrast, analyzing only complete cases for data that is either missing at random, or missing not at random can lead to biased parameter estimates. Multiple imputation generally assumes that the data is at least missing at random, meaning that it can also be used on data that is missing completely at random. Note that procedures for imputing data that is missing not at random have been developed, but to our knowledge have not been implemented in Stata.

If we have carefully examined the missing values in our dataset, we may
already have some sense of how the missing values were generated. It is possible to examine whether
values of the variables in the dataset predict missingness on other variables,
suggesting that the data may be missing at random. One method of doing
this is as follows. First, for each variable with missing values, create a binary
variable, equal to 1 if the value is missing, and 0 if it is observed. Next, we can
examine the relationship between that indicator and the other variables in the
dataset. Because it is not uncommon to have a large number of variables with
missing values, below we use a loop to go through these steps for each of the
variables in our dataset that has missing values. First, we create a local
macro, named **corrvars**, that contains a list of the variables that might
predict missingness in the variables with missing values. Next, we use a **
foreach** loop to repeat a set of commands for each variable in our list. The
loop will repeat the actions inside the brackets for each of the variables following
the word **varlist**. The first line inside the loop (the third line of
syntax) creates a binary variable that indicates missingness, for example, for
**female**
it create the variable **m_female**, which is equal to 1 when **female**
is missing and 0 otherwise. Next, the indicator variable is correlated
with all of the variables listed in the local macro **corrvars**, omitting
the variable whose indicator is being examined (e.g., **female** is removed from
the list of variables to be correlated with **m_female**).

local corrvars "female ses schtyp read write math science socst" foreach var of varlist female science read write math { gen m_`var' = missing(`var') pwcorr m_`var' `: list corrvars - var' }

Below we show the output for correlating the indicators of missingness for **female** and
**science**
(i.e., **m_female** and **m_science** respectively) with other variables from the dataset. The loop above
would also produce correlation tables for all other variables listed after **
varlist** (i.e., **read**, **write**, and **math**), but this output has been omitted. In
the first table, we see the correlations between the indicator variable, **
m_female**, and the other variables in the dataset. Note that we have not
requested significance tests because whether the correlations between
missingness and the other variables is statistically significant is unimportant,
we are not, after all, making inferences to a larger population, we are simply
looking for patterns within the dataset. The correlation of missingness on **female**
(i.e., **m_female**) with **read**,
**write**, and **math** stand out as the largest values, but overall, we see
small to moderate correlations between missingness on **female** and the
other variables in our dataset.

|m_femaleses schtyp read write math science socst -------------+------------------------------------------------------------------------ m_female |1.0000ses |-0.12071.0000 schtyp |0.00570.1367 1.0000 read |-0.14990.2865 0.0669 1.0000 write |-0.15120.2109 0.1492 0.5872 1.0000 math |-0.14280.2692 0.0908 0.6589 0.6182 1.0000 science |-0.02910.2708 0.0733 0.6329 0.5498 0.6296 1.0000 socst |-0.12280.3319 0.0968 0.6160 0.5975 0.5451 0.4512 1.0000 |m_mathfemale ses schtyp read write science socst -------------+------------------------------------------------------------------------ m_math |1.0000female |-0.11491.0000 ses |-0.1268-0.1140 1.0000 schtyp |-0.1243-0.0028 0.1367 1.0000 read |-0.0560-0.0174 0.2865 0.0669 1.0000 write |-0.02850.2508 0.2109 0.1492 0.5872 1.0000 science |0.0981-0.0918 0.2708 0.0733 0.6329 0.5498 1.0000 socst |-0.18800.0889 0.3319 0.0968 0.6160 0.5975 0.4512 1.0000 <output omitted>

Finding correlations between missingness on a given variable and other variables in the dataset is consistent with the MAR assumption, but does not evaluate the assumption that missing values on a given variable are unrelated to the (unobserved) value of that variable. Potthoff, Tudor, Pieper, and Hasselbland (2006) discuss a technique for assessing the degree to which this assumption is tenable.

The quality of the imputation model will influence the quality of the final results, so it is important to carefully consider the design of the imputation model. In general, one wants to use as much care building an imputation model as one uses in building an analysis model. Hence, it can take as long, or longer, to build a good imputation model as it takes to build a good analysis model. There are a number of important decisions to be made when building an imputation model, including, but not limited to, what method should be used to generate the imputations; whether to impute for a specific analysis/model, or to impute for an entire dataset; what, if any, "auxiliary" variables should be included in the imputation model; and how many imputed datasets should be generated. Below is a brief description of some of the options for each of these decisions. It is not intended to be a thorough discussion of all possible options. We recommend that you read the literature on MI for a thorough discussion of these issues. Because many issues in this area are unresolved, care should be taken to consult recent sources, as what is considered good practice may change. Additionally, while various procedures have advantages and disadvantages, we are unaware of research clearly supporting one technique over another in all circumstances. The best practice may be to repeat the analysis under different imputation models to see if, and how, changes in the imputation model result in changes in the final results.

**If the pattern of missingness is monotone**

You may have wondered why such a big deal was made of checking to see
if the pattern of missingness was monotone, or even near monotone. If the
missingness is monotone the process of imputation becomes much easier from a
statistical standpoint. One of the advantages of imputing monotone data is that
it is relatively easy to impute binary, ordinal, or categorical variables,
something that is trickier with arbitrary missing data patterns. Because
monotone missing data patterns are relatively rare, we won't discuss the process
of imputing them in depth here. The Stata command for imputing monotone missing
data is **mi impute monotone**. Once imputed, MI datasets from data with
monotone missingness are analyzed in the same manner as other MI datasets.

**The multivariate normal model and chained equations approaches**

When the pattern of missingness is arbitrary, two common approaches to creating MI
datasets are imputation using the multivariate normal model, and imputation
using the chained equations approach. The first, introduced by Rubin (see, e.g.,,
Little and Rubin 2002) involves
drawing from a multivariate normal distribution of all the variables in the
imputation model. The second approach, imputation by chained
equations, is sometimes referred to as the ICE or MICE (for multiple imputation by
chained equations) approach. The ICE approach generates imputations by
performing a series of *univariate*
regressions. Both the multivariate normal and ICE approaches are available in Stata. The
multivariate normal approach is implemented in the **mi impute mvn** command
introduced in Stata 11. The multivariate normal model has also been implemented
in a user-written plug-in **inorm** (for help locating and installing this package see our
FAQ: How do I use the findit command to search for programs and additional help?).
**inorm** requires Stata version 9 or later, making it available to users who
do not have access to Stata 11. The ICE approach is implemented in a user-written
package called **ice**.
The current version of **ice** requires Stata version 9.2 or later.

The multivariate normal approach has stronger theoretical underpinnings, and some better statistical properties, but the ICE approach seems to work in practice. An advantage of the ICE approach is that the variables are not assumed to have a multivariate normal distribution. However, the multivariate normal approach tends to be robust to departures from normality. One common concern among users new to the multivariate normal approach is that the normality assumption results in imputed values that do not necessarily resemble the observed values. For example, imputed values of a binary (0,1) variable may take on any value under the multivariate normal approach. Mathematically, this is not, generally speaking, a problem, but users sometimes find it disconcerting. Another advantage of the ICE approach is that because it involves a series of univariate models, rather than a single large model, it can be somewhat easier to estimate. This can be useful if you have a large number of variables in the imputation model or a relatively small number of cases. However, because the imputations are based on a series of univariate models, imputation models using the ICE approach can be tedious to specify. As the above is intended to suggest, neither model is necessarily "better" than the other in all situations. We recommend that you read the literature to learn about both approaches, and then carefully consider which approach is best suited to your situation. You may also want to consider creating MI datasets with both approaches and comparing the results of your final analysis using the two approaches.

**What variables should go in the model?**

At one extreme, one can run an imputation model that contains only those variables to be used in a specific analysis model. At the other end of the spectrum, one could run an imputation model that includes all of the variables in a dataset. The advantage to imputing for a specific analysis or set of analyses is that one can be sure to include all relevant variables, non-linear terms, interactions, etc. which may not be possible when imputing for an entire dataset. One advantage of imputing the entire dataset, is that the imputation model then uses all of the information in the dataset. Further, if one imputes most or all of the variables in the dataset, the resulting MI dataset can be used for most or all future analyses. Creating a single imputed dataset for use in future studies makes it easier for you, and others, to replicate the results (this is a good reason why one might want to use the MI datasets sometimes distributed with large datasets public use datasets).

Even if you intend to impute for a specific analysis, you may want to include some variables in the multiple imputation model that are not in the planned analysis, these are sometimes called auxiliary variables. Auxiliary variables may or may not have missing values. Regardless of which strategy you chose, you may want to include variables that do not have missing values in the imputation model. Note that the number of variables that can be included in an imputation model also depends on the imputation method and sample size. When collecting or downloading data, if you anticipate a lot of missing values on a specific measure, you can sometimes plan for auxiliary variables. If you are still in the research design phase, you can include additional measures, which may be less prone to missingness. For example, in a questionnaire, individuals may decline to report their income, but be more willing to report the type of car they drive or the number of rooms in their house, items that may be useful in imputing income. Similarly, if you are using a large existing dataset, you may want to download (or otherwise obtain) variables that aren't necessary for your analysis model but may be useful in an imputation model.

One common question about imputation is whether the dependent variable should be included in the imputation model. The answer is yes, if the dependent variable is not included in the imputation model, the imputed values will not have the same relationship to the dependent variable that the observed values do. In other words, if the dependent variable is not included in the imputation model, you may be artificially reducing the strength of the relationship between the independent and dependent variables. After the imputations have been created, the issue of how to treat imputed values of the dependent variable becomes more nuanced. If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information, and actually introduces additional error (von Hippel 2007). As a result some authors suggest including the dependent variable in the imputation model, which may include imputing values, and then excluding any cases with imputed values for the dependent variable from the final analysis (von Hippel 2007). If the imputation was performed using auxiliary variables or if the dataset was imputed without a specific analysis model in mind, then using the imputed values of the dependent variable may provide additional information. In these cases, it may be useful to include cases with imputed values of the dependent variable in the analysis model. Note that it is relatively easy to test the sensitivity of results to the inclusion of cases with imputed values of the dependent variable by running the analysis model with and without those cases.

**Selecting the number of imputations (m)**

Historically, the recommendation was for 3 to 5 MI datasets. While relatively
low values of m may be appropriate when proportion of missing values is low, and the analysis techniques are relatively simple,
more recently, larger values of m are
often recommended. To some extent, this change in the recommended number of
imputations is based on the radical increase in the computing power available to
the typical researcher, making it more practical to create and analyze MI
datasets with a larger number of imputations. Recommendations for the number of m vary, but, for
example, the Stata manual suggests 5 to 20 imputations for low fractions of missing information, and as many as 50 (or more) imputations when the proportion
of missing data is relatively high. One article that discusses the issue of the
number of imputations is by Graham, Olchowski, and Gilreath (2007). The **mi** commands include
features designed to allow you to test the sensitivity of your model to number
of imputations, and we recommend this type of sensitivity analysis (discussed
in the section on analysis of MI data). A larger number of imputations may also
allow hypothesis tests with less restrictive assumptions (i.e., that do not
assume equal fractions of missing information for all coefficients).

Two common types of "special" data are complex survey designs and longitudinal data. Imputation of complex survey data presents a number of complex issues that are not discussed here. Before deciding whether and how to impute data from a complex survey, you probably want to carefully read the literature in this area, to be sure you understand these issues. An alternative to imputing complex survey designs is to use MI datasets that have been created and released by the data distributors. However, the user should carefully study the documentation of the MI datasets, as well as the literature on imputation of survey data, to be sure they understand how the data was prepared. An advantage of using complex survey data that has been imputed by the data distributor is that the data distributor often has access to sensitive information not included in public use datasets, and this information may have been used in the imputation process. If the sensitive variables provides information useful in the imputation process, then imputations produced by the data distributors may be better than imputations produced by the user.

Longitudinal data in wide form (i.e., all the observations for a single subject
are on the same line) can be imputed using the same procedures as
data from a single time point. The data should be in wide form for two reasons.
First, in long form there is more than one observation per individual, so the
cases aren't independent. Second, if a case has a valid response for one
time point but missing data at others, the individual's valid response is likely
to be a good predictor of the missing value. For example, we might expect
an individual's score on a test of mathematics ability at time 1 to be correlated with
their score on a test of mathematics ability at time 2. For more information on
imputing longitudinal data, you may want to see
our FAQ: How can I perform multiple imputation on
longitudinal data using ICE? (note, the example uses **ice**, but the process
is similar if you are using **mi impute**).

Recall that the example dataset contains demographic information and test
scores from 200 high school students. Below we open the dataset and summarize
the variables (abbreviated **sum**). The dataset contains 11 variables, 6 of
which, **female**, **prog**, **read**, **write**, **math**, and **science**, have missing
values (i.e., fewer than 200 observations).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear(highschool and beyond (200 cases))sumVariable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 182 .5549451 .4983428 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------- prog | 182 2.027473 .6927511 1 3 read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 math | 185 52.8973 9.360837 33 75 science | 184 51.30978 9.817833 26 74 -------------+-------------------------------------------------------- socst | 200 52.405 10.73579 26 71

Using what we learned from examination of the data (e.g., descriptive
statistics, patterns of missing values, etc.), we have considered the
alternatives discussed in the previous section. We have decided to impute all 6 variables with
missing values, and to include all of the variables with complete information in
the imputation model. One alternative to this approach would be to impute for a
specific analysis model, for example, the model we ran at the beginning of the
seminar, predicting the variable **socst**
with **write**, **read**, **female**, and **math**. An imputation
model just for that model would impute the variables **write**, **read**,
**female** and **math**, and include the complete variable **socst**.
Because the dataset contains only a few variables, and the analysis model is
relatively simple (i.e., it does not contain any interactions or non-linear
terms), the imputations models based on these two approaches are similar, with a
larger number of variables or a more complex analysis model this might not be
the case. For this example, we will create five imputations (m=5). The example
dataset has relatively few missing values and a simple analysis model, so this
may be sufficient. However, as discussed above, for many applications more than
5 imputations may be desirable.

The data must be **mi set** before any of the **mi** commands can be used.
The first line of syntax shown below uses **mi set** to set the data using
the **flong** style. When the imputed data is stored in the flong
style, the original data, plus the **m** imputed datasets are stored
in a single data file. So, if there are 5 imputations, the file includes 6
copies of the dataset, that is, each case in the dataset occurs 6 times, once
for the original data, and once for each of the 5 imputations. Storing data in the flong
style is inefficient,
so you may want to use another style, we've used it here because it's
relatively easy to think about. (For more information on available styles type
**help mi styles** in the Stata command window.) Next we use the **tab**
command with the **gen(...)** option generate dummy
variables for the variable **prog**, for use in the imputation model. Because
"pr_" was used in the **gen(...)** option, the dummy variables will be named
**pr_1**, **pr_2**, and **pr_3**.

mi set flong tab prog, gen(pr_)type of | program | Freq. Percent Cum. ------------+----------------------------------- general | 41 22.53 22.53 academic | 95 52.20 74.73 vocation | 46 25.27 100.00 ------------+----------------------------------- Total | 182 100.00

Now we need to register the variables in our dataset. Based on our previous explorations of
the data we know that the variables **prog, female, read, write, math**, and **science**
all have missing values, all of which plan to impute. In the first line of syntax below, we use the **mi register
imputed** command to tell Stata that **female**, **read**, **write**, **math**,
**
science, pr_2**, and **pr_3**
are all imputed. This isn't really true, yet, but this is how we tell Stata
which variables we intend to impute. Note that in our case, **prog**, and **pr_1** are omitted from the
list of imputed variables, because they will not be included in the
imputation model due to colinearity with **pr_2** and **pr_3**. In the
second line of syntax, we register **race**, **ses**, **schtyp**, and **socst** as regular variables
(i.e., not imputed or based on imputed variables). Strictly speaking, we do not
need to register the other variables in the dataset (i.e., those we do not wish
to impute), but registering variables can help in data management. Finally we use **mi describe** to check that we
have properly registered the variables. The output also provides information
on how our dataset is set up, which we want to check carefully. For example, the
output tells us that the style is flong, which is what we intend, and that there
are currently no imputations (M = 0), which is expected because we have yet to
impute any values. The output also tells us that there are 3 unregistered
variables, **id** (the case id variable), **prog**, and **pr_1**, we
haven't registered these, although doing so would be fine.

mi register imputed female read write math science pr_2 pr_3 mi register regular race ses schtyp socst mi describeStyle: flong last mi update 0 seconds ago Obs.: complete 117 incomplete 83 (M = 0 imputations) --------------------- total 200 Vars.: imputed: 7; female(18) read(9) write(17) math(15) science(16) pr_2(18) pr_3(18) passive: 0 regular: 4; race ses schtyp socst system: 3; _mi_m _mi_id _mi_miss (there are 3 unregistered variables; id prog pr_1)

Once we are sure that everything is in order, we can generate the imputations.
The first line of syntax below sets the seed for the analysis so that we can
reproduce it later. Setting the seed is necessary if we want to reproduce
the results, because of the random component in the imputation process. The
second line
of syntax below runs the imputation model. The command name **mi impute mvn**, tells
Stata that we wish to run an imputation model (**mi impute**) using the multivariate normal model (**mvn**).
The variable list that immediately follows the command name contains the
list of variables we wish to impute in this model followed by an equal sign
( = ) and a second list of variables. The second list contains variables
with complete data that are to be included in the imputation model (i.e., **race**, **ses**,
and **schtyp**). We have used the **i. **prefix with **
race** (**i.race**) to indicate to
Stata that this is a factor (i.e., categorical) variable, and that instead
of including it as a continuous variables, the appropriate dummy variables
should be included in the model. In other words, it's a shortcut so we don't
have to produce the dummy variables ourselves. (For more information on
factor variables type **help factor variables** in the Stata command
window.) Note that for variables we wish to impute, we do need to create the
dummy variables ourselves, and include k-1 (where k = number of categories)
dummy variables in the imputation model. Following the comma is the **add(...)** option, this is
required. The 5 listed in the **add(5)** option indicates that we wish to
add 5 imputations to the existing dataset. In this case, there are currently
no imputations, so the result with be a data file with 5 imputations. As
mentioned above, in
many cases, we would want to generate more than 5 imputations for the
analysis, but it may be useful to start by running a model with a small
number of imputations to make sure that everything runs as expected, and
then create additional imputations.

set seed 49230 mi impute mvn female read write math science pr_2 pr_3 = i.race ses schtyp, add(5)

Frequently this command will take some time to run, this is expected because MI is computationally
intensive. The output from the **mi impute mvn**
command is shown below. The first few lines of output are status updates that
Stata issues while the command is still running. The next few lines give the
user information about the imputation that was performed, including the total
number of imputations as well as the number added, following this is more
technical information about how the model was specified. Finally there is
information about the variables that were imputed. For each variable that was
imputed, the table lists the number of complete and incomplete observations in
the original data (m=0), and the number of values that have been imputed. For
example, female had 9 missing values in the original data, and all 9 of which
have been imputed. If for some reason, the model we specified only imputed some
of the missing values for a variable, the number in the incomplete and imputed
columns would not match each other.

Performing EM optimization: observed log likelihood = -1624.7371 at iteration 12 Performing MCMC data augmentation ... Multivariate imputation Imputations = 5 Multivariate normal regression added = 5 Imputed: m=1 through m=5 updated = 0 Prior: uniform Iterations = 500 burn-in = 100 between = 100 | Observations per m |---------------------------------------------- Variable | complete incomplete imputed | total ---------------+-----------------------------------+---------- female | 182 18 18 | 200 read | 191 9 9 | 200 write | 183 17 17 | 200 math | 185 15 15 | 200 science | 184 16 16 | 200 pr_2 | 182 18 18 | 200 pr_3 | 182 18 18 | 200 -------------------------------------------------------------- (complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.)

From the user's standpoint, mechanically, generating the imputations is
simple, if time consuming. The command is simple, but statistically, the process is much more involved. The imputations
are created using a technique called Markov Chain Monte Carlo (MCMC). Unlike a lot of models
you may be used to (e.g., logit) there is no simple rule that can be used to determine
when an MCMC model has converged. One common technique is to visually inspect the parameters
from successive iterations of the model to see if they exhibit any clear trends, if they do
the model has probably not converged, if they don't, we may want to consider this
to be evidence
that the model has reached what is called a stationary distribution. Ideally,
this is done by checking the series for each parameter, but in larger models
this isn't always practical. An alternative is to use the worst linear function
(wlf), a sort of summary measure, which, under certain conditions, should
converge more slowly than any of the individual parameters. You can request that
Stata save the wlf for each iteration to a separate file, and then examine it.
The command below is the same as the command above, except for the addition
of the **savewlf(...)** option, which requests that the wlf be saved, the
name of the new file containing the wlf values is specified in the parenthesis
(i.e., **wlf**).

set seed 49230 mi impute mvn female read write math science = i.race i.prog ses schtyp, add(5) savewlf(wlf)

The output from the above command is identical to the output from the first **mi impute mvn** command,
the only difference is that Stata has saved the recorded the value of the wlf at
each iteration in the file **wlf.dta**. Below we first open the dataset, then we list the data so see what Stata has saved. The first variable, **iter**
contains the iteration number. The iteration numbers for the initial burn-in
(time before any imputations are drawn) count up to 0, by default Stata uses 100
burn in iterations, so **iter**=-99 in the first row. The variable **m**
gives the "leg" of the process that iteration occurs on, for example, for the
burn in period (**iter** = -99 to 0) the variable **m** is equal to 1, at
iteration 0 the first imputation is drawn, followed by the iterations between the first and second
imputations (**iter** = 1 to 100) so **m**=2, and so on. The variable **wlf**
contains the values of the worst linear function at each iteration.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wlf, clear list in 1/10, noobs cleanlist in 1/10, noobs clean iter m wlf -99 1 0 -98 1 .0001751 -97 1 -.0000706 -96 1 .0002469 -95 1 .0000131 -94 1 -.000044 -93 1 .000086 -92 1 .0001633 -91 1 .0002296 -90 1 -.0000149

We can use the **tsline** command to plot the value of the wlf
across the iterations, but in order to do this, we must first tell
Stata that the data is time series, and what variable contains the values
of time (**iter**), we do this with the **tsset** command. Next we
graph the time series plot of **wlf**. The **ytitle(...)** and **xtitle(...)** options are
used to put titles on the axes. The graph produced by this command
is shown below. Successive iterations are plotted on the x-axis and the value of
the wlf on the y-axis. There does not appear to be a long term trend (either
increasing or decreasing) in values of wlf, so it seems to have reached
a stationary distribution.

tsset itertime variable: iter, -99 to 200 delta: 1 unittsline wlf, ytitle(Worst Linear Function) xtitle(Iteration)

Because the imputations are drawn from successive iterations
of the same chain, autocorrelation between the imputed
datasets is possible. We can check to see that enough iterations were
left between successive draws (i.e., datasets) that autocorrelation does
not exist. We can asses this using an autocorrelation plot.
The command to produce an autocorrelation plot is **ac** followed by the name
of the variable to be plotted. In the graph below, the x-axis shows the lag,
that is the distance between a given iteration and the iteration it is being
correlated with, on the y-axis is the value of the correlations. The correlation
of the wlf at each iteration with the iteration immediately following it are positive
and reasonably high, about .1, after
that, the correlations for the subsequent 2 to 40 iterations show relatively low
correlations, without much of a pattern, suggesting any that the autocorrelation
is relatively short lived. Our imputation allowed 100 iteration (the default) between
each successive draw, based on this plot, such a long period may have been
unnecessary.

ac wlf

If we were particularly worried about the convergence of individual parameters in the model,
we could request that information on the value of individual parameters, rather than the summary measure wlf, be saved. We do not cover this command here, for more information see
**help mi ptrace** and **help mi impute mvn**.

It is also a good idea to carefully look at the descriptive statistics for the imputations (tools for doing so will be covered in part two of this seminar). We don't expect the descriptive statistics for the imputed data to be the same as those in the original data, but examining these values can help us see where things might have gone very wrong in the imputation process. For example, if a variable with a range of 1-10 (in the original data), has a mean outside that range (e.g., 15) after imputation, this suggests there may have been a problem in the imputation model.

In this seminar we've shown how you can explore missing values using Stata's
**mi** commands, as well as, user written commands. We also discussed some of
the major issues in imputation and showed an example of imputation using Stata's
**mi impute mvn** command. In part two of this seminar we will show an
alternative method of imputation, that is, imputation by chained equations. We
will also show some of the tools you can use to explore, manage and analyze MI data in Stata.

Graham, John W., Olchowski, Allison E. and Gilreath, Tamika D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

Little, Roderick J.A., and Rubin, Donald B. (2002). Statistical Analysis with Missing Data, Second Edition. Hoboken, New Jersey: Whiley-InterScience.

McKnight, Patrick E., McKnight, Katherine M., Sidani, Souraya, and Figueredo, Aurelio Jose (2007). Missing Data: A Gentle Introduction. New York, New York: The Guilford Press.

Molenberghs, Geert, and Kenward, Michael G. (2007). Missing Data in Clinical Studies. Chichester, West Sussex: John Whiley & Sons Ltd.

Potthoff, Richard F., Tudor, Gail E., Pieper, Karen S., and Hasselblad, Vic (2006). Can one assess whether missing data are missing at random in medical studies? Statistical Methods in Medical Research 15:213-234.

van Buuren S., H. C. Boshuizen and D. L. Knook. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694.

von Hippel, Paul T. (2007). Regression with missing y's: an improved strategy for analyzing multiple imputed data, Sociological Methodology, 37.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.