Statistical Computing Seminars
Multiple Imputation in Stata, Part 1

Outline of this seminar:

Part 1:

Part 2:

This seminar is on multiple imputation using Stata, but imputation is much more than the mechanical process of running commands, it requires creating a model. Building a good imputation model requires knowledge of the data as well as careful considerations of a number of options, a process that can take as long as creating a good analysis model. This seminar includes a brief review of some important concepts in multiple imputation, and handling missing data more generally, but is not intended to be a replacement for more thorough treatments of this topic. We have made an effort to point out some of the important steps and decisions in the imputation process. However, handling missing data is a complex and developing topic, so we recommend you read the related literature carefully before implementing multiple imputation or other missing data handling techniques in your own research. The classic text on handling missing data, now in its second edition, is Statistical Analysis with Missing Data by Little and Rubin (2002). This text is technical, so it may not be the best introduction to the topic, especially for those without a background in mathematics. A more approachable text is Missing Data: A Gentle Introduction by McKnight, McKnight, Sidani, and Figueredo (2007). Another text we like is Missing Data in Clinical Studies by Molenberghs and Kenward (2007). Many other excellent texts on this topic exist, this just happens to be one that we like and have in our library of books for loan. When reading books and articles on MI we recommend that you be conscious of when it was published, because as mentioned above, MI is a developing field, and what is generally considered good practice may have changed.

The trouble with missing values

Throughout the seminar we will be using a dataset that contains test scores, as well as demographic and school information for 200 high school students. Below we have summarized this dataset, note that although the dataset contains 200 cases, six of the variables have fewer than 200 observations. These variables have missing values for between 4.5% (read) and 9% (science) of cases. This doesn't seem like a lot of missing data, so we might be inclined to try to analyze the observed data as is, a strategy sometimes referred to as complete case analysis.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
(highschool and beyond (200 cases))

sum


    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |       200       100.5    57.87918          1        200
      female |       182    .5549451    .4983428          0          1
        race |       200        3.43    1.039472          1          4
         ses |       200       2.055    .7242914          1          3
      schtyp |       200        1.16     .367526          1          2
-------------+--------------------------------------------------------
        prog |       182    2.027473    .6927511          1          3
        read |       191    52.28796    10.21072         28         76
       write |       183    52.95082    9.257773         31         67
        math |       185     52.8973    9.360837         33         75
     science |       184    51.30978    9.817833         26         74
-------------+--------------------------------------------------------
       socst |       200      52.405    10.73579         26         71

Below we use write, read, female, and math to predict socst in a regression model. The regression model uses just those cases with complete data for all the variables in the model (i.e., no missing values on socst, write, read, female, or math). This is the default in Stata and many other statistical packages. Looking at the top of the output, we see that only 145 cases were used in the analysis, in other words, more than one quarter of the cases in our dataset (55/200) were excluded from the analysis because of missing data. Below the regression table we use the estimates store command to save the results so we can recall them later.

regress socst write read female math


      Source |       SS       df       MS              Number of obs =     145
-------------+------------------------------           F(  4,   140) =   28.10
       Model |   6630.7694     4  1657.69235           Prob > F      =  0.0000
    Residual |  8259.47888   140  58.9962777           R-squared     =  0.4453
-------------+------------------------------           Adj R-squared =  0.4295
       Total |  14890.2483   144  103.404502           Root MSE      =  7.6809

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .3212789   .1020247     3.15   0.002     .1195706    .5229871
        read |   .3047733   .0899709     3.39   0.001     .1268961    .4826505
      female |   .2233572   1.404163     0.16   0.874    -2.552749    2.999463
        math |   .1988131   .1016747     1.96   0.053    -.0022031    .3998294
       _cons |   9.358279   4.262397     2.20   0.030     .9312916    17.78527
------------------------------------------------------------------------------

estimates store cc

The reduction in sample size alone might be considered a problem, but complete case analysis can also lead to biased parameter estimates. Because this is an example dataset, and we created the missing values for this example, we also have the complete dataset. We can compare the results from the complete case analysis above, to the results from the original data (i.e., the dataset with no missing values). Below we open the original dataset, and run the same regression model as above, note that since there is no  missing data, all 200 observations were used in this regression. Below the regression output, we store the estimates.

use http://www.ats.ucla.edu/stat/data/hsb2, clear
(highschool and beyond (200 cases))

regress socst write read female math

      Source |       SS       df       MS              Number of obs =     200
-------------+------------------------------           F(  4,   195) =   44.45
       Model |  10938.9795     4  2734.74487           Prob > F      =  0.0000
    Residual |  11997.2155   195  61.5241822           R-squared     =  0.4769
-------------+------------------------------           Adj R-squared =  0.4662
       Total |   22936.195   199  115.257261           Root MSE      =  7.8437

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .3757491   .0852101     4.41   0.000     .2076975    .5438007
        read |   .3696825   .0775725     4.77   0.000     .2166938    .5226712
      female |  -.2340534   1.207995    -0.19   0.847    -2.616465    2.148358
        math |   .1209005   .0861526     1.40   0.162    -.0490101    .2908111
       _cons |   7.029076   3.562453     1.97   0.050      .003192    14.05496
------------------------------------------------------------------------------

estimates store full

Now we use estimate table to display the results of the complete case analysis (labeled cc) and analysis of the full dataset (labeled full). For each of the variables in our model (as well as the constant), the table includes three values, the coefficient estimate, below that is the standard error, and below that the p-value for the coefficient. Comparing the coefficients, as well as their standard errors and p-values, from the complete case and full data analyses, we can see that for all of the coefficients there is some bias (i.e., difference between the two analyses). The largest absolute difference is in the coefficient for female, however, the coefficient for female was non-significant in both models. In the case of math, the coefficient in the complete case analysis is significant, while it is not in the model estimated with the full data, and hence our inference about the coefficient would be different for the two models. The coefficients for read, and write, as well as their p-values are similar in the two analyses. The intercepts differ by a relatively small value (give the scale of the dependent variable). There is no consistent pattern of either larger or smaller coefficient estimates between the two, but the standard errors for the analysis of the complete data are all smaller, due in part to the larger sample size.

estimate table cc full, b se p

----------------------------------------
    Variable |     cc          full     
-------------+--------------------------
       write |  .32127885     .3757491  
             |  .10202467    .08521005  
             |     0.0020       0.0000  
        read |  .30477331    .36968249  
             |  .08997086    .07757247  
             |     0.0009       0.0000  
      female |  .22335724   -.23405342  
             |  1.4041631    1.2079946  
             |     0.8738       0.8466  
        math |  .19881314    .12090052  
             |  .10167466    .08615264  
             |     0.0525       0.1621  
       _cons |   9.358279    7.0290761  
             |  4.2623968    3.5624529  
             |     0.0298       0.0499  
----------------------------------------
                          legend: b/se/p

We can also compare these results to those from a multiply imputed dataset. Below we use a multiply imputed dataset to estimate our model.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/mvn_imputation, clear
mi estimate, post: reg socst write read female math

Multiple-imputation estimates                     Imputations     =          5
Linear regression                                 Number of obs   =        200
                                                  Average RVI     =     0.0820
                                                  Complete DF     =        195
DF adjustment:   Small sample                     DF:     min     =      59.71
                                                          avg     =     121.37
                                                          max     =     181.12
Model F test:       Equal FMI                     F(   4,  163.6) =      38.78
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .3472116   .0956238     3.63   0.000     .1572004    .5372228
        read |   .3673822   .0803328     4.57   0.000     .2086775    .5260869
      female |    .525372   1.375176     0.38   0.704    -2.225667    3.276411
        math |   .1508523   .0908884     1.66   0.099    -.0290372    .3307417
       _cons |    6.59747   3.707945     1.78   0.077    -.7188551     13.9138
------------------------------------------------------------------------------

estimates store mi

Below we use estimates table to display results from all three models next to each other. As with the table, the estimate, standard error, and p-value for each coefficient are listed (in that order). Comparing the results across the three analyses, we see that for all of the estimates in the model, the coefficient estimates from the MI analysis are closer to those from the full dataset than those from the complete case analysis, that is, the MI coefficients contain less bias (although the magnitude of these differences is often small in this example). In general, if done well, analysis using MI should result in coefficients with less bias than complete case analysis.

estimates table cc full mvn, b se p

-----------------------------------------------------
    Variable |     cc          full          mi      
-------------+---------------------------------------
       write |  .32127885     .3757491    .34721159  
             |  .10202467    .08521005    .09562376  
             |     0.0020       0.0000       0.0004  
        read |  .30477331    .36968249    .36738221  
             |  .08997086    .07757247    .08033285  
             |     0.0009       0.0000       0.0000  
      female |  .22335724   -.23405342    .52537204  
             |  1.4041631    1.2079946    1.3751758  
             |     0.8738       0.8466       0.7028  
        math |  .19881314    .12090052    .15085228  
             |  .10167466    .08615264    .09088836  
             |     0.0525       0.1621       0.0986  
       _cons |   9.358279    7.0290761    6.5974704  
             |  4.2623968    3.5624529    3.7079453  
             |     0.0298       0.0499       0.0768  
-----------------------------------------------------
                                       legend: b/se/p

What is multiple imputation?

To impute values generally means to replace missing values with some other value. There are a variety of methods of selecting the imputed value. One very simple approach is to replace missing values with the sample mean. While popular, mean imputation produces distributions that have far too many cases at the mean. More importantly, mean imputation can often produce estimates that are more biased than those from complete case analysis (Little & Rubin 2002, pg. 62). Another possibility is to perform a conditional mean imputation, that is, rather than imputing the sample mean, use the mean from cases that are similar to the case with the missing values in important ways. Replacing missing values with predicted values from a regression analysis of the complete data is a form of conditional mean imputation. What these methods of imputation have in common is that the imputed values are completely determined by a model applied to the observed data, in other words, they contain no error. This tends to reduce variance, and can distort relationships among variables. An alternative approach is to incorporate some error into the imputed values. The values imputed in multiple imputation are draws from a distribution, in other words, they inherently contain some variation. This variation is important for several reasons, not just for creating reasonable distributions.

A limitation of single imputation is that it treats imputed values as though they were observed, which is not the case, imputations are only estimates. As a result, standard analyses of a single imputation will tend to overstate our confidence in the parameter estimates, that is, the standard errors are too small. Multiple imputation addresses this problem by introducing an additional form of error based on variation in the parameter estimates across the imputations, so called, between imputation error. An MI analysis involves three steps, first, an imputation model is formulated and a series of imputed datasets are created. Second, the analysis of each imputed dataset is carried out separately, for example, you might calculate a mean or run a regression model. Finally, the estimates from the imputed datasets are combined, or pooled, to generate a single set of estimates. For parameters (e.g., means or regression coefficients), the MI estimate is simply the mean of parameter estimates across the imputations. The calculation of the standard errors is a little trickier. As mentioned above, the MI estimate of the standard error of a parameter contains two components, the within imputation variance, and the between imputation variance. The within imputation variance is the average of the variance (i.e., the standard error squared) across the imputations. The between imputation variance is a function of the variance of the parameter estimate across the imputed datasets and the number of imputations. The MI estimate of the standard error is the square root of the within and between variances added together. This process allows us to see how much our results change, when different values are imputed. Combining the results across the imputations allows us to account for the uncertainty in the imputed values.

It is important to note that MI is not the only appropriate method of handling missing data. Another particularly good method is full information maximum likelihood (FIML). Both FIML and MI have advantages and disadvantages, depending on your specific situation. To our knowledge, the FIML approach has not been implemented in Stata. FIML is commonly implemented in structural equation modeling packages such as Mplus, LISREL, and EQS.

Exploring missing data patterns

As with most, if not all, analyses, the first step in handling missing data is to get to know the data. This includes the usual exploratory data techniques, such as examining means and standard deviations, and graphing distributions. In addition, in the presence of missing data it is important to understand not just how much data is missing, but the patterns of missing values. Stata provides tools to do this. Throughout the process of creating and analyzing MI data in Stata, which tools are available depends on which version of Stata you have access to. Starting with version 11, Stata has a suite of commands for handling MI data, all of which start with the mi prefix. In earlier versions of Stata, a number of user written tools are available for working with missing data. We will introduce both Stata's built in commands and user written commands as we go along (see our FAQ: How do I use the findit command to search for programs and additional help? for more information about finding and installing user written commands).

Before we can use the mi commands, the data must be mi set. Fortunately, we don't need to know anything about the missing data structure in order to use the mi set command. We do need to declare the dataset style, but, since the data hasn't been imputed yet, we can use any style. The style can always be changed using the mi convert command (for more information on MI data styles see Stata's documentation on them by typing help mi styles in the Stata command window). For now we set the MI storage format to be wide. Next we use the mi misstable sum command to begin to explore the pattern of missing values in our dataset.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
(highschool and beyond (200 cases))

mi set wide
mi misstable sum

                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        female |        18                 182  |      2          0           1
          prog |        18                 182  |      3          1           3
          read |         9                 191  |     30         28          76
         write |        17                 183  |     29         31          67
          math |        15                 185  |     39         33          75
       science |        16                 184  |     32         26          74
  -----------------------------------------------------------------------------

The output above shows us which variables have missing values. The second, third, and fourth columns tell us about the number and type of missing values. The column labeled Obs=. tells us how many cases have a system missing value (i.e., "."), the column labeled Obs>. gives the number of cases with so called extended missing values (i.e., ".a", ".b", ".c"...), and the column labeled Obs<. gives the number of cases with non-missing values for each variable. The use of greater than and less than symbols may seem somewhat confusing, the reason for them is this, when Stata stores data, it stores system missing values as a very large number, when the value is an extended missing value (e.g., .a) Stata stores an even larger value in its place. As a result, any observed value in the dataset is less than the value stored for a system missing value (hence Obs<.), and the value stored for an extended missing value is larger than the value stored for a system missing value (hence Obs>.). The distinction between system and user missing variables is important because Stata's mi impute command will not impute extended missing values (the user written program ice does not make this distinction). Therefore, if you wish to impute missing values for cases with system missing values using mi impute, you will first need to replace the user missing values with system missing values. The final three columns give information about the cases with observed (i.e., non-missing) values, specifically, the number of unique values (Unique values), as well as the minimum (Min) and maximum (Max) values that each variable takes on.

Next, we want to explore the pattern of missing values in the dataset. It is important to know about the pattern of missing values because the patterns of missingness can sometimes suggest why the values are missing. For example, are there a lot of missing values for certain variables? Below is a sample dataset, instead of showing observed values, each case (row) contains a 1 if the value was observed, and a 0 if the value for that case is missing. The dataset contains four variables v1-v4.Notice that while v1 is observed for all cases, and v2 and v3 are observed for most cases, variable v4 has a lot of missing values. When you encounter a variable that has many more missing values than others, you probably want to ask yourself why this variable has so many missing values. Sometimes, the is part of a skip pattern or just doesn't apply to some respondents. However, you may also encounter variables with a large number of missing values, without such obvious causes, in these cases you may want to put some thought into why that particular variable would be missing so often. Was it a particularly sensitive survey question? Is this a reading generated by an unreliable machine or process? Was there a data entry or processing error that created missing values when the values were observed?

v1  v2  v3  v4
1   1   1   1
1   1   1   0
1   0   1   0
1   1   0   0
1   1   1   1
1   0   0   0

Another thing to watch out for is cases that are missing a lot of data. In the example above, the pattern shown in the sixth row is missing more values than the other rows. As with variables that are missing a lot of data, we probably want to ask ourselves if there is anything unusual about these cases that would lead to a large number of missing values. If there are multiple cases with a lot of missing data, do they have anything in common? Carefully exploring the data, and thinking about what we find, can aid us in making better decisions about how to treat the missing values later on.

Another thing to take note of is the pattern of missingness, which influences what kinds of imputation can be used. Missing data patterns are commonly described as either monotone or arbitrary. Below is another example dataset, where 1s indicate observed values and 0s missing values. This dataset is an example of monotone missingness, all values of v1 are observed, while all but the final two values of v2 are observed, and so on. It may be necessary to reorder variables and/or cases in order to be able to "see" monotone missingness, that is fine, as long as it is possible to do so, the missing data pattern is considered monotone.

v1  v2  v3  v4
1   1   1   1
1   1   1   1
1   1   1   0
1   1   0   0
1   0   0   0
1   0   0   0

For comparison, below is an example of what an arbitrary missing data pattern looks like (again, a value of 1 represents an observed value, while 0 indicates a missing value). Note that it would not be possible to reorder the variables and/or cases to form a monotone pattern.

v1  v2  v3  v4
1   1   0   1
0   1   1   0
1   0   1   0
1   1   0   0
0   1   1   1
1   1   0   1

As you might imagine, when the cases and variables aren't ordered nicely, it can be difficult to spot a monotone missing data pattern. Below we use the misstable nested command to examine the nesting structure of the missing values. The output shows no patterns in which a missing value on one variable is always missing on another variable. This suggests that the pattern of missingness in the test score dataset is non-monotone.

mi misstable nested

     1.  read(9)
     2.  math(15)
     3.  science(16)
     4.  write(17)
     5.  prog(18)
     6.  female(18)

If the data was perfectly monotone (as in the example above) the output would look something like that shown below. The 2 in parentheses after v2 tells us that this variable has two missing values. The arrow (i.e., -> ) pointing towards v3 tells us that when v2 is missing, v3 is always missing as well. The second arrow, which points towards v4, tells us that v2 and v3 are always missing when v4 is missing.

     1.  v2(2) -> v3(3) -> v4(4)

The command mi misstable patterns provides another way to examine the patterns of missing data in our dataset.  Below we use mi misstable patterns to display the missing data patterns. The first column gives the percent of cases in each pattern, there are five additional columns, one for each each variable with missing values. The order of the variables is shown below the table. In the body of the table, a 1 indicates the variable was observed in that pattern, and a 0 indicates that the variable is missing in that pattern. The most common pattern (59% of cases) is no missing values at all. The next most common pattern (8% of cases) is missing on female (the variable for the column labeled 5), but all other variables are observed. The bypatterns option, used to generate the second set of output below, groups the missing data patterns based on the number of variables missing in that pattern, rather than the frequency of the missing data pattern. Both of these tables can be useful in detecting cases with a large number of missing values.

mi misstable patterns


       Missing-value patterns
         (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4    5  6
  ------------+---------------------
       59%    |  1  1  1  1    1  1
              |
        8     |  1  1  1  1    0  1
        7     |  1  1  1  1    1  0
        7     |  1  1  0  1    1  1
        6     |  1  1  1  0    1  1
        6     |  1  0  1  1    1  1
        4     |  0  1  1  1    1  1
        1     |  1  0  1  0    1  1
       <1     |  0  1  0  1    1  1
       <1     |  1  0  1  1    0  1
       <1     |  1  0  1  1    1  0
       <1     |  1  1  0  0    1  1
       <1     |  1  1  0  1    1  0
       <1     |  1  1  1  0    0  1
       <1     |  1  1  1  0    1  0
       <1     |  1  1  1  1    0  0
  ------------+---------------------
      100%    |

  Variables are  (1) read  (2) math  (3) science  (4) write  (5) female  (6) prog


mi misstable patterns , bypatterns


       Missing-value patterns
         (1 means complete)

              |   Pattern
    Percent   |  1  2  3  4    5  6
  ------------+---------------------
       59%    |  1  1  1  1    1  1
              |
  1:          |
        7     |  1  1  1  1    1  0
        8     |  1  1  1  1    0  1
        6     |  1  1  1  0    1  1
        7     |  1  1  0  1    1  1
        6     |  1  0  1  1    1  1
        4     |  0  1  1  1    1  1
  2:          |
       <1     |  1  1  1  1    0  0
       <1     |  1  1  1  0    1  0
       <1     |  1  1  1  0    0  1
       <1     |  1  1  0  1    1  0
       <1     |  1  1  0  0    1  1
       <1     |  1  0  1  1    1  0
       <1     |  1  0  1  1    0  1
        1     |  1  0  1  0    1  1
       <1     |  0  1  0  1    1  1
  ------------+---------------------
      100%    |

  Variables are  (1) read  (2) math  (3) science  (4) write  (5) female  (6) prog

Below we show the what the output would look like if the data was perfectly monotone, using the small dataset from above. Because v1 is observed for all observations, it is entirely omitted from the table. The striking feature of the table below is that, except for the first row, the upper portion is all 0s, while the lower portion is all 1s, reflecting the monotone missing data pattern.

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Percent   |  1  2  3
  ------------+-------------
       33%    |  1  1  1
              |
       33     |  0  0  0
       17     |  1  0  0
       17     |  1  1  0
  ------------+-------------
      100%    |

  Variables are  (1) v2  (2) v3  (3) v4

Similar output can be produced using the user-written program mvpatterns. One advantage of mvpatterns is that it can be used in Stata 10 and earlier. The command and its output are shown below. The first table in the output lists the variables with missing values in a table that also lists their storage type (labeled type),  number of observations (obs), number of missing values (mv), and the variable label. The second table produced by mvpatterns shows the missing data patterns. Under the heading _pattern is a visual representation of the missing data patterns, the columns under this heading represent the variables with missing values (in the order shown in the first table). An addition sign ("+") indicates that the variable is observed in a given missing data pattern, while a period (".") indicates that the variable is not observed in that missing data pattern. The second table also shows the number of missing values in each missing data pattern (_mv), and the frequency of that pattern (_freq). From this table we can see that the most frequent missing data pattern, shown in the first row, is actually no missing data at all (i.e., "+++++").

mvpatterns
variables with no mv's: id race ses schtyp socst _mi_miss

Variable     | type     obs   mv   variable label
-------------+------------------------------------
female       | float    182   18   
prog         | float    182   18   type of program
read         | float    191    9   reading score
write        | float    183   17   writing score
math         | float    185   15   math score
science      | float    184   16   science score
--------------------------------------------------

Patterns of missing values

  +------------------------+
  | _pattern   _mv   _freq |
  |------------------------|
  |   ++++++     0     117 |
  |   .+++++     1      15 |
  |   +.++++     1      14 |
  |   +++++.     1      13 |
  |   +++.++     1      12 |
  |------------------------|
  |   ++++.+     1      11 |
  |   ++.+++     1       8 |
  |   +++..+     2       2 |
  |   +++.+.     2       1 |
  |   ++.++.     2       1 |
  |------------------------|
  |   +.+++.     2       1 |
  |   +.++.+     2       1 |
  |   +.+.++     2       1 |
  |   .+++.+     2       1 |
  |   .++.++     2       1 |
  |------------------------|
  |   ..++++     2       1 |
  +------------------------+

You can also request a description of the missing data in tree form, this may be a particularly useful method of displaying data that is near-monotone, because it would allow one to easily see this type of pattern even if the variables aren't arranged in the proper order. mi misstable tree becomes less useful, and more difficult to create, when the number of variables with missing data is large. Below is the command and its output, we have used the frequency option so that the output displays the number of cases with missing values rather than the percent of total that are missing. The output can be a little tricky to read at first. The variables are listed starting with the variable missing the most values (on the left) to the variable missing the fewest values (on the right), so we know that female is the variable with the most missing values, and read the least (which we also know from the earlier output). If we start reading down the first column, the output tells us that 18 cases are missing values of female, while 182 cases are observed. Now reading across the first row, we see that of those cases 18 cases with missing values on female, 1 is also missing on prog, and the other 17 are not. Reading further down we see that of the 182 cases with valid values of female, 17 have missing values on prog, while 165 do not.  Returning to the first row, and reading across to the third through sixth columns we see the case that is missing on both female and prog does not have missing values on any other variables. Looking further down the table, we see that of the 17 cases missing on female, but not missing on prog, 1 has a missing values for write, and so on. This confirms what we already suspected about the data, the pattern of missing values is arbitrary rather than monotone.

mi misstable tree , frequency

  Nested pattern of missing values
      female       prog      write    science       math       read
  -----------------------------------------------------------------
          18          1          0          0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                            0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                 1          0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                            1          0          0 
                                                                  0 
                                                       1          0 
                                                                  1 
                     17          1          0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                            1          0          0 
                                                                  0 
                                                       1          0 
                                                                  1 
                                16          0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                           16          1          0 
                                                                  1 
                                                      15          0 
                                                                 15 
         182         17          1          0          0          0 
                                                                  0 
                                                       0          0 
                                                                  0 
                                            1          0          0 
                                                                  0 
                                                       1          0 
                                                                  1 
                                16          1          0          0 
                                                                  0 
                                                       1          0 
                                                                  1 
                                           15          1          0 
                                                                  1 
                                                      14          0 
                                                                 14 
                    165         15          1          0          0 
                                                                  0 
                                                       1          0 
                                                                  1 
                                           14          2          0 
                                                                  2 
                                                      12          0 
                                                                 12 
                               150         14          0          0 
                                                                  0 
                                                      14          1 
                                                                 13 
                                          136         11          0 
                                                                 11 
                                                     125          8 
                                                                117 
  -----------------------------------------------------------------
 (number missing listed first)

The output from mi misstable tree , frequency run on the small monotone dataset from above is shown below. Here we can see the monotone missing structure, the important thing to notice is that reading from left to right, once a variable has an observed value for one variable, it will have observed values for all subsequent variables.

  Nested pattern of missing values
          v4         v3         v2
  --------------------------------
           4          3          2 
                                 1 
                      1          0 
                                 1 
           2          0          0 
                                 0 
                      2          0 
                                 2 
  --------------------------------
 (number missing listed first)

Exploring missing data mechanisms

The missing data mechanism is the process that generates missing values, that is, what predicts whether a given value is missing or not. Missing data mechanisms generally fall into three categories: missing completely at random, missing at random, and missing not at random. There are precise technical definitions for these terms in the literature, the following explanation necessarily contains simplifications.

When data is missing completely at random (MCAR), neither other variables in the dataset, nor the unobserved value of the variable itself, predict whether a value will be missing. MCAR is a fairly strong assumption, and tends to be relatively rare. One relatively common scenario, in which data can be assumed to be missing completely at random is when a subset of cases is randomly selected to undergo additional measurement, for example, health surveys, where some subjects are randomly selected to undergo more extensive physical examination.

Data is said to be missing at random (MAR) if other variables in the dataset can be used to predict missingness on a given variable. For example, in surveys, men may be more likely to decline to answer some questions than women (i.e., gender predicts missingness on other variables). MAR is a less restrictive assumption than MCAR.

Finally, data is said to be missing not at random (MNAR, sometimes also called not missing at random, or NMAR) if the value of the unobserved variable itself predicts missingness. A classic example of this is income, individuals with very high incomes are more likely to decline to answer questions about their income, than individuals with more moderate incomes. It can be difficult to determine whether variables are MNAR, because the information that would confirm that values are MNAR is unobserved. As a result, the decision to treat data as MNAR is often made based on theoretical and/or substantive information, rather than information present in the data itself.

The missing data mechanism is important, because different types of missing data require different treatment. When data is missing completely at random, analyzing only the complete cases will not result in biased parameter estimates (e.g., regression coefficients), it can, however, substantially reduce the sample size for an analysis, leading to larger standard errors. In contrast, analyzing only complete cases for data that is either missing at random, or missing not at random can lead to biased parameter estimates. Multiple imputation generally assumes that the data is at least missing at random, meaning that it can also be used on data that is missing completely at random. Note that procedures for imputing data that is missing not at random have been developed, but to our knowledge have not been implemented in Stata.

If we have carefully examined the missing values in our dataset, we may already have some sense of how the missing values were generated. It is possible to examine whether values of the variables in the dataset predict missingness on other variables, suggesting that the data may be missing at random. One method of doing this is as follows. First, for each variable with missing values, create a binary variable, equal to 1 if the value is missing, and 0 if it is observed. Next, we can examine the relationship between that indicator and the other variables in the dataset. Because it is not uncommon to have a large number of variables with missing values, below we use a loop to go through these steps for each of the variables in our dataset that has missing values. First, we create a local macro, named corrvars, that contains a list of the variables that might predict missingness in the variables with missing values. Next, we use a foreach loop to repeat a set of commands for each variable in our list. The loop will repeat the actions inside the brackets for each of the variables following the word varlist. The first line inside the loop (the third line of syntax) creates a binary variable that indicates missingness, for example, for female it create the variable m_female, which is equal to 1 when female is missing and 0 otherwise.  Next, the indicator variable is correlated with all of the variables listed in the local macro corrvars, omitting the variable whose indicator is being examined (e.g., female is removed from the list of variables to be correlated with m_female).

local corrvars "female ses schtyp read write math science socst"
foreach var of varlist female science read write math {
	gen m_`var' = missing(`var')
	pwcorr m_`var' `: list corrvars - var'
}

Below we show the output for correlating the indicators of missingness for female and science (i.e., m_female and m_science respectively) with other variables from the dataset. The loop above would also produce correlation tables for all other variables listed after varlist (i.e., read, write, and math), but this output has been omitted. In the first table, we see the correlations between the indicator variable, m_female, and the other variables in the dataset. Note that we have not requested significance tests because whether the correlations between missingness and the other variables is statistically significant is unimportant, we are not, after all, making inferences to a larger population, we are simply looking for patterns within the dataset. The correlation of missingness on female (i.e., m_female) with read, write, and math stand out as the largest values, but overall, we see small to moderate correlations between missingness on female and the other variables in our dataset.

             | m_female      ses   schtyp     read    write     math  science    socst
-------------+------------------------------------------------------------------------
    m_female |   1.0000 
         ses |  -0.1207   1.0000 
      schtyp |   0.0057   0.1367   1.0000 
        read |  -0.1499   0.2865   0.0669   1.0000 
       write |  -0.1512   0.2109   0.1492   0.5872   1.0000 
        math |  -0.1428   0.2692   0.0908   0.6589   0.6182   1.0000 
     science |  -0.0291   0.2708   0.0733   0.6329   0.5498   0.6296   1.0000 
       socst |  -0.1228   0.3319   0.0968   0.6160   0.5975   0.5451   0.4512   1.0000 


             |   m_math   female      ses   schtyp     read    write  science    socst
-------------+------------------------------------------------------------------------
      m_math |   1.0000 
      female |  -0.1149   1.0000 
         ses |  -0.1268  -0.1140   1.0000 
      schtyp |  -0.1243  -0.0028   0.1367   1.0000 
        read |  -0.0560  -0.0174   0.2865   0.0669   1.0000 
       write |  -0.0285   0.2508   0.2109   0.1492   0.5872   1.0000 
     science |   0.0981  -0.0918   0.2708   0.0733   0.6329   0.5498   1.0000 
       socst |  -0.1880   0.0889   0.3319   0.0968   0.6160   0.5975   0.4512   1.0000

<output omitted>

Finding correlations between missingness on a given variable and other variables in the dataset is consistent with the MAR assumption, but does not evaluate the assumption that missing values on a given variable are unrelated to the (unobserved) value of that variable. Potthoff, Tudor, Pieper, and Hasselbland (2006) discuss a technique for assessing the degree to which this assumption is tenable.

Building an imputation model

The quality of the imputation model will influence the quality of the final results, so it is important to carefully consider the design of the imputation model. In general, one wants to use as much care building an imputation model as one uses in building an analysis model. Hence, it can take as long, or longer, to build a good imputation model as it takes to build a good analysis model. There are a number of important decisions to be made when building an imputation model, including, but not limited to, what method should be used to generate the imputations; whether to impute for a specific analysis/model, or to impute for an entire dataset; what, if any, "auxiliary" variables should be included in the imputation model; and how many imputed datasets should be generated. Below is a brief description of some of the options for each of these decisions. It is not intended to be a thorough discussion of all possible options. We recommend that you read the literature on MI for a thorough discussion of these issues. Because many issues in this area are unresolved, care should be taken to consult recent sources, as what is considered good practice may change. Additionally, while various procedures have advantages and disadvantages, we are unaware of research clearly supporting one technique over another in all circumstances. The best practice may be to repeat the analysis under different imputation models to see if, and how, changes in the imputation model result in changes in the final results.

If the pattern of missingness is monotone

You may have wondered why such a big deal was made of checking to see if the pattern of missingness was monotone, or even near monotone. If the missingness is monotone the process of imputation becomes much easier from a statistical standpoint. One of the advantages of imputing monotone data is that it is relatively easy to impute binary, ordinal, or categorical variables, something that is trickier with arbitrary missing data patterns. Because monotone missing data patterns are relatively rare, we won't discuss the process of imputing them in depth here. The Stata command for imputing monotone missing data is mi impute monotone. Once imputed, MI datasets from data with monotone missingness are analyzed in the same manner as other MI datasets.

The multivariate normal model and chained equations approaches

When the pattern of missingness is arbitrary, two common approaches to creating MI datasets are imputation using the multivariate normal model, and imputation using the chained equations approach. The first, introduced by Rubin (see, e.g.,, Little and Rubin 2002) involves drawing from a multivariate normal distribution of all the variables in the imputation model. The second approach, imputation by chained equations, is sometimes referred to as the ICE or MICE (for multiple imputation by chained equations) approach. The ICE approach generates imputations by performing a series of univariate regressions. Both the multivariate normal and ICE approaches are available in Stata. The multivariate normal approach is implemented in the mi impute mvn command introduced in Stata 11. The multivariate normal model has also been implemented in a user-written plug-in inorm (for help locating and installing this package see our FAQ: How do I use the findit command to search for programs and additional help?).  inorm requires Stata version 9 or later, making it available to users who do not have access to Stata 11. The ICE approach is implemented in a user-written package called ice. The current version of ice requires Stata version 9.2 or later.

The multivariate normal approach has stronger theoretical underpinnings, and some better statistical properties, but the ICE approach seems to work in practice. An advantage of the ICE approach is that the variables are not assumed to have a multivariate normal distribution. However, the multivariate normal approach tends to be robust to departures from normality. One common concern among users new to the multivariate normal approach is that the normality assumption results in imputed values that do not necessarily resemble the observed values. For example, imputed values of a binary (0,1) variable may take on any value under the multivariate normal approach. Mathematically, this is not, generally speaking, a problem, but users sometimes find it disconcerting. Another advantage of the ICE approach is that because it involves a series of univariate models, rather than a single large model, it can be somewhat easier to estimate. This can be useful if you have a large number of variables in the imputation model or a relatively small number of cases. However, because the imputations are based on a series of univariate models, imputation models using the ICE approach can be tedious to specify. As the above is intended to suggest, neither model is necessarily "better" than the other in all situations. We recommend that you read the literature to learn about both approaches, and then carefully consider which approach is best suited to your situation. You may also want to consider creating MI datasets with both approaches and comparing the results of your final analysis using the two approaches.

What variables should go in the model?

At one extreme, one can run an imputation model that contains only those variables to be used in a specific analysis model. At the other end of the spectrum, one could run an imputation model that includes all of the variables in a dataset.  The advantage to imputing for a specific analysis or set of analyses is that one can be sure to include all relevant variables, non-linear terms, interactions, etc. which may not be possible when imputing for an entire dataset. One advantage of imputing the entire dataset, is that the imputation model then uses all of the information in the dataset. Further, if one imputes most or all of the variables in the dataset, the resulting MI dataset can be used for most or all future analyses. Creating a single imputed dataset for use in future studies makes it easier for you, and others, to replicate the results (this is a good reason why one might want to use the MI datasets sometimes distributed with large datasets public use datasets).

Even if you intend to impute for a specific analysis, you may want to include some variables in the multiple imputation model that are not in the planned analysis, these are sometimes called auxiliary variables. Auxiliary variables may or may not have missing values. Regardless of which strategy you chose, you may want to include variables that do not have missing values in the imputation model. Note that the number of variables that can be included in an imputation model also depends on the imputation method and sample size. When collecting or downloading data, if you anticipate a lot of missing values on a specific measure, you can sometimes plan for auxiliary variables. If you are still in the research design phase, you can include additional measures, which may be less prone to missingness. For example, in a questionnaire, individuals may decline to report their income, but be more willing to report the type of car they drive or the number of rooms in their house, items that may be useful in imputing income. Similarly, if you are using a large existing dataset, you may want to download (or otherwise obtain) variables that aren't necessary for your analysis model but may be useful in an imputation model.

One common question about imputation is whether the dependent variable should be included in the imputation model. The answer is yes, if the dependent variable is not included in the imputation model, the imputed values will not have the same relationship to the dependent variable that the observed values do. In other words, if the dependent variable is not included in the imputation model, you may be artificially reducing the strength of the relationship between the independent and dependent variables. After the imputations have been created, the issue of how to treat imputed values of the dependent variable becomes more nuanced. If the imputation model contains only those variables in the analysis model, then using the imputed values of the dependent variable does not provide additional information, and actually introduces additional error (von Hippel 2007). As a result some authors suggest including the dependent variable in the imputation model, which may include imputing values, and then excluding any cases with imputed values for the dependent variable from the final analysis (von Hippel 2007). If the imputation was performed using auxiliary variables or if the dataset was imputed without a specific analysis model in mind, then using the imputed values of the dependent variable may provide additional information. In these cases, it may be useful to include cases with imputed values of the dependent variable in the analysis model. Note that it is relatively easy to test the sensitivity of results to the inclusion of cases with imputed values of the dependent variable by running the analysis model with and without those cases.

Selecting the number of imputations (m)

Historically, the recommendation was for 3 to 5 MI datasets. While relatively low values of m may be appropriate when proportion of missing values is low, and the analysis techniques are relatively simple, more recently, larger values of m are often recommended. To some extent, this change in the recommended number of imputations is based on the radical increase in the computing power available to the typical researcher, making it more practical to create and analyze MI datasets with a larger number of imputations. Recommendations for the number of m vary, but, for example, the Stata manual suggests 5 to 20 imputations for low fractions of  missing information, and as many as 50 (or more) imputations when the proportion of missing data is relatively high. One article that discusses the issue of the number of imputations is by Graham, Olchowski, and Gilreath (2007).  The mi commands include features designed to allow you to test the sensitivity of your model to number of imputations, and we recommend this type of sensitivity analysis (discussed in the section on analysis of MI data). A larger number of imputations may also allow hypothesis tests with less restrictive assumptions (i.e., that do not assume equal fractions of missing information for all coefficients).

Imputing with special data structures

Two common types of "special" data are complex survey designs and longitudinal data. Imputation of complex survey data presents a number of complex issues that are not discussed here. Before deciding whether and how to impute data from a complex survey, you probably want to carefully read the literature in this area, to be sure you understand these issues. An alternative to imputing complex survey designs is to use MI datasets that have been created and released by the data distributors. However, the user should carefully study the documentation of the MI datasets, as well as the literature on imputation of survey data, to be sure they understand how the data was prepared. An advantage of using complex survey data that has been imputed by the data distributor is that the data distributor often has access to sensitive information not included in public use datasets, and this information may have been used in the imputation process. If the sensitive variables provides information useful in the imputation process, then imputations produced by the data distributors may be better than imputations produced by the user.

Longitudinal data in wide form (i.e., all the observations for a single subject are on the same line) can be imputed using the same procedures as data from a single time point. The data should be in wide form for two reasons. First, in long form there is more than one observation per individual, so the cases aren't independent. Second, if a case has a valid response for one time point but missing data at others, the individual's valid response is likely to be a good predictor of the missing value. For example, we might expect an individual's score on a test of mathematics ability at time 1 to be correlated with their score on a test of mathematics ability at time 2. For more information on imputing longitudinal data, you may want to see our FAQ: How can I perform multiple imputation on longitudinal data using ICE? (note, the example uses ice, but the process is similar if you are using mi impute).

Imputing using the multivariate normal model

Recall that the example dataset contains demographic information and test scores from 200 high school students. Below we open the dataset and summarize the variables (abbreviated sum). The dataset contains 11 variables, 6 of which, female, prog, read, write, math, and science, have missing values (i.e., fewer than 200 observations).

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
(highschool and beyond (200 cases))

sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |       200       100.5    57.87918          1        200
      female |       182    .5549451    .4983428          0          1
        race |       200        3.43    1.039472          1          4
         ses |       200       2.055    .7242914          1          3
      schtyp |       200        1.16     .367526          1          2
-------------+--------------------------------------------------------
        prog |       182    2.027473    .6927511          1          3
        read |       191    52.28796    10.21072         28         76
       write |       183    52.95082    9.257773         31         67
        math |       185     52.8973    9.360837         33         75
     science |       184    51.30978    9.817833         26         74
-------------+--------------------------------------------------------
       socst |       200      52.405    10.73579         26         71

Using what we learned from examination of the data (e.g., descriptive statistics, patterns of missing values, etc.), we have considered the alternatives discussed in the previous section. We have decided to impute all 6 variables with missing values, and to include all of the variables with complete information in the imputation model. One alternative to this approach would be to impute for a specific analysis model, for example, the model we ran at the beginning of the seminar, predicting the variable socst with write, read, female, and math. An imputation model just for that model would impute the variables write, read, female and math, and include the complete variable socst. Because the dataset contains only a few variables, and the analysis model is relatively simple (i.e., it does not contain any interactions or non-linear terms), the imputations models based on these two approaches are similar, with a larger number of variables or a more complex analysis model this might not be the case. For this example, we will create five imputations (m=5). The example dataset has relatively few missing values and a simple analysis model, so this may be sufficient. However, as discussed above, for many applications more than 5 imputations may be desirable.

The data must be mi set before any of the mi commands can be used. The first line of syntax shown below uses mi set to set the data using the flong style. When the imputed data is stored in the flong style, the original data, plus the m imputed datasets are stored in a single data file. So, if there are 5 imputations, the file includes 6 copies of the dataset, that is, each case in the dataset occurs 6 times, once for the original data, and once for each of the 5 imputations. Storing data in the flong style is inefficient, so you may want to use another style, we've used it here because it's relatively easy to think about. (For more information on available styles type help mi styles in the Stata command window.) Next we use the tab command with the gen(...) option generate dummy variables for the variable prog, for use in the imputation model. Because "pr_" was used in the gen(...) option, the dummy variables will be named pr_1, pr_2, and pr_3.

mi set flong
tab prog, gen(pr_)

    type of |
    program |      Freq.     Percent        Cum.
------------+-----------------------------------
    general |         41       22.53       22.53
   academic |         95       52.20       74.73
   vocation |         46       25.27      100.00
------------+-----------------------------------
      Total |        182      100.00

Now we need to register the variables in our dataset. Based on our previous explorations of the data we know that the variables prog, female, read, write, math, and science all have missing values, all of which plan to impute. In the first line of syntax below, we use the mi register imputed command to tell Stata that female, read, write, math, science, pr_2, and pr_3  are all imputed. This isn't really true, yet, but this is how we tell Stata which variables we intend to impute. Note that in our case, prog, and pr_1 are omitted from the list of imputed variables, because they will not be included in the imputation model due to colinearity with pr_2 and pr_3. In the second line of syntax, we register race, ses, schtyp, and socst as regular variables (i.e., not imputed or based on imputed variables). Strictly speaking, we do not need to register the other variables in the dataset (i.e., those we do not wish to impute), but registering variables can help in data management. Finally we use mi describe to check that we have properly registered the variables. The output also provides information on how our dataset is set up, which we want to check carefully. For example, the output tells us that the style is flong, which is what we intend, and that there are currently no imputations (M = 0), which is expected because we have yet to impute any values. The output also tells us that there are 3 unregistered variables, id (the case id variable), prog, and pr_1, we haven't registered these, although doing so would be fine.

mi register imputed female read write math science pr_2 pr_3
mi register regular race ses schtyp socst
mi describe

  Style:  flong
          last mi update 0 seconds ago

  Obs.:   complete          117
          incomplete         83  (M = 0 imputations)
          ---------------------
          total             200

  Vars.:  imputed:  7; female(18) read(9) write(17) math(15) science(16) pr_2(18) pr_3(18)

          passive: 0

          regular: 4; race ses schtyp socst

          system:  3; _mi_m _mi_id _mi_miss

         (there are 3 unregistered variables; id prog pr_1)

Once we are sure that everything is in order, we can generate the imputations. The first line of syntax below sets the seed for the analysis so that we can reproduce it later. Setting the seed is necessary if we want to reproduce the results, because of the random component in the imputation process. The second line of syntax below runs the imputation model. The command name mi impute mvn, tells Stata that we wish to run an imputation model (mi impute) using the multivariate normal model (mvn). The variable list that immediately follows the command name contains the list of variables we wish to impute in this model followed by an equal sign ( = ) and a second list of variables. The second list contains variables with complete data that are to be included in the imputation model (i.e., race, ses, and schtyp). We have used the i. prefix with race (i.race) to indicate to Stata that this is a factor (i.e., categorical) variable, and that instead of including it as a continuous variables, the appropriate dummy variables should be included in the model. In other words, it's a shortcut so we don't have to produce the dummy variables ourselves. (For more information on factor variables type help factor variables in the Stata command window.) Note that for variables we wish to impute, we do need to create the dummy variables ourselves, and include k-1 (where k = number of categories) dummy variables in the imputation model. Following the comma is the add(...) option, this is required. The 5 listed in the add(5) option indicates that we wish to add 5 imputations to the existing dataset. In this case, there are currently no imputations, so the result with be a data file with 5 imputations. As mentioned above, in many cases, we would want to generate more than 5 imputations for the analysis, but it may be useful to start by running a model with a small number of imputations to make sure that everything runs as expected, and then create additional imputations.

set seed 49230
mi impute mvn female read write math science pr_2 pr_3 = i.race ses schtyp, add(5)

Frequently this command will take some time to run, this is expected because MI is computationally intensive. The output from the mi impute mvn command is shown below. The first few lines of output are status updates that Stata issues while the command is still running. The next few lines give the user information about the imputation that was performed, including the total number of imputations as well as the number added, following this is  more technical information about how the model was specified. Finally there is information about the variables that were imputed. For each variable that was imputed, the table lists the number of complete and incomplete observations in the original data (m=0), and the number of values that have been imputed. For example, female had 9 missing values in the original data, and all 9 of which have been imputed. If for some reason, the model we specified only imputed some of the missing values for a variable, the number in the incomplete and imputed columns would not match each other.

Performing EM optimization:
  observed log likelihood = -1624.7371 at iteration 12

Performing MCMC data augmentation ... 

Multivariate imputation                 Imputations =        5
Multivariate normal regression                added =        5
Imputed: m=1 through m=5                    updated =        0

Prior: uniform                           Iterations =      500
                                            burn-in =      100
                                            between =      100

               |              Observations per m              
               |----------------------------------------------
      Variable |   complete   incomplete   imputed |     total
---------------+-----------------------------------+----------
        female |        182           18        18 |       200
          read |        191            9         9 |       200
         write |        183           17        17 |       200
          math |        185           15        15 |       200
       science |        184           16        16 |       200
          pr_2 |        182           18        18 |       200
          pr_3 |        182           18        18 |       200
--------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled in observations.)

From the user's standpoint, mechanically, generating the imputations is simple, if time consuming. The command is simple, but statistically, the process is much more involved. The imputations are created using a technique called Markov Chain Monte Carlo (MCMC). Unlike a lot of models you may be used to (e.g., logit) there is no simple rule that can be used to determine when an MCMC model has converged. One common technique is to visually inspect the parameters from successive iterations of the model to see if they exhibit any clear trends, if they do the model has probably not converged, if they don't, we may want to consider this to be evidence that the model has reached what is called a stationary distribution. Ideally, this is done by checking the series for each parameter, but in larger models this isn't always practical. An alternative is to use the worst linear function (wlf), a sort of summary measure, which, under certain conditions, should converge more slowly than any of the individual parameters. You can request that Stata save the wlf for each iteration to a separate file, and then examine it. The command below is the same as the command above, except for the addition of the savewlf(...) option, which requests that the wlf be saved, the name of the new file containing the wlf values is specified in the parenthesis (i.e., wlf).

set seed 49230
mi impute mvn female read write math science = i.race i.prog ses schtyp, add(5) savewlf(wlf)

The output from the above command is identical to the output from the first mi impute mvn command, the only difference is that Stata has saved the recorded the value of the wlf at each iteration in the file wlf.dta. Below we first open the dataset, then we list the data so see what Stata has saved. The first variable, iter contains the iteration number. The iteration numbers for the initial burn-in (time before any imputations are drawn) count up to 0, by default Stata uses 100 burn in iterations, so iter=-99 in the first row. The variable m gives the "leg" of the process that iteration occurs on, for example, for the burn in period (iter = -99 to 0) the variable m is equal to 1, at iteration 0 the first imputation is drawn, followed by the iterations between the first and second imputations (iter = 1 to 100) so m=2, and so on. The variable wlf contains the values of the worst linear function at each iteration.

use http://www.ats.ucla.edu/stat/stata/seminars/missing_data/wlf, clear
list in 1/10, noobs clean


list in 1/10, noobs clean

    iter   m         wlf  
     -99   1           0  
     -98   1    .0001751  
     -97   1   -.0000706  
     -96   1    .0002469  
     -95   1    .0000131  
     -94   1    -.000044  
     -93   1     .000086  
     -92   1    .0001633  
     -91   1    .0002296  
     -90   1   -.0000149

We can use the tsline command to plot the value of the wlf across the iterations, but in order to do this, we must first tell Stata that the data is time series, and what variable contains the values of time (iter), we do this with the tsset command. Next we graph the time series plot of wlf. The ytitle(...) and xtitle(...) options are used to put titles on the axes. The graph produced by this command is shown below. Successive iterations are plotted on the x-axis and the value of the wlf on the y-axis. There does not appear to be a long term trend (either increasing or decreasing) in values of wlf, so it seems to have reached a stationary distribution.

tsset iter
        time variable:  iter, -99 to 200
                delta:  1 unit

tsline wlf, ytitle(Worst Linear Function) xtitle(Iteration)

Because the imputations are drawn from successive iterations of the same chain, autocorrelation between the imputed datasets is possible. We can check to see that enough iterations were left between successive draws (i.e., datasets) that autocorrelation does not exist. We can asses this using an autocorrelation plot. The command to produce an autocorrelation plot is ac followed by the name of the variable to be plotted. In the graph below, the x-axis shows the lag, that is the distance between a given iteration and the iteration it is being correlated with, on the y-axis is the value of the correlations. The correlation of the wlf at each iteration with the iteration immediately following it are positive and reasonably high, about .1, after that, the correlations for the subsequent 2 to 40 iterations show relatively low correlations, without much of a pattern, suggesting any that the autocorrelation is relatively short lived. Our imputation allowed 100 iteration (the default) between each successive draw, based on this plot, such a long period may have been unnecessary.

ac wlf

If we were particularly worried about the convergence of individual parameters in the model, we could request that information on the value of individual parameters, rather than the summary measure wlf, be saved. We do not cover this command here, for more information see help mi ptrace and help mi impute mvn.

It is also a good idea to carefully look at the descriptive statistics for the imputations (tools for doing so will be covered in part two of this seminar). We don't expect the descriptive statistics for the imputed data to be the same as those in the original data, but examining these values can help us see where things might have gone very wrong in the imputation process. For example, if a variable with a range of 1-10 (in the original data), has a mean outside that range (e.g., 15) after imputation, this suggests there may have been a problem in the imputation model.

In this seminar we've shown how you can explore missing values using Stata's mi commands, as well as, user written commands. We also discussed some of the major issues in imputation and showed an example of imputation using Stata's mi impute mvn command. In part two of this seminar we will show an alternative method of imputation, that is, imputation by chained equations. We will also show some of the tools you can use to explore, manage and analyze MI data in Stata.

References

Graham, John W., Olchowski, Allison E. and Gilreath, Tamika D. (2007) How Many Imputations are Really Needed? Some Practical Clarifications of Multiple Imputation Theory, Prev Sci 8:206

Little, Roderick J.A., and Rubin, Donald B. (2002). Statistical Analysis with Missing Data, Second Edition. Hoboken, New Jersey: Whiley-InterScience.

McKnight, Patrick E., McKnight, Katherine M., Sidani, Souraya, and Figueredo, Aurelio Jose (2007). Missing Data: A Gentle Introduction. New York, New York: The Guilford Press.

Molenberghs, Geert, and Kenward, Michael G. (2007). Missing Data in Clinical Studies. Chichester, West Sussex: John Whiley & Sons Ltd.

Potthoff, Richard F., Tudor, Gail E., Pieper, Karen S., and Hasselblad, Vic (2006). Can one assess whether missing data are missing at random in medical studies? Statistical Methods in Medical Research 15:213-234.

van Buuren S., H. C. Boshuizen and D. L. Knook. (1999). Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18:681-694.

von Hippel, Paul T. (2007). Regression with missing y's: an improved strategy for analyzing multiple imputed data, Sociological Methodology, 37.

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.