|
|
|
||||
|
|
|||||
This page discusses the use of interactions, particularly interactions with categorical variables, in ice. We assume that you have downloaded and are familiar with ice; if not, please see Multiple Imputation Using ICE . We will begin by modifying our example data set by adding missing values to some variables and creating interactions. Of course, your first step in using ice would not be to add missing values to your data; however, you will need to create any interaction terms that you want to include in the imputation model(s). Please note that the imputation models may be nothing like the models you wish to use for your actual data analyses. Also, many of the same issues that arise in data analysis are found when creating imputation models; for example, if you have many categorical predictor variables and a small data set, you may end up with empty cells, and your model will not converge.
In working with ice (and other imputation procedures), there are lots of equations being used, and it is important not to confuse them. As mentioned above, the equations that are used in the analysis of the imputed data are almost always different from the equations used to impute the data. In the imputation process, each variable to be imputed may have its own imputation model. This model may include variables that are to be imputed as well as variables with complete data. Hence, in the examples below, you will notice the use of several variables that have no missing data. That these variables have no missing data is noted in the ice output. Also, each variable to be imputed will be the dependent variable in its imputation equation, and the type of imputation equation used (e.g., regress, logit, mlogit) will depend on the type of variable it is. You will also notice that some, or perhaps all, of the variables to be imputed are themselves used as predictor variables in the imputation equations for other variables. Because the variables are used in different ways at different points in the imputation process, they need to be specified in multiple ways in the ice command. For example, it may seem strange to specify both the original categorical variable and the dummies derived from the original categorical variable in the same ice command, but the original categorical variable is needed when it is to be used as an outcome variable, and its dummies are needed for use as predictor variables in another imputation equation. Both the original categorical variable and its dummies are used in the substitute option so that ice knows that the dummies are not to be used as predictor variables themselves.
We will use the user-written commands nmissing and mvpatterns to examine the number and pattern of missing values before we begin the imputation. You can download these programs using findit (see How can I use the findit command to search for programs and get additional help? for more information about using findit). The nmissing command gives the number of missing values for each variable listed, and mvpattern gives the pattern of missing values and their frequency. We strongly encourage you to "get to know" the amount and pattern of missing data before you begin any imputation procedure. Also, you should consider the mechanism by which the missing values were generated. Most imputation procedures, including ice, assume that the data are missing at random (MAR) or missing completely at random (MCAR). If the data are not not missing at random, the imputation may be inappropriate, and the imputed data may lead to erroneous conclusions when analyzed.
* findit mvpatterns * findit nmissinguse http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear * creating categorical variables with missing data gen catmiss = 1 replace catmiss = 2 if _n > 50 replace catmiss = 3 if _n > 100 replace catmiss = 4 if _n > 150 replace catmiss = . in 40/60 replace catmiss = . in 95/105 replace catmiss = . in 140/160 tab catmiss, gen(c) * adding missing values to female replace female = . in 3/5 replace math = . in 130/145 tab race, gen(r) tab ses, gen(s) sort ses gen t = ses replace t = . in 40/45 replace t = . in 60/66 replace t = . in 195/l tab t, gen(t) * generating interaction terms gen fc1 = female*c1 gen fc2 = female*c2 gen fc3 = female*c3 gen fc4 = female*c4 gen fr1 = female*r1 gen fr2 = female*r2 gen fr3 = female*r3 foreach i of numlist 1/4 { foreach j of numlist 1/3 { gen rc`i'`j' = (race==`i')*(catmiss==`j') } } gen rs11 = r1*s1 gen rs12 = r1*s2 gen rs21 = r2*s1 gen rs22 = r2*s2 gen rs31 = r3*s1 gen rs32 = r3*s2 gen c1m = c1 * math gen c2m = c2 * math gen c3m = c3 * math gen c4m = c4 * math gen c1t1 = c1*t1 gen c1t2 = c1*t2 gen c2t1 = c2*t1 gen c2t2 = c2*t2 gen c3t1 = c3*t1 gen c3t2 = c3*t2 gen rm = read*math save hsb2_ice_miss, replace *********************************************** *********************************************** nmissing female read math rm catmiss race ses t female 3 math 16 rm 16 catmiss 53 t 19 mvpatterns female read math rm catmiss race ses t variables with no mv's: read race ses Variable | type obs mv variable label -------------+----------------------------------- female | float 197 3 math | float 184 16 math score rm | float 184 16 catmiss | float 147 53 t | float 181 19 ------------------------------------------------- Patterns of missing values +------------------------+ | _pattern _mv _freq | |------------------------| | +++++ 0 122 | | +++.+ 1 44 | | ++++. 1 12 | | +..++ 2 6 | | +...+ 3 6 | |------------------------| | +..+. 3 4 | | .++++ 1 3 | | +++.. 2 3 | +------------------------+
Clearly the information provided by nmissing is also given in the mvpatterns output. We include both only to illustrate the use of these commands. In the mvpatterns output, each row represents a different pattern of missing and non-missing values across the variables that have missing values. The +'s indicate that there are no missing values and the .'s indicate missing values. For example, in the first row we have five +'s. Although we listed eight variables on the mvpatterns command, only five had any missing values (read, race and ses do not have missing values, as noted in the very first line of the output). 122 cases have no missing values on the five variables, 44 cases have missing values for catmiss but none of the other variables, four cases have missing values on female and catmiss, etc. You always want to start by examining the number and pattern of the missing values. For a variable like catmiss, with 53 of 200 cases missing, the using a good imputation model is critical, as you may find that you get different results when you analyze data with imputed values generated by different imputation models.
Before we begin, please note that both ice and mim (the prefix used to combine the results from the different imputed data sets) have been updated several times since they were originally released. If you have an older version of either of these programs, you may need to download the current version. To determine which version of ice you have, for example, you can type which ice. To ensure that you have the current version of ice, you can type
which ice ssc describe ice ssc install ice, replace
To ensure that you have the most current version of mim, you can type
which mim ssc describe mim ssc install mim, replace
When using ice, an interaction term formed by two continuous variables is the easiest type of interaction to specify. We will start with a small model with the variables female, read and math and the interaction between read and math, rm. We specify the file into which we want the imputed data to be placed after using. We need to use the passive option so that the interaction term, rm, is not directly imputed. Rather, any missing values on math and read are imputed, and then those variables are multiplied together to get rm. If we did not specify rm in the passive option, ice would impute it like any other variable, but the imputed values would not necessarily be equal to read*math. We will use the seed option to set the seed, so that the results of our imputations can be reproduced. If you don't use the seed option, you will get different imputed values each time you run the code. Finally, we will use the dryrun option. This option allows you to test your ice command for errors without creating any imputed data sets. Please note that not all errors can be detected when this option is used. Hence, in later examples, we will use the m option, which specifies the number of desired imputed data sets, and we will ask for only one data set. This is a handy way to trouble-shoot ice without taking the time to write multiple data sets. Once you have your code exactly as you want it, you can change the number of requested data sets.
ice female read math rm using hsb2_imp1, passive(rm: read*math) seed(356421) dryrun
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 181 90.50 90.50
1 | 3 1.50 92.00
2 | 16 8.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | read math rm
read | | [No missing data in estimation sample]
math | regress | female read
rm | | [Passively imputed from read*math]
End of dry run. No imputations were done, no files were created.
The top of this output tells us that 181 cases had no missing values, that one variable (female) has three missing values, and that 16 cases have missing values on two variables (math and rm). This matches with the nmissing output shown above. The bottom part of the output indicates that the logit command was used to with the variable female, whose missing values were imputed based on read, math and rm. The variable read was used to predict math using the regress command, and rm was passively imputed by multiplying read and math after they had been imputed.
Next let's use some categorical variables. We will use catmiss, which has missing values, and ses, which does not. We would like to include the interaction of catmiss and math in the imputation equation for female. Although there is a problem with the syntax below, let's run it and see what happens. Then we will discuss the problems with it and look at the corrected syntax.
* problematic syntax
ice female catmiss c1 c2 c3 s1 s2 math c1m c2m c3m using hsb2_imp2, ///
passive(c1: catmiss==1\c2: catmiss==2 \c3: catmiss==3 \ ///
c1m: c1*math \ c2m: c2*math \ c3m: c3*math) ///
substitute(catmiss: c1 c2 c3) ///
cmd(catmiss:mlogit) m(1) replace seed(356421)
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 134 67.00 67.00
1 | 3 1.50 68.50
4 | 10 5.00 73.50
7 | 47 23.50 97.00
8 | 6 3.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | c1 c2 c3 s1 s2 math c1m c2m c3m
catmiss | mlogit | female s1 s2 math c1m c2m c3m
c1 | | [Passively imputed from catmiss==1]
c2 | | [Passively imputed from catmiss==2]
c3 | | [Passively imputed from catmiss==3]
s1 | | [No missing data in estimation sample]
s2 | | [No missing data in estimation sample]
math | regress | female c1 c2 c3 s1 s2
c1m | | [Passively imputed from c1*math]
c2m | | [Passively imputed from c2*math]
c3m | | [Passively imputed from c3*math]
Imputing 1..file hsb2_imp2.dta saved
In the syntax above, we have included several categorical variables, but we have handled them differently, depending on whether or not there were any missing data in the variable. In all cases, whether or not a variable has missing data, the dummies for that variable need to be included in the first part of the ice command.
Now let's go through each of the variables that we have used. Because female is already a dummy variable, we can simply include it as it is regardless of whether or not it has missing data. As you can see, all of the terms have been included in the imputation equation (except for female). The variable catmiss, however, is not a dummy variable (it has four levels), and it has missing data. You need to include both the original catmiss variable and the three dummy variables derived from it (c1-c3). We need the original catmiss variable for use as the dependent variable in the equation to impute its missing values. We need the catmiss dummies for use as independent variables in the imputation equations for female and math, and also to create the interaction terms with math. If you look at the imputation equation for catmiss, however, you can see a problem: the catmiss by math interaction terms have been used to predict catmiss. There are two way to address this issue, and they will be discussed with the correct syntax below. We also have the categorical variable ses, represented by its two dummy variables, s1-s2. However, there are no missing data in ses, so it won't be used as a dependent variable in an imputation equation, we will not use it in the substitute option, and therefore don't need ses itself. The continuous variables math and read need no special treatment, and neither do the interaction terms of catmiss with math.
In the passive option, we define the catmiss dummies as part of catmiss. In other words, ice imputes the missing values of catmiss, and then those values are copied into the respective dummy variable. If catmiss and its dummies were independently imputed (as they would be without the use of the passive option), the values imputed in catmiss would not be the same as those imputed in its dummies, which would be problematic. We also include the interaction terms in the passive option, for the reasons previously explained.
Let's look again at the imputation equation for catmiss. You will notice that the interaction terms of catmiss with math, c1m, c2m and c3m, are used in the imputation equation for catmiss. In other words, the interaction of catmiss and math is used to predict catmiss. This, of course, does not make sense. There are at least two ways of correcting this problem. One possibility is to use the eq option and specify the imputation equation for catmiss. A second possibility is to define the interaction terms differently in the passive option. For example, we could use c1m: (catmiss==1)*math instead of c1m: c1*math. By using catmiss in the definition of c1m, ice knows that catmiss is included in the interaction and will not use the interaction of catmiss and math as a predictor of catmiss.
ice female catmiss c1 c2 c3 c4 s1 s2 math c1m c2m c3m c4m using hsb2_imp2, ///
passive(c1: catmiss==1\c2: catmiss==2 \c3: catmiss==3 \ c4: catmiss==4 \ ///
c1m: catmiss==1*math \ c2m: catmiss==2*math \ c3m: catmiss==3*math \ c4m: catmiss==4*math) ///
substitute(catmiss: c1 c2 c3) ///
cmd(catmiss:mlogit) m(1) replace seed(356421)
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 134 67.00 67.00
1 | 3 1.50 68.50
5 | 10 5.00 73.50
9 | 47 23.50 97.00
10 | 6 3.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | c1 c2 c3 c4 s1 s2 math c1m c2m c3m c4m
catmiss | mlogit | female s1 s2 math
c1 | | [Passively imputed from catmiss==1]
c2 | | [Passively imputed from catmiss==2]
c3 | | [Passively imputed from catmiss==3]
c4 | | [Passively imputed from catmiss==4]
s1 | | [No missing data in estimation sample]
s2 | | [No missing data in estimation sample]
math | regress | female c1 c2 c3 c4 s1 s2
c1m | | [Passively imputed from catmiss==1*math]
c2m | | [Passively imputed from catmiss==2*math]
c3m | | [Passively imputed from catmiss==3*math]
c4m | | [Passively imputed from catmiss==4*math]
Imputing 1..file hsb2_imp2.dta saved
In this example, we will use a categorical by categorical interaction, catmiss by female. In this example, we will use K-1 dummies (where K is the number of levels of the categorical variable) for catmiss (c1-c3). We also use the K-1 dummies to construct the interaction terms (fc1-fc3) and include those in both the first part of the ice model as well as in the passive option. In this example, we have also used the eq option, which allows us to specify the exact imputation equation that we want used for a variable. Here we have specified that math should be imputed from all of the variables except read.
ice female catmiss c1 c2 c3 r1 r2 r3 read math fc1 fc2 fc3 using hsb2_imp3 , ///
passive(c1: catmiss==1 \c2: catmiss==2 \c3: catmiss==3 \ ///
fc1: female*catmiss==1 \fc2: female*catmiss==2 \fc3: female*catmiss==3) ///
substitute(catmiss: c1 c2 c3) ///
cmd(catmiss:mlogit) ///
eq(math: female c1 c2 c3 fc1 fc2 fc3) m(1) replace seed(356421)
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 134 67.00 67.00
1 | 10 5.00 72.00
4 | 3 1.50 73.50
7 | 47 23.50 97.00
8 | 6 3.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | c1 c2 c3 r1 r2 r3 read math
catmiss | mlogit | female r1 r2 r3 read math
c1 | | [Passively imputed from catmiss==1]
c2 | | [Passively imputed from catmiss==2]
c3 | | [Passively imputed from catmiss==3]
r1 | | [No missing data in estimation sample]
r2 | | [No missing data in estimation sample]
r3 | | [No missing data in estimation sample]
read | | [No missing data in estimation sample]
math | regress | female c1 c2 c3 fc1 fc2 fc3
fc1 | | [Passively imputed from female*catmiss==1]
fc2 | | [Passively imputed from female*catmiss==2]
fc3 | | [Passively imputed from female*catmiss==3]
Imputing 1..file hsb2_imp3.dta saved
In this example, we will include two interactions, female by race and race by catmiss. Note that because there are no missing data in the four-level categorical variable race, we only need to represent it by its three dummy variables (r1-r3). We have also included the categorical variable ses (represented by its two dummies, s1 and s2) to show how this categorical variable, which has no missing data and is not involved in an interaction, is handled. We have used also the original catmiss variable and its three dummies (c1-c3).
ice female catmiss c1 c2 c3 r1 r2 r3 s1 s2 read math fr1 fr2 fr3 ///
rc11 rc12 rc13 rc21 rc22 rc23 rc31 rc32 rc33 using hsb2_imp4, ///
passive(c1: catmiss==1 \ c2: catmiss==2 \ c3: catmiss==3 \ ///
fr1: female*r1 \fr2: female*r2 \ fr3: female*r3 \ ///
rc11: r1*catmiss==1 \ rc12: r1*catmiss==2 \ rc13: r1*catmiss==3 \ ///
rc21: r2*catmiss==1 \ rc22: r2*catmiss==2 \ rc23: r2*catmiss==3 \ ///
rc31: r3*catmiss==1 \ rc32: r3*catmiss==2 \ rc33: r3*catmiss==3 ) ///
substitute(catmiss: c1 c2 c3) ///
m(1) replace seed(356421)
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 134 67.00 67.00
1 | 10 5.00 72.00
4 | 50 25.00 97.00
5 | 6 3.00 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | c1 c2 c3 r1 r2 r3 s1 s2 read math rc11 rc12 rc13 rc21
| | rc22 rc23 rc31 rc32 rc33
catmiss | mlogit | female r1 r2 r3 s1 s2 read math fr1 fr2 fr3
c1 | | [Passively imputed from catmiss==1]
c2 | | [Passively imputed from catmiss==2]
c3 | | [Passively imputed from catmiss==3]
r1 | | [No missing data in estimation sample]
r2 | | [No missing data in estimation sample]
r3 | | [No missing data in estimation sample]
s1 | | [No missing data in estimation sample]
s2 | | [No missing data in estimation sample]
read | | [No missing data in estimation sample]
math | regress | female c1 c2 c3 r1 r2 r3 s1 s2 read fr1 fr2 fr3 rc11
| | rc12 rc13 rc21 rc22 rc23 rc31 rc32 rc33
fr1 | | [Passively imputed from female*r1]
fr2 | | [Passively imputed from female*r2]
fr3 | | [Passively imputed from female*r3]
rc11 | | [Passively imputed from r1*catmiss==1]
rc12 | | [Passively imputed from r1*catmiss==2]
rc13 | | [Passively imputed from r1*catmiss==3]
rc21 | | [Passively imputed from r2*catmiss==1]
rc22 | | [Passively imputed from r2*catmiss==2]
rc23 | | [Passively imputed from r2*catmiss==3]
rc31 | | [Passively imputed from r3*catmiss==1]
rc32 | | [Passively imputed from r3*catmiss==2]
rc33 | | [Passively imputed from r3*catmiss==3]
Imputing 1..file hsb2_imp4.dta saved
In this example, we use the categorical variables catmiss (and three of its dummies), t (and two of its dummies) and the dummies for race. Both catmiss and t have missing values, and we will use their interaction, as well as the interaction of female and race, and of race and catmiss. Because both catmiss and t have missing values, those variables as well as their dummies are needed: the variables for use in the substitute option and the dummies for use in the prediction equations. We have used the eq option to restrict which variables are used in the imputation of catmiss, math and t.
ice female catmiss c1 c2 c3 t t1 t2 r1 r2 r3 read math fr1 fr2 fr3 ///
rc11 rc12 rc13 rc21 rc22 rc23 rc31 rc32 rc33 c1t1 c1t2 c2t1 c2t2 c3t1 c3t2 using hsb2_mp5, ///
passive(c1: catmiss==1 \ c2: catmiss==2 \ c3: catmiss==3 \ ///
t1: t==1 \ t2: t==2 \ ///
fr1: female*r1 \fr2: female*r2 \ fr3: female*r3 \ ///
rc11: r1*catmiss==1 \ rc12: r1*catmiss==2 \ rc13: r1*catmiss==3 \ ///
rc21: r2*catmiss==1 \ rc22: r2*catmiss==2 \ rc23: r2*catmiss==3 \ ///
c1t1: catmiss==1*t==1 \ c1t2:catmiss==2*t==1 \ c2t1: catmiss==1*t==2 \ c2t2: catmiss==2*t==2 ///
\ c3t1: catmiss==3*t==1 \ c3t2: catmiss==3*t==2 ///
\ rc31: r3*catmiss==1 \ rc32: r3*catmiss==2 \ rc33: r3*catmiss==3) ///
substitute(catmiss: c1 c2 c3, t: t1 t2) ///
cmd(catmiss:mlogit, t:mlogit) ///
eq(catmiss: math read t1 t2 female, math: read t1 t2 c1 c2 c3 c1t1 c1t2 c2t1 c2t2 c3t1 c3t2, ///
t: math female c1 c2 c3 r1 r2 r3 read fr1 fr2 fr3) ///
seed(356421) m(1) replace
#missing |
values | Freq. Percent Cum.
------------+-----------------------------------
0 | 122 61.00 61.00
1 | 6 3.00 64.00
4 | 3 1.50 65.50
9 | 12 6.00 71.50
10 | 48 24.00 95.50
11 | 6 3.00 98.50
13 | 3 1.50 100.00
------------+-----------------------------------
Total | 200 100.00
Variable | Command | Prediction equation
------------+---------+-------------------------------------------------------
female | logit | c1 c2 c3 t1 t2 r1 r2 r3 read math rc11 rc12 rc13 rc21
| | rc22 rc23 rc31 rc32 rc33 c1t1 c1t2 c2t1 c2t2 c3t1 c3t2
catmiss | mlogit | math read t1 t2 female
c1 | | [Passively imputed from catmiss==1]
c2 | | [Passively imputed from catmiss==2]
c3 | | [Passively imputed from catmiss==3]
t | mlogit | math female c1 c2 c3 r1 r2 r3 read fr1 fr2 fr3
t1 | | [Passively imputed from t==1]
t2 | | [Passively imputed from t==2]
r1 | | [No missing data in estimation sample]
r2 | | [No missing data in estimation sample]
r3 | | [No missing data in estimation sample]
read | | [No missing data in estimation sample]
math | regress | read t1 t2 c1 c2 c3 c1t1 c1t2 c2t1 c2t2 c3t1 c3t2
fr1 | | [Passively imputed from female*r1]
fr2 | | [Passively imputed from female*r2]
fr3 | | [Passively imputed from female*r3]
rc11 | | [Passively imputed from r1*catmiss==1]
rc12 | | [Passively imputed from r1*catmiss==2]
rc13 | | [Passively imputed from r1*catmiss==3]
rc21 | | [Passively imputed from r2*catmiss==1]
rc22 | | [Passively imputed from r2*catmiss==2]
rc23 | | [Passively imputed from r2*catmiss==3]
rc31 | | [Passively imputed from r3*catmiss==1]
rc32 | | [Passively imputed from r3*catmiss==2]
rc33 | | [Passively imputed from r3*catmiss==3]
c1t1 | | [Passively imputed from catmiss==1*t==1]
c1t2 | | [Passively imputed from catmiss==2*t==1]
c2t1 | | [Passively imputed from catmiss==1*t==2]
c2t2 | | [Passively imputed from catmiss==2*t==2]
c3t1 | | [Passively imputed from catmiss==3*t==1]
c3t2 | | [Passively imputed from catmiss==3*t==2]
Imputing 1..file hsb2_mp5.dta saved
One thing that should be obvious from the example above is that with only a few categorical variables, you can have lots of interaction terms and some very complicated models. In most cases, you want to be very judicious about the use of interactions with categorical variables, especially if they have lots of levels. As the imputation model gets larger, there is an increased chance that an error will occur. For example, you may have empty cells or too few degrees of freedom to run the model. If you get an error message about one of the imputation equations, you may want to try to run the model outside of ice. For example, you might try the following mlogit command (which is what ice will use if an equation for catmiss is not given in the eq option). From the output, it is clear that there is a problem with this model (e.g., some standard errors are extremely large or missing). Hence, we specified the imputation model for catmiss to be a much smaller model that runs without difficulty when run outside of ice. Running the uvis or mvis command by itself is another way of trying to trouble-shoot problems.
mlogit catmiss math read t1 t2 female r1 r2 r3 fr1 fr2 fr3, nolog
Multinomial logistic regression Number of obs = 122
LR chi2(33) = 171.70
Prob > chi2 = 0.0000
Log likelihood = -81.407005 Pseudo R2 = 0.5133
------------------------------------------------------------------------------
catmiss | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1 |
math | .0692883 .1191486 0.58 0.561 -.1642386 .3028152
read | .0723963 .091542 0.79 0.429 -.1070226 .2518153
t1 | -1.008773 2.185576 -0.46 0.644 -5.292424 3.274877
t2 | .4993844 1.878577 0.27 0.790 -3.182559 4.181328
female | -26.98459 7.410817 -3.64 0.000 -41.50953 -12.45966
r1 | 1.231074 2.257148 0.55 0.585 -3.192855 5.655004
r2 | -26.77656 5.14e+08 -0.00 1.000 -1.01e+09 1.01e+09
r3 | 15.29035 . . . . .
fr1 | -13.51481 3.16e+08 -0.00 1.000 -6.20e+08 6.20e+08
fr2 | -8.609845 5.28e+08 -0.00 1.000 -1.03e+09 1.03e+09
fr3 | -47.76074 1.20e+08 -0.00 1.000 -2.36e+08 2.36e+08
_cons | 15.50512 . . . . .
-------------+----------------------------------------------------------------
2 |
math | .0516269 .1176827 0.44 0.661 -.179027 .2822807
read | .0491975 .0888267 0.55 0.580 -.1248996 .2232946
t1 | -.1006813 2.061735 -0.05 0.961 -4.141608 3.940245
t2 | .2036754 1.880472 0.11 0.914 -3.481982 3.889332
female | -26.94278 7.304014 -3.69 0.000 -41.25839 -12.62718
r1 | .3988195 2.086316 0.19 0.848 -3.690284 4.487923
r2 | 13.62138 . . . . .
r3 | -24.94124 2.48e+08 -0.00 1.000 -4.85e+08 4.85e+08
fr1 | 25.96502 . . . . .
fr2 | -49.25019 1.22e+08 -0.00 1.000 -2.39e+08 2.39e+08
fr3 | -8.895122 2.76e+08 -0.00 1.000 -5.42e+08 5.42e+08
_cons | 17.89011 1.995063 8.97 0.000 13.97985 21.80036
-------------+----------------------------------------------------------------
3 |
math | -.0077738 .0500492 -0.16 0.877 -.1058684 .0903208
read | .0213092 .0448937 0.47 0.635 -.066681 .1092993
t1 | -.8373123 .7999026 -1.05 0.295 -2.405093 .730468
t2 | -.6714624 .7215148 -0.93 0.352 -2.085605 .7426806
female | -.1173197 2.216361 -0.05 0.958 -4.461308 4.226668
r1 | .2695127 . . . . .
r2 | .2419949 1.149276 0.21 0.833 -2.010545 2.494535
r3 | .2598336 1.03789 0.25 0.802 -1.774393 2.29406
fr1 | 23.997 . . . . .
fr2 | -1.547891 . . . . .
fr3 | .6138335 . . . . .
_cons | -.6676294 . . . . .
------------------------------------------------------------------------------
(catmiss==4 is the base outcome)
Continuous versus ordinal categorical: With some variables, such as 5-point Likert scales, you have a choice of whether to consider them to be continuous or categorical. For purposes of the imputation, you may want to consider the variable to be continuous, especially if you want to use it in an interaction. In the analysis, however, you may want to use the same variable as categorical. As far as we know, there is no problem with this approach of using a variable as continuous in the imputation equation and as categorical in the analysis.
Rounding: When imputing variables, you may find that you get decimals when you want whole numbers. For example, in our hsb2 data set, all of the test scores are whole numbers, such as 55, but the imputed values are not, such as 55.26547. As part of your planning process, you should decide how you will do any rounding that is necessary.
Sample Size: The size of the sample must be carefully considered when doing multiple imputation. Maximum likelihood techniques are being used, and the behavior of these techniques is not well understood (some would say "very unstable") with small sample sizes. When using large survey data sets, sample size is usually not much of an issue, and the prediction equations for the variables can get to be very large and complicated. On the other hand, if you have a small sample, any missing data can be problematic; each case is valuable. However, ice (and other forms of multiple imputation) may not be the best way to handle missing data, and other techniques should at least be considered.
Imputation Flags: An imputation flag is a dummy variable that is added to the data set to indicate which values of the variable have been imputed. The use of imputation flags is often a good idea when imputing data, particularly if the data will be given to others for analysis. An imputation flag allows you to know how many cases of a particular variable have been imputed. (If you didn't impute the data yourself, you would have no other way of knowing this.) If a large percentage of cases has been imputed, as is common in some public-use data sets, this information is particularly important. For example, if 60% of the cases in a variable have imputed values, the results of the analysis could be heavily determined by the quality of the imputation procedure and equation. If another researcher took the same variable and used a different imputation procedure and/or imputation equation, the imputed values could be very different, and hence, the results of any analysis using this variable could also be very different.
Below we show an example using the genmiss() option, which creates imputation flags. We have modified our first example by adding this option, and by creating three imputation data sets (the m(3) option) instead of using the dryrun option. Once the imputation data set is created (remember that ice puts all of the imputed data sets into one long data set and creates a variable _mj to indicate the different imputations), we use the imputed data set and run the codebook command. The newly created imputation flags are labeled so that you can easily tell what the variable is and what the values mean. Finally, we use the tab1 command to see the number of missing values. You can use the tab2 command as shown at the bottom to see how many missing values there are per data set.
Please note that if you are using older versions of ice or mim, you have the variables _i and _j instead of _mi and _mj. They are the same variables, so you can simply rename _i and _j to be _mi and _mj.
use hsb2_ice_miss, clear generate rm = read*math ice female read math rm using hsb2_imp1, passive(rm: read*math) seed(356421) genmiss(miflag) m(3) replace #missing | values | Freq. Percent Cum. ------------+----------------------------------- 0 | 181 90.50 90.50 1 | 3 1.50 92.00 2 | 16 8.00 100.00 ------------+----------------------------------- Total | 200 100.00 Variable | Command | Prediction equation ------------+---------+------------------------------------------------------- female | logit | read math rm read | | [No missing data in estimation sample] math | regress | female read rm | | [Passively imputed from read*math] Imputing 1..2..3..file hsb2_imp1.dta saved use hsb2_imp1.dta, clear codebook _mi miflagfemale _mj -------------------------------------------------------------------------------------------------------------------------------------------- _mi obs. number -------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (long) range: [1,200] units: 1 unique values: 200 missing .: 0/600 mean: 100.5 std. dev: 57.7825 percentiles: 10% 25% 50% 75% 90% 20.5 50.5 100.5 150.5 180.5 -------------------------------------------------------------------------------------------------------------------------------------------- miflagfemale 1 if female missing, 0 otherwise -------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (byte) range: [0,1] units: 1 unique values: 2 missing .: 0/600 tabulation: Freq. Value 591 0 9 1 -------------------------------------------------------------------------------------------------------------------------------------------- _mj imputation number -------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (int) range: [1,3] units: 1 unique values: 3 missing .: 0/600 tabulation: Freq. Value 200 1 200 2 200 3 tab1 miflagfemale miflagread miflagmath miflagrm _mj -> tabulation of miflagfemale 1 if female | missing, 0 | otherwise | Freq. Percent Cum. ------------+----------------------------------- 0 | 591 98.50 98.50 1 | 9 1.50 100.00 ------------+----------------------------------- Total | 600 100.00 -> tabulation of miflagread 1 if read | missing, 0 | otherwise | Freq. Percent Cum. ------------+----------------------------------- 0 | 600 100.00 100.00 ------------+----------------------------------- Total | 600 100.00 -> tabulation of miflagmath 1 if math | missing, 0 | otherwise | Freq. Percent Cum. ------------+----------------------------------- 0 | 552 92.00 92.00 1 | 48 8.00 100.00 ------------+----------------------------------- Total | 600 100.00 -> tabulation of miflagrm 1 if rm | missing, 0 | otherwise | Freq. Percent Cum. ------------+----------------------------------- 0 | 552 92.00 92.00 1 | 48 8.00 100.00 ------------+----------------------------------- Total | 600 100.00 -> tabulation of _mj imputation | number | Freq. Percent Cum. ------------+----------------------------------- 1 | 200 33.33 33.33 2 | 200 33.33 66.67 3 | 200 33.33 100.00 ------------+----------------------------------- Total | 600 100.00tab2 miflagfemale _mj -> tabulation of miflagfemale by _mj 1 if | female | missing, 0 | imputation number otherwise | 1 2 3 | Total -----------+---------------------------------+---------- 0 | 197 197 197 | 591 1 | 3 3 3 | 9 -----------+---------------------------------+---------- Total | 200 200 200 | 600
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services