Help the Stat Consulting Group by giving a gift

How can I identify cases used by an estimation command using e(sample)?

When performing data analysis, it is very common for a given model (e.g. a
regression model), to not use all cases in the dataset. This can occur for a
number of reasons, for example because **if** was used to tell Stata to perform the analysis on a subset of
cases, or because some cases had missing values on some or all of the variables
in the analysis. To allow you to identify the cases used in an analysis,
most Stata estimation commands return a function that takes on a value of
one if the case was included in the analysis, and zero otherwise (for more
information see our Stata FAQ: How can I access information stored after I run a command in Stata (returned results)?). Below we show how this can be useful in
two common situations. Many more situations exist, once you're aware of this
function and how it works, you'll recognize them. The examples below use
two different versions of the hsb2 dataset. Both versions contain information on
200 high school
students, including their scores on a series of standardized tests, and some demographic information.

In this example we run a regression model predicting student's **reading** scores based on
their scores for **math**, and **science**. However, we use **if** to indicate that we want to
run our model on only those cases where the variable **write** is greater
than or equal to 50. Below we see the output for this regression. Note that 128
observations were used in the analysis, rather than the full 200, because we
restricted the sample using **if**.

use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear regress read math science if write>=50Source | SS df MS Number of obs = 128 -------------+------------------------------ F( 2, 125) = 43.61 Model | 4595.51237 2 2297.75618 Prob > F = 0.0000 Residual | 6585.72982 125 52.6858386 R-squared = 0.4110 -------------+------------------------------ Adj R-squared = 0.4016 Total | 11181.2422 127 88.0412771 Root MSE = 7.2585 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- math | .4952647 .0898744 5.51 0.000 .3173921 .6731373 science | .2960287 .0942091 3.14 0.002 .1095772 .4824803 _cons | 11.7755 4.86182 2.42 0.017 2.153353 21.39764 ------------------------------------------------------------------------------

Once we have run our model, we can generate predicted values using the **predict**
command. Below generate a new variable, **p1**, which contains the predicted
values for each case. When we use **summarize** to examine the predicted values, we see that **predict**
that the variable **p1** has 200 observations, but the model from which these
predictions was made used only 128 observations. Predicted values were generated
for both the 128 cases where write>=50 and the 72 cases where write<50 (who were
not used to estimate the model). Generally, we don't want to use a model
estimated on one sample (in our case, observations where write>=50) on a
different sample (observations where write<50). This is particularly true in
cases like this one, where we know there is a systematic difference between the
samples.

predict p1(option xb assumed; fitted values)summarize p1Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- p1 | 200 53.1978 6.875587 40.55244 70.23441

We can use **e(sample)** to generate predicted values only for those cases
used to estimate the model. Below we use **predict** to generate a new
variable, **p2**, that contains predicted
values, but this time we add **if e(sample)==1**, which indicates that
predicted values should only be created for cases used in the last model we ran.
This time Stata tells us that we have generated 72 missing values. There are 72
cases where write<=50 in the dataset, rather than predicted values, these cases
were given missing values for **p2**. Summarizing the data again

predict p2 if e(sample)==1(option xb assumed; fitted values) (72 missing values generated)sum p2Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- p2 | 128 56.11719 6.015408 42.64159 70.23441

When we want to compare nested models, the models must be estimated on the
same sample in order for the comparison to be valid. When a dataset contains
missing values, adding additional predictor variables to a model often reduces
the number of cases available for a given model. In this example we fit a model
where write predicts read, and compare the fit of this model to a model that
contains **math** and **science** as well as **write** as predictors. We will
compare the two models using a likelihood ratio test (i.e. the command **lrtest**). Below we first run a
regression model where the variable **read** is predicted
by the variable **write** and store the estimates from that model as **m1** using
the command **estimates store m1**.

use http://www.ats.ucla.edu/stat/stata/faq/hsb2_mar, clear regress read writeSource | SS df MS Number of obs = 170 -------------+------------------------------ F( 1, 168) = 94.38 Model | 6188.25135 1 6188.25135 Prob > F = 0.0000 Residual | 11014.7428 168 65.563945 R-squared = 0.3597 -------------+------------------------------ Adj R-squared = 0.3559 Total | 17202.9941 169 101.792865 Root MSE = 8.0972 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .6496086 .0668652 9.72 0.000 .5176042 .7816129 _cons | 17.65687 3.589724 4.92 0.000 10.5701 24.74365 ------------------------------------------------------------------------------estimates store m1

Below we run a second model where **read** is predicted by **write**, **math**,
and **science**. We store the estimates from this model as **m2**.

reg read write math scienceSource | SS df MS Number of obs = 141 -------------+------------------------------ F( 3, 137) = 47.27 Model | 7560.153 3 2520.051 Prob > F = 0.0000 Residual | 7304.15906 137 53.3150296 R-squared = 0.5086 -------------+------------------------------ Adj R-squared = 0.4979 Total | 14864.3121 140 106.173658 Root MSE = 7.3017 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .2143165 .0915771 2.34 0.021 .0332291 .3954039 math | .3973615 .1020276 3.89 0.000 .1956088 .5991141 science | .3108218 .0905435 3.43 0.001 .1317781 .4898654 _cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323 ------------------------------------------------------------------------------estimates store m2

Now that we have estimated the two models and stored the results, we want to test whether the model that contains
**write**, **math**, and **science** fits significantly better
than the model that contains only **write** as a predictor. One way to do this is using a likelihood ratio
test, which is what is done below with the command **lrtest m1 m2**. However this command generates an error message.
It turns out, the models were not estimated on the same number of cases. In order for the test to be valid,
the two models must be run on the same cases, clearly this is not the case. Looking at the error message and the output from our regressions we see that the
model using only **write** as a predictor was run on 170 cases, while the model that contained
**write**, **math**, and
**science** as predictors was run on 141 cases. The only difference between these two models is the
addition of the variables
**math** and **science**, indicating that the difference in sample size for the two models is due to missing data on
**math**, and **science**.

lrtest m1 m2observations differ: 141 vs. 170 r(498);

So how do we make sure that the two models contain the same number of cases? First,
we run the model with **write**, **math**, and **science** as predictors, and store the estimates
as **m2**. Then we use the generate command (**gen**) to create a new variable
called **sample** that is equal to the function **e(sample)**. In other words the variable **sample**
is equal to one if the case was included in the last analysis (i.e. the model we just
ran) and zero otherwise.

regress read write math scienceSource | SS df MS Number of obs = 141 -------------+------------------------------ F( 3, 137) = 47.27 Model | 7560.153 3 2520.051 Prob > F = 0.0000 Residual | 7304.15906 137 53.3150296 R-squared = 0.5086 -------------+------------------------------ Adj R-squared = 0.4979 Total | 14864.3121 140 106.173658 Root MSE = 7.3017 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .2143165 .0915771 2.34 0.021 .0332291 .3954039 math | .3973615 .1020276 3.89 0.000 .1956088 .5991141 science | .3108218 .0905435 3.43 0.001 .1317781 .4898654 _cons | 3.851736 4.091921 0.94 0.348 -4.239757 11.94323 ------------------------------------------------------------------------------estimates store m2 generate sample = e(sample)

Now we can use the variable **sample** to run the model with only **write** as
a predictor

regress read write if sample==1Source | SS df MS Number of obs = 141 -------------+------------------------------ F( 1, 139) = 70.12 Model | 4984.37291 1 4984.37291 Prob > F = 0.0000 Residual | 9879.93915 139 71.0786989 R-squared = 0.3353 -------------+------------------------------ Adj R-squared = 0.3305 Total | 14864.3121 140 106.173658 Root MSE = 8.4308 ------------------------------------------------------------------------------ read | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .6479233 .0773728 8.37 0.000 .4949436 .800903 _cons | 17.43003 4.1396 4.21 0.000 9.245303 25.61475 ------------------------------------------------------------------------------estimates store m1

Now we can use the **lrtest** command again, to test whether the model with **write**, **math** and
**science** as predictors fits significantly better than a model with just **write** as a predictor.

lrtest m1 m2Likelihood-ratio test LR chi2(2) = 42.59 (Assumption: m1 nested in m2) Prob > chi2 = 0.0000