Help the Stat Consulting Group by giving a gift

Logistic Regression Analysis

This page shows an example of logistic regression regression analysis with footnotes explaining the
output. These data were collected on 200 high schools students and are
scores on various tests, including science, math, reading and social studies (**socst**).
The variable **female** is a dichotomous variable coded 1 if the student was
female and 0 if male.

Because we do not have a suitable dichotomous
variable to use as our dependent variable, we will create one (which we will
call **honcomp**, for honors composition) based on the continuous variable **
write**. We do not advocate making dichotomous variables out of
continuous variables; rather, we do this here only for purposes of this
illustration.

use http://www.ats.ucla.edu/stat/data/hsb2, clear generate honcomp = (write >=60) logit honcomp female read science

Iteration 0: log likelihood = -115.64441 Iteration 1: log likelihood = -84.558481 Iteration 2: log likelihood = -80.491449 Iteration 3: log likelihood = -80.123052 Iteration 4: log likelihood = -80.118181 Iteration 5: log likelihood = -80.11818 Logit estimates Number of obs = 200 LR chi2(3) = 71.05 Prob > chi2 = 0.0000 Log likelihood = -80.11818 Pseudo R2 = 0.3072 ------------------------------------------------------------------------------ honcomp | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.482498 .4473993 3.31 0.001 .6056111 2.359384 read | .1035361 .0257662 4.02 0.000 .0530354 .1540369 science | .0947902 .0304537 3.11 0.002 .035102 .1544784 _cons | -12.7772 1.97586 -6.47 0.000 -16.64982 -8.904589 ------------------------------------------------------------------------------

Iteration 0: log likelihood = -115.64441 Iteration 1: log likelihood = -84.558481 Iteration 2: log likelihood = -80.491449 Iteration 3: log likelihood = -80.123052 Iteration 4: log likelihood = -80.118181 Iteration 5:^{a}log likelihood = -80.11818

a. This is a listing of the log likelihoods at each iteration. (Remember that logistic regression uses maximum likelihood, which is an iterative procedure.) The first iteration (called iteration 0) is the log likelihood of the "null" or "empty" model; that is, a model with no predictors. At the next iteration, the predictor(s) are included in the model. At each iteration, the log likelihood increases because the goal is to maximize the log likelihood. When the difference between successive iterations is very small, the model is said to have "converged", the iterating is stopped and the results are displayed. For more information on this process, see Regression Models for Categorical and Limited Dependent Variables by J. Scott Long.

Logit estimates Number of obs^{c}= 200 LR chi2(3)^{d}= 71.05 Prob > chi2^{e}= 0.0000 Log likelihood = -80.11818^{b}Pseudo R2^{f}= 0.3072

b. **Log likelihood** - This is the log likelihood of the final
model. The value -80.11818 has no meaning in and of itself; rather, this
number can be used to help compare nested models.

c. **Number of obs** - This is the number of observations that were
used in the analysis. This number may be smaller than the total number of
observations in your data set if you have missing values for any of the
variables used in the logistic regression. Stata uses a listwise deletion
by default, which means that if there is a missing value for any variable in the
logistic regression, the entire case will be excluded from the analysis.

d. **LR chi2(3)** - This is the likelihood ratio (LR) chi-square
test. The likelihood chi-square test statistic can be calculated by hand
as 2*(115.64441 - 80.11818) = 71.05. This is minus two (i.e., -2) times
the difference between the starting and ending log likelihood. The number
in the parenthesis indicates the number of degrees of freedom. In this
model, there are three predictors, so there are three degrees of freedom.

e. **Prob > chi2** - This is the probability of obtaining the
chi-square statistic given that the null hypothesis is true. In other
words, this is the probability of obtaining this chi-square statistic (71.05) if
there is in fact no effect of the independent variables, taken together, on the
dependent variable. This is, of course, the p-value, which is compared to
a critical value, perhaps .05 or .01 to determine if the overall model is
statistically significant. In this case, the model is statistically
significant because the p-value is less than .000.

f. **Pseudo R2** - This is the pseudo R-squared. Logistic
regression does not have an equivalent to the R-squared that is found in OLS
regression; however, many people have tried to come up with one. There are
a wide variety of pseudo-R-square statistics. Because this statistic does
not mean what R-square means in OLS regression (the proportion of variance
explained by the predictors), we suggest interpreting this
statistic with great caution.

------------------------------------------------------------------------------ honcomp^{g}| Coef.^{h}Std. Err.^{i}z^{j}P>|z|^{j}[95% Conf. Interval]^{k}-------------+---------------------------------------------------------------- female | 1.482498 .4473993 3.31 0.001 .6056111 2.359384 read | .1035361 .0257662 4.02 0.000 .0530354 .1540369 science | .0947902 .0304537 3.11 0.002 .035102 .1544784 _cons | -12.7772 1.97586 -6.47 0.000 -16.64982 -8.904589 ------------------------------------------------------------------------------

g. **honcomp** - This is the dependent variable in our logistic
regression. The variables listed below it are the independent variables.

h. **Coef**. - These are the values for the logistic regression
equation for predicting the dependent variable from the independent variable.
They are in log-odds units. Similar to OLS regression, the prediction equation is

**log(p/1-p) = b0 + b1*female + b2*read + b3*science**

where p is the probability of being in honors composition. Expressed in terms of the variables used in this example, the logistic regression equation is

**log(p/1-p)**** = -12.7772 + 1.482498*female + .1035361*read +
0947902*science**

These estimates tell you about the relationship between the independent
variables and the dependent variable, where the dependent variable is on the
logit scale. These estimates tell the amount of
increase in the predicted log odds of honcomp = 1 that would be predicted by
a 1 unit increase in the predictor, holding all other predictors constant. Note: For the independent variables which
are not significant, the coefficients are not significantly different from 0,
which should be taken into account when interpreting the coefficients. (See the
columns with the z-values and p-values regarding testing whether the coefficients are
statistically significant). Because these coefficients are in log-odds units, they are often
difficult to interpret, so they are often converted into odds ratios. You
can do this by hand by exponentiating the coefficient, or by using the **or**
option with **logit** command, or by using the **logistic** command.

**female** - The coefficient (or parameter estimate) for the
variable **female** is 1.482498. This means that for a one-unit increase in **
female** (in other words, going from male to female), we expect a 1.482498
increase in the log-odds of the dependent variable **honcomp**, holding all
other independent variables constant.

**read** - For every one-unit increase in reading score (so, for
every additional point on the reading test), we expect a .1035361 increase in
the log-odds of **honcomp**, holding all other independent variables
constant.

**science** - For every one-unit increase in science score, we
expect a .0947902 increase in the log-odds of **honcomp**, holding all other
independent variables constant.

**constant** - This is the expected value of the log-odds of
**honcomp** when all of the predictor variables equal zero. In most cases,
this is not interesting. Also, oftentimes zero is not a realistic value
for a variable to take.

i. **Std. Err.** - These are the standard errors associated with the
coefficients. The standard error is used for testing whether the parameter is
significantly different from 0; by dividing the parameter estimate by the
standard error you obtain a z-value (see the column with z-values and p-values).
The standard errors can also be used to form a confidence interval for the
parameter, as shown in the last two columns of this table.

j. **z** and **P>|z|** - These columns provide the z-value and 2-tailed p-value used in testing the null hypothesis that the
coefficient (parameter) is 0. If you use a 2-tailed test, then you would compare
each p-value to your preselected value of alpha. Coefficients having p-values
less than alpha are statistically significant. For example, if you chose alpha
to be 0.05, coefficients having a p-value of 0.05 or less would be statistically
significant (i.e., you can reject the null hypothesis and say that the
coefficient is significantly different from 0). If you use a 1-tailed test
(i.e., you predict that the parameter will go in a particular direction), then
you can divide the p-value by 2 before comparing it to your preselected alpha
level. With a 2-tailed test and alpha of 0.05, you may reject the null
hypothesis that the coefficient for **female** is equal to 0. The
coefficient of 1.482498 is significantly greater than 0.

The coefficient for **read** is .1035361 significantly different from
0 using alpha of 0.05 because its p-value is 0.000, which is smaller than 0.05.

The coefficient for **science** is .0947902 significantly different
from 0 using alpha of 0.05 because its p-value is 0.000, which is smaller than
0.05.

k. **[95% Conf. Interval]** - This shows a 95% confidence interval for the
coefficient. This is very useful as it helps you understand how high and how
low the actual population value of the parameter might be. The confidence
intervals are related to the p-values such that the coefficient will not be
statistically significant if the confidence interval includes 0.

In this next example, we will illustrate the interpretation of odds ratios.
We will use the **logistic** command so that we see the odds ratios instead
of the coefficients. In this example, we will simplify our model so that
we have only one predictor, the binary variable **female**. Before we
run the logistic regression, we will use the **tab** command to obtain a
crosstab of the two variables.

tab female honcomp

| honcomp female | 0 1 | Total -----------+----------------------+---------- male | 73 18 | 91 female | 74 35 | 109 -----------+----------------------+---------- Total | 147 53 | 200

If we divide the number of males who are in honors composition, 18, by
the number of males who are not in honors composition, 73, we get the odds of
being in honors composition for males, 18/73 = .24657534. If we do the same thing for
females, we get 35/74 = .47297297. To get the odds ratio,
which is the ratio of the two odds that we have just calculated, we get
.47297297/.24657534 = 1.9181682. As we can see in the output below, this is
exactly the odds ratio we obtain from the **logistic** command. The thing
to remember here is that you want the group coded as 1 over the group coded as
0, so honcomp=1/honcomp=0 for both males and females, and then the odds for
females/odds for males, because the females are coded as 1.

With regard to the 95% confidence interval, we do not want this to include the value of 1. When we were considering the coefficients, we did not want the confidence interval to include 0. If we exponentiate 0, we get 1 (exp(0) = 1). Hence, this is two ways of saying the same thing. As you can see, the 95% confidence interval includes 1; hence, the odds ratio is not statistically significant. Because the lower bound of the 95% confidence interval is so close to 1, the p-value is very close to .05.

There are a few other things to note about the output below. The first is that although we have only one predictor variable, the test for the odds ratio does not match with the overall test of the model. This is because the z statistic is actually the result of a Wald chi-square test, while the test of the overall model is a likelihood ratio chi-square. While these two types of chi-square tests are asymptotically equivalent, in small samples they can differ, as they do here. Also, we have the unfortunate situation in which the results of the two tests give different conclusions. This does not happen very often. In a situation like this, it is difficult to know what to conclude. One might consider the power, or one might decide if an odds ratio of this magnitude is important from a clinical or practical standpoint.

logistic honcomp female

Logistic regression Number of obs = 200 LR chi2(1) = 3.94 Prob > chi2 = 0.0473 Log likelihood = -113.6769 Pseudo R2 = 0.0170 ------------------------------------------------------------------------------ honcomp | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.918168 .6400451 1.95 0.051 .9973827 3.689024 ------------------------------------------------------------------------------

For more information on interpreting odds ratios, please see How do I interpret odds ratios in logistic regression? .

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.