|
|
|
||||
|
|
|||||
Note: the code on this page works with R 2.5.0.
Example 2: We wish to study the influence of age, gender and exercise on whether or not someone has a heart attack. Again, we have a binary response variable, whether or not a heart attack occurs.
Example 3: How do variables, such as, gre (Graduate Record Exam scores), gpa (grade point average), and prestige of the undergraduate program effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/logit.csv"))
In order to make it easier to refer to the variables, we can attach the dataset
attach(mydata) names(mydata) [1] "admit" "gre" "topnotch" "gpa"
This hypothetical data set has a 0/1 variable called admit that we will use as our response (i.e., outcome, dependent) variable. We also have three variables that we will use as predictors: gre, which is the student's Graduate Record Exam score; gpa, which is the student's grade point average; and topnotch, which is a 0/1 variable where 1 indicates that the undergraduate institution was "top notch" and 0 indicates that it is not.
summary(gre) Min. 1st Qu. Median Mean 3rd Qu. Max. 220.0 520.0 580.0 587.7 660.0 800.0 sd(gre) [1] 115.5165 summary(gpa) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.260 3.130 3.395 3.390 3.670 4.000 sd(gpa) [1] 0.3805668 table(topnotch) topnotch 0 1 335 65
Before running logit, check to see if any cells (created by the crosstab of our categorical and response variables) are empty or particularly small. If this occurs, there may be difficulty running the logit model.
xtabs(~ topnotch + admit)
admit
topnotch 0 1
0 238 917
1 35 30
None of the cells are too small or empty (has no cases), so we will run our logit model using the function glm (generalized linear model).
mylogit<- glm(admit~gre+gpa+topnotch, family=binomial(link="logit"), na.action=na.pass)
Since we gave our model a name (mylogit), R will not produce any output from our regression. In order to get the results we use the summary command:
summary(mylogit)
Call:
glm(formula = admit ~ gre + gpa + topnotch, family = binomial(link = "logit"),
data = mydata, na.action = na.pass)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3905 -0.8836 -0.7137 1.2745 1.9572
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.600814 1.096379 -4.196 2.71e-05 ***
gre 0.002477 0.001070 2.314 0.0207 *
gpa 0.667556 0.325259 2.052 0.0401 *
topnotch 0.437224 0.291853 1.498 0.1341
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 478.13 on 396 degrees of freedom
AIC: 486.13
Number of Fisher Scoring iterations: 4
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.. Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to asses model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. (The command to obtain confidence intervals for the coefficient estimates is confint(mylogit).) The table of coefficients shows that both gre and gpa are statistically significant while topnotch is not. The interpretation of logit coefficients can be awkward. For example, for a one unit increase in gpa, the log odds of being admitted to graduate school (vs. not being admitted) increases by .667. For this reason, many researchers prefer to exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficents, tell R that you want to exponentiate (exp( )), and the object you want to expoentiate is called coefficients and it is part of mylogit (mylogit$coefficients).
exp(mylogit$coefficients) (Intercept) gre gpa topnotch 0.01004366 1.00247990 1.94946627 1.54840220
Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (vs. not being admitted) increased by a factor of 1.94. Since gre scores do not increase by a single unit (they increase only in units of 10), a one unit increase is meaningless. We can take the odds ratio and raise it to the 10th power, e.g. 1.00248 ^ 10 = 1.0250786, and say for a 10 unit increase in gre score, the odds of admission to graduate school increased by a factor of 1.025. The confidence intervals around the odds ratios can be obtained using the command exp(confint(mylogit)). Note that while R will produce it, the odds ratio for the intercept is not interpretable.
Even odds ratios can be hard to interpret. Instead, you can also use predicted probabilities, which are sometimes easier to understand than the coefficients or odds ratios, to interpret your results. We discuss how to obtain predicted probabilities and how to interpret them below.
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. We can use the difference in deviance residuals for the current model versus the null model to test for overall model fit. To do this, we test whether the difference between the current model and the null model (a model with just an intercept) is statistically significant using a chi-square test using the difference between the two residuals. The degrees of freedom for this test is equal to the differences in degrees of freedom between the current and the null model (i.e. the number of predictor variables in the model). To find the difference in deviance for the two models we can use the command:
mylogit$null.deviance-mylogit$deviance [1] 21.8469
The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:
mylogit$df.null-mylogit$df.residual [1] 3
Finally, the test statistic can be obtained using:
dchisq(mylogit$null.deviance-mylogit$deviance, mylogit$df.null-mylogit$df.residual) [1] 3.362101e-05
The chi-square of 21.85 with a p-value of less than 0.00004 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a log-likelihood test (the deviance residual is -2*log likelihood). To see the model's log likelihood, we type:
logLik(mylogit) 'log Lik.' -239.0648 (df=4)
The log likelihood (-239.06481) can be used in comparisons other nested models, but we won't show an example of that here.
Predicted probabilities can be computed for both categorical and continuous predictor variables. Although topnotch is not statistically significant, we will use it as an example of a categorical predictor. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions. The first line of the code shown below creates an object called topnotch, it is a vector containing the two values that the variable topnotch takes on. The second and third lines of the code below create objects called gre and gpa that only take on one value each, in this case we have set both variables to their means using the command mean(data.frame$varname), but they could be set at any value (e.g. gre=600). These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). On the last line of code, we combine the objects we just created (topnotch, gpa, and gre) into a data frame called newdata1.
topnotch <- c(0,1) gre <- c(mean(mydata$gre)) gpa <- c(mean(mydata$gpa)) newdata1 <- data.frame(gre,gpa,topnotch)
Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. While relatively compact, the first line of code below is important and contains a lot of instructions. The newdata1$topnotchP tells R that we want to create a new variable in the dataset (data frame) newdata1 called topnotchP, and that the values of topnotchP should be predictions made using the predict( ) command. The commands within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type="response"). The second line of the code tells R to list the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities. As you can see, the predicted probability of being accepted into the graduate program is 0.29 if the undergraduate institution was not "top notch" (topnotch = 0) and .39 if it was (topnotch = 1), while gre and gpa are held constant at their mean values.
newdata1$topnotchP <-predict(mylogit,newdata=newdata1,type="response")
newdata1
gre gpa topnotch topnotchP
1 587.7 3.3899 0 0.2927149
2 587.7 3.3899 1 0.3905475
We can do something very similar to create a table of predicted probabilities varying the value of gre. Above, we created objects called gre, gpa and topnotch and then bound them into a single data frame. This time, we are going to create newdata2 by creating a data frame and declaring the values within it at the same time. In the first line below, the command gre=seq(220,800,20) tells R that the first variable in the data frame should be called gre and that it should take on a sequence of values from 220 to 800, in increments of 20 (e.g. the first three values of gre in the data frame will be 220, 240, and 260). Next we tell R that we would like the second variable to be called gpa and that for every row in the data frame, gpa should be equal to the mean of gpa in the data frame mydata, and we do the same for topnotch.
The second line of this code is the same as before, except that the name of the variable and data frame have been changed. The third line of this code is a little different than the code for varying topnotch, instead of instructing R to output all of the data in the data frame as we did above, we instruct R to only print the two columns of data that we are interested in (gre and greP), using cbind(newdata2$gre, newdata2$greP). We could instruct R to print the entire data frame newdata2 by entering its name, but the advantage of printing only the two columns that we are interested in is that the variables that are constant (topnotch and gpa) do not clutter up the table, instead, the first column gives the values for gre and the second column the predicted probability based on the value of gre with topnotch and gre held at their means. Reading the table we can see that the predicted probability of getting accepted is only .15 if one's gre score is 220 and increases to .429 if one's gre score is 800 (while gpa and topnotch are held constant at their means).
newdata2<-data.frame(gre=seq(220,800,20),gpa=mean(mydata$gpa),topnotch=mean(mydata$topnotch))
newdata2$greP<-predict(mylogit,newdata=newdata2,type="response")
cbind(newdata2$gre,newdata2$greP)
[,1] [,2]
[1,] 220 0.1516246
[2,] 240 0.1581072
[3,] 260 0.1648132
[4,] 280 0.1717456
[5,] 300 0.1789071
[6,] 320 0.1863001
[7,] 340 0.1939264
[8,] 360 0.2017875
[9,] 380 0.2098843
[10,] 400 0.2182171
[11,] 420 0.2267858
[12,] 440 0.2355896
[13,] 460 0.2446270
[14,] 480 0.2538960
[15,] 500 0.2633937
[16,] 520 0.2731166
[17,] 540 0.2830605
[18,] 560 0.2932204
[19,] 580 0.3035906
[20,] 600 0.3141645
[21,] 620 0.3249348
[22,] 640 0.3358936
[23,] 660 0.3470319
[24,] 680 0.3583403
[25,] 700 0.3698085
[26,] 720 0.3814256
[27,] 740 0.3931799
[28,] 760 0.4050592
[29,] 780 0.4170508
[30,] 800 0.4291413
To compute the predicted probabilities at different levels of gpa holding gre and topnotch constant, we can use a similar command. Note that here we have used the command gpa=1:4 to tell R we want values of gpa from one to four in one unit intervals, if we wanted a different interval (e.g. 0.5), we would use the seq() command seq(from,to,units) (e.g. seq(1,4,0.5)) similar to the command used above for gre. Also, instead of holding topnotch at it's mean, this code holds topnotch at zero (topnotch=0).
newdata3<-data.frame(gre=mean(mydata$gre),gpa=1:4,topnotch=0) newdata3$gpaP<-predict(mylogit,newdata=newdata3,type="response")
It is also possible to get the predicted probability for a given profile, for example, an applicant with a gre score of 700, a gpa of 3.5, whose undergraduate institution is not considered "topnotch." The process is basically the same as above:
newdata3<-data.frame(gre=700,gpa=3.5,topnotch=0) newdata3$predicted <-predict(myprobit,newdata=newdata3,type="response") newdata3
Before we begin the sample write-up we need to get the output into a form more acceptable for publication, such as in LaTeX format. The LaTeX code can be generated easily using the function xtable from a package called xtable.
library(xtable)
l <- xtable(mylogit)
l
% latex table generated in R 2.5.0 by xtable 1.4-6 package
% Fri Jun 01 12:53:03 2007
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
\hline
& Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
\hline
(Intercept) & $-$4.6008 & 1.0964 & $-$4.20 & 0.0000 \\
gre & 0.0025 & 0.0011 & 2.31 & 0.0207 \\
gpa & 0.6676 & 0.3253 & 2.05 & 0.0401 \\
topnotch & 0.4372 & 0.2919 & 1.50 & 0.1341 \\
\hline
\end{tabular}
\end{center}
\end{table}
The LaTeX code above generates a table as shown below:

You will probably also want to use the odds ratios (exponentiated coefficients) and the predicted probabilities calculated above to help you describe your results. Below is one way of describing the results.
A logit regression was used to predict admission to graduate school from gre score, gpa, and whether the student was from a top notch university. gre score and gpa were significant predictors of admission to graduate school, but being from a top notch university was not related to admission to graduate school. For every one unit increase in gpa, the odds of admission (vs. non-admission) increased by a factor of 1.95, while for every ten unit increase in gre score, such odds increased by a factor of 1.025. These findings can also be interpreted using predicted probabilities. With all other variables held constant at their mean, the probability of admission for a gpa of 2.0 was .15, while a gpa of 3.0 resulted in a .26 probability of admission and a gpa of 4.0 was associated with a .40 probability of admission. Likewise, for gre scores of 400, 500, 600 and 700, the probabilities of admission were .22, .26, .31 and .37, respectively, while holding other predictors constant at their mean.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services