|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
Note: the code on this page was created and tested with R 2.8.1.
Logistic regression, also called a logit model, is used to model dichotomous outcome variables. In the logit model the log odds of the outcome is modeled as a linear combination of the predictor variables.Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics and potential follow-up analyses.
Example 1: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.
Example 2: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.
For our data analysis below, we are going to expand on Example 2 about getting into graduate school. We have generated hypothetical data, which can be obtained from our website from within R. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/binary.csv"))
In order to make it easier to refer to the variables, we can attach the dataset.
attach(mydata) names(mydata) [1] "admit" "gre" "gpa" "rank"
This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
summary(gre)
Min. 1st Qu. Median Mean 3rd Qu. Max.
220.0 520.0 580.0 587.7 660.0 800.0
sd(gre)
[1] 115.5165
summary(gpa)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.260 3.130 3.395 3.390 3.670 4.000
sd(gpa)
[1] 0.3805668
table(rank)
rank
1 2 3 4
61 151 121 67
table(admit)
admit
0 1
273 127
table(rank,admit)
admit
rank 0 1
1 28 33
2 97 54
3 93 28
4 55 12
Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
The code below estimates a logistic regression model using the glm (generalized linear model) function. The as.factor(rank) indicates that rank should be treated as a factor (i.e., categorical) variable.
mylogit<- glm(admit~gre+gpa+as.factor(rank), family=binomial(link="logit"), na.action=na.pass)
Since we gave our model a name (mylogit), R will not produce any output from our regression. In order to get the results we use the summary command:
summary(mylogit)
glm(formula = admit ~ gre + gpa + as.factor(rank), family = binomial(link = "logit"),
na.action = na.pass)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6268 -0.8662 -0.6388 1.1490 2.0790
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979 1.139951 -3.500 0.000465 ***
gre 0.002264 0.001094 2.070 0.038465 *
gpa 0.804038 0.331819 2.423 0.015388 *
as.factor(rank)2 -0.675443 0.316490 -2.134 0.032829 *
as.factor(rank)3 -1.340204 0.345306 -3.881 0.000104 ***
as.factor(rank)4 -1.551464 0.417832 -3.713 0.000205 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
Number of Fisher Scoring iterations: 4
We can use the confit(...) function to obtain confidence intervals for the coefficient estimates.
confint(mylogit)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -6.2716202334 -1.792547080
gre 0.0001375921 0.004435874
gpa 0.1602959439 1.464142727
as.factor(rank)2 -1.3008888002 -0.056745722
as.factor(rank)3 -2.0276713127 -0.670372346
as.factor(rank)4 -2.4000265384 -0.753542605
We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. Below we first load the aod library. Then we use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.
library(aod) wald.test(b=coef(mylogit), Sigma=vcov(mylogit), Terms=4:6) Wald test: ---------- Chi-squared test: X2 = 20.9, df = 3, P(> X2) = 0.00011
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.
We can also test additional hypotheses about the differences in the coefficients for the different levels of rank. Below we test that the coefficient for rank=2 is equal to the coefficient for rank=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).
l <- cbind(0,0,0,1,-1,0) wald.test(b=coef(mylogit), Sigma=vcov(mylogit), L=l) Wald test: ---------- Chi-squared test: X2 = 5.5, df = 1, P(> X2) = 0.019
The chi-squared test statistic of 5.5 with 1 degree of freedom is associated with a p-value of 0.019, indicating that the difference between the coefficient for rank=2 and the coefficient for rank=3 is statistically significant.
You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp( )), and that the object you want to exponentiate is called coefficients and it is part of mylogit (mylogit$coefficients).
exp(mylogit$coefficients)
(Intercept) gre gpa as.factor(rank)2
0.0185001 1.0022670 2.2345448 0.5089310
as.factor(rank)3 as.factor(rank)4
0.2617923 0.2119375
Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23. For more information on interpreting odds ratios see our FAQ page How do I interpret odds ratios in logistic regression? . The confidence intervals around the odds ratios can be obtained using the command exp(confint(mylogit)). Note that while R produces it, the odds ratio for the intercept is not generally interpreted.
You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions.
We will start by calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means. The first line of the code shown below creates an object called rank, it is a vector containing the four values that the variable rank takes on. The second and third lines of the code below create objects called gre and gpa that only take on one value each, in this case we have set both variables to their means using the command mean(data.frame$varname), but they could be set at any value (e.g. gre=600). These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). On the second to last line of code, we combine the objects we just created (rank, gpa, and gre) into a data frame called newdata1. The final line of code asks R to display the data frame newdata1 to give you an idea of what it looks like at this point.
rank <- c(1,2,3,4)
gre <- c(mean(mydata$gre))
gpa <- c(mean(mydata$gpa))
newdata1 <- data.frame(gre,gpa,rank)
newdata1
gre gpa rank
1 587.7 3.3899 1
2 587.7 3.3899 2
3 587.7 3.3899 3
4 587.7 3.3899 4
Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, we will break it apart to discuss what various components do. The newdata1$rankP tells R that we want to create a new variable in the dataset (data frame) newdata1 called rankP, the rest of the command tells R that the values of rankP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type="response"). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.
newdata1$rankP <-predict(mylogit,newdata=newdata1,type="response")
newdata1
newdata1$rankP <-predict(mylogit,newdata=newdata1,type="response")
newdata1
gre gpa rank rankP
1 587.7 3.3899 1 0.5166016
2 587.7 3.3899 2 0.3522846
3 587.7 3.3899 3 0.2186120
4 587.7 3.3899 4 0.1846684
In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.
We can do something very similar to create a table of predicted probabilities varying the value of gre. Above, we created objects called gre, gpa and rank and then bound them into a single data frame. This time, we are going to create newdata2 by creating a data frame and declaring the values within it at the same time. In the first line below, the command gre=seq(200,800,100) tells R that the first column in the data frame should be called gre and that it should take on a sequence of values from 200 to 800, in increments of 100 (e.g. the first three values of gre in the data frame will be 200, 300, and 400). Next we tell R that we would like the second variable to be called gpa and that for every row in the data frame, gpa should be equal to the mean of gpa in the data frame mydata. For rank, we specify that rank should be held constant at 2, as with the use of the mean for continuous predictors, this value is arbitrary, in this case 2 was selected because it is the modal category.
newdata2<-data.frame(gre=seq(200,800,100),gpa=mean(mydata$gpa),rank=2)
The code to generate the predicted probabilities (the first line below) is the same as before, except that the name of the variable and data frame have been changed. The second line of this code is a little different than the code for varying rank, instead of instructing R to output all of the data in the data frame as we did above, we instruct R to only print the two columns of data that we are interested in (gre and greP), using cbind(newdata2$gre, newdata2$greP). We could instruct R to print the entire data frame newdata2 by entering its name, but the advantage of printing only the two columns that we are interested in is that the variables that are constant (rank and gpa) do not clutter up the table, instead, the first column gives the values for gre and the second column the predicted probability based on the value of gre with rank held at 2 and gre held at its mean. Reading the table we can see that the predicted probability of getting accepted is only .18 if one's gre score is 200 and increases to .47 if one's gre score is 800 (while gpa is held constant at its mean and rank is held constant at 2).
newdata2$greP<-predict(mylogit,newdata=newdata2,type="response")
cbind(newdata2$gre,newdata2$greP)
[,1] [,2]
[1,] 200 0.1843830
[2,] 300 0.2208900
[3,] 400 0.2623007
[4,] 500 0.3084017
[5,] 600 0.3586658
[6,] 700 0.4122390
[7,] 800 0.4679753
It can also be helpful to use graphs of predicted probabilities to understand and/or present the model.
We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(mylogit) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:
mylogit$null.deviance - mylogit$deviance [1] 41.45903
The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:
mylogit$df.null - mylogit$df.residual [1] 5
Finally, the p-value can be obtained using:
1-pchisq(mylogit$null.deviance-mylogit$deviance, mylogit$df.null-mylogit$df.residual) [1] 7.578194e-08
The chi-square of 41.46 with 5 degrees of freedom and an associated p-value of less than 0.001 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a likelihood ratio test (the deviance residual is -2*log likelihood). To see the model's log likelihood, we type:
logLik(mylogit) 'log Lik.' -229.2587 (df=6)
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services