UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Comparing OLS and Logistic for Analyzing Binary Outcomes

Please Note: Stata graph commands changed with version 8 and this page was developed before version 8 was released and uses Stata 7 graph commands.  Please see How do I use version 7 graph commands in Stata version 8? for information on how to either run these Stata 7 graph commands in Stata version 8, or how you can covert these commands to use Stata 8 syntax.

* Is it important to use logistic regression when analyzing a 0/1 
* outcome variable?  Some have pointed out that there may be 
* little practical difference between the two methods and
* if so, then OLS would be simpler to apply and interpret.

* This page will compare OLS and logistic regression using
* a couple of examples to explore this issue.  We start by
* using the "hsb2" data file and we will predict "female"
* (0=male, 1=female) from a set of standardized scores,
* "read write science", and "socst".  We know that 
* predicting "female" is kind of a nutty thing to do, but
* this is just for illustration purposes.

* We first use the "hsb2" data file.
* use http://www.ats.ucla.edu/stat/stata/notes/hsb2, clear
use hsb2, clear

***** Check #1. Compare results of OLS and Logistic Models

* We run the model using OLS regression.
regress female read write science socst

* We run the model now using logistic regression.
logit   female read write science socst

* You can compare the two models and see that the "p values"
* associated with the two methods are quite similar, except
* for the constant, which is a boring difference.

* To help compare the two models, we will re-run the analyses
* and use the "outreg" command (see "findit outreg") and it will
* help us see the results side by side.
quietly regress female read write science socst
outreg using olslog, nolabel replace ctitle(OLS) pvalue
quietly logit   female read write science socst
outreg using olslog, append nolabel replace ctitle(Logit) pvalue

* We have to add some tabs to this table to make it line up.
type olslog.out

***** Check #2. Show graph of predicted probability of being female
***** by each predictor comparing OLS and Logistic Models

* Let's look at the relationship between one of the predictors, "read",
* and the predicted probabilility of being female, while holding
* all other predictors constant.  We will do this comparing
* OLS and logistic regression.
quietly regress female read write science socst
postgr read, generate(yhatols1)
quietly logit female read write science socst
postgr read, generate(yhatlog1)
graph yhatols1 yhatlog1 read, c(ll) ylabel(0 .2 to 1) sort 

* We can compute the "observed probability" of being a female
* by each level of "read" using the "egen" command and
* include that in the graph as well.
egen probfem_byread = mean(female) , by(read)
graph yhatols1 yhatlog1 probfem_byread read, c(ll.)  ylabel(0 .2 to 1) sort

* This does not work very well because we have small numbers
* in each level of "read", but this will work better in our
* second example.  For now, it suffices that the predicted
* probabilities by read are nearly identical for OLS and logistic.

* We can do the same for the other predictors in our model, just
* to assure us that this is not a fluke.  We will create all of the
* graphs and show them as one graph.
quietly regress female read write science socst
postgr write  , generate(yhatols2)
postgr science, generate(yhatols3)
postgr socst  , generate(yhatols4)
quietly logit female read write science socst
postgr write  , generate(yhatlog2)
postgr science, generate(yhatlog3)
postgr socst  , generate(yhatlog4)

graph yhatols1 yhatlog1 read   , c(ll) sort ylab(0 .2 to 1) saving(compare1a, replace)
graph yhatols2 yhatlog2 write  , c(ll) sort ylab(0 .2 to 1) saving(compare1b, replace)
graph yhatols3 yhatlog3 science, c(ll) sort ylab(0 .2 to 1) saving(compare1c, replace)
graph yhatols4 yhatlog4 socst  , c(ll) sort ylab(0 .2 to 1) saving(compare1d, replace)

graph using compare1a compare1b compare1c compare1d 

***** Check #3. Look at correlation between predicted probability 
***** of being female for OLS and Logistic Models.

* Let's now look at the predicted probabilities for the two models
* and compare them.
quietly regress female read write science socst
predict yhatols
quietly logit   female read write science socst
predict yhatlog


* The correlation between the predicted values is very high!
corr yhatols yhatlog

***** Check #4. Graph predicted probability of being female
***** for OLS by predicted probability for and Logistic Model.

* Let's graph the predicted values against each other.  Even
* though the correlation between the two predicted values is
* nearly 1, we can see some non-linearity between them.
graph yhatols yhatlog, yline(0 1) ylab(0 .2 to 1) 

***** Check #5. Graph observed probability of being female
***** by predicted probability of being female for OLS model 
***** and Logistic model.

* Now let's compare the observed probability with the predicted
* probability using OLS.  We could simply graph "female" by
* the predicted probability as shown below.
graph female yhatols, ylab(0 .2 to 1) 

* From the graph above, it is hard to see how well the 
* predicted data fit the observed data.  Instead, we can 
* group the students up into groups of 20 and then for each
* group of 20 compute the observed probability of being 
* female by dividing the number of females by 20.  
* For the OLS case, we will sort the data on the predicted
* probability and then create the groups of 20 and the compute
* the observed probability.
sort yhatols
generate olsgroup = int( (_n+1) / 20)
egen obsprobols = mean(female), by(olsgroup)
graph obsprobols yhatols, ylab(0 .2 to 1)  saving(obsprob1, replace) title(obs vs fitted, OLS)


* We can now do the same of the logistic analysis.
sort yhatlog
generate loggroup = int( (_n+1) / 20)
egen obsproblog = mean(female), by(loggroup)
graph obsproblog yhatlog, ylab(0 .2 to 1) saving(obsprob2, replace) title(obs vs fitted, Logistic)

graph using obsprob1 obsprob2


* This example would appear to suggest that the results from
* OLS and Logistic regression are quite similar.  
* When you look at the relationship between the predictors and
* predicted probability, the two are much the same.  The predicted
* probabilities are very highly correlated, and the relationship between
* the observed probability and predicted probability looks much the same
* for the two techniques.  But, let's consider another example.


* Example 2.

* This example uses the api data file
* use http://www.ats.ucla.edu/stat/stata/webbooks/logistic/apilog, clear
use apilog, clear

***** Check #1. Compare results of OLS and Logistic Models

regress hiqual full avg_ed
outreg using api1, nolabel ctitle(OLS) pvalue replace

logit hiqual  full avg_ed 
outreg using api1, nolabel ctitle(OLS) pvalue append

type api1.out

***** Check #2. Show graph of predicted probability of being female
***** by each predictor comparing OLS and Logistic Models

* get observed probabilities by ivs
egen ofull   = mean(hiqual), by(full)
egen oavg_ed = mean(hiqual), by(avg_ed)

* run logistic and get predicted values
logistic hiqual  full avg_ed 
predict yhatlog
* generate predicted values by each iv
postgr full  , gen(fullhatlog)
postgr avg_ed, gen(avg_edhatlog)

* run OLS and get predicted values
regress hiqual full avg_ed 
predict yhatols
* generate predicted values by each iv
postgr full  , gen(fullhatols)
postgr avg_ed, gen(avg_edhatols)

* make graph of each iv by observed and predicted probability
graph  ofull   fullhatlog   fullhatols   full  , c(.ll) s(o..) sort saving(grfull, replace) 
graph  oavg_ed avg_edhatlog avg_edhatols avg_ed, c(.ll) s(o..) sort saving(gravg_ed, replace) 

graph using gryr_rnd grfull grell gravg_ed 

***** Check #3. Look at correlation between predicted probability 
***** of being female for OLS and Logistic Models.

* #3, compare correlations
corr yhatlog yhatols

***** Check #4. Graph predicted probability of being hiqual
***** for OLS by predicted probability for and Logistic Model.

graph yhatols yhatlog, yline(0 1) ylab(0 .2 to 1) 

***** Check #5. Graph observed probability of being female
***** by predicted probability of being female for OLS model 
***** and Logistic model.

* break data up into 40 bins and get observed prob
sort yhatlog
generate n = int((_n-1) / 30)
egen mhiquall = mean(hiqual), by(n)
graph mhiquall yhatlog, saving(g1, replace)

sort yhatols
generate n2 = int((_n-1) / 30)
egen mhiqualo = mean(hiqual), by(n2)

graph mhiqualo yhatols, saving(g2, replace)
graph using g1 g2

***** Check #6. Compare residuals between OLS and logistic model.

gen resols = yhatols - mhiqualo
gen reslog = yhatlog - mhiquall
mdensity  reslog resols, xlab c(ll) s(..) xline(0)

* Example 3.

* This example uses the api data file
* use http://www.ats.ucla.edu/stat/stata/webbooks/logistic/apilog, clear
use apilog, clear

***** Check #1. Compare results of OLS and Logistic Models

regress hiqual full avg_ed yr_rnd meals
outreg using api2, nolabel ctitle(OLS) pvalue replace

logit hiqual full avg_ed yr_rnd meals
outreg using api2, nolabel ctitle(OLS) pvalue append

type api2.out

***** Check #2. Show graph of predicted probability and 
***** observed probability of being female
***** by each predictor comparing OLS and Logistic Models

* get observed probabilities by ivs
egen ofull   = mean(hiqual), by(full)
egen oavg_ed = mean(hiqual), by(avg_ed)
egen oyr_rnd = mean(hiqual), by(yr_rnd)
egen omeals  = mean(hiqual), by(meals)

* run logistic and get predicted values
logistic hiqual full avg_ed yr_rnd meals
predict yhatlog
* generate predicted values by each iv
postgr full  , gen(fullhatlog)
postgr avg_ed, gen(avg_edhatlog)
postgr yr_rnd, gen(yr_rndhatlog)
postgr meals , gen(mealshatlog)

* run OLS and get predicted values
regress hiqual full avg_ed yr_rnd meals
predict yhatols
* generate predicted values by each iv
postgr full  , gen(fullhatols)
postgr avg_ed, gen(avg_edhatols)
postgr yr_rnd, gen(yr_rndhatols)
postgr meals , gen(mealshatols)

* make graph of each iv by observed and predicted probability
graph  ofull   fullhatlog   fullhatols   full  , c(.ll) s(o..) sort saving(grfull, replace) 
graph  oavg_ed avg_edhatlog avg_edhatols avg_ed, c(.ll) s(o..) sort saving(gravg_ed, replace) 
graph  omeals  mealshatlog  mealshatols  meals , c(.ll) s(o..) sort saving(grmeals, replace) 
graph  oyr_rnd yr_rndhatlog yr_rndhatols yr_rnd, c(.ll) s(o..) sort saving(gryr_rnd, replace) 

graph using grfull gravg_ed grmeals gryr_rnd

***** Check #3. Look at correlation between predicted probability 
***** of being female for OLS and Logistic Models.

* #3, compare correlations
corr yhatlog yhatols

***** Check #4. Graph predicted probability of being hiqual
***** for OLS by predicted probability for and Logistic Model.

graph yhatols yhatlog, yline(0 1) ylab(0 .2 to 1) 

***** Check #5. Graph observed probability of being female
***** by predicted probability of being female for OLS model 
***** and Logistic model.

* break data up into 40 bins and get observed prob
sort yhatlog
generate n = int((_n-1) / 30)
egen mhiquall = mean(hiqual), by(n)
graph mhiquall yhatlog, saving(g1, replace)

sort yhatols
generate n2 = int((_n-1) / 30)
egen mhiqualo = mean(hiqual), by(n2)

graph mhiqualo yhatols, saving(g2, replace)
graph using g1 g2

***** Check #6. Compare residuals between OLS and logistic model.

gen resols = yhatols - mhiqualo
gen reslog = yhatlog - mhiquall
mdensity  reslog resols, xlab c(ll) s(..) xline(0)

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California