UCLA Academic Technology Services HomeServicesClassesContactJobs

R Data Analysis Examples
Probit Regression

Examples

Example 1:  Suppose that we are interested in factors that influence whether or not a political candidate wins an election.  Our outcome variable has only two possible values:  win or not win.  We believe that factors such as the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether the candidate is an incumbent affect whether the candidate wins the election.  Because our outcome variable is binary (either the candidate wins or does not win), we need to use a model that handles this feature correctly. 

Example 2:  Some people have heart attacks and others don't.  We would like to see if exercise, age and gender influences whether or not someone has a heart attack.  Again, we have a binary outcome:  have heart attack or not. 

Example 3:  Many undergraduates wish to continue their education in graduate school.  In their application to any given graduate program, they include their GRE scores and their GPA from their undergraduate institution.  Some students are graduating from very prestigious institutions, while others are graduating from not-so-prestigious institutions.  Many months after sending in their applications, students receive either a thick or a thin envelope from the graduate program to which they applied:  some were admitted and others were not.

Description of the Data

For our data analysis below, we are going to expand on Example 3 about getting into graduate school.  We have generated hypothetical data, which can be obtained from our website in R. Note that R requires forward slashes (/) not back slashes (\) when specifying a file location even if the file is on your hard drive.
mydata <- read.csv(url("http://www.ats.ucla.edu/stat/r/dae/logit.csv"))

In order to make it easier to refer to the variables, we can attach the dataset

attach(mydata)
names(mydata)
[1] "admit" "gre" "topnotch" "gpa" 

This hypothetical data set has a 0/1 variable called admit that we will use as our response (i.e., outcome, dependent) variable.  We also have three variables that we will use as predictors:  gre, which is the student's Graduate Record Exam score; gpa, which is the student's grade point average; and topnotch, which is a 0/1 variable where 1 indicates that the undergraduate institution was "top notch" and 0 indicates that it is not. 

summary(gre)
Min. 1st Qu. Median Mean 3rd Qu. Max. 
220.0 520.0 580.0 587.7 660.0 800.0 

sd(gre)
[1] 115.5165

summary(gpa)
Min. 1st Qu. Median Mean 3rd Qu. Max. 
2.260 3.130 3.395 3.390 3.670 4.000 

sd(gpa)
[1] 0.3805668

table(topnotch)
topnotch
0 1 
335 65 

Some Strategies You Might Try

Using the Probit Model

Before we run our probit model, we will see if any cells (created by the crosstab of our categorical and response variables) are empty or particularly small.  If any are, we may have difficulty running our model. 

xtabs(~ topnotch + admit)
        admit
topnotch   0   1
       0 238  97
       1  35  30
 

None of the cells is too small or empty (has no cases), so we will run our probit model using the function glm (generalized linear model).

myprobit<- glm(admit ~ gre + gpa + topnotch, family=binomial(link="probit"), data=mydata, na.action=na.pass)
summary(myprobit)

Call:
glm(formula = admit ~ gre + gpa + topnotch, family = binomial(link = "probit"), 
    data = mydata, na.action = na.pass)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3862  -0.8856  -0.7130   1.2715   1.9800  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.7978830  0.6476352  -4.320 1.56e-05 ***
gre          0.0015244  0.0006404   2.380   0.0173 *  
gpa          0.4009847  0.1948032   2.058   0.0396 *  
topnotch     0.2730331  0.1803284   1.514   0.1300    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 477.89  on 396  degrees of freedom
AIC: 485.89

Number of Fisher Scoring iterations: 4
	

Since we gave our model a name (myprobit), R will not produce any output from our regression. In order to get the results we use the summary command. In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.. Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to asses model fit.

The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), associated p-values. (The command to obtain confidence intervals for the coefficient estimates is confint(myprobit).) The table of coefficients shows that both gre and gpa are statistically significant while topnotch is not. A discussion of the interpretation of the coefficients can be found in the sample write up section below.

Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. We can use the difference in deviance residuals for the current model versus the null model to test for overall model fit. We can test whether the difference between the current model and the null model (a model with just an intercept) is statistically significant using a chi-square test using the difference between the two residuals, the degrees of freedom for this test is equal to the differences in degrees of freedom between the current model and the null model. To find the difference in deviance for the two models we can use the command:

myprobit$null.deviance-myprobit$deviance
[1] 22.08974
	

The degrees of freedom for the difference between the two models is equal to the number of predictor variables in the mode, and can be obtained using:

myprobit$df.null-myprobit$df.residual
[1] 3

Finally, the test statistic can be obtained using:

dchisq(myprobit$null.deviance-myprobit$deviance,myprobit$df.null-myprobit$df.residual)
[1] 2.994192e-05

The chi-square of 22.09 with a p-value of less than 0.00003 tells us that our model as a whole fits significantly better than an empty model. This is sometimes called a log-likelihood test and exactly the same values will be obtained if we use the difference in log-likelihoods to perform this test. To see the model's log likelihood, we type:

logLik(myprobit)
'log Lik.' -238.9434 (df=4)

The log likelihood (-238.9434) can be used in comparisons other nested models, but we won't show an example of that here.

It is sometimes difficult to interpret the raw coefficients (z-scores) from a probit model. One way to make the results more understandable, is toobtain predicted probabilities, which are sometimes easier to understand than the coefficients.  This can be done with either a categorical variable or a continuous variable and shows the predicted probability for each of the values of the variable specified.  Although topnotch is not statistically significant, we will use it as an example with a categorical predictor.

In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions. The first line of the code shown below creates an object called topnotch, it is a vector containing the two values that the variable topnotch takes on.  The second and third lines of the code below create objects called gre and gpa that only take on one value each, in this case we have set both variables to their means using the command mean(data.frame$varname), but they could be set at any value (e.g. gre=600). These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). On the last line of code, we combine the objects we just created (topnotch, gpa, and gre) into a data frame called newdata1.

topnotch <- c(0,1)
gre <- c(mean(mydata$gre))
gpa <- c(mean(mydata$gpa))
newdata1 <- data.frame(gre,gpa,topnotch)

Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. While relatively compact, the first line of code below is important and contains a lot of instructions. The newdata1$topnotchP tells R that we want to create a new variable in the dataset (data frame)  newdata1 called topnotchP, and that the values of topnotchP should be predictions made using the predict( ) command. The commands within the parentheses tell R that the predictions should be based on the analysis myprobit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type="response"). The second line of the code tells R to list the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities. As you can see, the predicted probability of being accepted into the graduate program is 0.29 if the undergraduate institution was not "top notch" (topnotch = 0)  and .39 if it was  (topnotch = 1), while gre and gpa are held constant at their mean values.

newdata1$topnotchP <-predict(myprobit,newdata=newdata1,type="response")
newdata1
    gre topnotch    gpa topnotchP
1 587.7        0 3.3899 0.2936703
2 587.7        1 3.3899 0.3937107

We can do something very similar to create a table of predicted probabilities varying the value of gre. Above, we created objects called gre, gpa and topnotch and then bound them into a single data frame. This time, we are going to create newdata2 by creating a data frame, and declaring the values within it at the same time. In the first line of code, the command gre=seq(220,800,20) tells R that the first variable in the data frame should be called gre and that it should take on a sequence of values (seq(220,800,20)) from 220 to 800, in increments of 20 (e.g. the first three values in the data frame will be 220, 240, and 260). Next we tell R that we would like the next variable to be called gpa and that for every row in the data frame, gpa should be equal to the mean of gpa in the data frame mydata, and we do the same for topnotch.

The second line of this code is the same as the line using predict above, except that the name of the variable and data frame have been changed. The third line of this code is a little different than the code for varying topnotch, instead of instructing R to output all of the data in the data frame as we did above, we instruct R to only print the two columns of data that we are interested in, using cbind(newdata2$gre, newdata2$greP). We could instruct R to print the entire data frame newdata2 by entering its name, but the advantage of printing only the two columns we are interested in is that the variables that are constant (topnotch and gpa) do not clutter up the table, instead, the first column gives the values for gre and the second column the predicted probability based on the value of gre with topnotch and gre held at their means. Reading the table we can see that the predicted probability of getting accepted is only .14 if one's gre score is 220 and increases to .43 if one's gre score is 800 (while gpa and topnotch are held constant at their means).

newdata2<-data.frame(gre=seq(220,800,20),gpa=mean(mydata$gpa),topnotch=mean(mydata$topnotch))
newdata2$predicted <-predict(myprobit,newdata=newdata2,type="response")
cbind(newdata2$gre,newdata2$predicted)

      [,1]      [,2]
 [1,]  220 0.1448344
 [2,]  240 0.1518901
 [3,]  260 0.1591706
 [4,]  280 0.1666760
 [5,]  300 0.1744059
 [6,]  320 0.1823599
 [7,]  340 0.1905367
 [8,]  360 0.1989348
 [9,]  380 0.2075521
[10,]  400 0.2163863
[11,]  420 0.2254342
[12,]  440 0.2346925
[13,]  460 0.2441573
[14,]  480 0.2538241
[15,]  500 0.2636882
[16,]  520 0.2737442
[17,]  540 0.2839863
[18,]  560 0.2944083
[19,]  580 0.3050034
[20,]  600 0.3157646
[21,]  620 0.3266842
[22,]  640 0.3377544
[23,]  660 0.3489667
[24,]  680 0.3603125
[25,]  700 0.3717827
[26,]  720 0.3833678
[27,]  740 0.3950582
[28,]  760 0.4068438
[29,]  780 0.4187145
[30,]  800 0.4306595

To compute the predicted probabilities at different levels of gpa holding gre and topnotch constant, we can use a similar command (shown below). Note that here we have used the command gpa=0:4 to tell R we want values of gpa from zero to four in one unit intervals, if we wanted a different interval (e.g. 0.5), we would use the seq() command seq(from,to,units) (e.g. seq(0,4,0.5)) similar to the command used above for gre.

newdata3<-data.frame(gre=mean(mydata$gre),gpa=0:4,topnotch=0)
newdata3$gpaP<-predict(myprobit,newdata=newdata3,type="response")

It is also possible to get the predicted probability for a given profile, for example, an applicant with a gre score of 700, a gpa of 3.5, whose undergraduate institution is not considered "topnotch." The process is basically the same as above:

newdata3<-data.frame(gre=700,gpa=3.5,topnotch=0)
newdata3$predicted <-predict(myprobit,newdata=newdata3,type="response")
newdata3

  gre gpa topnotch predicted
1 700 3.5        0 0.3716997
	

Sample Write-up of the Analysis

Before we begin the sample write-up we need to get the output into a form more acceptable for publication, such as in LaTeX format. The LaTeX code can be generated easily using the function xtable from a package called xtable.

library(xtable)
l <- xtable(myprobit)
l

% latex table generated in R 2.5.0 by xtable 1.4-6 package
% Wed Jun 06 15:47:22 2007
\begin{table}[ht]
\begin{center}
\begin{tabular}{rrrrr}
  \hline
 & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
  \hline
(Intercept) & $-$2.7979 & 0.6476 & $-$4.32 & 0.0000 \\
  gre & 0.0015 & 0.0006 & 2.38 & 0.0173 \\
  gpa & 0.4010 & 0.1948 & 2.06 & 0.0396 \\
  topnotch & 0.2730 & 0.1803 & 1.51 & 0.1300 \\
   \hline
\end{tabular}
\end{center}
\end{table}

The LaTeX code above generates a table as shown below:

Below is one way of describing the results.  Please note that the coefficients can be discussed in terms of either Z-scores or probit index.  These are equivalent terms.

The Z-score of a person with a zero GRE score and zero GPA at a non-topnotch school is about -2.8.  For each point of increase in GRE score, the Z-score is increased by .0015244; for each point of increase in GPA, the probit index increases by .4.

Describing the results in terms of Z-scores may not be the simplest metric for your audience to understand.  You can use the commands described above to obtain the predicted probabilities.  These are often useful for helping to tell the "story" of your results.

Similarities and differences between logit and probit models

Neither the logit model nor the probit model are linear, which makes things difficult.  To make the model linear, a transformation is done on the dependent variable.  In logit regression, the transformation is the logit function which is the natural log of the odds.  In probit models, the function used is the inverse of the standard normal cumulative distribution (a.k.a. a z-score).  In reality, this difference isn't too important:  both transformations are equally good at linearizing the model; which one you use is a matter of personal preference.  Both models need to have diagnostics done afterwards to check that the assumptions of the model have not been violated.  Both methods use maximum likelihood, and so require more cases than a similar OLS model.  Unlike logit models, you don't get odds ratios with probit models.  In general, the logit coefficients are larger than the probit coefficients by a factor of 1.7.  However, this rule often does not apply when an independent variable has a high standard error (lots of variability).

Cautions, Flies in the Ointment

See Also