Help the Stat Consulting Group by giving a gift

Probit Regression

Probit regression, also called a probit model, is used to model dichotomous or binary outcome variables. In the probit model, the inverse standard normal distribution of the probability is modeled as a linear combination of the predictors.

**Please note:** The purpose of this page is to show how to use various data analysis commands.
It does not cover all aspects of the research process which researchers are expected to do. In
particular, it does not cover data cleaning and checking, verification of assumptions, model
diagnostics and potential follow-up analyses.

Example 1: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively, and whether the candidate is an incumbent.

Example 2: A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average), and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

get file = "c:\data\probit.sav".

This data set has a binary response (outcome, dependent) variable called **admit**.
There are three predictor variables:** gre**, **gpa** and **rank**. We will treat the
variables **gre** and **gpa** as continuous. The variable **rank** is
ordinal, it takes on the values 1 through 4. Institutions with a **rank** of 1 have the highest prestige,
while those with a rank of 4 have the lowest. We will treat **rank** as
categorical. Lets start by looking at descriptive statistics.

descriptives /variables=gre gpa.Descriptive Statistics N Minimum Maximum Mean Std. Deviation gre 400 220 800 587.70 115.517 gpa 400 2.26 4.00 3.3899 .38057 Valid N (listwise) 400frequencies /variables = rank admit.Statistics rank admit N Valid 400 400 Missing 0 0 Frequency Table rank Frequency Percent Valid Percent Cumulative Percent Valid 1 61 15.3 15.3 15.3 2 151 37.8 37.8 53.0 3 121 30.3 30.3 83.3 4 67 16.8 16.8 100.0 Total 400 100.0 100.0 admit Frequency Percent Valid Percent Cumulative Percent Valid 0 273 68.3 68.3 68.3 1 127 31.8 31.8 100.0 Total 400 100.0 100.0crosstabs /tables = admit by rank.Case Processing Summary Cases Valid Missing Total N Percent N Percent N Percent admit * rank 400 100.0% 0 .0% 400 100.0% admit * rank Crosstabulation Count rank Total 1 2 3 4 admit 0 28 97 93 55 273 1 33 54 28 12 127 Total 61 151 121 67 400

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

- Probit regression, the focus of this page.
- Logistic regression. A logit model will produce results similar probit regression. The choice of probit versus logit depends largely on individual preferences.
- OLS regression. When used with a binary response variable, this model is known as a linear probability model and can be used as a way to describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and normality of errors assumptions of OLS regression, resulting in invalid standard errors and hypothesis tests. For a more thorough discussion of these and other problems with the linear probability model, see Long (1997, p. 38-40).
- Two-group discriminant function analysis. A multivariate method for dichotomous outcome variables.
- Hotelling's T
^{2}. The 0/1 outcome is turned into the grouping variable, and the former predictors are turned into outcome variables. This will produce an overall test of significance but will not give individual coefficients for each variable, and it is unclear the extent to which each "predictor" is adjusted for the impact of the other "predictors."

Below we use the **plum** command with the subcommand **/link=probit** to run a probit regression model.
After the command name (**plum**), the outcome variable (**admit**) is followed with **
by rank** which indicates that
**rank** is
a categorical predictor, followed by **with gre gpa**, indicating that the predictors
**gre** and **gpa** should be treated as continuous.

plum admit BY rank WITH gre gpa /link=probit /print= parameter summary.

The output from the **plum** command is broken into several sections, each of which is discussed below

Case Processing Summary N Marginal Percentage admit 0 273 68.3% 1 127 31.8% rank 1 61 15.3% 2 151 37.8% 3 121 30.3% 4 67 16.8% Valid 400 100.0% Missing 0 Total 400

- The
**plum**output is labeled as an ordinal regression, however, we can confirm below (see the note in the next set of tables) that the probit link function was used. Note that a model with a binary outcome can be view as a special case of an ordinal model, where there are only two categories. - The table above includes frequencies for the two categorical variables
**admit**(the outcome) and**rank**(one of the predictors). - We can see that all 400 observations have been used. Fewer observations would have been used if any of our variables had missing values.

Model Fitting Information Model -2 Log Likelihood Chi-Square df Sig. Intercept Only 493.620 Final 452.057 41.563 5 .000 Link function: Probit. Pseudo R-Square Cox and Snell .099 Nagelkerke .138 McFadden .083 Link function: Probit.

- The table labeled Model Fitting Information includes two rows, one for the model we requested (labeled Final) and one for a so called null model (Intercept Only). The -2 log likelihoods can be used to compare model fit of the two models. The final -2 log likelihood for our model is 452.057. The intercept-only model has a -2 log likelihood of 493.620.
- The chi-square test statistic of 41.563 is the difference between the two -2 log likelihoods, this test statistic, with 5 degrees of freedom and an associated p-value of less than 0.0004 tells us that the current model fits better than a model with just an intercept.
- Pseudo-R-squared values are another way to assess model fit. Three different pseudo-R-squared are given in the output, but many different measures of pseudo-R-squareds exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds?

Parameter Estimates Estimate Std. Error Wald df Sig. 95% Confidence Interval Lower Bound Upper Bound Threshold [admit = 0] 3.323 .663 25.090 1 .000 2.023 4.623 Location gre .001 .001 4.478 1 .034 .000 .003 gpa .478 .197 5.869 1 .015 .091 .864 [rank=1] .936 .245 14.560 1 .000 .455 1.417 [rank=2] .520 .211 6.091 1 .014 .107 .934 [rank=3] .124 .224 .305 1 .581 -.315 .563 [rank=4] 0a . . 0 . . . Link function: Probit. a. This parameter is set to zero because it is redundant.

- In the table labeled Parameter Estimates, we see the coefficients, their standard errors, Wald test statistic
with associated df and p-values,
and the 95% confidence interval of the coefficients. The variables
**gre**,**gpa**, and the terms for**rank**=1 and**rank**=2 are statistically significant. The probit regression coefficients give the change in the z-score (also called the probit index) for a one unit change in the predictor.- For a one unit increase in
**gre**, the z-score increases by 0.001. - For each one unit increase in
**gpa**, the z-score increases by 0.478. - The terms for
**rank**have a slightly different interpretation. For example, having attended an undergraduate institution with a**rank**of 1, versus an institution with a**rank**of 4 (the reference group), increases the z-score by 0.936.

- For a one unit increase in

We may also want to test the overall effect of **rank**, we can do this using the **test**
subcommand. The test subcommand is followed by the name of the variable we wish
to test (i.e., **rank**), and then one value for each level of that
variable (including the omitted category). The first line of the test subcommand
**rank 1 0 0 0** indicates that we want to test that the coefficient for **
rank**=1 is 0. To perform a multiple degree of freedom test, we include
multiple lines in the test subcommand, all but the last line is separated by a
semicolon. The second and third rows indicate that we wish to test that the
coefficients for **rank**=2 and **rank**=3 are equal to 0. Note that there is no need to
include a row for the fourth category of **rank**.

plum admit by rank with gre gpa /link=probit /print= parameter summary /test rank 1 0 0 0; rank 0 1 0 0; rank 0 0 1 0.

Because the models are the same, most of the output produced by the above **
plum** command is the same as before. The only difference is the additional output
produced by the **test** subcommand, only this portion of the output is
shown below.

Custom Hypothesis Tests 1 Contrast Coefficients C1 C2 C3 Threshold [admit = 0] 0 0 0 Location gre 0 0 0 gpa 0 0 0 [rank=1] 1 0 0 [rank=2] 0 1 0 [rank=3] 0 0 1 [rank=4] 0 0 0 Contrast Results Contrasts Estimate Std. Error Test value Wald df Sig. 95% Confidence Interval Lower Bound Upper Bound C1 .936 .245 0 14.560 1 .000 .455 1.417 C2 .520 .211 0 6.091 1 .014 .107 .934 C3 .124 .224 0 .305 1 .581 -.315 .563 Link function: Probit. Test Results Wald df Sig. 21.361 3 .000 Link function: Probit.

- The first table above, labeled Contrast Coefficients, shows the hypotheses we are testing.
- The second table gives the contrast results, because each row in the
**test**subcommand tests that a coefficient in the model is equal to 0, these estimates, standard errors, etc. are equal to those from the table labeled Parameter Estimates in the main part of the output. The one difference in this table is that the column labeled Test which explicitly gives the null hypothesis, in our case, that each of the terms is equal to 0. (Note that other null hypotheses can be specified.) - The final table produced by the test subcommand, labeled Test Results,
gives the multiple degree of freedom test we are interested in, the Wald
test statistic of 21.361, with 3 degrees of freedom, and an associated p-value
of less than 0.001, tells
us that the overall effect of
**rank**is statistically significant.

The table labeled Parameter Estimates gives hypothesis tests for differences
between each level of **rank** and the reference category. We can use the **
test** subcommand to test for differences between the other levels of **rank**. For example, we might
want to test for a difference in coefficients for **rank**=2 and **rank**=3.
In the syntax below we have added a second **test** subcommand. This time,
the values given are **0 1 -1 0** this indicates that we want to calculate
the difference between the coefficients for **rank**=2 and **rank**=3
(i.e.**, rank**=2 - **rank**=3).

plum admit by rank with gre gpa /link=probit /print= parameter summary /test rank 1 0 0 0; rank 0 1 0 0; rank 0 0 1 0 /test rank 0 1 -1 0.

Again the output from the model, as well as the output associated with the first **test** subcommand
are identical to those shown above, so they are omitted.

Custom Hypothesis Tests 2 Contrast Coefficients C1 Threshold [admit = 0] 0 Location gre 0 gpa 0 [rank=1] 0 [rank=2] 1 [rank=3] -1 [rank=4] 0 Contrast Results Contrasts Estimate Std. Error Test value Wald df Sig. 95% Confidence Interval Lower Bound Upper Bound C1 .397 .168 0 5.573 1 .018 .067 .726 Link function: Probit.

In the table labeled Contrast Results we see the difference in the coefficients (i.e., 0.397).
The
Wald test statistic of 5.573, with one degree of freedom, and associated p-value
of less than 0.02, indicates that
the difference between the coefficients for **rank**=2 and **rank**=3 is
statistically significant. Because only one estimate was specified in the test
subcommand, the multiple degree of freedom test (i.e. the Test Results table) is
not printed.

- Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases (a small cell), the model may become unstable or it might not run at all.
- Separation or quasi-separation (also called perfect prediction), a condition in which the outcome does not vary at some levels of the independent variables. See our page FAQ: What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them? for information on models with perfect prediction.
- Sample size: Both logit and probit models require more cases than OLS regression because they use maximum likelihood estimation techniques. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.
- Pseudo-R-squared: Many different measures of pseudo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds?
- Diagnostics: The diagnostics for logistic regression are different from those for OLS regression. For a discussion of model diagnostics for logistic regression, see Hosmer and Lemeshow (2000, Chapter 5). Note that diagnostics done for logistic regression are similar to those done for probit regression.

- Annotated Output for Probit Regression
- Stat Books for Loan, Logistic Regression and Limited Dependent Variables

- Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
- Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.