Stata Data Analysis Examples
Discriminant Function Analysis

Version info: Code for this page was tested in Stata 12.

Linear discriminant function analysis (i.e., discriminant analysis) performs a multivariate test of differences between groups.   In addition, discriminant analysis is used to determine the minimum number of dimensions needed to describe these differences.  A distinction is sometimes made between descriptive discriminant analysis and predictive discriminant analysis.  We will be illustrating predictive discriminant analysis on this page.

Please note: The purpose of this page is to show how to use various data analysis commands.  It does not cover all aspects of the research process which researchers are expected to do.  In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.

Examples of discriminant function analysis

Example 1. A large international air carrier has collected data on employees in three different job classifications: 1) customer service personnel, 2) mechanics and 3) dispatchers.  The director of Human Resources wants to know if these three job classifications appeal to different personality types.  Each employee is administered a battery of psychological test which include measures of interest in outdoor activity, sociability and conservativeness.

Example 2. There is Fisher's (1936) classic example of discriminant analysis involving three varieties of iris and four predictor variables (petal width, petal length, sepal width, and sepal length).  Fisher not only wanted to determine if the varieties differed significantly on the four continuous variables, but he was also interested in predicting variety classification for unknown individual plants.

Description of the data

Let's pursue Example 1 from above.

We have a data file, discrim.dta, with 244 observations on four variables.  The psychological variables are outdoor interests, social and conservative.  The categorical variable is job type with three levels; 1) customer service, 2) mechanic and 3) dispatcher.

Let's look at the data.  It is always a good idea to start with descriptive statistics.

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered.  Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.

Discriminant function analysis

We will run the discriminant analysis using the candisc procedure.  We could also have run the discrim lda command to get the same analysis with slightly different output.  There is a great deal of output, so we will comment at various places along the way.

Standardized canonical discriminant function coefficients

                 | function1  function2 
    -------------+----------------------
         outdoor |  .3785725   .9261104 
          social | -.8306986   .2128593 
    conservative |  .5171682  -.2914406 

Canonical structure

                 | function1  function2 
    -------------+----------------------
         outdoor |  .3230982   .9372155 
          social | -.7653907   .2660298 
    conservative |   .467691  -.2587426

The output includes the means on the discriminant functions for each of the three groups and a classification table.  Values in the diagonal of the classification table reflect the correct classification of individuals into groups based on their scores on the discriminant dimensions.

By default, Stata assumes a priori an equal number of people in each job.  This is represneted by the 0.3333 Priors in the table above.  If you have different expected proportions in mind, you may specify them with the priors option. 

Next, we will plot a graph of individuals on the discriminant dimensions.  Due to the large number of subjects we will shorten the labels for the job groups to make the graph more legible.  As long as we do not save the dataset, these new labels will not be made permanent.

 

The discrimant functions are:

discriminant_score_1 = 0.517*conservative + 0.379*outdoor - 0.831*social.

discriminant_score_2 = 0.926*outdoor + 0.213*social - 0.291*conservative.

As you can see, the customer service employees tend to be at the more social (negative) end of dimension 1; the dispatchers are at the opposite end; the mechanics are in the middle.  On dimension 2 the results are not as clear; however, the mechanics tend to be higher on the outdoor dimension and customer service employees and dispatchers are lower.

We can also plot the discriminant loadings for the variables onto the discriminant dimensions.

There is no surprise that the variable social is strong on the social dimension, i.e., it has a high negative loading, and the outdoor variable is high on the outdoor dimension.

Things to consider

See also

References

How to cite this page

Report an error on this page or leave a comment

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.