|
|
|
||||
|
|
|||||
Example 1. A researcher has collected data on three psychological variables, four academic variables (standardized test scores) and gender for 600 college freshman. She is interested in how the set of psychological variables relates to the academic variables and gender. In particular, the researcher is interested in how many dimensions are necessary to understand the association between the two sets of variables.
Example 2. There is Fisher's (1936) classic example of discriminant analysis involving three varities of iris and four predictor variables (petal width, petal length, sepal width, and sepal length). Fisher not only wanted to determine if the varieties differed significantly on the four continuous variables but he was also interested in predicting variety classification for unknown individual plants.
We have a data file, discrim.sas7bdat, with 244 observations on four variables. The psychological variables are outdoor interests, social and conservative. The categorical variable is job type with three levels; 1) customer service, 2) mechanic, and 3) dispatcher.
Let's look at the data.
options nocenter;
proc means data='d:\data\discrim' n mean std min max;
var outdoor social conservative;
run;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
OUTDOOR 244 15.6393443 4.8399326 0 28.0000000
SOCIAL 244 20.6762295 5.4792621 7.0000000 35.0000000
CONSERVATIVE 244 10.5901639 3.7267890 0 20.0000000
proc means data='d:\data\discrim' n mean std;
class job;
var outdoor social conservative;
run;
The MEANS Procedure
N
JOB Obs Variable N Mean Std Dev
1 85 OUTDOOR 85 12.5176471 4.6486346
SOCIAL 85 24.2235294 4.3352829
CONSERVATIVE 85 9.0235294 3.1433091
2 93 OUTDOOR 93 18.5376344 3.5648012
SOCIAL 93 21.1397849 4.5506602
CONSERVATIVE 93 10.1397849 3.2423535
3 66 OUTDOOR 66 15.5757576 4.1102521
SOCIAL 66 15.4545455 3.7669895
CONSERVATIVE 66 13.2424242 3.6922397
proc corr data='d:\data\discrim';
var outdoor social conservative;
run;
Pearson Correlation Coefficients, N = 244
Prob > |r| under H0: Rho=0
OUTDOOR SOCIAL CONSERVATIVE
OUTDOOR 1.00000 -0.07130 0.07938
0.2672 0.2166
SOCIAL -0.07130 1.00000 -0.23586
0.2672 0.0002
CONSERVATIVE 0.07938 -0.23586 1.00000
0.2166 0.0002
proc freq data='d:\data\discrim';
tables job;
run;
The FREQ Procedure
Cumulative Cumulative
JOB Frequency Percent Frequency Percent
1 85 34.84 85 34.84
2 93 38.11 178 72.95
3 66 27.05 244 100.00
We will run the discriminant analysis using SAS' proc candisc. We could also have used proc discrim with the appropriate options and obtained the same results. Please note that we will not be using all of the output that SAS provides nor will the output be presented in the same order as it appears. There is still a lot of output remaining so we will comment at various places along the way.
proc candisc data='d:\data\discrim' out=discrim_out ;
class job;
var outdoor social conservative;
run;
The CANDISC Procedure
Multivariate Statistics and F Approximations
S=2 M=0 N=118.5
Statistic Value F Value Num DF Den DF Pr > F
Wilks' Lambda 0.36398797 52.38 6 478 <.0001
Pillai's Trace 0.76206574 49.25 6 480 <.0001
Hotelling-Lawley Trace 1.40103067 55.69 6 316.9 <.0001
Roy's Greatest Root 1.08052702 86.44 3 240 <.0001
NOTE: F Statistic for Roy's Greatest Root is an upper bound.
NOTE: F Statistic for Wilks' Lambda is exact.
Adjusted Approximate Squared
Canonical Canonical Standard Canonical
Correlation Correlation Error Correlation
1 0.720661 0.716099 0.030834 0.519353
2 0.492659 . 0.048580 0.242713
Test of H0: The canonical correlations in the
Eigenvalues of Inv(E)*H current row and all that follow are zero
= CanRsq/(1-CanRsq)
Likelihood Approximate
Eigenvalue Difference Proportion Cumulative Ratio F Value Num DF Den DF Pr > F
1 1.0805 0.7600 0.7712 0.7712 0.36398797 52.38 6 478 <.0001
2 0.3205 0.2288 1.0000 0.75728681 38.46 2 240 <.0001
There are two discriminant dimensions both of which are statistically significant. The canonical correlations for the dimensions one and two are 0.72 and 0.49 respectively.
Standardized canonical discriminant function coefficients
Pooled Within-Class Standardized Canonical Coefficients
Variable Can1 Can2
OUTDOOR -.3785725108 0.9261103825
SOCIAL 0.8306986150 0.2128592590
CONSERVATIVE -.5171682475 -.2914406390
Pooled Within Canonical Structure
Variable Can1 Can2
OUTDOOR -0.323098 0.937215
SOCIAL 0.765391 0.266030
CONSERVATIVE -0.467691 -0.258743
The standardized discriminant coefficients function in a manner analogous to standardized regression coefficients in OLS regression. For example, a one standard deviation increase on the outdoor variable will result in a .32 standard deviation decrease in the predicted values on discriminant function 1. The canonical structure, also known as canonical loading or discriminant loadings, represent correlations between observed variables and the unob served discriminant functions (dimensions). The discriminat functions are a kind of latent variable and the correlations are loadings analgous to factor loadings.
Class Means on Canonical Variables
JOB Can1 Can2
1 1.219100186 -0.389003864
2 -0.106724637 0.714570441
3 -1.419668555 -0.505904888
Number of Observations and Percent Classified into JOB
From
JOB 1 2 3 Total
1 69 12 4 85
81.18 14.12 4.71 100.00
2 17 64 12 93
18.28 68.82 12.90 100.00
3 3 10 53 66
4.55 15.15 80.30 100.00
Total 89 86 69 244
36.48 35.25 28.28 100.00
The output includes the means on the discriminant functions for each of the three groups and a classification table. Values in the diagonal of the classification table reflect the correct classification of individuals into groups based on their scores on the discriminant dimensions.
Next, we will plot a graph of individuals on the discriminant dimensions. Due to the large number of subjects we will shorten the labels for the job groups to make the graph more legible. As long as we don't save the dataset these new lables will not be made permanent.
proc format;
value jobname
1='C '
2='M '
3='D ';
run;
data discrimplot;
set discrim_out;
format job jobname.;
run;
symbol1 interpol=none font='Times-Roman' pointlabel=("#job") height=1;
proc gplot data=discrimplot;
plot Can2*Can1=job / haxis=axis1;
run;

As you can see the customer service people tend to be a the more social (negative) end of dimension 1 and dispatchers at the opposite end with mechanics in the middle. On dimension 2 the results are not as clear, however the mechanics tend to be higher on the outdoor dimension and customer service and dispatchers lower.
Table 1: Tests of Discriminant Dimensions
Canonical Mult.
Dimension Corr. F df1 df2 p
1 0.72 52.38 6 478 0.000
2 0.49 38.46 2 240 0.000
Table 2: Standardized Discriminant Coefficients
Dimension
1 2
outdoor -0.38 0.93
social 0.83 0.21
conservative -0.52 -0.29
Tests of dimensionality for the discriminant analysis, as shown in Table 1, indicate that both of the dimensions are statistically significant. The F-tests associated with each dimension are exact. Dimension 1 had a canonical correlation of 0.72 between the response variables and the job classification, while for dimension 2 the canonical correlation was lower at 0.49.
Table 2 presents the standardized canonical coefficients for both dimensions. The first discriminant dimension is positively weighted by outdoor (0.38) and conservation (0.52 and strongly negative on social (-0.83). The second discriminant dimenstion is dominated by the outdoor variable (0.93). These results are interpreted to indicate that the first dimension reflects a bipolar social/non-social dimension while the second is an outdoor/non-outdoor dimension.UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services