Discriminant Analysis in JMP

advertisement
Discriminant Analysis in JMP
The genus of flea beetle Chaetocnema contains three species that are difficult to
distinguish from one another and, indeed, were confused for a long time. These data
comprise the six different measured characteristics made on these three species of flea
beetles.
The variables in this data set are:

Species - species of flea beetle (1= Chaetocnema concinna, 2 = Chaetocnema
heikertingeri, or 3 = Chaetocnema heptapotamica)






Width 1 - a numeric vector giving the width of the first joint of the first tarsus in
microns (the sum of measurements for both tarsi)
Width 2 - a numeric vector giving the width of the second joint of the first tarsus
in microns (the sum of measurements for both tarsi)
Maxwid 1 - a numeric vector giving the maximal width of the head between the
external edges of the eyes in 0.01 mm
Maxwid 2 - a numeric vector giving the maximal width of the aedeagus in the
fore-part in microns
Angle - a numeric vector giving the front angle of the aedeagus (1 unit = 7.5
degrees)
Width 3 - a numeric vector giving the aedeagus width from the side in microns
The main questions of interest are:
Are there significant differences in the two measured characteristics for these three
species of flea beetle?
Can a classification rule based on these six measurements be developed? A
classification rule should allow us to predict the species of flea beetle based on these six
values measurements. Statistical methods for developing classification rules include
discriminant analysis, classification trees, logistic regression, nearest-neighbors, naïve
Bayes classifiers, support vector machines, neural networks, etc.
1
Analysis in JMP
First we use graphical techniques such as histograms, comparative boxplots and
scatterplots with color coding in an attempt to derive a classification rule for the three
species of flea beetle. This can done as follows:
Color coding - From the Rows menu select Color/Marker by Col... then check box
labeled Marker (color will already be checked by default). Now highlight Species with
the mouse and click OK. In subsequent plots the different species of flea beetle will
color coded and a different plotting symbol will be used for each species. Look at the
spreadsheet to see the correspondence.
Histograms - Select Distribution of Y from the Analyze menu and add Species and all
six of the measurements to the right hand box, then click OK. Now use the mouse to
click on the bars corresponding to the different species of beetle in the bar graph for
Species, carefully observing what happens in the histograms for the two measured
characteristics. This is an example of linked viewing - data that is highlighted or
selected in one plot becomes highlighted in all other plots.
What can we conclude from this preliminary histogram analysis?
2
From the above displays we can see that Heptapot flea beetles has width measurements
between 125 and 150. On the angle variable we can see that this species of flea beetle
tends to have much smaller angle measurements than the other two species. At this
point might conjecture that if we observed a width value between 125-150 and an angle
measurement less than 12 for an unclassified flea beetle that it would be classified as a
Heptapot flea beetle. Similar types of observations can be made by clicking on the bars
for the other two species in the bar graph for species.
Comparative Boxplots - Select Fit Y by X from the Analyze menu and add Species in
the X box and all six measurements in the Y box. Add boxplots and/or means diamonds
to the resulting plots by selecting these options in the Display menu which is located
below the plot.
3
Comparative Displays for Width 1
The plot at the top to the left gives
comparative boxplots for the three
flea beetle genus for the width
variable. We clearly see that the
Heikert. genus have the lowest
width measurements in general,
while Concinna generally have the
highest.
The Compare density option gives
smoothed histograms which also can
be used to show univariate
differences between the three genus
of flea beetle.
4
Comparative Displays for Angle
These graphs again show that the genus Heptapot have the smallest angle measurements
in general. Another interesting plot to consider when looking at multivariate data
where there are potentially different populations involved is the parallel coordinate
plot. In the parallel coordinates plot, a set of parallel axes are drawn for each variable.
Then a given row of data is represented by drawing a line that connects the value of
that row on each corresponding axis. When you have predefined populations
represented in your data, these plots can be particularly useful when looking for
distinguishing characteristics of these populations.
To obtain a parallel coordinate plot in JMP select Parallel Plot from the Graph menu
and place the variables of interest in the Y response box. You can place a grouping
variable in the X, Grouping box if you wish to see a parallel coordinate display for each
level of the grouping variable X. The results from both approaches are shown on the
following page.
5
At this point we can see our classification rule could be stated as follows... If the angle is
less than 12 classify a flea beetle as genus Heptapot. If angle is greater than 12 then the
beetles is most likely from genus Concinna or Heikert. To distinguish genus Concinna
from Heikert the comparative boxplots for Maxwid 2 suggest that we could classify a
flea beetle with angle greater than 12 as being from genus Concinna if the maximal
width exceeded 133 or 134, otherwise classify it as being from genus Heikert.
6
To formally compare the genus types on these six characteristics we could perform a
one-way analysis of variance (ANOVA) for each. This is can be done by selecting
Means/ANOVA from the Analysis menu below the plot. Pairwise multiple
comparisons can be done using Tukey’s method by selecting Compare All Pairs from
the Oneway Analysis pull-down menu.
ANOVA TABLE for Maximal Width 2
The p-value for the F-test for testing the null hypothesis that the population means for
width are all equal is less than <.0001 indicating that at least two means are significantly
different. To decide which means are significantly different we can use Tukey’s
multiple comparison procedure examining all pairwise comparisons of the populations
means. The results are shown below.
Results of Tukey's Multiple Comparisons
Here we can see that the mean maximal widths differ for all species.
7
To aid in the development of a simple classification rule we could also examine a
scatterplot for Maxwid 2 vs. Angle making sure that the different species are still color
coded (see above). This can be done by choosing Fit Y by X from the Analyze menu and
placing Width in the X box and Angle in the Y box. We can add a density ellipse for
each group to the plot by choosing Grouping Variable from the Analysis menu below
the plot and highlighting Species as the variable to group on. Then select Density
Ellipses with the desired percent confidence from the Analysis menu. The ellipses
should contain the specified percent of the points for each group. Notice the nice
separation of the density ellipses.
Scatterplot of Angle vs. Width
We can use this scatterplot to state a classification rule that can be used to identify
species on the basis of these measurements. Clearly an angle less than 12 would
indicate that the flea beetle was genus Heptapot. Concinna and Heikert both appear
to have angle measurements exceeding 12. However by using the maximal width
measurement we can distinguish between these two species. Concinna flea beetles have
widths exceeding 134, while the Heikert beetles have widths less than 134. To
statistically formalize the procedure above we could perform discriminant analysis,
which we will consider later in this tutorial.
8
Rather than consider each of the six characteristics individually using ANOVA.
MANOVA allows us to determine if the three species differ on the six characteristics
measured simultaneously. This is achieved by looking at the multivariate response
consisting of all six measurements rather than each characteristic individually which
was done above through the use of one-way ANOVA. To perform MANOVA in JMP
first select the Fit Model option in the Analyze menu. Place all six measurements in
the Y box and place Species in the Effects in Model box, then select MANOVA from the
Personality pull-down menu and select Run. When the analysis is completed, choose
Identity from the menu below where it says Choose Response and click Done. Below
you will see a variety of boxes with many different statistics and statistical tests. To test
the main hypotheses that these measured characteristics differ from species to species
you will want to examine the results of the tests in the Species box. This box contains
four different test statistics which all the answer the question - - do the means for these
six measurements differ in any way from species to species? If the p-value is small for
any of them there is evidence that the mean of width and angle differ significantly from
species to species. You can examine a profile of the mean values of width and angle for
each species by examining the plot for species in the Least Square Means box in the
output. The table on the following page shows the results of the MANOVA for species.
MANOVA Results for Species Comparison
Here we can see that the p-values associated with each test statistic is less than .0001,
which provides compelling evidence that the three species differ significantly on the six
measured characteristics. To see the nature of these differences select Centroid Plot
from the pull-down menu at the top of the Species box. The centroid plot for these data
is shown on the following page.
Canonical Centroid Plot
9
A 2-D canonical centroid plot is a plot of the first
two discriminants from Fisher’s Discriminant
Analysis. Fisher’s discriminant analysis is a
method where linear combinations of X1, …, Xp
are found that maximally separate the groups.
The first linear combination maximally separates
the groups in 1-D dimension, the second linear
combination maximally separates the groups
subject to the constraint that the linear
combination is orthogonal to the first. Thus
when the resulting linear combinations are plotted
for each observation we obtain a scatterplot
exhibiting zero correlation and hopefully good
group separation. We can also visualize the
results in 3-D considering a third linear
combination, provided there are more than three
groups/populations.
The above plot confirms things we have already seen. Notice that the Concinna and
Heikert centroid circles lie in the direction of the Angle ray indicating that these two
genus types have large angle measurements relative to genus Heptapot. The circle for
genus Heikert lies in the direction of the Width 1 ray indicating that these flea beetles
have relatively large width measurements. In total, we see that a nice species
separation is achieved.
The canonical centroid plot displays the results of discriminant analysis. Discriminant
analysis, though related to MANOVA, is really a standalone method. There are two
main approaches to classic discriminant analysis: Fisher’s method which is discussed
above and a Bayesian approach where the posterior probability of group membership is
calculated assuming X’s have an approximate multivariate normal distribution. In
Fisher’s approach to discriminating between g groups, (g – 1) orthogonal linear
combinations are found that maximally separate the groups, with the first linear
combination doing largest degree of separation and so. Future observations are
classified to the group they are closest to in the lower dimensional space created by
these linear combinations.
In LDA/QDA we classify observations to groups based on their posterior
“probability” of group membership. The probabilities are calculated assuming the
10
populations have a multivariate normal distribution. In linear discriminant analysis
(LDA) we assume each population, while having different mean vectors, have the same
variance-covariance structure. In quadratic discriminant analysis (QDA) we assume
that the variance-covariance structure of the populations is different. QDA requires
more observations per group as the variance-covariance matrix is estimated separately
for each of the groups, whereas LDA uses a pooled estimate of the common variancecovariance structure. Because QDA is effectively estimating more “parameters” it
should provide a better discrimination between than LDA. Regularized discriminant
analysis (RDA) is a balance between the two extremes by essentially taking a weighted
average of variance-covariance structure of the two approaches. You can think of it as a
shrunken version of QDA, where the shrinkage is towards LDA.
To classify future observations a Bayesian approach is used where the posterior
probability of group membership for each group is computed as
𝑃(𝐺𝑟𝑜𝑢𝑝 = 𝑘|𝒙) =
2
exp(−.5 𝐷𝑘∗ (𝒙))
∑𝑔𝑖=1 exp(−.5 𝐷𝑖∗ 2 (𝒙))
𝑓𝑜𝑟 𝑘 = 1, … , 𝑔
where,
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) + 𝒍𝒏|𝑺𝒊 | − 𝟐𝒍𝒏𝒑𝒊 for QDA with unequal priors
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒊 (𝒙 − 𝒙
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) + 𝒍𝒏|𝑺𝒊 |
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒊 (𝒙 − 𝒙
for QDA with equal priors
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 ) − 𝟐𝒍𝒏𝒑𝒊
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒑 (𝒙 − 𝒙
for LDA with unequal priors
2
̅𝒊 )′ 𝑺−𝟏
̅𝒊 )
𝐷𝑖∗ (𝒙) = (𝒙 − 𝒙
𝒑 (𝒙 − 𝒙
for LDA with equal priors
and
𝑆𝑝−1 =
𝑔
∑𝑖=1(𝑛𝑖 −1)𝑆𝑖−1
𝑔
∑𝑖=1(𝑛𝑖 −1)
 pooled estimate of the common variance-covariance matrix
in the case of LDA. For RDA we use the same formula as QDA with the sample
variance-covariance matrix replaced by
𝑆𝑖∗
−1
= 𝛼𝑆𝑖−1 + (1 − 𝛼)𝑆𝑝−1
The posterior probabilities for each observation in the training data are computed and
group membership is predicted. The performance of the discriminant analysis is
11
reported the number/percent misclassified in the training data. Future observations can
predicted by adding them to the data table.
To perform a discriminant analysis in JMP, choose Discriminant from the Multivariate
Methods option within the Analyze menu as shown below.
Put all six measurements
in the Y, Covariates box
and Species in the X,
Categories.
The results are shown below:
Here we can see that linear discriminant analysis misclassifies none of the flea beetles in
these training data.
12
A contingency table showing the classification results is displayed below the table of
posterior genus probabilities for each observation. This table is sometimes referred to as
a confusion matrix.
Confusion Matrix for Species Classification
None of the species are misclassified giving us an apparent error rate (APER) of .000 or
0%. Applying either QDA (which actually might be recommended given that the
variability of some of the characteristics differs across species) or RDA results in perfect
classification of the species as well.
We can save the distances to each group along with the posterior probabilities for each
species to our spreadsheet by selecting Save Formulas from Score Options pull-out
menu as shown below.
Having saved these formulae to the data spreadsheet we can use the results of our
discriminant analysis to classify the species of new observations. For example, adding 2
13
rows to our spreadsheet and entering the measurements for two yet to be classified
beetles will obtain their predicted species based on our model.
The predictions are shown below for the two new beetles, the first is classified as
Heikert and the second as Concinna.
Below is a visualization of the classification of the new flea beetles using the first two
discriminants.
JMP 11 Detail
14
Show Canonical Details (discriminant loadings)
15
Save Formulas results
16
Download