, involves detecting previously unknown

advertisement
Chapter 3
Clustering and
Classication
3.1 Background
Clustering, or unsupervised classication, involves detecting previously unknown
groups in the data. We usually imagine nice neatly separated clusters of observations, but quite often clustering is simply partitioning or carving up the data
space. Discriminant analysis, or supervised classication, requires the knowledge of the groups, and seeks to nd the view which shows the best separation
of the groups.
Whether the class information is known or not the visual tools for classication are primarily the same. When class identity is known, colors and/or
symbols can be used to code this information into the plots. It is also relatively easy to approach the analysis of data which contains a mix of known and
unknown class identities. This is not typically true in numerical algorithms,
where the knowledge of group information changes the algorithm for nding the
separations between groups completely.
The graphics that are useful for classication tasks include rudimentary density plots, scatterplots, matrices of pairwise scatterplots, parallel coordinate
plots. Motion graphics, such as tours, are used to animate these previous plots,
to provide views of arbitrary combinations of variables. User interaction, such
as brushing, is very useful, especially when multiple plots are visible and linked.
There are several visual cues that are useful on detecting cluster structure:
separation of points in particular views, similar movement patterns in motion
graphics, and intersecting lines in parallel coordinate plots.
This chapter will describe the visual methods as we use them to explore
data in the case studies. Various numerical methods, such as linear discriminant analysis, principal component analysis, neural networks, hierarchical cluster analysis, k-means clustering, will also be used and we will discuss the use of
graphics in conjunction with these methods.
31
3.2 Supervised Classication
This section includes two case studies one using data on Australian crabs, and
the other data on Italian olive oils data. The exercises use data on ea beetles
and Australian kangaroos.
3.2.1 Case Study: Australian Leptograpsus Crabs
Data Description
This data was described in Campbell & Mahon (1974). It contains measurements on rock crabs of the genus Leptograpsus. One species L. variegatus had
been split into two new species, previously grouped by color form, orange and
blue. Preserved specimens lose their color, so it was hoped that morphological
dierences would enable museums specimens to be classied.
There are 50 specimens of each sex of each species, collected on sight at
Fremantle, Western Australia. Each specimen has measurements on
Frontal Lip (FL), in mm
Rear Width (RW), in mm
Length of midline of the carapace (CL), in mm
Maximum width of carapace (CW), in mm
Body Depth (BD), in mm.
The primary question here is \How do we distinguish between species and
sex based on these ve morphological measurements." There are also some interesting structural features to note about the data.
Density Plots
Two types of univariate plots are available in XGobi: textured dotplots (Tukey
& Tukey 1990), an average shifted histograms (Scott 1992).
Select 1DPlot from the View menu. A scrollbar allows the smoothness of the
histogram to be changed. Two average shifted histograms of the variable body
depth are shown in Figure 3.1. The blue species are slightly smaller on average
as seen by the presence of blue points at the low values, and presence of more
orange points at the high values.
Scatterplot
Select XYPlot from the View menu. Variables can be changed interacting with
variable circle control panel.
Scatterplot Matrix
A scatterplot matrix can be used to lay out the pairwise plots, to view them all
in one glance. To get a scatterplot matrix on startup, use the -scatmat option
on the command line. It is also possible to start up XGobi normally and then
32
6
8
10
12
14
16
18 20
22
6
BD
8
10
12
14
16
18 20
22
BD
Figure 3.1: ASH plots of body depth, two dierent smoothness parameters.
bring up a second xgobi with a scatterplot matrix by selecting the Scatterplot
matrix item on the Options menu. This uses the subset of variables that are
active in the tour as the subset displayed in the scatterplot matrix.
Based on the scatterplot matrix these are the observations about the relationships between variables and groups.
There is strong correlation between between all the variables.
The variation is less with the smaller measurements and gets larger as the
values increase. Smaller crabs harder to distinguish, separations increase
as size gets larger.
There is a dierence in males and females in plots of CL vs RW (males
have a higher CL:RW ratio), and BD vs RW, and CW vs RW.
There is a dierence in the two species in the plots of BD vs CW, and
CW vs FL.
Parallel Coordinate Plots
Parallel coordinate plots display data using parallel axes, rather than orthogonal
axes. Cases are represented by a series of lines to the observation's values
on each variable. Good references for parallel coordinate plots are (Inselberg
1985, Wegman 1990). Figure 3.3 displays a parallel coordinate plot of the crabs
data. Very little can be seen from this plot. Perhaps using principal component
coordinates would reveal more.
33
FL
RW
CL
CW
BD
Figure 3.2: Scatterplot matrix of the 5 variables. Despite the strong correlation
various dierence can be seen between species and sex.
34
3
4
5
6
7
Figure 3.3: Parallel coordinate plot of the 5 crab variables. Very little can be
seen from this plot.
Tours
Introduction
While linked brushing provides information on conditional distributions,
tours provide information on joint distributions. They are particularly useful for detecting clusters, outliers, distributional shape, including covariance,
and some non-linear structure.
Grand Tour
A grand tour is a continuous 1-parameter family of d-dimensional projections
of -dimensional data which is dense in the set of all -dimensional projections
in p . The parameter is usually thought of as time.
This means that each projection shown can be indexed by a time parameter.
As time is allowed to wander o to 1 the grand tour will show all possible
-dimensional projections of the data, which is the meaning of \dense in the set
of all projections".
A grand tour oers a multitude of aspects simultaneously in relationship to
one another. If the data is intrinsically 0-, 1-, or 2-dimensional (that is, clusters,
curves or surfaces) the human eye can pick up the \gestalt" almost instantly.
(We are adept at detecting and recognizing moving objects.)
Three-dimensional rotation can be considered a special case of the tour,
where the dimension of the data is = 3.
The grand tour provides the overview of the struture of the data.
p
d
IR
d
p
Guided Tour
To nd more specic types of structure intelligent search engines can be
connected to the tour, which can automatically provide more informative views
than the random ones provided by the grand tour.
The guided tour leads the user to rare views.
Manual Tour
Prior knowledge can be incorporated with manually controlled tours. The
35
user can increase or decrease the contribution of a particular variable to a view
to examine how a particular variable contributes to any structure. In addition
manual tools allow us to assess the sensitivity of the structure to a particular
variable or sharpen or rene structure exposed with the grand or guided tour.
The manual tour renes the views.
Good references for tours are (Asimov 1985, Buja & Asimov 1986, Cook,
Buja & Cabrera 1993, Cook, Buja, Cabrera & Hurley 1995, Buja, Cook, Asimov
& Hurley 1997, Cook & Buja 1997).
Back to this data
Watch the crabs data in a grand tour of all 5 variables. The shape is rather
like a 1D pencil or stick rotating in the 5D space. This is due to the strong
correlation between all variables. Not much structure can be seen using this
raw data scale.
For this data it is more useful to examine the principal component basis,
that is, we sphered the data, before viewing it in the tour. This is a useful
trick for graphics, it removes the distraction due to correlation structure. It
is important to note that all principal components were used, there was no
dimension reduction conducted. This is important, because often by eliminating
the lowest principal components one also removes the cluster structure. We
then used projection pursuit with the Holes index to obtain views in Figure
3.4. In another approach we rst standardized the variables before sphering
and projection pursuit (Figure 3.5).
BD
FL
RW
CL
CW
BD
FL
CL RW
CW
Figure 3.4: Projection pursuit solutions found with the Holes index (sphered
data).
To obtain the discriminant boundary (Figure 3.6), we rst standardized the
four variables frontal lip, carapace length, carapace width and body depth, call
36
4
2
0
-2
Discrim 2
-4
-6
(RW-m)/s
(CW-m)/s
(CL-m)/s
(FL-m)/s
(BD-m)/s
-10
-5
0
5
Discrim 1
0.6
Figure 3.5: (Left) Projection pursuit solutions found with the Holes index (standardized, sphered data). (Right) Linear discrimant analysis solution.
0.4
•
•
•
••
• • •• • •
• •
•
•
• •• •
•••• ••
•
•• •• •
•• • •• •
• • •
•
• • ••• •
•
•
•
•
• ••••••• ••
• ••• • •
•
•
• • •
•
•
•• • • ••
•
• • ••
• • • •
•
• •
•
-0.4
Discrim 2
-0.2
0.0
0.2
•
PC 2
•
•
-0.6
PC 4
PC 1
•
•
•
•
• ••
•
•
• •
•
•
• ••
•
•
• • • •• • •
•
• ••••
•
••
• • • • ••
• •• •• • •
•
•• • • •
•
• •
••
• ••
• •
• •• • •• • •• • •• •
• ••
•
•
•
•
• •
•
• •
• •
• • ••
•
•
•
-0.4
-0.2
0.0
Discrim 1
0.2
0.4
Figure 3.6: View of the principal components of four variables - FL, CL, CW,
BD - which separates species cleanly and consequent S plot delineating the
discriminant boundary.
37
these s(FL), s(CL), s(CW), s(BD). Then we computed the principal components:
PC
1 = (0 499 (
0 500 2 = (0 532 (
0 437 3 = (,0 684 (
0 715 4 = (0 031 (
0 218 :
:
PC
:
:
=
std BD
:
std F L
:
std BD
:
std C W
std C L
:
std C W
:
=
:
std BD
std C L
:
:
std F L
:
PC
:
std BD
std F L
:
PC
) + 0 502 ( ) + 0 499 ( ) +
p
( )) 3 939
) , 0 315 ( ) , 0 653 ( ) +
p
( )) 0 047
) + 0 085 ( ) , 0 119 ( ) +
p
( )) 0 012
) , 0 801 ( ) + 0 557 ( ) +
p
( )) 002
std F L
std C L
=
:
std C W
:
:
std C L
:
std C W
=
Then the discriminant rule is given by:
If (0.094PC1 + 0.201PC2 - 0.030PC4 < -0.05 then the species is blue
else the species is orange
To separate the sexes we worked stepwise, hierarchically between the species.
Figure 3.7 illustrates two projections which do well at separating sexes within
each species. Here we relied heavily on rear width in relation to the other
variables to obtain separations. Standardized variables were also used.
We explored the restructuring the variables as ratios to rear width as an
alternative solution to obtaining good separations. It is a reasonable approach
but doesn't produce outstanding results.
Based on touring here are the observations that we can make.
Strong separations between species and sex.
Axes suggest that body depth, frontal lip, carapace length and width
contribute to species separation.
Rear width contributes most to the separation of sexes.
Numerical Methods and Graphics
From studying the data with graphics we can come to a fairly neat understanding of this data. This can help numerical analysis in two ways: interpreting
results, and determining how eective particular approaches might be. From
the graphics we would expect to be able to get a numerical solution which separates species perfectly, and to be able to separate sexes with a high degree of
accuracy.
The linear discriminant analysis solution is very similar to the solution found
with the Holes index (see Figure 3.5).
We would expect CART to perform poorly on the raw data, but that working
with principal components might provide better solutions.
38
(RW-m)/s
(CW-m)/s
(CL-m)/s
(BD-m)/s
(FL-m)/s
(RW-m)/s
(CW-m)/s
(CL-m)/s
BD
Figure 3.7: Projection pursuit solutions, separately for species, found with the
Holes index (standardized, sphered data, axes shown in terms of standardized
variables).
Neural networks are likely to be able to perfectly classify the species and
sex (which is possible with Ripley's S code for a feed-forward network), but the
reliability of the classication for the sexes of small crabs is suspect, for smaller
crabs the sexes are physiologically less distinguishable.
3.2.2 Exercises
1. Generate a scatterplot matrix of the ea beetle data. Which variables
would contribute to separating the 3 species?
2. Generate a parallel coordinate plot of the ea beetle data. Characterize
the 3 species by the pattern of their traces.
3. Watch the ea beetle data in a grand tour. Stop the tour when you see a
separation and describe the variables that contribute to the separation.
4. Using the projection pursuit guided tour, with the holes index, nd a
projection which neatly separates all 3 species. Put the axes onto the plot
and explain the variables that are contributing to the separation. Using
univariate plots conrm that these variables are important to separate
species.
39
3.2.3 Case Study: Italian Olive Oils
Data Description
This data consists of the percentage composition of 8 fatty acids (palmitic,
palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the
lipid fraction of 572 Italian olive oils. (An analysis of this data is given in
(Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas,
4 from southern Italy (North and South Apulia, Calabria, Sicily), two from
Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and
West Liguria).
The data available are:
Region
South, North or Sardinia
Area
Sub-regions within the larger regions (North and South Apulia,
Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and
West Liguria
Palmitic Acid
Percentage 100 in sample
Palmitoleic Acid Percentage 100 in sample
Stearic Acid
Percentage 100 in sample
Oleic Acid
Percentage 100 in sample
Linoleic Acid
Percentage 100 in sample
Linolenic Acid Percentage 100 in sample
Arachidic Acid Percentage 100 in sample
Eicosenoic Acid Percentage 100 in sample
The primary question is \How do we distinguish the oils from dierent regions and areas in Italy based on their combinations of the fatty acids?"
Working through the data
1. Check that the colors and glyphs code region and area, by looking at
the scatterplot of Region vs Area. Note that the \.colors" and \.glyphs"
were originally generated by brushing in the Regin vs Area scatterplot
and saving the results.
2. The region of southern Italy's oils can be recognized as dierent from all
other regions by the presence or absence of one fatty acid. Which fatty
acid is it? (Using univariate plots change between variables until one
variable shows a separation.)
3. Because it is so easy to distinguish the southern oils, remove them for now.
Erase these points (go into brush and select erase, and move the brush
over the red points - easier if you have the dotplot of region showing - then
pull down the erase menu and select \Rescale ignoring erased points").
40
4. Now try to work on the separating the northern oils from the sardinian
oils. Work back through the dotplots to nd the variables which do the
best job of separating the two groups. Then look at pairwise plots, and
nd which pair of variables do well at separating the groups. (It is not
necessary to look at eicosenoic.)
5. I think you will do best with 3 variables, so using 3-D rotation look for
three variables which best separate the groups. Pick the three that you
see as doing the best job, and nd the projection which best separates the
groups. Print this out, so that you can construct a classication rule.
At the end of this exercise you should be able to write down a decision rule
to classify oil samples into regions of Italy. If you get a new sample, with the
same measured variables you can follow the decision rule to classify the oil as
coming from the North, South or Sardinia.
Here is my solution, at this point, for separating the Northern Italian oils
from those of Sardinia. First, I looked at the three variables Oleic, Linoleic and
Eicosenoic Acid, and found a projection of the three variables where we could
discriminate between the two regions with a vertical line (Figure 3.8). I printed
out the coecients using the \I/O" button:
Var
Horizontal Vertical
oleic
-0.000077 -0.000318
linoleic
-0.000968 -0.000265
linolenic -0.005010 0.011945
and used these to generate the plot again in S using the following code:
par(pty="s")
x<-d.olive[d.olive[,1]!=1,6:8]
reg<-d.olive[d.olive[,1]!=1,1]
ds1<-x[,1]*(-0.000077)+x[,2]*(-0.000968)+x[,3]*(-0.005010)
ds2<-x[,1]*(-0.000318)+x[,2]*(-0.000265)+x[,3]*0.011945
plot(ds1, ds2,xlab="Discrim 1", ylab="Discrim 2",pch=".")
points(ds1[reg==2],ds2[reg==2],pch=0)
points(ds1[reg==3],ds2[reg==3],pch=1)
abline(v=-1.68)
This gives me a discrimination rule as follows:
If oleic(0 000077)+linoleic(0 000968)+linolenic(0 005010)
1 68 then the oil comes from region 3 (Northern Italy),
:
:
:
>
:
Else the oil is from region 2 (Sardinia).
The next step is to nd classication rules for Areas within the Regions.
Before we do this we will talk more about rotating data plots. An alternative
way to doing the 3D rotation is to use a grand tour. With 3 variables it is simply
rotating the data in 3-D space. But it works for any number of variables, and
can rotate data containing 4, 5 or more variables.
41
-1.8
.
-2.0
.
.
..
.
.
. .
......... ... . .
. . . .... .
. .
..
.
.. ... . .
. .. . .
. ... ........ .
. . .. .
. . ...... .. . ..
. .....
. . .
-2.6
Discrim 2
-2.4
-2.2
.
.
.
.
. .. .
.
.
.. .. .
. . .. . .
.
. ...
.
.
. .. . .. .....
.. . .
.
.
. .. .
. . . .. ...... .
. . . ...
. .
. . ... . ... . . .
. . ..
. . .. . . . . . .
.
.
. .
..
.. . .. . ..
. .
.. . . . . ...
. .
-2.8
linolenic
.
linoleic
oleic
.
.
-2.0
-1.8
-1.6
Discrim 1
-1.4
-1.2
Figure 3.8: Olive oil data: projection showing separation of Sardinian oils from
those of Northern Italy both from XGobi and the subsequent S plot with the
discrimination boundary drawn.
1. Watch the olive oil data in a grand tour (click on the variable circles of all
fatty acids except eicosenoic, and also those of region and area so these
are removed from the plot).
2. Keep the axes on the plot and watch how the variables fade in and out
of view. These specify how a variable contributes or is represented in a
single plot. These axes are replicated in the variable circles, one axis per
circle, which alleviates the mess on screen.
3. Change the speed of the tour by dragging on the top scrollbar.
4. Turn o the axes and watch the shapes that the data forms. See how the
points separate into clusters in some projections.
5. Stop the tour at a view where the green group (Sardinia) is separated
from the purple group (North Italy). Turn on the axes again, and look for
the variable(s) which contribute most in the direction of the separation.
Compare the view given in a pairwise plot of these variables and see if the
separation can be seen similarly here.
6. If you have diculty stopping at a projection which separates Sardinian
oils from those of Northern Italy use projection pursuit (click on projection
pursuit and then the optimize button - you may have to turn optimize on
and o several times to get to the same maximum as shown in Figure 3.9).
(This is called a guided tour, because the code uses a numerical function
to nd projections that are \interesting".) Turning the axes on you will
nd that two acids, oleic and linoleic are the main ones contributing to the
42
1400
dierence (Figure 3.9). Look at the XYPlot of these two acids. Remember
these were the two main variables that allowed us to separate the regions.
800
linoleic
stearicarachidic
palmitic
1000
1200
linoleic
600
oleic
7000
7500
8000
8500
oleic
Figure 3.9: Olive oil data: projection showing separation of Sardinian oils from
those of Northern Italy shows that the dierence is in two fatty acids, oleic and
linoleic.
At the end of this exercise you should have learned how the tour (both the
grand and guided) works, and how it can help understand cluster structure in
high-dimesional data. It can also help extract the important variables more
automatically.
Scale Aects Structure Detection
Now dierent scales can aect interpretation dramatically. This next exercise
illustrates how looking at the data on dierent scales can help nd dierent
features.
1. Standardize each variable to have mean zero and variance 1, using the
transformation controls.
2. Then repeat the last exercise. Tour on all the standardized variables except eicosenoic acid. (Also only use the northern and Sardinian samples.)
Turn on the guided tour: \ProjPrst", select the \Holes Index", and click
on \Optimz".
3. What you'll notice is that it always gets back to the same projection, but
this one is not as good as when the data was not standardized: there
are a handful of (green) Sardinian samples which get confused with the
northern Italian oils. What is interesting though is just before it stabilizes
at this *not-so-interesting* projection it passes by one that does have a
43
(oleic-m)/s
(stearic-m)/s
(palmitic-m)/s
(palmitoleic-m)/s
(linolenic-m)/s
(arachidic-m)/s
(linoleic-m)/s
Figure 3.10: Olive oil data: Interesting projection when using standardized
variables. The variables contributing most to this separation are oleic, linoleic
and arachidic.
1500
-182
.
-183
PC 2
-184
1400
-185
1300
-187
1100
-186
1200
linoleic
.
.
7000
7200
oleic
.
.
.
.
..
..
. .
.
. . .. . .. . .
.
.. . . ..
.
... .. . .
.
.
.
.
. . . ...
. . .
.
.
.
.
. .
.
.
.
.
.
. .
.
.
.
..
.
. .
. .
.
.
.
.
.
. .
.
. .
.
.
.
.
.
.
.. .
.
.
7400
-30
-29
-28
-27
PC 1
Figure 3.11: Olive oil data: projection showing separation of Sardinian oils both
from XGobi and the subsequent S plot with the discrimination boundary drawn.
44
3.4
3.2
Discrim 2
2.8
3.0
.
.
.
.
. . .
.
.
. .. .
..
.
.
. .
. . .
..
. ..
...
. .. .
0.0
.
5.2
5.4
Discriminant for Umbria
1.0
0.8
. .
.
Random Uniform
0.4
0.6
0.8
Random Uniform
0.4
0.6
.
.
0.2
.
.
.
. . . ..
. .
. . .. ..
.
.
.
. .
.
.. .
.. . .
.
. .
.
. .
.
.
. .
. .
. . . .
.
. .. ..
.
.
..
.
.
.
.
. .. .. . . .
. .
.
.
.. . .
. .
.
.
...
.
.
..
.
..
.
.
.
...
.
.
.
.
0.2
.. ...
. .
.
.
0.0
1.0
3.4
.
5.6
. (4.0,3.04)
.
..
.
. .
. .
. .. . .. .
.
.
..
.. . . .
. . . .. ..
. . .
.
. .. . . .
. .
.. .
.
(3.65,2.82)
.
2.4
linoleic
.
.
2.6
arachidic
palmitic
oleic
stearic
palmitoleic
linolenic
.
.
(3.31,3.42)
.
.
.
.
..
..
.. .. .. . .
. . .. ..
. .
. .
... ...
.. .. . . .
.. .
. ..... . .
.. . .. .
. .. .
.
..
...... ..
.
.
. .
.. .
.
..
.
.
3.6
3.8
Discrim 1
.. .
.
.
...
. .. . .
.
.
. . .. ..
.. . .
.
.
. ..
.
.
.
...
.
.
.
.
..
.
.
.
. . ...
.
.
. .. .
.
.
.
. .... .
. .
.
.
...
.
. ..
.
.
.
.
. . ..... .
..
.
..
.
.
-2.5
4.0
4.2
.
..
.. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.. .
. .
. .
.
. .
.
.
.
.
.
.
.
.
.
.
.
..
.
. .
.. .
.
-2.0
-1.5
-1.0
-0.5
Discriminant for West Liguria
.
.
0.0
Figure 3.12: Olive oil data: projection showing separation of Northern oils
both from XGobi and the subsequent S plot with the discrimination boundaries
drawn.
45
good separation of two regions. To get back to this \Pause" the tour, drag
the speed scrollbar close to the left hand side to make it go slow, click o
\Optimz", and click on \Backtrack". Be ready to click on \Pause" again
when it reaches a nice view. Now click \Pause" and watch the tour go
back to the interesting view.
4. When it reaches the view where the groups are separated and you have
paused the tour there, take a look at the variables that are contributing to
the separation. They are oleic, linoleic as before, plus either arachidic and
linolenic. (You will each get slight dierences here, sometimes it is linolenic
that contributes, and sometimes it is arachidic. Figure 3.10 shows my run
when arachidic contributes to the separation.) The latter two variables
you found earlier as being important for getting a good separation of the
samples of the two regions.
It is important to look at data on dierent scales, raw coordinates, standardized coordinates or principal component coordinates. From earlier classes you
should have learned that it is important to use transformations, such as logs,
to normalize or spread out the data values. This is also important in viewing
high-dimensional graphics. Although in this data there was no need to use logs
it is a fairly common necessity.
Pursuing the Olive Oil data in more depth
In this next exercise we drill down further into the data to explore the subclusters
corresponding to areas within the regions. Right now we have the following
classication rule:
1. If eicosenoic acid 4 then the oil comes from region 1 (South>
ern Italy),
2. If oleic(0:000077)+linoleic(0:000968)+linolenic (0:005010) >
1:68 then the oil comes from region 3 (Northern Italy),
3. Else the oil is from region 2 (Sardinia).
1. Erase brush the regions 1 and 3 out of the data, select \Rescale ignoring
erased points" from the \Erase" menu.
2. Then look at dotplots of variables to nd which variables help to discriminate Inland oils from Coastal Sardinian oils. Both oleic and linoleic seem
to be very good.
3. Plot oleic vs linoleic. It is clear that these two variables alone are sucient
to separate the two areas (see Figure 3.11).
The actual discriminant rule would be found by taking the rst principal
component of the two variables. This is the S code I used:
46
x<-f.sph.data(d.olive[d.olive[,1]==2,6:7])
are<-d.olive[d.olive[,1]==2,2]
plot(x[,1],x[,2],pch=".",xlab="PC 1",ylab="PC 2")
points(x[are==5,1],x[are==5,2],pch=0,cex=1)
points(x[are==6,1],x[are==6,2],pch=0,cex=2)
abline(v=-28.85)
f.sph.data<-function(data)
{
vardat <- var(data)
svd <- eigen(vardat)
evc <- svd$vectors
evl <- svd$values
sphd.data <- data %*% evc %*% diag(1/sqrt(evl))
return(sphd.data)
}
and the rst principal component is given by
PC
p
1 = (oleic 0 8024368 + linoleic (,0 5967371)) 30807 5
:
:
=
:
giving the discriminant rule to be
p
if (oleic0 8024368+linoleic(,0 5967371)) 30807 5 28 85 then
:
:
=
:
>
:
the oil comes from Coastal Sardinia else from Inland Sardinia.
To separate the dierent areas in northern Italy is a messy proposition. It is
possible, with only one missclassication, and with all the 8 variables (without
eicosenoic acid).
Without going into laborious detail, the discrimination projection is given
by:
2
3
0 000691 0 000705
66 0 000843 0 000402 7
66 0 001582 0 000533 7
7
66 0 000220 0 000304 7
7
66 0 001247 ,0 000772 7
7
66 ,0 002989 ,0 000216 7
7
4 ,0 002220 0 006294 7
5
0 000000 0 000000
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
Project the data (matrix of values of the 9 fatty acids) into this matrix to get
the 2D view shown in Figure 3.12. Call the result . Then the discriminant
boundaries for the 3 groups are as follows:
x
1. If 1 + 0 568 2 5 25 then the oil comes from Umbria,
2. If 1 ,1 761 2 ,1 31 then the oil comes from West Liguria,
x
x
:
:
x
x
>
<
:
:
47
oleic
stearic
palmitoleic
linolenic
palmitoleic
stearic oleic
linoleic
palmitoleic
linoleic
eicosenoic
arachidic
Figure 3.13: Olive oil data: projections showing combinations of variables where
separation of oils from dierent areas of Southern Italy is visible, from XGobi.
3. Else the oil comes from East Liguria.
To separate the southern Italian oils, looking at the dotplots just of these
regions it appears that palmitoleic and linoleic, stearic and oleic might be useful
variables for separating the areas North and South Apulia from the other two
areas. Eicosenoic, linolenic, and arachidic would be needed to separate Calabria
from Sicily, although it is clear that it is not easy to cleanly distinguish this
group. Figure 3.13 shows some plots to support these statements.
3.2.4 Exercises
1. For the kangaroos data, build a discriminant rule for each species and sex,
using graphical methods.
2. There are three historical skulls (Leiden, Paris and London) which defy
classication. Use graphical methods to guess which species and sex each
belong to.
48
3.3 Unsupervised Classication
This section explains how to approach classication when the class identities
are unknown, using visual methods. We describe the brush-and-spin approach
to interactive cluster analysis. The approach will work reasonably when there
are good separations between groups even when there are marked dierences
in variance structures between groups, and non-linear boundaries. It doesn't
seem to work very well when there are classes which overlap in space, or when
there are no distinct classes but rather we simply wish to partition the data. In
these situations it is better to begin with a numerical solution, and attempt to
rene it with visual tools. We will use the Italian Olive oil data with the class
boundaries removed to demonstrate the brush-and-spin approach, and rening
numerical solutions.
3.3.1 Case Study: Italian Olive Oils
Density Plots
Work through each variable plot, and brush groups where there are clear separations, for example, eicosenoic acid (Figure 3.14). This is a much harder task
than discriminant analysis! It relies much more heavily on careful assessment
of split, and the exibility of undoing previous splits. Heavy use of jittering can
help in these density plots, especially when there is discreteness in the data, or
in some subsets of values.
0 5 10 15 20 25 30 35 40 45 50 55 60
eicosenoic
Figure 3.14: ASH plot of eicosenoic acid revealing two clusters. Jittering is used
because there is overplotting of points with values close to zero.
49
Focus on one group, by erasing all points painted dierently, and continue
through the next sections. Sequentially continue to focus on one group at a
time.
Scatterplots
20 40 60 80 100 120 140 160 180
palmitoleic
Work through pairwise plots and brush the points that are clearly separated
(Figures 3.15, 3.16).
600 700 800 900 10001100120013001400
palmitic
Figure 3.15: Outliers can be distracting in that they take away from the plot
space for detecting clustering.
Tours
While focusing on one group, watch the data rotate in a grand tour, and stop
when a separation is seen. Use the projection-pursuit guided tour with the
Holes index (or Central Mass or Skewness indices if there appear to be unequal
numbers of points in groups) to nd further separations.
Also put the other the other groups back in to the plot, and examine how
the clusters fall out in the tour, rene the boundaries by re-brushing points that
appear to be incorrectly classied.
Examining Numerical Solutions
When using hierarchical cluster algorithms it is common to use a dendrogram
to examine the results. This can be computed in S, and the results passed to
50
linoleic
5006007008009001000
1100
1200
1300
1400
1500
linoleic
5006007008009001000
1100
1200
1300
1400
1500
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
0
oleic
10
20
30
40
50
60
70
linolenic
Figure 3.16: Scatterplots reveal more clusters.
XGobi, to study interactively. Figure 3.17 displays a dendrogram (computed
using average linkage), with one outlier (East Liguria) highlighted, followed with
focusing on one part of the tree which is painted orange. The brushed points
form a strongly cohesive group in several variables, but they don't correspond
to one unique region of Italy. Rather they form a very heterogeneous group
containing samples from areas in southern Italy, as well as all the northern
Italian areas. The simple hierarchical cluster methods work very poorly with
this data, as does k-means clustering. Model-based clustering, where unequal
variances are assumed does much better.
51
Merge Level
0200400600800
1000
1200
1400
1600
1800
2000
2200
2400
East-Liguria
East-Liguria
400
0 50100150200250300350400450500550
450
500
550
oleic
6400
6600
6800
7000
7200
7400
7600
7800
8000
8200
8400
East-Liguria
East-Liguria
600700800900
1000
1100
1200
1300
1400
1500
1600
1700
1800
600700800900
1000
1100
1200
1300
1400
1500
1600
1700
1800
palmitic
palmitic
5
1
2
3
4
Area
6
7
8
9
palmitoleic
20 40 60 80100120140160180200220240260280
Objects
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
Region
Figure 3.17: Exploring hierarchical clustering results using a dendrogram linked
to scatterplots.
52
Download