3. Clustering and Classi cation Visualizing Cluster Structure

advertisement
3. Clustering and Classication
Unsupervised classication - clustering
Supervised classication - discriminant analysis
3-1
Visualizing Cluster Structure
Cluster structure can be detected using a grand tour watching
for two visual cues:
Separation of points in particular views
Dierent motion paths
In parallel coordinate plots, these two visual cues correspond
to:
Separation of lines
Crossing of lines
3-2
Discriminant Analysis
Add color/glyph information to the plot according to group
information.
3-3
Case Study: Australian Leptograpsus Crabs
There are 50 specimens of both sexes of two species, collected on sight at Fremantle, Western Australia (Campbell
and Mahon, 74). Each specimen has measurements on
Frontal Lip (FL), in mm
Rear Width (RW), in mm
Length of midline of the carapace (CL), in mm
Maximum width of carapace (CW), in mm
Body Depth (BD), in mm.
Preserved specimens lose their color, so it was hoped that
morphological dierences would enable museums specimens
to be classied.
3-4
Crabs: Summary Statistics
Variable
FL
RW
CL
CW
BD
Min. 1st Qu. Median Mean 3rd Qu.
7.2 12.9 15.55 15.58 18.05
6.5
11
12.8 12.74 14.3
14.7 27.28 32.1 32.11 37.23
17.1 31.5
36.8 36.41 42
6.1 11.4
13.9 14.03 16.6
Variable
FL
RW
CL
CW
BD
Blue Male
14.84 (3.20)
11.72 (2.11)
32.01 (7.31)
36.81 (8.35)
13.35 (3.20)
Blue Female
13.27 (2.63)
12.14 (2.44)
28.1 (5.92)
32.62 (6.80)
11.82 (2.75)
Max.
23.1
20.2
47.6
54.6
21.6
Orange Male Orange Female
16.63 (3.51) 17.59 (2.97)
12.26 (2.20) 14.84 (2.35)
33.69 (7.61) 34.62 (5.84)
37.19 (8.39) 39.04 (6.54)
15.32 (3.53) 15.63 (2.75)
3-5
Crabs: Scatterplot Matrix
xgobi -scatmat crabs
It is also possible to start up XGobi without the -scatmat option, then select the Scatterplot matrix item on the Options
menu. This uses the variables that are active in the tour as
the subset displayed in the scatterplot matrix.
3-6
Crabs: Scatterplot Matrix
FL
RW
CL
CW
BD
3-7
Crabs: Scatterplot Matrix Observations
Strong correlation between variables.
Smaller crabs harder to distinguish, separations increase as
size gets larger.
Males separated from females in plots of CL vs RW (males
have a higher CL:RW ratio), and BD vs RW, and CW vs
RW.
The two species can be separated in the plots of BD vs CW,
and CW vs FL.
It looks like it is almost possible to separate all four groups
using just RW and FL.
3-8
Aside: Projection Pursuit and Holes Index
Projection pursuit is the search for interesting projections of
high-d data via the optimizing of an index function, eg
max Var(X )
2S ,
p
1
gives the rst principal component.
In general,
max f(XA)
A2G
p;d
3-9
Aside: Projection Pursuit and Holes Index
n
d
^IHoles = ,(2),d=2 1 X exp(, 1 X Yij2) + (2p),d
n
2
i=1
j =1
where Y = X . Holes index nds projections where there is
not much data in the center, ie holes.
3 - 10
Crabs: Tour Plots - Raw data
PCA basis used, to alleviate distraction from correlation.
Projection pursuit with the Holes index to obtain views.
BD
FL
RW
CL
CW
BD
FL
CL RW
CW
3 - 11
Crabs: Tour Plots - Standardized data
PCA basis, Holes index.
(RW-m)/s
(CW-m)/s
(CL-m)/s
(FL-m)/s
(BD-m)/s
3 - 12
Crabs: Tour Plots - Hierarchical by Species
Standardized data, PCA basis, Holes index.
(RW-m)/s
(CW-m)/s
(CL-m)/s
(BD-m)/s
(FL-m)/s
(RW-m)/s
(CW-m)/s
(CL-m)/s
BD
3 - 13
Crabs: Tour Plot Observations
Strong separations between species and sex.
Axes suggest that body depth, frontal lip, carapace length
and width contribute to species separation.
Rear width contributes most to the separation of sexes.
3 - 14
0
-2
-6
-4
Discrim 2
2
4
Crabs: LDA Solution
-10
-5
0
5
Discrim 1
3 - 15
Crabs: Comparison of Methods
LDA is about as good as you can do on this data.
CART does very poorly, due to the strong correlations.
Neural networks (feed-forward - Ripley's S code) can perfectly classify this data, but results are unreliable/non-replicable
for small crabs, where boundary is less clear.
3 - 16
0
0
1
2
3
4
0
50
Crabs: Clustering - For Fun!
100
Objects
150
200
BDCW
CL
FL RW
Crabs: Clustering - For Fun!
50
100
Objects
150
200
RW
CL
BDFL
CW
3 - 17
3 - 18
Hierarchical average linkage clustering on principal components.
Merge Level
Hierarchical average linkage clustering groups points along
the covariance structure.
20
15
10
5
0
Merge Level
Building the Dendrogram
Append data matrix with two more variables, containing fuse
heights and horizontal spread, and additional \dummy" points
with locations of fusing.
.lines
.nlinkable
contains two columns representing
the observation numbers of points
between which lines are drawn from,to.
Ignore points after this observation
number, so that dummy points are
ignored during brushing.
S function available.
3 - 19
Case Study: Breast Cancer
From Institut Curie, France, by way of Richard D. De Veaux,
Williams College.
Histologie
Benign (0) or malignant (1)
Typsein
tissue: light (0), dense (1)
Cote
left (0), right (1) breast
Taille
Size in mm
Nombre
Number of microcalcications
Foyer
Number of suspicious clusters
Forme
Shape of the microcalc
Polymorphisme Is there many type of microcalc in
one cluster? yes(1),no(0)
Contour
Shape of the cluster, 1: circular, 2:
angular, 3: other
Retro
Is the cluster under the nipple? Yes
(1)/No(0)
Prof
Are the microcalc deep under the
skin? yes (1)/no(0)
3 - 20
Breast Cancer: Mosaic Plots vs Jittering
Breast Cancer: Histologie by Foyer
2
0.6
0.4
0.0
0.2
Malignant
histologie
0.8
1.0
Benign
1
1.0
1.2
1.4
1.6
1.8
2.0
foyer
3 - 21
1.0
1.0
0.6
0.4
0.2
0.0
histologie
0.8
0.8
0.6
histologie
0.2
0.0
0.4
0.6
0.4
0.2
0.0
histologie
0.8
1.0
Breast Cancer: Jittering with Continuous Variables
20
40
60
80
age
Jittering the binary response against continous explanatory.
3 - 22
Palmitic Acid
Palmitoleic Acid
Stearic Acid
Oleic Acid
Linoleic Acid
Eicosanoic Acid
Linolenic Acid
Eicosenoic Acid
20
40
age
60
80
South, North or Sardinia
Sub-regions within the larger
regions (North and South
Apulia, Calabria, Sicily, Inland and Coastal Sardinia,
Umbria, East and West Liguria
% in sample 100
% in sample 100
% in sample 100
% in sample 100
% in sample 100
% in sample 100
% in sample 100
% in sample 100
3 - 24
Fatty acid composition in olive oils from 9 sub-regions of Italy
(Forina et al, 83).
Region
Area
3 - 23
Breast Cancer: Use of Color for the Response
100
80
60
40
20
0
Case Study: Italian Olive Oils
taille
60
Oils: LDA vs CART vs Manual Tour
1
50
10
20
eicosenoic
30
40
1
0
1
1
11 1
1
1
11 1 1 1 1
1
11 11 1
1
1
1
11
1
1
1
1
1
11
1
111111
11 1 111 11
11 1 1 1 1 11 1
1
1
1 1 11 1 1 11111 11111 1 1
1 11 1 111
1
11
1
1
1 1 1 11 111 11 11111 1 111 1 1
11111111 1 11 1 1 11 1111 11 111 1
1 11 1 1 11
1 111 1
11111 111 1
1 1 1 1 11 111 11 1 111 11111
11111 11111111111 11 1 1
1
11
111
111
11111
111 11 1
111 11 11111 11
1 1111 1111 1
11
111 1 1
1
1
11111111111 11 1
1 11 1 111
1 111 1111111
111
1
1 1111111 11111
1111 1
1
1
333
33
33333333
3
333333333333
3
3333 3333
33 222222
2
222
22222222
2 2
2222222
3333 3
33
33
333
2
2
3
33
33
33
33
3333333
33333333333
333333333333333
3 332
2 22222
22
222
222
222222 2 222
22222222222
22
600
800
1000
linoleic
1200
1400
1
linoleic
2
arachidic
oleic
eicosenoic
3 - 25
Oils: LDA vs CART vs Manual Tour
CART and LDA are similar and both confuse region 2 (Sardinia) and 3 (North). Manual tour improves the solution by
including small amounts of oleic and arachidic to the projection.
3 - 26
Oils: Sardinia
-182
1500
.
.
PC 2
-184
-185
1300
-186
1200
-187
1100
7000
7200
oleic
.
.
.
..
..
. .
.
.
.
.
.
. .
.
.
.
. . .. . .. . .
.
.. . . ..
.
... .. . .
.
.
.
.
. . . ...
. . .
.
.
.
.
. .
.
.
.
. .
. .
.
..
.
.
. .
.
. .
.
.
.
.
.
.
.. .
.
.
.
.
7400
-30
-29
-28
-27
PC 1
3 - 27
.
.
(3.31,3.42)
.
.
.
.
..
..
.. .. .. . .
. . .. ..
. .
. .
.. . . .
... ...
.. .
. .... .
.
.. ..... .
. . .. .
.... .
..
.
...
.
.
. .
.. .
.
..
.
.
linolenic
palmitic
oleic
stearic
palmitoleic
eicosanoic
linoleic
.
. (4.0,3.04)
.
..
.
. .
. .
. .. . .. .
.
. ..
.. . . .
. . . .. ..
. . .
.
. .. . . .
. .
.. .
.
.
(3.65,2.82)
.
2.6
Discrim 2
2.8
3.0
3.2
3.4
Oils: North
2.4
linoleic
1400
-183
.
3.4
3.6
3.8
Discrim 1
4.0
.
.
.
4.2
3 - 28
Oils: South
oleic
stearic
palmitoleic
linoleic
stearic oleic
linoleic
palmitoleic
eicosanoic
palmitoleic
eicosenoic
linolenic
3 - 29
Oils: Separating Sub-regions
Sub-regions in the north, and sub-regions in Sardinia are easy
to separate, but sub-regions in south are very dicult to
separate.
3 - 30
Oils: Assessing Neural Network Solutions
Add variable containing predictions to the data set. Code is
nnet in S (Ripley, 96).
Plotting the classications vs variables ) subset of variables
used by the net.
Brush points in the boundary/confusion region, observe these
points in the multivariate plots ) where the net draws its
boundaries.
3 - 31
2
1
NN class
3
4
Oils: Neural Networks
linoleic
arachidic
1.0
1.5
2.0
2.5
3.0
oleic
Region
3 - 32
Case Study: Particle Physics Data
Unsupervised clustering, reveals 7 clusters, each low-dimensional
embedded in high-d space (Cook et al, 95).
X5
X3
X6
X2
X1
X4
X7
X5
X3
X2
X1
X4
X7
X6
3 - 33
Download