3. Clustering and Classication Unsupervised classication - clustering Supervised classication - discriminant analysis 3-1 Visualizing Cluster Structure Cluster structure can be detected using a grand tour watching for two visual cues: Separation of points in particular views Dierent motion paths In parallel coordinate plots, these two visual cues correspond to: Separation of lines Crossing of lines 3-2 Discriminant Analysis Add color/glyph information to the plot according to group information. 3-3 Case Study: Australian Leptograpsus Crabs There are 50 specimens of both sexes of two species, collected on sight at Fremantle, Western Australia (Campbell and Mahon, 74). Each specimen has measurements on Frontal Lip (FL), in mm Rear Width (RW), in mm Length of midline of the carapace (CL), in mm Maximum width of carapace (CW), in mm Body Depth (BD), in mm. Preserved specimens lose their color, so it was hoped that morphological dierences would enable museums specimens to be classied. 3-4 Crabs: Summary Statistics Variable FL RW CL CW BD Min. 1st Qu. Median Mean 3rd Qu. 7.2 12.9 15.55 15.58 18.05 6.5 11 12.8 12.74 14.3 14.7 27.28 32.1 32.11 37.23 17.1 31.5 36.8 36.41 42 6.1 11.4 13.9 14.03 16.6 Variable FL RW CL CW BD Blue Male 14.84 (3.20) 11.72 (2.11) 32.01 (7.31) 36.81 (8.35) 13.35 (3.20) Blue Female 13.27 (2.63) 12.14 (2.44) 28.1 (5.92) 32.62 (6.80) 11.82 (2.75) Max. 23.1 20.2 47.6 54.6 21.6 Orange Male Orange Female 16.63 (3.51) 17.59 (2.97) 12.26 (2.20) 14.84 (2.35) 33.69 (7.61) 34.62 (5.84) 37.19 (8.39) 39.04 (6.54) 15.32 (3.53) 15.63 (2.75) 3-5 Crabs: Scatterplot Matrix xgobi -scatmat crabs It is also possible to start up XGobi without the -scatmat option, then select the Scatterplot matrix item on the Options menu. This uses the variables that are active in the tour as the subset displayed in the scatterplot matrix. 3-6 Crabs: Scatterplot Matrix FL RW CL CW BD 3-7 Crabs: Scatterplot Matrix Observations Strong correlation between variables. Smaller crabs harder to distinguish, separations increase as size gets larger. Males separated from females in plots of CL vs RW (males have a higher CL:RW ratio), and BD vs RW, and CW vs RW. The two species can be separated in the plots of BD vs CW, and CW vs FL. It looks like it is almost possible to separate all four groups using just RW and FL. 3-8 Aside: Projection Pursuit and Holes Index Projection pursuit is the search for interesting projections of high-d data via the optimizing of an index function, eg max Var(X ) 2S , p 1 gives the rst principal component. In general, max f(XA) A2G p;d 3-9 Aside: Projection Pursuit and Holes Index n d ^IHoles = ,(2),d=2 1 X exp(, 1 X Yij2) + (2p),d n 2 i=1 j =1 where Y = X . Holes index nds projections where there is not much data in the center, ie holes. 3 - 10 Crabs: Tour Plots - Raw data PCA basis used, to alleviate distraction from correlation. Projection pursuit with the Holes index to obtain views. BD FL RW CL CW BD FL CL RW CW 3 - 11 Crabs: Tour Plots - Standardized data PCA basis, Holes index. (RW-m)/s (CW-m)/s (CL-m)/s (FL-m)/s (BD-m)/s 3 - 12 Crabs: Tour Plots - Hierarchical by Species Standardized data, PCA basis, Holes index. (RW-m)/s (CW-m)/s (CL-m)/s (BD-m)/s (FL-m)/s (RW-m)/s (CW-m)/s (CL-m)/s BD 3 - 13 Crabs: Tour Plot Observations Strong separations between species and sex. Axes suggest that body depth, frontal lip, carapace length and width contribute to species separation. Rear width contributes most to the separation of sexes. 3 - 14 0 -2 -6 -4 Discrim 2 2 4 Crabs: LDA Solution -10 -5 0 5 Discrim 1 3 - 15 Crabs: Comparison of Methods LDA is about as good as you can do on this data. CART does very poorly, due to the strong correlations. Neural networks (feed-forward - Ripley's S code) can perfectly classify this data, but results are unreliable/non-replicable for small crabs, where boundary is less clear. 3 - 16 0 0 1 2 3 4 0 50 Crabs: Clustering - For Fun! 100 Objects 150 200 BDCW CL FL RW Crabs: Clustering - For Fun! 50 100 Objects 150 200 RW CL BDFL CW 3 - 17 3 - 18 Hierarchical average linkage clustering on principal components. Merge Level Hierarchical average linkage clustering groups points along the covariance structure. 20 15 10 5 0 Merge Level Building the Dendrogram Append data matrix with two more variables, containing fuse heights and horizontal spread, and additional \dummy" points with locations of fusing. .lines .nlinkable contains two columns representing the observation numbers of points between which lines are drawn from,to. Ignore points after this observation number, so that dummy points are ignored during brushing. S function available. 3 - 19 Case Study: Breast Cancer From Institut Curie, France, by way of Richard D. De Veaux, Williams College. Histologie Benign (0) or malignant (1) Typsein tissue: light (0), dense (1) Cote left (0), right (1) breast Taille Size in mm Nombre Number of microcalcications Foyer Number of suspicious clusters Forme Shape of the microcalc Polymorphisme Is there many type of microcalc in one cluster? yes(1),no(0) Contour Shape of the cluster, 1: circular, 2: angular, 3: other Retro Is the cluster under the nipple? Yes (1)/No(0) Prof Are the microcalc deep under the skin? yes (1)/no(0) 3 - 20 Breast Cancer: Mosaic Plots vs Jittering Breast Cancer: Histologie by Foyer 2 0.6 0.4 0.0 0.2 Malignant histologie 0.8 1.0 Benign 1 1.0 1.2 1.4 1.6 1.8 2.0 foyer 3 - 21 1.0 1.0 0.6 0.4 0.2 0.0 histologie 0.8 0.8 0.6 histologie 0.2 0.0 0.4 0.6 0.4 0.2 0.0 histologie 0.8 1.0 Breast Cancer: Jittering with Continuous Variables 20 40 60 80 age Jittering the binary response against continous explanatory. 3 - 22 Palmitic Acid Palmitoleic Acid Stearic Acid Oleic Acid Linoleic Acid Eicosanoic Acid Linolenic Acid Eicosenoic Acid 20 40 age 60 80 South, North or Sardinia Sub-regions within the larger regions (North and South Apulia, Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria % in sample 100 % in sample 100 % in sample 100 % in sample 100 % in sample 100 % in sample 100 % in sample 100 % in sample 100 3 - 24 Fatty acid composition in olive oils from 9 sub-regions of Italy (Forina et al, 83). Region Area 3 - 23 Breast Cancer: Use of Color for the Response 100 80 60 40 20 0 Case Study: Italian Olive Oils taille 60 Oils: LDA vs CART vs Manual Tour 1 50 10 20 eicosenoic 30 40 1 0 1 1 11 1 1 1 11 1 1 1 1 1 11 11 1 1 1 1 11 1 1 1 1 1 11 1 111111 11 1 111 11 11 1 1 1 1 11 1 1 1 1 1 11 1 1 11111 11111 1 1 1 11 1 111 1 11 1 1 1 1 1 11 111 11 11111 1 111 1 1 11111111 1 11 1 1 11 1111 11 111 1 1 11 1 1 11 1 111 1 11111 111 1 1 1 1 1 11 111 11 1 111 11111 11111 11111111111 11 1 1 1 11 111 111 11111 111 11 1 111 11 11111 11 1 1111 1111 1 11 111 1 1 1 1 11111111111 11 1 1 11 1 111 1 111 1111111 111 1 1 1111111 11111 1111 1 1 1 333 33 33333333 3 333333333333 3 3333 3333 33 222222 2 222 22222222 2 2 2222222 3333 3 33 33 333 2 2 3 33 33 33 33 3333333 33333333333 333333333333333 3 332 2 22222 22 222 222 222222 2 222 22222222222 22 600 800 1000 linoleic 1200 1400 1 linoleic 2 arachidic oleic eicosenoic 3 - 25 Oils: LDA vs CART vs Manual Tour CART and LDA are similar and both confuse region 2 (Sardinia) and 3 (North). Manual tour improves the solution by including small amounts of oleic and arachidic to the projection. 3 - 26 Oils: Sardinia -182 1500 . . PC 2 -184 -185 1300 -186 1200 -187 1100 7000 7200 oleic . . . .. .. . . . . . . . . . . . . . . .. . .. . . . .. . . .. . ... .. . . . . . . . . . ... . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . 7400 -30 -29 -28 -27 PC 1 3 - 27 . . (3.31,3.42) . . . . .. .. .. .. .. . . . . .. .. . . . . .. . . . ... ... .. . . .... . . .. ..... . . . .. . .... . .. . ... . . . . .. . . .. . . linolenic palmitic oleic stearic palmitoleic eicosanoic linoleic . . (4.0,3.04) . .. . . . . . . .. . .. . . . .. .. . . . . . . .. .. . . . . . .. . . . . . .. . . . (3.65,2.82) . 2.6 Discrim 2 2.8 3.0 3.2 3.4 Oils: North 2.4 linoleic 1400 -183 . 3.4 3.6 3.8 Discrim 1 4.0 . . . 4.2 3 - 28 Oils: South oleic stearic palmitoleic linoleic stearic oleic linoleic palmitoleic eicosanoic palmitoleic eicosenoic linolenic 3 - 29 Oils: Separating Sub-regions Sub-regions in the north, and sub-regions in Sardinia are easy to separate, but sub-regions in south are very dicult to separate. 3 - 30 Oils: Assessing Neural Network Solutions Add variable containing predictions to the data set. Code is nnet in S (Ripley, 96). Plotting the classications vs variables ) subset of variables used by the net. Brush points in the boundary/confusion region, observe these points in the multivariate plots ) where the net draws its boundaries. 3 - 31 2 1 NN class 3 4 Oils: Neural Networks linoleic arachidic 1.0 1.5 2.0 2.5 3.0 oleic Region 3 - 32 Case Study: Particle Physics Data Unsupervised clustering, reveals 7 clusters, each low-dimensional embedded in high-d space (Cook et al, 95). X5 X3 X6 X2 X1 X4 X7 X5 X3 X2 X1 X4 X7 X6 3 - 33