Chapter 3 Clustering and Classication 3.1 Background Clustering, or unsupervised classication, involves detecting previously unknown groups in the data. We usually imagine nice neatly separated clusters of observations, but quite often clustering is simply partitioning or carving up the data space. Discriminant analysis, or supervised classication, requires the knowledge of the groups, and seeks to nd the view which shows the best separation of the groups. Whether the class information is known or not the visual tools for classication are primarily the same. When class identity is known, colors and/or symbols can be used to code this information into the plots. It is also relatively easy to approach the analysis of data which contains a mix of known and unknown class identities. This is not typically true in numerical algorithms, where the knowledge of group information changes the algorithm for nding the separations between groups completely. The graphics that are useful for classication tasks include rudimentary density plots, scatterplots, matrices of pairwise scatterplots, parallel coordinate plots. Motion graphics, such as tours, are used to animate these previous plots, to provide views of arbitrary combinations of variables. User interaction, such as brushing, is very useful, especially when multiple plots are visible and linked. There are several visual cues that are useful on detecting cluster structure: separation of points in particular views, similar movement patterns in motion graphics, and intersecting lines in parallel coordinate plots. This chapter will describe the visual methods as we use them to explore data in the case studies. Various numerical methods, such as linear discriminant analysis, principal component analysis, neural networks, hierarchical cluster analysis, k-means clustering, will also be used and we will discuss the use of graphics in conjunction with these methods. 31 3.2 Supervised Classication This section includes two case studies one using data on Australian crabs, and the other data on Italian olive oils data. The exercises use data on ea beetles and Australian kangaroos. 3.2.1 Case Study: Australian Leptograpsus Crabs Data Description This data was described in Campbell & Mahon (1974). It contains measurements on rock crabs of the genus Leptograpsus. One species L. variegatus had been split into two new species, previously grouped by color form, orange and blue. Preserved specimens lose their color, so it was hoped that morphological dierences would enable museums specimens to be classied. There are 50 specimens of each sex of each species, collected on sight at Fremantle, Western Australia. Each specimen has measurements on Frontal Lip (FL), in mm Rear Width (RW), in mm Length of midline of the carapace (CL), in mm Maximum width of carapace (CW), in mm Body Depth (BD), in mm. The primary question here is \How do we distinguish between species and sex based on these ve morphological measurements." There are also some interesting structural features to note about the data. Density Plots Two types of univariate plots are available in XGobi: textured dotplots (Tukey & Tukey 1990), an average shifted histograms (Scott 1992). Select 1DPlot from the View menu. A scrollbar allows the smoothness of the histogram to be changed. Two average shifted histograms of the variable body depth are shown in Figure 3.1. The blue species are slightly smaller on average as seen by the presence of blue points at the low values, and presence of more orange points at the high values. Scatterplot Select XYPlot from the View menu. Variables can be changed interacting with variable circle control panel. Scatterplot Matrix A scatterplot matrix can be used to lay out the pairwise plots, to view them all in one glance. To get a scatterplot matrix on startup, use the -scatmat option on the command line. It is also possible to start up XGobi normally and then 32 6 8 10 12 14 16 18 20 22 6 BD 8 10 12 14 16 18 20 22 BD Figure 3.1: ASH plots of body depth, two dierent smoothness parameters. bring up a second xgobi with a scatterplot matrix by selecting the Scatterplot matrix item on the Options menu. This uses the subset of variables that are active in the tour as the subset displayed in the scatterplot matrix. Based on the scatterplot matrix these are the observations about the relationships between variables and groups. There is strong correlation between between all the variables. The variation is less with the smaller measurements and gets larger as the values increase. Smaller crabs harder to distinguish, separations increase as size gets larger. There is a dierence in males and females in plots of CL vs RW (males have a higher CL:RW ratio), and BD vs RW, and CW vs RW. There is a dierence in the two species in the plots of BD vs CW, and CW vs FL. Parallel Coordinate Plots Parallel coordinate plots display data using parallel axes, rather than orthogonal axes. Cases are represented by a series of lines to the observation's values on each variable. Good references for parallel coordinate plots are (Inselberg 1985, Wegman 1990). Figure 3.3 displays a parallel coordinate plot of the crabs data. Very little can be seen from this plot. Perhaps using principal component coordinates would reveal more. 33 FL RW CL CW BD Figure 3.2: Scatterplot matrix of the 5 variables. Despite the strong correlation various dierence can be seen between species and sex. 34 3 4 5 6 7 Figure 3.3: Parallel coordinate plot of the 5 crab variables. Very little can be seen from this plot. Tours Introduction While linked brushing provides information on conditional distributions, tours provide information on joint distributions. They are particularly useful for detecting clusters, outliers, distributional shape, including covariance, and some non-linear structure. Grand Tour A grand tour is a continuous 1-parameter family of d-dimensional projections of -dimensional data which is dense in the set of all -dimensional projections in p . The parameter is usually thought of as time. This means that each projection shown can be indexed by a time parameter. As time is allowed to wander o to 1 the grand tour will show all possible -dimensional projections of the data, which is the meaning of \dense in the set of all projections". A grand tour oers a multitude of aspects simultaneously in relationship to one another. If the data is intrinsically 0-, 1-, or 2-dimensional (that is, clusters, curves or surfaces) the human eye can pick up the \gestalt" almost instantly. (We are adept at detecting and recognizing moving objects.) Three-dimensional rotation can be considered a special case of the tour, where the dimension of the data is = 3. The grand tour provides the overview of the struture of the data. p d IR d p Guided Tour To nd more specic types of structure intelligent search engines can be connected to the tour, which can automatically provide more informative views than the random ones provided by the grand tour. The guided tour leads the user to rare views. Manual Tour Prior knowledge can be incorporated with manually controlled tours. The 35 user can increase or decrease the contribution of a particular variable to a view to examine how a particular variable contributes to any structure. In addition manual tools allow us to assess the sensitivity of the structure to a particular variable or sharpen or rene structure exposed with the grand or guided tour. The manual tour renes the views. Good references for tours are (Asimov 1985, Buja & Asimov 1986, Cook, Buja & Cabrera 1993, Cook, Buja, Cabrera & Hurley 1995, Buja, Cook, Asimov & Hurley 1997, Cook & Buja 1997). Back to this data Watch the crabs data in a grand tour of all 5 variables. The shape is rather like a 1D pencil or stick rotating in the 5D space. This is due to the strong correlation between all variables. Not much structure can be seen using this raw data scale. For this data it is more useful to examine the principal component basis, that is, we sphered the data, before viewing it in the tour. This is a useful trick for graphics, it removes the distraction due to correlation structure. It is important to note that all principal components were used, there was no dimension reduction conducted. This is important, because often by eliminating the lowest principal components one also removes the cluster structure. We then used projection pursuit with the Holes index to obtain views in Figure 3.4. In another approach we rst standardized the variables before sphering and projection pursuit (Figure 3.5). BD FL RW CL CW BD FL CL RW CW Figure 3.4: Projection pursuit solutions found with the Holes index (sphered data). To obtain the discriminant boundary (Figure 3.6), we rst standardized the four variables frontal lip, carapace length, carapace width and body depth, call 36 4 2 0 -2 Discrim 2 -4 -6 (RW-m)/s (CW-m)/s (CL-m)/s (FL-m)/s (BD-m)/s -10 -5 0 5 Discrim 1 0.6 Figure 3.5: (Left) Projection pursuit solutions found with the Holes index (standardized, sphered data). (Right) Linear discrimant analysis solution. 0.4 • • • •• • • •• • • • • • • • •• • •••• •• • •• •• • •• • •• • • • • • • • ••• • • • • • • ••••••• •• • ••• • • • • • • • • • •• • • •• • • • •• • • • • • • • • -0.4 Discrim 2 -0.2 0.0 0.2 • PC 2 • • -0.6 PC 4 PC 1 • • • • • •• • • • • • • • •• • • • • • •• • • • • •••• • •• • • • • •• • •• •• • • • •• • • • • • • •• • •• • • • •• • •• • •• • •• • • •• • • • • • • • • • • • • • •• • • • -0.4 -0.2 0.0 Discrim 1 0.2 0.4 Figure 3.6: View of the principal components of four variables - FL, CL, CW, BD - which separates species cleanly and consequent S plot delineating the discriminant boundary. 37 these s(FL), s(CL), s(CW), s(BD). Then we computed the principal components: PC 1 = (0 499 ( 0 500 2 = (0 532 ( 0 437 3 = (,0 684 ( 0 715 4 = (0 031 ( 0 218 : : PC : : = std BD : std F L : std BD : std C W std C L : std C W : = : std BD std C L : : std F L : PC : std BD std F L : PC ) + 0 502 ( ) + 0 499 ( ) + p ( )) 3 939 ) , 0 315 ( ) , 0 653 ( ) + p ( )) 0 047 ) + 0 085 ( ) , 0 119 ( ) + p ( )) 0 012 ) , 0 801 ( ) + 0 557 ( ) + p ( )) 002 std F L std C L = : std C W : : std C L : std C W = Then the discriminant rule is given by: If (0.094PC1 + 0.201PC2 - 0.030PC4 < -0.05 then the species is blue else the species is orange To separate the sexes we worked stepwise, hierarchically between the species. Figure 3.7 illustrates two projections which do well at separating sexes within each species. Here we relied heavily on rear width in relation to the other variables to obtain separations. Standardized variables were also used. We explored the restructuring the variables as ratios to rear width as an alternative solution to obtaining good separations. It is a reasonable approach but doesn't produce outstanding results. Based on touring here are the observations that we can make. Strong separations between species and sex. Axes suggest that body depth, frontal lip, carapace length and width contribute to species separation. Rear width contributes most to the separation of sexes. Numerical Methods and Graphics From studying the data with graphics we can come to a fairly neat understanding of this data. This can help numerical analysis in two ways: interpreting results, and determining how eective particular approaches might be. From the graphics we would expect to be able to get a numerical solution which separates species perfectly, and to be able to separate sexes with a high degree of accuracy. The linear discriminant analysis solution is very similar to the solution found with the Holes index (see Figure 3.5). We would expect CART to perform poorly on the raw data, but that working with principal components might provide better solutions. 38 (RW-m)/s (CW-m)/s (CL-m)/s (BD-m)/s (FL-m)/s (RW-m)/s (CW-m)/s (CL-m)/s BD Figure 3.7: Projection pursuit solutions, separately for species, found with the Holes index (standardized, sphered data, axes shown in terms of standardized variables). Neural networks are likely to be able to perfectly classify the species and sex (which is possible with Ripley's S code for a feed-forward network), but the reliability of the classication for the sexes of small crabs is suspect, for smaller crabs the sexes are physiologically less distinguishable. 3.2.2 Exercises 1. Generate a scatterplot matrix of the ea beetle data. Which variables would contribute to separating the 3 species? 2. Generate a parallel coordinate plot of the ea beetle data. Characterize the 3 species by the pattern of their traces. 3. Watch the ea beetle data in a grand tour. Stop the tour when you see a separation and describe the variables that contribute to the separation. 4. Using the projection pursuit guided tour, with the holes index, nd a projection which neatly separates all 3 species. Put the axes onto the plot and explain the variables that are contributing to the separation. Using univariate plots conrm that these variables are important to separate species. 39 3.2.3 Case Study: Italian Olive Oils Data Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic) found in the lipid fraction of 572 Italian olive oils. (An analysis of this data is given in (Forina, Armanino, Lanteri & Tiscornia 1983)). There are 9 collection areas, 4 from southern Italy (North and South Apulia, Calabria, Sicily), two from Sardinia (Inland and Coastal) and 3 from northern Italy (Umbria, East and West Liguria). The data available are: Region South, North or Sardinia Area Sub-regions within the larger regions (North and South Apulia, Calabria, Sicily, Inland and Coastal Sardinia, Umbria, East and West Liguria Palmitic Acid Percentage 100 in sample Palmitoleic Acid Percentage 100 in sample Stearic Acid Percentage 100 in sample Oleic Acid Percentage 100 in sample Linoleic Acid Percentage 100 in sample Linolenic Acid Percentage 100 in sample Arachidic Acid Percentage 100 in sample Eicosenoic Acid Percentage 100 in sample The primary question is \How do we distinguish the oils from dierent regions and areas in Italy based on their combinations of the fatty acids?" Working through the data 1. Check that the colors and glyphs code region and area, by looking at the scatterplot of Region vs Area. Note that the \.colors" and \.glyphs" were originally generated by brushing in the Regin vs Area scatterplot and saving the results. 2. The region of southern Italy's oils can be recognized as dierent from all other regions by the presence or absence of one fatty acid. Which fatty acid is it? (Using univariate plots change between variables until one variable shows a separation.) 3. Because it is so easy to distinguish the southern oils, remove them for now. Erase these points (go into brush and select erase, and move the brush over the red points - easier if you have the dotplot of region showing - then pull down the erase menu and select \Rescale ignoring erased points"). 40 4. Now try to work on the separating the northern oils from the sardinian oils. Work back through the dotplots to nd the variables which do the best job of separating the two groups. Then look at pairwise plots, and nd which pair of variables do well at separating the groups. (It is not necessary to look at eicosenoic.) 5. I think you will do best with 3 variables, so using 3-D rotation look for three variables which best separate the groups. Pick the three that you see as doing the best job, and nd the projection which best separates the groups. Print this out, so that you can construct a classication rule. At the end of this exercise you should be able to write down a decision rule to classify oil samples into regions of Italy. If you get a new sample, with the same measured variables you can follow the decision rule to classify the oil as coming from the North, South or Sardinia. Here is my solution, at this point, for separating the Northern Italian oils from those of Sardinia. First, I looked at the three variables Oleic, Linoleic and Eicosenoic Acid, and found a projection of the three variables where we could discriminate between the two regions with a vertical line (Figure 3.8). I printed out the coecients using the \I/O" button: Var Horizontal Vertical oleic -0.000077 -0.000318 linoleic -0.000968 -0.000265 linolenic -0.005010 0.011945 and used these to generate the plot again in S using the following code: par(pty="s") x<-d.olive[d.olive[,1]!=1,6:8] reg<-d.olive[d.olive[,1]!=1,1] ds1<-x[,1]*(-0.000077)+x[,2]*(-0.000968)+x[,3]*(-0.005010) ds2<-x[,1]*(-0.000318)+x[,2]*(-0.000265)+x[,3]*0.011945 plot(ds1, ds2,xlab="Discrim 1", ylab="Discrim 2",pch=".") points(ds1[reg==2],ds2[reg==2],pch=0) points(ds1[reg==3],ds2[reg==3],pch=1) abline(v=-1.68) This gives me a discrimination rule as follows: If oleic(0 000077)+linoleic(0 000968)+linolenic(0 005010) 1 68 then the oil comes from region 3 (Northern Italy), : : : > : Else the oil is from region 2 (Sardinia). The next step is to nd classication rules for Areas within the Regions. Before we do this we will talk more about rotating data plots. An alternative way to doing the 3D rotation is to use a grand tour. With 3 variables it is simply rotating the data in 3-D space. But it works for any number of variables, and can rotate data containing 4, 5 or more variables. 41 -1.8 . -2.0 . . .. . . . . ......... ... . . . . . .... . . . .. . .. ... . . . .. . . . ... ........ . . . .. . . . ...... .. . .. . ..... . . . -2.6 Discrim 2 -2.4 -2.2 . . . . . .. . . . .. .. . . . .. . . . . ... . . . .. . .. ..... .. . . . . . .. . . . . .. ...... . . . . ... . . . . ... . ... . . . . . .. . . .. . . . . . . . . . . .. .. . .. . .. . . .. . . . . ... . . -2.8 linolenic . linoleic oleic . . -2.0 -1.8 -1.6 Discrim 1 -1.4 -1.2 Figure 3.8: Olive oil data: projection showing separation of Sardinian oils from those of Northern Italy both from XGobi and the subsequent S plot with the discrimination boundary drawn. 1. Watch the olive oil data in a grand tour (click on the variable circles of all fatty acids except eicosenoic, and also those of region and area so these are removed from the plot). 2. Keep the axes on the plot and watch how the variables fade in and out of view. These specify how a variable contributes or is represented in a single plot. These axes are replicated in the variable circles, one axis per circle, which alleviates the mess on screen. 3. Change the speed of the tour by dragging on the top scrollbar. 4. Turn o the axes and watch the shapes that the data forms. See how the points separate into clusters in some projections. 5. Stop the tour at a view where the green group (Sardinia) is separated from the purple group (North Italy). Turn on the axes again, and look for the variable(s) which contribute most in the direction of the separation. Compare the view given in a pairwise plot of these variables and see if the separation can be seen similarly here. 6. If you have diculty stopping at a projection which separates Sardinian oils from those of Northern Italy use projection pursuit (click on projection pursuit and then the optimize button - you may have to turn optimize on and o several times to get to the same maximum as shown in Figure 3.9). (This is called a guided tour, because the code uses a numerical function to nd projections that are \interesting".) Turning the axes on you will nd that two acids, oleic and linoleic are the main ones contributing to the 42 1400 dierence (Figure 3.9). Look at the XYPlot of these two acids. Remember these were the two main variables that allowed us to separate the regions. 800 linoleic stearicarachidic palmitic 1000 1200 linoleic 600 oleic 7000 7500 8000 8500 oleic Figure 3.9: Olive oil data: projection showing separation of Sardinian oils from those of Northern Italy shows that the dierence is in two fatty acids, oleic and linoleic. At the end of this exercise you should have learned how the tour (both the grand and guided) works, and how it can help understand cluster structure in high-dimesional data. It can also help extract the important variables more automatically. Scale Aects Structure Detection Now dierent scales can aect interpretation dramatically. This next exercise illustrates how looking at the data on dierent scales can help nd dierent features. 1. Standardize each variable to have mean zero and variance 1, using the transformation controls. 2. Then repeat the last exercise. Tour on all the standardized variables except eicosenoic acid. (Also only use the northern and Sardinian samples.) Turn on the guided tour: \ProjPrst", select the \Holes Index", and click on \Optimz". 3. What you'll notice is that it always gets back to the same projection, but this one is not as good as when the data was not standardized: there are a handful of (green) Sardinian samples which get confused with the northern Italian oils. What is interesting though is just before it stabilizes at this *not-so-interesting* projection it passes by one that does have a 43 (oleic-m)/s (stearic-m)/s (palmitic-m)/s (palmitoleic-m)/s (linolenic-m)/s (arachidic-m)/s (linoleic-m)/s Figure 3.10: Olive oil data: Interesting projection when using standardized variables. The variables contributing most to this separation are oleic, linoleic and arachidic. 1500 -182 . -183 PC 2 -184 1400 -185 1300 -187 1100 -186 1200 linoleic . . 7000 7200 oleic . . . . .. .. . . . . . .. . .. . . . .. . . .. . ... .. . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .. . . . 7400 -30 -29 -28 -27 PC 1 Figure 3.11: Olive oil data: projection showing separation of Sardinian oils both from XGobi and the subsequent S plot with the discrimination boundary drawn. 44 3.4 3.2 Discrim 2 2.8 3.0 . . . . . . . . . . .. . .. . . . . . . . .. . .. ... . .. . 0.0 . 5.2 5.4 Discriminant for Umbria 1.0 0.8 . . . Random Uniform 0.4 0.6 0.8 Random Uniform 0.4 0.6 . . 0.2 . . . . . . .. . . . . .. .. . . . . . . .. . .. . . . . . . . . . . . . . . . . . . . . .. .. . . .. . . . . . .. .. . . . . . . . .. . . . . . . ... . . .. . .. . . . ... . . . . 0.2 .. ... . . . . 0.0 1.0 3.4 . 5.6 . (4.0,3.04) . .. . . . . . . .. . .. . . . .. .. . . . . . . .. .. . . . . . .. . . . . . .. . . (3.65,2.82) . 2.4 linoleic . . 2.6 arachidic palmitic oleic stearic palmitoleic linolenic . . (3.31,3.42) . . . . .. .. .. .. .. . . . . .. .. . . . . ... ... .. .. . . . .. . . ..... . . .. . .. . . .. . . .. ...... .. . . . . .. . . .. . . 3.6 3.8 Discrim 1 .. . . . ... . .. . . . . . . .. .. .. . . . . . .. . . . ... . . . . .. . . . . . ... . . . .. . . . . . .... . . . . . ... . . .. . . . . . . ..... . .. . .. . . -2.5 4.0 4.2 . .. .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . .. . . . .. . . -2.0 -1.5 -1.0 -0.5 Discriminant for West Liguria . . 0.0 Figure 3.12: Olive oil data: projection showing separation of Northern oils both from XGobi and the subsequent S plot with the discrimination boundaries drawn. 45 good separation of two regions. To get back to this \Pause" the tour, drag the speed scrollbar close to the left hand side to make it go slow, click o \Optimz", and click on \Backtrack". Be ready to click on \Pause" again when it reaches a nice view. Now click \Pause" and watch the tour go back to the interesting view. 4. When it reaches the view where the groups are separated and you have paused the tour there, take a look at the variables that are contributing to the separation. They are oleic, linoleic as before, plus either arachidic and linolenic. (You will each get slight dierences here, sometimes it is linolenic that contributes, and sometimes it is arachidic. Figure 3.10 shows my run when arachidic contributes to the separation.) The latter two variables you found earlier as being important for getting a good separation of the samples of the two regions. It is important to look at data on dierent scales, raw coordinates, standardized coordinates or principal component coordinates. From earlier classes you should have learned that it is important to use transformations, such as logs, to normalize or spread out the data values. This is also important in viewing high-dimensional graphics. Although in this data there was no need to use logs it is a fairly common necessity. Pursuing the Olive Oil data in more depth In this next exercise we drill down further into the data to explore the subclusters corresponding to areas within the regions. Right now we have the following classication rule: 1. If eicosenoic acid 4 then the oil comes from region 1 (South> ern Italy), 2. If oleic(0:000077)+linoleic(0:000968)+linolenic (0:005010) > 1:68 then the oil comes from region 3 (Northern Italy), 3. Else the oil is from region 2 (Sardinia). 1. Erase brush the regions 1 and 3 out of the data, select \Rescale ignoring erased points" from the \Erase" menu. 2. Then look at dotplots of variables to nd which variables help to discriminate Inland oils from Coastal Sardinian oils. Both oleic and linoleic seem to be very good. 3. Plot oleic vs linoleic. It is clear that these two variables alone are sucient to separate the two areas (see Figure 3.11). The actual discriminant rule would be found by taking the rst principal component of the two variables. This is the S code I used: 46 x<-f.sph.data(d.olive[d.olive[,1]==2,6:7]) are<-d.olive[d.olive[,1]==2,2] plot(x[,1],x[,2],pch=".",xlab="PC 1",ylab="PC 2") points(x[are==5,1],x[are==5,2],pch=0,cex=1) points(x[are==6,1],x[are==6,2],pch=0,cex=2) abline(v=-28.85) f.sph.data<-function(data) { vardat <- var(data) svd <- eigen(vardat) evc <- svd$vectors evl <- svd$values sphd.data <- data %*% evc %*% diag(1/sqrt(evl)) return(sphd.data) } and the rst principal component is given by PC p 1 = (oleic 0 8024368 + linoleic (,0 5967371)) 30807 5 : : = : giving the discriminant rule to be p if (oleic0 8024368+linoleic(,0 5967371)) 30807 5 28 85 then : : = : > : the oil comes from Coastal Sardinia else from Inland Sardinia. To separate the dierent areas in northern Italy is a messy proposition. It is possible, with only one missclassication, and with all the 8 variables (without eicosenoic acid). Without going into laborious detail, the discrimination projection is given by: 2 3 0 000691 0 000705 66 0 000843 0 000402 7 66 0 001582 0 000533 7 7 66 0 000220 0 000304 7 7 66 0 001247 ,0 000772 7 7 66 ,0 002989 ,0 000216 7 7 4 ,0 002220 0 006294 7 5 0 000000 0 000000 : : : : : : : : : : : : : : : : Project the data (matrix of values of the 9 fatty acids) into this matrix to get the 2D view shown in Figure 3.12. Call the result . Then the discriminant boundaries for the 3 groups are as follows: x 1. If 1 + 0 568 2 5 25 then the oil comes from Umbria, 2. If 1 ,1 761 2 ,1 31 then the oil comes from West Liguria, x x : : x x > < : : 47 oleic stearic palmitoleic linolenic palmitoleic stearic oleic linoleic palmitoleic linoleic eicosenoic arachidic Figure 3.13: Olive oil data: projections showing combinations of variables where separation of oils from dierent areas of Southern Italy is visible, from XGobi. 3. Else the oil comes from East Liguria. To separate the southern Italian oils, looking at the dotplots just of these regions it appears that palmitoleic and linoleic, stearic and oleic might be useful variables for separating the areas North and South Apulia from the other two areas. Eicosenoic, linolenic, and arachidic would be needed to separate Calabria from Sicily, although it is clear that it is not easy to cleanly distinguish this group. Figure 3.13 shows some plots to support these statements. 3.2.4 Exercises 1. For the kangaroos data, build a discriminant rule for each species and sex, using graphical methods. 2. There are three historical skulls (Leiden, Paris and London) which defy classication. Use graphical methods to guess which species and sex each belong to. 48 3.3 Unsupervised Classication This section explains how to approach classication when the class identities are unknown, using visual methods. We describe the brush-and-spin approach to interactive cluster analysis. The approach will work reasonably when there are good separations between groups even when there are marked dierences in variance structures between groups, and non-linear boundaries. It doesn't seem to work very well when there are classes which overlap in space, or when there are no distinct classes but rather we simply wish to partition the data. In these situations it is better to begin with a numerical solution, and attempt to rene it with visual tools. We will use the Italian Olive oil data with the class boundaries removed to demonstrate the brush-and-spin approach, and rening numerical solutions. 3.3.1 Case Study: Italian Olive Oils Density Plots Work through each variable plot, and brush groups where there are clear separations, for example, eicosenoic acid (Figure 3.14). This is a much harder task than discriminant analysis! It relies much more heavily on careful assessment of split, and the exibility of undoing previous splits. Heavy use of jittering can help in these density plots, especially when there is discreteness in the data, or in some subsets of values. 0 5 10 15 20 25 30 35 40 45 50 55 60 eicosenoic Figure 3.14: ASH plot of eicosenoic acid revealing two clusters. Jittering is used because there is overplotting of points with values close to zero. 49 Focus on one group, by erasing all points painted dierently, and continue through the next sections. Sequentially continue to focus on one group at a time. Scatterplots 20 40 60 80 100 120 140 160 180 palmitoleic Work through pairwise plots and brush the points that are clearly separated (Figures 3.15, 3.16). 600 700 800 900 10001100120013001400 palmitic Figure 3.15: Outliers can be distracting in that they take away from the plot space for detecting clustering. Tours While focusing on one group, watch the data rotate in a grand tour, and stop when a separation is seen. Use the projection-pursuit guided tour with the Holes index (or Central Mass or Skewness indices if there appear to be unequal numbers of points in groups) to nd further separations. Also put the other the other groups back in to the plot, and examine how the clusters fall out in the tour, rene the boundaries by re-brushing points that appear to be incorrectly classied. Examining Numerical Solutions When using hierarchical cluster algorithms it is common to use a dendrogram to examine the results. This can be computed in S, and the results passed to 50 linoleic 5006007008009001000 1100 1200 1300 1400 1500 linoleic 5006007008009001000 1100 1200 1300 1400 1500 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900 8000 0 oleic 10 20 30 40 50 60 70 linolenic Figure 3.16: Scatterplots reveal more clusters. XGobi, to study interactively. Figure 3.17 displays a dendrogram (computed using average linkage), with one outlier (East Liguria) highlighted, followed with focusing on one part of the tree which is painted orange. The brushed points form a strongly cohesive group in several variables, but they don't correspond to one unique region of Italy. Rather they form a very heterogeneous group containing samples from areas in southern Italy, as well as all the northern Italian areas. The simple hierarchical cluster methods work very poorly with this data, as does k-means clustering. Model-based clustering, where unequal variances are assumed does much better. 51 Merge Level 0200400600800 1000 1200 1400 1600 1800 2000 2200 2400 East-Liguria East-Liguria 400 0 50100150200250300350400450500550 450 500 550 oleic 6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400 East-Liguria East-Liguria 600700800900 1000 1100 1200 1300 1400 1500 1600 1700 1800 600700800900 1000 1100 1200 1300 1400 1500 1600 1700 1800 palmitic palmitic 5 1 2 3 4 Area 6 7 8 9 palmitoleic 20 40 60 80100120140160180200220240260280 Objects 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 Region Figure 3.17: Exploring hierarchical clustering results using a dendrogram linked to scatterplots. 52