Visualizing N Dimensional Clustering in 2 Dimensions Dr. Paul Jeull Dept. of Computer Science North Dakota State University Fargo, North Dakota 58105 Paul.Jeull@ndsu.nodak.edu W. Jockheck Dept. of Computer Science North Dakota State University Fargo, North Dakota 58105 William.Jockheck@ndsu.nodak.edu Abstract 1.1 An n-dimensional Wire Frame Data mining has presented users with methods for clustering over a large number of attributes. What is missing is a intuitive way to visualize the results in that n-dimensional space. This paper examines some of the efforts to visualize n dimensional data and then proposes a scheme to provide a representation that can be used view and interpret the analysis of the data. The proposed visualization involves projecting the normalized values onto the sides of an n-1 sided regular polygon. These points are then transformed into a representation of a single point inside the polygon. While this approach has limitations such as the fact that it does not provide a unique mapping to a two dimensional space, it does provide a relatively intuitive way to view the data with some limitations. The initial work was discounted due to problems caused by perturbations in the patterns caused by random or uncorrelated variables. After viewing the Visualizing High Dimensional Datasets and Relations tutorial by Alfred Inselberg* at KDD200, it became apparent that the parallel coordinates systems he used were similar in construction to the work here. His points of interest were simply those that we were pruning out for simplicity. After revisiting this work it became apparent that the similarity would be capitalized upon to gain useful visualization of clustering. 1. Background Increasingly there is a need to represent more and more complex data in a visual format. As the number of independent variables exceeds three or four it becomes awkward to provide graphs in the traditional sense. This usually results in some combination of graphs or the exclusion of some of the variables for the purpose of simplicity. Data mining has extended this problem by often working in tend or hundreds of “dimension.” These matters are further complicated by the fact that the user interface (a computer screen) is actually 2 D and human experience is for the most part limited to experience in three (or four) dimensions. Mathematically, there is no limitation on the extension of dimensionality but for visualization it becomes quite busy and far from intuitive to the user of the image. Using a simple wire frame drawing we can progress from 1-D to n-D. If you can choose an arbitrary vector to visualize the third dimension, there's no reason you can't choose another arbitrary vector to visualize the fourth dimension. The figure below shows an example of this concept extended to five dimensions. Figure 1: A five dimensional cube. Each drag point indicates the angel used to represent the dimension in the figure. Drag points 1 and 2 are clearly the x and y coordinates. Drag point 3 is a traditional projection of a third dimension. Drag points 4 and 5 continue the concept but are arbitrary since the people have no natural experience in viewing four or five dimensions. The result is sometimes referred to as a 5-dimensional super hypercube [1]. 1.2 Other n-dimensional Displays A number of other methods have been proposed to convey dimensional information. Figure 2 is taken from a dissertation by Sami Kaski [2]. It displays a ten-dimensional data item visualized using four different methods. These include (a) a profile of the component values, (b) a ``star'' in which the length of each ray emanating from the center illustrates one component, (c) Andrews' curve [3], (d) a facial caricature. The first is simply a set of grouped bar graphs which show the values associated for each item. There are serious limitations on how many values could be portrayed in this method. The star is related to the technique proposed here. However, since the rays emanate from the center there is a limit to the number of rays that can be shown before they are over lapping. Representing data as a wave allows comparison of multiple data points but again has a limit on how many can be viewed and compared in a meaningful way. In (d) Chernoff's faces [4] use each dimension of the data to determine the size, location, or shape of some component of a facial. This technique is based on the concept that human brains are well adapted for recognizing and remembering faces. Figure 2: Methods for conveying additional information (dimensionality) 1.3 Parallel Coordinate Scheme [5,6,7]. This method from to Inselberg and others attempts plots with all the axes parallel to each other in a plane. This preserves some of geometric structure. It is however projected in such a fashion that most geometric intuition has to be relearned. Inselberg has used it to achieve impressive results but the intuition is not clear to a novice user. Figure 3: A parallel Axis System 2. The Proposed Method of Viewing ndimensional Information The method is straight forward but offers many variation on implementation and tailoring. In the first version, a set of six variable was used, three of which were related and three of which were random. This was simply generated using an Excel spreadsheet. The results were then normalized and projected onto the sides of a regular n-dimensional polygon (in this case a hexagon). The six projected points were then simply averaged to create a "center of mass". Because the outline looks like a cut stone with twinkling facets it is nicknamed a jewel diagram. Figure 4: Jewel Diagram with each six dimensions with attributes for each sample connected. While interesting to look at the results are hard to distinguish due to the numerous colors being used to identify the different sample sets (thirteen used in the example). This means the method will have serious limitations as more and more samples are introduced. However, notice the similarity of the above diagram to the parallel coordinate system of Isenberg[5,6,7]. While Isenberg uses parallel axis, this set of lines follow the same layout but around the polygon. The implications of this have not been fully explored but left for later exploration. They may be useful but pruning out the lines in the next step makes the diagram simpler to view in the hopes of making it useful in much larger data sets. This representation in figure 5 provides z-axis separation of each sample for clarity. The results displayed are for a set of data samples for meteorological data. The close-up shows the sequence. Figure 5: Jewel diagram with connecting lines removed. By removing the lines each side of the polygon represents the distribution of values for one attribute (labeled Series 1 through 6 in the diagram). The cluster in the center shows the “center of mass” for each sample. Now each sample point represents a single point in the center of the polygon. For the purposes of this example an attempt was made to retain the identity of the points by using color and patterns. However, that would not be necessary in larger data sets. To assist with the visibility problem the idea of creating a vrml space in which each sample is separated and can be viewed from different angles was explored. While this solves some of the separation problems and enables exploration of the values on the polygon edges, many of the basic problems remain. Figure 6: Detail of the vrml view, notice the points on the polygon showing the individual values. Since the data is in vrml, the viewer can fly through the space to examine the detail. This lead to the creation of a utility to generate the vrml. Note that a browser with a vrml enable plug in is required. The screen shots show Netscape and Cosmoviewer as the software. Using other software may provide slightly different appearances. Figure 7: Skewed angle view in vrml Or viewed from a distance to gain an overview. Figure 5: Screen shot of the vrml utility http://jockheck.northern.edu/vrml2/demo12.html 3.4 Usefulness. Initial exploration has not consistently produced useful patters. It appears that the pattern is lost due to the perturbations caused by the random (or uncorrelated) values in some variables which then move the points randomly in the two dimensional space. 4. Conclusion and Future Directions This technique is interested in providing a method for viewing n-dimensional data and hence interpreting such items as clustering in data mining. Like most tools it has its drawbacks. However, the method of displaying the information provides a method for human viewers to gain understanding and insight. Figure 8: A distant view of the sequence in vrml. Despite the drawbacks this technique at least provides a method to visualize n-dimensional clustering. As a result the author intends to continue to use the technique to represent data mining results in n-dimensional space. 3. Concerns 5. References There are serious problems in an absolute mathematical sense with this approach. [1] Nick Jackiw, October 1997 from http://www.keypress.com/sketchpad/java_gsp/hyperc ube.html 3.1 Loss of uniqueness: Points mapped to 2-d space are not a unique representation of the n-d space. It may be possible to construct such a mapping but it would probably require the insertion of additional variables which combine the existing ones. It may be necessary to have up to n! sides to the polygon to produce uniqueness. 3.2 Shrinking viewing space. The points mapped into 2-d space will converge to a single point (no distinguishable variation in values) as the number of sides of the polygon approaches a circle. This is due to the fact that the values are averaged. While it is possible to increase the diameter of the near circle to separate the point, when n is very large it is not possible to view the distribution of the individual attributes on the sides of the polygons in the same image as the center of mass points. 3.3 Sequence of Variables. The sequence that the variables are placed on the polygon has significant impact. Note that (in figure 4) series 1 and 4 increase in opposite directions (180 degree offset) hence directly off setting each other while 1 and 2 move in a similar (30 degree offset) direction. Changing the sequence of variable changes the diagram significantly. [2] Kaski, S., Data exploration using self-organizing maps. Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series No. 82, Espoo 1997, 57 pp. Published by the Finnish Academy of Technology. ISBN 952-5148-13-0. ISSN 1238-9803. UDC 681.327.12:159.95:519.2 accessible at http://www.cis.hut.fi/~sami/thesis [3] Andrews, D. F. (1972) Plots of high-dimensional data. Biometrics, 28:125-136. [4] Chernoff, H. (1973) The use of faces to represent points in k-dimensional space graphically. Journal of the American Statistical Association, 68:361-368. [5] Visual data mining with parallel coordinates Al Inselberg, COMPUTATION STAT 13: (1) 47-63 1998 [6] MULTIDIMENSIONAL LINES .1. REPRESENTATION. INSELBERG A, DIMSDALE B, SIAM J APPL MATH 54: (2) 559-577 APR 1994 [7] Parallel coordinate graphics using MATLAB, http://www.nbb.cornell.edu/neurobio/land/PROJECT S/Inselberg/