Introduction to Data Analysis by Machine Learning Overview of Machine Learning and Visualization Three of the major techniques in machine learning are clustering, classification and feature reduction. Classification and clustering are also broadly known as unsupervised and supervised learning. In supervised learning, the object is to learn predetermined class assignments from other data attributes. For example, given a set of gene expression data for samples with known diseases, a supervised learning algorithm might learn to classify disease states based on patterns of gene expression. In unsupervised learning, there either are no predetermined classes or class assignments are ignored. Cluster analysis is the process by which data objects are grouped together based on some relationship defined between objects. In both classification and clustering an explicit or implicit model is created from the data which can help to predict future data instances or understand the physical process behind the data. Creating these models can be a very compute intensive task, such as training a neural network. Feature reduction or selection reduces the data attributes used in creating a data model. This process can reduce analysis time and create simpler and (sometimes) more accurate models. In the three cancer examples presented all three machine learning techniques are used and will be described, however, one of the primary analysis techniques used is high dimensional visualizations. One particular visualization, RadViz™, incorporates all three machine learning techniques in an intuitive, interactive display. Two other high dimensional visualizations, Parallel Coordinates and PatchGrid (similar to HeatMap) are also used to analyze and display results. Classification techniques used: RadViz™ – rearranging dimensions based on T-statistic – a visual classifier Naïve Bayes (Weka) Support Vector Machines (Weka) Instance Based or K – nearest neighbor (Weka) Logistic Regression (Weka) Neural Net (Weka) Neural Net (Clementine) Validation technique 10-fold Hold 1 out Training and Test datasets Clustering techniques: RadViz™ – arranging dimensions not based on class label – ex. Principal Components Hiarchical with Pearson correlation Feature Reduction techniques used: Pairwise t-statistic – equal variance used in RadViz™ (other statistics can also be used) F-statistic – select top dimensions based on the highest F-statistic computed from class labels PURS™ (patent pending) - Principal Uncorrelated Record Selection ***** Phil/Alex Should have the new algorithm definition ******* Initially selection some “seed” dimensions, say based on high t or F statistic, repeatedly delete dimensions that correlate highly to seed dimensions, if not correlated add the dimension to the “seed” dimension set. Repeat and slowly reduce the correlation threshold until “seed” dimensions are reduced to the desired amount. Random – randomly selected dimensions and build/test classifier ****** This probably should be reduced ********]. The Importance of High-dimensional Data Visualization and its Integration with Analytic Data Mining Techniques. Visualization, data mining, statistics, as well as mathematical modeling and simulation are all methodologies that can be used to enhance the discovery process [15].. There are numerous visualizations and a good number of valuable taxonomies (See [16] for an overview of taxonomies). Most information visualization systems focus on tables of numerical data (rows and columns), such as 2D and 3D scatterplots [17], although many of the techniques apply to categorical data. Looking at the taxonomies, the following stand out as high-dimensional visualizations: Matrix of scatterplots [17]; Heat maps [17]; Height maps [17]; Table lens [18]; Survey plots [19]; Iconographic displays [20]; Dimensional stacking (general logic diagrams) [21]; parallel coordinates [22]; Pixel techniques, circle segments [23]; Multidimensional scaling [23]; Sammon plots [24]; Polar charts [17]; RadViz™ [25]; Principal component analysis [26]; Principal curve analysis [27]; Grand Tours [28]; Projection pursuit [29]; Kohonen self-organizing maps [30]. Grinstein et.al., [31] have compared the capabilities of most of these visualizations. Historically, static displays include histograms, scatterplots, and large numbers of their extensions. These can be seen in most commercial graphics and statistical packages (Spotfire, S-PLUS, SPSS, SAS, MATLAB, Clementine, Partek, Visual Insight’s Advisor, and SGI’s Mineset, to name a few). Most software packages provide limited features that allow interactive and dynamic querying of data. HDVs have been limited to research applications and have not been incorporated into many commercial products. However, HDVs are extremely useful because they provide insight during the analysis process and guide the user to more targeted queries. Visualizations fall into two main categories: (1) low-dimensional, which includes scatterplots, with from 2-9 variables (fields, columns, parameters) and (2) highdimensional, with 100-1000+ variables. Parallel coordinates or a spider chart or radar display in Microsoft Excel can display up to 100 dimensions, but place a limit on the number of records that can be interpreted. There are a few visualizations that deal with a large number (>100) of dimensions quite well: Heatmaps, Heightmaps, Iconographic Displays, Pixel Displays, Parallel Coordinates, Survey Plots, and RadViz™. When more than 1000 records are displayed, the lines overlap and cannot be distinguished. Of these, only RadViz is uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions) datasets, and we discuss it in detail below. RadViz™ is a visualization and classification/clustering tool that uses a spring analogy for placement of data points and incorporates machine learning feature reduction techniques as selectable algorithms. 13-15 The “force” that any feature exerts on a sample point is determined by Hooke’s law: f kd . The spring constant, k, ranging from 0.0 to1.0 is the value of the feature(scaled) for that sample, and d is the distance between the sample point and the perimeter point on the RadViz™ circle assigned to that feature-see Figure A. The placement of a sample point, as described in Figure A is determined by the point where the total force determined vectorially from all features is 0. The RadViz display combines the n data dimensions into a single point for the purpose of clustering, but it also integrates analytic embedded algorithms in order to intelligently select and radially arrange the dimensional axes. This arrangement is performed through Autolayout, a unique, proprietary set of algorithmic features based upon the dimensions’ significance statistics that optimizes clustering by optimizing the distance separating clusters of points. The default arrangement is to have all features equally spaced around the perimeter of the circle, but the feature reduction and class discrimination algorithms arrange the features unevenly in order to increase the separation of different classes of sample points. The feature reduction technique used in all figures in the present work is based on the t statistic with Bonferroni correction for multiple tests. The circle is divided into n equal sectors or “pie slices,” one for each class. Features assigned to each class are spaced evenly within the sector for that class, counterclockwise in order of significance (as determined by the t statistic, comparing samples in the class with all other samples). As an example, for a 3 class problem, features are assigned to class 1 based on the sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined. Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3 combined values, and Class 3 features are assigned based on the t-statistic comparing class 3 values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of the circle have no features assigned to them, the data points would all cluster on one side of the circle, pulled by the unbalanced force of the features present in other sectors. In this case, a variation of the spring force calculation is used, where the features present are effectively divided into qualitatively different forces comprised of high and low k value classes. This is done via requiring k to range from – 1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and others ‘push’ (low or –k values) the points to spread them absolutely into the display space, but maintaining the relative point separations. It should be stated that one can simply do feature reduction by choosing the top features by t-statistic significance and then apply those features to a standard classification algorithm. The t-statistic significance is a standard method for feature reduction in machine learning approaches, independently of RadViz. The top significance chemicals selected with the t-statistic are the same as those selected by RadViz. RadViz has this machine learning feature embedded in it and is responsible for the selections carried out here. The advantage of RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class separation correlates to the accuracy of any classifier built from the reduced features. The additional advantage to this visualization is that sub clusters, outliers and misclassified points can quickly be seen in the graphical layout. One of the standard techniques to visualize clusters or class labels is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter plot using the first few Principle Components as axes. Often this display shows clear class separation, but the most important features contributing to the PCA are not easily seen. RadViz is a “visual” classifier that can help one understand important features and how many features are related. The RadViz Layout: An example of the RadViz layout is illustrated in Figure A. There are 16 variables or dimensions associated with the 1 point plotted. Sixteen imaginary springs are anchored to the points on the circumference and attached to one data point. The data point is plotted where the sum of the forces are zero according to Hooke’s law (F = Kx): where the force is proportional to the distance x to the anchor point. The value K for each spring is the value of the variable for the data point. In this example the spring constants (or dimensional values) are higher for the lighter springs and lower for the darker springs. Normally, many points are plotted without showing the spring lines. Generally, the dimensions (variables) are normalized to have values between 0 and 1 so that all dimensions have “equal” weights. This spring paradigm layout as some interesting features. For example if all dimensions have the same normalized value the data point will lie exactly in the center of the circle. If the point is a unit vector then that point will lie exactly at the fixed point on the edge of the circle (where the spring for that dimension is fixed). Many points can map to the same position. This represents a non-linear transformation of the data which preserves certain symmetries and which produces an intuitive display. Some features of this visualization include: it is intuitive, higher dimension values “pull” the data points closer to the dimension on the circumference points with approximately equal dimension values will lie close to the center points with similar values whose dimensions are opposite each other on the circle will lie near the center points which have one or two dimension values greater than the others lie closer to those dimensions the relative locations of the of the dimension anchor points can drastically affect the layout (the idea behind the “Class discrimination layout” algorithm) an n-dimensional line gets mapped to a line (or a single point) in RadViz Convex sets in n-space map into convex sets in RadViz Computation time is very fast 1000’s of dimensions can be displayed in one visualization