Introduction to Data Analysis by Machine Learning

Introduction to Data Analysis by Machine Learning
Overview of Machine Learning and Visualization
Three of the major techniques in machine learning are clustering, classification and
feature reduction. Classification and clustering are also broadly known as unsupervised
and supervised learning. In supervised learning, the object is to learn predetermined class
assignments from other data attributes. For example, given a set of gene expression data
for samples with known diseases, a supervised learning algorithm might learn to classify
disease states based on patterns of gene expression. In unsupervised learning, there either
are no predetermined classes or class assignments are ignored. Cluster analysis is the
process by which data objects are grouped together based on some relationship defined
between objects. In both classification and clustering an explicit or implicit model is
created from the data which can help to predict future data instances or understand the
physical process behind the data. Creating these models can be a very compute intensive
task, such as training a neural network. Feature reduction or selection reduces the data
attributes used in creating a data model. This process can reduce analysis time and create
simpler and (sometimes) more accurate models.
In the three cancer examples presented all three machine learning techniques are used
and will be described, however, one of the primary analysis techniques used is high
dimensional visualizations.
One particular visualization, RadViz™, incorporates all
three machine learning techniques in an intuitive, interactive display. Two other high
dimensional visualizations, Parallel Coordinates and PatchGrid (similar to HeatMap) are
also used to analyze and display results.
Classification techniques used:
RadViz™ – rearranging dimensions based on T-statistic – a visual classifier
Naïve Bayes (Weka)
Support Vector Machines (Weka)
Instance Based or K – nearest neighbor (Weka)
Logistic Regression (Weka)
Neural Net (Weka)
Neural Net (Clementine)
Validation technique
Hold 1 out
Training and Test datasets
Clustering techniques:
RadViz™ – arranging dimensions not based on class label – ex. Principal
Hiarchical with Pearson correlation
Feature Reduction techniques used:
Pairwise t-statistic – equal variance used in RadViz™ (other statistics can also be
F-statistic – select top dimensions based on the highest F-statistic computed from
class labels
PURS™ (patent pending) - Principal Uncorrelated Record Selection
***** Phil/Alex Should have the new algorithm definition *******
Initially selection some “seed” dimensions, say based on high t or F statistic,
repeatedly delete dimensions that correlate highly to seed dimensions, if not
correlated add the dimension to the “seed” dimension set. Repeat and slowly
reduce the correlation threshold until “seed” dimensions are reduced to the
desired amount.
Random – randomly selected dimensions and build/test classifier
****** This probably should be reduced ********].
The Importance of High-dimensional Data Visualization and its Integration with
Analytic Data Mining Techniques. Visualization, data mining, statistics, as well as
mathematical modeling and simulation are all methodologies that can be used to enhance
the discovery process [15].. There are numerous visualizations and a good number of
valuable taxonomies (See [16] for an overview of taxonomies). Most information
visualization systems focus on tables of numerical data (rows and columns), such as 2D
and 3D scatterplots [17], although many of the techniques apply to categorical data.
Looking at the taxonomies, the following stand out as high-dimensional visualizations:
Matrix of scatterplots [17]; Heat maps [17]; Height maps [17]; Table lens [18]; Survey
plots [19]; Iconographic displays [20]; Dimensional stacking (general logic diagrams)
[21]; parallel coordinates [22]; Pixel techniques, circle segments [23]; Multidimensional
scaling [23]; Sammon plots [24]; Polar charts [17]; RadViz™ [25]; Principal component
analysis [26]; Principal curve analysis [27]; Grand Tours [28]; Projection pursuit [29];
Kohonen self-organizing maps [30]. Grinstein, [31] have compared the capabilities
of most of these visualizations. Historically, static displays include histograms,
scatterplots, and large numbers of their extensions. These can be seen in most
commercial graphics and statistical packages (Spotfire, S-PLUS, SPSS, SAS, MATLAB,
Clementine, Partek, Visual Insight’s Advisor, and SGI’s Mineset, to name a few). Most
software packages provide limited features that allow interactive and dynamic querying
of data.
HDVs have been limited to research applications and have not been incorporated into
many commercial products. However, HDVs are extremely useful because they provide
insight during the analysis process and guide the user to more targeted queries.
Visualizations fall into two main categories: (1) low-dimensional, which includes
scatterplots, with from 2-9 variables (fields, columns, parameters) and (2) highdimensional, with 100-1000+ variables. Parallel coordinates or a spider chart or radar
display in Microsoft Excel can display up to 100 dimensions, but place a limit on the
number of records that can be interpreted. There are a few visualizations that deal with a
large number (>100) of dimensions quite well: Heatmaps, Heightmaps, Iconographic
Displays, Pixel Displays, Parallel Coordinates, Survey Plots, and RadViz™. When more
than 1000 records are displayed, the lines overlap and cannot be distinguished. Of these,
only RadViz is uniquely capable of dealing with ultra–high-dimensional (>10,000
dimensions) datasets, and we discuss it in detail below.
RadViz™ is a visualization and classification/clustering tool that uses a spring
analogy for placement of data points and incorporates machine learning feature reduction
techniques as selectable algorithms.
The “force” that any feature exerts on a sample
point is determined by Hooke’s law: f  kd . The spring constant, k, ranging from 0.0
to1.0 is the value of the feature(scaled) for that sample, and d is the distance between the
sample point and the perimeter point on the RadViz™ circle assigned to that feature-see
Figure A. The placement of a sample point, as described in Figure A is determined by
the point where the total force determined vectorially from all features is 0. The RadViz
display combines the n data dimensions into a single point for the purpose of clustering,
but it also integrates analytic embedded algorithms in order to intelligently select and
radially arrange the dimensional axes. This arrangement is performed through
Autolayout, a unique, proprietary set of algorithmic features based upon the dimensions’
significance statistics that optimizes clustering by optimizing the distance separating
clusters of points. The default arrangement is to have all features equally spaced around
the perimeter of the circle, but the feature reduction and class discrimination algorithms
arrange the features unevenly in order to increase the separation of different classes of
sample points. The feature reduction technique used in all figures in the present work is
based on the t statistic with Bonferroni correction for multiple tests. The circle is divided
into n equal sectors or “pie slices,” one for each class. Features assigned to each class are
spaced evenly within the sector for that class, counterclockwise in order of significance
(as determined by the t statistic, comparing samples in the class with all other samples).
As an example, for a 3 class problem, features are assigned to class 1 based on the
sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined.
Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1
and 3 combined values, and Class 3 features are assigned based on the t-statistic
comparing class 3 values with class 1 and class 2 combined. Occasionally, when large
portions of the perimeter of the circle have no features assigned to them, the data points
would all cluster on one side of the circle, pulled by the unbalanced force of the features
present in other sectors. In this case, a variation of the spring force calculation is used,
where the features present are effectively divided into qualitatively different forces
comprised of high and low k value classes. This is done via requiring k to range from –
1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and
others ‘push’ (low or –k values) the points to spread them absolutely into the display
space, but maintaining the relative point separations. It should be stated that one can
simply do feature reduction by choosing the top features by t-statistic significance and
then apply those features to a standard classification algorithm. The t-statistic
significance is a standard method for feature reduction in machine learning approaches,
independently of RadViz. The top significance chemicals selected with the t-statistic are
the same as those selected by RadViz.
RadViz has this machine learning feature
embedded in it and is responsible for the selections carried out here. The advantage of
RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic
selection. Generally, the amount of visual class separation correlates to the accuracy of
any classifier built from the reduced features.
The additional advantage to this
visualization is that sub clusters, outliers and misclassified points can quickly be seen in
the graphical layout. One of the standard techniques to visualize clusters or class labels
is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter
plot using the first few Principle Components as axes. Often this display shows clear
class separation, but the most important features contributing to the PCA are not easily
seen. RadViz is a “visual” classifier that can help one understand important features and
how many features are related.
The RadViz Layout:
An example of the RadViz layout is illustrated in Figure A. There are 16 variables or
dimensions associated with the 1 point plotted. Sixteen imaginary springs are anchored
to the points on the circumference and attached to one data point. The data point is
plotted where the sum of the forces are zero according to Hooke’s law (F = Kx): where
the force is proportional to the distance x to the anchor point. The value K for each
spring is the value of the variable for the data point. In this example the spring constants
(or dimensional values) are higher for the lighter springs and lower for the darker springs.
Normally, many points are plotted without showing the spring lines. Generally, the
dimensions (variables) are normalized to have values between 0 and 1 so that all
dimensions have “equal” weights. This spring paradigm layout as some interesting
For example if all dimensions have the same normalized value the data point will lie
exactly in the center of the circle. If the point is a unit vector then that point will lie
exactly at the fixed point on the edge of the circle (where the spring for that dimension is
fixed). Many points can map to the same position. This represents a non-linear
transformation of the data which preserves certain symmetries and which produces an
intuitive display. Some features of this visualization include:
it is intuitive, higher dimension values “pull” the data points closer to the dimension
on the circumference
points with approximately equal dimension values will lie close to the center
points with similar values whose dimensions are opposite each other on the circle will
lie near the center
points which have one or two dimension values greater than the others lie closer to
those dimensions
the relative locations of the of the dimension anchor points can drastically affect the
layout (the idea behind the “Class discrimination layout” algorithm)
an n-dimensional line gets mapped to a line (or a single point) in RadViz
Convex sets in n-space map into convex sets in RadViz
Computation time is very fast
1000’s of dimensions can be displayed in one visualization