Introduction to Data Analysis by Machine Learning

advertisement
Introduction to Data Analysis by Machine Learning
Overview of Machine Learning and Visualization
Machine learning is the application of statistical techniques to derive general
knowledge from specific data sets by searching through possible hypotheses exemplified
in the data. The goal is typically to build predictive or descriptive models that distinguish
the useful attributes of a dataset to allow the use of those attributes to draw conclusions
from other similar datasets. .[Data mining - Witten, Frank] In cancer diagnosis and
detection, machine learning helps identify significant factors in high dimensional data
sets of genomic, proteomic, or clinical data that can be used to understand the disease
state in patients. Machine learning techniques serve as tools for finding the needle in the
haystack of possible hypotheses formulated by studying the correlation of protein or
genomic expression with the presence or absence of disease.
In the process of analyzing the efficacy and correctness of the generalization of
concepts from a data set, high dimensional visualizations give the researcher time-saving
tools for analyzing the significance, biases, and strength of a possible hypothesis. In
dealing with a potentially discontinuous high dimensional concept space, the researcher’s
intuition benefits greatly from a visual validation of statistical correctness of the result.
The visualizations can also reveal sensitivity to variance, non-obvious correlations, and
unusual higher order effects that are scientifically important, but would require time
consuming mathematical analysis to discover without a mechanism
application of the discovered hypothesis to the dataset.
to picture the
Cancer diagnosis and detection involves a group of techniques in machine learning called
classification techniques. Classification can be done by supervised learning, where the
classes of the obects are already known and are used to train the system to learn the
attributes that most effectively disambiguate and describe the members of the classes. .
For example, given a set of gene expression data for samples with known diseases, a
supervised learning algorithm might learn to classify disease states based on patterns of
gene expression. In unsupervised learning, there either are no predetermined classes, or
the class assignments are ignored and data objects are grouped together by cluster
analysis based on some relationship between the objects.
In both supervised and
unsupervised classification (also know simply as clustering) an explicit or implicit model
is created from the datato help to predict future data instances or understand the physical
process behind the data. Creating these models can be a very compute intensive task,
such as training a neural network, and these models are prone to the flaw of generalizing
knowledge from a particular data set to “overfit” the model to the data, making it less
generally valid when applied to other data sets of the same type Feature selection and
reduction techniques help with both compute time and overfitting problems, by reducing
the data attributes used in creating a data model to those that are most important in
characterizing the hypothesis. This process can reduce analysis time and create simpler
and (sometimes) more accurate models.
In the three cancer examples presented, both supervised and unsupervised
classification and feature reduction are used and will be described. In addition we will
discuss the use of high dimensional visualization in conjunction with these analytical
techniques. One particular visualization, RadViz™, incorporates machine learning
techniques in an intuitive, interactive display.
Two other high dimensional other
visualizations, Parallel Coordinates and PatchGrid (similar to HeatMaps) are also used to
analyze and display results.
Below we summarize the classification, feature reduction, validation, and visualization
techniques we use in the examples, with a particular emphasis on explaining the
techniques of RadViz and Principle Uncorrelated Record Selection (PURS), that have
been developed by the authors.
Machine Learning Techniques:
Classification techniques vary from the statistically simple, testing dimensions for
statistical significance in their contribution to the classification for example, to
sophisticated probabilistic modeling. The supervised learning techniques used in the
examples include Naïve Bayes, Support Vector Machines, Instance-based learning or Knearest neighbor, Logistic regression, and Neural Networks. Much of the work in the
following examples is supervised learning, but it also includes some unsupervised
hierarchical clustering using Pearson correlations. There are many texts giving detailed
descriptions of the implementations and use of these techniques, the authors particularly
like [Data Mining, Witten and Frank].
Feature Reduction Techniques:
There are a number of statistical approaches to feature reduction that are quite useful.
These include the application of pairwise T-statistics and using F-statistics from the class
labels to select the the most important dimensions.
A more sophisticated approach is one we call Principle Uncorelated Record Selection, or
PURS. PURS involves selecting some initial seed attributes (or dimensions), based on a
high t or f statistic, for example. We then repeatedly delete attributes that correlate highly
to seed attributes. If an attribute does not correlate highly to one of the seed attribute set,
it is added to the seed set. We repeat this process reducing the correlation threshold until
the seed dimensions are reduced to an optimal or desired number.
We also use random feature selection to train and test classifiers. This technique is used
to validate the predictive power of a more carefully selected feature set culled from a
sparse data set.
Validation Techniques:
Perhaps the most significant challenge in the application of machine learning to
biological data is the problem of validation, or the task of determining the expected error
rate from a classifier when applied to a new dataset. The data used to create a model
cannot be used to predict the performance of that model on other datasets. The attributes,
or dimensions, selected as important for classification must be tested against a set of data
that was not used in any way in the creation of the classifier. The easy solution to this is
to divide the data into a training set and a test set, or even a training set, a tuning set to
tune the classifer, and a test set to determine the error rate. The problem of course, is that
biological data is usually expensive to aquire, and sufficiently large sets to allow this
subdivision and still have the statistical power to generalize knowledge from the training
set are hard to find. So, as well as the training set/test set approach we use a common
machine learning technique called 10-fold cross-validation. This approach divides the
data into 10 groups, creates the model using 9 of the groups, and tests it on the remaining
group. We then repeat this using each group as the test group once. The ten error
estimates are then averaged to give an overall sense of the predictive power of
classification technique on that data set.
Another technique we use to help predict performance in limited data sets is an extension
of the 10-fold validation idea called leave-one-out validation. In this technique one data
point is left out of each iteration of model creation, and is used to test the model. This is
repeated with every data point in the set being used once as the test data. This approach is
nicely deterministic, as compared to 10-fold cross validation which requires the careful
random stratification of the ten groups, but does not give as useful a characterization of
the accuracy of the model for some distributions of classes within the datasets.
High Dimensional Visualization
Although there are a number of conventional visualizations that can help in the
understanding of the correlation of a small number of dimensions to an attribute, high
dimensional visualizations have been difficult to understand and use because of the
potential loss of information when projecting high-dimensional data down to two or
three-dimensional representations.
There are numerous visualizations and a good number of valuable taxonomies of visual
techniques (See [16] for an overview of taxonomies). As a group we make use of many
of the techniques in the analysis of biological data, especially : matrices of scatterplots
[17]; Heat maps [17]; parallel coordinates [22]; RadViz™ [25]; and Principal component
analysis [26].Of these, we find RadViz uniquely capable of dealing with ultra–highdimensional (>10,000 dimensions) datasets, and very useful when used interactively in
conjunction with specific machine learning and statistical techniques to explore critical
attributes for classification.
RadViz™ is a visualization and classification and clustering tool that uses a spring
analogy for placement of data points and incorporates machine learning feature reduction
techniques as selectable algorithms.
13-15
The “force” that any feature exerts on a sample
point is determined by Hooke’s law: f  kd . The spring constant, k, ranging from 0.0
to1.0 is the value of the feature(scaled) for that sample, and d is the distance between the
sample point and the perimeter point on the RadViz™ circle assigned to that feature-see
Figure A. The placement of a sample point, as described in Figure A is determined by
the point where the total force determined vectorially from all features is 0. The RadViz
display combines the n data dimensions into a single point for the purpose of clustering,
but it also integrates analytic embedded algorithms in order to intelligently select and
radially arrange the dimensional axes. This arrangement is performed through
Autolayout, aset of algorithmic features based upon the dimensions’ significance
statistics that optimizes clustering by optimizing the distance separating clusters of
points. The default arrangement is to have all features equally spaced around the
perimeter of the circle, but the feature reduction and class discrimination algorithms
arrange the features unevenly in order to increase the separation of different classes of
sample points. The feature reduction technique used in all figures in the present work is
based on the t statistic with Bonferroni correction for multiple tests. The circle is divided
into n equal sectors or “pie slices,” one for each class. Features assigned to each class are
spaced evenly within the sector for that class, counterclockwise in order of significance
(as determined by the t statistic, comparing samples in the class with all other samples).
As an example, for a 3 class problem, features are assigned to class 1 based on the
sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined.
Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1
and 3 combined values, and Class 3 features are assigned based on the t-statistic
comparing class 3 values with class 1 and class 2 combined. Occasionally, when large
portions of the perimeter of the circle have no features assigned to them, the data points
would all cluster on one side of the circle, pulled by the unbalanced force of the features
present in other sectors. In this case, a variation of the spring force calculation is used,
where the features present are effectively divided into qualitatively different forces
comprised of high and low k value classes. This is done via requiring k to range from –
1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and
others ‘push’ (low or –k values) the points to spread them absolutely into the display
space, but maintaining the relative point separations. It should be stated that one can
simply do feature reduction by choosing the top features by t-statistic significance and
then apply those features to a standard classification algorithm. The t-statistic
significance is a standard method for feature reduction in machine learning approaches,
independently of RadViz. RadViz has this machine learning feature embedded in it and
is responsible for the selections carried out here. The advantage of RadViz is that one
immediately sees a “visual” clustering of the results of the t-statistic selection. Generally,
the amount of visual class separation correlates to the accuracy of any classifier built
from the reduced features. The additional advantage to this visualization is that sub
clusters, outliers and misclassified points can quickly be seen in the graphical layout.
One of the standard techniques to visualize clusters or class labels is to perform a
Principle Component Analysis and show the points in a 2d or 3d scatter plot using the
first few Principle Components as axes. Often this display shows clear class separation,
but the most important features contributing to the PCA are not easily seen. RadViz is a
“visual” classifier that can help one understand important features and how many features
are related.
The RadViz Layout:
An example of the RadViz layout is illustrated in Figure A. There are 16 variables or
dimensions associated with the 1 point plotted. Sixteen imaginary springs are anchored
to the points on the circumference and attached to one data point. The data point is
plotted where the sum of the forces are zero according to Hooke’s law (F = Kx): where
the force is proportional to the distance x to the anchor point. The value K for each
spring is the value of the variable for the data point. In this example the spring constants
(or dimensional values) are higher for the lighter springs and lower for the darker springs.
Normally, many points are plotted without showing the spring lines. Generally, the
dimensions (variables) are normalized to have values between 0 and 1 so that all
dimensions have “equal” weights. This spring paradigm layout has some interesting
features.
For example if all dimensions have the same normalized value the data point will lie
exactly in the center of the circle. If the point is a unit vector then that point will lie
exactly at the fixed point on the edge of the circle (where the spring for that dimension is
fixed). Many points can map to the same position. This represents a non-linear
transformation of the data which preserves certain symmetries and which produces an
intuitive display. Some features of this visualization include:

it is intuitive, higher dimension values “pull” the data points closer to the dimension
on the circumference

points with approximately equal dimension values will lie close to the center

points with similar values whose dimensions are opposite each other on the circle will
lie near the center

points which have one or two dimension values greater than the others lie closer to
those dimensions

the relative locations of the of the dimension anchor points can drastically affect the
layout (the idea behind the “Class discrimination layout” algorithm)

an n-dimensional line gets mapped to a line (or a single point) in RadViz

Convex sets in n-space map into convex sets in RadViz
Download