Parallel ETL Tool CSCI 6702 – Parallel Computing

advertisement
Evaluation of Supervised Learning
Algorithms on Gene Expression Data
Machine
Learning
Prediction
CSCI 6505 – Machine Learning
Adan Cosgaya
acosgaya@dal.ca
Winter 2006
Dalhousie University
Outline

Introduction

Definition of the Problem

Related Work

Algorithms

Description of the Data

Methodology of Experiments

Results

Relevance of Results

Conclusions & Future Work
2 / 18
Introduction






ML has gained attention in the biomedical field.
Need to turn biomedical data into meaningful
information.
Microarray technology is used to generate gene
expression data.
Gene expression data involves a huge number of
numeric attributes (gene expression measurements).
This kind of data is also characterized by consisting of a
small numbers of instances.
This work investigates the classification problem on such
data.
3 / 18
Definition of the Problem
Classifying Gene Expression Data


Number of features (n) is much greater than the number
of sample instances (m). (n >> m)

Typical data: n > 5000, and m < 100

High risk of overfitting the data due the abundance of
attributes and shortage of available samples.

The datasets produced by Microarray experiments are
highly dimensional and often noisy due to the process
involved in the experiments.
4 / 18
Related Work

Using gene expression data for the task of classification,
has recently gained attention in the biomedical community.

Golub et al. describe an approach to cancer classification
based on gene expression applied to human acute
Leukemia (ALL vs AML).

A. Rosenwald et al. developed a model predictor of patient
survival after chemotherapy (Alive vs Dead).

Furey et al. present a method to analyze microarray
expression data using SVM.

Guyon et al. experiment with reducing the dimensionality
of gene expression data.
5 / 18
Algorithms

K-Nearest Neighbor (KNN)


Naive Bayes (NB)


It assumes that the effect of a feature value on a given
class is independent of the values of other features.
Decision Trees (DT)


It is one of the simplest and widely used algorithms for data
classification.
Internal nodes represent tests on one or more attributes
and leaf nodes indicate decision outcomes.
Support Vector Machines (SVM)

Works well on high dimensional data
6 / 18
Description of the Data

Leukemia dataset.


A collection of 72 expression measurements. The samples are divided
into two variants of leukemia: 25 samples of acute myeloid leukemia
(AML) and 47 samples acute lymphoblastic leukemia (ALL).
Diffuse Large-B-Cell Lymphoma (DLBCL) dataset

Biopsy samples that were examined for gene expression with the use of
DNA microarrays. Each sample corresponds to the prediction of survival
after chemotherapy for diffuse large-B-cell lymphoma (Alive, Dead).
Dataset
Leukemia
DLBCL
# Instances
# Classes
# Features
72
240
2
2
7129
7399
# Features after
feature selection
1026
68
7 / 18
Methodology of Experiments
All features



Feature Selection
 Remove irrelevant features (but may have
biological meaning).
 Use of GainRatio
Selecting a Supervised Learning Method
 KNN, NB, DT, SVM
Feature
Selection
(gene subset)
Algorithm
Testing Methodology
 Evaluation over independent test set (train/test split)
 Ratios: 66/34, 80/20, 90/10
 10-fold Cross-Validation
 Compare both methods and see if they are in logical agreement
8 / 18
Methodology of Experiments (cont…)

Measuring Performance

Accuracy
Number of correct classifications
Total number of test cases



Precision (p)
Recall (r)
F-Measure



It is hard to compare two classifiers using two measures. FMeasure combines precision and recall into one measure.
F-Measure is the harmonic mean of precision, and recall.
For F to be large, both p and r must be large.
F  Measure 
2 pr
pr
9 / 18
Results
Without Feature Selection

DLBCL (no feat ure select ion)
Leuk emia (no feat ure select ion)
70.000
100.000
65.000
90.000
60.000
55.000
80.000
Accuracy
60.000
50.000
t rain / t est split
cross-validat ion
40.000
Accuracy
50.000
70.000
45.000
40.000
35.000
30.000
t rain / t est split
cross-valid at ion
25.000
20.000
15.000
30.000
20.000
10.000
5.000
10.000
0.000
0.000
KNN
NB
DT
KNN
SVM
Naive Bayes and
SVM perform better
DT
SVM
Algorithm s
Algorit hm s

NB

KNN and SVM
perform better
Cross-validation results are lower; it uses nearly all the data for training and
testing, giving a more realistic estimation.
10 / 18
Results (cont…)
With Feature Selection

Leukem ia (feat ure select ion)
DLBCL (feat u re select ion )
100. 000
90. 000
Accuracy
70. 000
60. 000
50. 000
t r ain / t est split
cr oss- valid at ion
40. 000
30. 000
20. 000
10. 000
0. 000
Accuracy
80. 000
80. 000
75. 000
70. 000
65. 000
60. 000
55. 000
50. 000
45. 000
40. 000
35. 000
30. 000
25. 000
20. 000
15. 000
10. 000
5. 000
0. 000
t r ain / t est split
cr oss- validat ion
KNN
KNN
NB
DT
KNN and SVM
perform better
DT
SVM
Algorithms
Algorithm s

NB
SVM

NB and SVM perform
better
There is an increase in the overall accuracy, more notorious in DLBCL
11 / 18
Results (cont…)
Summary of classification accuracies with cross-validation
KNN
NB
DT
SVM

Le u ke m ia d a t a se t
DLBCL d a t a se t
All featur es Featur e s election All featur es Featur e s election
87.500
9 8 .6 1 1
6 2 .9 1 7
62.250
9 8 .6 1 1
97.222
59.167
70.833
86.111
9 8 .6 1 1
84.722
9 8 .6 1 1
56.250
57.917
64.167
7 1 .2 5 0
F-Measures for both datasets with and without feature selection
1 .0 0 0
0 .9 0 0
0 .8 0 0
0 .7 0 0
F-Measur e

0 .6 0 0
0 .5 0 0
Leu kem ia All
Leu kem ia F.S.
0 .4 0 0
D LBCL All
D LBCL F.S.
0 .3 0 0
0 .2 0 0
0 .1 0 0
0 .0 0 0
KNN
NB
DT
Algor ithm s
SVM
12 / 18
Relevance of Results

Performance depends on the characteristics of the problem,
the quality of the measurements in the data, and the
capabilities of the classifier in finding regularities in the data.

Feature selection, helps to minimize the use of redundant
and/or noisy features.

SVM gave the best results, they perform well with high
dimensional data, and also benefit from feature selection.

Decision Trees had the overall worst performance, however,
they still work at a competitive level.
13 / 18
Relevance of Results (cont…)

Surprisingly, KNN behaves relatively well despite its simplicity,
this characteristic allows it to scale well for large feature
spaces.

In the case of the Leukemia dataset, very high accuracies
were achieved here for all the algorithms. Perfect accuracy
was achieved in many cases.

The DLBCL dataset shows lower accuracies, although using
feature selection improved them.

In the overall, the observations of the accuracy results are
consistent with those from the F-measure, giving us
confidence in the relevance of the results obtained.
14 / 18
Conclusions & Future Work

Supervised learning algorithms can be used to the
classification of gene expression data from DNA microarrays
with high accuracy.

SVM by its very own nature, deal well with high dimensional
gene expression data.

We have verified that there are subsets of features (genes)
that are more relevant than others and better separate the
classes.

The use of one algorithm instead of others should be
evaluated on a case by case basis
15 / 18
Conclusions & Future Work (cont…)

The use of feature selection proved to be beneficial to
improve the overall performance of the algorithms. This idea
can be extended to the use of other feature selection methods
or data transformation such as PCA.

Analysis of the effect of noisy gene expression data on the
reliability of the classifier.

While the scope of our experimental results is confined to a
couple of datasets, the analysis can be used as a baseline for
future use of supervised learning algorithms for gene
expression data
16 / 18
References








T.R. Golub et al. Molecular classification of cancer: class discovery and
class prediction by gene-expression monitoring. Science, Vol. 286, 531–
537, 1999.
A. Rosenwald, G. Wright, W. C. Chan, et al. The use of molecular profiling
to predict survival after chemotherapy for diffuse large B-cell lymphoma.
New England Journal of Medicine, Vol. 346, 1937–1947, 2002.
Terrence S. Furey, Nello Cristianini, et al. Support vector machine
classification and validation of cancer tissue samples using microarray
expression data. Bioinformatics, Vol. 16, 906–914, 2001.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer
classification using support vector machines. BIOWulf Technical Report,
2000.
Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, 2004.
Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Second Edition. Morgan Kaufmann Publishers , 2005
Wikipedia: www.wikipedia.org
Alvis Brazma, Helen Parkinson, Thomas Schlitt, Mohammadreza Shojatalab.
A quick introduction to elements of biology-cells, molecules, genes,
functional genomics, microarrays. European Bioinformatics Institute.
17 / 18
Thank You!
18 / 18
Download