Cancer Classification with Data-dependent Kernels

advertisement
Cancer Classification with
Data-dependent Kernels
Anne Ya Zhang
(with Xue-wen Chen & Huilin Xiong)
EECS & ITTC
University of Kansas
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
1
Outline




Introduction
Data-dependent Kernel
Results
Conclusion
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
2
Cancer facts

Cancer is a group of many related diseases



Cancer is the second leading cause of death in the
United States



Cancer causes 1 of every 4 deaths
NIH estimate overall costs for cancer in 2004 at $189.8
billion ($64.9 billion for direct medical cost)
Cancer types


Cells continue to grow and divide and do not die when they should.
Changes in the genes that control normal cell growth and death.
Breast cancer, Lung cancer, Colon cancer, …
Death rates vary greatly by cancer type and stage at
diagnosis
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
3
Motivation

Why do we need to classify cancers?


The general way of treating cancer is to:
 Categorize the cancers in different classes
 Use specific treatment for each of the classes
Traditional way to classify cancers




Morphological appearance
Not accurate!
Enzyme-based histochemical analyses.
Immunophenotyping.
Cytogenetic analysis.
Complicated & needs highly specialized laboratories
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
4
Motivation

Why traditional ways are not enough ?

There exists some tumors in the same class with
completely different clinical courses


May be more accurate classification is needed
Assigning new tumors to known cancer classes is
not easy

e.g. assigning an acute leukemia tumor to one of the


2016/6/30
AML (acute myeloid leukemia)
ALL (acute lymphoblastic leukemia)
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
5
DNA Microarray-based Cancer Diagnosis


Cancer is caused by changes in the genes that
control normal cell growth and death.
Molecular diagnostics offer the promise of
precise, objective, and systematic cancer
classification


These tests are not widely applied because
characteristic molecular markers for most solid tumors
have to be identified.
Recently, microarray tumor gene expression
profiles have been used for cancer diagnosis.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
6
Microarray


A microarray experiment monitors the
expression levels for thousands of
genes simultaneously.
Microarray techniques will lead to a
more complete understanding of the
molecular variations among tumors,
hence to a more reliable classification.
Low
Zero
High
2016/6/30
C1 C2 C3 C4 C5 C6 C7
G1
G2
G3
G4
G5
G6
G7
G6
G7
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
7
Microarray


Microarray analysis allows the monitoring of
the activities of thousands of genes over many
different conditions.
From a machine learning point of view…
Gene\Experiment
ex-1
ex-2
……
ex-m
g-1
g-2
…….
…….
g-n
The large volume of the data requires the computational aid in
analyzing the expression data.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
8
Machine learning tasks in cancer
classification

There are three main types of machine learning
problems associated with cancer classification:




The identification of new cancer classes using gene
expression profiles
The classification of cancer into known classes
The identifications of “marker” genes that characterize the
different cancer classes
In this presentation, we focus on the second type of
problems.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
9
Project Goals

To develop a more systematic machine learning
approach to cancer classification using microarray gene
expression profiles.

Use an initial collection of samples belonging to the
known classes of cancer to create a “class predictor” for
new, unknown, samples.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
10
Challenges in cancer classification

Gene expression data are typically characterized by
high dimensionality (i.e. a large number of genes)
 small sample size
Curse of dimensionality!


Methods



2016/6/30
Kernel techniques
Data resampling
Gene selection
AML
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
11
Outline




Introduction
Data-dependent Kernel
Results
Conclusion
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
12
Data-dependent kernel model
Data dependent
k ( x, y )  q( x)q( y )k0 ( x, y )
k0 ( x, y) is a basic kernel function
m
q( x )   0  i k1 ( x, xi )
i 1
k1 ( x, xi )  e
 1||x  xi ||2
Optimizing the data-dependent kernel is to choose
the coefficient vector   ( 0 ,  1 ,  ,  l ) T
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
13
Optimizing the kernel

Criterion for kernel optimization
Maximum class separability of the training data in
the kernel-induced feature space
tr ( S b )
max J 
tr ( S w )
2016/6/30
Sb : between - class scatter matrix
S w : within - class scatter matrix
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
14
The Kernel Optimization
max J 
tr ( S b )
tr ( S w )
 T M 0
max J  T
M 0 , N0 are functions of K0 and K1
 N 0
N0 is nonsingula r
M 0  N 0
In reality, the matrix N0
is usually singular
M 0   ( N 0  I )
 0
α: eigenvector corresponding to the largest eigenvalue
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
15
Kernel optimization
Training data
Test data
Before Kernel Optimization
After Kernel Optimization
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
16
Distributed resampling
{xi , yi } (i  1,2,...m)

Original training data:

Training data with resampling: {ai , bi } (i  1,2,...3m)
 xi 1  i  m
ai  
 xr   i  m
 yi 1  i  m
bi  
 yr i  m
 ~ N (0,  )
2
xr : random sample of {xi } with replacemen t
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
17
Gene selection

A filter method: class separability
2
m
(
x
(
j
)

x
(
j
))
k
k 1 k
2
g ( j) 
 
2
k 1
iC k
( xi ( j )  x k ( j )) 2
Ck : index set of k - th class
mk : the number of samples in Ck
x k ( j ) : average expression across k-th class
x( j ) : average expression of all training samples
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
18
Outline




Introduction
Data-dependent Kernel
Results
Conclusion
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
19
Comparison with other
methods




k-Nearest Neighbor (kNN)
Diagonal linear discriminant analysis (DLDA)
Uncorrelated Linear Discriminant analysis
(ULDA)
Support vector machines (SVM)
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
20
Data sets
AML
Subtypes: ALL vs. AML
Status of Estrogen receptor
Status of lymph nodal
Outcome of treatment
Tumor vs. healthy tissue
Subtypes: MPM vs. ADCA
Different lymphomas cells
Cancer vs. non-cancer
Tumor vs. healthy tissue
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
21
Experimental setup

Data normalization



Zero mean and unity variance at the gene
direction
Random partition data into two disjoint
subsets of equal size – training data + test
data
Repeat each experiment 100 times
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
22
Parameters





DLDA: no parameter
KNN: Euclidean distance, K=3
ULDA: K=3
SVM: Gaussian kernel, use leave-one-out on
the training data to tune parameters
KerNN: Gaussian kernel for basic kernel k0,
γ0 andσare empirically set. Use leave-one-out
on the training data to tune the rest
parameters. KNN for classification
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
23
Effect of data resampling
2016/6/30
Lung
Prostate
181 samples
102 samples
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
24
Effect of gene selection
ALL-AML
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
25
Effect of gene selection
Colon
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
26
Effect of gene selection
Prostate
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
27
Comparison results
ALL-AML
BreastLN
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
BreastER
Colon
28
Comparison results
CNS
Ovarian
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
lung
Prostate
29
Outline




Introduction
Data-dependent Kernel
Results
Conclusion
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
30
Conclusion



By maximizing the class separability of
training data, the data-dependent kernel is
also able to increase the separability of test
data.
The kernel method is robust to high
dimensional microarray data
The distributed resampling strategy helps to
alleviate the problem of overfitting
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
31
Conclusion


The classifier assign samples more accurately than
other approaches so we can have better treatments
respectively.
The method can be used for clarifying unusual
cases


e.g. a patient which was diagnosed as AML but with
atypical morphology.
The method can be applied to distinctions relating to
future clinical outcomes.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
32
Future work


How to estimate the parameters
Study the genes selected
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
33
Reference






H. Xiong, M.N.S. Swamy, and M.O. Ahmad. Optimizing the data-dependent
kernel in the empirical feature space. IEEE Trans. on Neural Networks 2005,
16:460-474.
H. Xiong, Y. Zhang, and X. Chen. Data-dependent Kernels for Cancer
Classification. Under review.
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z.
Yakhini. Tissue classification with gene expression profiles. J. Computational
Biology 2000, 7:559-584.
S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination method
for the classification of tumor using gene expression data. J. Am. Statistical
Assoc. 2002, 97:77-87
T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D.
Haussler. Support vector machine classification and validation of cancer
tissue samples using microarray expression data. Bioinformatics 2000,
16:906-914.
J. Ye, T. Li, T. Xiong, and R. Janardan. Using uncorrelated discriminant
analysis for tissue classification with gene expression data. IEEE/ACM Trans.
on Computational Biology and Bioinformatics 2004, 1:181-190.
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
34
Thanks!
Questions?
2016/6/30
DIMACS Workshop on Machine Learning
Techniques in Bioinformatics
35
Download