PPT (Sayan). Suggested Reading

advertisement
MIT
CBCl/AI
Class 19: Bioinformatics
S. Mukherjee, R. Rifkin, G. Yeo, and T. Poggio
What is bioinformatics ?
MIT
CBCl/AI
Pre 1995
Application of computing technology to providing statistical and
database solutions to problems in molecular biology.
Post 1995
Defining and addressing problems in molecular biology using
methodologies from statistics and computer science.
The genome project, genome wide analysis/screening of disease,
genetic regulatory networks, analysis of expression data.
Central dogma
CBCl/AI
Central dogma of biology
DNA  RNA  pre-mRNA
 mRNA Protein
MIT
Basic molecular biology
MIT
CBCl/AI
DNA:
CGAACAAACCTCGAACCTGCT
Transcription
mRNA: GCU UGU UUA CGA
Translation
Polypeptide: Ala Cys Leu Arg
Less basic molecular biology
MIT
CBCl/AI
Transcription
End modification
Splicing
Transport
Translation
Biological information
MIT
CBCl/AI
Sequence information
DNA
transcription
RNA
Quantitative information
microarray
rt-pcr
preRNA
mRNA
translation, stability
protein arrays
yeast-2 hybrid
Chemical screens
splicing
Protein
transport, localization
Sequence problems
CBCl/AI
•Splice sites and branch points in eukaryotic pre-mRNA
•Gene finding in prokaryotes and eukaryotes
•Promoter recognition (transcription and termination)
•Protein structure prediction
•Protein function prediction
•Protein family classification
MIT
Gene expression problems
MIT
CBCl/AI
•Predict tissue morphology
•Predict treatment/drug effect
•Infer metabolic pathway
•Infer disease pathways
•Infer developmental pathway
Microarray technology
MIT
CBCl/AI
Basic idea:
The state of the cell is determined by proteins.
A gene codes for a protein which is assembled via mRNA.
Measuring amount particular mRNA gives measure of
amount of corresponding protein.
Copies of mRNA is expression of a gene.
Microarray technology allows us to measure the expression
of thousands of genes at once. (Northern blot).
Measure the expression of thousands of genes
under different experimental conditions and ask what is
different and why.
Microarray technology
MIT
CBCl/AI
Ramaswamy and Golub, JCO
RNA
Biological Sample
Test Sample
Test Sample
Reference
PE
Cy3
Cy5
ARRAY
ARRAY
Oligonucleotide
Synthesis
cDNA Clone
(LIBRARY)
PCR Product
Microarray technology
MIT
CBCl/AI
Oligonucleotide
cDNA
Lockhart and Winzler 2000
Microarray experiment
MIT
CBCl/AI
Yeast experiment
Analytic challenge
CBCl/AI
When the science is not well understood, resort to statistics:
Infer cancer genetics by analyzing microarray data from tumors
Ultimate goal: discover the genetic pathways of cancers
Immediate goal: models that discriminate tumor types or treatment
outcomes and determine genes used in model
Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000
genes measured for each sample, ill-posed problem
Curse of dimensionality: Far too few examples for so many dimensions
to predict accurately
MIT
Cancer Classification
MIT
CBCl/AI
38 examples of Myeloid and Lymphoblastic leukemias
Affymetrix human 6800, (7128 genes including control genes)
34 examples to test classifier
Results: 33/34 correct
d perpendicular distance
from hyperplane
d
Test data
Coregulation and kernels
MIT
CBCl/AI
Coregulation: the expression of two genes must be correlated for a
protein to be made, so we need to look at pairwise
correlations as well as individual expression
Size of feature space: if there are 7,000 genes, feature space is about
24 million features, so the fact that feature space
is never computed is important
Two gene example: two genes measuring Sonic Hedgehog and TrkC
x  e sh , eTrk 


2
2
φ( x )  e sh
, eTrk
, eTrk e sh , e sh , eTrk ,1
K ( x i , x j )  φ( x i )  φ( x j )  ( x i  x j  1) 2
Gene coregulation
MIT
CBCl/AI
Nonlinear SVM helps when the most informative genes are removed,
Informative as ranked using Signal to Noise (Golub et al).
Genes removed
1st
order
errors
2nd order
3rd order
0
10
20
30
40
50
100
200
1
2
3
3
3
3
3
3
1
1
2
3
3
2
3
3
1
1
1
2
2
2
2
3
1500
7
7
8
polynomials
Rejecting samples
MIT
CBCl/AI
Golub et al classified 29 test points correctly, rejected 5 of which 2 were
errors using 50 genes
Need to introduce concept of rejects to SVM
Cancer
g2
Reject
Normal
g1
Rejecting samples
CBCl/AI
MIT
Estimating a CDF
CBCl/AI
MIT
The regularized solution
CBCl/AI
MIT
Rejections for SVMs
MIT
CBCl/AI
95% confidence or p = .05
.95
P(c=1 | d)
1/d
d = .107
Results with rejections
MIT
CBCl/AI
Results: 31 correct, 3 rejected of which 1 is an error
d
Test data
Gene selection
MIT
CBCl/AI
SVMs as stated use all genes/features
Molecular biologists/oncologists seem to be convinced that only
a small subset of genes are responsible for particular biological
properties, so they want the genes most important in
discriminating
Practical reasons, a clinical device with thousands of genes is
not financially practical
Possible performance improvement
Wrapper method for gene/feature selection
Results with gene selection
MIT
CBCl/AI
AML vs ALL: 40 genes 34/34 correct, 0 rejects.
5 genes 31/31 correct, 3 rejects of which 1 is an error.
d
d
Test data
Test data
B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.
Two feature selection algorithms
MIT
CBCl/AI
Recursive feature elimination (RFE): based upon perturbation
analysis, eliminate genes that perturb the margin the least
Optimize leave-one out (LOO): based upon optimization
of leave-one out error of a SVM, leave-one out error is
unbiased
Recursive feature elimination
MIT
CBCl/AI
1. Solve the SVM problem for vector w
2. Rank order elements of vector w by absolute value
3. Discard input features/g enes correspond ing to those
vector elements with small absolute magnitude (for smallest
10%)
4. Retrain SVM on reduced gene set and goto step (2)
Optimizing the LOO
MIT
CBCl/AI
Use leave-one-out (LOO) bounds for SVMs as a criterion to select
features by searching over all possible subsets of n features for the
ones that minimizes the bound.
When such a search is impossible because of combinatorial
explosion, scale each feature by a real value variable and compute
this scaling via gradient descent on the leave-one-out bound. One
can then keep the features corresponding to the largest scaling
variables.
The rescaling can be done in the input space or in a “Principal
Components” space.
Pictorial demonstration
MIT
CBCl/AI
Rescale features to minimize the LOO bound R2/M2
R2/M2 >1
R
R2/M2 =1
M=R
M
x2
x2
x1
Three LOO bounds
MIT
CBCl/AI
Radius margin bound: simple to compute, continuous
very loose but often tracks LOO well
Jaakkola Haussler bound: somewhat tighter, simple to compute,
discontinuous so need to smooth,
valid only for SVMs with no b term
Span bound: tight as a Britney Spears outfit
complicated to compute, discontinuous so need to smooth
Classification function with scaling
MIT
CBCl/AI
We add a scaling parameter s to the SVM, which scales genes, genes
corresponding to small sj are removed.
The SVM function has the form:
f ( x) 
yα K
iSV
i
i
σ
( x , x i )  b.
where K σ ( x , y )  K ( x  σ , y  σ ), σ   n ,
u  v denote element by element multiplica tion.
SVM and other functionals
MIT
CBCl/AI
The α' s are computed by maximizing the following quadratic form :
α
W 2 (α , σ ) 
i
 12  y i y j α i α j ( K σ ( x i , x j )  C1  ij ),
i
α  0.
i, j
For computing the radius around the data maximize
β K
R 2 ( β, σ ) 
i
σ
( x i , x i )  12  β i β j K σ ( x i , x j ) ,
i
β  0 and  β i  1.
i, j
For variance around the data
V 2 (σ ) 
1

K
i
σ
( x i , x i )  12  K σ ( x i , x j ).
i, j
Remeber L( D )  T  1 W 2 (α, σ ) R 2 ( β , σ )  1 W 2 (α, σ )V 2 (σ ).
i
Algorithm
MIT
CBCl/AI
The following steps are used to compute α and σ
1. Initialize σ  1,...,1.
2.Solve the standard SVM algorithm
α (σ )  arg max W (α, σ ).
α
3. Minimize the estimate of error T with respect to σ with a gradient step.
4.If local minima of T is not reached goto step 3.
5. Discard dimensions correspond ing to small elements in σ and return to step 2.
Computing gradients
MIT
CBCl/AI
The gradient w ith respect to the margin

K x i , x j 
W 2  
   i j y i y j
.
s f
s f
i , j 1
The gradient w ith respect to the radius


K x i , x j 
K  x i , x i 
R 2  
  i
  i j
s f
s i
s f
i 1
i , j 1
The gradient w ith respect to the span
~
2


S i  
4  ~ -1 K SV ~ -1 
 S i K SV
K SV ,


s f

s
f


 K 1
~
.
where K SV   T
0
1
Toy data
MIT
CBCl/AI
Nonlinear problem with 2
relevant dimensions of 52
error rate
error rate
Linear problem with 6
relevant dimensions of 202
number of samples
number of samples
Molecular classification of cancer
MIT
CBCl/AI
Dataset
Total
Sample
s
38
Class 0
Class 1
Dataset
27
ALL
11
AML
Lymphoma
Morphology
Leukemia
Morpholgy (test)
34
20
ALL
14
AML
Leukemia Lineage
(ALL)
23
15
B-Cell
Lymphoma
Outcome
(AML)
15
8
Low
risk
Leukemia
Morphology (train)
Total
Sample
s
77
Class 0
Class 1
19
FSC
58
DLCL
Lymphoma
Outcome
58
22
Low risk
36
High risk
8
T-Cell
Brain Morphology
41
14
Glioma
27
MD
7
High risk
Brain Outcome
50
38
Low risk
12
High risk
Hierarchy of difficulty:
1. Histological differences: normal vs. malignant, skin vs. brain
2. Morphologies: different leukemia types, ALL vs. AML
3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma
4. Outcome: treatment outcome, elapse, or drug sensitivity.
Morphology classification
MIT
CBCl/AI
Dataset
Algorithm
Total
Samples
Total
error
s
Class 1
errors
Class 0
errors
Number
Genes
Leukemia
Morphology (trest)
AML vs ALL
SVM
35
0/35
0/21
0/14
40
WV
35
2/35
1/21
1/14
50
k-NN
35
3/35
1/21
2/14
10
SVM
23
0/23
0/15
0/8
10
WV
23
0/23
0/15
0/8
9
k-NN
23
0/23
0/15
0/8
10
SVM
77
4/77
2/32
2/35
200
WV
77
6/77
1/32
5/35
30
k-NN
77
3/77
1/32
2/35
250
SVM
41
1/41
1/27
0/14
100
WV
41
1/41
1/27
0/14
3
k-NN
41
0/41
0/27
0/14
5
Leukemia Lineage
(ALL)
B vs T
Lymphoma
FS vs DLCL
Brain
MD vs Glioma
Outcome classification
MIT
CBCl/AI
Dataset
Algorithm
Total
Samples
Total
errors
Class 1
errors
Class 0
errors
Number
Genes
Lymphoma
SVM
58
13/58
3/32
10/26
100
LBC treatment
outcome
WV
58
15/58
5/32
10/26
12
k-NN
58
15/58
8/32
7/26
15
Brain
SVM
50
7/50
6/12
1/38
50
MD treatment
outcome
WV
50
13/50
6/12
7/38
6
k-NN
50
10/50
6/12
4/38
5
Outcome classification
MIT
CBCl/AI
0.6
0.6
0.8
0.8
1.0
1.0
Error rates ignore temporal information such as when a patient dies. Survival
analysis takes temporal information into account. The Kaplan-Meier survival
plots and statistics for the above predictions show significance.
p-val = 0.0015
0.0
0.0
0.2
0.2
0.4
0.4
p-val = 0.00039
0
50
Lymphoma
100
150
0
20
40
60
80
100
Medulloblastoma
120
Multi tumor classification
MIT
CBCl/AI
Breast
Prostate
Lung
Colorec
tal
Lympho
ma
Bladd
er
Meleno
ma
Uterus
Leuke
mia
Renal
Pancr
eas
Ovary
Mesothe
lioma
Brain
Abrev
B
P
L
CR
Ly
Bl
M
U
Le
R
PA
Ov
MS
C
Total
11
10
11
13
22
11
10
10
30
11
11
11
11
20
Train
8
8
8
8
16
8
8
8
24
8
8
8
8
16
Test
3
2
3
5
6
3
2
3
6
3
3
3
3
4
Note that most of these tumors came from secondary sources and were not
at the tissue of origin.
Clustering is not accurate
CBCl/AI
CNS, Lymphoma, Leukemia tumors separate
Adenocarcinomas do not separate
MIT
Multi tumor classification
MIT
CBCl/AI
+
+
+
+
B+R
G+R
Class
+1
+1
-1
-1
+1
-1
+1
-1
R
B
G
Y
Combination approaches: All pairs
One versus all (OVA)
Supervised methodology
MIT
CBCl/AI
Figure 2
ALL OTHER TUMORS
Hyperplane
Confidence
TEST SAMPLE
BREAST TUMORS
Breast OVA
Classifier
CNS OVA
Classifier
(Highest OVA
Prediction
Strength)
0
+2
Breast (High Confidence)
-2
Final
Multiclass
Call
...
Gene
Expression
Dataset
...
Prostate OVA
Classifier
Well differentiated tumors
MIT
CBCl/AI
Train/
cross -val.
Train/cross
Test 1
-val.
Accuracy
Test 1
Fraction of Calls
Train/cross - val.
5
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
1
0.9
0.8
0.7
0.6
0.5
0.4
4
High
2
1
Low
Confidence
3
0
0
-1
Correct Errors
Correct Errors
0.3
0.2
0.1
0
0
-1
0
1
Low
2
3
4
High
Confidence
Test 1
-1
0
1
Low
2
3
First
4
Top 2
Top 3
Prediction Calls
High
Confidence
Dataset
Sample Type
Validation
Method
Sample
Number
Total
Accuracy
Confidence
High
Low
Fraction Accuracy Fraction Accuracy
Train
Well Differentiated
Cross-val.
144
78%
80%
90%
20%
28%
Test 1
Well Differentiated
Train/Test
54
78%
78%
83%
22%
58%
Feature selection hurts performance
CBCl/AI
MIT
Poorly differentiated tumors
MIT
CBCl/AI
Accuracy
5
Fraction of Calls
1
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
4
High
0.6
2
0.4
1
Low
Confidence
0.8
3
0
0.2
-1
0
-1
Correct Errors
0
Low
1
2
3
4
First
High
Top 2
Top 3
Prediction Calls
Confidence
Dataset
Sample Type
Validation
Method
Sample
Number
Total
Accuracy
Confidence
High
Low
Fraction Accuracy Fraction Accuracy
Test
Poorly Differentiated
Train/test
20
30%
50%
50%
50%
10%
Predicting sample sizes
MIT
CBCl/AI
For a given number of existing samples, how significant is the
performance
of a classifier? Does the classifier work at all? Are the results better
than
what one would expect by chance?
Assuming we know the answer to the previous questions for a
number
of sample sizes. What would be the performance of the classifier
When trained with additional samples? Will the accuracy of the
classifier improve significantly? Is the effort to collect additional
samples worthwhile?
Predicting sample sizes
MIT
CBCl/AI
l samples
Subsampling procedure
n1
n2
Train
Input
Train
….
Test
Test
Dataset
classifier
classifier
classifier
classifier
e n1
classifier
classifier
classifier
classifier
….
en 2
l
Train
Train/test
realizations
Test
i =1,2,…T1
classifier
classifier
classifier
classifier
el
Average
error rates
Significance test (comparison with random predictors)
….
Not significant
Significant
Learning
curve
Error
rate
e ( n)  an   b
Sample size
This model can be used
to estimate sample size
requirements
Statistical significance
MIT
CBCl/AI
a)
b)
The statistical significance in the tumor vs. nontumor classification for a) 15 samples and b) 30
samples.
Learning curves
MIT
CBCl/AI
Tumor vs. normal
Learning curves
MIT
CBCl/AI
Tumor vs. normal
a)
b)
Learning curves
MIT
CBCl/AI
AML vs. ALL
Learning curves
MIT
CBCl/AI
Colon cancer
Learning curves
MIT
CBCl/AI
Brain treatment outcome
Download