A statistical framework for genomic data fusion William Stafford Noble

advertisement
A statistical framework for
genomic data fusion
William Stafford Noble
Department of Genome Sciences
Department of Computer Science and
Engineering
University of Washington
Outline
•
•
•
•
•
•
Recognizing correctly identified peptides
The support vector machine algorithm
Experimental results
Yeast protein classification
SVM learning from heterogeneous data
Results
Recognizing correctly
identified peptides
Protein
sample
Database search
Sequence
database
Tandem mass
spectrometer
Search
algorithm
Observed
spectra
Predicted
peptides
The learning task
Observed
Theoretical
• We are given paired observed and
theoretical spectra.
• Question: Is the pairing correct?
Properties of the observed
spectrum
1. Total peptide mass. Too small yields little
information; too large (>25 amino acids) yields
uneven fragmentation.
2. Charge (+1, +2 or +3). Provides some
evidence about amino acid composition.
3. Total ion current. Proportional to the amount of
peptide present.
4. Peak count. Small indicates poor
fragmentation; large indicates noise.
Observed vs. theoretical spectra
5.
6.
Mass difference.
Percent of ions matched. Number of matched ions /
total number of ions.
7. Percent of peaks matched. Number of matched peaks
/ total number of peaks.
8. Percent of peptide fragment ion current matched. Total
intensity of matched peaks / total intensity of all peaks.
9. Preliminary SEQUEST score (Sp).
10. Preliminary score rank.
11. SEQUEST cross-correlation (XCorr).
Top-ranked vs. second-ranked
peptides
12. Change in cross-correlation. Compute
the difference in XCorr for the top-ranked
and second-ranked peptide.
13. Percent sequence identity. Usually anticorrelated with change in crosscorrelation.
Negative examples
Positive examples
The support vector machine
algorithm
SVMs in computational biology
• Splice site recognition
• Protein sequence
similarity detection
• Protein functional
classification
• Regulatory module
search
• Protein-protein
interaction prediction
• Gene functional
classification from
microarray data
• Cancer classification
from microarray data
Support vector machine
Support vector machine
+
+
+
+ +
+
+
+ + + - - +
-++ +
-
-
Locate a
plane that
separates
positive from
negative
examples.
-
-
-
Focus on the examples closest to the boundary.
Kernel matrix
K  X , Y    X  Y   1
3
 X Y
K  X , Y   exp
2
 2
2




Kernel functions
• Let X be a finite input space.
• A kernel is a function K, such that for all x, z  X,
K(x, z) = (x) · (y), where  is a mapping from
X to an (inner product) feature space F.
• Let K(x,z) be a symmetric function on X. Then
K(x,z) is a kernel function if and only if the matrix
K  K xi , x j i , j 1
n
is positive semi-definite.
Peptide ID kernel function
• Let p(x,y) be the function that computes a
13-element vector of parameters for a pair
of spectra, x and y.
• The kernel function K operates on pairs of
observed and theoretical spectra:
K S oA : StA , S oB : StB   K  p S oA , StA , p S oB , StB 
  p S , S  p S , S
A
o
A
t
B
o
B
t
  1
2
Experimental results
Experimental design
• Data consists of one 13-element vector per
predicted peptide.
• Each feature is normalized to sum to 1.0 across
all examples.
• The SVM is tested using leave-one-out crossvalidation.
• The SVM uses a second-degree polynomial,
normalized kernel with a 2-norm asymmetric soft
margin.
Three data sets
• Set 1: Ion trap mass spectrometer.
Sequest search on the full non-redundant
database.
• Set 2: Ion trap mass spectrometer.
Sequest search on human NRDB.
• Set 3: Quadrupole time-of-flight mass
spectrometer. Sequence search on
human NRDB.
Data set sizes
Positive
Negative
Total
Ion-trap
NRDB
497
479
976
Ion-trap
HNRDB
696
465
1161
QTOF
HNRDB
1017
523
1540
0.94
0.95
0.99
(18,966) (126,931)
(27,936) (108,936)
(57,732)
Conversion to probabilities
• Hold out a subset of the training
examples.
• Use the hold-out set to fit a sigmoid.
y = label
1
f = discriminant
Pr  y  1 f  
Af  B
1 e
• This is equivalent to assuming that the
SVM output is proportional to the logodds of a positive example.
Pr x  
1
1 e
1.15 x 0.114
Yeast protein classification
Membrane proteins
• Membrane proteins
anchor in a cellular
membrane (plasma,
ER, golgi,
mitochondrial).
• Communicate across
membrane.
• Pass through the
membrane several
times.
Heterogeneous data
sequence data
mRNA
expression data
protein-protein
interaction data
Vector representation
• Each matrix entry is
an mRNA expression
measurement.
• Each column is an
experiment.
• Each row
corresponds to a
gene.
 X Y
K  X , Y   exp
2
 2
2




Sequence kernels
>ICYA_MANSE
GDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKY
DGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVN
LVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVD
NVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVIN
MKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDA
QSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKI
DALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALE
KFDKALKALPMHIRLSFNPTQLEEQCHI
• We cannot compute a scalar product on a
pair of variable-length, discrete strings.
Pairwise comparison kernel
Pairwise comparison kernel
Pairwise kernel variants
• Smith-Waterman allvs-all
• BLAST all-vs-all
• Smith-Waterman w.r.t.
SCOP database
• E-values from Pfam
database
Protein-protein interactions
• Pairwise interactions
can be represented
as a graph or a
matrix.
protein
protein
10010101
10101101
00001100
00101101
00101001
10000001
00101000
Linear interaction kernel
10010101
10101101
00001100
00101101
00101001
10000001
00101000
3
• The simplest kernel counts the number of
interactions between each pair.
Diffusion kernel
• A general method for establishing similarities
between nodes of a graph.
• Based upon a random walk.
• Efficiently accounts for all paths connecting two
nodes, weighted by path lengths.
Hydrophobicity profile
Membrane protein
Non-membrane protein
• Transmembrane regions are typically
hydrophobic, and vice versa.
• The hydrophobicity profile of a membrane
protein is evolutionarily conserved.
Hydrophobicity kernel
• Generate hydropathy profile from amino
acid sequence using Kyte-Doolittle index.
• Prefilter the profiles.
• Compare two profiles by
– Computing fast Fourier transform (FFT), and
– Applying Gaussian kernel function.
• This kernel detects periodicities in the
hydrophobicity profile.
SVM learning from
heterogeneous data
Combining kernels
A
K(A)
B
A
K(B)
A:B
B
Identical
K(A:B)
K(A)+K(B)
Semidefinite programming
• Define a convex cost function to assess
the quality of a kernel matrix.
• Semidefinite programming (SDP)
optimizes convex cost functions over the
convex cone of positive semidefinite
matrices.
Semidefinite programming
Learn K from the convex cone of
positive-semidefinite matrices or a
convex subset of it :
According to a convex quality
measure:
Integrate constructed kernels
Large margin classifier (SVM)
K   i K i
i
Learn a linear mix
Maximize the margin
SDP
Integrate constructed kernels
Large margin classifier (SVM)
K   i K i
i
Learn a linear mix
Maximize the margin
Experimental results
Seven yeast kernels
Kernel
KSW
KB
KHMM
KFFT
KLI
KD
KE
Data
protein sequence
protein sequence
protein sequence
hydropathy profile
protein
interactions
protein
interactions
gene expression
Similarity measure
Smith-Waterman
BLAST
Pfam HMM
FFT
linear kernel
diffusion kernel
radial basis kernel
Membrane proteins
Comparison of performance
Simple rules from
hydrophobicity profile
TMHMM
Cytoplasmic ribosomal proteins
False negative predictions
False negative expression
profiles
Markov Random Field
• General Bayesian method, applied by
Deng et al. to yeast functional
classification.
• Used five different types of data.
• For their model, the input data must be
binary.
• Reported improved accuracy compared to
using any single data type.
Yeast functional classes
Category
Size
Metabolism
1048
Energy
242
Cell cycle & DNA processing
600
Transcription
753
Protein synthesis
335
Protein fate
578
Cellular transport
479
Cell rescue, defense
264
Interaction w/ evironment
193
Cell fate
411
Cellular organization
192
Transport facilitation
306
Other classes
81
Six types of data
•
•
•
•
•
•
Presence of Pfam domains.
Genetic interactions from CYGD.
Physical interactions from CYGD.
Protein-protein interaction by TAP.
mRNA expression profiles.
(Smith-Waterman scores).
Results
MRF
SDP/SVM
(binary)
SDP/SVM
(enriched)
Many yeast kernels
• protein sequence
• phylogenetic profiles
• separate gene
expression kernels
• time series
expression kernel
• promoter regions
using seven aligned
species
• protein localization
• ChIP
• protein-protein
interactions
• yeast knockout
growth data
• more ...
Future work
• New kernel functions that incorporate
domain knowledge.
• Better understanding of the semantics of
kernel weights.
• Further investigation of yeast biology.
• Improved scalability of the algorithm.
• Prediction of protein-protein interactions.
Acknowledgments
• Dave Anderson, University of Oregon
• Wei Wu, Genome Sciences, UW &
FHCRC
• Michael Jordan, Statistics & EECS, UC
Berkeley
• Laurent El Ghaoui, EECS, UC Berkeley
• Gert Lanckriet, EECS, UC Berkeley
• Nello Cristianini, Statistics, UC Davis
Fisher criterion score
Low score
1   2 
2
 
2
1
2
2
High score
Feature ranking
delta Cn
% match total ion current
Cn
% match peaks
Sp
mass
charge
rank Sp
peak count
sequence similarity
% ion match
total ion current
delta mass
2.861
2.804
2.444
2.314
1.158
0.704
0.488
0.313
0.209
0.115
0.079
0.026
0.024
Pairwise feature ranking
% match TIC-delta Cn
% match peaks-delta Cn
% match TIC-Cn
delta Cn-Cn
delta Cn-charge
% match peaks-Cn
delta Cn-mass
% match TIC-% match peaks
% ion match-delta Cn
Sp-delta Cn
% match TIC-Sp
% match peaks-Sp
4.741
4.233
3.819
3.597
3.563
3.377
3.119
2.823
2.812
2.799
2.579
2.383
% match TIC-mass
% ion match-mass
Cn-charge
Sp-mass
% match TIC-charge
Cn-mass
Sp-Cn
% ion match-Cn
Sp-charge
% match peaks-mass
% match peaks-charge
% match TIC-% ion match
2.097
2.091
1.943
1.922
1.898
1.884
1.881
1.827
1.770
1.668
1.528
1.473
Download