Neural Networks and Kernels for Learning Discrete Data Structures

advertisement
Neural Networks and Kernels
for Learning Discrete Data
Structures
Paolo Frasconi
MLNN Group, Università di Firenze
http://www.dsi.unifi.it/neural/
Neural networks for data structures
Graphical models and neural nets
Preferences on syntactic structures
FLF 2005, Burnontige
Outline
Link prediction in protein structures
Kernels for data structures
Explorations between “all-substructure kernels” and
probability product kernels
2
Problems of increasing complexity
Scalar output
Common setting. Examples: classification of molecules, QSAR, protein
subcellular-localization, ranking parse trees
FLF 2005, Burnontige
Supervised learning on graphs
I/O isomorph (input and output graphs share V and E)
Classical sequential supervised learning problems (protein secondary
structure, POS-tagging, named-entity recognition), Web pages
classification
Link prediction
Will normally require a search in graph space. Prediction of protein
contact maps, localization of disulfide bridges
Arbitrary
3
Supervised learning on graphs
ƒ(x)
x
x
ƒ(x)
Scalar output
x
I/O isomorphic
ƒ(x)
Link prediction
ƒ(x)
x
Arbitrary
Local model (i.e. for a generic node)
May include a “hidden” variable (as in HMMs)
May be seen as a template that is repeated on the entire
graph
xv
xv
Neighbors
sv
Hidden states linked
according to the relation
expressed by the graph
FLF 2005, Burnontige
Graphical modeling
yv
Sequence (HMM)
Tree (HTMM)
5
yv
yv
sv
xv
MaxEnt
FLF 2005, Burnontige
Several possible variations...
xv
Conditional RF
I/O
6
sv
What is in a link?
xv
FLF 2005, Burnontige
yv
In a probabilistic network we would have
Pr(yv|sv)
Pr(sv|xv, spa[v])
where pa[v] are v’s parents in the graph
Can also replace probabilistic by functional
dependencies with unknown parameters θ
yv= ƒout(sv; θout)
sv= ƒtransition(xv, spa[v]; θtransition)
(Frasconi, Gori, Sperduti, 1998)
7
sv
Neural networks
(propagation algorithm)
xv
FLF 2005, Burnontige
yv
Given a graph x = (V,E ) with labels xv on each v∈V
For v = 1,…|V| in topological order do
sv= ƒtransition(xv, spa[v]; θtransition)
ŷv= ƒout(sv; θout)
Remarks:
Parameters are shared across replicas of the template
Learning by backpropagation following reverse TopSort
(Göller & Küchler, 1996)
Graph classification: predict ƒ(x) = ƒout(sroot; θout)
8
NN Pros
Fast (exact) inference
Trained in a discriminant way, may be more accurate
FLF 2005, Burnontige
Neural vs. probabilistic networks
Universal approximation (Hammer 1999)
NN Cons
Collective inference is unilateral: outputs do not affect each
other (but are all affected by the same inputs)
Need DAGs – otherwise can use spanning DAGs or use
contraction maps and relaxation (Scarselli et al)
Weak training procedure: vanishing gradients (Bengio et al
1994) and local minima
Background knowledge incorporated ad-hoc
9
Someone shot the servant of the actress who
was on the balcony
FLF 2005, Burnontige
Structural ambiguity in NLP
(Cuetos & Mitchell 1988)
Tuning hypothesis (Mitchell et al. 1995):
...early parsing choices can be determined by high-level
statistical regularities of the language...
Relative frequencies of trees matter
Tree generalization
10
Given an incremental tree Ti−1
and a word wi the task is to
find a proper CP to make Ti
Ti
Connection
path
Anchor
Ti−1
w1 w2 · · · wi−1 wi
FLF 2005, Burnontige
Dynamic grammar
A collection of CPs can be seen
as a dynamic grammar where:
States are incremental trees
Transitions occur when a CP
is attached
The grammar can be extracted
(learned) from a treebank
11
The dynamic grammar allows multiple transitions
Someone shot the servant of the actress who was on the balcony
FLF 2005, Burnontige
Structural ambiguity
NP
NP
PP
S'
NP
...
WHNP
D
N
P
D
N
WH
the
servant
of
the
actress
who
...
12
FLF 2005, Burnontige
Structural ambiguity
The hunter shot the leopard with the gun
S
NP
VP
VP
PP
NP
NP
PP
D
N
V
D
N
P
The
hunter
shot
the
leopard
with
P
with
13
Disambiguation is a preference task
s1
re
ct
r
Co
er
t
l
a
v
ati
e
n
s2
x1
x2
sk
xk
e!w,sj "
yj = k
!
e!w,si "
max
w,si
!
train forests
log y1 +
k
!
j=2
i=1
One forest for every word in every sentence
log(1 − yj )
FLF 2005, Burnontige
Data set
Training set
sections 2-21 from the WSJ section of the Penn treebank
~40k sentences
About 1 million words
Forest size:
average > 60 alternatives
1e+07
Connection Paths
f(x)=6 10^6 x^(-1.89)
1e+06
100000
10000
1000
100
10
1
skewed distribution, max size > 600
0.1
1
10
100
1000
10000
Test on section 23 (2,416 sentences)
Validation on section 22 (early stopping)
15
FLF 2005, Burnontige
Prediction accuracy
RNN
Late Closure
Minimal attachment
100%
90%
80%
91%
95%
96%
83%
70%
60%
50%
40%
1s!
2nd
3rd
4th
Costa, Frasconi, Lombardo & Soda. Applied Intelligence (2003)
16
Left context
S
NP
D
N
VP
NP NP
PP
a friend P
FLF 2005, Burnontige
Reduced incremental trees
N
V
of Jim saw
D
N
the thief
Connection path
PP
P
with
17
FLF 2005, Burnontige
Reduced incremental trees
Right frontier
S
NP
D
N
VP
NP NP
PP
a friend P
N
V
of Jim saw
D
N
the thief
PP
P
with
18
Right frontier + c-commanding nodes
S
NP
D
N
VP
NP NP
PP
a friend P
FLF 2005, Burnontige
Reduced incremental trees
N
V
of Jim saw
D
N
the thief
PP
P
with
19
Right frontier + c-commanding nodes
+ connection path
S
NP
D
N
VP
NP
PP
a friend P
FLF 2005, Burnontige
Reduced incremental trees
N
V
of Jim saw
D
N
the thief
PP
P
with
20
S
NP
FLF 2005, Burnontige
Reduced incremental trees
VP
NP
V
D
N
PP
P
21
Full tree
Reduced tree
100%
97%
95%
94%
95%
90%
98%
FLF 2005, Burnontige
Results
96%
91%
86%
85%
80%
83%
1s!
2nd
3rd
4th
Costa, et al. IEEE Trans. Neural Networks (to appear)
22
0
-0.5
FLF 2005, Burnontige
What features are extracted by the RNN?
-1
-1.5
-2
-2.5
-3
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
Each dot is the PCA projected RNN state
in a forest of alternatives
The cross is the correct alternative
23
Application context: protein sequences
We are given the protein sequence (possibly enriched by
multiple alignments)
FLF 2005, Burnontige
Link prediction with NNs
We want to predict a relation defined on important
constituents of the protein (e.g. amino acids)
We model the protein as a graph where vertices are
constituents and initial arcs represent serial order
We want to complete the graph with additional edges
representing the sought relation
24
FLF 2005, Burnontige
Method
x
Sequence
y
Candidate relation
x!y
ƒ(x ! y)
NN trained
in scalar
output mode
Predicted relation: y* = arg maxy ƒ(x ԩ y)
25
Desiderata
If y is “closer” than y’ to the target graph y*, then we should
have ƒ(x ԩ y) > ƒ(x ԩ y’ )
FLF 2005, Burnontige
Details on the scoring function ƒ
The target function ƒ should be amenable for a greedy
graph search algorithm:
e* is a “safe” edge for y
y* if (y
ԩ e) y*
e* is a “locally best” edge for y if e* = argmaxe ƒ(y
ԩ e)
ƒ is an admissible score if a locally best edge is safe
The algorithm builds the target graph by adding locally best edges
26
Choose
ƒ(x ԩ y) = 2 •Precision • Recall / (Precision + Recall)
where
Precision = | y
Recall = | y
∩y*|/| y|
y*
FLF 2005, Burnontige
An admissible scoring function
y
∩y*|/| y*|
Can show that this scoring function ƒ is admissible
27
FLF 2005, Burnontige
Application: Coarse contact maps
10
9
8
7
6
5
4
3
2
1
Vullo & Frasconi, IEEE CS Bioinformatics ‘02
Pollastri, Baldi, Vullo & Frasconi, NIPS ‘02
1
2
3
4
5
6
7
8
9
10
Coarse grained map:
Are two secondary structure
elements of a protein close in
space?
28
FLF 2005, Burnontige
Searching the y space
greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but
network is trained by randomly sub-sampling the successors of the current state.
Eight numerical features encode the input label of each node: one-hot encoding of
#secondary
of graphs
y on
grows exponentially
with
structure
type;xnormalized
linear distances from the
N to |x|
C terminus;
average, maximum and minimum hydrophobic character of the segment (based
on the Kyte-Doolittle
on moving
window centered at all residues
What
(x,y) pairsscale
should
we7-length
use during?
positions in the segment). Results (5-fold cross-validation) are shown in Table ??.
For each
strategy
we measured
performances
means of several indices: micro
Pure
exploration
(choose
next ybyrandomly)
and macro-averaged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 ,
M F1 ).
Micro-averages
refers to
the edges
flattenedtoset
of secondary
structure segment
Pure
exploitation:
Add
current
y maximizing
pairs whereas macro-averages are obtained by first computing precision and recall
ƒ(x,y
asthen
predicted
parameters
next)and
for each
protein,
averagingby
overcurrent
the set of proteins.
Besides these measures,
we also provide specificity, i.e. percentage of correct prediction for non-contacts.
policy:over
explore
probability
andover
exploit
with
Theseε−greedy
are both averaged
the setwith
of proteins
(M P (nc))ε and
the whole
segments pairs (mP (nc)).
probability (1– ε)
fully
expand
open
search
state
butsampling:
only follow
the
TableHybrid:
2: Training
bi-recursive
neural
networks
with
dynamic
summary
of experimental
results. (as in pure exploitation)
best successor
Strategy
Random exploration
Semi-uniform
Pure exploitation
Hybrid
mP
.715
.454
.431
.417
mP (nc)
.769
.787
.806
.834
mR
.418
.631
.726
.79
mF1
.518
.526
.539
.546
MP
.767
.507
.481
.474
M P (nc)
.709
.767
.793
.821
MR
.469
.702
.787
.843
M F1
.574
.588
.596
.607
29
Covalent bond formed by cysteines
Important role stabilizing the native conformation
of proteins
Important structural feature – e.g. constraint in the
conformation space
Prediction of disulfide bridges from sequence:
Help towards folding
May help other prediction algorithms
Example: (1IMT)
AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRKMHHTCPCAPNLACVQTSPKKFKCLSK
30
FLF 2005, Burnontige
Disulfide bridges
A connectivity pattern is an undirected graph where
Nodes are bonded cysteines
Arcs are disulfide bridges
FLF 2005, Burnontige
Modeling disulfide connectivity
Each node has degree exactly =1
For B bridges there are (2B – 1)!! connectivity
patterns
simply accomplished by
F (G) = φr . It turns out
encoding of the subgraph
t of pairs {(Gi , yi ), i =
desired output for graph
its network global outby minimizing the error
#2
g(φri ; θ )
r
(4)
31
In general (2B–1)!! is big but for useful values of
B this is not too bad
E.g. B=5 yields 9*7*5*3 = 945 alternatives and
covers ~ 85% of known proteins
Brute force is ok!
FLF 2005, Burnontige
Complexity
A.Vullo and P.Frasconi
SwissProt 39
SwissProt 40.28
250
Number of sequences
200
150
100
50
0
1
2
3
4
5
6
7
Number of disulfide bonds
8
9
10
32
g
s
e
,
FLF 2005, Burnontige
d
e
d
,
e
e
h
L
s
h
d
Results on SWISS-PROT
Table 2. Comparison among different
prediction
algorithms
Vullo
& Frasconi,
Bioinformatics 2004
Method
B=2
Q p Qc
B=3
Qp Qc
B=4
Qp Qc
B=5
Qp Qc
B = {2 . . . 5}
Qp Qc
Frequency
MC graphmatching
NN graphmatching
BiRnn-1
sequence
BiRnn-1
profile
BiRnn-2
sequence
BiRnn-2
profile
0.58 0.58 0.29 0.37 0.01 0.10 0.00 0.23 0.29 0.32
0.56 0.56 0.21 0.36 0.17 0.37 0.02 0.21 0.29 0.38
0.68 0.68 0.22 0.37 0.20 0.37 0.02 0.26 0.34 0.42
0.59 0.59 0.17 0.30 0.10 0.22 0.04 0.18 0.28 0.32
0.65 0.65 0.46 0.56 0.24 0.32 0.08 0.27 0.42 0.46
0.59 0.59 0.22 0.34 0.18 0.30 0.08 0.24 0.31 0.37
0.73 0.73 0.41 0.51 0.24 0.37 0.13 0.30 0.44 0.49
Prediction indices
Qpprediction
and Qc as inservice
Equation(DISULFIND):
(5). Methods as described in section 5.
Online
Results in bold
indicate a statistically significant difference in performance between
hppt://cassandra.dsi.unifi.it/cysteines/
Early
stopping
Weight
decay
33
Very general framework for kernels on structured
data types (Haussler, 1999)
Objects are decomposed into their parts
FLF 2005, Burnontige
Decomposition kernels
E.g. strings decomposed as prefix Ѿ suffix
Parts are matched in a “logical and” fashion by multiplying
suitable kernels defined on each part
E.g. kpref (xpref,zpref) * ksuff (xsuff,zsuff)
Since there are multiple ways of dividing an object into
parts, all possible above matches are summed up
E.g. sum the above over all possible prefix Ѿ suffix splits
34
Decomposition kernels
An R–decomposition structure on a
! R, !k" where
set X is a triple R = !X,
! = (X1 , . . . , XD ) is a D–tuple
• X
of non–empty sets;
• R is a finite relation on
X1 × . . . × XD × X;
• !k = (k1 , . . . , kD ) is a D–tuple
of positive definite kernel functions kd : Xd × Xd $→ IR.
For all x, z ∈ X, let
! : R(!x, x)}.
R−1 (x) = {!x ∈ X
Tensor product
KR,⊗ (x, z) =
!
D
"
kd (xd , zd )
!
x ∈ R−1 (x) i=1
!
z ∈ R−1 (z)
Direct sum
KR,⊕ (x, z) =
!
D
!
!
x ∈ R−1 (x)
!
z ∈ R−1 (z)
i=1
kd (xd , zd )
Theorem 1 (Haussler 1999) The
above kernels are positive definite
FLF 2005, Burnontige
Examples
Several “ bagof ” kernels
Strings
k-spectrum: all substrings of length k
all subsequences, all substrings, all prefixes, all suffixes, ...
Trees
All subtrees
a
b
d
e
f
b
b
d
a
c
e
c
d
a
Graphs
a
c
All co-rooted subtrees
All walks
a
b
c
f
d
36
Embedding:
each substructure mapped to one component of the feature space
but the number of distinct substructures grows with their sizes...
FLF 2005, Burnontige
Feature space explosion
Is this a problem?
two large structures may look “dissimilar” simply because
they do not share many large parts
nearly diagonal Gram matrices
analogy: Gaussian kernel having too small a width
Large margin classifiers might not help
Diagonal deflation (e.g. sub-poly kernel), downweighting, a
priori reduction
37
FLF 2005, Burnontige
Learning curves
47.6%
14.25%
43.2%
verbs
38.8%
16%
10.00%
12%
0
10
00
40
,0
0
00
1,
0
00
2,
0
50
0
10
00
40
,0
0
00
1,
0
00
2,
0
50
0
10
2,
00
0
1,
00
0
40
,0
00
00
13.75%
,0
21.75%
40
20%
0
17.50%
Overall
24%
00
31.50%
adjectives
1,
21.25%
0
41.25%
12.00%
28%
00
25.00%
2,
51.00%
punctuation
prepositions
10
0
1,
00
0
40
,0
00
30.0%
2,
00
0
3.00%
50
0
3.0%
10
0
34.4%
1,
00
0
40
,0
00
6.75%
2,
00
0
5.5%
50
0
10.50%
10
0
8.0%
50
0
nouns
10.5%
18.00%
0
13.0%
Co-rooted subtrees kernel
50
RNN
Menchetti, Costa, Frasconi & Pontil, Pattern Recognition Letters (to appear)
38
Flatten structures into value multisets for attributes
Example:
FLF 2005, Burnontige
Opposite extreme
protein sequence flattened into amino acid composition
(Hua & Sun 2001)
in principle could be of some use in other applications but of
course structural information is lost!
Interpretation of the linear kernel:
!
κ(x, x ) =
20
!
!
p(j)p (j)
j=1
39
Basic ideas:
A simple generative model is fitted to each example
The kernel between to examples is evaluated by integrating
the product of the two corresponding distributions
FLF 2005, Burnontige
Product probability kernels (Jebara et al. 2004)
Given p(λ) fitted on x and p'(λ) fitted on x'
κ(x, x! ) = κprob (p, p! ) =
!
p(λ)ρ p! (λ)ρ dλ
Λ
ρ = 1/2 gives the Battacharrya kernel
ρ = 1 in the discrete case gives a linear kernel on frequencies
20
!
from value multisets
κ(x, x! ) =
p(j)p! (j)
j=1
40
Ideas developed for defining image similarity notions
in multimedia retrieval
κ(x, x! ) =
m
!
j=1
FLF 2005, Burnontige
Histogram intersection kernel (Odone et al 2005)
min{p(j), p! (j)}
41
No restrictions on graph topology (e.g. cycles ok)
Attributes attached to vertices and edges, e.g.
FLF 2005, Burnontige
Probability distribution kernels on graph
AtomType(3) = C
BondType(3,4) = Aromatic
AminoAcid(133) = Cys
Value multiset of attribute ξi
ξi (x) = {ξi (v) : v ∈ vertices(x)}
Histogram entries: pi (j) = ni (j)/|ξi (x)|
Kernel (e.g. HIK):
κ(x, x! ) =
mi
n !
!
i=1 j=1
min{pi (j), p!i (j)}
42
s!
s
x
FLF 2005, Burnontige
All-substructures decomposition kernels
!
x
!
K(x, x ) =
!!
where δ(s, s! ) =
δ(s, s )
s!
s
!
!
1
0
if s = s!
otherwise
43
!
s
s
!
c
c
x
!
FLF 2005, Burnontige
Weighted decomposition kernels
K(x, x ) =
! !
x!
!
!
δ(s, s )κ(c, c )
(s,c) (s! ,c! )
where κ(c, c! ) is a kernel between distributions
44
Given k ≥ 0 (selector radius) and l ≥ 0 (context radius),
the WDK is simply
!
!
K(x, x ) =
|x| |x |
!
!
t=1 τ =1
FLF 2005, Burnontige
WDK on protein sequences
exact match
!
!
δ(x(t, r), x (τ, r))κ(x(t, l), x (τ, l))
soft match using HIK
where x(t, r) is the substring of x spanning string positions from t − r to t + r.
r=1
RINTVRGPITIICGSSAGSEAGFTLTHEHFLRAWPEF
ISEAGFTLTHVNITVRGGSLRPITSSAIECHGGFEFFAWP
l=7
45
O
O
S
FLF 2005, Burnontige
WDK on molecules
O
O
l=3
S
K(x, x! ) =
!
v ∈ V (x)
w ∈ V (x! )
N
δ(x(v), x! (w)) · κ(x(v, l), x! (w, l))
x(v, l): subgraph of x induced by the
vertices which are reachable from v
by a path of length ≤ l
46
Kernel calculation
Equality predicate on selectors: does it require solving a
subgraph isomorphism problem?
FLF 2005, Burnontige
Algorithms
Selectors do not need to be “large” – They can be matched in O(1)
Quadratic complexity for double summation on selectors?
Indexing to reduce complexity up to linear
context 1
sel
e
context k
cto
r
buckets
47
Computing all histograms
Time is linear in the size of the data set
However also the size of data structures is relevant
FLF 2005, Burnontige
Algorithms
Sequences:
Relatively easy, histograms can be calculated using a “moving average”
trick, O(|V|) even for large contexts
Trees and DAGs:
Again we can add contributions of “new” vertices and subtract
contributions of “old” vertices
General graphs:
Worst case is the cost of |V| breadth first searches: O(|V|2) for sparse
graphs
48
Newly made proteins are sorted into different cell
compartments (e.g. nucleous, cytoplasm, organelles,
etc)
FLF 2005, Burnontige
Subcellular localization
Prediction from sequence may help towards
understanding protein function
Reference work
SubLoc (Hua & Sun 2001), SVM on amino acid composition
(LOCnet) Nair & Rost (2003), sophisticated system based
on neural network and rich inputs
sequence and profiles
predicted structure (secondary structure and solvent accessibility)
surface composition
49
Cytoplasmic
Extra–Cellular
Mitochondrial
Method
Acc Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC
Table 1. Leave one out performance on the SubLoc data set. The spectrum kernel is based on
the
WDK, contexts
residues
C = 10.
SubLoc
79.4 width
72.6 was
76.615 .74
.64and 81.2
79.7 .80 .77
70.8 57.3 .63 .58
Spectrum3 84.9 80.4 83.3 .81 .74
90.6 85.5 .88 .86
75.8 61.4 .68 .63
WDK
87.9 82.6 87.9 .85 .79
96.9 87.7 .92 .91
89.7 62.3 .74 .71
Cytoplasmic
Extra–Cellular
Mitochondrial
Nuclear
Pre Rec gAv MCC
3-mers and C = 10. For
85.2 87.4 .86 .74
88.3 92.6 .90 .82
88.7 95.5 .92 .85
Nuclear
FLF 2005, Burnontige
SubLoc
data set:on2,427
eukaryotic
sequences
(Hua
& Sun
2001)
Table 1. Leave
one out performance
the SubLoc
data set. The
spectrum kernel
is based
on 3-mers
and C = 10. For
the WDK, contexts width
was 15
residues
C =–
10.WDK parameters: r = 1, l = 7
Leave
one
outand
error
Method
Pre Rec on
gAv
Pre
Rec
MCC
Rec spectrum
gAv MCC
Pre
Rec on
gAv
MCCand
Table
2. Test Acc
set performance
the MCC
SwissProt
data
setgAv
defined
by ? Pre
(?). The
kernel
is based
4-mers
C = 5. For the WDK, contexts width was 15 residues and C = 5.
SubLoc
79.4 72.6 76.6 .74 .64
81.2 79.7 .80 .77
70.8 57.3 .63 .58
85.2 87.4 .86 .74
Spectrum3 84.9 80.4 83.3 .81 .74
90.6 85.5 .88 .86
75.8 61.4 .68 .63
88.3 92.6 .90 .82
WDK
87.9 82.6 Cytoplasmic
87.9 .85 .79
96.9
87.7 .92 .91
89.7Mitochondrial
62.3 .74 .71
88.7 95.5
.92 .85
Extra–Cellular
Nuclear
Method
Acc Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC
data
set:
1,973data
eukaryotic
from kernel
SwissProt
Table 2. Test setLOCnet
performance
on the
SwissProt
set defined bysequences
? (?). The spectrum
is based on 4-mers and
C LOCnet
= 5. Train
For the64.2
WDK,
contexts
was
15 residues
and(Nair
C = 5. -& Rost,
54.0
56.0 width
.54
- split
76.0
.81
45.0 53.0
.49 – WDK
71.0
(1,461)
test
(512)
as86.0
in
2003)
r =73.0
1, l.72
=7 Spectrum4 75.8 73.3 66.7 .69 .58
83.6 82.3 .82 .77
89.7 43.3 .62 .59
71.3 89.8 .80 .67
WDK
78.0 71.4 72.9 .72 .60
85.7 87.1 .86 .81
78.9 50.0 .62 .59
77.8 85.3 .81 .70
Cytoplasmic
Extra–Cellular
Mitochondrial
Nuclear
Method
Acc Pre Rec
Table 3. DTP AIDS Antiviral
LOCnet
64.2 54.0 56.0
task
Spectrum4 75.8 73.3 66.7
WDK
78.0 71.4 72.9
D=1
gAv MCC Pre Rec gAv
Screen Dataset: CA-CM
.54
76.0 86.0 .81
.69 .58
83.6 82.3 .82
.72 .60
85.7 87.1 .86
D=2
r=1 r=2 r=3 r=1 r=2 r=3
Table 3. DTP AIDS Antiviral Screen Dataset: CA-CM
78.3 79.7 79.8 81.1 81.7 82.3
task
MCC
Pre Rec gAv MCC
Pre Rec gAv MCC
ignored, CE, P, SE are positive and N, NE are negative.
45.0 53.0 .49
71.0 73.0 .72
.77 use 89.7
43.3
.62 .59
71.3 described,
89.8 .80
We
all the
features
previously
.81
78.9 50.0 .62 .59
77.8 85.3 .81
structure normalization and do not balance
and negative examples.
.67
without
.70
positive
ignored,
CE, are
P, SE
are positive
and N,
NE validation,
are negaThe results
obtained
by 5 folds
cross
tive.
where the class distribution on each fold is the same
50
Proteins hierarchically grouped into families and
subfamilies
Experimental setup by Jaakkola et al. (2000), also used by
Leslie et al. (2002)
FLF 2005, Burnontige
Protein SCOP classification
33 families
all alpha
4 helical
cytokines
Class
Fold
4 helical
cytokines
Superfamily
interferons/
interleukin-10
short-chain
cytokines
long-chain
cytokines
1ilk
1bgc
1hmc
Positive train
Positive test
Family
Negative test 0
Negative train 1
Negative test 1
Negative train 0
51
FLF 2005, Burnontige
Results
0.5
0.8
0.4
0.6
0.4
1
0.8
WDC ROC50
1
WDC RFP 50%
WDC RFP 100%
WDK parameters: r = 1, l = 7
0.3
0.2
0.6
0.4
0.2
0.1
0.2
0
0
0
0
0.2
0.4
0.6
Spectrum RFP 100%
(a)
0.8
1
0
0.1
0.2
0.3
Spectrum RFP 50%
0.4
0.5
(b)
Figure
1. Remote
Protein
Homologies:
family by family spectrum
comparison ofkernel
the WDK a
Each
dot is a
family.
WDK outperforms
of when
each point
arefall
the below
RFP at the
100%diagonal
coverage (a), at 50% coverage (b) and the R
dots
obtained using the WDK and spectrum kernel. Note that the better performan
Marginal
improvement
over
spectrum
kernel
while is over in (c).
52
Predictive toxicology challenge (Helma et al. 2001)
417 compounds tested for toxicity on 4 types of rodents:
Male rat (MR) Female rat (FM) Male Mouse (MM) Female Mouse (FM)
FLF 2005, Burnontige
Classification of molecules (1)
Categories:
Positive: clear evidence (CE), positive (P), some evidence (SE)
Negative: negative (N), no evidence (NE)
Equivocal and inadequate study ignored
Attributes:
Vertex attributes: atom type, charge, functional group membership
Edge attributes: bond type
No structure (3D atom coordinates) used
53
Selector: a single atom (along with its attributes)
Context: two dimensions explored:
FLF 2005, Burnontige
Predictive toxicology experiment
Radius (l = 1, l = 2, l = 3)
O
Neighborhood only (D = 1) vs. neighborhood +
complementary portion of the molecule (D = 2)
D=1
N
O
S
D=2
S
N
54
Mining fragments
Deshpande et al. 2003
(best reported result*)
WDK
D=1
D=2
l=1
l=2
l=3
l=1
l=2
l=3
Topo
Geom
MM
68.0
68.2
66.8
65.6
66.9
66.0
65.5
66.7
FM
64.7
67.1
68.6
66.2
66.8
67.0
67.3
69.9
MR
61.5
67.5
67.4
68.5
68.5
66.5
62.6
64.8
FR
62.9
62.4
63.0
62.7
62.1
65.3
65.2
66.1
FLF 2005, Burnontige
Results (area under ROC)
*Varying
minimum
support and positive
vs. negative loss
55
National Cancer Institute’s HIV data set
42,687 compounds tested for protection of human CEM
cells against HIV infection
FLF 2005, Burnontige
Classification of molecules (2)
Compounds grouped into 3 categories
Confirmed active (CA): 100% protection
Confirmed moderately active (CM): > 50% protection
Confirmed inactive (CI)
2 binary classification problems:
CA vs CM (1,503 molecules)
CA vs CI (41,184 molecules)
Relatively “large” molecules (average 46 atoms and 48
bonds)
56
Mining fragments
Deshpande et al. 2003
(best reported result)
WDK
D=1
D=2
l=1
l=2
l=3
l=1
l=2
l=3
Topo
Geom
CA vs. CM
78.3
79.7
79.8
81.1
81.7
82.3
81.0
82.1
CA vs. CI
–
–
–
–
–
93.8
90.8
94.0
FLF 2005, Burnontige
Results (area under ROC)
*Varying
minimum
support and positive
vs. negative loss
57
Soft subgraph matching, classification efficiency
Generality (works well on sequences and graphs, no
restrictions on graph types)
FLF 2005, Burnontige
Conclusions (WDK)
Good performance on the studied problems (sequence
and graph classification)
Potential improvements:
subgraph mining to make selectors (fragments)
Context around mined fragments
58
MLNN group, Univ. Firenze
Alessio Ceroni,Fabrizio Costa,
Sauro Menchetti, Andrea
Passerini, Giovanni Soda
Univ. Glasgow
Patrick Sturt
Univ. Torino
Vincenzo Lombardo
Univ. College Dublin
Gianluca Pollastri, Alessandro
Vullo
FLF 2005, Burnontige
Acknowledgments
UC Irvine
Pierre Baldi
Univ. College London
Massimilano Pontil
59
Download