Neural Networks and Kernels for Learning Discrete Data Structures Paolo Frasconi MLNN Group, Università di Firenze http://www.dsi.unifi.it/neural/ Neural networks for data structures Graphical models and neural nets Preferences on syntactic structures FLF 2005, Burnontige Outline Link prediction in protein structures Kernels for data structures Explorations between “all-substructure kernels” and probability product kernels 2 Problems of increasing complexity Scalar output Common setting. Examples: classification of molecules, QSAR, protein subcellular-localization, ranking parse trees FLF 2005, Burnontige Supervised learning on graphs I/O isomorph (input and output graphs share V and E) Classical sequential supervised learning problems (protein secondary structure, POS-tagging, named-entity recognition), Web pages classification Link prediction Will normally require a search in graph space. Prediction of protein contact maps, localization of disulfide bridges Arbitrary 3 Supervised learning on graphs ƒ(x) x x ƒ(x) Scalar output x I/O isomorphic ƒ(x) Link prediction ƒ(x) x Arbitrary Local model (i.e. for a generic node) May include a “hidden” variable (as in HMMs) May be seen as a template that is repeated on the entire graph xv xv Neighbors sv Hidden states linked according to the relation expressed by the graph FLF 2005, Burnontige Graphical modeling yv Sequence (HMM) Tree (HTMM) 5 yv yv sv xv MaxEnt FLF 2005, Burnontige Several possible variations... xv Conditional RF I/O 6 sv What is in a link? xv FLF 2005, Burnontige yv In a probabilistic network we would have Pr(yv|sv) Pr(sv|xv, spa[v]) where pa[v] are v’s parents in the graph Can also replace probabilistic by functional dependencies with unknown parameters θ yv= ƒout(sv; θout) sv= ƒtransition(xv, spa[v]; θtransition) (Frasconi, Gori, Sperduti, 1998) 7 sv Neural networks (propagation algorithm) xv FLF 2005, Burnontige yv Given a graph x = (V,E ) with labels xv on each v∈V For v = 1,…|V| in topological order do sv= ƒtransition(xv, spa[v]; θtransition) ŷv= ƒout(sv; θout) Remarks: Parameters are shared across replicas of the template Learning by backpropagation following reverse TopSort (Göller & Küchler, 1996) Graph classification: predict ƒ(x) = ƒout(sroot; θout) 8 NN Pros Fast (exact) inference Trained in a discriminant way, may be more accurate FLF 2005, Burnontige Neural vs. probabilistic networks Universal approximation (Hammer 1999) NN Cons Collective inference is unilateral: outputs do not affect each other (but are all affected by the same inputs) Need DAGs – otherwise can use spanning DAGs or use contraction maps and relaxation (Scarselli et al) Weak training procedure: vanishing gradients (Bengio et al 1994) and local minima Background knowledge incorporated ad-hoc 9 Someone shot the servant of the actress who was on the balcony FLF 2005, Burnontige Structural ambiguity in NLP (Cuetos & Mitchell 1988) Tuning hypothesis (Mitchell et al. 1995): ...early parsing choices can be determined by high-level statistical regularities of the language... Relative frequencies of trees matter Tree generalization 10 Given an incremental tree Ti−1 and a word wi the task is to find a proper CP to make Ti Ti Connection path Anchor Ti−1 w1 w2 · · · wi−1 wi FLF 2005, Burnontige Dynamic grammar A collection of CPs can be seen as a dynamic grammar where: States are incremental trees Transitions occur when a CP is attached The grammar can be extracted (learned) from a treebank 11 The dynamic grammar allows multiple transitions Someone shot the servant of the actress who was on the balcony FLF 2005, Burnontige Structural ambiguity NP NP PP S' NP ... WHNP D N P D N WH the servant of the actress who ... 12 FLF 2005, Burnontige Structural ambiguity The hunter shot the leopard with the gun S NP VP VP PP NP NP PP D N V D N P The hunter shot the leopard with P with 13 Disambiguation is a preference task s1 re ct r Co er t l a v ati e n s2 x1 x2 sk xk e!w,sj " yj = k ! e!w,si " max w,si ! train forests log y1 + k ! j=2 i=1 One forest for every word in every sentence log(1 − yj ) FLF 2005, Burnontige Data set Training set sections 2-21 from the WSJ section of the Penn treebank ~40k sentences About 1 million words Forest size: average > 60 alternatives 1e+07 Connection Paths f(x)=6 10^6 x^(-1.89) 1e+06 100000 10000 1000 100 10 1 skewed distribution, max size > 600 0.1 1 10 100 1000 10000 Test on section 23 (2,416 sentences) Validation on section 22 (early stopping) 15 FLF 2005, Burnontige Prediction accuracy RNN Late Closure Minimal attachment 100% 90% 80% 91% 95% 96% 83% 70% 60% 50% 40% 1s! 2nd 3rd 4th Costa, Frasconi, Lombardo & Soda. Applied Intelligence (2003) 16 Left context S NP D N VP NP NP PP a friend P FLF 2005, Burnontige Reduced incremental trees N V of Jim saw D N the thief Connection path PP P with 17 FLF 2005, Burnontige Reduced incremental trees Right frontier S NP D N VP NP NP PP a friend P N V of Jim saw D N the thief PP P with 18 Right frontier + c-commanding nodes S NP D N VP NP NP PP a friend P FLF 2005, Burnontige Reduced incremental trees N V of Jim saw D N the thief PP P with 19 Right frontier + c-commanding nodes + connection path S NP D N VP NP PP a friend P FLF 2005, Burnontige Reduced incremental trees N V of Jim saw D N the thief PP P with 20 S NP FLF 2005, Burnontige Reduced incremental trees VP NP V D N PP P 21 Full tree Reduced tree 100% 97% 95% 94% 95% 90% 98% FLF 2005, Burnontige Results 96% 91% 86% 85% 80% 83% 1s! 2nd 3rd 4th Costa, et al. IEEE Trans. Neural Networks (to appear) 22 0 -0.5 FLF 2005, Burnontige What features are extracted by the RNN? -1 -1.5 -2 -2.5 -3 -4.5 -4 -3.5 -3 -2.5 -2 -1.5 -1 Each dot is the PCA projected RNN state in a forest of alternatives The cross is the correct alternative 23 Application context: protein sequences We are given the protein sequence (possibly enriched by multiple alignments) FLF 2005, Burnontige Link prediction with NNs We want to predict a relation defined on important constituents of the protein (e.g. amino acids) We model the protein as a graph where vertices are constituents and initial arcs represent serial order We want to complete the graph with additional edges representing the sought relation 24 FLF 2005, Burnontige Method x Sequence y Candidate relation x!y ƒ(x ! y) NN trained in scalar output mode Predicted relation: y* = arg maxy ƒ(x ԩ y) 25 Desiderata If y is “closer” than y’ to the target graph y*, then we should have ƒ(x ԩ y) > ƒ(x ԩ y’ ) FLF 2005, Burnontige Details on the scoring function ƒ The target function ƒ should be amenable for a greedy graph search algorithm: e* is a “safe” edge for y y* if (y ԩ e) y* e* is a “locally best” edge for y if e* = argmaxe ƒ(y ԩ e) ƒ is an admissible score if a locally best edge is safe The algorithm builds the target graph by adding locally best edges 26 Choose ƒ(x ԩ y) = 2 •Precision • Recall / (Precision + Recall) where Precision = | y Recall = | y ∩y*|/| y| y* FLF 2005, Burnontige An admissible scoring function y ∩y*|/| y*| Can show that this scoring function ƒ is admissible 27 FLF 2005, Burnontige Application: Coarse contact maps 10 9 8 7 6 5 4 3 2 1 Vullo & Frasconi, IEEE CS Bioinformatics ‘02 Pollastri, Baldi, Vullo & Frasconi, NIPS ‘02 1 2 3 4 5 6 7 8 9 10 Coarse grained map: Are two secondary structure elements of a protein close in space? 28 FLF 2005, Burnontige Searching the y space greedily (i.e. the best successor is chosen at each step, as in pure exploitation), but network is trained by randomly sub-sampling the successors of the current state. Eight numerical features encode the input label of each node: one-hot encoding of #secondary of graphs y on grows exponentially with structure type;xnormalized linear distances from the N to |x| C terminus; average, maximum and minimum hydrophobic character of the segment (based on the Kyte-Doolittle on moving window centered at all residues What (x,y) pairsscale should we7-length use during? positions in the segment). Results (5-fold cross-validation) are shown in Table ??. For each strategy we measured performances means of several indices: micro Pure exploration (choose next ybyrandomly) and macro-averaged precision (mP , M P ), recall (mR, M R) and F1 measure (mF1 , M F1 ). Micro-averages refers to the edges flattenedtoset of secondary structure segment Pure exploitation: Add current y maximizing pairs whereas macro-averages are obtained by first computing precision and recall ƒ(x,y asthen predicted parameters next)and for each protein, averagingby overcurrent the set of proteins. Besides these measures, we also provide specificity, i.e. percentage of correct prediction for non-contacts. policy:over explore probability andover exploit with Theseε−greedy are both averaged the setwith of proteins (M P (nc))ε and the whole segments pairs (mP (nc)). probability (1– ε) fully expand open search state butsampling: only follow the TableHybrid: 2: Training bi-recursive neural networks with dynamic summary of experimental results. (as in pure exploitation) best successor Strategy Random exploration Semi-uniform Pure exploitation Hybrid mP .715 .454 .431 .417 mP (nc) .769 .787 .806 .834 mR .418 .631 .726 .79 mF1 .518 .526 .539 .546 MP .767 .507 .481 .474 M P (nc) .709 .767 .793 .821 MR .469 .702 .787 .843 M F1 .574 .588 .596 .607 29 Covalent bond formed by cysteines Important role stabilizing the native conformation of proteins Important structural feature – e.g. constraint in the conformation space Prediction of disulfide bridges from sequence: Help towards folding May help other prediction algorithms Example: (1IMT) AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRKMHHTCPCAPNLACVQTSPKKFKCLSK 30 FLF 2005, Burnontige Disulfide bridges A connectivity pattern is an undirected graph where Nodes are bonded cysteines Arcs are disulfide bridges FLF 2005, Burnontige Modeling disulfide connectivity Each node has degree exactly =1 For B bridges there are (2B – 1)!! connectivity patterns simply accomplished by F (G) = φr . It turns out encoding of the subgraph t of pairs {(Gi , yi ), i = desired output for graph its network global outby minimizing the error #2 g(φri ; θ ) r (4) 31 In general (2B–1)!! is big but for useful values of B this is not too bad E.g. B=5 yields 9*7*5*3 = 945 alternatives and covers ~ 85% of known proteins Brute force is ok! FLF 2005, Burnontige Complexity A.Vullo and P.Frasconi SwissProt 39 SwissProt 40.28 250 Number of sequences 200 150 100 50 0 1 2 3 4 5 6 7 Number of disulfide bonds 8 9 10 32 g s e , FLF 2005, Burnontige d e d , e e h L s h d Results on SWISS-PROT Table 2. Comparison among different prediction algorithms Vullo & Frasconi, Bioinformatics 2004 Method B=2 Q p Qc B=3 Qp Qc B=4 Qp Qc B=5 Qp Qc B = {2 . . . 5} Qp Qc Frequency MC graphmatching NN graphmatching BiRnn-1 sequence BiRnn-1 profile BiRnn-2 sequence BiRnn-2 profile 0.58 0.58 0.29 0.37 0.01 0.10 0.00 0.23 0.29 0.32 0.56 0.56 0.21 0.36 0.17 0.37 0.02 0.21 0.29 0.38 0.68 0.68 0.22 0.37 0.20 0.37 0.02 0.26 0.34 0.42 0.59 0.59 0.17 0.30 0.10 0.22 0.04 0.18 0.28 0.32 0.65 0.65 0.46 0.56 0.24 0.32 0.08 0.27 0.42 0.46 0.59 0.59 0.22 0.34 0.18 0.30 0.08 0.24 0.31 0.37 0.73 0.73 0.41 0.51 0.24 0.37 0.13 0.30 0.44 0.49 Prediction indices Qpprediction and Qc as inservice Equation(DISULFIND): (5). Methods as described in section 5. Online Results in bold indicate a statistically significant difference in performance between hppt://cassandra.dsi.unifi.it/cysteines/ Early stopping Weight decay 33 Very general framework for kernels on structured data types (Haussler, 1999) Objects are decomposed into their parts FLF 2005, Burnontige Decomposition kernels E.g. strings decomposed as prefix Ѿ suffix Parts are matched in a “logical and” fashion by multiplying suitable kernels defined on each part E.g. kpref (xpref,zpref) * ksuff (xsuff,zsuff) Since there are multiple ways of dividing an object into parts, all possible above matches are summed up E.g. sum the above over all possible prefix Ѿ suffix splits 34 Decomposition kernels An R–decomposition structure on a ! R, !k" where set X is a triple R = !X, ! = (X1 , . . . , XD ) is a D–tuple • X of non–empty sets; • R is a finite relation on X1 × . . . × XD × X; • !k = (k1 , . . . , kD ) is a D–tuple of positive definite kernel functions kd : Xd × Xd $→ IR. For all x, z ∈ X, let ! : R(!x, x)}. R−1 (x) = {!x ∈ X Tensor product KR,⊗ (x, z) = ! D " kd (xd , zd ) ! x ∈ R−1 (x) i=1 ! z ∈ R−1 (z) Direct sum KR,⊕ (x, z) = ! D ! ! x ∈ R−1 (x) ! z ∈ R−1 (z) i=1 kd (xd , zd ) Theorem 1 (Haussler 1999) The above kernels are positive definite FLF 2005, Burnontige Examples Several “ bagof ” kernels Strings k-spectrum: all substrings of length k all subsequences, all substrings, all prefixes, all suffixes, ... Trees All subtrees a b d e f b b d a c e c d a Graphs a c All co-rooted subtrees All walks a b c f d 36 Embedding: each substructure mapped to one component of the feature space but the number of distinct substructures grows with their sizes... FLF 2005, Burnontige Feature space explosion Is this a problem? two large structures may look “dissimilar” simply because they do not share many large parts nearly diagonal Gram matrices analogy: Gaussian kernel having too small a width Large margin classifiers might not help Diagonal deflation (e.g. sub-poly kernel), downweighting, a priori reduction 37 FLF 2005, Burnontige Learning curves 47.6% 14.25% 43.2% verbs 38.8% 16% 10.00% 12% 0 10 00 40 ,0 0 00 1, 0 00 2, 0 50 0 10 00 40 ,0 0 00 1, 0 00 2, 0 50 0 10 2, 00 0 1, 00 0 40 ,0 00 00 13.75% ,0 21.75% 40 20% 0 17.50% Overall 24% 00 31.50% adjectives 1, 21.25% 0 41.25% 12.00% 28% 00 25.00% 2, 51.00% punctuation prepositions 10 0 1, 00 0 40 ,0 00 30.0% 2, 00 0 3.00% 50 0 3.0% 10 0 34.4% 1, 00 0 40 ,0 00 6.75% 2, 00 0 5.5% 50 0 10.50% 10 0 8.0% 50 0 nouns 10.5% 18.00% 0 13.0% Co-rooted subtrees kernel 50 RNN Menchetti, Costa, Frasconi & Pontil, Pattern Recognition Letters (to appear) 38 Flatten structures into value multisets for attributes Example: FLF 2005, Burnontige Opposite extreme protein sequence flattened into amino acid composition (Hua & Sun 2001) in principle could be of some use in other applications but of course structural information is lost! Interpretation of the linear kernel: ! κ(x, x ) = 20 ! ! p(j)p (j) j=1 39 Basic ideas: A simple generative model is fitted to each example The kernel between to examples is evaluated by integrating the product of the two corresponding distributions FLF 2005, Burnontige Product probability kernels (Jebara et al. 2004) Given p(λ) fitted on x and p'(λ) fitted on x' κ(x, x! ) = κprob (p, p! ) = ! p(λ)ρ p! (λ)ρ dλ Λ ρ = 1/2 gives the Battacharrya kernel ρ = 1 in the discrete case gives a linear kernel on frequencies 20 ! from value multisets κ(x, x! ) = p(j)p! (j) j=1 40 Ideas developed for defining image similarity notions in multimedia retrieval κ(x, x! ) = m ! j=1 FLF 2005, Burnontige Histogram intersection kernel (Odone et al 2005) min{p(j), p! (j)} 41 No restrictions on graph topology (e.g. cycles ok) Attributes attached to vertices and edges, e.g. FLF 2005, Burnontige Probability distribution kernels on graph AtomType(3) = C BondType(3,4) = Aromatic AminoAcid(133) = Cys Value multiset of attribute ξi ξi (x) = {ξi (v) : v ∈ vertices(x)} Histogram entries: pi (j) = ni (j)/|ξi (x)| Kernel (e.g. HIK): κ(x, x! ) = mi n ! ! i=1 j=1 min{pi (j), p!i (j)} 42 s! s x FLF 2005, Burnontige All-substructures decomposition kernels ! x ! K(x, x ) = !! where δ(s, s! ) = δ(s, s ) s! s ! ! 1 0 if s = s! otherwise 43 ! s s ! c c x ! FLF 2005, Burnontige Weighted decomposition kernels K(x, x ) = ! ! x! ! ! δ(s, s )κ(c, c ) (s,c) (s! ,c! ) where κ(c, c! ) is a kernel between distributions 44 Given k ≥ 0 (selector radius) and l ≥ 0 (context radius), the WDK is simply ! ! K(x, x ) = |x| |x | ! ! t=1 τ =1 FLF 2005, Burnontige WDK on protein sequences exact match ! ! δ(x(t, r), x (τ, r))κ(x(t, l), x (τ, l)) soft match using HIK where x(t, r) is the substring of x spanning string positions from t − r to t + r. r=1 RINTVRGPITIICGSSAGSEAGFTLTHEHFLRAWPEF ISEAGFTLTHVNITVRGGSLRPITSSAIECHGGFEFFAWP l=7 45 O O S FLF 2005, Burnontige WDK on molecules O O l=3 S K(x, x! ) = ! v ∈ V (x) w ∈ V (x! ) N δ(x(v), x! (w)) · κ(x(v, l), x! (w, l)) x(v, l): subgraph of x induced by the vertices which are reachable from v by a path of length ≤ l 46 Kernel calculation Equality predicate on selectors: does it require solving a subgraph isomorphism problem? FLF 2005, Burnontige Algorithms Selectors do not need to be “large” – They can be matched in O(1) Quadratic complexity for double summation on selectors? Indexing to reduce complexity up to linear context 1 sel e context k cto r buckets 47 Computing all histograms Time is linear in the size of the data set However also the size of data structures is relevant FLF 2005, Burnontige Algorithms Sequences: Relatively easy, histograms can be calculated using a “moving average” trick, O(|V|) even for large contexts Trees and DAGs: Again we can add contributions of “new” vertices and subtract contributions of “old” vertices General graphs: Worst case is the cost of |V| breadth first searches: O(|V|2) for sparse graphs 48 Newly made proteins are sorted into different cell compartments (e.g. nucleous, cytoplasm, organelles, etc) FLF 2005, Burnontige Subcellular localization Prediction from sequence may help towards understanding protein function Reference work SubLoc (Hua & Sun 2001), SVM on amino acid composition (LOCnet) Nair & Rost (2003), sophisticated system based on neural network and rich inputs sequence and profiles predicted structure (secondary structure and solvent accessibility) surface composition 49 Cytoplasmic Extra–Cellular Mitochondrial Method Acc Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC Table 1. Leave one out performance on the SubLoc data set. The spectrum kernel is based on the WDK, contexts residues C = 10. SubLoc 79.4 width 72.6 was 76.615 .74 .64and 81.2 79.7 .80 .77 70.8 57.3 .63 .58 Spectrum3 84.9 80.4 83.3 .81 .74 90.6 85.5 .88 .86 75.8 61.4 .68 .63 WDK 87.9 82.6 87.9 .85 .79 96.9 87.7 .92 .91 89.7 62.3 .74 .71 Cytoplasmic Extra–Cellular Mitochondrial Nuclear Pre Rec gAv MCC 3-mers and C = 10. For 85.2 87.4 .86 .74 88.3 92.6 .90 .82 88.7 95.5 .92 .85 Nuclear FLF 2005, Burnontige SubLoc data set:on2,427 eukaryotic sequences (Hua & Sun 2001) Table 1. Leave one out performance the SubLoc data set. The spectrum kernel is based on 3-mers and C = 10. For the WDK, contexts width was 15 residues C =– 10.WDK parameters: r = 1, l = 7 Leave one outand error Method Pre Rec on gAv Pre Rec MCC Rec spectrum gAv MCC Pre Rec on gAv MCCand Table 2. Test Acc set performance the MCC SwissProt data setgAv defined by ? Pre (?). The kernel is based 4-mers C = 5. For the WDK, contexts width was 15 residues and C = 5. SubLoc 79.4 72.6 76.6 .74 .64 81.2 79.7 .80 .77 70.8 57.3 .63 .58 85.2 87.4 .86 .74 Spectrum3 84.9 80.4 83.3 .81 .74 90.6 85.5 .88 .86 75.8 61.4 .68 .63 88.3 92.6 .90 .82 WDK 87.9 82.6 Cytoplasmic 87.9 .85 .79 96.9 87.7 .92 .91 89.7Mitochondrial 62.3 .74 .71 88.7 95.5 .92 .85 Extra–Cellular Nuclear Method Acc Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC Pre Rec gAv MCC data set: 1,973data eukaryotic from kernel SwissProt Table 2. Test setLOCnet performance on the SwissProt set defined bysequences ? (?). The spectrum is based on 4-mers and C LOCnet = 5. Train For the64.2 WDK, contexts was 15 residues and(Nair C = 5. -& Rost, 54.0 56.0 width .54 - split 76.0 .81 45.0 53.0 .49 – WDK 71.0 (1,461) test (512) as86.0 in 2003) r =73.0 1, l.72 =7 Spectrum4 75.8 73.3 66.7 .69 .58 83.6 82.3 .82 .77 89.7 43.3 .62 .59 71.3 89.8 .80 .67 WDK 78.0 71.4 72.9 .72 .60 85.7 87.1 .86 .81 78.9 50.0 .62 .59 77.8 85.3 .81 .70 Cytoplasmic Extra–Cellular Mitochondrial Nuclear Method Acc Pre Rec Table 3. DTP AIDS Antiviral LOCnet 64.2 54.0 56.0 task Spectrum4 75.8 73.3 66.7 WDK 78.0 71.4 72.9 D=1 gAv MCC Pre Rec gAv Screen Dataset: CA-CM .54 76.0 86.0 .81 .69 .58 83.6 82.3 .82 .72 .60 85.7 87.1 .86 D=2 r=1 r=2 r=3 r=1 r=2 r=3 Table 3. DTP AIDS Antiviral Screen Dataset: CA-CM 78.3 79.7 79.8 81.1 81.7 82.3 task MCC Pre Rec gAv MCC Pre Rec gAv MCC ignored, CE, P, SE are positive and N, NE are negative. 45.0 53.0 .49 71.0 73.0 .72 .77 use 89.7 43.3 .62 .59 71.3 described, 89.8 .80 We all the features previously .81 78.9 50.0 .62 .59 77.8 85.3 .81 structure normalization and do not balance and negative examples. .67 without .70 positive ignored, CE, are P, SE are positive and N, NE validation, are negaThe results obtained by 5 folds cross tive. where the class distribution on each fold is the same 50 Proteins hierarchically grouped into families and subfamilies Experimental setup by Jaakkola et al. (2000), also used by Leslie et al. (2002) FLF 2005, Burnontige Protein SCOP classification 33 families all alpha 4 helical cytokines Class Fold 4 helical cytokines Superfamily interferons/ interleukin-10 short-chain cytokines long-chain cytokines 1ilk 1bgc 1hmc Positive train Positive test Family Negative test 0 Negative train 1 Negative test 1 Negative train 0 51 FLF 2005, Burnontige Results 0.5 0.8 0.4 0.6 0.4 1 0.8 WDC ROC50 1 WDC RFP 50% WDC RFP 100% WDK parameters: r = 1, l = 7 0.3 0.2 0.6 0.4 0.2 0.1 0.2 0 0 0 0 0.2 0.4 0.6 Spectrum RFP 100% (a) 0.8 1 0 0.1 0.2 0.3 Spectrum RFP 50% 0.4 0.5 (b) Figure 1. Remote Protein Homologies: family by family spectrum comparison ofkernel the WDK a Each dot is a family. WDK outperforms of when each point arefall the below RFP at the 100%diagonal coverage (a), at 50% coverage (b) and the R dots obtained using the WDK and spectrum kernel. Note that the better performan Marginal improvement over spectrum kernel while is over in (c). 52 Predictive toxicology challenge (Helma et al. 2001) 417 compounds tested for toxicity on 4 types of rodents: Male rat (MR) Female rat (FM) Male Mouse (MM) Female Mouse (FM) FLF 2005, Burnontige Classification of molecules (1) Categories: Positive: clear evidence (CE), positive (P), some evidence (SE) Negative: negative (N), no evidence (NE) Equivocal and inadequate study ignored Attributes: Vertex attributes: atom type, charge, functional group membership Edge attributes: bond type No structure (3D atom coordinates) used 53 Selector: a single atom (along with its attributes) Context: two dimensions explored: FLF 2005, Burnontige Predictive toxicology experiment Radius (l = 1, l = 2, l = 3) O Neighborhood only (D = 1) vs. neighborhood + complementary portion of the molecule (D = 2) D=1 N O S D=2 S N 54 Mining fragments Deshpande et al. 2003 (best reported result*) WDK D=1 D=2 l=1 l=2 l=3 l=1 l=2 l=3 Topo Geom MM 68.0 68.2 66.8 65.6 66.9 66.0 65.5 66.7 FM 64.7 67.1 68.6 66.2 66.8 67.0 67.3 69.9 MR 61.5 67.5 67.4 68.5 68.5 66.5 62.6 64.8 FR 62.9 62.4 63.0 62.7 62.1 65.3 65.2 66.1 FLF 2005, Burnontige Results (area under ROC) *Varying minimum support and positive vs. negative loss 55 National Cancer Institute’s HIV data set 42,687 compounds tested for protection of human CEM cells against HIV infection FLF 2005, Burnontige Classification of molecules (2) Compounds grouped into 3 categories Confirmed active (CA): 100% protection Confirmed moderately active (CM): > 50% protection Confirmed inactive (CI) 2 binary classification problems: CA vs CM (1,503 molecules) CA vs CI (41,184 molecules) Relatively “large” molecules (average 46 atoms and 48 bonds) 56 Mining fragments Deshpande et al. 2003 (best reported result) WDK D=1 D=2 l=1 l=2 l=3 l=1 l=2 l=3 Topo Geom CA vs. CM 78.3 79.7 79.8 81.1 81.7 82.3 81.0 82.1 CA vs. CI – – – – – 93.8 90.8 94.0 FLF 2005, Burnontige Results (area under ROC) *Varying minimum support and positive vs. negative loss 57 Soft subgraph matching, classification efficiency Generality (works well on sequences and graphs, no restrictions on graph types) FLF 2005, Burnontige Conclusions (WDK) Good performance on the studied problems (sequence and graph classification) Potential improvements: subgraph mining to make selectors (fragments) Context around mined fragments 58 MLNN group, Univ. Firenze Alessio Ceroni,Fabrizio Costa, Sauro Menchetti, Andrea Passerini, Giovanni Soda Univ. Glasgow Patrick Sturt Univ. Torino Vincenzo Lombardo Univ. College Dublin Gianluca Pollastri, Alessandro Vullo FLF 2005, Burnontige Acknowledgments UC Irvine Pierre Baldi Univ. College London Massimilano Pontil 59