ACL-04 Tutorial 1 Kernel Methods in Natural Language Processing Jean-Michel RENDERS Xerox Research Center Europe (France) ACL’04 TUTORIAL ACL-04 Tutorial 2 Warnings This presentation contains extra slides (examples, more detailed views, further explanations, …) that are not present in the official notes If needed, the complete presentation can be downloaded on the Kermit Web Site www.euro-kermit.org (feedback welcome) ACL-04 Tutorial 3 Agenda What’s the philosophy of Kernel Methods? How to use Kernels Methods in Learning tasks? Kernels for text (BOW, latent concept, string, word sequence, tree and Fisher Kernels) Applications to NLP tasks ACL-04 Tutorial 4 Plan What’s the philosophy of Kernel Methods? How to use Kernels Methods in Learning tasks? Kernels for text (BOW, latent concept, string, word sequence, tree and Fisher Kernels) Applications to NLP tasks ACL-04 Tutorial 5 Kernel Methods : intuitive idea Find a mapping f such that, in the new space, problem solving is easier (e.g. linear) The kernel represents the similarity between two objects (documents, terms, …), defined as the dot-product in this new vector space But the mapping is left implicit Easy generalization of a lot of dot-product (or distance) based pattern recognition algorithms ACL-04 Tutorial 6 Kernel Methods : the mapping f f f Original Space Feature (Vector) Space ACL-04 Tutorial 7 Kernel : more formal definition A kernel k(x,y) is a similarity measure defined by an implicit mapping f, from the original space to a vector space (feature space) such that: k(x,y)=f(x)•f(y) This similarity measure and the mapping include: Invariance or other a priori knowledge Simpler structure (linear representation of the data) The class of functions the solution is taken from Possibly infinite dimension (hypothesis space for learning) … but still computational efficiency when computing k(x,y) ACL-04 Tutorial 8 Benefits from kernels Generalizes (nonlinearly) pattern recognition algorithms in clustering, classification, density estimation, … When these algorithms are dot-product based, by replacing the dot product (x•y) by k(x,y)=f(x)•f(y) e.g.: linear discriminant analysis, logistic regression, perceptron, SOM, PCA, ICA, … NM. This often implies to work with the “dual” form of the algo. When these algorithms are distance-based, by replacing d(x,y) by k(x,x)+k(y,y)-2k(x,y) Freedom of choosing f implies a large variety of learning algorithms ACL-04 Tutorial 9 Valid Kernels The function k(x,y) is a valid kernel, if there exists a mapping f into a vector space (with a dot-product) such that k can be expressed as k(x,y)=f(x)•f(y) Theorem: k(x,y) is a valid kernel if k is positive definite and symmetric (Mercer Kernel) A function is P.D. if K (x, y) f (x) f (y )dxdy 0 f L2 In other words, the Gram matrix K (whose elements are k(xi,xj)) must be positive definite for all xi, xj of the input space One possible choice of f(x): k(•,x) (maps a point x to a function k(•,x) feature space with infinite dimension!) ACL-04 Tutorial 10 Example of Kernels (I) Polynomial Kernels: k(x,y)=(x•y)d Assume we know most information is contained in monomials (e.g. multiword terms) of degree d (e.g. d=2: x12, x22, x1x2) Theorem: the (implicit) feature space contains all possible monomials of degree d (ex: n=250; d=5; dim F=1010) But kernel computation is only marginally more complex than standard dot product! For k(x,y)=(x•y+1)d , the (implicit) feature space contains all possible monomials up to degree d ! ACL-04 Tutorial 11 Examples of Kernels (III) f Polynomial kernel (n=2) RBF kernel (n=2) ACL-04 Tutorial 12 The Kernel Gram Matrix With KM-based learning, the sole information used from the training data set is the Kernel Gram Matrix k (x1 , x1 ) k (x1 , x 2 ) k (x , x ) k (x , x ) 2 1 2 2 K training ... ... k (x m , x1 ) k (x m , x 2 ) ... k (x1 , x m ) ... k (x 2 , x m ) ... ... ... k (x m , x m ) If the kernel is valid, K is symmetric definitepositive . ACL-04 Tutorial 13 The right kernel for the right task Assume a categorization task: the ideal Kernel matrix is k(xi,xj)=+1 if xi and xj belong to the same class k(xi,xj)=-1 if xi and xj belong to different classes concept of target alignment (adapt the kernel to the labels), where alignment is the similarity between the current gram matrix and the ideal one [“two clusters” kernel] A certainly bad kernel is the diagonal kernel k(xi,xj)=+1 if xi = xj k(xi,xj)=0 elsewhere All points are orthogonal : no more cluster, no more structure ACL-04 Tutorial 14 How to choose kernels? There is no absolute rules for choosing the right kernel, adapted to a particular problem Kernel design can start from the desired feature space, from combination or from data Some considerations are important: Use kernel to introduce a priori (domain) knowledge Be sure to keep some information structure in the feature space Experimentally, there is some “robustness” in the choice, if the chosen kernels provide an acceptable trade-off between simpler and more efficient structure (e.g. linear separability), which requires some “explosion” Information structure preserving, which requires that the “explosion” is not too strong. ACL-04 Tutorial 15 How to build new kernels Kernel combinations, preserving validity: K ( x,y ) K1 ( x,y ) (1 ) K 2 ( x,y ) K ( x,y ) a.K1 ( x,y ) 0 1 a0 K ( x,y ) K1 ( x,y ).K 2 ( x,y ) K ( x,y ) f ( x ). f ( y ) f is real valued function K ( x,y ) K 3 (φ( x) ,φ( y )) K ( x,y ) xPy P symmetric definite positive K ( x,y ) K1 ( x,y ) K1 ( x,x) K1 ( y,y ) ACL-04 Tutorial 16 Kernels built from data (I) In general, this mode of kernel design can use both labeled and unlabeled data of the training set! Very useful for semi-supervised learning Intuitively, kernels define clusters in the feature space, and we want to find interesting clusters, I.e. cluster components that can be associated with labels. In general, extract (non-linear) relationship between features which will catalyze learning results ACL-04 Tutorial 17 Examples ACL-04 Tutorial 18 ACL-04 Tutorial 19 ACL-04 Tutorial 20 ACL-04 Tutorial 21 Kernels built from data (II) Basic ideas: Convex linear combination of kernels in a given family: find the best coefficient of eigen-components of the (complete) kernel matrix by maximizing the alignment on the labeled training data. Find a linear transformation of the feature space such that, in the new space, pre-specified similarity or dissimilarity constraints are respected (as best as possible) and in a kernelizable way Build a generative model of the data, then use the Fischer Kernel or Marginalized kernels (see later) ACL-04 Tutorial 22 Increased use of syntactic and semantic info Kernels for texts Similarity between documents? Seen as ‘bag of words’ : dot product or polynomial kernels (multi-words) Seen as set of concepts : GVSM kernels, Kernel LSI (or Kernel PCA), Kernel ICA, …possibly multilingual Seen as string of characters: string kernels Seen as string of terms/concepts: word sequence kernels Seen as trees (dependency or parsing trees): tree kernels Etc. ACL-04 Tutorial 23 Agenda What’s the philosophy of Kernel Methods? How to use Kernels Methods in Learning tasks? Kernels for text (BOW, latent concept, string, word sequence, tree and Fisher Kernels) Applications to NLP tasks ACL-04 Tutorial 24 Kernels and Learning In Kernel-based learning algorithms, problem solving is now decoupled into: A general purpose learning algorithm (e.g. SVM, PCA, …) – Often linear algorithm (well-funded, robustness, …) A problem specific kernel Simple (linear) learning algorithm Complex Pattern Recognition Task Specific Kernel function ACL-04 Tutorial 25 Learning in the feature space: Issues High dimensionality allows to render flat complex patterns by “explosion” Computational issue, solved by designing kernels (efficiency in space and time) Statistical issue (generalization), solved by the learning algorithm and also by the kernel e.g. SVM, solving this complexity problem by maximizing the margin and the dual formulation E.g. RBF-kernel, playing with the s parameter With adequate learning algorithms and kernels, high dimensionality is no longer an issue ACL-04 Tutorial 26 Current Synthesis Modularity and re-usability Same kernel ,different learning algorithms Different kernels, same learning algorithms This tutorial is allowed to focus only on designing kernels for textual data Data 1 (Text) Kernel 1 Gram Matrix Learning Algo 1 (not necessarily stored) Data 2 (Image) Kernel 2 Gram Matrix Learning Algo 2 ACL-04 Tutorial 27 Agenda What’s the philosophy of Kernel Methods? How to use Kernels Methods in Learning tasks? Kernels for text (BOW, latent concept, string, word sequence, tree and Fisher Kernels) Applications to NLP tasks ACL-04 Tutorial 28 Kernels for texts Similarity between documents? Seen as ‘bag of words’ : dot product or polynomial kernels (multi-words) Seen as set of concepts : GVSM kernels, Kernel LSI (or Kernel PCA), Kernel ICA, …possibly multilingual Seen as string of characters: string kernels Seen as string of terms/concepts: word sequence kernels Seen as trees (dependency or parsing trees): tree kernels Seen as the realization of probability distribution (generative model) ACL-04 Tutorial 29 Strategies of Design Kernel as a way to encode prior information Invariance: synonymy, document length, … Linguistic processing: word normalisation, semantics, stopwords, weighting scheme, … Convolution Kernels: text is a recursively-defined data structure. How to build “global” kernels form local (atomic level) kernels? Generative model-based kernels: the “topology” of the problem will be translated into a kernel function ACL-04 Tutorial 30 Strategies of Design Kernel as a way to encode prior information Invariance: synonymy, document length, … Linguistic processing: word normalisation, semantics, stopwords, weighting scheme, … Convolution Kernels: text is a recursively-defined data structure. How to build “global” kernels form local (atomic level) kernels? Generative model-based kernels: the “topology” of the problem will be translated into a kernel function ACL-04 Tutorial 31 ‘Bag of words’ kernels (I) Document seen as a vector d, indexed by all the elements of a (controlled) dictionary. The entry is equal to the number of occurrences. A training corpus is therefore represented by a Term-Document matrix, noted D=[d1 d2 … dm-1 dm] The “nature” of word: will be discussed later From this basic representation, we will apply a sequence of successive embeddings, resulting in a global (valid) kernel with all desired properties ACL-04 Tutorial 32 BOW kernels (II) Properties: All order information is lost (syntactical relationships, local context, …) Feature space has dimension N (size of the dictionary) Similarity is basically defined by: k(d1,d2)=d1•d2= d1t.d2 or, normalized (cosine similarity): kˆ(d1 , d 2 ) k (d1 , d 2 ) k (d1 , d1 ).k (d 2 , d 2 ) Efficiency provided by sparsity (and sparse dot-product algo): O(|d1|+|d2|) ACL-04 Tutorial 33 ‘Bag of words’ kernels: enhancements The choice of indexing terms: Exploit linguistic enhancements: Lemma / Morpheme & stem Disambiguised lemma (lemma+POS) Noun Phrase (or useful collocation, n-grams) Named entity (with type) Grammatical dependencies (represented as feature vector components) Ex: The human resource director of NavyCorp communicated important reports on ship reliability. Exploit IR lessons Stopword removal Feature selection based on frequency Weighting schemes (e.g. idf ) NB. Using polynomial kernels up to degree p, is a natural and efficient way of considering all (up-to-)p-grams (with different weights actually), but order is not taken into account (“sinking ships” is the same as “shipping sinks”) ACL-04 Tutorial 34 ‘Bag of words’ kernels: enhancements Weighting scheme : the traditional idf weighting scheme tfitfi*log(N/ni) is a linear transformation (scaling) f(d) W.f(d) (where W is diagonal): k(d1,d2)=f(d1)t.(Wt.W).f(d2) can still be efficiently computed (O(|d1|+|d2|)) Semantic expansion (e.g. synonyms) Assume some term-term similarity matrix Q (positive definite) : k(d1,d2)=f(d1)t.Q.f(d2) In general, no sparsity (Q: propagates) How to choose Q (some kernel matrix for term)? ACL-04 Tutorial 35 Semantic Smoothing Kernels Synonymy and other term relationships: GVSM Kernel: the term-term co-occurrence matrix (DDt) is used in the kernel: k(d1,d2)=d1t.(D.Dt).d2 The completely kernelized version of GVSM is: The training kernel matrix K(= Dt.D) K2 (mxm) The kernel vector of a new document d vs the training documents : t K.t (mx1) The initial K could be a polynomial kernel (GVSM on multi-words terms) Variants: One can use a shorter context than the document to compute term-term similarity (term-context matrix) Another measure than the number of co-occurrences to compute the similarity (e.g. Mutual information, …) Can be generalised to Kn (or a weighted combination of K1 K2 … Kn cfr. Diffusion kernels later), but is Kn less and less sparse! Interpretation as sum over paths of length 2n. ACL-04 Tutorial 36 Semantic Smoothing Kernels Can use other term-term similarity matrix than DDt; e.g. a similarity matrix derived from the Wordnet thesaurus, where the similarity between two terms is defined as: the inverse of the length of the path connecting the two terms in the hierarchical hyper/hyponymy tree. A similarity measure for nodes on a tree (feature space indexed by each node n of the tree, with fn(term x) if term x is the class represented by n or “under” n), so that the similarity is the number of common ancestors (including the node of the class itself). With semantic smoothing, 2 documents can be similar even if they don’t share common words. ACL-04 Tutorial 37 Latent concept Kernels Basic idea : F1 terms terms terms terms terms documents Size d K(d1,d2)=? Concepts space Size t F2 Size k <<t ACL-04 Tutorial 38 Latent concept Kernels k(d1,d2)=f(d1)t.Pt.P.f(d2), where P is a (linear) projection operator From Term Space to Concept Space Working with (latent) concepts provides: Robustness to polysemy, synonymy, style, … Cross-lingual bridge Natural Dimension Reduction But, how to choose P and how to define (extract) the latent concept space? Ex: Use PCA : the concepts are nothing else than the principal components. ACL-04 Tutorial 39 Polysemy and Synonymy polysemy synonymy t2 t2 doc1 doc1 Concept axis doc2 Concept axis doc2 t1 t1 ACL-04 Tutorial 40 More formally … SVD Decomposition of DU(txk)S(kxk)Vt(kxd), where U and V are projection matrices (from term to concept and from concept to document) Kernel Latent Semantic Indexing (SVD decomposition in feature space) : U is formed by the eigenvectors corresponding to the k largest eigenvalues of D.Dt (each column defines a concept by linear combination of terms) V is formed by the eigenvectors corresponding to the k largest eigenvalues of K=DtD S = diag (si) where si2 (i=1,..,k) is the ith largest eigenvalue of K Cfr semantic smoothing with D.Dt replaced U.Ut (new term-term similarity matrix): k(d1,d2)=d1t.(U.Ut).d2 As in Kernel GVSM, the completely kernelized version of LSI is: KVS2V’ (=K’s approximation of rank k) and t VV’t (vector of similarities of a new doc), with no computation in the feature space If kn, then latent semantic kernel is identical to the initial kernel ACL-04 Tutorial 41 Complementary remarks Composition Polynomial kernel + kernel LSI (disjunctive normal form) or Kernel LSI + polynomial kernel (tuples of concepts or conjunctive normal form) GVSM is a particular case with one document = one concept Other decomposition : Random mapping and randomized kernels (Monte-Carlo following some non-uniform distribution; bounds exist to probabilistically ensure that the estimated Gram matrix is e-accurate) Nonnegative matrix factorization D A(txk)S(kxd) [A>0] ICA factorization D A(txk)S(kxd) (kernel ICA) Cfr semantic smoothing with D.Dt replaced by A.At : k(d1,d2)=d1t.(A.At).d2 Decompositions coming from multilingual parallel corpora (crosslingual GVSM, crosslingual LSI, CCA) ACL-04 Tutorial 42 Why multilingualism helps … Graphically: Terms in L1 Parallel contexts Terms in L2 Concatenating both representations will force languageindependent concept: each language imposes constraints on the other Searching for maximally correlated projections of paired observations (CCA) has a sense, semantically speaking ACL-04 Tutorial 43 Diffusion Kernels Recursive dual definition of the semantic smoothing: K=D’(I+uQ)D Q=D(I+vK)D’ NB. u=v=0 standard BOW; v=0 GVSM Let B= D’D (standard BOW kernel); G=DD’ If u=v, The solution is the “Von Neumann diffusion kernel” K=B.(I+uB+u2B2+…)=B(I-uB)-1 and Q=G(I-uG)-1 [only of u<||B||-1] Can be extended, with a faster decay, to exponential diffusion kernel: K=B.exp(uB) and Q=exp(uG) ACL-04 Tutorial 44 Graphical Interpretation These diffusion kernels correspond to defining similarities between nodes in a graph, specifying only the myopic view Documents The (weighted) adjacency matrix is the Doc-Term Matrix Terms Or By aggregation, the (weighted) adjacency matrix is the term-term similarity matrix G Diffusion kernels corresponds to considering all paths of length 1, 2, 3, 4 … linking 2 nodes and summing the product of local similarities, with different decay strategies It is in some way similar to KPCA by just “rescaling” the eigenvalues of the basic Kernel matrix (decreasing the lowest ones) ACL-04 Tutorial 45 Strategies of Design Kernel as a way to encode prior information Invariance: synonymy, document length, … Linguistic processing: word normalisation, semantics, stopwords, weighting scheme, … Convolution Kernels: text is a recursively-defined data structure. How to build “global” kernels form local (atomic level) kernels? Generative model-based kernels: the “topology” of the problem will be translated into a kernel function ACL-04 Tutorial 46 Sequence kernels Consider a document as: A sequence of characters (string) A sequence of tokens (or stems or lemmas) A paired sequence (POS+lemma) A sequence of concepts A tree (parsing tree) (later) A dependency graph Sequence kernels order has importance Kernels on string/sequence : counting the subsequences two objects have in common … but various ways of counting Contiguity is necessary (p-spectrum kernels) Contiguity is not necessary (subsequence kernels) Contiguity is penalised (gap-weighted subsequence kernels) ACL-04 Tutorial 47 String and Sequence Just a matter of convention: String matching: implies contiguity Sequence matching : only implies order ACL-04 Tutorial 48 p-spectrum kernel Features of s = p-spectrum of s = histogram of all (contiguous) substrings of length p Feature space indexed by all elements of Sp fu(s)=number of occurrences of u in s Ex: s=“John loves Mary Smith” t=“Mary Smith loves John” s t JL 1 LM 1 MS 1 1 SL LJ K(s,t)=1 1 1 ACL-04 Tutorial 49 p-spectrum Kernels (II) Naïve implementation: For all p-grams of s, compare equality with the p-grams of t O(p|s||t|) Later, implementation in O(p(|s|+|t|)) ACL-04 Tutorial 50 All-subsequences kernels Feature space indexed by all elements of S*={e}U S U S2U S3U… fu(s)=number of occurrences of u as a (noncontiguous) subsequence of s Explicit computation rapidly infeasible (exponential in |s| even with sparse rep.) L M S J L M S L J L M S J M e J Ex: L M S L J L M S L L S M S L J M L S J s 1 1 11 1 1 1 1 1 1 1 t 1 1 11 1 1 1 1 1 1 1 K=6 ACL-04 Tutorial 51 Recursive implementation Consider the addition of one extra symbol a to s: common subsequences of (sa,t) are either in s or must end with symbol a (in both sa and t). Mathematically, k (s, e ) 1 k (sa, t ) k (s, t ) k (s, t (1 : j 1)) k (s, t ) k ' (sa, t ) j : t j a k ' (sa, tv) k ' (sa, t ) k (s, t ) va This gives a complexity of O(|s||t|) ACL-04 Tutorial 52 Practical implementation (DP table) s\t e John admires Mary Ann Smith e 1 1 1 1 1 1 K’(jonn) 0 1 1 1 1 1 John 1 2 2 2 2 2 K’(loves) 0 0 0 0 0 0 Loves 1 2 2 2 2 2 K’(Mary) 0 0 0 2 2 2 Mary 1 2 2 4 4 4 NB: by-product : all k(a,b) for prefixes a of s, b of t ACL-04 Tutorial 53 Fixed-length subsequence kernels Feature space indexed by all elements of Sp fu(s)=number of occurrences of the p-gram u as a (non-contiguous) subsequence of s Recursive implementation (will create a series of p tables) k0 (s, e ) 1 k p (s, e ) 0 p 0 k p (sa, t ) k p (s, t ) k j : t j a p 1 (s, t (1 : j 1)) k (s, t ) k p ' (sa, t ) k p ' (sa, tv) k p 1 ' (sa, t ) k p 1 (s, t ) va Complexity: O(p|s||t|) , but we have the k-length subseq. kernels (k<=p) for free easy to compute k(s,t)=Salkl(s,t) ACL-04 Tutorial 54 Gap-weighted subsequence kernels Feature space indexed by all elements of Sp fu(s)=sum of weights of occurrences of the p-gram u as a (non-contiguous) subsequence of s, the weight being length penalizing: length(u)) [NB: length includes both matching symbols and gaps] Example: D1 : ATCGTAGACTGTC D2 : GACTATGC (D1)CAT = 28+210 and (D2)CAT = 4 k(D1,D2)CAT=212+214 Naturally built as a dot product valid kernel For alphabet of size 80, there are 512000 trigrams For alphabet of size 26, there are 12.106 5-grams ACL-04 Tutorial 55 Gap-weighted subsequence kernels Hard to perform explicit expansion and dotproduct! Efficient recursive formulation (dynamic programming –like), whose complexity is O(k.|D1|.|D2|) Normalization (doc length independence) kˆ(d1 , d 2 ) k (d1 , d 2 ) k (d1 , d1 ).k (d 2 , d 2 ) ACL-04 Tutorial 56 Recursive implementation Defining K’i(s,t) as Ki(s,t), but the occurrences are weighted by the length to the end of the string (s) K i ( sa, t ) K i ( s, t ) 2 λ K 'i 1 (s, t[1 : j 1]) j :t j a K 'i ( sa, t ) K 'i ( s, t ) |t | j 1 λ K 'i 1 ( s, t[1 : j 1]) K 'i ( s, t ) K "i ( sa, t ) j :t j a K i " ( sa, tv) K i 1" ( sa, t ). K i 1 ' ( sa, tv). av .2 K 0 ' ( s, t ) 1 K i ' ( s, t ) 0 if min(| s |, | t |) i 3.p DP tables must be built and maintained As before, as by-product, all gap-weighted k-grams kernels, with k<=p so that any linear combination k(s,t)=Salkl(s,t) easy to compute ACL-04 Tutorial 57 Word Sequence Kernels (I) Here “words” are considered as symbols Meaningful symbols more relevant matching Linguistic preprocessing can be applied to improve performance Shorter sequence sizes improved computation time But increased sparsity (documents are more : “orthogonal”) Intermediate step: syllable kernel (indirectly realizes some low-level stemming and morphological decomposition) Motivation : the noisy stemming hypothesis (important Ngrams approximate stems), confirmed experimentally in a categorization task ACL-04 Tutorial 58 Word Sequence Kernels (II) Link between Word Sequence Kernels and other methods: For k=1, WSK is equivalent to basic “Bag Of Words” approach For =1, close relation to polynomial kernel of degree k, WSK takes order into account Extension of WSK: Symbol dependant decay factors (way to introduce IDF concept, dependence on the POS, stop words) Different decay factors for gaps and matches (e.g. noun<adj when gap; noun>adj when match) Soft matching of symbols (e.g. based on thesaurus, or on dictionary if we want cross-lingual kernels) ACL-04 Tutorial 59 Recursive equations for variants It is obvious to adapt recursive equations, without increasing complexity K i ( sa , t ) K i ( s, t ) λ a,match K 'i 1 ( s, t[1 : j 1]) 2 j :t j a K 'i ( sa , t ) K 'i ( s, t ).λ a,gap K "i ( sa , t ) K i " ( sa , tv) K i 1" ( sa , t ).λ v,gap K i 1 ' ( sa , tv). av .λ v,match 2 Or, for soft matching (bi,j: elementary symbol kernel): |t | K i ( sa , t ) K i ( s, t ) K 'i 1 ( s, t[1 : j 1])λ 2ba ,t j j i K 'i ( sa , t ) K 'i ( s, t ).λ K "i ( sa , t ) K i " ( sa , tv) K i 1" ( sa , t ).λ K i 1 ' ( sa , tv).ba ,v .λ 2 ACL-04 Tutorial 60 Trie-based kernels An alternative to DP based on string matching techniques TRIE= Retrieval Tree (cfr. Prefix tree) = tree whose internal nodes have their children indexed by S. Suppose F= Sp : the leaves of a complete p-trie are the indices of the feature space Basic algorithm: Generate all substrings s(i:j) satisfying initial criteria; idem for t. Distribute the s-associated list down from root to leave (depth-first) Distribute the t-associated list down from root to leave taking into account the distribution of s-list (pruning) Compute the product at the leaves and sum over the leaves Key points: in steps (2) and (3), not all the leaves will be populated (else complexity would be O(| Sp|) … you need not build the trie explicitly! ACL-04 Tutorial 61 Some digression to learning method Kernel-based learning algorithms (ex. SVM, kernel perceptron, …) result in model of the form f(Sai yi k(x, xi)) or f((Sai yi f(xi))•f(x)) This suggests efficient computation by pre-storing (« pre-compiling ») the weighted TRIE Sai yi f(xi) (number of occurrences are weighted by ai yi and summed up. This « pre-compilation » trick is often possible with a lot of convolution kernels. ACL-04 Tutorial 62 Example 1 – p-spectrum p=2 S=a a b a {aa, ab, ba} T=b a b a b b {ba, ab, ba, ab, bb} Complexity: O(p(|s|+|t|)) S: {aa, ab} T: {ab, ab} a S: 1 b a b S: 1 T:2 S: {ba} T:{ba,ba,bb} a S:1 T:2 b k=2*1+2*1=4 ACL-04 Tutorial 63 Example 2: (p,m)-mismatch kernels Feature space indexed by all elements of Sp fu(s)=number of p-grams (substrings) of s that differ form u by at most m symbols See example on next slide (p=2; m=1) Complexity O(pm+1|S|m(|s|+|t|)) Can be easily extended by using a semantic (local) dissimilarity matrix ba,b, to fu(s)=number of p-grams (substrings) of s that differ form u by a total dissimilarity not larger than some threshold (total=sum) ACL-04 Tutorial 64 Example 2: illustration S: {aa:0, ab:0, ba:1} T: {ab:0, ab:0, ba:1, ba:1, bb:1} a S: {aa:0, ab:1, ba:1} T:{ab:1,ab:1,ba:1,ba:1} b b a S: {ba:0, aa:1, ab:1} T:{ba:0,ba:0,bb:0, ab:1,ab:1} a b k=3*4+2*3+2*3+2*5=34 S:{ba:0,aa:1} T:{ba:0,ba:0,bb:1} S: {aa:1, ab:0} S:{ba:1,ab:1} T:{ba:1,ba:1,bb:0,ab:1,ab:1} T:{ab:0,ab:0,bb:1} ACL-04 Tutorial 65 Example 3: restricted gap-weighted kernel Feature space indexed by all elements of Sp fu(s)=sum of weights of occurrences of p-gram u as a (non-contiguous) subsequence of s, provided that u occurs with at most m gaps, the weight being gap penalizing: gap(u)) For small , restrict m to 2 or even 1 is a reasonable approximation of the full gap-weighted subsequence kernel Cfr. Previous algorithms but generate all (p+m) substrings at the initial phase. Complexity O((p+m)m(|s|+|t|)) If m is too large, DP is more efficient ACL-04 Tutorial 66 Example 3 : illustration p=2 ; m=1 S=a a b a {aab,aba} T=b a b a a b {bab,aba,baa,aab} S: {aab:0, aba:0} T: {aba:0, aab:0} a b a b a S:{aab:0; aba:1} S:{aab:1; aba:0} T:{aab:0; aba:1} T:{aab:1; aba:0} k=(1+) (1+)+(1+) (1+) b ACL-04 Tutorial 67 Mini Dynamic Programming Imagine the processing of substring aaa in the previous example In the depth-first traversal, to assign a unique value at the leave (when more than 1 way), Dynamic Programming must be used Build a small DP table [ size (p+m)xp ] to find the least penalised way. ACL-04 Tutorial 68 Tree Kernels Application: categorization [one doc=one tree], parsing (desambiguation) [one doc = multiple trees] Tree kernels constitute a particular case of more general kernels defined on discrete structure (convolution kernels). Intuitively, the philosophy is to split the structured objects in parts, to define a kernel on the “atoms” and a way to recursively combine kernel over parts to get the kernel over the whole. ACL-04 Tutorial 69 Tree seen as String One could use our string kernels by re-encoding the tree as a string, using extra characters (cfr. LISP representation of trees) Ex: VP V loves Encoded as: [VP [V [loves]] [N [Mary]]] N Mary Restrict substrings to subtrees by imposing constraints on the number and position of ‘[‘ and ‘]’ ACL-04 Tutorial 70 Fundaments of Tree kernels Feature space definition: one feature for each possible proper subtree in the training data; feature value = number of occurences A subtree is defined as any part of the tree which includes more than one node, with the restriction there is no “partial” rule production allowed. ACL-04 Tutorial 71 Tree Kernels : example S Example : NP S VP Mary VP VP NP John N VP V N V V N N loves loves loves A Parse Tree Mary … a few among the many subtrees of this tree! Mary VP V N ACL-04 Tutorial 72 Tree Kernels : algorithm Kernel = dot product in this high dimensional feature space Once again, there is an efficient recursive algorithm (in polynomial time, not exponential!) Basically, it compares the production of all possible pairs of nodes (n1,n2) (n1T1, n2 T2); if the production is the same, the number of common subtrees routed at both n1 and n2 is computed recursively, considering the number of common subtrees routed at the common children Formally, let kco-rooted(n1,n2)=number of common subtrees rooted at both n1 and n2 k (T1 , T2 ) k n1T1 n2 T2 co rooted (n1 , n2 ) ACL-04 Tutorial 73 All sub-tree kernel Kco-rooted(n1,n2)=0 if n1 or n2 is a leave Kco-rooted(n1,n2)=0 if n1 and n2 have different production or, if labeled, different label Else Kco-rooted(n1,n2)= (1 kco rooted (ch(n1 , i),ch(n2 , i ))) children i “Production” is left intentionally ambiguous to both include unlabelled tree and labeled tree Complexity s O(|T1|.|T2|) ACL-04 Tutorial 74 Illustration a b i c j d e f g h k Kcoroot i j k a 1 0 0 b 0 0 0 c 1 0 0 d 0 0 0 e 0 0 0 f 0 0 0 g 0 0 0 h 0 0 0 K=2 ACL-04 Tutorial 75 Tree kernels : remarks Even with “cosine” normalisation, the kernel remains “to peaked” (influence of larger structure feature … which grow exponentially) Either, restrict to subtrees up to depth p Either, downweight larger structure feature by a decay factor ACL-04 Tutorial 76 d-restricted Tree Kernel Stores d DP tables Kco-rooted(n1,n2,p) p=1,…,d kup_to_ p(T1,T2)kcorooted(n1,n2, p) n1T1n2T2 Kco-rooted(n1,n2,p)=P(1+ Kco-rooted(ci(n1),ci(n2),p-1)) With initial Kco-rooted(n1,n2,1)=1 if n1,n2 same production Complexity is O(p.|T1|.|T2|) ACL-04 Tutorial 77 Depth-penalised Tree Kernel Else Kco-rooted(n1,n2)= 2 (1 kco rooted (ch(n1 , i),ch(n2 , i ))) children i Corresponds to weight each (implicit) feature by size(subtree) where size=number of (nonterminating) nodes ACL-04 Tutorial 78 Variant for labeled ordered tree Example: dealing with html/xml documents Extension to deal with: Partially equal production Children with same labels … but order is important The subtree n1 A A n2 B B A A B A B C is common 4 times ACL-04 Tutorial 79 Labeled Order Tree Kernel : algo Actually, two-dimensional dynamic programming (vertical and horizontal) k (T1 , T2 ) k n1T1 n2 T2 co rooted (n1 , n2 ) Kco-rooted(n1,n2)=0 if n1,n2 different labels; else Kco-rooted(n1,n2)=S(nc(n1),nc(n2)) [number of subtrees up to …] S(i,j)=S(i-1,j)+S(i,j-1)-S(i-1,j-1)+S(i-1,j-1)* Kco-rooted(ch(n1,i),ch(n2,j)) • Complexity: O(|T1|.|T2|.nc(T1).nc(T2)) • Easy extension to allow label mutation by introducing sim(label(a),label(b)) PD matrix ACL-04 Tutorial 80 Dependency Graph Kernel A sub-graph is a connected part with at least two word (and the labeled edge) * with PP-obj telescope det the saw sub I PP obj man det the with PP-obj telescope saw obj man det the det the ACL-04 Tutorial 81 Dependency Graph Kernel k ( D1 , D2 ) k n1D1 n2 D2 co rooted (n1 , n2 ) Kco-rooted(n1,n2)=0 if n1 or n2 has no child Kco-rooted(n1,n2)=0 if n1 and n2 have different label Else Kco-rooted(n1,n2)= (2 k co rooted x , ycommon dependencies ( x, y )) 1 ACL-04 Tutorial 82 Paired sequence kernel A subsequence is a subsequence of states, with or without the associated word States (TAG) words Det Noun Verb The man saw … Det Noun Det Noun The man Verb ACL-04 Tutorial 83 Paired sequence kernel k (S1 , S2 ) k co rooted n1states( S1 ) n2 states( D2 ) (n1 , n2 ) If tag(n1)=tag(n2) AND tag(next(n1))=tag(next(n2)) Kco-rooted(n1,n2)=(1+x)*(1+y+ Kco-rooted(next(n1),next(n2))) where x=1 if word(n1)=word(n2), =0 else y=1 if word(next(n1))=word(next(n2)), =0 else Else Kco-rooted(n1,n2)=0 A c A B d = B c d A + B A + BA d + c B ACL-04 Tutorial 84 Graph Kernels General case: Directed Labels on both vertex and edges Loops and cycles are allowed (not in all algorithms) Particular cases easily derived from the general one: Non-directed No label on edge, no label on vertex ACL-04 Tutorial 85 Theoretical issues To design a kernel taking the whole graph structure into account amounts to build a complete graph kernel that distinguishes between 2 graphs only if they are not isomorphic Complete graph kernel design is theoretically possible … but practically infeasible (NPcomplex) Approximations are therefore necessary: Common local subtrees kernels Common (label) walk kernels (most popular) ACL-04 Tutorial 86 Graph kernels based on Common Walks Walk = (possibly infinite) sequence of labels obtained by following edges on the graph Path = walk with no vertex visited twice Important concept: direct product of two graphs G1xG2 V(G1xG2)={(v1,v2), v1 and v2: same labels) E(G1xG2)={(e1,e2): e1, e2: same labels, p(e1) and p(e2) same labels, n(e1) and n(e2) same labels} e p(e) n(e) ACL-04 Tutorial 87 Direct Product of Graphs Examples A A X C B B A A C A B B A A B X B A A ACL-04 Tutorial 88 Kernel Computation Feature space: indexed by all possible walks (of length 1, 2, 3, …n, possibly ) Feature value: sum of number of occurrences, weighted by function (size of the walk) Theorem: common walks on G1 and G2 are the walks of G1xG2 (this is not true for path!) So, K(G1,G2)= S(over all walks g) (size of g)) (PD kernel) Choice of the function will ensure convergence of the kernel : typically bs or bs/s! where s=size of g. Walks of (G1xG2) are obtained by the power series of the adjacency matrix (closed form if the series converges) – cfr. Our discussion about diffusion kernels Complexity: O(|V1|.|V2|.|A(V1)|.|A(V2)|) ACL-04 Tutorial 89 Remark This kind of kernel is unable to make the distinction between B A B A B A B Common subtree kernels allow to overcome this limitation (locally unfold the graph as a tree) ACL-04 Tutorial 90 Variant: Random Walk Kernel Doesn’t (explicitly) make use of the direct product of the graphs Directly deal with walk up to length Easy extension to weighted graph by the « random walk » formalism Actualy, nearly equivalent to the previous approach (… and same complexity) Some reminiscence of PageRank, …approaches ACL-04 Tutorial 91 Random Walk Graph Kernels k(G1,G2)kcorooted(n1,n2) n1G1 n2G2 Where kco-root(n1,n2)= sum of probabilities of common walks starting from both n1 and n2 Very local view: kco-root(n1,n2)= I(n1,n2), wtih I(n1,n2)=1 only if n1 and n2 same labels Recursively defined (cfr. Diffusion kernels) as I(e1,e2) kcoroot(n1,n2)I(n1,n2).(1) k(next(n1,e1),next(n2,e2)) A ( n 1 ) A ( n 2 ) e1A(n1),e2A(n2) ACL-04 Tutorial 92 Random Walk Kernels (end) This corresponds to random walks with a probability of stopping =(1-) at each stage and a uniform (or non-uniform) distribution when choosing an edge Matricially, the vector of kco-root(n1,n2), noted as k, can be written k=(1-)k0+B.k (where B directly derives from the adjacency matrices and has size |V1|.|V2| x |V1|.|V2|) So, kernel computation amount to … matrix inversion: k=(1-)(I-B)-1ko …often solved iteratively (back to the recursice formulation) ACL-04 Tutorial 93 Strategies of Design Kernel as a way to encode prior information Invariance: synonymy, document length, … Linguistic processing: word normalisation, semantics, stopwords, weighting scheme, … Convolution Kernels: text is a recursively-defined data structure. How to build “global” kernels form local (atomic level) kernels? Generative model-based kernels: the “topology” of the problem will be translated into a kernel function ACL-04 Tutorial 94 Marginalised – Conditional Independence Kernels Assume a family of models M (with prior p0(m) on each model) [finite or countably infinite] each model m gives P(x|m) Feature space indexed by models: x P(x|m) Then, assuming conditional independence, the joint probability is given by PM ( x, z ) P( x, z | m) P0 (m) P( x | m) P( z | m) P0 (m) mM mM This defines a valid probability-kernel (CI implies PD kernel), by marginalising over m. Indeed, the gram matrix is K=P.diag(P0).P’ (some reminiscence of latent concept kernels) ACL-04 Tutorial 95 Remind This family of strategies brings you the additional advantage of using all your unlabeled training data to design more problem-adapted kernels They constitute a natural and elegant way of solving semi-supervised problems (mix of labelled and unlabelled data) ACL-04 Tutorial Example 1: PLSA-kernel (somewhat artificial) Probabilistic Latent Semantic Analysis provides a generative model of both documents and words in a corpus: d c w P(d,w)=SP(c)P(d|c)P(w|c) Assuming that the topics is the model, you can identify the models P(d|c) and P(c) Then use marginalised kernels PM (d1 , d 2 ) P(d1 | c) P(d 2 | c) P0 (c) cM 96 ACL-04 Tutorial 97 Example 2: HMM generating fixed-length strings The generative models of a string s (of length n) is given by an HMM, represented by (A is the set of states) h1 h2 h3 hn n P ( s | h) P ( si | ti ) s1 s2 Then k ( s, t ) s3 sn i 1 n P(s | h ).P(t i i i | hi ).P(hi | hi 1 ) path hAn i 1 With efficient recursive (DP) implementation in O(n|A|2) ACL-04 Tutorial One step further: marginalised kernels with latent variable models 98 Assume you both know the visible (x) and the latent (h) variables and want to impose the joint kernel kz((x,h),(x’,h’)) This is more flexible than previous approach for which kz((x,h),(x’,h’)) is automatically 0 if hh’ Then the kernel is obtained by marginalising (averaging) over the latent variables: k(x,x') p(h x).p(h' x').kz((x,h),(x',h')) h h' ACL-04 Tutorial 99 Marginalised latent kernels The posterior p(h|x) are given by the generative model (using Bayes rule) This is an elegant way of coupling generative models and kernel methods; the joint kernel kz allows to introduce domain (user) knowledge Intuition: learning with such kernels will perform well if the class variable is contained as a latent variable of the probability model (as good as the corresponding MAP decision) Basic key feature: when computing the similarity between two documents, the same word (x) can be weighted differently depending on the context (h) [title, development, conclusion] ACL-04 Tutorial 100 Examples of Marginalised Latent Kernels Gaussian mixture: p(x)=Sp(h)N(x|h,mh,Ah) Choosing kz((x,h),(x’,h’))=xtAhx’ if h=h’ (0 else) … this corresponds to the local Mahalanobis intra-cluster distance Then k(x,x’)= Sp(x|h)p(x’|h)p2(h) xtAhx’ /p(x) /p(x’) Contextual BOW kernel: Let x,x’ be two symbol sequences corresponding to sequence h,h’ of hidden states We can decide to count common symbol occurrences only if appearing in the same context (given h and h’ sequences, this is the standard BOW kernel restricted to common states; then the results are summed and weighted by the posteriors. ACL-04 Tutorial 101 ACL-04 Tutorial 102 Fisher Kernels Assume you have only 1 model Marginalised kernel give you little information: only one feature: P(x|m) To exploit much, the model must be “flexible”, so that we can measure how it adapts to individual items we require a “smoothly” parametrised model Link with previous approach: locally perturbed models constitute our family of models, but dimF=number of parameters More formally, let P(x|q0) be the generative model 0 is typically θ log P ( x | θ) θ(q θ found by max likelihood); the gradient reflects how the model will be changed to accommodate the new point x (NB. In practice the loglikelihood is used) 0 ACL-04 Tutorial 103 Fisher Kernel : formally Two objects are similar if they require similar adaptation of the parameters or, in other words, if they stretch the model in the same direction: K(x,y)= ( θ log P( x | θ) θθ )' I M1 ( θ log P( y | θ) θθ ) 0 0 Where IM is the Fisher Information Matrix I M E[( θ log P( x | θ) θθ )( θ log P( x | θ) θθ )' ] 0 0 ACL-04 Tutorial 104 On the Fisher Information Matrix The FI matrix gives some non-euclidian, topological-dependant dot-product It also provides invariance to any smooth invertible reparametrisation Can be approximated by the empirical covariance on the training data points But, it can be shown that it increase the risk of amplifying noise if some parameters are not relevant Practically, IM is taken as I. ACL-04 Tutorial 105 Example 1 : Language model Language models can improve k-spectrum kernels Language model is a generative model: p(wn|wnk…wn-1) are the parameters The likelihood of s= nk p( s j k |s j ...s j k 1 ) j 1 The gradient of the log-likelihood with respect to parameter uv is the number of occurrences of uv in s divided by p(uv) ACL-04 Tutorial 106 String kernels and Fisher Kernels Standard p-spectrum kernel corresponds to Fisher Kernel with p-stage Markov process, with uniform distribution for p(uv) (=1/|S|) … which is the least informative parameter setting Similarly, the gap weighted subsequence kernel is the Fisher Kernel of a generalized k-stage Markov process with decay factor and uniform p(uv) (any subsequence of length (k-1) from the beginning of the document contributes to explaining the next symbol, with a gap-penalizing weight) ACL-04 Tutorial 107 Example 2 : PLSA-Fisher Kernels An example : Fisher kernel for PLSA improves the standard BOW kernel P(c | d1 ).P(c | d 2 ) P(c | d1 , w).P(c | d 2 , w) ~ ~ K (d1 , d 2 ) t f ( w, d1 )t f ( w, d 2 ) P (c ) P( w | c) c w c where k1(d1,d2) is a measure of how much d1 and d2 share the same latent concepts (synonymy is taken into account) where k2(d1,d2) is the traditional inner product of common term frequencies, but weighted by the degree to which these terms belong to the same latent concept (polysemy is taken into account) ACL-04 Tutorial Link between Fisher Kernels and Marginalised Latent Variable Kernels Assume a generative model of the form: P(x)=Sp(x,h|q) (latent variable model) Then the Fisher Kernel can be rewritten as k(x,x') p(h x).p(h' x').kz((x,h),(x',h')) h h' with a particular form of kz Some will argue that Fisher Kernels are better, as kz is theoretically founded MLV Kernels are better, because of the flexibility of kz 108 ACL-04 Tutorial 109 Applications of KM in NLP Document categorization and filtering Event Detection and Tracking Chunking and Segmentation Dependency parsing POS-tagging Named Entity Recognition Information Extraction Others: Word sense disambiguation, Japanese word segmentation, … ACL-04 Tutorial 110 General Ideas (1) Consider generic NLP tasks such as tagging (POS, NE, chunking, …) or parsing (syntactical parsing, …) Kernels defined for structures such as paired sequences (tagging) and trees (parsing) ; can be easily extended to weighted (stochastic) structures (probabilities given by HMM, PCFG,…) Goal: instead of finding the most plausible analysis by building a generative model, define kernels and use a learning method (classifier) to discover the correct analysis, … which necessitates training examples Advantages: avoid to rely on restrictive assumptions (indepedence, restriction to low order information, …), take into account larger substructure by efficient kernels ACL-04 Tutorial 111 General ideas (2) Methodolgy often envolves transforming the original problam into a classification problem or a ranking problem Exemple: parsing problem transformed as ranking Sentence s{x1,x2,x3,…,xn} candidate parsing trees, obtained by CFG or top-n PCFG Find a ranking model which outputs a plausibility score W•f(xi) Ideally, for each sentence, W•f(x1)> W•f(xi) i=2,…,n This is the primal formulation (not practical) ACL-04 Tutorial 112 Ranking problem f2 Correct trees for all training W sentences (optimal) x x Incorrect trees for all training sentences x x x x x Non-optimal W f1 ACL-04 Tutorial 113 Ranking problem Dual Algo This is a problem very close to classification The only difference: origin has no importance: we can work with relative values f(xi)- f(x1) instead Dual formulation: W=Sai,j[f(xi,j)- f(x1,j)] (ai,j :dual parameters) Decision output is now= Sai,j[k(x,x1,j)-k(x,xi,j)] Dual parameters are obtained by margin maximisation (SVM) or simple udating rule such as ai,j = ai,j +1 if W•f(x1,j)< W•f(xi,j) (this is the dual formulation of the Perceptron algorithm) ACL-04 Tutorial 114 Typical Results Based on the coupling (efficient kernel + margin-based learning): the coupling is known To have good generalisation properties, both theoretically and experimentally To overcome the « curse of dimensionality » problem in high dimensional feature space Margin-based kernel method vs PCFG for parsing the Penn TreeBank ATIS corpus: 22% increase in accuracy Other learning frameworks such as boosting, MRF are primal Having the same features as in Kernel methods is prohibitive Choice of good features is critical ACL-04 Tutorial 115 Slowness remains the main issue! Slowness during learning (typically SVM) Circumvented by heuristics in SVM (caching, linear expansion, …) Use of low-cost learning method such as the Voted Perceptron (same as perceptron with storage of intermediate « models » and weighted vote) Slowness during the classification step (applying the learned model to new examples to take a decision) Use of efficient representation (inverted index, a posteriori expansion of the main features to get a linear model so that the decision function is computed in linear time, and is no longer quadratic in document size) Some « pre-compilation » of the Support Vector solution is often possible Revision Learning: Efficient co-working between standard method (ex. HMM for tagging) and SVM (used only to correct errors of HMM: binary problem instead of complex one-vs-n problem) ACL-04 Tutorial 116 Document Categorization & Filtering Classification task Classes = topics Or Class = relevant / not relevant (filtering) Typical corpus: 30.000 features , 10.000 training documents Break-even point Reuters WebKb Ohsumed Naïve Bayes 72.3 82.0 62.4 Rocchio 79.9 74.1 61.5 C4.5 79.4 79.1 56.7 K-NN 82.6 80.5 63.4 SVM 87.5 90.3 71.6 ACL-04 Tutorial 117 SVM for POS Tagging POS tagging is a multiclass classification problem Typical Feature Vector (for unknown words): Surrounding context: words of both sides Morphologic info: pre- and suffixes, existence of capitals, numerals, … POS tags of preceding words (previous decisions) Results on the Penn Treebank WSJ corpus : TnT (second-order HMM): 96.62 % (F-1 measure) SVM (1-vs-rest): 97.11% Revision learning: 96.60% ACL-04 Tutorial 118 SVM for Chunk Identification Each word has to be tagged with a chunk label (combination IBO / chunk type). E.g. I-NP, B-NP, … This can be seen as a classification problem with typically 20 categories (multiclass problem – solved by one-vsrest or pairwise classification and max or majority voting Kx(K-1)/2 classifiers) Typical feature vector: surrounding context (word and POS-tag) and (estimated) chunk labels of previous words Results on WSJ corpus (section 15-19 as training; section 20 as test): SVM: 93.84% Combination (weighted vote) of SVM with different input representations and directions: 94.22% ACL-04 Tutorial 119 Named Entity Recognition Each word has to be tagged with a combination of entity label (8 categories) and 4 sub-tags (B,I,E,S) E.g. Company-B, Person-S, … This can be seen as a classification problem with typically 33 categories (multiclass problem – solved by one-vs-rest or pairwise classification and max or majority voting) Typical feature vector: surrounding context (word and POS-tag), character type (in Japanese) Ensures consistency among word classes by Viterbi search (SVM’ scores are transformed into probabilities) Results on IREX data set (‘CRL’ as training; ‘General’ as test): RG+DT (rule-based): 86% F1 SVM: 90% F1 ACL-04 Tutorial 120 SVM for Word Sense Disambiguation Can be considered as a classification task (choose between some predefined senses) NB. One (multiclass) classifier for each ambiguous word Typical Feature Vector: Surrounding context (words and POS tags) Presence/absence of focus-specific keywords in a wider context As usual in NLP problems, few training examples and many features On the 1998 Senseval competition Corpus, SVM has an average rank of 2.3 (in competition with 8 other learning algorithms on about 30 ambiguous words) ACL-04 Tutorial 121 Recent Perspectives Rational Kernels Weighted Finite SateTransducer representation (and computation) of kernels Can be applied to compute kernels on variable-length sequences and on weighted automata HDAG Kernels (tomorrow presentation of Suzuki and co-workers) Kernels defined on Hierarchical Directed Acyclic Graph (general structure encompassing numerous structures found in NLP) ACL-04 Tutorial 122 Rational Kernels (I) Reminder: FSA accepts a set a strings x A string can be directly represented by an automaton WFST associates to each pair of strings (x,y) a weight [ T ](x,y) given by the « sum » over all « succesful paths » (accepts x in input, emits y in output, starts from an initial state and ends in a final state) of the weights of the path (the « product » of the transition weights) A string kernel is rational if there exists a WFST T and a function y such that k(x,y)=y([T](x,y)) ACL-04 Tutorial 123 Rational Kernels (II) A rational kernel will define a PD (valid) kernel iif T can be decomposed as UoU-1 (U-1 is obtained by swapping input/output labels … some kind of transposition; « o » is the composition operator) Indeed, by definition of composition, neglecting y, K(x,y)=Sz[U](x,z)*[U](y,z) … corresponding to A*At (sum and product are defined over a semi-ring) Kernel computation y([T](x,y) involves Transducer composition ((XoU)o(U-1oY)) Shortest distance algo to find y([T](x,y) Using « failure functions », complexity is O(|x|+|y|) P-spectrum kernels and the gap-weighted string kernels can be expressed as rational transducer (the elementary « U » transducer is some « counting transducer », automatically outputting weighted substrings of x ACL-04 Tutorial 124 HDAG Kernels Convolution kernels applied to HDAG, ie. a mix of tree and DAG (nodes in a DAG can themselves be a DAG … but edges connecting « nonbrother » nodes are not taken into account Can handle hierarchical chunk structure, dependency relations, attributes (labels such as POS, type of chunk, type of entity, …) associated with a node at whatever level Presented in details tomorrow ACL-04 Tutorial 125 Conclusions If you can only remember 2 principles after this session, these should be: Kernel methods are modular The KERNEL, unique interface with your data, incorporating the structure, the underlying semantics, and other prior knowledge, with a strong emphasis on efficient computation while working with an (implicit) very rich representation The General learning algorithm with robustness properties based on both the dual formulation and margin properties Successes in NLP come from both origines Kernel exploiting the particular structures of NLP entities NLP tasks reformulated as (typically) classification or ranking tasks, enabling general robust kernel-based learning algorithms to be used. ACL-04 Tutorial 126 Bibliography (1) Books on Kernel Methods (general): J. Shawe-Taylor and N. Cristianini , Kernel Methods for Pattern Analysis, Cambridge University Press, 2004 N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, 2000 B. Schölkopf, C. Burges, and A.J. Smola, Advances in Kernel Methods – Support Vector Learning, MIT Press, 1999 B. Schölkopf and A.J. Smola, Learning with Kernels, MIT Press, 2001 V. N. Vapnik, Statiscal Learning Theory, J. Wiley & Sons, 1998 Web Sites: www.kernel-machines.org www.support-vector.net www.kernel-methods.net www.euro-kermit.org ACL-04 Tutorial 127 Bibliography (2) General Principles governing kernel design [Sha01,Tak03] Kernels built from data [Cri02a,Kwo03] Kernels for texts – Prior information encoding BOW kernels and linguistic enhancements: [Can02, Joa98, Joa99, Leo02] Semantic Smoothing Kernels: [Can02, Sio00] Latent Concept Kernels: [Can02, Cri02b, Kol00, Sch98] Multilingual Kernels: [Can02, Dum97, Vin02] Diffusion Kernels: [Kan02a, Kan02b, Kon02, Smo03] ACL-04 Tutorial 128 Bibliography (3) Kernels for Text – Convolution Kernels [Hau99, Wat99] String and sequence kernels: [Can03, Les02, Les03a, Les03b, Lod01, Lod02, Vis02] Tree Kernels: [Col01, Col02] Graph Kernels: [Gar02a, Gar02b, Kas02a, Kas02b, Kas03, Ram03, Suz03a] Kernels defined on other NLP structures: [Col01, Col02, Cor03, Gar02a, Gar03, Kas02b, Suz03b] Kernels for Text – Generative-model Based Fisher Kernels: [Hof01, Jaa99a, Jaa99b, Sau02, Sio02] Other approaches: [Alt03, Tsu02a, Tsu02b] ACL-04 Tutorial 129 Bibliography (4) Particular Applications in NLP Categorisation:[Dru99, Joa01, Man01, Tak01, Ton01,Yan99] WSD: [Zav00] Chunking: [Kud00, Kud01] POS-tagging: [Nak01, Nak02, Zav00] Entity Extraction: [Iso02] Relation Extraction: [Cum03, Zel02] [Alt03] Y. Altsun, I. Tsochantaridis and T. Hofmann. Hidden Markov Support Vector Machines. ICML 2003. [Can02] N. Cancedda et al., Cross-Language and Semantic Information Analysis and its Impact on Kernel Design, Deliverable D3 of the KERMIT Project, February 2002. [Can03] N. Cancedda, E. Gaussier, C. Goutte, and J.-M. Renders. Word-sequence kernels. Journal of Machine Learning Research 3:1059-1082, 2003. [Col01] M. Collins and N. Duffy. Convolution kernels for natural languages. NIPS’2001. [Col02] M. Collins and N. Duffy, Convolution Kernels for Natural Language Processing. In Advances in Neural Information Processing Systems, 14, 2002 [Cor03] C. Cortes, P. Haffner and M. Mohri. Positive Definite Rational Kernels. COLT 2003 [Cri02a] N. Cristianini, A. Eliseef, J. Shawe-Taylor and J. Kandola, On Kernel-Target Alignment, In Advances in Neural Information Processing Systems 14, MIT Press, 2002 [Cri02b] N. Cristianini, J. Shawe-Taylor and Huma Lodhi, Latent Semantic Kernels. Journal of Intelligent Information Systems, 18 (2-3):127-152, 2002 [Cum03] C. Cumby and D. Roth. On kernel methods for relational learning. ICML’2003. [Dru99] H. Drucker, D. Wu and V. Vapnik, Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10 (5), 1048-1054, 1999 [Dum97] S. T. Dumais, T.A. Letsche, M.L. Littmann and T.K. Landauer, Automatic Cross-Language Retrieval Using Latent Semantic Indexing. In AAAI Spring Symposium on Cross-Language ext and Speech Retrieval, 1997 [Gar02a] T. Gartner, J. Lloyd and P. Flach. Kernels for Structured Data. Proc. Of 12th Conf. On Inductive Logic Programming, 2002. [Gar02b] T. Gartner, Exponential and Geometric kernels for graphs. NIPS,- Workshop on unreal data – 2002 [Gar03] T. Gartner. A survey of kernels for structured data. SIGKDD explorations 2003. [Hau99] D. Haussler, Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University of California in Santa Cruz, Computer Science Department, July 1999 [Hof01] T. Hofmann, Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42 (12):177-196, 2001 [Iso02] H. Isozaki and H. Kazawa. Efficient Support Vector Classifiers for Named Entity Recognition. COLING 2002 [Jaa99a] T. S. Jaakkola and D. Haussler, Probabilistic Kernel Regression Models. Proceedings on the Conference on AI and Statistics, 1999 [Jaa99b] T.S. Jaakkola and D. Haussler, Exploiting Generative Models in Discriminative Classifiers. In Advances in Neural Information Processing Systems 11, 487:493, MIT Press, 1999 [Joa01] T. Joachims, N. Cristianini and J. Shawe-Taylor, Composite Kernels for Hypertext Categorization. Proceedings 18th International Conference on Machine Learning (ICML-01), Morgan Kaufmann Publishers, 2001 [Joa98] T. Joachims, Text Categorization with Support Vector Machines : Learning with many Relevant Features. In Proceedings of the European Conference on Machine Learning, Berlin, 137-142, Springer Verlag, 1998 [Joa99] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines. Proceedings of the International Conference on Machine Learning, 1999 [Kan02a] J. Kandola, J. Shawe-Taylor, and N. Cristianini. Learning semantic similarity. NIPS’2002. [Kan02b] J. Kandola, J. Shawe-Taylor, and N. Cristianini. On the Applications of Diffusion Kernels to Text Data. NeuroCOLT’2002. [Kas02a] H. Kashima and A. Inokuchi. Kernels for Graph Classification. ICDM Workshop on Active Mining 2002. [Kas02b] H. Kashima and T. Koyanagi. Kernels for Semi-Structured Data. ICML 2002. [Kas03] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. ICML’2003. [Kol00] T. Kolenda, L.K. Hansen and S. Sigurdsson, Independent Components in Text. In Advances in Independent Component Analysis (M. Girolami Editor), Springer Verlag, 2000 [Kon02] R.I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete input spaces. ICML’2002. [Kud00] T. Kudo and Y. Matsumoto, Use of Support Vector Learning for Chunk Identification. Proceedings of CoNNL-2000 and LLL-2000, Lisbon, Portugal, 2000 [Kud01] T. Kudo and Y. Matsumoto. Chunking with Support Vector Machines. NAACL 2001 [Kud03] T. Kudo and Y. Matsumoto. Fast Methods for Kernel-based Text Analysis. ACL 2003 [Kwo03] J. Kwok and I. Tsang. Learning with Idealized Kernels. ICML 2003 [Leo02] E. Leopold and J. Kindermann, Text Categorization with Support Vector Machines : how to represent Texts in Input Space?, Machine Learning 46, 423-444, 2002 [Les02] C. Leslie, E. Eskin and W. Noble. The Spectrum Kernel : a string kernel for SVM Protein Classification. Proc. Of the Pacific Symposium on Biocomputing. 2002 [Les03a] C. Leslie and R. Kuang. Fast Kernels for Intexact String Matching. COLT 2003. [Les03b] C. Leslie, E. Eskin, J. Weston and W. Noble. Mismatch String Kernels for SVM Protein Classification. NIPS 2002. [Lod01] H. Lodhi, N. Cristianini, J. Shawe-Taylor and C. Watkins, Text Classification using String Kernel. In Advances in Neural Information Processing Systems 13, MIT Press, 2001 [Lod02] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini and C. Watkins, Text Classification using String Kernels. Journal of Machine Learning Research 2, 419-444, 2002 [Man01] L. Manevitz and M. Yousef, One-Class SVMs for Document Classification. Journal of Machine Learning Research 2, December 2001 [Nak01] T. Nakagawa, T. Kudoh and Y. Matsumoto, Unknown Word Guessing and Part-of-Speech Tagging Using Support Vector Machines. Proceedings of the 6th Natural Language Processing Pacific Rim Symposium (NLPRS2001), 2001 [Nak02] T. Nakagawa, T. Kudo and Y. Matsumoto. Revision Learning and its Application to Part-of-Speech Tagging. ACL 2002. [Ram03] J. Ramon and T. Gartner. Expressivity vs Efficiency of Graph Kernels. MGTS 2003 [Sau02] C. Saunders, J. Shawe-Taylor, and A. Vinokourov. String kernels, Fisher kernels and finite state automata.NIPS’2002. [Sch98] B. Schölkopf, A.J. Smola and K. Müller, Kernel Principal Component Analysis. In Advances in Kernel Methods – Support Vector Learning, MIT Press, 1998 [Sha01] J. Shawe-Taylor et al., Report on Techniques for Kernels, Deliverable 2 of the KERMIT Project, August 2001. [Sio00] G. Siolas and F. d’Alche Buc, Support Vector Machines based on a Semantic Kernel for Text Categorization. In Proceedings of the International Joint Conference on Neural Networks 2000, Vol.5, 205-209, IEEE Press, 2000 [Sio02] G. Siolas and F. d’Alche-Buc. Mixtures of probabilistic PCAs and Fisher kernels for word and document Modeling. ICANN’2002. [Smo03] A. Smola and R. Kondor. Kernels and Regularization on Graphs. COLT 2003 [Suz03a] J. Suzuki, T. Hirao, Y. Saski and E. Maeda. Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data. ACL 2003 [Suz03b] J. Suzuki, Y. Sasaki, and E. Maeda. Kernels for structured natural language data. NIPS’2003. [Tak01] H. Takamura and Y. Matsumoto, Feature Space Restructuring for SVMs with Application to Text Categorization. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2001 [Tak03] E. Takimoto and M. Warmut. Path Kernels and Multiplicative Updates. Journal of Machine Learning Research 4, pp. 773-818. 2003 [Ton01] S. Tong and D. Koller, Support Vector Machine Active Learning with Applications to Text Classification. Journal of Machine Learning Research 2, December 2001 [Tsu02a] K. Tsuda, M. Kawanabe, G. Ratsch, S. Sonnenburg, and K.-R. Muller. A new discriminative kernel from probabilistic models. Neural Computation 14:2397-2414, 2002. [Tsu02b] K. Tsuda, T. Kin and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics, 1 (1), pp. 18, 2002 [Vin02] A. Vinokourov, J. Shawe-Taylor and N. Cristianini, Finding Language-Independent Semantic Representation of Text using Kernel Canonical Correlation Analysis. NeuroCOLT Technical Report, NC-TR-02119, 2002 [Vis02] S. Vishwanathan and A. Smola. Fast Kernels for String and Tree Matching. NIPS 2002. [Wat99] C. Watkins, Dynamic Alignment Kernels. Technical Report CSD-TR-98-11, Royal Holloway, University of London, Computer Science Department, January 1999 [Yan99] Y. Yang and X. Liu, A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, 1999 [Zav00] J. Zavrel, S Degroeve, A. Kool, W. Daelemans and K Jokinen, Diverse Classifiers for NLP Disambiguation Tasks : Comparison, Optimization, Combination and Evolution. Proceedings of the 2nd CevoLE Workshop, 2000 [Zel02] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relational extraction. Journal of Machine Learning Research 3