Constraint satisfaction inference for discrete sequence processing in NLP Antal van den Bosch ILK / CL and AI, Tilburg University DCU Dublin April 19, 2006 (work with Sander Canisius and Walter Daelemans) Constraint satisfaction inference for discrete sequence processing in NLP Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion How to map sequences to sequences? • Machine learning’s pet solution: – Local-context windowing (NETtalk) – One-shot prediction of single output tokens. – Concatenation of predicted tokens. The near-sightedness problem • A local window never captures long-distance information. • No coordination of individual output tokens. • Long-distance information does exist; holistic coordination is needed. Holistic information • “Counting” constraints: – Certain entities occur only once in a clause/sentence. • “Syntactic validity” constraints: – On discontinuity and overlap; chunks have a beginning and an end. • “Cooccurrence” constraints: – Some entities must occur with others, or cannot co-exist with others. Solution 1: Feedback • Recurrent networks in ANN (Elman, 1991; Sun & Giles, 2001), e.g. word prediction. • Memory-based tagger (Daelemans, Zavrel, Berck, and Gillis, 1996). • Maximum-entropy tagger (Ratnaparkhi, 1996). Feedback disadvantage • Label bias problem (Lafferty, McCallum, and Pereira, 2001). – Previous prediction is an important source of information. – Classifier is compelled to take its own prediction as correct. – Cascading errors result. Label bias problem Label bias problem Label bias problem Label bias problem Solution 2: Stacking • Wolpert (1992) for ANNs. • Veenstra (1998) for NP chunking: – Stage-1 classifier, near-sighted, predicts sequences. – Stage-2 classifier learns to correct stage-1 errors by taking stage-1 output as windowed input. Windowing and stacking Stacking disadvantages • Practical issues: – Ideally, train stage-2 on cross-validated output of stage-1, not “perfect” output. – Costly procedure. – Total architecture: two full classifiers. • Local, not global error correction. What exactly is the problem with mapping to sequences? • Born in Made, The Netherlands O_O_B-LOC_O_B-LOC_I-LOC • Multi-class classification with 100s or 1000s of classes? – Lack of generalization • Some ML algorithms cannot cope very well. – SVMs – Rule learners, decision trees • However, others can. – Naïve Bayes, Maximum-entropy – Memory-based learning Solution 3: n-gram subsequences • Retain windowing approach, but • Predict overlapping n-grams of output tokens. Resolving overlapping n-grams • Probabilities available: Viterbi • Other options: voting N-gram+voting disadvantages • Classifier predicts syntactically valid trigrams, but • After resolving overlap, only local error correction. • End result is still a concatenation of local uncoordinated decisions. • Number of classes increases (problematic for some ML). Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion Four “chunking” tasks • English base-phrase chunking – CoNLL-2000, WSJ • English named-entity recognition – CoNLL-2003, Reuters • Dutch medical concept chunking – IMIX/Rolaquad, medical encyclopedia • English protein-related entity chunking – Genia, Medline abstracts Treated the same way • IOB-tagging. • Windowing: – 3-1-3 words – 3-1-3 predicted PoS tags (WSJ / Wotan) • No seedlists, suffix/prefix, capitalization, … • Memory-based learning and maximumentropy modeling • MBL: automatic parameter optimization (paramsearch, Van den Bosch, 2004) IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .)) IOB-codes for chunks: step 1, PTB-II WSJ ((S (ADVP-TMP Once) (NP-SBJ-1 he) (VP was (VP held (NP *-1) (PP-TMP for (NP three months)) (PP without (S-NOM (NP-SBJ *-1) (VP being (VP charged) ))))) .)) IOB codes for chunks: Flatten tree [Once]ADVP [he]NP [was held]VP [for]PP [three months]NP [without]PP [being charged]VP Example: Instances 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. feature 1 feature 2 feature 3 (word -1) (word 0) (word +1) _ Once he was held for three months without being charged Once he was held for three months without being charged . he was held for three months without being charged . _ class I-ADVP I-NP I-VP I-VP I-PP I-NP I-NP I-PP I-VP I-VP O MBL • Memory-based learning – k-NN classifier (Fix and Hodges, 1951; Cover and Hart, 1967; Aha et al., 1991), Daelemans et al. – Discrete point-wise classifier – Implementation used: TiMBL (Tilburg MemoryBased Learner) Memory-based learning and classification • Learning: – Store instances in memory • Classification: – Given new test instance X, – Compare it to all memory instances • Compute a distance between X and memory instance Y • Update the top k of closest instances (nearest neighbors) – When done, take the majority class of the k nearest neighbors as the class of X Similarity / distance • A nearest neighbor has the smallest distance, or the largest similarity • Computed with a distance function • TiMBL offers two basic distance functions: – Overlap – MVDM (Stanfill & Waltz, 1986; Cost & Salzberg, 1989) • Feature weighting • Exemplar weighting • Distance-weighted class voting The Overlap distance function • “Count the number of mismatching features” n (X,Y ) (x i , y i ) i1 x i y i max min i i (x i , y i ) 0 1 if numeric, else if x i y i if x i y i The MVDM distance function • Estimate a numeric “distance” between pairs of values – “e” is more like “i” than like “p” in a phonetic task – “book” is more like “document” than like “the” in a parsing task n (x i , y i ) P(C j | x i ) P(C j | y i ) j1 Feature weighting • Some features are more important than others • TiMBL metrics: Information Gain, Gain Ratio, Chi Square, Shared Variance • Ex. IG: – Compute data base entropy – For each feature, • partition the data base on all values of that feature – For all values, compute the sub-data base entropy • Take the weighted average entropy over all partitioned subdatabases • The difference between the “partitioned” entropy and the overall entropy is the feature’s Information Gain Feature weighting in the distance function • Mismatching on a more important feature gives a larger distance • Factor in the distance function: n (X,Y ) IGi (x i , y i ) i1 Distance weighting • Relation between larger k and smoothing • Subtle extension: making more distant neighbors count less in the class vote – Linear inverse of distance (w.r.t. max) – Inverse of distance – Exponential decay Current practice • Default TiMBL settings: – k=1, Overlap, GR, no distance weighting – Work well for some morpho-phonological tasks • Rules of thumb: – Combine MVDM with bigger k – Combine distance weighting with bigger k – Very good bet: higher k, MVDM, GR, distance weighting – Especially for sentence and text level tasks Base phrase chunking • 211,727 training, 47,377 test examples • 22 classes • [He]NP [reckons]VP [the current account deficit]NP [will narrow]VP [to]PP [only $ 1.8 billion]NP [in]PP [September]NP . Named entity recognition • 203,621 training, 46,435 test examples • 8 classes • [U.N.]organization official [Ekeus]person heads for [Baghdad]location Medical concept chunking • 428,502 training, 47,430 test examples • 24 classes • Bij [infantiel botulisme]disease kunnen in extreme gevallen [ademhalingsproblemen]symptom en [algehele lusteloosheid]symptom optreden. Protein-related concept chunking • 458,593 training, 50,916 test examples • 51 classes • Most hybrids express both [KBF1]protein and [NF-kappa B]protein in their nuclei , but one hybrid expresses only [KBF1]protein . Results: feedback in MBT Task Baseline With feedback Error red. Base-phrase chunking 91.9 93.0 14% Named-entity recog. 77.2 78.1 4% Medical chunking 64.7 67.0 7% Protein chunking 55.8 62.3 15% Results: stacking Task Baseline With stacking Error red. Base-phrase chunking 91.9 92.6 9% Named-entity recog. 77.2 78.9 7% Medical chunking 64.7 67.0 7% Protein chunking 55.8 57.2 3% Results: trigram classes Task Baseline With trigram Error red. Base-phrase chunking 91.9 92.7 10% Named-entity recog. 77.2 80.2 13% Medical chunking 64.7 67.5 8% Protein chunking 55.8 60.1 10% Numbers of trigram classes Task unigrams trigrams Base-phrase chunking 22 846 Named-entity recog. 8 138 Medical chunking 24 578 Protein chunking 51 1471 Error reductions Task Feedback Stacking Trigrams Stacking +trigrams Base-phrase chunking 14% 9% 10% 15% Named-entity recog. 4% 7% 13% 15% Medical chunking 7% 7% 8% 11% Protein chunking 15% 3% 10% 5% Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Case studies: syntactic and semantic chunking • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion Classification + inference Classification + inference Comparative study • Base discrete classifier: Maximum-entropy model (Zhang Le, maxent) – Extended with feedback, stacking, trigrams, combinations • Compared against – Conditional Markov Models (Ratnaparkhi, 1996) – Maximum-entropy Markov Models (McCallum, Freitag, and Pereira, 2000) – Conditional Random Fields (Lafferty, McCallum, and Pereira, 2001) • On Medical & Protein chunking Maximum entropy • Probabilistic model: conditional distribution p(C|x) (= probability matrix between classes and values) with maximal entropy H(p) • Given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible • Maximize entropy in matrix through iterative process: – IIS, GIS (Improved/Generalized Iterative Scaling) – L-BFGS • Discretized! Results: discrete Maxent variants Task Baseline Feedback Stacking Trigram Medical chunking 61.5 63.9 62.0 63.1 Protein chunking 54.5 62.1 56.5 58.8 Conditional Markov Models • Probabilistic analogue of Feedback • Processes from left to right • Produces conditional probabilities, including previous classification, limited by beam search • With beam=1, equal to Feedback • Can be trained with maximum entropy – E.g. MXPOST, Ratnaparkhi (1996) Feedback vs. CMM Task Baseline Feedback CMM Medical chunking 61.5 63.9 63.9 Protein chunking 54.5 62.1 62.4 Maximum-entropy Markov Models • Probabilistic state machines: – Given previous label and current input vector, produces conditional distributions for current output token. – Separate conditional distributions for each output token (state). • Again directional, so suffers from label bias problem. • Specialized Viterbi search. Conditional Random Fields • Aimed to repair weakness of MEMMs. • Instead of separate model for each state, • A single model for likelihood of sequence (e.g. class bigrams). • Viterbi search. Discrete classifiers vs. MEMM and CRF Task Best discrete MBL Best discrete Maxent CMM MEMM CRF Medical chunking 67.51 63.93 63.9 63.7 63.4 Protein chunking 62.32 62.14 62.4 62.1 62.8 1 MBL with trigrams 3 Maxent with feedback 2 MBL with feedback 4 Maxent with feedback Learning linguistic sequences Talk overview • How to map sequences to sequences, not output tokens? • Discrete versus probabilistic classifiers • Constraint satisfaction inference • Discussion Classification + inference Classification + inference Classification + inference (Many classes - no problem for MBL) Classification + inference Classification + inference Constraint satisfaction inference Constraint satisfaction inference Results: Shallow parsing and IE Base classifier Voting CSI Oracle CoNLL Chunking 91.9 92.7 93.1 95.8 CoNLL NER 77.2 80.2 81.8 86.5 Genia (bio-NER) 55.8 60.1 61.8 69.8 ROLAQUAD (med-NER) 64.7 67.5 68.9 74.9 task Results: Morpho-phonology task Base classifier CSI Letter-phoneme English 79.0 ± 0.82 84.5 ± 0.82 Letter-phoneme Dutch 92.8 ± 0.25 94.4 ± 0.25 Morphological segmentation English 80.0 ± 0.75 85.4 ± 0.71 Morphological segmentation Dutch 41.3 ± 0.48 51.9 ± 0.48 Discussion • The classification+inference paradigm fits both probabilistic and discrete classifiers – Necessary component: search space to look for globally likely solutions • Viterbi search in class distributions • Constraint satisfaction inference in overlapping trigram space • Discrete vs probabilistic? – CMM beam search hardly matters – Best discrete Maxent MEMM! (but CRF is better) – Discrete classifiers: lightning-fast training vs. convergence training of MEMM / CRF. – Don’t write off discrete classifiers. Software • • • • TiMBL, Tilburg Memory-Based Learner (5.1) MBT, Memory-based Tagger (2.0) Paramsearch (1.0) CMM, MEMM http://ilk.uvt.nl • Maxent (Zhang Le, Edinburgh, 20041229) http://homepages.inf.ed.ac.uk/ s0450736/maxent_toolkit.html • Mallet (McCallum et al., UMass) http://mallet.cs.umass.edu Paramsearch • (Van den Bosch, 2004, Proc. of BNAIC) • Machine learning meta problem: – Algorithmic parameters change bias • Description length and noise bias • Eagerness bias – Can make huge difference (Daelemans & Hoste, ECML 2003) – Different parameter settings = functionally different system – But good settings not predictable Known solution • Classifier wrapping (Kohavi, 1997) – Training set train & validate sets – Test different setting combinations – Pick best-performing • Danger of overfitting • Costly Optimizing wrapping • Worst case: exhaustive testing of “all” combinations of parameter settings (pseudoexhaustive) • Simple optimization: – Not test all settings Optimized wrapping • Worst case: exhaustive testing of “all” combinations of parameter settings (pseudoexhaustive) • Optimizations: – Not test all settings – Test all settings in less time Optimized wrapping • Worst case: exhaustive testing of “all” combinations of parameter settings (pseudoexhaustive) • Optimizations: – Not test all settings – Test all settings in less time – With less data Progressive sampling • Provost, Jensen, & Oates (1999) • Setting: – 1 algorithm (parameters already set) – Growing samples of data set • Find point in learning curve at which no additional learning is needed Wrapped progressive sampling • Use increasing amounts of data • While validating decreasing numbers of setting combinations • E.g., – Test “all” settings combinations on a small but sufficient subset – Increase amount of data stepwise – At each step, discard lower-performing setting combinations Procedure (1) • Given training set of labeled examples, – Split internally in 80% training and 20% held-out set – Create clipped parabolic sequence of sample sizes • n steps multipl. factor nth root of 80% set size • Fixed start at 500 train / 100 test • E.g. {500, 698, 1343, 2584, 4973, 9572, 18423, 35459, 68247, 131353, 252812, 486582} • Test sample is always 20% of train sample Procedure (2) • Create pseudo-exhaustive pool of all parameter setting combinations • Loop: – – – – Apply current pool to current train/test sample pair Separate good from bad part of pool Current pool := good part of pool Increase step • Until one best setting combination left, or all steps performed (random pick) Procedure (3) • Separate the good from the bad: min max Procedure (3) • Separate the good from the bad: min max Procedure (3) • Separate the good from the bad: min max Procedure (3) • Separate the good from the bad: min max Procedure (3) • Separate the good from the bad: min max Procedure (3) • Separate the good from the bad: min max “Mountaineering competition” “Mountaineering competition” Customizations algorithm Total # setting combinations # parameters Ripper (Cohen, 1995) 6 648 C4.5 (Quinlan, 1993) 3 360 Maxent (Giuasu et al, 1985) 2 11 Winnow (Littlestone, 1988) 5 1200 IB1 (Aha et al, 1991) 5 925 Experiments: datasets Task # Examples # Features # Classes Class entropy audiology 228 69 24 3.41 bridges 110 7 8 2.50 soybean 685 35 19 3.84 tic-tac-toe 960 9 2 0.93 votes 437 16 2 0.96 1730 6 4 1.21 67559 42 3 1.22 kr-vs-kp 3197 36 2 1.00 splice 3192 60 3 1.48 12961 8 5 1.72 car connect-4 nursery Experiments: results normal wrapping Algorithm Ripper Error reduct. Reduct./ combin. WPS Error reduct. Reduct./ combin. 16.4 0.025 27.9 0.043 C4.5 7.4 0.021 7.7 0.021 Maxent 5.9 0.536 0.4 0.036 IB1 30.8 0.033 31.2 0.034 Winnow 17.4 0.015 32.2 0.027 Paramsearch roundup • Large improvements with algorithms with many parameters. • “Guaranteed” gain of 0.02% per added combination. • Still to do: interaction with feature selection. Thank you