• ML: Classical methods from AI –Decision-Tree induction –Exemplar-based Learning –Rule Induction –TBEDL EMNLP’02 11/11/2002 DecisionTrees Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical sequential structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n -ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95) EMNLP’02 11/11/2002 DecisionTrees An Example A1 v1 v2 A2 ... Decision Tree ... small circle pos big pos EMNLP’02 ... blue A2 ... v5 A2 v6 C3 A5 v7 COLOR triang red neg v4 A5 SIZE SHAPE A3 v3 C1 C2 C1 neg 11/11/2002 DecisionTrees Learning Decision Trees Training Training Set + TDIDT = DT Test Example + = Class DT EMNLP’02 11/11/2002 General Induction Algorithm DTs function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_exampes (X,amax,val); A’ := A \ {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function EMNLP’02 11/11/2002 General Induction Algorithm DTs function TDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion (X)) then tree1 := create_leaf_tree (X) else amax := feature_selection (X,A); tree1 := create_tree (X, amax); for-all val in values (amax) do X’ := select_examples (X,amax,val); A’ := A \ {amax}; tree2 := TDIDT (X’,A’); tree1 := add_branch (tree1,tree2,val) end-for end-if return (tree1) end-function EMNLP’02 11/11/2002 DecisionTrees Feature Selection Criteria Functions derived from Information Theory: – Information Gain, Gain Ratio (Quinlan86) Functions derived from Distance Measures – Gini Diversity Index (Breiman et al. 84) – RLM (López de Mántaras 91) Statistically-based – Chi-square test (Sestito & Dillon 94) – Symmetrical Tau (Zhou & Dillon 91) RELIEFF-IG: variant of RELIEFF EMNLP’02 (Kononenko 94) 11/11/2002 DecisionTrees Information Gain (Quinlan79) EMNLP’02 11/11/2002 DecisionTrees Information Gain(2) (Quinlan79) EMNLP’02 11/11/2002 DecisionTrees EMNLP’02 Gain Ratio (Quinlan86) 11/11/2002 DecisionTrees RELIEF (Kira & Rendell, 1992) EMNLP’02 11/11/2002 DecisionTrees RELIEFF (Kononenko, 1994) EMNLP’02 11/11/2002 DecisionTrees RELIEFF-IG (Màrquez, 1999) • RELIEFF + the distance measure used for calculating the nearest hits/misses does not treat all attributes equally ( it weights the attributes according to the IG measure). EMNLP’02 11/11/2002 DecisionTrees Extensions of DTs (Murthy 95) • (pre/post) Pruning • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • etc. EMNLP’02 11/11/2002 DecisionTrees Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98) EMNLP’02 11/11/2002 DecisionTrees Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97) • More recent applications of DTs to NLP: but combined in a boosting framework (we will see it in following sessions) EMNLP’02 11/11/2002 DecisionTrees Example: POS Tagging using DT POS Tagging He was shot in the hand as he chased NN NNback street JJ the robbers in the VB VB VB (The Wall Street Journal Corpus) EMNLP’02 11/11/2002 DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Language Model Raw text Morphological analysis Disambiguation Algorithm Tagged text POS tagging EMNLP’02 11/11/2002 … DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Decision Trees Raw text Morphological analysis Disambiguation Algorithm Tagged text POS tagging EMNLP’02 11/11/2002 … DecisionTrees POS Tagging using Decision Trees (Màrquez, PhD 1999) Language Model Raw text Morphological analysis RTT STT RELAX Tagged text POS tagging EMNLP’02 11/11/2002 … DecisionTrees DT-based Language Modelling “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form others “As”,“as” ... P(IN)=0.83 P(RB)=0.17 tag(+1) others Statistical interpretation: ^ P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987 ... RB P(IN)=0.13 P(RB)=0.87 tag(+2) IN ^ P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013 P(IN)=0.013 P(RB)=0.987 EMNLP’02 leaf 11/11/2002 DecisionTrees DT-based Language Modelling “preposition-adverb” tree root P(IN)=0.81 P(RB)=0.19 Word Form others “As”,“as” ... P(IN)=0.83 P(RB)=0.17 tag(+1) Collocations: “as_RB much_RB as_IN” “as_RB soon_RB as_IN” “as_RB well_RB as_IN” EMNLP’02 others ... RB P(IN)=0.13 P(RB)=0.87 tag(+2) IN P(IN)=0.013 P(RB)=0.987 leaf 11/11/2002 DecisionTrees Language Modelling using DTs • Granularity? Ambiguity class level – adjective-noun, adjective-noun-verb, etc. • Algorithm: Top-Down Induction of Decision Trees (TDIDT). Supervised learning – CART (Breiman et al. 84), C4.5 (Quinlan 95), etc. • Attributes: Local context (-3,+2) tokens • Particular implementation: – – – – – Branch-merging Minimizing the effect of over-fitting, CART post-pruning data fragmentation & sparseness Smoothing Attributes with many values Several functions for attribute selection EMNLP’02 11/11/2002 DecisionTrees Model Evaluation The Wall Street Journal (WSJ) annotated corpus • 1,170,000 words • Tagset size: 45 tags • Noise: 2-3% of mistagged words • 49,000 word-form frequency lexicon – Manual filtering of 200 most frequent entries – 36.4% ambiguous words – 2.44 (1.52) average tags per word • 243 ambiguity classes EMNLP’02 11/11/2002 DecisionTrees Model Evaluation The Wall Street Journal (WSJ) annotated corpus # ambiguity classes 50% 60% 70% 80% 90% 95% 99% 100% 8 11 14 19 37 58 113 243 Number of ambiguity classes that cover x% of the training corpus 2-tags 3-tags 4-tags 5-tags 6-tags # ambiguity classes 103 90 35 12 3 Arity of the classification problems EMNLP’02 11/11/2002 DecisionTrees 12 Ambiguity Classes They cover 57.90% of the ambiguous occurrences! Experimental setting: 10-fold cross validation EMNLP’02 11/11/2002 DecisionTrees N-fold Cross Validation Divide the training set S into a partition of n equal-size disjoint subsets: s1, s2, …, sn for i:=1 to N do learn and test a classifier using: training_set := Usj for all j different from i validation_set :=si end_for return: the average accuracy from the n experiments Which is a good value for N? (2-10-...) Extreme case (N=training set size): Leave-one-out EMNLP’02 11/11/2002 DecisionTrees Number of nodes Size: Number of Nodes 25,000 22,095 20,000 15,000 10,674 10,000 5,715 5,000 0 Basic algorithm Merging Pruning Average size reduction: 51.7% 46.5% 74.1% (total) EMNLP’02 11/11/2002 DecisionTrees Accuracy 34 28,83 % Error rate 29 24 19 14 9 8,49 8,36 8,3 4 Low er Bound Merging Basic Algorithm Pruning (at least) No loss in accuracy EMNLP’02 11/11/2002 DecisionTrees 20 18 16 14 12 10 8 6 4 2 0 Statistically equivalent 17,24 11,63 EMNLP’02 8,4 el ie fF -IG Ta u 8,35 R X^ i G in Average error rate 8,52 2 8,69 8,58 G R IG LM 8,24 8,31 R G SIG E N M an d om 8,9 R Error rate % Feature Selection Criteria 11/11/2002 DecisionTrees DT-based POS Taggers Tree Base = Statistical Component – RTT: Reductionistic Tree based tagger (Màrquez & Rodríguez 97) – STT: Statistical Tree based tagger (Màrquez & Rodríguez 99) Tree Base = Compatibility Constraints – RELAX: Relaxation-Labelling based tagger (Màrquez & Padró 97) EMNLP’02 11/11/2002 DecisionTrees RTT (Màrquez & Rodríguez 97) Language Model Raw text Morphological analysis stop? Classify Update Filter yes no Tagged text Disambiguation EMNLP’02 11/11/2002 DecisionTrees STT (Màrquez & Rodríguez 99) N-grams (trigrams) EMNLP’02 11/11/2002 DecisionTrees STT (Màrquez & Rodríguez 99) P(tk | Ck ) Contextual probabilities ~ P(tk | Ck ) TACk (tk ; Ck ) Estimated using Decision Trees EMNLP’02 11/11/2002 DecisionTrees STT (Màrquez & Rodríguez 99) Language Model Lexical probs. + Contextual probs. Raw text Morphological analysis Viterbi algorithm Tagged text Disambiguation EMNLP’02 11/11/2002 DecisionTrees STT+ (Màrquez & Rodríguez 99) Language Model Lexical probs. + + N-grams Contextual probs. Raw text Morphological analysis Viterbi algorithm Tagged text Disambiguation EMNLP’02 11/11/2002 DecisionTrees Tree Base = Statistical Component – RTT: Reductionistic Tree based tagger (Màrquez & Rodríguez 97) – STT: Statistical Tree based tagger (Màrquez & Rodríguez 99) Tree Base = Compatibility Constraints – RELAX: Relaxation-Labelling based tagger (Màrquez & Padró 97) EMNLP’02 11/11/2002 DecisionTrees RELAX (Màrquez & Padró 97) Language Model N-grams + + Linguistic rules Set of constraints Raw text Morphological analysis Relaxation Labelling (Padró 96) Tagged text Disambiguation EMNLP’02 11/11/2002 DecisionTrees RELAX (Màrquez & Padró 97) Translating Tress into Constraints root P(IN)=0.81 P(RB)=0.19 Word Form others Positive constraint “As”,“as” ... P(IN)=0.83 P(RB)=0.17 tag(+1) RB others ... P(IN)=0.13 P(RB)=0.87 tag(+2) IN P(IN)=0.013 P(RB)=0.987 Negative constraint 2.37 (RB) -5.81 (IN) (0 “as” “As”) (0 “as” “As”) (1 RB) (1 RB) (2 IN) (2 IN) leaf Compatibility values: estimated using Mutual Information EMNLP’02 11/11/2002 DecisionTrees Experimental Evaluation Using the WSJ annotated corpus • • • • Training set: 1,121,776 words Test set: 51,990 words Closed vocabulary assumption Base of 194 trees – Covering 99.5% of the ambiguous occurrences – Storage requirement: 565 Kb – Acquisition time: 12 CPU-hours (Common LISP / Sparc10 workstation) EMNLP’02 11/11/2002 DecisionTrees Experimental Evaluation RTT results • 67.52% error reduction with respect to MFT • Accuracy = 94.45% (ambiguous) 97.29% (overall) • Comparable to the best state-of-the-art automatic POS taggers • Recall = 98.22% Precision = 95.73% (1.08 tags/word) + RTT allows to state a tradeoff between precision and recall EMNLP’02 11/11/2002 DecisionTrees Experimental Evaluation STT results • Comparable to those of RTT + STT allows the incorporation of N-gram information some problems of sparseness and coherence of the resulting tag sequence can be alleviated STT+ results • Better than those of RTT and STT EMNLP’02 11/11/2002 DecisionTrees Experimental Evaluation Including trees into RELAX • Translation of 44 representative trees covering 84% of the examples = 8,473 constraints • Addition of: – bigrams (2,808 binary constraints) – trigrams (52,161 ternary constraints) – linguistically-motivated manual constraints (20) EMNLP’02 11/11/2002 DecisionTrees Accuracy of RELAX MFT B T Ambig. 85.31 91.35 91.35 Overall 94.66 96.86 BT C BC TC BTC 91.82 91.92 91.82 91.96 92.72 92.72 92.82 92.82 92.55 97.03 97.08 97.36 97.39 97.29 97.06 MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints” H BH TH BTH CH BCH TCH BTCH Ambig. 86.41 91.88 92.04 92.32 91.97 92.76 92.98 92.71 Overall 95.06 97.05 97.11 97.21 97.08 97.37 97.45 97.35 H = set of 20 hand-written linguistic rules EMNLP’02 11/11/2002 DecisionTrees Decision Trees: Summary • Advantages – Acquires symbolic knowledge in a understandable way – Very well studied ML algorithms and variants – Can be easily translated into rules – Existence of available software: C4.5, C5.0, etc. – Can be easily integrated into an ensemble EMNLP’02 11/11/2002 DecisionTrees Decision Trees: Summary • Drawbacks – Computationally expensive when scaling to large natural language domains: training examples, features, etc. – Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation – DTs is a model with high variance (unstable) – Tendency to overfit training data: pruning is necessary – Requires quite a big effort in tuning the model EMNLP’02 11/11/2002