DT2.2.1

advertisement
• ML: Classical methods from AI
–Decision-Tree induction
–Exemplar-based Learning
–Rule Induction
–TBEDL
EMNLP’02
11/11/2002
DecisionTrees
Decision Trees
•
Decision trees are a way to represent rules underlying training data,
with hierarchical sequential structures that recursively partition the
data.
•
They have been used by many research communities (Pattern
Recognition, Statistics, ML, etc.) for data exploration with some of the
following purposes: Description, Classification, and Generalization.
•
From a machine-learning perspective:
Decision Trees are n -ary branching trees that represent classification
rules for classifying the objects of a certain domain into a set of
mutually exclusive classes
•
Acquisition: Top-Down Induction of Decision Trees (TDIDT)
•
Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98),
ASSISTANT, ASSISTANT-R (Cestnik et al. 87; Kononenko et al. 95)
EMNLP’02
11/11/2002
DecisionTrees
An Example
A1
v1
v2
A2
...
Decision Tree
...
small
circle
pos
big
pos
EMNLP’02
...
blue
A2
...
v5
A2
v6
C3
A5
v7
COLOR
triang red
neg
v4
A5
SIZE
SHAPE
A3
v3
C1
C2
C1
neg
11/11/2002
DecisionTrees
Learning Decision Trees
Training
Training
Set
+ TDIDT =
DT
Test
Example
+
= Class
DT
EMNLP’02
11/11/2002
General Induction Algorithm
DTs
function TDIDT (X:set-of-examples; A:set-of-features)
var: tree1,tree2: decision-tree;
X’: set-of-examples;
A’: set-of-features
end-var
if (stopping_criterion (X)) then
tree1 := create_leaf_tree (X)
else
amax := feature_selection (X,A);
tree1 := create_tree (X, amax);
for-all val in values (amax) do
X’ := select_exampes (X,amax,val);
A’ := A \ {amax};
tree2 := TDIDT (X’,A’);
tree1 := add_branch (tree1,tree2,val)
end-for
end-if
return (tree1)
end-function
EMNLP’02
11/11/2002
General Induction Algorithm
DTs
function TDIDT (X:set-of-examples; A:set-of-features)
var: tree1,tree2: decision-tree;
X’: set-of-examples;
A’: set-of-features
end-var
if (stopping_criterion (X)) then
tree1 := create_leaf_tree (X)
else
amax := feature_selection (X,A);
tree1 := create_tree (X, amax);
for-all val in values (amax) do
X’ := select_examples (X,amax,val);
A’ := A \ {amax};
tree2 := TDIDT (X’,A’);
tree1 := add_branch (tree1,tree2,val)
end-for
end-if
return (tree1)
end-function
EMNLP’02
11/11/2002
DecisionTrees
Feature Selection Criteria
 Functions derived from Information Theory:
– Information Gain, Gain Ratio (Quinlan86)
 Functions derived from Distance Measures
– Gini Diversity Index (Breiman et al. 84)
– RLM (López de Mántaras 91)
 Statistically-based
– Chi-square test (Sestito & Dillon 94)
– Symmetrical Tau (Zhou & Dillon 91)
 RELIEFF-IG: variant of RELIEFF
EMNLP’02
(Kononenko 94)
11/11/2002
DecisionTrees
Information Gain
(Quinlan79)
EMNLP’02
11/11/2002
DecisionTrees
Information Gain(2)
(Quinlan79)
EMNLP’02
11/11/2002
DecisionTrees
EMNLP’02
Gain Ratio (Quinlan86)
11/11/2002
DecisionTrees
RELIEF
(Kira & Rendell, 1992)
EMNLP’02
11/11/2002
DecisionTrees
RELIEFF (Kononenko, 1994)
EMNLP’02
11/11/2002
DecisionTrees
RELIEFF-IG
(Màrquez, 1999)
• RELIEFF + the distance measure used for
calculating the nearest hits/misses does not
treat all attributes equally ( it weights the
attributes according to the IG measure).
EMNLP’02
11/11/2002
DecisionTrees
Extensions of DTs
(Murthy 95)
• (pre/post) Pruning
• Minimize the effect of the greedy approach:
lookahead
• Non-lineal splits
• Combination of multiple models
• etc.
EMNLP’02
11/11/2002
DecisionTrees
Decision Trees and NLP
• Speech processing (Bahl et al. 89; Bakiri & Dietterich 99)
• POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez &
Rodríguez 95,97; Màrquez et al. 00)
• Word sense disambiguation (Brown et al. 91; Cardie 93;
Mooney 96)
• Parsing (Magerman 95,96; Haruno et al. 98,99)
• Text categorization (Lewis & Ringuette 94; Weiss et al. 99)
• Text summarization (Mani & Bloedorn 98)
• Dialogue act tagging (Samuel et al. 98)
EMNLP’02
11/11/2002
DecisionTrees
Decision Trees and NLP
• Noun phrase coreference
(Aone & Benett 95; Mc Carthy & Lehnert 95)
• Discourse analysis in information extraction
(Soderland & Lehnert 94)
• Cue phrase identification in text and speech
(Litman 94; Siegel & McKeown 94)
• Verb classification in Machine Translation
(Tanaka 96; Siegel 97)
• More recent applications of DTs to NLP: but
combined in a boosting framework (we will see it
in following sessions)
EMNLP’02
11/11/2002
DecisionTrees
Example: POS Tagging using DT
POS Tagging
He was shot in the hand as he chased
NN
NNback street JJ
the robbers
in the
VB
VB
VB
(The Wall Street Journal Corpus)
EMNLP’02
11/11/2002
DecisionTrees
POS Tagging using Decision Trees
(Màrquez, PhD 1999)
Language
Model
Raw
text
Morphological
analysis
Disambiguation
Algorithm
Tagged
text
POS tagging
EMNLP’02
11/11/2002
…
DecisionTrees
POS Tagging using Decision Trees
(Màrquez, PhD 1999)
Decision
Trees
Raw
text
Morphological
analysis
Disambiguation
Algorithm
Tagged
text
POS tagging
EMNLP’02
11/11/2002
…
DecisionTrees
POS Tagging using Decision Trees
(Màrquez, PhD 1999)
Language
Model
Raw
text
Morphological
analysis
RTT
STT
RELAX
Tagged
text
POS tagging
EMNLP’02
11/11/2002
…
DecisionTrees
DT-based Language Modelling
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
others
“As”,“as”
...
P(IN)=0.83
P(RB)=0.17
tag(+1)
others
Statistical interpretation:
^
P( RB | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.987
...
RB
P(IN)=0.13
P(RB)=0.87
tag(+2)
IN
^
P( IN | word=“A/as” & tag(+1)=RB & tag(+2)=IN) = 0.013
P(IN)=0.013
P(RB)=0.987
EMNLP’02
leaf
11/11/2002
DecisionTrees
DT-based Language Modelling
“preposition-adverb” tree
root
P(IN)=0.81
P(RB)=0.19
Word Form
others
“As”,“as”
...
P(IN)=0.83
P(RB)=0.17
tag(+1)
Collocations:
“as_RB much_RB as_IN”
“as_RB soon_RB as_IN”
“as_RB well_RB as_IN”
EMNLP’02
others
...
RB
P(IN)=0.13
P(RB)=0.87
tag(+2)
IN
P(IN)=0.013
P(RB)=0.987
leaf
11/11/2002
DecisionTrees
Language Modelling using DTs
• Granularity? Ambiguity class level
– adjective-noun, adjective-noun-verb, etc.
• Algorithm: Top-Down Induction of Decision Trees
(TDIDT). Supervised learning
– CART (Breiman et al. 84), C4.5 (Quinlan 95), etc.
• Attributes: Local context (-3,+2) tokens
• Particular implementation:
–
–
–
–
–
Branch-merging
Minimizing the effect of over-fitting,
CART post-pruning
data fragmentation & sparseness
Smoothing
Attributes with many values
Several functions for attribute selection
EMNLP’02
11/11/2002
DecisionTrees
Model Evaluation
The Wall Street Journal (WSJ) annotated corpus
• 1,170,000 words
• Tagset size: 45 tags
• Noise: 2-3% of mistagged words
• 49,000 word-form frequency lexicon
– Manual filtering of 200 most frequent entries
– 36.4% ambiguous words
– 2.44 (1.52) average tags per word
• 243 ambiguity classes
EMNLP’02
11/11/2002
DecisionTrees
Model Evaluation
The Wall Street Journal (WSJ) annotated corpus
# ambiguity
classes
50%
60%
70%
80%
90%
95%
99%
100%
8
11
14
19
37
58
113
243
Number of ambiguity classes that cover x% of the training corpus
2-tags 3-tags 4-tags 5-tags 6-tags
# ambiguity
classes
103
90
35
12
3
Arity of the classification problems
EMNLP’02
11/11/2002
DecisionTrees
12 Ambiguity Classes
They cover 57.90% of the ambiguous occurrences!
Experimental setting: 10-fold cross validation
EMNLP’02
11/11/2002
DecisionTrees
N-fold Cross Validation
Divide the training set S into a partition of n equal-size
disjoint subsets: s1, s2, …, sn
for i:=1 to N do
learn and test a classifier using:
training_set := Usj for all j different from i
validation_set :=si
end_for
return: the average accuracy from the n experiments
Which is a good value for N? (2-10-...)
Extreme case (N=training set size): Leave-one-out
EMNLP’02
11/11/2002
DecisionTrees
Number of nodes
Size: Number of Nodes
25,000
22,095
20,000
15,000
10,674
10,000
5,715
5,000
0
Basic algorithm
Merging
Pruning
Average size reduction: 51.7%
46.5%
74.1% (total)
EMNLP’02
11/11/2002
DecisionTrees
Accuracy
34
28,83
% Error rate
29
24
19
14
9
8,49
8,36
8,3
4
Low er Bound
Merging
Basic Algorithm
Pruning
(at least) No loss in accuracy
EMNLP’02
11/11/2002
DecisionTrees
20
18
16
14
12
10
8
6
4
2
0
Statistically equivalent
17,24
11,63
EMNLP’02
8,4
el
ie
fF
-IG
Ta
u
8,35
R
X^
i
G
in
Average error rate
8,52
2
8,69
8,58
G
R
IG
LM
8,24 8,31
R
G
SIG
E
N
M
an
d
om
8,9
R
Error rate %
Feature Selection Criteria
11/11/2002
DecisionTrees
DT-based POS Taggers
 Tree Base = Statistical Component
– RTT: Reductionistic Tree based tagger
(Màrquez & Rodríguez 97)
– STT: Statistical Tree based tagger
(Màrquez & Rodríguez 99)
 Tree Base = Compatibility Constraints
– RELAX: Relaxation-Labelling based tagger
(Màrquez & Padró 97)
EMNLP’02
11/11/2002
DecisionTrees
RTT
(Màrquez & Rodríguez 97)
Language
Model
Raw
text
Morphological
analysis
stop?
Classify
Update
Filter
yes
no
Tagged
text
Disambiguation
EMNLP’02
11/11/2002
DecisionTrees
STT
(Màrquez & Rodríguez 99)
N-grams (trigrams)
EMNLP’02
11/11/2002
DecisionTrees
STT
(Màrquez & Rodríguez 99)
P(tk | Ck )
Contextual probabilities
~
P(tk | Ck )  TACk (tk ; Ck ) Estimated using Decision Trees
EMNLP’02
11/11/2002
DecisionTrees
STT
(Màrquez & Rodríguez 99)
Language Model
Lexical
probs.
+
Contextual probs.
Raw
text
Morphological
analysis
Viterbi
algorithm
Tagged
text
Disambiguation
EMNLP’02
11/11/2002
DecisionTrees
STT+
(Màrquez & Rodríguez 99)
Language Model
Lexical
probs.
+
+
N-grams
Contextual probs.
Raw
text
Morphological
analysis
Viterbi
algorithm
Tagged
text
Disambiguation
EMNLP’02
11/11/2002
DecisionTrees
 Tree Base = Statistical Component
– RTT: Reductionistic Tree based tagger
(Màrquez & Rodríguez 97)
– STT: Statistical Tree based tagger
(Màrquez & Rodríguez 99)
 Tree Base = Compatibility Constraints
– RELAX: Relaxation-Labelling based tagger
(Màrquez & Padró 97)
EMNLP’02
11/11/2002
DecisionTrees
RELAX (Màrquez & Padró 97)
Language Model
N-grams
+
+
Linguistic
rules
Set of constraints
Raw
text
Morphological
analysis
Relaxation
Labelling
(Padró 96)
Tagged
text
Disambiguation
EMNLP’02
11/11/2002
DecisionTrees
RELAX (Màrquez & Padró 97)
Translating Tress into Constraints
root
P(IN)=0.81
P(RB)=0.19
Word Form
others
Positive constraint
“As”,“as”
...
P(IN)=0.83
P(RB)=0.17
tag(+1)
RB
others
...
P(IN)=0.13
P(RB)=0.87
tag(+2)
IN
P(IN)=0.013
P(RB)=0.987
Negative constraint
2.37 (RB)
-5.81 (IN)
(0 “as” “As”)
(0 “as” “As”)
(1 RB)
(1 RB)
(2 IN)
(2 IN)
leaf
Compatibility values: estimated using Mutual Information
EMNLP’02
11/11/2002
DecisionTrees
Experimental Evaluation
Using the WSJ annotated corpus
•
•
•
•
Training set: 1,121,776 words
Test set: 51,990 words
Closed vocabulary assumption
Base of 194 trees
– Covering 99.5% of the ambiguous occurrences
– Storage requirement: 565 Kb
– Acquisition time: 12 CPU-hours
(Common LISP / Sparc10 workstation)
EMNLP’02
11/11/2002
DecisionTrees
Experimental Evaluation
RTT results
• 67.52% error reduction with respect to MFT
• Accuracy = 94.45% (ambiguous)
97.29% (overall)
• Comparable to the best state-of-the-art automatic
POS taggers
• Recall = 98.22%
Precision = 95.73% (1.08 tags/word)
+ RTT allows to state a tradeoff between precision
and recall
EMNLP’02
11/11/2002
DecisionTrees
Experimental Evaluation
STT results
• Comparable to those of RTT
+ STT allows the incorporation of N-gram information
some problems of sparseness and coherence of
the resulting tag sequence can be alleviated
STT+ results
• Better than those of RTT and STT
EMNLP’02
11/11/2002
DecisionTrees
Experimental Evaluation
Including trees into RELAX
• Translation of 44 representative trees covering
84% of the examples = 8,473 constraints
• Addition of:
– bigrams (2,808 binary constraints)
– trigrams (52,161 ternary constraints)
– linguistically-motivated manual constraints (20)
EMNLP’02
11/11/2002
DecisionTrees
Accuracy of RELAX
MFT
B
T
Ambig.
85.31
91.35
91.35
Overall
94.66
96.86
BT
C
BC
TC
BTC
91.82 91.92
91.82
91.96
92.72
92.72
92.82
92.82
92.55
97.03
97.08
97.36
97.39
97.29
97.06
MFT= baseline, B=bigrams, T=trigrams, C=“tree constraints”
H
BH
TH
BTH
CH
BCH
TCH
BTCH
Ambig.
86.41
91.88
92.04
92.32
91.97
92.76
92.98
92.71
Overall
95.06
97.05
97.11
97.21
97.08
97.37
97.45
97.35
H = set of 20 hand-written linguistic rules
EMNLP’02
11/11/2002
DecisionTrees
Decision Trees: Summary
• Advantages
– Acquires symbolic knowledge in a
understandable way
– Very well studied ML algorithms and variants
– Can be easily translated into rules
– Existence of available software: C4.5, C5.0, etc.
– Can be easily integrated into an ensemble
EMNLP’02
11/11/2002
DecisionTrees
Decision Trees: Summary
• Drawbacks
– Computationally expensive when scaling to large
natural language domains: training examples,
features, etc.
– Data sparseness and data fragmentation: the problem
of the small disjuncts => Probability estimation
– DTs is a model with high variance (unstable)
– Tendency to overfit training data: pruning is necessary
– Requires quite a big effort in tuning the model
EMNLP’02
11/11/2002
Download