Categorization

advertisement
Web Mining
Giuseppe Attardi
[includes slides borrowed from C. Manning]
Types of Mining

Web Usage Mining
– Access logs, query logs, etc.
– For improving performance (caching)
– For understanding user needs or
preferences

Web Content Mining
– For extracting information or knowledge
Content Mining
Web Search
 Link Analysis
 Categorization
 Clustering
 Knowledge Extraction

– Ontology
– Semantic Web

Question Answering
Text Categorization
Categorization


Problem: Given a universe of objects and a predefined set of classes, or categories, assign
each object to its correct class.
Examples:
Problem
Tagging
Objects
Categories
words in context POS tag
Word Sense Disambiguation words in context word sense
PP attachment
sentences
parse trees
Language identification
text
language
Text categorization
text
topic
Overview

Definition of Text Categorization

Techniques
– Naïve Bayes
– Decision Trees
– Maximum Entropy Modeling
– k-Nearest Neighbor Classification
– SVM
Text categorization

Classification (= Categorization)
– Task of assigning objects to classes or
categories

Text categorization
– Task of classifying the topic or theme of
a document
Statistical classification
 Training
set of objects
 Data representation model
 Model class
 Training procedure
 Evaluation
Training set of objects


A set of objects, each labeled by one or more classes
Example from Reuters
<REUTERS TOPICS="YES" NEWID="2005">
<DATE> 5-MAR-1987 09:22:57.75</DATE>
<TOPICS><D>earn</D></TOPICS>
<PLACES><D>usa</D></PLACES>
<TEXT>
<TITLE>NORD RESOURCES CORP <NRD> 4TH QTR NET</TITLE>
<DATELINE> DAYTON, Ohio, March 5 - </DATELINE>
<BODY>Shr 19 cts vs 13 cts
Net 2,656,000 vs 1,712,000
Revs 15.4 mln vs 9,443,000
Avg shrs 14.1 mln vs 12.6 mln
Shr 98 cts vs 77 cts
Net 13.8 mln vs 8,928,000
Revs 58.8 mln vs 48.5 mln
Avg shrs 14.0 mln vs 11.6 mln
NOTE: Shr figures adjusted for 3-for-2 split paid Feb 6, 1987.
Reuter </BODY></TEXT>
</REUTERS>
Data Representation Model
The training set is encoded via a data
representation model
 Typically, each object in the training
set is represented by a pair (x, c),
where:

– x: a vector of measurements
– c: class label
Data Representation

For text categorization:
– use words that are frequent in “earnings”
documents
– the 20 most representative words are:
vs, min, cts, loss, &, 000, profit, dlrs, pct, etc.

Each document is represented as vector

x j  s1j ,..., skj
where

1  log tfi j 
j

si  round 10 

1  log l j 

and tfij is the number of occurrences of word i in
document j and lj is the length of document j.
word
sij
vs
5
mln
5
cts
3
;
3
&
3
000
4
loss
0
‘
0
“
0
3
4
profit
0
dlrs
3
1
2
pct
0
is
0
s
0
that
0
net
3
lt
2
at
0
Model Class and Training Procedure

Model class
– A parameterized family of classifiers
– e.g. a model class for binary
classification: g(x) = w∙x + w0
• if g(x) > 0, choose class c1, else c2

Training procedure
– Algorithm to select one classifier from
this family
– i.e., to select proper parameters values
(e.g. w, w0)
Evaluation



Borrowing from IR, NLP systems are evaluated
by precision, recall, etc.
Example: for text categorization, given a set of
documents of which a subset is in a particular
category (say, “earnings”), the system
classifies some other subset of the documents
as belonging to the “earnings” category.
The results of the system are compared with
the actual results as follows:
Correct
Incorrect
Assigned
tp (true positive)
fp (false positive)
Not Assigned
tn (true negative)
fn (false negative)
Evaluation measures
Precision
tp
tp  fp

Recall
tp
tp  fn

Accuracy
tp  tn
tp  tn  fp  fn
Error
fp  fn
tp  tn  fp  fn


Evaluation of text categorization

macro-averaging
– Compute an evaluation measure for each contingency
table separately and average over categories
– gives equal weight to each category
a1
a2
a3


– macro-averaged precision = a 1  b1 a 2  b 2 a 3  b 3
3

micro-averaging
– Make a single contingency table for all categories by
summing the scores in each cell, then compute the
evaluation measure for the whole table
– gives equal weight to each object
– micro-averaged precision =
(a1  a 2  a 3)
( a 1  a 2  a 3 )  ( b1  b 2  b 3 )
Classification Techniques
Naïve Bayes
 Decision Trees
 Maximum Entropy Modeling
 Support Vector Machines
 k-Nearest Neighbor

Bayesian Classifiers
Bayesian Methods





Learning and classification methods
based on probability theory
Bayes theorem plays a critical role in
probabilistic learning and classification
Build a generative model that
approximates how data is produced
Uses prior probability of each category
given no information about an item
Categorization produces a posterior
probability distribution over the possible
categories given a description of an item
Notation
P(A)
P(A, B)
P(A | B)
probability of an event A
probability of both events A
and B
conditional probability of
event A given that an event B
has occurred
Bayes’ Rule
P (C , X )  P (C | X ) P ( X )  P ( X | C ) P (C )
prior
probability
P ( X | C ) P (C )
P (C | X ) 
P( X )
posterior
probability
Maximum a posteriori Hypothesis
hMAP  argmax P ( h | D)
hH
hMAP
P ( D | h ) P ( h)
 argmax
P ( D)
h H
hMAP  argmax P ( D | h) P ( h)
hH
Maximum likelihood Hypothesis
If all hypotheses are a priori equally likely,
need only to consider the P(D | h) term:
hML  argmax P ( D | h)
hH
Naïve Bayes Classifiers
Task: Classify a new instance based on a
tuple of attribute values
x1 , x2 ,, xn
c MAP  argmax P (c j | x1 , x2 ,, xn )
c j C
c MAP  argmax
c j C
P ( x1 , x2 ,, xn | c j ) P (c j )
P (c1 , c2 ,, cn )
c MAP  argmax P ( x1 , x2 ,, xn | c j ) P (c j )
c j C
Naïve Bayes Classifier: Assumptions

P(cj)
– Can be estimated from the frequency of
classes in the training examples.

P(x1,x2,…,xn | cj)
– O(|X|n•|C|)
– Could only be estimated if a very, very large
number of training examples was available.
Conditional Independence Assumption:
 Assume that the probability of observing the
conjunction of attributes is equal to the product
of the individual probabilities.
The Naïve Bayes Classifier
Flu
X1
runnynose

X2
sinus
X3
cough
X4
fever
X5
muscle-ache
Conditional Independence Assumption:
features are independent of each other
given the class:
P( X 1 ,, X 5 | C )  P( X 1 | C )  P( X 2 | C )  ...  P( X 5 | C )
Learning the Model
C
X1

X2
X3
X4
X5
X6
Common practice: maximum likelihood
– simply use the frequencies in the data
Pˆ (c j ) 
Pˆ ( x i | c j ) 
N (C  c j )
N
N ( X i  xi , C  c j )
N (C  c j )
Problem with Max Likelihood
Flu
X1
runnynose
X2
sinus
X3
cough
X4
fever
X5
muscle-ache
P ( X 1 ,, X 5 | C )  P ( X 1 | C )  P ( X 2 | C )  ...  P ( X 5 | C )

What if we have seen no training cases where patient had
no flu and muscle aches?
N ( X 5  t , C  nf )
Pˆ ( X 5  t | C  nf ) 
0
N (C  nf )

Zero probabilities cannot be conditioned away, no matter
the other evidence!
  arg max c Pˆ (c ) i Pˆ ( x i | c )
Smoothing to Avoid Overfitting
Pˆ ( x i | c j ) 
N ( X i  xi , C  c j )  1
N (C  c j )  k
# of values of Xi

Somewhat more subtle version
Pˆ ( x i ,k | c j ) 
overall
fraction in
data where
Xi=xi,k
N ( X i  x i ,k , C  c j )  mp i ,k
N (C  c j )  m
extent of
“smoothing”
Text Classification Using Naïve Bayes:
Basic method

Attributes are text positions, values are
words.
c NB  argmax P (c j ) P ( xi | c j )
c jC
i
 argmax P (c j ) P ( x1 " our" | c j ) P ( xn " text"| c j )
c jC

Naive Bayes assumption is clearly violated.
– Example?


Still too many possibilities
Assume that classification is independent of the positions of
the words
– Use same parameters for each position
Text Classification Algorithms: Learning

From training corpus, extract Vocabulary
 Calculate required P(cj) and P(xk | cj) terms
– For each cj in C do
• docsj  subset of documents for which the
target class is cj
•
P (c j ) 
| docs j |
| total # documents |
• Textj  single document containing all docsj
• for each word xk in Vocabulary
– nk  number of occurrences of xk in Textj
–
P ( xk | c j ) 
nk  1
n | Vocabulary |
Text Classification Algorithms: Classifying

positions  all word positions in current
document which contain tokens found in
Vocabulary

Return cNB, where
c NB  argmax P (c j )
c jC
 P( x
i positions
i
| cj)
Naive Bayes Time Complexity

Training Time: O(|D|Ld + |C||V|))
where
Ld is the average length of a document in D
– Assumes V and all Di , ni, and nij pre-computed in O(|D|Ld)
time during one pass through all of the data.
– Generally just O(|D|Ld) since usually |C||V| < |D|Ld


Test Time: O(|C| Lt)
where
Lt is the average length of a test document
Very efficient overall, linearly proportional to the
time needed to just read in all the data
Underflow Prevention

Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result
in floating-point underflow
 Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing
logs of probabilities rather than
multiplying probabilities
 Class with highest final un-normalized log
probability score is still the most probable
Naïve Bayes Posterior Probabilities

Classification results of naïve Bayes (the
class with maximum posterior probability)
are usually fairly accurate
 However, due to the inadequacy of the
conditional independence assumption, the
actual posterior-probability numerical
estimates are not
– Output probabilities are generally very close to
0 or 1
Two Models

Model 1: Multivariate binomial
– One feature Xw for each word in
dictionary
– Xw = true in document d if w appears in d
– Naive Bayes assumption:
• Given the document’s topic, appearance of
one word in document tells us nothing
about chances that another word appears
Two Models

Model 2: Multinomial
– One feature Xi for each word pos in document
• feature’s values are all words in dictionary
– Value of Xi is the word in position i
– Naïve Bayes assumption:
• Given the document’s topic, word in one position in
document tells us nothing about value of words in
other positions
– Second assumption:
• word appearance does not depend on position
P(Xi = w | c ) = P(Xj = w | c)
for all positions i,j, word w, and class c
Parameter estimation

Binomial model:
fraction of documents of topic cj
ˆ
P ( X w  t | c j )  in which word w appears

Multinomial model:
Pˆ ( X i  w | c j ) 
fraction of times in which
word w appears
across all documents of topic cj
– creating a mega-document for topic j by concatenating all
documents in this topic
– use frequency of w in mega-document
Feature selection via Mutual Information

We might not want to use all words, but
just reliable, good discriminators
 In training set, choose k words which best
discriminate the categories.
 One way is in terms of Mutual Information:
p(ew , ec )
I ( w , c )    p(ew , ec ) log
p(ew ) p(ec )
e w { 0 ,1} ec { 0 ,1}
– For each word w and each category c
Feature selection via MI (2)

For each category we build a list of k most
discriminating terms
 For example (on 20 Newsgroups):
– sci.electronics: circuit, voltage, amp, ground, copy,
battery, electronics, cooling, …
– rec.autos: car, cars, engine, ford, dealer, mustang,
oil, collision, autos, tires, toyota, …

Greedy: does not account for correlations
between terms
 In general feature selection is necessary for
binomial NB, but not for multinomial NB
Evaluating Categorization

Evaluation must be done on test data that
are independent of the training data
(usually a disjoint set of instances)
 Classification accuracy: c/n where n is the
total number of test instances and c is the
number of test instances correctly
classified by the system
 Results can vary based on sampling error
due to different training and test sets
 Average results over multiple training and
test sets (splits of the overall data) for the
best results
Example: AutoYahoo!

Classify 13,589 Yahoo! webpages in “Science”
subtree into 95 different topics (hierarchy depth 2)
Sample Learning Curve
(Yahoo Science Data)
Importance of Conditional Independence
Assume a domain with 20 binary (true/false) attributes A1,…, A20, and
two classes c1 and c2.
Goal: for any case A=A1,…,A20 estimate P(A,ci).
A) No independence assumptions:
Computation of 221 parameters (one for each combination of values) !



The training database will not be so large!
Huge Memory requirements / Processing time.
Error Prone (small sample error).
B) Strongest conditional independence assumptions (all attributes
independent given the class) = Naive Bayes:
P(A,ci)=P(A1,ci)P(A2,ci)…P(A20,ci)
Computation of 20*22 = 80 parameters.



Space and time efficient.
Robust estimations.
What if the conditional independence assumptions do not hold??
C) More relaxed independence assumptions
Tradeoff between A) and B)
Conditions for Optimality of Naïve Bayes
Answer
Fact
Sometimes NB performs
well even if the
Conditional
Independence
assumptions are badly
violated.
Questions
Assume two classes c1 and c2.
A new case A arrives.
NB will classify A to c1 if:
P(A, c1) > P(A, c2)
P(A,c1)
P(A,c2)
Class of A
Actual Probability
0.1
0.01
c1
Estimated Probability by NB
0.08
0.07
c1
WHY? And WHEN?
Hint
Classification is about
predicting the correct
class label and NOT
about accurately
estimating probabilities.
Despite the big error in estimating the
probabilities the classification is still correct.
Correct estimation  accurate prediction
but NOT
accurate prediction  Correct estimation
Naïve Bayes is Not So Naïve

Naïve Bayes: First and Second place in KDD-CUP 97 competition,
among 16 (then) state of the art algorithms
Goal: Financial services industry direct mail response prediction model.
Predict if the recipient of mail will actually respond to the advertisement – 750,000 records.

Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
Instead Decision Trees & Nearest-Neighbor methods can heavily suffer from this.

Very good in Domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data


A good dependable baseline for text classification (but not the
best)!
Optimal if the Independence Assumptions hold:
– If assumed independence is correct, then it is the Bayes Optimal Classifier for
problem

Very Fast:
– Learning with one pass over the data; testing linear in the number of attributes,
and document collection size


Low Storage requirements
Handles Missing Values
Naïve Bayes Drawbacks


Doesn’t do higher order interactions
Typical example: Chess end games
– Each move completely changes the context for
the next move
– C4.5  99.5% accuracy; NB  87% accuracy.


What if you have BOTH high order
interactions AND few training data?
Doesn’t model features that do not equally
contribute to distinguishing the classes
– If few features ONLY mostly determine the class,
additional features usually decrease the accuracy
– Because NB gives same weight to all features
Decision Trees
Decision Trees
Example: decision whether to assign documents to the category "earning"
node 1
7681 articles
P(c|n1) = 0.300
split: cts
value: 2
cts < 2
cts  2
node 2
5977 articles
P(c|n2) = 0.116
split: net
value: 1
net < 1
node 3
5436 articles
P(c|n3) = 0.050
node 5
1704articles
P(c|n5) = 0.943
split: vs
value: 2
net  1
vs < 2
node 4
541 articles
P(c|n4) = 0.649
node 6
201 articles
P(c|n6) = 0.694
vs  2
node 7
1403 articles
P(c|n7) = 0.996
Decision Trees - Training procedure (1)

Growing a tree with training data
– splitting criterion
• for finding the feature and its value on which to split
• e.g. maximum information gain
– stopping criterion
• determines when to stop splitting
• e.g. all elements at node have same category

Pruning it back to reasonable size
– to avoid overfitting the training set
e.g. ‘dlrs’ and ‘pct’ in just one document
– to optimize performance
Maximum Information Gain

Information gain
H(t) – H(t|a) = H(t) – (pLH(tL) + pRH(tR))
where:
a is attribute we split on
t is distribution of the node we split
pL and pR are the percent of nodes passed on to left and right
nodes
tL and tR are the distributions of left and right nodes

Choose attribute which maximizes IG
Example:
H(n1) = - 0.3 log(0.3) - 0.7 log(0.7) = 0.881
H(n2) = 0.518
H(n5) = 0.315
H(n1) – H(n1| ‘cts’) =
0.881 – (5977/7681) · 0.518 – (1704/7681) · 0.315 = 0.408
Decision Trees – Pruning (1)


At each step, drop a node considered least
helpful
Find best tree using validation on validation set
– validation set: portion of training data held out from
training

Find best tree using cross-validation
– I. Determine the optimal tree size
• 1. Divide training data into N partitions
• 2. Grow using N-1 partitions, and prune using held-out
partition
• 3. Repeat 2. N times
• 4. Determine average pruned tree size as optimal tree size
– II. Training using total training data, and pruning back to
optimal size
Decision Trees - Pruning (2)


Effect of pruning on accuracy
Optimal performance on test set pruning 951 nodes
Decision Trees Summary


Useful for non-trivial classification tasks (for
simple problems, use simpler methods)
Tend to split the training set into smaller and
smaller subsets:
– may lead to poor generalizations
– not enough data for reliable prediction
– accidental regularities


Volatile: very different model from slightly
different data
Can be interpreted easily
– easy to trace the path
– easy to debug one’s code
– easy to understand a new domain
Maximum Entropy Modeling
Maximum Entropy Modeling

Maximum Entropy Modeling
– The model with maximum entropy of all
the models that satisfy the constraints
– desire to preserve as much uncertainty
as possible
Model class: log linear model
 Training procedure: generalized
iterative scaling

MaxEntropy: example data
Features
Outcome
Sunny, Happy
Outdoor
Sunny, Happy, Dry
Outdoor
Sunny, Happy, Humid
Outdoor
Sunny, Sad, Dry
Outdoor
Sunny, Sad, Humid
Outdoor
Cloudy, Happy, Humid
Outdoor
Cloudy, Happy, Humid
Outdoor
Cloudy, Sad, Humid
Outdoor
Cloudy, Sad, Humid
Outdoor
Rainy, Happy, Humid
Indoor
Rainy, Happy, Dry
Indoor
Rainy, Sad, Dry
Indoor
Rainy, Sad, Humid
Indoor
Cloudy, Sad, Humid
Indoor
Cloudy, Sad, Humid
Indoor
MaxEnt: example predictions
Context
Cloudy, Happy, Humid
Outdoor
0.771
Indoor
0.228
Rainy, Sad, Humid
0.001
0.998
Maximum Entropy Modeling (2)
word

Model class: loglinear model


1 K
( x ,c )
f
p( x , c )    i i
Z i 1
1 if sij  0 and c  1

fi ( x j , c)  
otherwise
0
– i : weight for i-th feature
– Z : normalizing constant

Class of new document
– compute p(x, 0), p(x, 1)
– choose the class label with the greater
probability
log(i)
vs
0.613
mln
-0.110
cts
1.298
;
-0.432
&
-0.429
000
-0413
loss
-0.332
‘
-0.085
“
0.202
3
-0.463
profit
0.360
dlrs
-0.202
1
-0.211
pct
-0.260
is
-0.546
s
-0.490
that
-0.285
net
-0.300
lt
1.016
at
-0.465
fK+1
0.009
Maximum Entropy Modeling (3)

Training procedure: generalized interative
scaling
– Expected value of p:
E f
p
i

  p x , c 

x ,c
f
i

( x, c)
– Epfi maximum entropy distribution p*: Ep*fi = Epfi

Algorithm
1. initialize (1). compute Epfi. n=1
2. compute p(n)(x, c) for each training data
3. compute Ep(n)fi
4. update (n+1)
5. if converged, stop. otherwise n=n+1, goto 2
Maximum Entropy Modeling (4)

Define (K+1)th feature: for the constraint that sum of
fi is equal to C
K
N


f K 1 ( x , c )  C   f i ( x , c )
C  max


x ,c
i 1

i 1

f i ( x, c)
Expected value of p is defined as


E p f i   p x , c  f i ( x , c )

x ,c

Expected value for the empirical distribution is
computed as


1
E ~p f i   ~p x , c  f i ( x , c ) 

N
x ,c

N

j 1

fi ( x j , c)
Expected value of p is approximately computed as
1
E p fi 
N


 p(c | x j ) f i ( x j , c)
N
j 1 c
GIS Algorithm (full)
1.
Initialize {ai(1)}.
Maximum Entropy Modeling (6)
word

Application to text categorization
– trained on 9603 articles, 500
iteration
– test result: 88.6% accuracy
log(i)
vs
0.613
mln
-0.110
cts
1.298
;
-0.432
&
-0.429
000
-0413
loss
-0.332
‘
-0.085
“
0.202
3
-0.463
profit
0.360
dlrs
-0.202
1
-0.211
pct
-0.260
is
-0.546
s
-0.490
that
-0.285
net
-0.300
lt
1.016
at
-0.465
fK+1
0.009
Maximum Entropy Modeling (7)

Shortcoming of MEM
– restricted to binary feature
• low performance in some situations
– computationally expensive: slow convergence
– the lack of smoothing can cause problem

Strength of MEM
– can specify all possible relevant information
• complex features can be defined
– can use heterogeneous features and weighting
feature
– an integrated framework for feature selection &
classification
• a very large number of features could be down to a
manageable size during training procedure
Vector Space Classifiers
Vector Space Representation
Each document is a vector, one
component for each term (= word)
 Normalize to unit length
 Properties of vector space

– terms are axes
– n docs live in this space
– even with stemming, may have 10,000+
dimensions, or even 1,000,000+
Classification Using Vector Spaces
Each training doc a point (vector)
labeled by its class
 Similarity hypothesis: docs of the
same class form a contiguous region
of space. Or: Similar documents are
usually in the same class.
 Define surfaces to delineate classes
in space

Classes in a Vector Space
Similarity
hypothesis
true in
general?
Government
Science
Arts
Given a Test Document
Figure out which region it lies in
 Assign corresponding class

Test Document = Government
Government
Science
Arts
k-Nearest Neighbor
k-Nearest Neighbor Classification
To classify document d into class c
 Define k-neighborhood N as k
nearest neighbors of d
 Count number of documents l in N
that belong to c
 Estimate P(c|d) as l/k

Example: k=6 (6NN)
P(science| )?
Government
Science
Arts
kNN Learning Algorithm


Learning is just storing the representations of the
training examples in D
Testing instance x:
– Compute similarity between x and all examples in D
– Assign x the category of the most similar example in D


Does not explicitly compute a generalization or
category prototypes
Also called:
– Case-based learning
– Memory-based learning
– Lazy learning
kNN Is Close to Optimal

Cover and Hart 1967
 Asymptotically, the error rate of 1-nearestneighbor classification is less than twice
the Bayes rate [error rate of classifier knowing model
that generated data]

In particular, asymptotic error rate is 0 if
Bayes rate is 0
 Assume: query point coincides with a
training point
 Both query point and training point
contribute error → 2 times Bayes rate
kNN with Inverted Index

Naively finding nearest neighbors requires
a linear search through |D| documents in
collection
 But determining k nearest neighbors is the
same as determining the k best retrievals
using the test document as a query to a
database of training documents
 Use standard vector space inverted index
methods to find the k nearest neighbors
 Testing Time: O(B|Vt|)
where B is the
average number of training documents in
which a test-document word appears
– Typically B << |D|
kNN: Discussion

No feature selection necessary
 Scales well with large number of classes
– Don’t need to train n classifiers for n classes

Classes can influence each other
– Small changes to one class can have ripple
effect

Scores can be hard to convert to
probabilities
 No training necessary
– Actually: perhaps not true. (Data editing, etc.)
kNN vs. Naive Bayes

Bias/Variance tradeoff
– Variance ≈ Capacity

kNN has high variance and low bias.
– Infinite memory

NB has low variance and high bias.
– Decision surface has to be linear (hyperplane – see
later)

Consider: Is an object a tree? (Burges)
– Too much capacity/variance, low bias
• Botanist who memorizes
• Will always say “no” to new object (e.g., # leaves)
– Not enough capacity/variance, high bias
• Lazy botanist
• Says “yes” if the object is green
– Want the middle ground
kNN vs. Naïve Bayes






Bias/Variance tradeoff
Variance ≈ Capacity
kNN has high variance and low bias
Regression has low variance and high bias
Consider: Is an object a tree? (Burges)
Too much capacity/variance, low bias
– Botanist who memorizes
– Will always say “no” to new object (e.g., #leaves)

Not enough capacity/variance, high bias
– Lazy botanist
– Says “yes” if the object is green
kNN: Discussion

Classification time linear in training set
 No feature selection necessary
 Scales well with large number of classes
– Don’t need to train n classifiers for n classes

Classes can influence each other
– Small changes to one class can have ripple effect

Scores can be hard to convert to
probabilities
 No training necessary
– Actually: not true. Why?
Binary Classification
Consider 2 class problems
 How do we define (and find) the
separating surface?
 How do we test which region a test
doc is in?

Separation by Hyperplanes

Assume linear separability for now:
– in 2 dimensions, can separate by a line
– in higher dimensions, need hyperplanes

Can find separating hyperplane by
linear programming (e.g. perceptron):
– separator can be expressed as
ax + by = c
Linear Programming / Perceptron
Find a, b, c, such that
ax + by  c for red points
ax + by  c for blue points
Relationship to Naïve Bayes?
Find a, b, c, such that
ax + by  c for red points
ax + by  c for blue points
Linear Classifiers
Many common text classifiers are
linear classifiers
 Despite this similarity, large
performance differences

– For separable problems, there is an
infinite number of separating
hyperplanes. Which one do you
choose?
– What to do for non-separable
problems?
Which Hyperplane?
In general, lots of possible
solutions for a, b, c
Which Hyperplane?




Lots of possible solutions for a, b, c.
Some methods find a separating
hyperplane, but not the optimal one (e.g.,
perceptron)
Most methods find an optimal separating
hyperplane
Which points should influence optimality?
– All points
• Linear regression
• Naïve Bayes
– Only “difficult points” close to decision
boundary
• Support vector machines
• Logistic regression (kind of)
Hyperplane: Example


Class: “interest” (as in interest rate)
Example features of a linear classifier
(SVM)
wi
ti
0.70 prime
0.67 rate
0.63 interest
0.60 rates
0.46 discount
0.43 bundesbank
wi ti
-0.71 dlrs
-0.35 world
-0.33 sees
-0.25 year
-0.24 group
-0.24 dlr
More Than Two Classes

One-of classification: each document
belongs to exactly one class
– How do we compose separating surfaces into
regions?

Any-of or multiclass classification
– For n classes, decompose into n binary
problems

Vector space classifiers for one-of
classification
– Use a set of binary classifiers
– Centroid classification
– K Nearest Neighbor classification
Composing Surfaces: Issues
?
?
?
Set of Binary Classifiers: Any of
Build a separator between each class
and its complementary set (docs
from all other classes)
 Given test doc, evaluate it for
membership in each class
 Apply decision criterion of classifiers
independently
 Done

Set of Binary Classifiers: One of

Build a separator between each class and
its complementary set (docs from all other
classes)
 Given test doc, evaluate it for membership
in each class
 Assign document to class with:
– maximum score
– maximum confidence
– maximum probability

Why different from multiclass/any of
classification?
Negative Examples

Formulate as above, except negative
examples for a class are added to its
complementary set.
Positive examples
Negative examples
Centroid Classification
Given training docs for a class,
compute their centroid
 Now have a centroid for each class
 Given query doc, assign to class
whose centroid is nearest

Example
Government
Science
Arts
Support Vector Machines
Which Hyperplane?


In general, lots of possible
solutions for a, b, c
Support Vector Machine
(SVM) finds an optimal
solution
Support Vector Machine (SVM)

SVMs maximize the margin
around the separating
hyperplane:
Support vectors
– a.k.a. large margin classifiers



The decision function is fully
specified by a subset of
training samples, the support
vectors
Quadratic programming
problem
State of the art text
classification method
Maximize
margin
Geometric Margin
wT x  b
r
w

Distance from example to the separator is

Examples closest to the hyperplane are support vectors

Margin ρ of the separator is the width of separation between
support vectors of classes
ρ
r
Maximum Margin: Formalization

w: hyperplane normal
 xi: data point i
 yi: class of data point i (+1 or -1)

Constraint optimization formalization:
xi · w + b ≥ +1 for yi = +1
(2) xi · w + b ≤ -1 for yi = -1
(3) maximize margin: 2/||w||
(1)
Quadratic Programming

One can show that hyperplane w with maximum
margin is:
w    i yi x i
i
i: Lagrange multipliers
xi: data point i
yi: class of data point i (+1 or -1)
where the i are the solution to maximizing:
LD    i  12   i j yi y j xi  x j
i
i
Most i will be zero
Building an SVM Classifier
Now we know how to build a
separator for two linearly separable
classes
 What about classes whose
exemplary docs are not linearly
separable?

Not Linearly Separable
Find a line that penalizes
points on “the wrong side”
Penalizing Bad Points
Negative for
bad points.
Define distance for each point with
respect to separator ax + by = c:
(ax + by) - c for red points
c - (ax + by) for blue points.
Solve Quadratic Program
Solution gives “separator” between
two classes: choice of a,b
 Given a new point (x,y), can score its
proximity to each class:

– evaluate ax+by
– Set confidence threshold
3
5
7
Non-linear SVMs

Datasets that are linearly separable (with some noise) work
out great:
x
0

But what are we going to do if the dataset is just too hard?
x
0

How about … mapping data to a higher-dimensional space:
x2
0
x
Non-linear SVMs: Feature spaces

General idea: the original feature space can
always be mapped to some higher-dimensional
feature space where the training set is separable:
Φ: x → φ(x)
The “Kernel Trick”

The linear classifier relies on an inner product between vectors
K(xi,xj) = xiTxj
 If every datapoint is mapped into high-dimensional space via some
transformation Φ: x → φ(x), the inner product becomes:
K(xi, xj) = φ(xi)Tφ(xj)
 A kernel function is some function that corresponds to an inner
product in some expanded feature space.
 Example:
2-dimensional vectors x = [x1 x2]; let K(xi,xj) = (1 + xiTxj)2
Need to show that K(xi,xj) = φ(xi)Tφ(xj):
K(xi,xj) = (1 + xiTxj)2 =
= 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 =
= [1 xi12 √2 xi1xi2 xi22 √2xi1 √2xi2]T [1 xj12 √2 xj1xj2 xj22 √2xj1 √2xj2]
= φ(xi) Tφ(xj)
where φ(x) = [1 x12 √2 x1x2 x22 √2x1 √2x2]
Kernels

Why use kernels?
– Make non-separable problem separable
– Map data into better representational space

Common kernels
– Linear
– Polynomial K(x,z) = (1+xTz)d
– Radial basis function (infinite dimensional
space)
Evaluation: Classic Reuters Data Set




Most (over)used data set
21578 documents
9603 training, 3299 test articles (ModApte split)
118 categories
– An article can be in more than one category
– Learn 118 binary category distinctions


Average document: about 90 types, 200 tokens
Average number of classes assigned
– 1.24 for docs with at least one category

Only about 10 out of 118 categories are large
Common categories
(#train, #test)
• Earn (2877, 1087)
• Acquisitions (1650,
179)
• Money-fx (538, 179)
• Grain (433, 149)
• Crude (389, 189)
•
•
•
•
•
Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)
Performance of SVM





SVM are seen as best-performing method
by many
Statistical significance of most results not
clear
There are many methods that perform
about as well as SVM
Example: regularized regression
(Zhang&Oles)
Example of a comparison study: Yang&Liu
Dumais et al. 1998: Reuters - Accuracy
earn
acq
money-fx
grain
crude
trade
interest
ship
wheat
corn
Avg Top 10
Avg All Cat
Rocchio
NBayes
Trees
LinearSVM
92.9%
95.9%
97.8%
98.2%
64.7%
87.8%
89.7%
92.8%
46.7%
56.6%
66.2%
74.0%
67.5%
78.8%
85.0%
92.4%
70.1%
79.5%
85.0%
88.3%
65.1%
63.9%
72.5%
73.5%
63.4%
64.9%
67.1%
76.3%
49.2%
85.4%
74.2%
78.0%
68.9%
69.7%
92.5%
89.7%
48.2%
65.3%
91.8%
91.1%
64.6%
61.7%
81.5%
75.2% na
88.4%
91.4%
86.4%
Yang&Liu: SVM vs Other Methods
Results for Kernels (Joachims)
Confusion matrix
Entry (i, j) means 53 of the docs actually in
class i were put in class j by the classifier
Actual Class
Class assigned by classifier

53
In a perfect classification, only the
diagonal has non-zero entries
SVM Summary

Choose hyperplane based on support vectors
– Support vector = “critical” point close to decision
boundary



(Degree-1) SVMs are linear classifiers
Kernels: powerful and elegant way to define
similarity metric
Perhaps best performing text classifier
– But there are other methods that perform about as well as
SVM, such as regularized logistic regression (Zhang &
Oles 2001)

Partly popular due to availability of SVMlight
– SVMlight is accurate and fast – and free (for research)



Now lots of software: libsvm, TinySVM, ….
Comparative evaluation of methods
Real world: exploit domain specific structure!
Resources









Manning and Schütze. Foundations of Statistical Natural Language
Processing. Chapter 16. MIT Press.
Christopher J. C. Burges. A Tutorial on Support Vector Machines for
Pattern Recognition, 1998.
S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems,
13(4), Jul/Aug 1998.
S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive
learning algorithms and representations for text categorization.
Proceedings of CIKM ’98, pp. 148-155.
A re-examination of text categorization methods (1999) Yiming Yang, Xin
Liu 22nd Annual International SIGIR.
Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized
Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)
Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of
Statistical Learning: Data Mining, Inference and Prediction" SpringerVerlag, New York.
T. Joachims, Learning to Classify Text using Support Vector Machines.
Kluwer, 2002
Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized
Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)
Question Answering
Question Answering from Open-Domain Text
 Search
Engines (IR) return list of
(possibly) relevant documents
 Users still to have to dig through
returned list to find answer
 QA: give the user a (short) answer
to their question, perhaps
supported by evidence
People do ask questions

Examples from AltaVista query log
–
–
–
–
who invented surf music?
how to make stink bombs
where are the snowdens of yesteryear?
which english translation of the bible is used in official catholic
liturgies?
– how to do clayart
– how to copy psx
– how tall is the sears tower?

Examples from Excite query log (12/1999)
–
–
–
–
–

how can i find someone in texas
where can i find information on puritan religion?
what are the 7 wonders of the world
how can i eliminate stress
What vacuum cleaner does Consumers Guide recommend
Around 12–15% of query logs
The Google answer #1
Include question words (why, who,
etc.) in stop-list
 Do standard IR
 Sometimes this (sort of) works:

– Question: Who was the prime minister
of Australia during the Great
Depression?
– Answer: James Scullin (Labor) 1929–31
Page about Curtin (WW II
Labor Prime Minister)
(Can deduce answer)
Page about Curtin (WW II
Labor Prime Minister)
(Lacks answer)
Page about Chifley
(Labor Prime Minister)
(Can deduce answer)
But often it doesn’t…
Question: How much money did IBM
spend on advertising in 2002?
 Answer: I dunno, but I’d like to … 

The Google answer #2






Take the question and try to find it as a string
on the web
Return the next sentence on that web page as
the answer
Works brilliantly if this exact question
appears as a FAQ question, etc.
Works lousily most of the time
Reminiscent of the line about monkeys and
typewriters producing Shakespeare
But a slightly more sophisticated version of
this approach has been revived in recent
years with considerable success…
AskJeeves

AskJeeves was the most hyped example of
“Question answering”
– Have basically given up now: just web search except
when there are factoid answers of the sort MSN also
does

It largely did pattern matching to match your
question to their own knowledge base of
questions
 If that works, you get the human-curated answers
to that known question
 If that fails, it falls back to regular web search
 A potentially interesting middle ground, but a
fairly weak shadow of real QA
Online QA Examples

LCC:
http://www.languagecomputer.com/demos/question_answering/i
ndex.html
AnswerBus is an open-domain
question answering system:
www.answerbus.com
 Ionaut: http://www.ionaut.com:8400/
 EasyAsk, AnswerLogic,
AnswerFriend, Start, Quasm, Mulder,
Webclopedia, etc.

Question Answering at TREC



Question answering competition at TREC
Until 2004, consisted of answering a set of 500
fact-based questions, e.g. “When was Mozart
born?”.
For the first three years systems were allowed to
return 5 ranked answer snippets (50/250 bytes) to
each question.
– IR think
– Mean Reciprocal Rank (MRR) scoring:
• 1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ doc
– Mainly Named Entity answers (person, place, date, …)

From 2002 the systems are only allowed to return
a single exact answer and the notion of
confidence has been introduced
The TREC Document Collection

The retrieval collection uses news articles
from the following sources:
• AP newswire, 1998-2000
• New York Times newswire, 1998-2000
• Xinhua News Agency newswire, 1996-2000

In total there are 1,033,461 documents in
the collection: 3GB of text
 Too much to handle entirely using NLP
techniques so the systems usually consist
of an initial information retrieval phase
followed by more advanced processing
 Many supplement this text with use of the
web, and other knowledge bases
Sample TREC questions
1. Who is the author of the book, “The Iron Lady: A
Biography of Margaret Thatcher”?
2. What was the monetary value of the Nobel Peace
Prize in 1989?
3. What does the Peugeot company manufacture?
4. How much did Mercury spend on advertising in 1993?
5. What is the name of the managing director of Apricot
Computer?
6. Why did David Koresh ask the FBI for a word processor?
7. What debts did Qintex group leave?
8. What is the name of the rare neurological disease with
symptoms such as: involuntary movements (tics), swearing,
and incoherent vocalizations (grunts, shouts, etc.)?
Top Performing Systems


Currently the best performing systems at TREC
can answer approximately 60-80% of the
questions
Approaches and successes have varied a fair
deal
– Knowledge-rich approaches, using a vast array
of NLP techniques
• Notably Harabagiu, Moldovan et al. – SMU/UTD/LCC
– AskMSR system stressed how much could be
achieved by very simple methods with enough
text
– Middle ground is to use a large collection of
surface matching patterns (Insight Software)
TREC 2000 Q&A track

693 fact-based, short answer questions
– either short (50 B) or long (250 B) answer

~3 GB newspaper/newswire text (AP, WSJ,
SJMN, FT, LAT, FBIS)
 Score: MRR (penalizes second answer)
 Questions: 186 (Encarta), 314 (seeds from
Excite logs), 193 (syntactic variants of 54
originals)
Sample Questions
Q: Who shot President Abraham Lincoln?
A: John Wilkes Booth
Q: How many lives were lost in the Pan Am crash in Lockerbie?
A: 270
Q: How long does it take to travel from London to Paris through the Channel?
A: three hours 45 minutes
Q: Which Atlantic hurricane had the highest recorded wind speed?
A: Gilbert (200 mph)
Q: Which country has the largest part of the rain forest?
A: Brazil (60%)
Question Types
Class 1
Answer: single datum or list of items
C: who, when, where, how (old, much, large)
Class 2
A: multi-sentence
C: extract from multiple sentences
Class 3
A: across several texts
C: comparative/contrastive
Class 4
A: an analysis of retrieved information
C: synthesized coherently from several retrieved
fragments
Class 5
A: result of reasoning
C: word/domain knowledge and common sense
reasoning
Question subtypes
Class 1.A
About subjects, objects, manner, time or location
Classs 1.B
About properties or attributes
Class 1.C
Taxonomic nature
TREC 2000 Results
Knowledge Rich 0.8
Approach:
0.7
Falcon
0.6
0.5
0.4
MRR
0.3
0.2
0.1
a
Pi
s
IC
N
TT
M
SI
LI
IB
M
o
W
at
er
lo
en
s
Q
ue
SM
U
0
Falcon: Architecture
Question
Collins Parser +
NE Extraction
Paragraph Index
Question Semantic Form
Collins Parser +
NE Extraction
Answer Semantic Form
Paragraph
filtering
Question
Taxonomy
Answer Logical Form
Expected
Answer
Type
Answer Paragraphs
WordNet
Coreference
Resolution
Question Expansion
Abduction Filter
Question Logical Form
Answer
Question Processing
Paragraph Processing
Answer Processing
Question parse
S
VP
S
VP
PP
NP
WP
VBD DT
JJ
NNP
NP
NP
TO
VB
IN
NN
Who was the first Russian astronaut to walk in space
Question semantic form
first
Russian
astronaut
Answer
type
PERSON
walk
space
Question logic form:
first(x)  astronaut(x)  Russian(x)  space(z) 
walk(y, z, x)  PERSON(x)
Expected Answer Type
WordNet
QUANTITY
dimension
size
Argentina
Question: What is the size of Argentina?
Questions about definitions

Special patterns:
– What {is|are} …?
– What is the definition of …?
– Who {is|was|are|were} …?

Answer patterns:
– …{is|are}
– …, {a|an|the}
–…-
Question Taxonomy
Question
Location
Country
Reason
City
Product
Province
Nationality
Continent
Manner
Number
Speed
Currency
Degree
Language
Dimension
Mammal
Rate
Reptile
Duration
Game
Percentage
Organization
Count
Question expansion

Morphological variants
– invented  inventor

Lexical variants
– killer  assassin
– far  distance

Semantic variants
– like  prefer
Indexing for Q/A

Alternatives:
– IR techniques
– Parse texts and derive conceptual
indexes

Falcon uses paragraph indexing:
– Vector-Space plus proximity
– Returns weights used for abduction
Abduction to justify answers


Backchaining proofs from questions
Axioms:
– Logical form of answer
– World knowledge (WordNet)
– Coreference resolution in answer text

Example
– Q: When was the internal combustion engine invented?
– A: The first internal-combustion engine was built in 1867
– invent → create_mentally → create → build

Effectiveness:
– 14% improvement
– Filters 121 erroneous answers (of 692)
– Takes 60% of question processing time
TREC 2001 Results
0.7
Surface
Approach:
Insight Software 0.6
0.5
0.4
0.3
MRR
0.2
0.1
sa
Pi
SR
M
IB
M
ng o
ap
or
e
Si
er
lo
at
IS
I
W
e
LC
C
O
ra
cl
In
si
g
ht
0
TREC 2001: no NLP
Best system from Insight Software
using surface patterns
 AskMSR uses a Web Mining
approach, by retrieving suggestions
from Web searches

Insight Sofware: Surface patterns approach



Best at TREC 2001: 0.68 MRR
Use of Characteristic Phrases
“When was <person> born”
– Typical answers
• “Mozart was born in 1756.”
• “Gandhi (1869-1948)...”
– Suggests phrases (regular expressions) like
• “<NAME> was born in <BIRTHDATE>”
• “<NAME> ( <BIRTHDATE>-”
– Use of Regular Expressions can help locate
correct answer
Use Pattern Learning

Example:
• “The great composer Mozart (1756-1791) achieved
fame at a young age”
• “Mozart (1756-1791) was a genius”
• “The whole world would always be indebted to the
great music of Mozart (1756-1791)”
– Longest matching substring for all 3 sentences
is "Mozart (1756-1791)”
– Suffix tree would extract "Mozart (1756-1791)"
as an output, with score of 3

Reminiscent of Information Extraction
pattern learning
Pattern Learning (cont.)

Repeat with different examples of
same question type
– “Gandhi 1869”, “Newton 1642”, etc.

Some patterns learned for
BIRTHDATE
– a. born in <ANSWER>, <NAME>
– b. <NAME> was born on <ANSWER> ,
– c. <NAME> ( <ANSWER> – d. <NAME> ( <ANSWER> - )
Experiments

6 different Q types
– from Webclopedia QA Typology (Hovy
et al., 2002a)
•
•
•
•
•
•
BIRTHDATE
LOCATION
INVENTOR
DISCOVERER
DEFINITION
WHY-FAMOUS
Experiments: pattern precision

BIRTHDATE table:
•
•
•
•
•
•
•

1.0
0.85
0.6
0.59
0.53
0.50
0.36
<NAME> ( <ANSWER> - )
<NAME> was born on <ANSWER>,
<NAME> was born in <ANSWER>
<NAME> was born <ANSWER>
<ANSWER> <NAME> was born
- <NAME> ( <ANSWER>
<NAME> ( <ANSWER> -
DEFINITION
• 1.0 <NAME> and related <ANSWER>
• 1.0 form of <ANSWER>, <NAME>
• 0.94 as <NAME>, <ANSWER> and
Shortcomings & Extensions

Need for POS &/or semantic types
• "Where are the Rocky Mountains?”
• "Denver's new airport, topped with white
fiberglass cones in imitation of the Rocky
Mountains in the background , continues to
lie empty”
• <NAME> in <ANSWER>

Named Entity tagger &/or ontology
could enable system to determine
“background” is not a location name
Shortcomings ... (cont.)

Long distance dependencies
– “Where is London?”
– “London, which has one of the most busiest
airports in the world, lies on the banks of the
river Thames”
– would require pattern like:
<QUESTION>, (<any_word>)*, lies on <ANSWER>

Abundance & variety of Web data helps
system to find an instance of patterns w/o
losing answers to long distance
dependencies
Shortcomings... (cont.)

System currently has only one anchor word
– Doesn't work for Q types requiring multiple words from
question to be in answer
• “In which county does the city of Long Beach lie?”
• “Long Beach is situated in Los Angeles County”
• required pattern:
<Q_TERM_1> is situated in <ANSWER> <Q_TERM_2>

Did not use case
• “What is a micron?”
• “...a spokesman for Micron, a maker of semiconductors,
said SIMMs are...”

If Micron had been capitalized in question, would
be a perfect answer
AskMSR: Web Mining

Web Question Answering: Is More Always
Better?
– Dumais, Banko, Brill, Lin, Ng (Microsoft, MIT,
Berkeley)
Q: “Where is the Louvre located?”
 Want “Paris” or “France” or “75058 Paris
Cedex 01” or a map
 Don’t just want URLs

AskMSR: Details
1
2
3
5
4
Step 1: Rewrite queries

Intuition: The user’s question is often
syntactically quite close to
sentences that contain the answer
– Where is the Louvre Museum located?
– The Louvre Museum is located in Paris
– Who created the character of Scrooge?
– Charles Dickens created the character of
Scrooge.
Query rewriting

–
–
–
Classify question into seven categories
Who is/was/are/were…?
When is/did/will/are/were …?
Where is/are/were …?
a. Category-specific transformation rules
Nonsense,
eg “For Where questions, move ‘is’ to all possible but who
locations”
cares? It’s
only a few
“Where is the Louvre Museum located”
more queries

“is the Louvre Museum located”
to Google.

“the is Louvre Museum located”

“the Louvre is Museum located”

“the Louvre Museum is located”

“the Louvre Museum located is”
b. Expected answer “Datatype” (eg, Date, Person, Location, …)
When was the French Revolution?  DATE

Hand-crafted classification/rewrite/datatype rules
Step 2: Query search engine
Send all rewrites to a Web search
engine
 Retrieve top N answers
 For speed, rely just on search
engine’s “snippets”, not the full text
of the actual document

Step 3: Mining N-Grams

Simple: Enumerate all N-grams (N=1,2,3 say) in all
retrieved snippets
– Use hash table and other fancy footwork to make this
efficient


Weight of an N-gram: occurrence count, each
weighted by “reliability” (weight) of rewrite that
fetched the document
Example: “Who created the character of
Scrooge?”
–
–
–
–
–
–
–
–
Dickens - 117
Christmas Carol - 78
Charles Dickens - 75
Disney - 72
Carl Banks - 54
A Christmas - 41
Christmas Carol - 45
Uncle - 31
Step 4: Filtering N-Grams





Each question type is associated with one or
more “data-type filters” = regular expressions
When…
Date
Where…
Location
What …
Person
Who …

Boost score of N-grams that do match regexp
Lower score of N-grams that don’t match regexp

Details omitted from paper….

Step 5: Tiling the Answers
Scores
20
Charles Dickens
Dickens
15
10
merged,
discard
old n-grams
Mr Charles
Score 45
Mr Charles Dickens
tile highest-scoring n-gram
N-Grams
N-Grams
Repeat, until no more overlap
AskMSR Results

Standard TREC contest test-bed:
~1M documents; 900 questions
 Technique doesn’t do too well (though
would have placed in top 9 of ~30
participants!)
– MRR = 0.262 (i.e., right answer ranked about #4#5 on average)
– Why? Because it relies on the enormity of the
Web!

Using the Web as a whole, not just TREC’s
1M documents… MRR = 0.42 (i.e., on
average, right answer is ranked about #2-#3)
AskMSR Issues
In many scenarios (e.g., monitoring
an individual’s email…) we only have
a small set of documents
 Works best/only for “Trivial Pursuit”style fact-based questions
 Limited/brittle repertoire of

– question categories
– answer data types/filters
– query rewriting rules
PiQASso: Pisa Question Answering
System
“Computers are useless, they can only
give answers”
Pablo Picasso
PiQASso Architecture
?
Question
analysis
Answer analysis
MiniPar
Relation
Matching
Answer
Scoring
Question
Classification
Type
Matching
Popularity
Ranking
MiniPar
Answe
r
found?
Document
collection
WNSense
Sentence
Splitter
Indexer
Query
Formulation
/Expansion
WordNet
Answer Pars
Answer
Question Analysis
What metal has the highest melting point?
1 Parsing
obj
subj
lex-mod
What metal has the highest melting point?
mod
2 Keyword extraction
3 Answer type detection
metal, highest, melting, point
SUBSTANCE
4 Relation extraction
<SUBSTANCE, has, subj>
<point, has, obj>
<melting, point, lex-mod>
<highest, point, mod>
1.
NL question is parsed
2.
POS tags are used to select search keywords
3.
Expected answer type is determined applying heuristic rules to the
dependency tree
4.
Additional relations are inferred and the answer entity is identified
Answer Analysis
Tungsten is a very dense material and has the highest melting point of any metal.
1 Parsing
………….
2 Answer type check
SUBSTANCE
1.
Parse retrieved paragraphs
2.
Paragraphs not containing an entity of the expected
type are discarded
3.
Dependency relations are extracted from Minipar
output
4.
Matching distance between word relations in
question and answer is computed
5.
Too distant paragraphs are filtered out
6.
Popularity rank used to weight distances
3 Relation extraction
<tungsten, material, pred>
<tungsten, has, subj>
<point, has, obj>
…
4 Matching Distance
Tungsten
5 Distance Filtering
6 Popularity Ranking
ANSWER
Match Distance between Question and Answer

Analyze relations between
corresponding words considering:
– number of matching words in question and in
answer
– distance between words. Ex: moon - satellite
– relation types. Ex: words in the question related
by subj while the matching words in the answer
related by pred
http://medialab.di.unipi.it/askpiqasso.html
TREC 2004

Questions: factoid (230) and list (56)
in series (65), definition
– Where was Franz Kafka born?
– When was he born?
– What books did he author?

Question series provide context
TREC 2004 Results (factoid)
Knowledge Rich 0.8
Approach:
PowerAnswer2 0.7
0.6
0.5
0.4
0.3
Accuracy
0.2
0.1
da
n
Fu
IR
ST
IT
M
IB
M
LC
C
W
al
e
Si
ng s
ap
or
e
Sa
ar
la
nd
0
PowerAnswer2 Architecture
Q
Context
Insertion
A
Answer
Formulation
Strategy
Selection
Answer
Ranking
COGEX
Answer
Detection
Keyword
Selection
Semantic
Matching
Parser
NER
Lexical
Matching
Keyword
Alternations/
Expansions
Passage
Retrieval
COGEX Theorem Prover
Semantic
Calculus
Linguistic
Axioms
Axiom
Building
Logic
Prover
Q
Semantic
Parser
Anwer
Ranking
Text
WordNet
Axioms

Word
Knowledge
Axioms
Linguistic axioms improve overall accuracy by
12%
QA: Linguistic + Web mining
Saarland: linguistic + Web
Match question with syntactic
pattern  surface structure of
possible answer
 Search Web, parse, tag and match
results
 Use NER to identify answer

Example
Question: When did Amtrak begin
operations?
 pattern: When,did,NP,Verb,NP|PP
 target: ref(3), ref(4)->past, ref(5), in, NP
 target: in,NP,[,],ref(3),ref(4)->past,ref(5)
 answerType:
NP|PP

– Weights:
Date = 4, year = 3, other = 1
Web search

Queries:
– “Amtrak began operations in”
– “In” “Amtrak began operations”

Results:
– “Amtrak began operations in 1971”
– “… through 1971, when Amtrak began
operations in the state”

Matches:
– “1971”: 4
– “the state”: 1
Effectiveness
Algorithm
Accuracy
Rephrasing
0.20
Google fallback
0.395
QA without linguistic knowledge
Lexiclone: Approach
Find answer in a smaller and specific
database
 Extract a few predicative definitions
from the text summary
 Find in AQUAINT sentence with same
predicatives as those found
 Accuracy: 0.63 (second best score in
2003)

Lexiclone: Summary
“Most don’t live in the city, …”
 3 nouns, 3 verbs, 4 adjectives
 Build all triads:

– City live most
– Most live city
– City live in
– Most live in
–…
QA: role of Named Entities
Lexical Terms Extraction as input to IR

Questions approximated by sets of
unrelated words (lexical terms)
 Similar to bag-of-word IR models: but
choose nominal non-stop words and
verbs
Question (from TREC QA track)
Lexical terms
Q002: What was the monetary
monetary, value,
value of the Nobel Peace Prize in Nobel, Peace, Prize
1989?
Q003: What does the Peugeot
company manufacture?
Peugeot, company,
manufacture
Q004: How much did Mercury
spend on advertising in 1993?
Mercury, spend,
advertising, 1993
Extracting Answers for Factoid Questions: NER

In TREC 2003 the LCC QA system extracted 289
correct answers for factoid questions
 A Name Entity Recognizer was responsible for 234
of them
 Names are classified into classes matched to
questions
QUANTITY
55
ORGANIZATION
15
PRICE
3
NUMBER
45
AUTHORED WORK
11
SCIENCE NAME
2
DATE
35
PRODUCT
11
ACRONYM
1
PERSON
31
CONTINENT
5
ADDRESS
1
COUNTRY
21
PROVINCE
5
ALPHABET
1
OTHER LOCATIONS
19
QUOTE
5
URI
1
CITY
19
UNIVERSITY
3
NE-driven QA

The results of the past 5 TREC evaluations of
QA systems indicate that current state-of-theart QA is largely based on the high accuracy
recognition of Named Entities:
–
–
–
–
Precision of recognition
Coverage of name classes
Mapping into concept hierarchies
Participation into semantic relations (e.g.
predicate-argument structures or frame
semantics)
Not all problems are solved yet!

Where do lobsters like to live?
– on a Canadian airline

Where are zebras most likely found?
– near dumps
– in the dictionary

Why can't ostriches fly?
– Because of American economic sanctions

What’s the population of Mexico?
– Three

What can trigger an allergic reaction?
– ..something that can trigger an allergic reaction
References

AskMSR: Question Answering Using the Worldwide Web
– Michele Banko, Eric Brill, Susan Dumais, Jimmy Lin
– http://www.ai.mit.edu/people/jimmylin/publications/Bank
o-etal-AAAI02.pdf
– In Proceedings of 2002 AAAI SYMPOSIUM on Mining
Answers from Text and Knowledge Bases, March 2002

Web Question Answering: Is More Always Better?
– Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and
Andrew Ng
– http://research.microsoft.com/~sdumais/SIGIR2002-QASubmit-Conf.pdf

D. Ravichandran and E.H. Hovy. 2002.
Learning Surface Patterns for a Question Answering
System. ACL conference, July 2002.
References
S. Harabagiu, D. Moldovan, M. Paşca, R. Mihalcea, M. Surdeanu, R.
Bunescu, R. Gîrju, V.Rus and P. Morărescu. FALCON: Boosting
Knowledge for Answer Engines. The Ninth Text REtrieval
Conference (TREC 9), 2000.
Marius Pasca and Sanda Harabagiu, High Performance
Question/Answering, in Proceedings of the 24th Annual International
ACL SIGIR Conference on Research and Development in
Information Retrieval (SIGIR-2001), September 2001, New Orleans
LA, pages 366-374.
L. Hirschman, M. Light, E. Breck and J. Burger. Deep Read: A
Reading Comprehension System. In Proceedings of the 37th
Annual Meeting of the Association for Computational
Linguistics, 1999.
C. Kwok, O. Etzioni and D. Weld. Scaling Question Answering to
the Web. ACM Transactions in Information Systems, Vol 19, No.
3, July 2001, pages 242-262.
M. Light, G. Mann, E. Riloff and E. Breck. Analyses for Elucidating
Current Question Answering Technology. Journal of Natural
Language Engineering, Vol. 7, No. 4 (2001).
Dependency Parsing
Parsing in QA
Top systems in TREC 2005 perform
parsing of queries and answer
paragraphs
 Some use specially built parser
 Parsers are slow: ~ 1min/sentence

Constituent Parsing

Requires Grammar
– CFG, PCFG, Unification Grammar

Produces phrase structure parse tree
VP
S
VP
S
VP
NP
NP
NP
VP
ADJP
Rolls-Royce Inc. said it expects its sales to remain steady
Dependency Tree
Word-word dependency relations
 Far easier to understand and to
annotate

Rolls-Royce Inc. said it expects its sales to remain steady
Parsing as classification
Inductive dependency parsing
 Parsing based on Shift/Reduce
actions
 Learn from annotated corpus which
action to perform

Right
Shift
Left
Parser Actions
top
next
Ho
VER:aux
visto
VER:pper
una
DET
ragazza
NOM
con
PRE
gli
DET
occhiali
NOM
.
POS
Parser State
The parser state is a quadruple
S, I, T, A, where
S is a stack of partially processed tokens
I is a list of (remaining) input tokens
T is a stack of temporary tokens
A is the arc relation for the dependency
graph
(w, r, h)  A represents an arc w → h,
tagged with dependency r
Parser Actions
Shift
Right
Left
S, n|I, T, A
n|S, I, T, A
s|S, n|I, T, A
S, n|I, T, A{(s, r, n)}
s|S, n|I, T, A
S, s|I, T, A{(n, r, s)}
Learning Features
feature
Value
W
word
L
lemma
P
part of speech (POS) tag
M
morphology: e.g. singular/plural
W<
word of the leftmost child node
L<
lemma of the leftmost child node
P<
POS tag of the leftmost child node, if present
M<
whether the rightmost child node is singular/plural
W>
word of the rightmost child node
L>
lemma of the rightmost child node
P>
POS tag of the rightmost child node, if present
M>
whether the rightmost child node is singular/plural
Learning Event
left context
Sosteneva
VER
che
PRO
target nodes
leggi
NOM
le
DET
context
anti
ADV
Serbia
NOM
right context
che
PRO
,
PON
erano
VER
discusse
ADJ
(-3, W, che), (-3, P, PRO),
(-2, W, leggi), (-2, P, NOM), (-2, M, P), (-2, W<, le), (-2, P<, DET), (-2, M<, P),
(-1, W, anti), (-1, P, ADV),
(0, W, Serbia), (0, P, NOM), (0, M, S),
(+1, W, che), ( +1, P, PRO), (+1, W>, erano), (+1, P>, VER), (+1, M>, P),
(+2, W, ,), (+2, P, PON)
Projectivity
A dependency tree is projective iff
every arc is projective
 An arc wi→wk is projective iff
j, i < j < k or i > j > k,
wi →* wk

Non Projective
Většinu těchto přístrojů lze take používat nejen jako fax , ale
Actions for non-projective arcs
Right2
Left2
Right3
Left3
Extract
Insert
s1|s2|S, n|I, T, A
s1|S, n|I, T, A{(s2, r, n)}
s1|s2|S, n|I, T, A
s2|S, s1|I, T, A{(n, r, s2)}
s1|s2|s3|S, n|I, T, A
s1|s2|S, n|I, T, A{(s3, r, n)}
s1|s2|s3|S, n|I, T, A
s2|s3|S, s1|I, T, A{(n, r, s3)}
s1|s2|S, n|I, T, A
n|s1|S, I, s2|T, A
S, I, s1|T, A
s1|S, I, T, A
CoNLL-X Shared Task Results
Maximum Entropy
Language
MBL
LAS
%
UAS
%
LA
%
Train
sec
Arabic
54.15
69.50
72.97
181
Bulgarian
72.90
85.24
77.68
Chinese
70.00
81.33
Czech
62.10
Danish
Parse
sec
LAS
%
UAS
%
LA
%
Train
sec
Parse
sec
2.6
59.70
74.69
75.49
24
950
452
1.5
79.17
85.92
83.22
88
353
58.75
1156
1.8
72.17
83.08
75.55
540
478
73.44
69.84
13800
12.8
69.20
80.22
77.72
496
13500
71.72
78.84
74.65
386
3.2
76.13
83.65
82.06
52
627
Dutch
63.71
68.93
66.47
679
3.3
68.97
74.73
75.93
132
923
German
75.88
80.25
78.39
9315
4.3
79.79
84.31
86.88
1399
3756
Japanese
78.01
82.05
73.68
129
0.8
83.39
86.73
89.95
44
97
Portuguese
79.40
85.03
80.79
1044
4.9
80.97
86.78
85.27
160
670
Slovene
60.63
72.14
69.36
98
3.0
62.67
76.60
72.72
16
547
Spanish
70.33
74.25
82.19
204
2.4
74.37
79.70
85.23
54
769
Swedish
75.20
83.03
72.42
1424
2.9
74.85
83.73
77.81
96
1177
Turkish
48.83
65.25
49.81
177
2.3
47.58
65.25
59.65
43
727
Future Directions

Opinion Extraction
– Finding opinions (positive/negative)
– Blog track in TREC2006

Intent Analysis
– Determine author intent, such as:
problem (description, solution),
agreement (assent, dissent), preference
(likes, dislikes), statement (claim,
denial)
References
S. Chakrabarti, Mining the Web,
Morgan-Kaufmann, 2004.
 G. Attardi. Experiments with a
Multilanguage Non-projective
Dependency Parser, CoNLL-X, 2006.
 H. Yamada, Y. Matsumoto. Statistical
Dependency Analysis with Support
Vector Machines. In Proc. IWPT, 2003.

Download