Answering List Questions using Co

advertisement
Answering List Questions
using Co-occurrence and
Clustering
Majid Razmara and Leila Kosseim
Concordia University
m_razma@cs.concordia.ca
Introduction
• Question Answering
• TREC QA track
 Question Series
 Corpora
Target: American Girl dolls








FACTOID: In what year were American Girl dolls first introduced?
LIST: Name the historical dolls.
LIST: Which American Girl dolls have had TV movies made about them?
FACTOID: How much does an American Girl doll cost?
FACTOID: How many American Girl dolls have been sold?
FACTOID: What is the name of the American Girl store in New York?
FACTOID: What corporation owns the American Girl company?
OTHER: Other
2
Hypothesis
• Answer Instances
1. Have the same semantic entity class
2. Co-occur within sentences, or
3. Occur in different sentences sharing similar context

Based on Distributional Hypothesis: “Words occurring in the
same contexts tend to have similar meanings” [Harris, 1954].
3
Target 232: "Dulles Airport“
Question 232.6: "Which airlines use Dulles”
Ltw_Eng_20050712.0032 (AQUAINT-2)
United, which operates a hub at Dulles, has six luggage screening machines in its
basement and several upstairs in the ticket counter area.
Delta, Northwest, American, British Airways and KLM share four screening machines
in the basement.
Ltw_Eng_20060102.0106 (AQUAINT-2)
Independence said its last flight Thursday will leave White Plains, N.Y., bound for
Dulles Airport.
Flyi suffered from rising jet fuel costs and the aggressive response of competitors, led
by United and US Airways.
New York Times (Web)
Continental Airlines sued United Airlines and the committee that oversees operations
at Washington Dulles International Airport yesterday, contending that recently installed
baggage-sizing templates inhibited competition.
Wikipedia (Web)
At its peak of 600 flights daily, Independence, combined with service from JetBlue and
AirTran, briefly made Dulles the largest low-cost hub in the United States.
4
Our Approach
1. Create an initial candidate list
 Answer Type Recognition
 Document Retrieval
 Candidate Answer Extraction
 It may also be imported from an external source
(e.g. Factoid QA)
2. Extract co-occurrence information
3. Cluster candidates based on their co-
occurrence
5
Answer Type Recognition
• 9 Types:
 Person, Country, Organization, Job, Movie, Nationality, City, State,
and Other
• Lexical Patterns

^ (Name | List | What | Which) (persons | people | men | women | players | contestants |
artists | opponents | students) 
PERSON

^ (Name | List | What | Which) (countries | nations)

COUNTRY
• Syntagmatic Patterns for Other types

^ (WDT | WP | VB | NN) (DT | JJ)* (NNS | NNP | NN | JJ | )* (NNS | NNP | NN | NNPS) (VBN | VBD |
VBZ | WP | $)

^ (WDT | WP | VB | NN) (VBD | VBP) (DT | JJ | JJR | PRP$ | IN)* (NNS | NNP | NN | )* (NNS | NNP |
NN)
• Type Resolution
6
Type Resolution
•
Resolves the answer subtype to one of the main types

List previous conductors of the Boston Pops.

•
Type: OTHER
Sub Type: Conductor  PERSON
WordNet's Hypernym Hierarchy
7
Document Retrieval
• Document Collection
 Source Document Collection
Source
Domain
 Few documents
 To extract candidates
 Domain Document Collection
 Many documents
 To extract co-occurrence information
• Query Generation
 Google Query on Web
 Lucene Query on Corpora
8
Candidate Answer Extraction
• Term Extraction
 Extract all terms that conform to the expected answer type
 Person, Organization, Job
 Intersection of several NE taggers: LingPipe, Stanford tagger & Gate NE
 To get a better precision
 Country, State, City, Nationality
 Gazetteer
 To get a better precision
 Movie, Other
 Capitalized and quoted terms
 Verification of Movie numHits(GoogleQuery intitle:Term site:www.imdb.com)
 Verification of Other
numHits(“SubType Term” OR “Term SubType”)
numHits(“Term”)
9
Co-occurrence Information Extraction
• Domain Collection Documents are split into sentences
• Each sentence is checked as to whether it contains
Se
nt
en
ce
s
candidate answers
Candidates
0
3
1
0
2
0
3
1
3
1
Candidates
10
Hierarchical Agglomerative Clustering
•
Steps:
1. Put each candidate term ti in a separate cluster Ci
2. Compute the similarity between each pair of clusters

Average Linkage
similarity (Ci, Cj) 

1
similarity (tm, tn)
| Ci |  | Cj | tmCi tnCj
3. Merge two clusters with highest inter-cluster similarity
4. Update all relations between this new cluster and
other clusters
5. Go to step 3 until


There are only N clusters, or
The similarity is less than a threshold
11
The Similarity Measure
• Similarity between each pair of candidates
• Based on co-occurrence
termi
termi
Total
termj
O11
O21
O11 + O21
termj
O12
O22
O12 + O22
within sentences
• Using chi-square (2)
Total
O11 + O12 O21 + O22 N
2
N
(O
11
O
22
–
O
12
O
21
)
2 
(O11  O12)(O11  O21)(O12  O22)(O21  O22)
• Shortcoming
12
Pinpointing the Right Cluster
• Question and target keywords are used as “spies”
• Spies are:
 Inserted into the list of candidate answers
 Are treated as candidate answers, hence
 their similarity to one another and similarity to candidate answers
are computed
 they are clustered along with candidate answers
• The cluster with the most number of spies is
returned
 The spies are removed
• Other approaches
13
Target 269: Pakistan earthquakes of October 2005
Question 269.2: What countries were affected by this earthquake?
Cluster-2
Cluster-31
pakistan, 2005,
oman
afghanistan, octob,
u.s, india, affect,
Cluster-9
earthquak
spain, bangladesh,
japan, germany, haiti, nepal,
Recall = 2/3
china, sweden, iran, mexico,
Precision = 2/3
vietnam, belgium, lebanon,
F-score = 2/3
iraq, russia, turkey
14
ba
LC PA
CF 07
er
IL ret
Q
U
Q A1
AS
C
Ep U3
hy
ra
3
FD
U
IIT UQ ofL
D AT
IB 16
M
pr 20 B
on 0
to 7T
07
r
lsv un3
2
pi 00
rc
s0 7a
7
Q qa1
UA
N
TA
c
In
te sai
lle l1
x
as er7
ke A
d0
7c
D
ua a
m l0
M s0 7t
IT 7a
R
E2 tch
Dr 00
ex 7B
el
R
ed un
uF 2
sc
iiit 05
qa
07
Ly
m
Results in TREC 2007
TREC 2007 Results (F-measure)
0.5
0.45
0.4
0.35
0.2
Best
0.479
Median
0.085
0.3
Worst
0.000
0.25
F=14.5
0.15
0.1
0.05
0
15
Evaluation of Clustering
• Baseline
 List of candidate answers prior to clustering
• Our Approach
 List of candidate answers filtered by the clustering
• Theoretical Maximum
 The best possible output of clustering based on the initial list
Corpus
Baseline
Our Approach
Theoretical Max
Baseline
Our Approach
Theoretical Max
Questions Precision Recall
2004
TREC 2005
2006
237
TREC 2007
85
0.064
0.141
1
0.075
0.165
1
0.407
0.287
0.407
0.388
0.248
0.388
F-score
0.098
0.154
0.472
0.106
0.163
0.485
16
Percentage of each Question Type in Training
Set
Evaluation of each
Question Type
3%
2%
5%
5%
1%
1%
15%
36%
32%
Other
Organization
Job
F-score of each type in training and test sets
Person
Movie
State
Country
City
Nationality
0.700
0.600
0.500
0.400
Test Set
Training Set
0.300
0.200
0.100
Ci
ty
Na
tio
na
lity
ov
ie
M
Jo
b
rg
an
iza
tio
n
O
St
at
e
Co
un
try
th
er
O
Pe
rs
on
0.000
17
Future Work
• Developing a module that verifies whether each
candidate is a member of the answer type
 In case of Movie and Other types
• Using co-occurrence at the paragraph level rather than
the sentence level
 Anaphora Resolution can be used
 Another method for similarity measure
 2 does not work well with sparse data
 for example, using Yates correction for continuity (Yates’ 2)
• Using different clustering approaches
• Using different similarity measures
 Mutual Information
18
Questions?
19
Download