Answering Questions Requiring Cross

advertisement
Answering Questions Requiring Cross-passage Evidence
Kisuh Ahn
Dept. of Linguistics
Hankuk University of Foreign Studies
San 89 Wangsan-li, Mohyeon-myeon
Yongin-si, Gyeongi-do, Korea
kisuhahn@gmail.com
Abstract
This paper presents methods for answering,
what we call, Cross-passage Evidence Questions. These questions require multiply scattered passages all bearing different and partial evidence for the answers. This poses
special challenges to the textual QA systems
that employ information retrieval in the “conventional” way because the ensuing Answer
Extraction operation assumes that one of the
passages retrieved would, by itself, contain
sufficient evidence to recognize and extract
the answer. One method that may overcome
this problem is factoring a Cross-passage Evidence Question into constituent sub-questions
and joining the respective answers. The first
goal of this paper is to develop and put this
method into test to see how indeed effective
this method could be. Then, we introduce another method, Direct Answer Retrieval, which
rely on extensive pre-processing to collect different evidence for a possible answer off-line.
We conclude that the latter method is superior both in the correctness of the answers and
the overall efficiency in dealing with Crosspassage Evidence Questions.
1
Distinguishing Questions Based on
Evidence Locality
Textual factoid Question Answering depends on the
existence of at least one passage or text span in the
corpus that can serve as sufficient evidence for the
question. A single piece of evidence may suffice to
answer a question, or more than a single piece may
be needed. By “a piece of evidence”, we mean a
371
Hee-Rahk Chae
Dept. of Linguistics
Hankuk University of Foreign Studies
San 89 Wangsan-li, Mohyeon-myeon
Yongin-si, Gyeongi-do, Korea
hrchae@hufs.ac.kr
snippet of continuous text, or passage, that supports
or justifies an answer to the question posed. More
practically, in factoid QA, a piece of evidence is a
text span with two properties: (1) An Information
Retrieval (IR) procedure can recognise it as relevant
to the question and (2) an automated Answer Extraction (AE) procedure can extract from it an answerbearing expression (aka an answer candidate).
With respect to a given corpus, we call questions
with the following property Single Passage Evidence
Questions or SEQs:
A question Q is a SEQ if evidence E sufficient to select A as an answer to Q can
be found in the same text snippet as A.
In contrast, we call a question that requires multiple
different pieces of evidence (in multiple text spans
with respect to a corpus) a Cross-passage Evidence
Question or CEQ:
A question Q is a CEQ if the set of evidence E1, ..., En needed to justify A as
an answer to Q cannot be found in a single text snippet containing A, but only in
a set of such snippets.
For example, consider the following question:
Which Sub-Saharan country had hosted
the World Cup?
If the evidence for the country being located south
of Sahara dessert and the evidence for this same
country having hosted the World Cup is not contained in the same passage/sentence, but are found
Copyright 2012 by Kisuh Ahn and Hee-Rahk Chae
26th Pacific Asia Conference on Language,Information and Computation pages 371–380
tion will have more terms and relations that need
to be satisfied. For example, the above question
is of the form “What/Which <NBAR> <VP>?”
such as Which [NBAR Sub-Saharan country] [VP had
hosted the World Cup?], and has at least two predicates/constraints that must be established, the one
or more conveyed by the NBAR, and the one or
more conveyed by the VP. These multiple restrictions might call for different pieces of evidence depending on the particular corpus from which the answer is to be found.
In database QA, CEQs correspond to queries that
involve joins (usually along with selection and projection operations). The database equivalent of the
afore-mentioned question about the certain World
Cup hosting country might involve joining one relation linking country names with the requisite location, and another linking the names of countries
with World Cup hosting history. Note that this involves breaking up the original query into a set of
sub-queries, each answerable through a single relation. Answering a CEQ through sub-queries therefore involve the fusion of answers to different questions.
Analogously, we apply this method of joining
sub-queries for database to the task of textual QA to
deal with the CEQs. The solution we explore here
can be adopted by any existing system with the conventional Information (Passage) Retrieval and Answer Extraction (IR+AE) pipeline architecture. It
involves:
in two distinct passages, the question would be a
Cross-passage Evidence Question. This distinction
between SEQs and CEQs lies only in the locality of
evidence within a corpus. It does not imply that the
corpus contains only one piece of text sufficient for
a SEQ: Often there are multiple text snippets, each
with sufficient evidence for the answer. Such redundancy is exploited by many question answering systems to rank the confidence of an answer candidate
(e.g. including (Brill et al., 2002)) but the evidence
is redundant rather than qualitatively different.
Now, as opposed to Single-passage Evidence
Questions, which had been the usual TREC type
questions (White and Sutcliffe, 2004), Crosspassage Evidence Question poses special challenges
to the textual QA systems that employ information retrieval in the “conventional” way. Most textual QA system uses Information Retrieval as document/passage pre-fetch. The ensuing Answer Extraction operation assumes that one of the passages
retrieved would, by itself, contain sufficient evidence to recognize and extract the answer. Thus the
reliance on a particular passage to answer a question renders the task of question answering essentially a local operation with respect to the corpus as
a whole. Whatever else is expressed in the corpus
about an entity being questioned is ignored, or used
(in the case of repeated evidence instances of the
same answer candidate) only to increase confidence
in particular answer candidates. This means that factoid questions whose correct answer depends jointly
on textual evidence located in different places in the
corpus cannot be answered. We call this the locality
constraint of factoid QA. Thus special methods are
needed to overcome this locality constraint in order
to successfully handle CEQs. In the following sections, we explore two methods for answering CEQs,
first, based on Question Factoring for conventional
IR based QA systems, and second, based on what
we call Direct Answer Retrieval method in place of
conventional IR.
2.1
2
Decomposing a CEQ into simpler sub-questions
about the question variable involves:
Solving CEQs by Question Factoring
While whether a question is a CEQ or not depends
entirely on the corpus, it can be guessed that the
more syntactically complex a question, the more
likely that it is a CEQ, given that a complex ques372
1. Dividing a CEQ into sub-questions SQ1 , ...,
SQn , each of which is a simple question about
the same question variable.1
2. Finding the answer candidate(s) for each subquestion,
3. Selecting the best overall answer from the answer candidates for the sub-questions.
Dividing a CEQ into sub-questions
1
We have only considered sub-questions that are naturally
joined on the question variable. Extending to the equivalent of
joins on several variables would require an even more complex
strategy for handling the answer candidate sets than the one we
describe in Section 2.2.
• Resolving all co-referring expressions within
the question;
• If the WH-phrase of the question is a complex
phrase, making a sub-question asking the identity of its head noun phrase (e.g. “Which northern country is ...” → “What is a northern country?”);
• Breaking up the question at clause boundaries
(including relative clauses);
• Within a single clause, if there is a conjoined
set of restrictors (e.g. “ ... German physician,
theologian, missionary, musician and philosopher ..”), copying the clause as many times as
the number of restrictors, so that each clause
now contains only one restrictor;
• Finally, reformulating all the individual clauses
into questions.
Some examples of CEQs which have been factored
into sub-questions are as follows2 .:
• Which French-American prolific writer was
a prisoner and survivor of the infamous
Auschwitz German concentration camp, Chairman of the U.S. President’s Commission on
the Holocaust, a powerful advocate for human
rights and a recipient of the Nobel Peace Prize?
1. Who was a French-American prolific
writer?
2. Who was a prisoner and survivor of the infamous Auschwitz German concentration
camp?
3. Who was a Chairman of the U.S. President’s Commission on the Holocaust?
4. Who was a powerful advocate for human
rights?
5. Who was a recipient of the Nobel Peace
Prize?
It should be clear that factoring a CEQ into a set
of sub-questions is often tricky but doable. The
intra-sentential anaphora resolution can be less than
straight-forward. So often, it is a matter of choice
2
All the evaluation questions are from a pub-quiz site
http://www.funtrivia.com
373
as to how to break a question into how many subquestions and at what boundaries. In the evaluation reported in Section 2.3, the CEQs were manually broken into sub-questions, based on the procedure outlined above, since our focus was on evaluating the viability of the overall method rather than
individual components. Our results may thus serve
as an upper-bound to what a fully automated procedure would produce. (For work related to the automatic decomposition of a complex question in order
to deal with temporal questions, see (Saquete et al.,
2004).)
2.2
Selecting an Answer to the CEQ from
Sub-question Answer Candidates
From the answer candidates to the sub-questions, an
answer must be derived for the CEQ as a whole. The
most obvious way would be to pick the answer candidate that is the answer to the largest number of
sub-questions. So if a CEQ had three sub-questions,
in the ideal case there would be one answer candidate that would be the answer to all of these subquestions at the same time and thus be the most
likely answer to the CEQ. This is like a simple voting scheme, where each sub-question votes for an
answer candidate, and the candidate with the most
votes win.
Complications from this ideal situation arise in
two ways: First, a sub-question can be very general because it results from separating away other
restrictions from the question. Hence, the answer to
a sub-question must be regarded instead as a set of
answers as in list questions, several of which may be
correct rather than being one correct answer to that
subquestion. In other words, a sub-question can vote
for multiple candidates rather than just one.
Second, simply maximising overlap (i.e. the number of sub-questions to which an answer candidate
is the answer, which we call simply votes from now
on) ignores the possibility that multiple answers can
tie for the same maximum votes, especially if this
number is small compared to the number of subquestions. Thus, there is the need for a method to
rank the answers with the same number of votes.
In summary, a sub-question can vote for multiple candidates and more than one candidates can receive the same largest number of votes from the subquestions. This means we need an additional way to
break the ties among the candidates by ranking the
candidates according to some other criteria. The answer selection method we have chosen is described
below.
Let’s assume that a question has been factored
into N sub-questions, and each sub-question is answered with a (possibly empty) set of answer candidates. So for the set of N sub-questions for the original question, there are N answer candidate sets. The
most likely answer would be the answer candidate
shared by all the N sub-questions (i.e. the answer
candidate present in all the N sets of answer candidates.). To see if there is such a common answer
candidate, these N sets of answer candidates are first
intersected (via generalized intersection).
If the intersection of these N sets is empty
(i.e., there is no one answer candidate that all the
sub-questions share.), then it must be investigated
whether there is a common answer candidate for N1 sets of answer candidates (i.e. an answer shared by
N-1 sub-questions.). There will be N cases to consider since there will be N cases of N-1 sub-question
sets to intersect. If all these are empty, then all subsets of size N-2 are considered, and so on, until a
non-empty intersection is obtained. This means considering the power set of the original set of answer
candidate sets.
This process may result in one answer candidate
or several with the same maximum number of votes.
If there is only one, this is chosen as the answer.
Otherwise, there is a need for a further way to rank
these answer candidates to produce the most likely
as the answer. Specifically, if there is more than
one non-empty intersection with the same maximum
votes, we do the expected thing of taking into account the original ranking of answer to the subquestions (in most QA systems, the answer candidates are ranked according to their plausibility). If
an answer is found to be the second ranked answer to
one sub-question and the sixth ranked answer to another, its overall rank is taken to be the simple mean
of the ranks. So in this case, the answer would have
the rank four overall. This algorithm can be more
formally stated as follows3 .
• Step 1: Let S be the set of the sets of answers,
3
This will be easier to understand if considered together with
the example that follows.
374
A1, .. , An, to the sub-questions, Q1 .. Qn
respectively.
• Step 2: Produce the power-set of S, i.e. P =
P OW (S).
• Step 3: Produce a set of ordered pairs, V , such
that V = {ho, Ri | R ⊂ P ∧ ∀x ∈ R. |x| = o
for every distinct o = |y| of every y ∈ P }
• Step 4: Pick the ordered pair, L such that L =
ho, Ri ∈ V with the largest o among the members of V , and do:
S
1. Produce
T
from
R
such
that
T
=
{x |
T
x = t for every t ∈ R}. (Note that t is
a set.)
2. If R‘ is an empty set, repeat this step 4 for
the ordered pair with the next largest o.
3. Else if T has a unique member, then pick
that unique member as the answer and terminate.
4. Otherwise, go to the next step.
• Step 5: For each member, x ∈ T , get the ranks,
r1 , .. , rn in the sub-questions, Q1 .. Qn and
compute the mean M of the ranks.
• Step 6: Pick the member of T with the lowest
M score as the answer.
To illustrate the steps described above, consider a
CEQ M that can be split into three sub-questions Q1,
Q2 and Q3. Then:
• Step 1: Assume that the sets A1, A2, A3 are the
answer candidate sets for sub-questions Q1, Q2
and Q3 respectively:
A1 = {CLINTON, BUSH, REAGAN}
A2
=
{MAJOR,
REAGAN,
THATCHER}
A3 = {FORD, THATCHER, NIXON}
Let S = {A1,A2,A3}
• Step 2: P = POW(S) =
{{A1,A2,A3},{A1,A2},{A2,A3},{A1,A3},{A1},
{A2},{A3},{}}
• Step 3:
V = {h3, {{A1, A2, A3}}i ,
h2, {{A1, A2}, {A2, A3}, {A1, A3}}i ,
h1, {{A1}, {A2}, {A3}}i , h0, {{}}i}
• Step 4:
First pick L = ho, Ri = h3, {{A1, A2, A3}}i
based on the largest o in consideration (i.e. 3).
S
T
Then, get T suchSthat T = {x | x = t for
every t ∈ R} = {A1 ∩ A2 ∩ A3 = {}}
Since T is an empty set, no answer can be
picked. So repeat this step for the ordered pair
with the next largest votes.
• So again Step 4:
The second pick L
=
ho, Ri
h2, {{A1, A2}, {A2, A3}, {A1, A3}}i
=
Now get T by T = (A1 ∩ A2) ∪ (A2 ∩ A3) ∪
(A1 ∩ A3) = {REAGAN, THATCHER}
Since R’ is non-empty, which at the same time
does not have a unique member, go to the next
step.
• Step 5:
REAGAN : 4th candidate for Q1 and 10th in Q2
– average rank 7
THATCHER : 1st candidate for Q2 and 5th in Q3
– average rank 3
• Step 6:
Take the candidate with the highest average
rank as the answer – here, THATCHER.
2.3
Evaluation of the Sub-question Method
The evaluation has two purposes. The first is to see
how well the sub-question method works with respect to CEQs. The second is to provide a baseline
results for the performance of the Direct Answer Retrieval method introduced in the next section.
For the evaluation of this strategy, we chose 41
“quiz” questions (i.e. ones designed to challenge
and amuse people rather than to meet some real information need, like the IBM Watson system doing
Jeopardy questions (David Ferrucci, 2012)) as likely
Cross-passage Evidence Questions (CEQs) with respect to the knowledge base corpus, AQUAINT corpus in this evaluation. They were filtered based on
the following criteria:
• Did the answer appear in the AQUAINT corpus?
• Was the answer a proper name?
• Was it a difficult question? (Questions from
the original site have been manually marked for
difficulty.)
The questions, varying in length from 20 to 80
words4 , include:
What was the name of the German physician, theologian, missionary, musician
and philosopher who was awarded the
Nobel Peace Prize in 1952?
Which French-American prolific writer
was a prisoner and survivor of the infamous Auschwitz German concentration
camp, Chairman of the U.S. President’s
Commission on the Holocaust, a powerful
advocate for human rights and a recipient
of the Nobel Peace Prize?
He was born in 1950 in California. He
dropped out of the University of California at Berkeley and quit his job with
Hewlett-Packard to co-found a company.
He once built a “Blue Box” phone attachment that allowed him to make longdistance phone calls for free. Who is he?
The QA system we have used to test this method
was developed previously and had shown good performance for TREQ QA questions (Ahn et al.,
2005). In order to accomodate this method, a question factoring module must be added. But as we
have previously mentioned, question factoring has
been done manually offline as we have not implemented a fully automatic module for that as of yet.
The joining of answers to sub-questions are done automatically, however, using a specially added postprocessing module. Thus, this QA system is a simulation of a fully-automatic CEQ handling QA system.
The evaluation procedure ran as follows. First, a
CEQ was factored into a set of sub-questions manually. Then each sub-question was fed into the QA
4
• Did the question contain multiple restrictions
This is long compared to TREC questions, whose mean
length is less than 10 words.
of the entity in question?
375
QID
1
3
6
7
8
9
10
12
13
16
17
18
20
21
22
23
24
25
26
27
28
29
30
31
32
36
37
38
40
SQ
2
4
3
4
3
5
4
3
2
3
2
4
2
3
2
3
2
3
5
2
2
2
3
2
2
9
2
8
9
MaxV
1
1
1
1
1
2
1
1
2
1
2
1
2
2
1
3
1
3
3
2
1
2
3
2
2
1
1
1
3
ACI
0
0
0
0
0
6
0
0
2
0
1
0
4
7
0
1
0
1
1
23
0
86
3
9
1
0
0
0
1
Rank
0
0
0
0
0
3
0
0
1
0
1
0
1
1
0
1
0
1
1
1
0
1
1
5
1
0
0
0
1
Table 1: Questions with answer candidates identified for
≥1 sub-questions.
engine to produce a set of 100 answer candidates.
The resulting sets of answers were assessed by an
Answer Selection/Ranking module that uses the algorithm described in Section 2.2 to produce a final
set of ranked answer candidates.
2.4
Results and Observations
Among the 41 questions, 12 questions are found to
have no correct answer candidates by the QA engine, and so these questions are ignored in the rest
of the analysis. For the remaining 29 questions, Table 1 shows the value of ranking (i.e., Step 5 above)
when more than one answer candidate shares the
376
A@N
1
2
3
4
5
6
7
8
9
10
15
20
ACC
MRR
ARC
SQ-Combined
0.317:13
0.317:13
0.341:14
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.317
0.341
1.333
Table 2: Results for Sub-question Method
same number of largest votes. Such answer candidates need to be ranked in order to pick the best answer. Here, QID indicates the question ID which did
have at least one sub-question with answers, where
SQ more specifically tells how many sub-questions
each question was factored into.
In Table 1, column MaxV indicates the largest
number of votes received by an answer candidate
across the set of sub-questions. Column ACI indicates the number of answer candidates with this
number of Votes. For example, CEQ 9 has 5 subquestions. Its MaxV of 2 indicates that only the
intersection of the answer candidate sets for two
sub-questions (out of maximum 5) produced a nonempty set (of 6 members according to ACI). These
6 members are the final answer candidates that will
be ranked.
The column labelled Rank indicates the ranking
of final answer candidates by mean ranking with respect to the sub-questions they answer and identifies
where the correct answer lies within that ranking.
So for Question 9, the correct answer was third in
the final ranking. Fifteen questions had no answer
candidates common to any set of sub-questions (i.e.,
ACI=0). Of the 14 remaining questions, six had
only a single answer candidate, so ranking was not
relevant. (That answer was correct in all 6 cases.) Of
the final eight questions with ≥1 final answer candi-
dates, Table 1 shows that ranking them according to
the mean of their original rankings led to the correct answer being ranked first in six of them. In the
most extreme case, starting with 86 ties between answers for two sub-questions (all of which have a set
of 100 answers), ranking reduces this to the correct
answer being ranked first. Table 1 also shows the
importance of the proportion of sub-questions that
contain the correct overall answer. For CEQs with
2 sub-questions, in every case, both sub-questions
needed to have contained the correct overall answer
in order for that CEQ to have been answered correctly. (If only one sub-question contains the correct overall answer, our algorithm has not selected
the correct overall answer.) For CEQs with 3 subquestions, at least two of the 3 sub-questions need to
have contained the correct overall answers in order
for the CEQ to be answered correctly by this strategy. This seems to be less the case as the number of
sub-questions increases. However, the strategy does
seem to require at least two sub-questions to agree
on a correct overall answer in order for a CEQ to be
correctly answered. It is possible that another subquestion ranking strategy could do better.
In sum, starting with 41 questions, the “subquestion method found 14 with at least one “intersective” answer candidate. Of those 14, the topranked candidate (or, in 6 cases, the only “intersective” candidate) was the correct answer in 12 cases.
Hence the “sub-question” method has succeeded in
12 of 41 cases that would be totally lost without this
method. Table 2 shows the final scores with respect
to A@N, accuracy, MRR and ARC.
2.5
Discussion
The evaluation shows that the method based on subquestion decomposition and ranking can be used for
answering Cross-passage Evidence Questions, but
we also need to consider the cost of adopting this
method as a practical method for doing real time
Question Answering; clearly multiplying the number of questions to be answered by decomposing
a question into sub-questions can make the overall
task even more resource-intensive than it is already.
Pre-caching answers to simple, frequent questions,
as in (Chu-Carroll et al., 2002), which reduces them
to database look-up, may help in some cases. Another compatible strategy would be to weigh the sub377
questions, so as to place more emphasis on more
important ones or to first consider ones that can be
answered more quickly (as in database query optimisation). This would avoid, or at least postpone,
very general sub-questions such as “Who was born
in 1950?”, which are like list questions, with a large
number of candidate answers and thus expensive to
process. The QA system would well process more
specific sub-questions first by learning the appropriate weight as in (Chali and Joty, 2008) The second
issue is for the need of a method that can reliably
factor a CEQ into simple sub-questions automatically, which would not be trivial to develop. Direct
Answer Retrieval for QA offers another alternative
that does not require the factoring of a question in
the first place, which is the subject of the next section.
3
Direct Answer Retrieval for QA
The answers to many factoid questions are named
entities. For example, “Who is the president of
France?” has as its answer a name referring to a
certain individual. The basic idea of Direct Answer
Retrieval for Question Answering is to extract these
kinds of expressions off-line from a textual corpus
as potential answers and process them in such a
way that they can be directly retrieved as answers
to questions. Thus, the primary goal of Direct Answer Retrieval is to turn factoid Question Answering
into fine-grained Information Retrieval, where answer candidates are directly retrieved instead of documents/passages. For simple named entity answers,
we have previously shown that this can make for fast
and accurate retrieval (Ahn and Webber, 2007).
In addition, as the process gathers and collates all
the relevant textual evidence for a possible answer
from all over the corpus, it becomes possible to answer a question based on all the evidence available
in the corpus regardless of its locality. Whether this
is indeed so is what we are going to put to test here.
3.1
The Off-line Processing
The supporting information for a potential answer
(named-entity), for the present research, is the set of
all text snippets (sentences) that mentions it. With
the set of all sentences that mention a particular potential answer put into one file, this can literally be
regarded as a document on its own with the answer
name as its title. The collection of such documents
could be regarded as the index of answers and the retrieval of documents as answer retrieval (Hence the
name Direct Answer Retrieval).
The processes of generating the collection of answer documents for all the potential answers run as
follows. First, the whole base corpus is run through
POS tagging, chunking and named-entity recognition. Then, each sentence in each document is examined as to whether it contains at least one namedentity. If so, then whether this named-entity represents an already identified answer candidate and
stored in the repository is examined. If so, then this
sentence is appended to the corresponding answer
document in the collection. If not, then a new answer entity is instantiated and added to the repository and a new corresponding answer document is
created with this sentence. If more than answer candidate is identified in a sentence, then the same process is applied to every one of them.
In order to facilitate the retrieval of answer documents, this answer document collection is itself indexed using standard document indexing technique.
At the same time, for each answer entity, its corresponding named entity type is stored in a separate
answer repository database(using external resources
such as YAGO (Suchanek et al., 2007), it is possible
to look up very fine-grained entity type information,
which we did in our evaluation system). Answer Index together with this repository database make up
the knowledge base for Direct Answer Retrieval QA
system.
3.2
The On-line Processing
With the knowledge base built off-line, the actual
on-line QA processes run through several steps. The
first operation is the answer type identification. For
this, in our evaluation system, the question is parsed
and a simple rule based algorithm is used that looks
at the WH-word (e.g. “Where” means location), the
head noun of a WH-phrase with “Which” or “What”
(e.g. “Which president” means the answer type is
of president), and if the main verb is a copula, the
head of the post-copula noun phrase (e.g. for “Who
is the president ..”, here again “president” is the answer type. The identified answer type is resolved to
one the base named entity type (PERSON, LOCA378
TION, ORGANIZATION and OTHER) using ontology database such as the WordNet if the named
entity type is derived from the head noun of WHword or the noun after the copula, which do not have
the corresponding entity type in the answer repository. The next operation is the retrieval of answers
as answer candidates for a given question. This involves formulating a query, retrieving answer documents as answer candidates. If the index has been
partitioned into sub-indices based on the NER type,
as we have done, an appropriate index can be chosen based on the answer type identified, and thereby
make the search more effiecient.
In the actual retrieval operation, there can be different ways to score an answer candidate with respect to a query depending on the model of retrieval
used. The model of retrieval implemented for the
evaluation system is an adaptation of the document
inference network model for information retrieval
(Turtle, 1991). For a more thorough description of
the answer retrieval model, please refer to our previous work (Ahn and Webber, 2008).
When the search is performed and a ranked list
of answers is retrieved, this ranked list is then run
through the following procedures:
1. Filter the retrieved list of answers to remove
any named-entity that was mentioned in the
question itself if any.
2. Re-rank with respect to answer type, preferring the answer that match the answer type precisely.
3. Pick the highest ranking answer as the answer
to the question.
The first procedure is heuristically necessary because, for example, “Germany” in “Which country
has a common border with Germany”, can pop up
as an answer candidate from the retrieval. From the
remaining items in the list, each item is looked up
with respect to its answer type using the answer-type
table in the answer repository. The re-ranking is performed according to the following rules:
• Answers whose type precisely matches the answer type are ranked higher than any other answers whose types do not precisely match the
answer type.
• Answers whose type do not precisely match
the answer type but still matches the base type
traced from the answer type are ranked higher
than any other answers whose types do not
match the answer type at all.
Now using these rules, the answer which is ranked
the highest is picked as the number one answer candidate.
This method, by itself, looks more like an answer
candidate retrieval system rather than a full-fledged
QA system with sophisticated answer extraction algorithm as found in most QA systems. However, the
structured retrieval algorithm that we employ makes
up for this answer extraction operation. Again for a
full exposition of this structural retrieval operation,
refer to our previous work.
3.3
Answering CEQs by Direct Answer
Retrieval Method
Direct Answer Retrieval enables a direct approach
to Cross-passage Evidence Questions: There is no
need for any special method for question factoring.
With respect to an answer document, a possible answer is already associated with all the distributed
pieces of evidence about it in the corpus: In other
words, CEQs are exactly the same as SEQs.
Here, for example, to the question, “Which US
senator is a former astronaut?” the conventional
approach requires different set of textual evidence
(passages) to be assembled based on the decomposition of the question as we have presented in the
previous section, whereas in the Direct Answer Retrieval approach, only one answer document needs
to be located.
Thus there are three advantages for using Direct
Answer Retrieval method over the conventional IR
with the special sub-question method:
• No multiplication of questions.
• No need for question factoring.
• No need to combine the answers of the subquestions.
Whether, despite these advantages, the performance would hold good, is evaluated next.
379
3.4
Evaluation and Comparison to IR+AE
The purpose of the evaluation is to see how well
this method can deal with CEQs, particularly as
compared to simulated IR+AE system with a subquestion method as presented in the previous section.
The implemented QA system had been previously
evaluated with respect to the more simple TREC
QA questions and found to have good performance.
The particular configuration that we used for our test
here utilizes naturally the same questions and the
same textual corpus as the source data as in the subquestion method.
For each question, the system simply retrieves answer candidates and the top 20 answers are taken for
the score assessment.
A@N
1
2
3
4
5
6
7
8
9
10
15
20
ACC
MRR
ARC
Sub-QA
0.317:13
0.317:13
0.341:14
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.366:15
0.317
0.341
1.333
Answering
0.317:13
0.512:21
0.585:24
0.634:26
0.659:27
0.659:27
0.659:27
0.659:27
0.659:27
0.659:27
0.659:27
0.659:27
0.317
0.456
1.889
Table 3: Comparison of the Scores
Table 3 summarises the results of the evaluation
for the Direct Answer Retrieval system (Ans-Ret)
compared to the Sub-question method based system
(Sub-QA).
The Ans-Ret system has returned more correct answers than the Sub-QA system (in all cut-off points
except having tied in number 1 rank). Also all the
correct answers that were found by the Sub-QA system were also found by the Ans-Ret system irrespective of the ranks. Twelve correct answers that
were found by Ans-Ret (at A@5) were missed by
the Sub-QA system. Among those answers that
were found by both systems, Sub-QA produced better rank in 5 cases whereas in only one case the AnsRet produced a better rank. However, Table 3 shows
that Ans-Ret in general found more correct answers
(27 vs 15) in total, and more answers in higher rank
(e.g. top 2) than Sub-QA system.
In order to verify the performance difference, we
used a statistical test, Wilcoxon Matched Signed
Rank Test (Wilcoxon, 1945), which is used for small
samples and assumes no underlying distribution.
According to this test, the difference is statistically
significant (W+ = 17.50, w- = 172.50, N = 19, p =
0.0007896). Hence, it is possible to conclude that
Direct Answer Retrieval method is superior to the
simulated IR+AE method with special sub-question
method for this type of questions. The fact that Direct Answer Retrieval Method has superior performance is no surprise considering that the more the
evidence from the question for an answer, more information is available for the method in matching
the relevant answer document of a candidate answer
to the question. This is in contrast to the IR+AE
based systems, which, without a special strategy
such as the sub-question method discussed here, require that whatever evidence exist in a question must
be found within one sentence in the corpus due to the
locality constraint.
4
Conclusion
Cross-passage Evidence Questions are the kind of
questions for which the locality constraint bear out
its constraining effect. A special method is devised
in order to overcome this constraint with the conventional QA with IR+AE architecture. This involves partitioning a question into a set of simpler
questions. The results show that the strategy is successful to a degree in that some questions are indeed
correctly answered but it also comes with a cost of
multiplying the number of questions. On the other
hand, Direct Answer Retrieval method is processwise more efficient, has a better performance in
terms of accuracy, and does not require tricky question factoring regarding this type of questions. Direct Answer Retrieval method, thus, clearly shows a
clear advantage for CEQs.
380
References
Ahn, K., Bos, J., Clark, S., Curran, J., Kor, D., Nissim,
M., and Webber, B. (2005). Question answering with
qed at trec-2005. In Proceedings of TREC’05.
Ahn, K. and Webber, B. (2007). Nexus: A real time qa
system. In SIGIR ’07: Proceedings of the 30th annual
international ACM SIGIR conference.
Ahn, K. and Webber, B. (2008). Topic indexing and retrieval for qa. In COLING ’08: Proceedings of the
Coling 2008 IR4QA Workshop.
Brill, E., Dumais, S., and Banko, M. (2002). Analysis
of the askmsr question-answering system. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2002), pages 257–
264, Philadelphia PA.
Chali, Y. and Joty, S. R. (2008). Selecting sentences
for answering complex questions. In Proceedings of
the 2008 Conference on Empirical Methods in Natural
Language Processing, pages 304–313.
Chu-Carroll, J., Prager, J., Welty, C., Czuba, K., and Ferrucci, D. (2002). A multi-strategy and multi-source
approach to question answering. In Proceedings of the
11th Text Retrieval Conference (TREC 10), National
Institute of Standards and Technology.
David Ferrucci, Eric Brown, e. a. (2012). Building watson: An overview of the deepqa project. AI Magazine,
Volume 31, No 3, 31(1).
Saquete, E., Martı́nez-Barco, P., Muñoz, R., and
González, J. L. V. (2004). Splitting complex temporal questions for question answering systems. In ACL,
pages 566–573.
Suchanek, F. M., Kasneci, G., and Weikum, G. (2007).
Yago: A core of semantic knowledge - unifying WordNet and Wikipedia. In Williamson, C. L., Zurko,
M. E., and Patel-Schneider, Peter F. Shenoy, P. J., editors, 16th International World Wide Web Conference
(WWW 2007), pages 697–706, Banff, Canada. ACM.
Turtle, H. R. (1991). Inference Networks for Document
Retrieval. PhD thesis, University of Massachusetts.
White, K. and Sutcliffe, R. F. E. (2004). Seeking an upper
bound to sentence level retrieval in question answering. In Proceedings of the 23rd Annual International
ACM SIGIR Workshop on Question Answering (SIGIR
2004).
Wilcoxon, F. (1945). Individual comparisons by ranking
methods. Biometrics, (1):80–83.
Related documents
Download