Performance Analysis of Effective Japanese and Chinese Question Answering System

advertisement
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
Performance Analysis of Effective Japanese and
Chinese Question Answering System
Jaspreet Kaur#1, Vishal Gupta*2
#
ME Student, Dept. of Computer Science and Engineering, Panjab University
Chandigarh, India
*
Assistant Professor, Dept. of Computer Science and Engineering, Panjab University
Chandigarh, India
Abstract — In today’s world of technological advancement, Question Answering has emerged as the key area for the researchers. In
Question Answering user is provided with specific answers instead of large number of documents or passages. Question Answering
has been carried out in many languages. This paper compares some already existing Question Answering Systems for Chinese and
Japanese languages. Different Chinese and Japanese Question Answering Systems are compared on the basis of their performance in
different workshops held for Question Answering Systems. We also discuss the best approach out of all and the reasons for which it is
considered good and some methods for further improvement in those techniques.
Keywords — Question Answering, Performance, Evaluation, Algorithms.
I. INTRODUCTION
In this paper, we review previous works on Japanese and
Chinese question answering systems and evaluate their
performances and suggest different measures for further
improvement. Comparing Chinese and Japanese with other
languages, word segmentation is a key problem in Chinese
and Japanese question answering. We review studies on
different techniques used for Japanese and Chinese languages
and discuss important issues which are helpful for building
QA systems. Machine learning approaches currently represent
the main stream on many QA research issues, we believe, by
efficiently utilizing the above resources, the performance of
machine learning approaches can be improved further in
Chinese question answering.
II. JAPANESE QUESTION ANSWERING
SYSTEM USING A* SEARCH AND ITS
IMPROVEMENT
A. Method
In Table 1 improvement on existing Japanese Question
Answering System is described and its performance is shown.
In this a method is proposed to introduce A* search control in
a sentential matching mechanism for a Japanese QA system so
that it can reduce turnaround time without affecting the
accuracy. Several measures of the degree of sentence
matching and pseudo voting method is proposed to improve
the accuracy. Effectiveness of the newly introduced
techniques is examined in an evaluation workshop for
question answering systems.
ISSN: 2231-5381
TABLE I. CHARACTERISTICS OF JAPANESE QUESTION
ANSWERING SYSTEM
Method
Purpose
Limitation
A*
search
control in a
sentential
matching
Reduce
turnaround
time
while
maintaining
accuracy
To
improve
accuracy of A*
search control
method
Accuracy not
sufficiently
high.
Several
measures of
degree
of
sentence
matching and
a variant of
voting method
were
integrated
with
A*
search control
method
Higher
accuracy as
compared to
previous
approach
MRR(Mean
Reciprocal
Rank)
0.3
0.5
B. Details of Experimental System
Experiments were conducted by the author in different
conditions shown in Table 2 with the test collection of QAC2
Subtask 1.
C. Improving Accuracy of Japanese Sentential Matcher
In this several off-the-shelf NLP techniques are adopted that
treat certain different aspects of sentence [6].
http://www.ijettjournal.org
Page 643
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
TABLE II
EXPERIMENTAL SYSTEM DETAILS
Morphological Analyzer
Dependency Analyzer
NE recognizer
JUMAN 3.61 [1]
KNP 2.0b6 [2]
SVM-based NE recognizer [3]
] using SVMlight [4]
Numerical
expression
extractor
Document database
(knowledge resource)
Size: (774 MByte)
System by Fujihata et al. [5]
Computer
Language of Implementation
Mainichi Shimbun newspaper
articles in 1998 and 1999
Yomiuri Shimbun newspaper
articles in 1998 and 1999
CPU: Xeon (2.2 GHz) × 2,
Main memory:4 GByte
(for QA server)
CPU: UltraSPARC III Cu
(900 MHz) × 2, Main
memory:8 G Byte (for search
engine)
JPerl 5.005_03
Composite matching score is constructed as shown in Eq. (1),
which is a linear combination of the following sub score for
an answer candidate AC in the ith retrieved sentence Li with
respect to a question sentence Lq having an interrogative Q.
S AC, L , L
= Sb AC, L , L
+ Sk AC, L , L
+ St AC, L , L
(1)
Sb(AC, Li , Lq), Sk(AC, Li , Lq), and Sd(AC, Li , Lq) are scores
for measuring the similarity between the context of an
interrogative Q in a question sentence Lq and that of an answer
candidate AC in a retrieved sentence Li . On the other hand,
St(AC, Li , Lq) is a score for representing the consistency
between the answer candidate AC and a question type
expressed by the interrogative Q.
1. Sentence Chaining
A method is proposed in which multiple sentences are treated
as a single sentence [6]. In this method, each parse tree for
sentences L0 . . . Ln in an extracted passage is connected to the
parse tree of the succeeding sentence, if the following
condition with respect to the question sentence Lq is satisfied:
KW l
∩
⋀ KW l
KW(L ) ⊈ KW L
∩⋃
∩ KW(L )
KW(L ) ⊉ KW L
∩ KW(L )
(2)
Where KW(Li ) is the set of keywords appearing in Li . The
condition holds when Ln has new keywords that do not appear
in the sequence of sentences L0 . . . Ln-1, and vice versa.
Instinctively, when keywords in the question sentence are
scattered over a series of sentences, the sentences are
connected. This method is known as sentence chaining.
2.
Matching in Terms of Keywords
ISSN: 2231-5381
The matching score in terms of keywords
is calculated
based on the number of keywords shared by a question and a
retrieved sentence. Equation (3) defines the score Sk(AC, Li ,
Lq) for an answer candidate AC in a retrieved sentence Li with
respect to a question Lq [6].
Sk AC, L , L =
C ∑ ∈
w(k) + C ∑⟨
, ,
, ⟩⊂
(
, ,
) w(k)
SKW AC, L , L = (KW(L )\{AC}) ∩ KW(L )
SKWC AC, L , L = (KW (L )\{⟨AC, c ⟩}) ∩ KW
(3)
(L )
Where the function K
(L) returns a set of (keyword, case
marker) pairs in the sentence L. The constants Ck and Cc are
mixing factors in the composite score of (1). cAC is the case
marker of the AC. The functions SKW (AC, Li , Lq) and
SKWC(AC, Li , Lq) return the set of keywords and the set of
(keyword, case marker) pairs shared by the question Lq and
the retrieved sentence Li. Function w (k) is a weighting
function of keyword k, according to a certain global weighting
method.
3. Matching in Terms of 2-Grams
According to the no. of character 2-grams shared by a
retrieved sentence and a question, the matching score in terms
of 2-grams is calculated.
Sb AC, L , L
=C
⊂
(
)
= Sb1 s, e, L , L
( )
∑ ∑
bfreq(L , L , lj)
w(k).
len(AC)
bfreq L , L , l, j = (j ≠ l). freq(substr(L , j, j + 1), L )
Where s and e are the character positions of the start and end
of an AC in Li , respectively, and the constant Cb is a mixing
factor in (1).
4. Matching in Terms of Dependency Structure
The first element of the dependency vector DV (AC, Ki ) is the
distance between AC and LCA (the lowest common ancestor
of AC and Ki ) . The second element is the distance between Ki
and LCA. The distance is defined as the number of edges
between two nodes. Similarly, dependency vector DV (Q, Ki )
for a question sentence is obtained, where Q is an
interrogative in the sentence. Similarity between dependency
structures of two sentences is calculated using dependency
vectors. First, the similarity between the dependency relation
of a Q to a keyword Ki in the question sentence Lq and that of
the AC to Ki in the retrieved sentence Lj can be regarded as
measures of the appropriateness of AC. The appropriateness
can be represented as a function of the distance between two
dependency vectors DV(Q, Ki ) and DV(AC, Ki ).
Sd
(AC, Q, K ) = (4)
( , )|
| ( , )
Second, the nearness between an answer candidate and
keywords can also be regarded as a measure of the
http://www.ijettjournal.org
Page 644
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
appropriateness of the AC, which can be calculated as the
length of the dependency vector.
Sd
(AC, Q, K ) = (5)
Considering these two measures of structural similarity, a
measure of the degree of matching between the dependency
structures of the AC in the retrieved sentence Lj and that of Q
in the question sentence Lq is defined by Eq. (6):
Sd AC, L , L
/ (|
=C ∑
Sd1(AC, Q, K ) =
Sd (Sd1(AC, Q, K ) + Sd
(
(
,
, ,
)| |
( ,
|)
) Sd1(AC, Q, K
)
Sd1(AC, Q, K ) . w(K )
(6)
(7)
5. Matching in Terms of Question Type
The matching score St(AC, Li , Lq) in terms of the question
type is calculated as follows. First, a question type is
estimated by the type of the interrogative and other clue
expressions. Second, when the question type is supported by
numerical expression extractor, a set of patterns is used to
filter answer candidates. Each retrieved sentence Lj with a
suitable numeric is then passed to the extractor to obtain a
triplet (Objj , Attr j , ACj (=numeric + unit). The matching
score for an answer candidate is defined in Eq. (8) based on
the similarity of surface expression between the triplet (Objj ,
Attr j , ACj ) and the triplet
St
AC, L , L = C
. R Obj , Attr , Obj , Attr
+(1 − α
)) ∑ ∈
( , , ) w(K )
(8)
Therefore, the system may find other answer candidates that
have the same surface expression as one of the answer
candidates that have already reached the goal state [9].
Consequently, the frequency information of answer candidates
can be partially used by recording all that have reached the
goal state in the search process [10]. Here, the pseudo voting
score Sv(AC, Lq) for an answer candidate AC is as follows:
S AC, L
= (log (freq(AC, AnsList))) + 1. max S( AC, L , L )
(10)
where AnsList is the list of answer candidates that have
reached the goal state.
E. Experimental Result
The performance of the systems was evaluated using the MRR
(Mean Reciprocal Rank).Following experiments were
conducted [12]:
1.
Experiment 1: Evaluation of Performance with
Respect to System Parameters
In this experiment, the MRR and turnaround time by varying
the values of the system parameters is examined as shown in
Table 3. The result is shown in Figure 2 [6].
TABLE III
DESCRIPTION OF SYSTEM PARAMETERS
a: Number of answers to be searched
d: Number of documents to be retrieved
ppd: Maximum number of passages retrieved from one
document
p: Number of passages to be considered in the retrieved
documents
R Obj , Attr , Obj , Attr
1
= shared_char Objj , Objq .
+shared_char Attr , Attr .
len Objj
Two lines for the turnaround time have been plotted. The first
is the turnaround time that includes the processing time of an
external search engine, while the second excludes the
processing time of an external search engine.
1
+
len Objq
2
(
)
(
)
(9)
1.
Where the function shared_char(s1, s2) returns the number of
characters shared by both strings s1 and s2. Therefore, the
function Rs (Objj , Attr j , Objq, Attrq) represents the similarity
of objects and attributes between two triplets in terms of the
surface expression. The parameter αnum (0 ≤ αnum ≤ 1) controls
the extent to which the result of the numerical expression
extractor is considered. The constant Cnum is a mixing factor in
Eq. (1). Third, if the question type belongs to any other simple
numeric expression such as DATE, a set of patterns is applied
to retrieved sentences to filter out answer candidates that are
not numeric or do not have suitable unit expressions.
Experiment 2: Evaluation of Effectiveness of
Proposed Matching Techniques
In order to evaluate the effectiveness of these matching
techniques, systems are prepared in which one function is
suppressed, and examined (1) their accuracy on the basis of
MRR, (2) the average precision of the first answer candidate,
and (3) the ratio of the number of questions whose answers
are found within the top five answer candidates. The baseline
adopts the scoring function Sdꞌ(AC, Li , Lq) defined in Eq. (11)
instead of matching score Sd(AC, Li , Lq) for the dependency
structure, but does not use the pseudo voting method:
Sd′ AC, L , L
D.
Pseudo Voting Method in Search Scheme
This method continues searching for answers until n different
answer candidates are found in case n-best answers are found
[8]. In other words, n different answer candidates reach the
goal state, where all of sub scores are actually calculated.
ISSN: 2231-5381
= C′ ∑
( )
( , )
(11)
where the function Dist (AC, k) returns the distance between
AC and k, that is, the number of morphemes between AC and k
plus one. The result of comparison is shown in Figure 3,
where the parameter setting is a = 5/d = 250/ppd = 5/p = 50,
which achieves the best performance in the experiment.
http://www.ijettjournal.org
∈
(
, ,
)
Page 645
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
Figure 1 show that each proposed matching technique has an
1.
Experiment 3: Evaluation of Effectiveness of
Proposed Search Control
Following systems are compared in terms of the MRR value
and the average processing time to search one or five
(different) answer candidates.
The result is shown in Figure 2, where the parameter setting is
a = 1 or 5/d = 250/ppd = 5/p = 50.
effect
on
improving
accuracy.
III. AN ANALYSIS OF A HIGH-PERFORMANCE
JAPANESE QUESTION ANSWERING
SYSTEM
A. Method
This question answering system, SAIQA-QAC2 (System for
Advanced Interactive Question Answering), achieved the best
performance of MRR = 0.607 in the subtask. SAIQA-QAC2 is
an improvement of our previous system SAIQA-Ii that
achieved MRR = 0.46 for QAC1 subtask 1 at NTCIR- QAC In
Table 4 improvement on existing Japanese Question
Answering System is described and its performance is shown.
TABLE IV
FEATURES OF A HIGH PERFORMANCE JAPANESE QUESTION
ANSWERING SYSTEM
Method
Purpose
Limitation
MRR(Mea
n
Reciprocal
Rank)
Docume
nt
Databas
e
SAIQAQAC2
(System
for
Advanced
Interactiv
e
Question
Answerin
g)
-new
proximity
based
ranking
method
DIDF is
introduce
d.
-DIDF is
compared
with
other
document
retrieval
methods,
namely,
IDF,
BM25
and
MultiTex
t.
-Main part of
the
experimental
system
is
implemented in
Perl , which is
a
script
language and
reimplementati
on in other
programming
language that
generate native
code would be
effective.
-NE rcognizer
used is very
time
consuming and
needs
improvement
0.516
Mainichi
Shimbun
newspap
er articles
in 1998
and 1999
-Yomiuri
Shimbun
newspap
er articles
in 1998
and 1999
(size=
774
MByte)
Fig. 1. Evaluation of effectiveness of proposed matching techniques.
Fig. 2. Comparison of Search Control Methods in Terms of the Average
Processing Time of Matching Score and MRR.
Figure 2 shows that the proposed system [A∗ (Approx. +
Max.)] is 12.0 times (for one candidate) and 5.8 times (for five
different candidates) faster than a system with no search
control.
ISSN: 2231-5381
This question answering system, SAIQA-QAC2 (System for
Advanced Interactive Question Answering), achieved the best
performance of MRR = 0.607 in the subtask. SAIQA-QAC2 is
an improvement of our previous system SAIQA-Ii that
achieved MRR = 0.46 for QAC1 subtask 1 at NTCIR- QAC
[13]. The mainstream of the system is composed of four
modules: Question Analysis, Document Retrieval, AnswerExtraction, and Answer Evaluation.
IREXNE90 is an efficient SVM-based named entity
recognizer that achieved F=90% [15] for IREX general
task[16]. The same recognizer trained only by a publicly
available corpus (CRL NE data1) achieved F = 85%. This
version is IREXNE85. IREXNE90 detects named entities and
classifies them into eight classes: ORGANIZATION,
http://www.ijettjournal.org
Page 646
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
PERSON, LOCATION, ARTIFACT, DATE, TIME,
MONEY, and PERCENT. In general, fine-grained answer
taxonomy improves MRR [17]..
B. Description of Each Module
1. Question Analysis
The Question-Analysis Module normalizes questions and
determines expected answer types. Question normalization is
used to simplify “answer-type determination rules.” After the
normalization, ALTJAWS, a morphological analyzer based on
a Japanese lexicon “Nihongo Goi-taikei” [18] is applied, to
segment a given question into words, because inter word
spaces are not used in Japanese.
2. Document Retrieval
A new proximity-based document-ranking method is
developed that considers all passages of different lengths.
With this method, document D’s score is defined as the best
score of all passages in D.
DS
( ) (D)
= max PS
( ) (p)
⊆
Here, p1 ⊆ p2 means that a passage p1 is contained in another
passage p2. The document D is also regarded as a passage. In
this paper, passage p is represented by a pair of integers [l , r]
where l is the location of p’s first word and r is the location of
p’s last word. The location of a word equals the number of
words before the word.
The Passage score is given by following definition.
PS
( ) ([l, r])
= exp(−β(r − l))
idf[q]
4. Answer Evaluation
A simple scoring function is used [21] [22]. Candidate c in a
document D is evaluated by a weighted Hanning window
function [23] defined as follows:
score(c, D) = DS(D) ×
Where DS (D) is D’s document score, d(c, q) is the distance
between c, and the nearest position of a query term q, and
1
πd
H(d) = 2 cos W + 1 if0 ≤ d ≤ W
0otherwise
C. Contribution of Each Component
Each component contribution to the performance is analyzed.
1. Contribution of Answer-Extraction sub modules
Table 5 shows the degree to which each sub module of the
Answer-Extraction Module contributed to the performance of
the entire QA system. According to this table, IREXNE90,
NUMEXP, and WSENSE were main contributors to the
system performance [14].
TABLE V
CONTRIBUTION OF SUB MODULES IN THE ANSWER-EXTRACTION
MODULE
Wind
ow
Size
W=
30 Bunsetsus
∈ ([ , ])
Where idf [q] = log(N/df [q]) is IDF (Inverse Document
Frequency) of query term q. df [q] is the number of documents
that contains q and N is the total number of documents. Q([l ,
r]) is the set of query terms that appear in the passage [l , r]. β
≥ 0 is a decay factor. If β is large, the scoring function
becomes nearsighted. Long passages do not produce good
scores. If β is small, the scoring function becomes farsighted.
Therefore, this method is known as DIDF (Decayed IDF). An
efficient algorithm is used for the document ranking. DIDF(0)
is equivalent to an IDF-based document scoring function:
DS
(D) = idf[q]
∈ ( )
By using 2000 in-house questions and about 1000 QAC1
questions, it is found that β’s optimal value lies somewhere
between 0.005 and 0.0001, but there was not enough time to
tune the value before the formal run of QAC2. β = 0.005 is
used for the formal run.
3. Answer Extraction
Here, sub modules of the Answer-Extraction Module are
described. As mentioned above, IREXNE90 generates answer
candidates and REJECT sub module removes inappropriate
candidates of a certain fine-grained answer type. For example,
school names and hospital names are recognized by
IREXNE90 as LOCATION or ORGANIZATION, depending
on the context.
ISSN: 2231-5381
w[q]H(d(c, q))
∈
IREX
NE85
IREX
NE90
+NU
MEX
P
+SUF
FIX
+WSE
NSE
+DEF
INE
+REJ
ECT
(CoarseGrained)
W=
60 Words
Top
5
Di
ff
MR
R
To
p5
D
if
f
MR
R
To
p5
D
iff
MR
R
91
+
91
0.37
94
+
94
0.38
91
+
91
0.38
100
+9
0.40
103
+
9
0.41
100
+9
0.43
114
+
14
0.47
117
+
14
0.48
114
+
14
0.49
120
+6
0.50
123
+
6
0.51
121
+7
0.52
135
+
15
0.57
132
+
9
0.55
134
+
13
0.58
138
+3
0.58
135
+
3
0.56
137
+3
0.59
138
+0
0.61
134
-1
0.57
139
+2
0.62
2.
Contribution of the Proximity-Based Document
Retrieval Module
Table 6 compares the precision of DIDF with that of baseline
methods. The precision at rank R is the ratio of relevant
documents in the top R documents. It is assumed that a
relevant document is a document that contains a correct
answer. That is, lenient evaluation is used.
http://www.ijettjournal.org
Page 647
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
TABLE VI
COMPARISON OF PRECISION OF DIFFERENT DOCUMENT
RETRIEVAL METHODS
Rank
DIDF
(0.005)
DIDF
(0.001)
MultiText
IDF
BM25
5
0.609
10
0.524
20
0.448
30
0.407
40
0.379
50
0.358
0.613
0.543
0.475
0.429
0.396
0.371
0.604
0.526
0.444
0.398
0.369
0.347
0.579
0.579
0.515
0.503
0.448
0.428
0.413
0.390
0.383
0.364
0.365
0.344
=1.2
BM25
0.626
0.541
0.462
0.419
0.389
0.369
=0.1
BM25
0.575
0.507
0.442
0.406
0.378
0.359
=0.0
.
Table 7 compares MRRs given by different retrieval systems.
Top 20 documents were used. According to this table, DIDF
gives better scores than MultiText, BM25, or IDF.
However, even the difference between BM25 (k1 = 1.2) and
DIDF(0.001) was not statistically significant when RRs were
compared.
TABLE VII
COMPARISON OF MRR FOR DIFFERENT RETRIEVAL SYSTEMS
MRR
Top5
BM25
BM25
=1.2
0.580
133
=0.1
0.601
139
MULT
TI
-TEXT
0.626
139
DIDF
=
0.005
0.622
139
DIDF
=
0.001
0.628
141
IDF
0.617
139
IV. BOOSTING CHINESE QUESTION
ANSWERING WITH TWO LIGHTWEIGHT
METHODS: ABSPS AND SCO-QAT
A. Method
Experiments in this article were conducted on a host QA
system, ASQA (“Academia Sinica Question Answering
system”), which was developed to deal with Chinese
related QA tasks. The system participated in the CLQA
C-C (Chineseto-Chinese) subtasks at NTCIR-5 and
NTCIR-6, and achieved state-of-the-art performances in
both cases. Questions are first analyzed by the question
processing module to get keywords, named entities
(NEs), and the question type. Then, queries are
constructed for passage retrieval according to the question
processing results. In the next phase, answer extraction is
performed on the retrieved passages to obtain candidate
answers, which are then filtered and ranked by the answer
filtering module and answer ranking module,
respectively.
B. PROPOSED METHODS
ISSN: 2231-5381
1. ABSPs- Alignment-Based Surface Patterns
In ASQA, ABSPs are used in an answer filter to confidently
identify correct answers [25] [26].
1.1 The Alignment Algorithm
Pair-wise sequence alignment (PSA) algorithms that generate
templates and match them against new text have been
researched extensively [28]. Because surface patterns
extracted from sentences are needed that have certain
morphological similarities, local alignment techniques are
employed to generate surface patterns [27].
To apply the alignment algorithm, first word segmentation is
performed. In the following discussion each unit is a word.
Templates contain named entity (NE) as semantic tag, and
POS as syntactic tag. Consider two sequences X = (x1, x2 . . .
xn) and Y = (y1, y2 . . . ym) defined over the alphabet P that
consists of four kinds of tags: NE tags, POS tags, a raw word
tag for every single word, and a tag “-” for a gap. We assign a
scoring function, F, to measure the similarity of X and Y. F(i,
j) is defined as the score of the optimal alignment between the
initial segment from x1 to xi of X and the initial segment from
y1 to yj of Y. F(i, j) is recursively calculated as follows [24]:
F(i, 0) = 0. F(0, j) = 0, xi, yj ∈ ∑,
0,
⎧F(i − 1, j − 1) + d(xi, yj)
F(i, j) = max
′
⎨ F(i − 1, j) + d(xi, ′− )
′
⎩ F(i, j − 1) + d(′− , yj)
(12)
(13)
Where d(a, b) is the function that determines the degree of
similarity between two alphabet letters a and b. The function
is defined as Where NE(a) denotes the Named Entity (NE) tag
of a, and POS(a) denotes POS tag of a.
1, a = b
1, NE(a) = NE(b)
1, POS(a) ≈ POS(b)
d(a, b) = max
⎨1 − penality, POS(a) ≈ POS(b)
⎪
⎩
0, a ≠ b
⎧
⎪
(14)
1.2 ABSP Generation
An ABSP is composed of ordered slots. From a set of
sentences by applying the alignment algorithm, ABSPs can be
generated. Before alignment, the sentences are segmented and
tagged with POS by a Chinese segmentation tool, AutoTag10.
In addition, the sentences are tagged with semantic tags. NER
engine is used to label PERSON, ORGANIZATION,
LOCATION, and TIME tags, and a word list for “occupation”
tags. After this step, the remaining words without any
semantic tag are tagged “O”. Thus, every segment of a
sentence contains a word, a POS tag and a semantic tag in the
format: “word/POS tag/semantic tag”.
The complete ABSP generation algorithm is detailed in
Algorithm 1 [24].
http://www.ijettjournal.org
Page 648
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
1.2.1 Algorithm 1. ABSP Generation
Input: Question set S = {s1…sn},
Output: A set of uncategorized ABSPs T = {tn….tk}
Comment: Perform pair alignment for every two questions
1. T = {};
2. For each question si from s1 to sn-1 do
3. For each question sj from si to sn do
4. Perform alignment on si and sj, then
5. Pair segments according to similarity matrix F;
6. Generate a common ABSP t from the aligned pairs with the
maximum similarity;
7. T ← T∪t;
8. End;
9. End;
10. Return T;
1.3 ABSPs Selection
The selection process chooses patterns that can connect
question keywords and the answer.. Each generated ABSP is
applied to its source passages. When a matched source
passage is found, corresponding terms are extracted from the
important slots. If the extracted terms do not contain the
answer and any of the important terms of the source question,
the ABSP is removed. The detail is described in Algorithm 2
[24]. ABSPs are applied as a filter to choose highly confident
answers. It is assumed that words matched by an ABSP have
certain relations between them. When a pattern matches the
words, a relation is identified and a Related-Terms-Set (RTS)
is constructed which contains the related terms. If an ABSP
matches a passage, an RTS is extracted, which is comprised of
the matched important terms. More RTS are constructed if
more than one ABSP matches different terms in a passage. If
the RTS contains common elements (i.e., the same term is
matched by at least two ABSPs,), the idf value of those
elements is checked. If one idf value is higher than a threshold
value, the two RTSs are merged.
1.3.1 Algorithm 2. ABSPs Selection
Input: A set of ABSPs T = {t1,…,tk} for selection, the source
question Q, the answer A, the source passages S = {s1,…,sn}.
Output: Selected set of ABSPs T' = {t1,….,ti}.
1. T' = {};
2. QTs ← extract important terms from Q
3. for each sentence si in S do
4. for each ABSP tj in T do
5. perform pattern matching on si with ti, if match then
6. PTs ← extract terms that match with important slots of tj
from si
7. if PTs contains A and any term in QTs then
8. T' ← T'∪tj;
9. end if;
10. end if;
11. end;
12. end;
13. return T';
1.4 Relation Extraction and Score Calculation
ISSN: 2231-5381
After all the RTSs for the given question have been
constructed, the question’s important terms are used to
calculate an RTS score. The score is calculated as the ratio of
the question’s important terms to the matched important
terms. For RTSs that do not contain any of the question’s
important terms, the candidate answers they contain are
discarded. If none of the RTSs contains a question’s important
terms, then it is said that the question is not covered; and since
no useful relations for filtering answers could be found, all the
answers are retained. After processing all the sentences
selected for a question, the candidate answers are ranked by
the sum of their RTS scores for the sentences in which they
appear and retain the top-ranked answer(s) [24].
C. SCO-QAT: Sum of Co-occurrences of Question and
Answer Terms
Passages are chosen instead of documents, because it is
assumed that co-occurrence information provided by passages
is more reliable. The SCO-QAT formula is deduced from an
expected confidence score concept. Let the given answer be A
and the given question be Q, where Q consists of a set QT of
question terms {qt1, qt2, qt3, . . . . . . , qtn}.
Based on QT, QC is defined as a set of question term
combinations, or more precisely QC = {qci | qci is a subset of
QT and qci is not empty}. The co-occurrence confidence score
of an answer A with a question term combination qci is
calculated as follows [24]:
(
Conf(qc , A) = , )
, iffreq(qc ) ≠ 0
0, iffreq(qc ) = 0
(
)
(15)
Where freq(X) is the number of retrieved passages in which all
the elements of X co-occur. The expected confidence score is
defined as
∑
|
|
|
|
Conf(qc , A) =
|
|
∑
|
|
Conf(qc , A)
(16)
Because |QC| is the same for every answer, it can be removed.
As a result, there is following formula for SCO-QAT:
SCO − QAT(A) = ∑
|
|
Conf(qc , A)
(17)
Candidate answers are ranked according to their SCO-QAT
scores.
Equation (17) is used to calculate the candidate answer’s
SCO-QAT score as follows:
freq(qt1, c1) freq(qt2, c1) freq(qt3, c1)
+
+
freq(qt1)
freq(qt2)
freq(qt3)
freq(qt1, qt2, c1)
+
freq(qt1, qt2)
freq(qt1, qt3, c1) freq(qt2, qt3, c1)
+
+
freq(qt1, qt3)
freq(qt2, qt3)
freq(qt1, qt2, qt3, c1)
+
freq(qt1, qt2, qt3)
SCO − QAT(c1) = http://www.ijettjournal.org
Page 649
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
D. Enhancing SCO-QAT with Distance Information
SCO-QAT is improved by integrating distance information to
achieve the term density in passages when the number of
question terms is small. The following is the extended SCOQAT formula:
Conf
(
(
=
, )
)
∑
( ,
, )
, iffreq(qc ) ≠ 0
0,iffreq(qc ) = 0
SCO − QAT
∑
∑
|
|
|
|
( )
=
Conf(qc , A), if|QT| > threshold
Conf
(
, ) ,
(18)
(19)
if|QT| < threshold
where n denotes the number of retrieved passages. If the
passage does not contain qci, confidence value is set to 0. As
shown in the modified SCOQAT function in Equation (19),
when the number of question terms is smaller than a threshold
then we switch to Conf_dist. The avgdist function is the
average number of characters between the question term
combination qci and the answer A in passage pj, which is
calculated as:
avgdist p , qc , A = ∑
( , , )
∈
|
|
E. Experiments
Three experiments were conducted.
1 Comparing SCO-QAT with other Ranking
features
As a replacement of combined features, only the consequence
of single ranking features is examined. It is supposed that they
are more consistent and can be applied to other systems more
easily. In addition to SCO-QAT, following widely used
shallow features were tested: density, keyword overlap, IR
score, mutual information score, and answer frequency. The
keyword overlap is the ratio of question keywords found in a
paragraph. The IR score is calculated by the Lucene
information retrieval engine. There are several methods to
calculate density [29]. The mutual information score is
computed by the PMI method [30] [31]. The experiment
results are shown in Table 8.
TABLE VIII
THE PERFORMANCE OF SINGLE FEATURES: “ACCURACY” IS THE
RU-ACCURACY, “MRR” IS THE TOP5 RU-MEAN-RECIPROCALRANK, AND “EAA” IS THE EXPECTED ANSWER ACCURACY
Feature
SCOQAT
KO
Density
Frequency
IR
MI
Data: NTCR5-CC-D200e
Accuracy
EAA
0.545
0.621
0.515
0.601
0.375
0.501
0.445
0.560
0.515
0.598
0.210
0.342
ISSN: 2231-5381
MRR
0.522
0.254
0.368
0.431
0.425
0.210
Data: IASL-CC-Q465
Accuracy
EAA
0.578
0.628
0.568
0.618
0.432
0.519
0.413
0.486
0.518
0.587
0.138
0.280
Data: NTCIR5-CC-T200e
Accuracy
EAA
0.515
0.586
0.495
0.569
0.390
0.479
0.395
0.499
0.495
0.569
0.155
0.138
Data: NTCIR6-C-T150
Accuracy
EAA
0.413
0.495
0.367
0.476
0.340
0.420
0.340
0.431
0.367
0.460
0.167
0.281
Feature
SCOQAT
KO
Density
Frequency
IR
MI
Feature
SCOQAT
KO
Density
Frequency
IR
MI
Feature
SCOQAT
KO
Density
Frequency
IR
MI
MRR
0.546
0.247
0.369
0.406
0.406
0.124
MRR
0.515
0.245
0.380
0.366
0.420
0.290
MRR
0.406
0.130
0.314
0.343
0.283
0.142
1 Enhancing SCO-QAT with Distance Information
Experiments were carried out on the extended version of
SCO-QAT with the question-term-number threshold in
Equation (19) set to 5.
The results are shown in Table 9. According to paired t-test,
SCO-QAT with distance was significantly more correct than
SCO-QAT at the 0.01 level [24].
TABLE IX
THE PERFORMANCE OF SCO-QAT AND SCO-QAT WITH DISTANCE
INFORMATION: “ACCURACY” IS THE RU-ACCURACY, “MRR” IS
THE TOP5 RU-MEAN-RECIPROCAL-RANK, AND “EAA” IS THE
EXPECTED ANSWER ACCURACY
Feature
SCOQAT
SCOQAT_Dist
Feature
SCOQAT
SCOQAT_Dist
Feature
SCOQAT
SCOQAT_Dist
Feature
SCOQAT
SCOQAT_Dist
2
http://www.ijettjournal.org
Data: NTCR5-CC-D200e
Accuracy
EAA
0.545
0.522
0.570
0.568
Data: IASL-CC-Q465
Accuracy
EAA
0.578
0.546
0.589
0.565
Data: NTCIR5-CC-T200e
Accuracy
EAA
0.515
0.515
0.535
0.538
Data: NTCIR6-C-T150
Accuracy
EAA
0.413
0.406
0.453
0.449
MRR
0.621
0.643
MRR
0.628
0.637
MRR
0.586
0.597
MRR
0.495
0.565
ABSP- Based Answer Filter
Page 650
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
When the ABSP-based answer filter was used in ASQA for
the NTCIR-6 dataset, the RU-accuracy increased from 0.453
to 0.5.
V. PERFORMANCE ANALYSIS OF
DESCRIBED QUESTION ANSWERING
SYSTEMS
Following Figure 3 describes the performance of different
Question Answering systems described above.
in the answer-type determination module and the retrieval
module. A new proximity based document retrieval module
DIDF is developed that performs better than other document
retrieval modules. In experiments, only newspaper articles
were used. Additional experiments are required to show the
generality of scoring function. Languages and the number of
documents may influence the performance.
In Japanese Question Answering System Using A* Search and
its Improvement, which achieved the MRR = 0.516. From the
viewpoint of computational cost, the absolute average
processing time should be reduced to a greater extent,
although this controlled search successfully manages the trade
off between computational cost and accuracy. One of the
reasons is that the main part of the experimental system is
implemented in Perl, which is a script language. The
reimplementation in other programming languages that
generate native code would be effective. Introduction of a
caching mechanism to reuse the result of sub processes would
also achieve good results, because some sub processes, such
as the NE recognizer used in this study, are very time
consuming. These challenges need to be explored.
Boosting Chinese Question Answering with Two Lightweight
Methods: ABSPs (Alignment-based Surface Patterns) and
SCO-QAT (Sum of Co-occurrences of Question and Answer
Terms), achieved RU-Accuracy = 0.535. Two lightweight
methods, SCO-QAT and ABSPs were proposed, for use in a
state of- the-art Chinese factoid QA system (ASQA). The
methods require fewer resources than heavy methods, such as
the parsers and logic provers used in state-of the- art QA
systems in other languages. The ABSP method is a variation
of surface pattern methods. It tries to increase question
coverage and maintain accuracy by targeting surface patterns
for all question types, instead of specific question types,
combining relations extracted by multiple surface patterns
ISSN: 2231-5381
As we can see that among above described question
answering systems, SAIQA- QAC2 (System for Advanced
Interactive Question Answering) achieves the highest MRR
(Mean Reciprocal Rank) i.e. 0.607. In this method, SAIQAQAC2 is an improvement on previous system SAIQA-Ii that
achieved MRR = 0.46 for QAC1. In this improvement is done
from multiple passages, and incorporating richer semantic
tags. By using this strategy, ABSPs can achieve 37.33%
coverage and 0.911 RU-Accuracy on the questions covered.
The SCO-QAT method utilizes co-occurrence information in
retrieved passages. Since it calculates all the co-occurrence
combinations without extra access to the corpus or the Web, it
is suitable for bandwidth-limited situations. Moreover, SCOQAT does not require word-ignoring rules to handle missing
counts and it can be combined with other answer ranking
features.
SCO-QAT and ABSPs can be improved in several ways. In
both methods, applying rules with taxonomy or ontology
resources would solve most canonicalization problems. For
SCO-QAT, it would be helpful if a better term weighting
scheme would be used. Using more syntactic information,
such as incorporating surface patterns, would result in more
reliable co-occurrence calculations. For ABSPs, more accurate
semantic tags, which are usually finer grained, would improve
the accuracy while maintaining question coverage. Also, to
increase question coverage, in addition to the strategies
adopted for ABSPs, partial matching could also be used
because it allows portions of a surface pattern to be
unmatched. Allowing overlapping tags is also a possibility,
because some errors are caused by tagging, such as wrong
word segmentation.
I. CONCLUSION
We analysed among above described question answering
systems, SAIQA- QAC2 (System for Advanced Interactive
Question Answering) achieves the highest MRR (Mean
Reciprocal Rank) i.e. 0.607. In this method, SAIQA-QAC2 is
an improvement on previous system SAIQA-Ii that achieved
MRR = 0.46 for QAC1. In this improvement is done in the
answer-type determination module and the retrieval module.
http://www.ijettjournal.org
Page 651
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
Fig. 3. Performance of Different Question Answering Systems
Structure, Sig Notes 2001-Nl-145, Information Processing Society
of Japan, (Sep.), Japan, 2001.
A new proximity based document retrieval module DIDF is
[6] T. Mori, “Japanese Question-Answering System Using A* Search
developed that performs better than other document retrieval
and Its Improvement”, ACM Transactions on Asian Language
modules. In Japanese Question Answering System Using A*
Information Processing, Vol. 4, No. 3, Pages 280-304, 2005.
Search and its Improvement, which achieved the MRR =
[7] M. Murata, M. Utiyama, and H. Isahara, Question answering
system using similarity guided reasoning, SIG Notes 2000-NL0.516. The reimplementation in other programming languages
135. Information Processing Society of Japan, 2000.
that generate native code would be effective. Boosting
[8] C. L. Clarke, G. V. Cormack, and T. R. Lynam, “Exploiting
Chinese Question Answering with Two Lightweight Methods:
Redundancy in Question Answering”, in Proceedings Of SIGIR:
ABSPs (Alignment-based Surface Patterns) and SCO-QAT
The 24th Annual International ACM SIGIR Conference On
Research And Development In Information Retrieval, 358–365,
(Sum of Co-occurrences of Question and Answer Terms),
2001.
achieved RU-Accuracy = 0.535. Two lightweight methods,
[9] J. Xu, A. Licuanan, and R. Weischedel, “TREC QA at BBN:
SCO-QAT and ABSPs were proposed. For SCO-QAT, it
Answering definitional questions”, in Proceedings of the twelfth
would be helpful if a better term weighting scheme would be
Text Retrieval Conference, 2003.
[10] B. Magnini, M. Negri, and R. P. H. Tanev, “Is it the right answer?
used. Using more syntactic information, such as incorporating
Exploiting web redundancy for answer validation”, in Proceedings
surface patterns, would result in more reliable co-occurrence
of the 40th Annual Meeting of the Association for Computational
calculations.
Linguistics (ACL), 425–432, 2003.
[11] J. Fukumoto, T. Kato, and F. Masui, Question Answering
ACKNOWLEDGMENT
Challenge for Five Ranked and List Answers—Overview of Ntcir4
Qac2 Subtask 1 And 2—. In Working Notes of the Fourth NTCIR
I would like to articulate my thanks to Mr. Vishal Gupta,
Workshop Meeting, 283–290, 2004.
Assistant Professor of Computer Science and Engineering
[12] J. Fukumoto, T. Kato, and F. Masui, Question answering
Department in UIET, Department of Panjab University
challenge (QAC-1)—Question answering evaluation at NTCIR
workshop 3, in Working Notes of the Third NTCIR Workshop
Chandigarh for his guidance in accomplishing this task.
meeting—Part IV: Question Answering Challenge (QAC1), 1–6,
2002.
REFERENCES
[13] Y. Sasaki, H. Isozaki, T. Hirao, K. Kokuryou, and E. Maeda,
[1] T. Kurohashi and M. Nagao, Japanese Morphological Analysis
NTT’s QA systems for NTCIR QAC-1, in Working Notes of the
System Juman Version 3.6 Manual, Kyoto University, Japan,
Third NTCIR Workshop Meeting, Part IV: Question Answering
1998.
Challenge (QAC1), 63–70, 2002.
[2] S. Kurohashi, Japanese Syntactic Parsing System Knp Version 2.0
[14] H. Isozaki, “An Analysis of a High-Performance Japanese
B6 Instruction Manual, Japan, 1998.
Question Answering System”, ACM Transactions on Asian
[3] H. Yamada, T. Kudo, and Y. Matsumoto, “Japanese Named Entity
Language Information Processing, Vol. 4, No. 3, Pages 263-279,
Extraction Using Support Vector Machine”, IPSJ Journal 43, 1
2005.
(Jan.), 44–53, Japan, 2002.
[15] H. Isozaki, and H. Kazawa, “Efficient Support Vector Classifiers
[4] T. Joachims, Svmlight—Support Vector Machine. [Online].
for Named Entity Recognition”, in Proceedings of the 19th
Available: Http://Svmlight.Joachims.Org/, 2002.
International Conference on Computational Linguistics, 390–396,
[5] K. Fujihata, M. Shiga, and T. Andmori, Extraction of Numerical
2002.
Expressions by Constraints and Default Rules of Dependency
ISSN: 2231-5381
http://www.ijettjournal.org
Page 652
International Journal of Engineering Trends and Technology (IJETT) - Volume4Issue4- April 2013
[16] S. Sekine, and Y. Eriguchi, “Japanese Named Entity Extraction
Evaluation—Analysis of Results”, in Proceedings of the 18th
International Conference on Computational Linguistics, 1106–
1110, 2000.
[17] Y. Ichimura, Y. Saito, T. Sakai, T. Kokubu, and M. Koyama, “A
study of the relations among question answering, Japanese named
entity extraction, and named entity taxonomy (in Japanese)”, in
IPSJ SIG Technical Report NL-161, 17–24, 2004.
[18] S. Ikehara, M. Miyazaki, S. Shirai, A. Yokoo, H. Nakaiwa, K.
Ogura, Y. Ooyama, and Y. Hayashi, Goi-Taikei—A Japanese
Lexicon (in Japanese), Iwanami Shoten, 1997.
[19] K. S. Jones, S. Walker, and S. E. Robertson, A Probabilistic Model
of Information Retrieval: Development and Comparative
Experiments, Information Processing and Management 36, 779–
840, 2000.
[20] C. L. A. Clarke, G. V. Cormack, and T. R. Lynam, “Exploiting
Redundancy in Question Answering”, in Proceedings of the 24th
Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 358–365, 2001.
[21] M. Murata, M. Utiyama, and H. Isahara, Japanese QuestionAnswering System using Decreased Adding with Multiple Answers,
in Working Notes of NTCIR-4, 353–360, 2004.
[22] J. Suzuki, Y. Sasaki, and E. Maeda, “SVM Answer Selection for
Open-Domain Question Answering”, in Proceedings of the 19th
International Conference on Computational Linguistics, 974–980,
2002.
[23] T. Hirao, Y. Sasaki, and H. Isozaki, “An Extrinsic Evaluation for
Question-Biased Text Summarization on QA Tasks”, in
Proceedings of the Workshop on Automatic Summarization, The
Second Meeting of the North American Chapter of the Association
for Computational Linguistics, 61–68, 2001.
[24] C. W. Lee, M. Y. Day, C. L. Sung, Y. H. Lee, T. J. Jiang, C. W.
Wu, D. W. Shih, Y. R. Chen, and W. L. Hsu, “Boosting Chinese
Question Answering with Two Lightweight Methods: ABSPs and
SCO-QAT”, ACM Transactions on Asian Language Information
Processing, Vol. 7, No. 4, Article 12, 2008.
[25] W. L. Hsu, S. H. Wu, and Y. S. Chen, “Event identification based
on the information map-INFOMAP”, in Proceedings of the IEEE
International Conference on Systems, Man, and Cybernetics
(SMC’01), Tucson, AZ, 1661–1666, 2001.
[26] V. N. Vapnik, The nature of statistical learning theory, Springer,
1995.
[27] C. W. Wu, S. Y. Jan, R. T. H. Tsai, and W. L. Hsu, “On Using
Ensemble Methods for Chinese Named Entity recognition”, in
Proceedings of the 5th SIGHAN Workshop on Chinese Language
Processing, 2006.
[28] M. Huang, X. Zhu, Y. Hao, D. G. Payan, K. Qu, and M. Li,
Discovering patterns to extract protein - protein interactions from
full texts, Bioinformatics 20, 3604–3612, 2004.
[29] D. Molla, and M. Gardiner, “Answerfinder Question Answering by
Combining Lexical, Syntactic and Semantic Information”, in
Australasian Language Technology workshop, 2005.
[30] Y. Zhao, Z. M. Xu, Y. Guan, and P. Li, “Insun05QA on QA Track
Of Trec’05”, in Proceedings Of The 14th Text Retrieval
Conference, 2005.
[31] Z. Zheng, Answer Bus question answering system, Human
Language Technology Conference, 24–27, 2002.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 653
Download