Uploaded by 郭科岑

paper

advertisement
Knowledge-Based Systems 201–202 (2020) 106077
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
DAM: Transformer-based relation detection for Question Answering
over Knowledge Base
Yongrui Chen, Huiying Li
∗
School of Computer Science and Engineering, Southeast University, Nanjing 210096, PR China
article
info
Article history:
Received 3 November 2019
Received in revised form 22 May 2020
Accepted 23 May 2020
Available online 1 June 2020
Keywords:
Knowledge Base Question Answering
Transformer
Relation detection
a b s t r a c t
Relation Detection is a core component of Knowledge Base Question Answering (KBQA). In this paper,
we propose a Transformer-based deep attentive semantic matching model (DAM), to identify the KB
relations corresponding to the questions. The DAM is completely based on the attention mechanism
and applies the fine-grained word-level attention to enhance the matching of questions and relations.
On the basis of the DAM, we build a three-stage KBQA pipeline system. The experimental results on
multiple benchmarks demonstrate that our DAM model outperforms previous methods on relation
detection. In addition, our DAM-based KBQA system also achieves state-of-the-art results on multiple
datasets.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction
Since the birth of the Knowledge Base (KB), it has been widely
applied in various communities [1–3]. Knowledge Base Question
Answering (KBQA) is one of the active research areas where the
goal is to provide crisp answers to natural language questions by
employing the KB. An important direction in KBQA is answering
via semantic parsing (SP): translating natural language questions
posed by humans into structured queries (e.g., SPARQL). As illustrated in Fig. 1, SP-based KBQA approaches [4–7] is typically
a pipeline, which consist of three key stages: (1) entity linking,
which links n-grams to knowledge base (KB) entities; (2) relation
detection, which identifies the KB relation(s) to which a question
refers; and (3) constraint detection, which adds constraints to
construct structured queries. In this paper, we focus on relation
detection, which is a core phase of the pipeline.
Relation detection aims to build a correct mapping from questions to KB relations. Currently, it is a common practice to employ
the neural network model to learn the semantic representations
of relations and questions to bridge the lexical gap between
them. To make the model focus on the parts of the question
which are relevant to the relation, the attention mechanism is
applied in recent work [7,8]. In order to reduce the number of
unseen relations [5], recent research breaks the relation name
into a sequence of words as the input of the model. For example, the relation place_of_birth is converted into a sequence
{‘‘place‘‘,‘‘of’’,‘‘birth’’}.
∗ Corresponding author.
E-mail address: huiyingli@seu.edu.cn (H. Li).
https://doi.org/10.1016/j.knosys.2020.106077
0950-7051/© 2020 Elsevier B.V. All rights reserved.
However, this word-level representation often makes the input sequences of relations too long. It is mainly caused by two
reasons: First, in some knowledge base like Freebase, each relation (e.g., people.person.place_of_birth) consists of two parts:
type of the subject (people.person) and the genuine relation name
(date_of_birth).1 Second, In order to answer complex questions,
2-hop chains of relations in the KB are often required as the
core relations. The length of the sequence makes the existing
model difficult to capture the complete semantic information of
the relation. Further, it limits the performance of the existing
attention mechanism, which calculates the attention weight by
the global vector representation of the relation. In fact, different
words in one relation are typically relevant to different parts
of the question. For example in Fig. 1, the word ‘‘education’’ in
the relation corresponds to the words ‘‘university’’ and ‘‘graduated’’ in the question, and the word ‘‘institution’’ corresponds
to the phrase ‘‘what university’’. Therefore, it is unreasonable to
apply the identical attention distribution (from the global vector
representation) to the question for each word in the relation.
This paper overcomes the problem above by proposing a
relation detector based on Transformer [9], which is a completely attention-based neural network. First, we employ the
self-attention mechanism to encode the input question and candidate relations in order to capture the semantic information
of each word from a global perspective. Then, we propose a
fine-grained attention mechanism with multi-head attention to
compute an attention distribution on the question for each word
1 According to our observation, there are a lot of cases where two relations
have the same genuine relation names but different prefixes, so the prefix type
cannot be ignored.
2
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
Fig. 1. Entire process of the SP-based KBQA approach. The example question includes only type constraint, which is one kind of common constraints.
in the relation independently. In order to explore whether the
proposed relation detector can help KBQA, we construct a threestage pipeline system. For entity linking, the system directly uses
the results of the existing methods. For constraint detection, the
system handles three kinds of constraints, including multi-entity
constraint, temporal constraint and type constraint. The main
contributions of this paper are:
• We propose a word-level attention mechanism by transformer for relation detection. To the best of our knowledge,
it is the first attempt to perform the attention mechanism
independently for each word of the relation instead of using
the entire relation.
• We build a pipeline system for the KBQA task based on the
proposed model and propose to leverage the same model to
handle the type constraint in the question.
• We perform comprehensive experiments on multiple KBQA
datasets. Our proposed model outperforms previous approaches for relation detection and our pipeline system
produces state-of-the-art results.
2. Related works
Understanding natural language questions is the most important part of KBQA. At present, mainstream KBQA methods can be
roughly classified into three categories: template-based methods,
semantic parsing-based methods, and end-to-end methods.
2.1. Template-based methods
The template-based methods aim to obtain a logical formal
query from the question using predefined templates. Traditional
approaches [10–12] manually create templates. It is costly and
time-assuming. To overcome this problem, Abujabal et al. [13]
propose to generate templates automatically from user questions
paired with their answers. Ding et al. [14] leverage the frequent
query substructure from the training queries and build templates
by combining these substructures. Nonetheless, the templatebased methods are still limited by the coverage of the templates
and extremely vulnerable when they face the unconventional
expression.
2.2. End-to-end methods
An important line of research for KBQA constructs an end-toend system by deep learning powered similarity matching. These
methods [15–17] aim to design an end-to-end neural network
model to rank candidate answers. To enhance the KB context of
candidate answers, Hao et al. [18] apply a cross-attention mechanism to integrate the information into the vector representation
of the answers. Zhao et al. [19] focus on the simple questions
and propose a joint-score for training both entities and relations.
These end-to-end methods typically have only a single process
to avoid complex NLP pipelines and error propagation, however
their results are not explainable.
2.3. Semantic parsing-based methods
SP-based methods aim to translate questions into logical formal queries. Traditional SP-based methods approaches [20,21]
require complex NLP pipelines, including dependency parsing,
vocabulary mapping. With the maturity of deep learning, recent work focuses on building semantic parsing framework with
neural networks to improve the performance. Yih et al. [22]
propose a query graph that can be mapped to a logical form. Then,
the traditional semantic parsing is simplified into a pipeline for
staged query graph generation. This framework is widely applied
in recent research [4–7], including our work. Yu et al. [5] propose
a hierarchical recursive model enhanced by residual learning
for relation detection. Maheshwari et al. [7] propose a relation
detection model with a slot-matching attention mechanism that
treats each relation name as a slot and focus on different parts
of the question for each slot. In fact, recent work [23,24] focus
on capturing the semantic information of entire sentences from
word-level. Inspired by them, we propose a fine-grained attention mechanism based on word-level to enhance the matching
of questions and relations. In addition to the staged generation
methods, Luo et al. [6] propose to rank entire candidate query
graphs by a single representation model with dependency parsing. Sorokin et al. [25] constructs a state transition-based query
graph generation method by using the gated graph neural network to represent the query graph. To further introduce external
knowledge, Deng et al. [8] leverage the multi-task learning to
combine the KBQA task (relation detection) and Answer Selection
(AS) task. In comparison with the end-to-end methods, SP-based
methods can provide a more explainable result by the logical
form of the question.
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
3. Transformer-based relation detector
This section describes our transformer-based relation detection model.
3.1. Task definition
Initially, we provide a unified description of relation detection.
We define
Definition 1. The Core Relation of the question is a relation (or
chain of relations2 ) in the KB that connects the topic entity in the
question with the answers.
For an input question q, assuming its topic entity (e.g.,
‘‘m.0271_s’’ in Fig. 1) has been linked, denoted by eq . Then, the
core relation rq can be selected by
rq = arg max prel (r ; q)
(1)
r ∈Re
where prel (r ; q) is the matching score of r and q. Re denotes the
candidate set of relations, which consists of all relation chains
(within 2-hop) connected with eq . In this sections, we will detail
three models for calculating prel (r ; q), the first two of which are
baselines, and the last one is our proposed model.
3.2. Baseline sequence matching model
Following the common practise, each relation r and question q are first broken into sequences of words. Then, they are
converted into word embeddings R = [r1 , r2 , . . . , rn ] and Q =
[q1 , q2 , . . . , qm ], respectively. Here, n (m) is the word number of
the relation (question), ri ∈ Rd and qi ∈ Rd are word vectors.
Thereafter, the matching score is calculated by
prel (r ; q) = h(g(f(R)), g(f(Q )))
n×d
(2)
n×d
Here, f : R
→ R
denotes a neural network (e.g., BiLSTM)
which transforms the word embeddings into d-dimensional semantic vectors. g : Rn×d → Rd denotes an aggregation operation
(e.g., pooling), which transforms the semantic vectors into the
global vector representation of the input sequence. h : Rd ×
Rd → R denotes a distance function (e.g., cosine distance) used
to measure the similarity of two vector representations.
3.3. Coarse-grained attention-based sequence matching model
In order to enhance the matching of q and r, the attention
mechanism is applied in recent work [7,8]. It aims to make the
model focus on the parts of q that are relevant to r. By employing
it, the matching score is calculated by
prel (r ; q) = h(g(f(R)), a(X , y))
(3)
where X = f(Q ) ∈ Rm×d , y = g(f(R)) ∈ Rd , and a : Rm×d × Rd →
Rd denotes the attention mechanism. Specifically, a(X , y) ∈ Rd is
calculated by
a(X , y) =
m
∑
αi xi
(4)
i=1
exp(u(xi , y))
αi = ∑m
j=1
exp(u(xj , y))
(5)
where αi denotes the attention weight of the xi in X . u : Rd ×
Rd → R is an attention function (e.g.,feed-forward network) used
2 In this paper, we focus on the chain within 2-hop, which meets the
requirements of almost all existing KBQA datasets.
3
to calculate the weights by the semantic relevance between xi
and y.
In this framework, the model first obtains the global vector
representation of r by the aggregation operation g, and then
computes the attention weights of q by the global vector representation. By this mechanism, the parts of the q that relevant
to r can be paid more attention. However, if r is a long sequence, its words are typically relevant to the different parts
of q. This mechanism of utilizing the global representation of r
is too coarse-grained to integrate the relevances from different
word-pairs.
3.4. Deep attentive sequence matching model
To build a fine-grained attention mechanism, we apply the
scaled dot-product attention used in Transformer to independently
calculate the attention weights of q for each word in r. Specifically, let Q ∈ Rm×d and R ∈ Rn×d denote the embedding vectors
of q and r, respectively, which are obtained by function f. Then,
the mechanism, denoted by a∗ can be described by
a∗ (Q , R) = W a QW v
W a = softmax(
(6)
(RW q )(QW k )T
√
d
)
(7)
Here, W q ∈ Rd×d , W k ∈ Rd×d , and W v ∈ Rd×d are the trainable
parameter matrices for affine transformation. W a ∈ Rn×m is the
attention matrix of q and r. The ith row of W a is the attention
distribution on q from the ith word of r. The output of a∗ (Q , R)
is denoted by Q̄ = [q̄1 , q̄2 , . . . , q̄n ] ∈ Rn×d , where q̄i denotes the
weighted sum of the vectors in Q under the attention from the
ith word of r.
Following Transformer, we apply the multi-head mechanism to
calculate a∗ (Q , R) in different vector spaces, which is formulated
by
m(Q , R) = concat(head1 , . . . , headη )W o
where headi = a∗i (Q , R)
(8)
Here, η is the number of heads, headi has the independent trainable parameters, and W o ∈ Rηd×d is the affine transformation to
combine the output from different vector spaces.
In addition to using the multi-head attention to perform a
fine-grained question-relation (Q-R) attention, we also employ it
as f to encode q and r, i.e., utilizing self-attention. For example,
for the input question matrix Q ∈ Rm×d , the output of the selfattention is denoted by s(Q ) ∈ Rm×d , which is calculated by
s(Q ) = M(Q , Q )
(9)
In this mechanism, each head will obtain a self-attention matrix W s ∈ Rm×m , which is similar to W a in (6). By W s , each
vector of the input sequence is directly affected by all other
vectors (including itself) in the same sequence. Comparing to
some sequence model (e.g., BiLSTM), the self-attention can break
the limitation of the distance between two words, so that it has
promising performance in handling long sequences. On the basis
of the mechanisms above, we propose Deep Attentive Matching
(DAM ) model for relation detection. It consists of an encoder and
a decoder, which is illustrated in Fig. 2.
The encoder aims to obtain the semantic vectors of q. It consists of l identical layers, where each layer mainly contains a selfattention sub-layer and a feed-forward network sub-layer. There
is a residual connection [26] around each of the two sub-layers,
followed by layer normalization. That is, for an arbitrary matrix
Z , the output of each sub-layer is LayerNorm(Z + Sublayer(Z )),
where Sublayer(Z ) is the function implemented by the sub-layer
4
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
Fig. 2. Architecture of DAM.
itself. Formally, by ignoring residual connection and normalization, the ith encoder layer ei can be formulated by
{
ei (Q ) =
fe1 (se1 (Q )),
i=1
fei (sei (ei−1 (Q ))),
i≥2
(10)
where sei and fei denote the self-attention sub-layer and feedforward network sub-layer of ei , respectively. The output of ei (Q )
can be regarded as the semantic matrix of q, denoted by Q̂ ∈
Rm×d .
The decoder has two functions: one is to obtain the semantic
vectors of r, and the other is to perform the Q-R attention.
Similar to the encoder, it consists of l identical layers, where each
layer mainly contains a self-attention sub-layer, a multi-head
Q-R attention sub-layer, and a feed-forward network sub-layer.
The residual connection and the layer normalization are also
employed. Formally, the ith decoder layer di can be formulated
by
{
di (R, Q ) =
sdi ,
fd1 (md1 (Q̂ , sd1 (R))),
i=1
fdi (mdi (Q̂ , sdi (di−1 (R, Q )))),
i≥2
mdi
3.5. Training
During training, we adopt the hinge loss to maximize the
margin between the gold relations r + and the negative relations
r −:
lr (q) =
∑
max(0, γ + prel (r − ; q) − prel (r + ; q))
(13)
r − ∈R−
where R− is the set of negative relations sampled from the
candidate set Re .
4. KBQA Pipeline system
(11)
fdi
where
and
denote the self-attention sub-layer, multihead Q-R attention sub-layer, and feed-forward network sublayer of di , respectively. We take the output of sd1 (R) as the
semantic vectors of r, denoted by R̂ ∈ Rn×d , and take sdl (R, Q )
as the attentive vectors of q, denoted by Q̃ = [q̃1 , . . . , q̃n ]. Here,
q̃i ∈ Rd is the weighted sum of Q̂ under the attention of ith word
in r.
For the inputs to the encoder and decoder, we follow [9] to
adopt the positional encoding for recording the position information of each word. Finally, the matching score is calculated by
prel (r ; q) = h(g(R̂), g(Q̃ ))
utilizes the self-attention to encode the question and relation
instead of the traditional sequence model (e.g., BiLSTM). So it
is beneficial for handling long sequences. (3) DAM is a deep
structure so that it has a stronger capability of representation
than other shallow models.
This section describes our KBQA pipeline system based on the
proposed relation detector. The system consists of three stages,
namely (1) entity linking, (2) relation detection, (3) and constraint
detection.
4.1. Entity linking
For each question q, we follow previous work to use the results
of an existing entity linking tools.3 The top-t linked entities
Et ranked by the score pent (e; q) are selected to come into the
subsequent stages.
4.2. Relation detection
(12)
where g is a max-pooling, and h is the cosine distance.
In comparison with the coarse-grained attention-based model,
our DAM has three differences: (1) DAM employs a fine-grained
Q-R attention mechanism to make the model focus on different
parts of the question for each word in the relation. (2) DAM
First, we construct relation candidate set Re for each candidate
entity e ∈ Et . Note that r ∈ Re can be a single relation or a
3 For LC-QuAD, we directly use the gold entity for a fair comparison with
previous work. For WebQSP and CompQ, we utilize the results of S-MART [27].
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
5
relation chain of 2-hop. Then, the entity-relation pair (eq , rq ) to
the question is selected by
(eq , rq ) = arg max max prel (r ; q)
e∈Et
(14)
r ∈Re
Here, if there are multiple pairs with the identical prel , we select
the one with the highest pent . After that, the base formal query is
built, denote by λb = {⟨eq , rq , ? x⟩},4 where ? x denotes the answer
variable that needs to be grounded to the KB.
Fig. 3. Architecture of DAM for type detection.
4.3. Constraint detection.
To deal with complex questions, it is necessary to attach
constraints to λb to represent the complete intentions of the
question. Our KBQA system deals with the following three kinds
of constraints: multi-entity constraint, temporal constraint, and
type constraint. After all the three constraints are handled, the
complete query is obtained, denoted by λ. Finally, the answers
are retrieved by executing λ.
For question q, we first collect the types of the answers retrieved by λb , denoted by T. Then, for each candidate type t ∈ T,
we calculate its matching score by the DAM, denoted by ψ (t ; q).
Finally, if the score of t is greater than the threshold θt , it will be
added to λb by attached it to ? x. The new query can be described
by {⟨eq , rq , ? x⟩, ⟨? x, rdf:type , t ⟩}. Specifically, ψ (t ; q) is calculated
by
4.3.1. Multi-entity constraint
Multi-entity constraint refers that there is more than one
entity in the question. For example, the question ‘‘Who plays
ken barlow in coronation street?’’ has two entities (italic). Following [5], we handle this constraint by text matching. Specifically,
given a question q and its corresponding query λb , we first retrieve the answers A (including the grounded results of ? x and
hidden variable ? y) by executing λb in the KB. Then, we collect all
the neighbor entities of each answer a ∈ A, whose literal name is
denoted by e. Thereafter, we compute the matching score ϕ (e; q)
between e and q by
ψ (t ; q) = sigmoid(concat(T̂ , Q̃ )W )
ϕ (e; q) = max
n(q)
lcs(n(q), e)
|q|
+
lcs(n(q), e)
|e|
Q̃ = g(dl (T ; el (Q )))
(15)
4.3.2. Temporal constraint
Temporal constraint refers that there is temporal information
in the question. For example, the question ‘‘Who is the governor of ohio 2011?’’ has a constraint ‘‘2011’’. Since the temporal
constraint is usually in the form of year in our used datasets.5
Therefore, we adopt a simple but efficient strategy. First, the year
in the question is detected by rules (Regular Expression), denoted
by yc . Then, we attach two variables ? ds (start date) and ? de (end
date) with the relations from and to to λb . The new query can
be described by {⟨eq , rq0 , ? y⟩, ⟨? y, rq1 , ? x⟩, ⟨? y, from, ? ds ⟩, ⟨? ds , ≤
, yc ⟩, ⟨? y, to, ? de ⟩, ⟨? de , ≥, yc ⟩}. Note that here omits the type
prefixes of from and to.
4.3.3. Type constraint
Type constraint refers that there is some information in the
question which demonstrates the types of answers. For example in Fig. 1, only the entity ‘‘m.0l2tk’’ with the type ‘‘College/University’’ is the correct answer. To handle this case, we
employ the DAM again to predict the types of answers (see Fig. 3).
4 If r is a 2-hop chain, λ = {⟨e , r 0 , ? y⟩, ⟨? y, r 1 , ? x⟩}, where ? y is a hidden
q
b
q q
q
variable (e.g., CVT node in Freebase). Note that the direction of the relation is
not represented in the examples.
5 The questions in SimpQ and LC-QuAD (DBpedia) do not have temporal
(16)
(17)
(18)
where T ∈ Rz ×d and Q ∈ Rm×d are the word embeddings of t and
q with positional encoding, respectively. W ∈ Rd×1 is a trainable
parameter vector.
During training, we minimize the cross entropy loss, which is
calculated by
lt (q) =
where n(x) denotes the n-gram of string x, |x| denotes the length
of string x, and lcs(x, y) denotes the length of the Longest Consecutive Common Subsequence between the string x and y. Finally, if
the score of e is greater than the threshold θe , it will be added
to λb by attached it to ? y. The new query can be described by
{⟨eq , rq0 , ? y⟩, ⟨? y, rq1 , ? x⟩, ⟨? y, rc , e⟩}, where rc is the relation that
connects a with e.
constraints.
T̂ =
g(sd1 (T ))
∑
ρ log ψ (t ; q) + (1 − ρ ) log (1 − ψ (t ; q)).
(19)
t ∈T
where ρ ∈ {0, 1} is the ground truth for training. If t is the type
of the correct answers to q, ρ is 1, otherwise ρ is 0.
5. Experiments
This section describes the experiments to evaluate our relation
detector and KBQA system.
5.1. Datasets and evaluation metrics
Our models are trained and evaluated over the following four
KBQA datasets:
SimpleQuestions (SimpQ) is a single-relation (1-hop) KBQA task.
It consists of more than 100K questions and the fact of each
question can be described by a triple in Freebase.
WebquestionsSP (WebQSP) is a multi-relation (1-hop and 2-hop)
KBQA task over Freebase, which contains full semantic parses for
a subset of the questions from the WebQuestions [20] dataset. It
is split into 3098 training and 1639 test questions.
LC-QuAD is a multi-relation (1-hop and 2-hop) complex KBQA
task over DBpedia, having 5000 question-SPARQL pairs. It is split
into 4000 training and 1000 testing questions.
ComplexQuestions (CompQ) is a multi-relation (1-hop and 2hop) complex KBQA task over Freebase, which contains 2100
complex questions collected from Bing search query log. It is split
into 1300 training and 800 testing questions.
For relation detection, we follow previous work [5,8] to adopt
Accuracy as the evaluation metric. That is, a result is considered
to be correct if it matches one of the ground truth relations. For
KBQA task, we follow the common practice to adopt Precision,
Recall and F1-score as the evaluation metrics.
6
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
Table 1
Accuracy for relation detection on the four datasets.
Model
SimpQ
WebQSP
LC-QuAD
CompQ
CNN
BiCNN [22]
BiLSTM
HR-BiLSTM [5]
MVA-MTQA-net [8]
Slot-Matching [7]
89.8
90.0
91.2
93.3
93.1
93.7
78.1
77.7
79.3
82.5
82.3
82.8
44.0
–
61.0
62.0
–
63.0
52.5
–
56.8
57.4
–
57.8
BERT(base) [29]
BERT(large) [29]
90.9
91.2
84.0
83.8
62.3
62.5
57.8
57.9
DAM
w/o Q-R attention
coarse-grained Q-R attention
replacing self-attention with BiLSTM
93.3
92.4
93.1
92.4
84.1
82.3
82.7
81.5
63.2
61.9
62.0
61.5
58.5
55.2
57.5
56.8
5.2. Experimental settings
During the training of the DAM, all word embeddings are
initialized using the pre-trained GloVe6 with 300 dimensions. The
layer number N is set to 6 and the head number η is set to 8.
The size of hidden states d is set to 512. Dropout is adopted in
each sub-layer, whose probability is set to 0.1. The loss margin γ
in (13) is set to 0.1. All parameters are trained using Adam [28]
optimizer with a learning rate of 5 × 10−5 in a mini-batch of
size 16. The training process lasts 30 epochs. The threshold θe for
multi-entity constraint is set to 0.1 and θt for type constraint is
set to 0.5. The size of negative sampling is set to 50.
5.3. Relation detection results and analysis
We first evaluate the performance of the proposed DAM model
for the relation detection tasks on the four datasets.
5.3.1. Comparison with baselines
We compared the DAM model with several baselines. CNN
and BiLSTM are the most basic models for the semantic matching
of sequences. BiCNN is proposed by [22], where both questions
and relations are represented with a word hashing of character
tri-grams. HR-BiLSTM [5] considers both word- and name-level
representations of the relation and utilizes the Residual Connection with two layers of the BiLSTM for the hierarchical matching
of the two levels. MVA-MTQA-net [8] leverages the multi-task
learning of relation detection and answer selection to introduce
external information. To focus on the performance of models, we
only compare with their single-task results of relation detection.
Slot-Matching is proposed by [7], which applies a slot-matching
mechanism to make each relation name pay different attention
to the question.
The experimental results are reported in Table 1. Our relation
detector outperforms previous models on WebQSP, LC-QuAD, and
CompQ, and ranks second only on SimpQ. Overall, it can be
considered that our model achieves state-of-the-art results.
The performance of CNN and BiCNN is limited by the size
of its local convolutional window so that it fails to obtain the
global semantic information of the sequence for encoding each
token. BiLSTM benefits from its recurrent structure to utilize
the forward and backward context meanwhile suffers from it in
handling long sequences. Both HR-BiLSTM and MVA-MTQA-net
utilize the double-level input of the relation, namely word-level
and relation(knowledge)-level, to enhance the representation of
the relation, so they achieve better results. Slot-Matching considers each relation name in the chain as a slot and makes each slot
focus on different parts of the question. Therefore, it outperforms
6 http://nlp.stanford.edu/projects/glove.
the previous models. However, its attention mechanism based on
relation name-level is too coarse-grained for long relations. Our
model employs the self-attention mechanism instead of convolutional or recurrent networks so that it is able to encode each word
from a global perspective. More importantly, the fine-grained QR attention mechanism helps our model to perform semantic
matching more precisely. The main reason for ranking second on
SimpQ dataset is that all the questions in SimpQ correspond to
single relations. The input sequence is too short to highlight the
advantages of our model.
5.3.2. Comparison with pre-trained model
We also compared our proposed DAM with BERT [29], which
is a pre-trained Transformer network. Specifically, we fine-tune
BERT by using it as f to encode questions and relations. In the
experiments, we evaluated the base and large versions of BERT
separately.
The experimental results are shown in Table 1. Facing relation
detection tasks, base and large versions of BERT did not show
excellent performance like that in other areas. Possible reasons
include: (1) The questions in the dataset is too short for the
input of BERT (512 tokens). Note the worse performance on the
shorter-questions dataset (SimpQ). (2) BERT is pre-trained by
using unstructured text, but the KB consists of structured data.
Typically, relations in the KB have specific naming rules, which
contain the structured information of the KB. Therefore, the prior
knowledge obtained by BERT from the unstructured text has less
effective for encoding relations.
5.3.3. Ablation tests of DAM
To explore the contribution brought by each component of
DAM, we evaluate the performance of DAM in the following
settings.
• w/o Q-R attention We removed the attention mechanism
between the relation and the question. It can be formalized
by (2), where, f denotes a transformer encoder, g denotes a
max-pooling, and h denotes a cosine distance.
• coarse-grained Q-R attention We calculate the attention
weights on the question by utilizing the global vector representation of the relation. It can be formalized by (3),
(4) and (5), where f denotes a transformer encoder, g denotes a max-pooling, h denotes a cosine distance and u is a
two-layer feed-forward network.
• replacing self-attention with BiLSTM We apply BiLSTM to
encode questions and relations instead of the transformer
encoder.
The experimental results are shown at the bottom of Table 1.
By removing Q-R attention, the performance declined approximately 1.5% on all the datasets. The name-level attention makes
the performance drop 1% on WebQSP and CompQ. The main
reason for the smaller drops on SimpQ and LC-QuAD is that the
relations of two datasets are typically shorter.7 It is a similar
reason that the BiLSTM encoder brought a less significant drop
on SimpQ and LC-QuAD.
To intuitively reflect the effectiveness of the fine-grained QR attention mechanism in DAM, we visualize some rows in the
attention matrix W a in (6). The heat maps are shown in Fig. 4.
The results show that the words ‘‘education’’ and ‘‘institution’’ in
the relation learn different attention distributions for the words
of the question, which is approximately consistent with our expectations.
7 SimpQ only have single-relations. LC-QuAD relies on DBpedia, whose
relations do not have type prefixes.
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
7
Table 2
Precision, recall, and F1-score on WebQSP, LC-QuAD, and CompQ for KBQA task.
WebQSP
LC-QuAD
CompQ
P
R
F1
P
R
F1
P
R
F1
STAGG [22]
HR-BiLSTM [5]
GGNN [25]
CompQA [6]
Slot-Matching [7]
0.79
0.79
0.82
0.79
0.80
0.72
0.73
0.73
0.75
0.75
0.67
0.68
0.71
0.69
0.70
0.63
0.64
0.66
0.64
0.65
0.75
0.77
0.78
0.75
0.78
0.69
0.70
0.71
0.69
0.71
0.40
0.39
0.41
0.41
0.40
0.51
0.52
0.55
0.56
0.55
0.37
0.37
0.42
0.43
0.41
Our
w/o
w/o
w/o
0.81
0.77
0.78
0.78
0.74
0.75
0.75
0.76
0.70
0.67
0.68
0.69
0.65
0.63
–
0.62
0.80
0.80
–
0.81
0.72
0.69
–
0.68
0.40
0.37
0.38
0.38
0.57
0.57
0.57
0.58
0.43
0.40
0.41
0.41
system
multi-ent
temporal
type
6. Conclusion
In this paper, we propose a deep attentive relation detection
model, DAM, based on Transformer. It performs the fine-grained
attention mechanism that calculates the attention weights of
the question for each word of the relation. Our proposed DAM
achieves state-of-the-art results for relation detection on multiple
datasets. The experimental results also demonstrate that the DAM
can help the SP-based KBQA system to be more competitive.
CRediT authorship contribution statement
Fig. 4. Two visualized attention distributions on the question for the words
‘‘education’’ (Left) and ‘‘institution’’ (Right) of the relation. The size of each heat
map is t × m, where ith row denotes the weights on the question from ith
attention head. In one row, the darker the color denotes the higher the weight.
5.4. KBQA Results and analysis
We compared our DAM-based KBQA pipeline system with several state-of-the-art SP-based approaches. We do not use SimpQ
because it consists of only simple questions without constraints,
which cannot reflect the performance of constraint detection in
our system. The first block of Table 2 shows their precision, recall,
and F1-score. From the perspective of F1-score, our DAM-based
system outperformed all the baselines on LC-QuAD and CompQ
and ranked second on WebQSP.
STAGG [22] and HR-BiLSTM [5] are standard semantic parsingbased pipeline systems, which are similar to ours. However, their
performance is mainly limited by their relation detection model.
CompQA [6] and Slot-Matching [7] are based on query graph
ranking. Although the error propagation of pipelines is avoided,
they also suffer from that one single model lacks enough capability to represent entire query graphs. GGNN [25] is a state
transition-based query graph generation method. It leverages the
gated graph neural network to capture the structure information
of the query graph so that it achieves better results. However, because it encodes the relation only by the sum of the embeddings
in the bag-of-words without adding any weight, it is not able to
precisely capture the semantic information of the relation with a
long sequence of words.
The second block of Table 2 shows our ablation tests for
evaluating the improvements by each constraint. The multi-entity
constraint has the most important contribution on the three
datasets. Note that when the constraints are added, the recall of
the results will have a little drop. It is due to some constraint detection errors that removes some correct answers. However, the
overall performance still improves due to the improved precision
of the results.
Yongrui Chen: Methodology, Software, Writing - original draft.
Huiying Li: Conceptualization, Investigation, Writing - review &
editing, Supervision, Funding acquisition.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Acknowledgments
The work is supported by the National Natural Science Foundation of China under grant No. 61502095 and the Natural Science Foundation of Jiangsu Province, PR China under Grant
BK20140643.
References
[1] D.Q. Nguyen, T.D. Nguyen, D.Q. Nguyen, D.Q. Phung, A novel embedding
model for knowledge base completion based on convolutional neural
network, in: M.A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the
2018 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, NAACL-HLT,
New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers),
Association for Computational Linguistics, 2018, pp. 327–333, http://dx.
doi.org/10.18653/v1/n18-2053.
[2] L. He, B. Liu, G. Li, Y. Sheng, Y. Wang, Z. Xu, Knowledge base completion
by variational Bayesian neural tensor decomposition, Cogn. Comput. 10 (6)
(2018) 1075–1084, http://dx.doi.org/10.1007/s12559-018-9565-x.
[3] Y. Wang, M. Wang, H. Fujita, Word sense disambiguation: A comprehensive
knowledge exploitation framework, Knowl. Based Syst. 190 (2020) 105030,
http://dx.doi.org/10.1016/j.knosys.2019.105030.
[4] J. Bao, N. Duan, Z. Yan, M. Zhou, T. Zhao, Constraint-based question
answering with knowledge graph, in: Proceedings of 26th International
Conference on Computational Linguistics, Proceedings of the Conference,
2016, pp. 2503–2514, https://www.aclweb.org/anthology/C16-1236/.
[5] M. Yu, W. Yin, K.S. Hasan, C.N. dos Santos, B. Xiang, B. Zhou, Improved
neural relation detection for knowledge base question answering, in: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, 2017, pp. 571–581, http://dx.doi.org/10.18653/v1/P17-1053.
[6] K. Luo, F. Lin, X. Luo, K.Q. Zhu, Knowledge base question answering
via encoding of complex query graphs, in: Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, 2018,
pp. 2185–2194, https://www.aclweb.org/anthology/D18-1242/.
8
Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077
[7] G. Maheshwari, P. Trivedi, D. Lukovnikov, N. Chakraborty, A. Fischer, J.
Lehmann, Learning to rank query graphs for complex question answering
over knowledge graphs, in: Proceedings of 18th International Semantic
Web Conference, 2019, pp. 487–504, http://dx.doi.org/10.1007/978-3-03030793-6_28.
[8] Y. Deng, Y. Xie, Y. Li, M. Yang, N. Du, W. Fan, K. Lei, Y. Shen, Multi-task
learning with multi-view attention for answer selection and knowledge
base question answering, in: The Thirty-Third AAAI Conference on Artificial
Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial
Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii,
USA, January 27 - February 1, 2019, AAAI Press, 2019, pp. 6318–6325,
http://dx.doi.org/10.1609/aaai.v33i01.33016318.
[9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L.
Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of Advances
in Neural Information Processing Systems 30: Annual Conference on Neural
Information Processing Systems, 2017, pp. 5998–6008, http://papers.nips.
cc/paper/7181-attention-is-all-you-need.
[10] C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano,
Template-based question answering over RDF data, in: A. Mille, F.L.
Gandon, J. Misselis, M. Rabinovich, S. Staab (Eds.), Proceedings of the
21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April
16-20, 2012, ACM, 2012, pp. 639–648, http://dx.doi.org/10.1145/2187836.
2187923.
[11] A. Fader, L.S. Zettlemoyer, O. Etzioni, Paraphrase-driven learning for open
question answering, in: Proceedings of the 51st Annual Meeting of the
Association for Computational Linguistics, ACL 2013, 4-9 August 2013,
Sofia, Bulgaria, Volume 1: Long Papers, The Association for Computer
Linguistics, 2013, pp. 1608–1618, https://www.aclweb.org/anthology/P131158/.
[12] W. Zheng, L. Zou, X. Lian, J.X. Yu, S. Song, D. Zhao, How to build templates
for RDF question/answering: An uncertain graph similarity join approach,
in: T.K. Sellis, S.B. Davidson, Z.G. Ives (Eds.), Proceedings of the 2015 ACM
SIGMOD International Conference on Management of Data, Melbourne,
Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp. 1809–1824,
http://dx.doi.org/10.1145/2723372.2747648.
[13] A. Abujabal, M. Yahya, M. Riedewald, G. Weikum, Automated template
generation for question answering over knowledge graphs, in: Proceedings
of the 26th International Conference on World Wide Web, 2017, pp.
1191–1200, http://dx.doi.org/10.1145/3038912.3052583.
[14] J. Ding, W. Hu, Q. Xu, Y. Qu, Leveraging frequent query substructures to
generate formal queries for complex question answering, in: K. Inui, J.
Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019,
Hong Kong, China, November 3-7, 2019, Association for Computational
Linguistics, 2019, pp. 2614–2622, http://dx.doi.org/10.18653/v1/D19-1263.
[15] L. Dong, F. Wei, M. Zhou, K. Xu, Question answering over freebase with
multi-column convolutional neural networks, in: Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the
7th International Joint Conference on Natural Language Processing of the
Asian Federation of Natural Language Processing, ACL 2015, July 26-31,
2015, Beijing, China, Volume 1: Long Papers, The Association for Computer
Linguistics, 2015, pp. 260–269, http://dx.doi.org/10.3115/v1/p15-1026.
[16] K. Xu, S. Reddy, Y. Feng, S. Huang, D. Zhao, Question answering on freebase
via relation extraction and textual evidence, in: Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics, ACL
2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, The
Association for Computer Linguistics, 2016, http://dx.doi.org/10.18653/v1/
p16-1220.
[17] D. Lukovnikov, A. Fischer, J. Lehmann, S. Auer, Neural network-based
question answering over knowledge graphs on word and character level,
in: R. Barrett, R. Cummings, E. Agichtein, E. Gabrilovich (Eds.), Proceedings
of the 26th International Conference on World Wide Web, WWW 2017,
Perth, Australia, April 3-7, 2017, ACM, 2017, pp. 1211–1220, http://dx.doi.
org/10.1145/3038912.3052675.
[18] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, J. Zhao, An end-to-end
model for question answering over knowledge base with cross-attention
combining global knowledge, in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1:
Long Papers, Association for Computational Linguistics, 2017, pp. 221–231,
http://dx.doi.org/10.18653/v1/P17-1021.
[19] W. Zhao, T. Chung, A.K. Goyal, A. Metallinou, Simple question answering
with subgraph ranking and joint-scoring, in: J. Burstein, C. Doran, T. Solorio
(Eds.), Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019,
pp. 324–334, http://dx.doi.org/10.18653/v1/n19-1029.
[20] J. Berant, A. Chou, R. Frostig, P. Liang, Semantic parsing on freebase
from question-answer Pairs, in: Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing, 2013, pp. 1533–1544,
https://www.aclweb.org/anthology/D13-1160/.
[21] J. Berant, P. Liang, Semantic parsing via paraphrasing, in: Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics,
2013, pp. 1415–1425, https://www.aclweb.org/anthology/P14-1133/.
[22] W. Yih, M. Chang, X. He, J. Gao, Semantic parsing via staged query graph
generation: question answering with knowledge base, in: Proceedings of
the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2015, pp.
1321–1331, https://www.aclweb.org/anthology/P15-1128/.
[23] M. Esposito, E. Damiano, A. Minutolo, G.D. Pietro, H. Fujita, Hybrid query
expansion using lexical resources and word embeddings for sentence
retrieval in question answering, Inform. Sci. 514 (2020) 88–105, http:
//dx.doi.org/10.1016/j.ins.2019.12.002.
[24] T. Hayashi, H. Fujita, Word embeddings-based sentence-level sentiment
analysis considering word importance, Acta Polytech. Hungarica 16 (7)
(2019) 7–24, http://www.uni-obuda.hu/journal/Hayashi_Fujita_94.pdf.
[25] D. Sorokin, I. Gurevych, Modeling semantics with gated graph neural
networks for knowledge base question answering, in: E.M. Bender, L. Derczynski, P. Isabelle (Eds.), Proceedings of the 27th International Conference
on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA,
August 20-26, 2018, Association for Computational Linguistics, 2018, pp.
3306–3317, https://www.aclweb.org/anthology/C18-1280/.
[26] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
in: 2016 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society,
2016, pp. 770–778, http://dx.doi.org/10.1109/CVPR.2016.90.
[27] Y. Yang, M. Chang, S-MART: novel tree-based structured learning algorithms applied to tweet entity linking, in: Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics, 2015,
pp. 504–513, https://www.aclweb.org/anthology/P15-1049/.
[28] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y.
Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, 2015, http://arxiv.org/abs/1412.6980.
[29] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep
bidirectional transformers for language understanding, in: Proceedings of
the 2019 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, 2019, pp.
4171–4186, https://www.aclweb.org/anthology/N19-1423/.
Download