Knowledge-Based Systems 201–202 (2020) 106077 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys DAM: Transformer-based relation detection for Question Answering over Knowledge Base Yongrui Chen, Huiying Li ∗ School of Computer Science and Engineering, Southeast University, Nanjing 210096, PR China article info Article history: Received 3 November 2019 Received in revised form 22 May 2020 Accepted 23 May 2020 Available online 1 June 2020 Keywords: Knowledge Base Question Answering Transformer Relation detection a b s t r a c t Relation Detection is a core component of Knowledge Base Question Answering (KBQA). In this paper, we propose a Transformer-based deep attentive semantic matching model (DAM), to identify the KB relations corresponding to the questions. The DAM is completely based on the attention mechanism and applies the fine-grained word-level attention to enhance the matching of questions and relations. On the basis of the DAM, we build a three-stage KBQA pipeline system. The experimental results on multiple benchmarks demonstrate that our DAM model outperforms previous methods on relation detection. In addition, our DAM-based KBQA system also achieves state-of-the-art results on multiple datasets. © 2020 Elsevier B.V. All rights reserved. 1. Introduction Since the birth of the Knowledge Base (KB), it has been widely applied in various communities [1–3]. Knowledge Base Question Answering (KBQA) is one of the active research areas where the goal is to provide crisp answers to natural language questions by employing the KB. An important direction in KBQA is answering via semantic parsing (SP): translating natural language questions posed by humans into structured queries (e.g., SPARQL). As illustrated in Fig. 1, SP-based KBQA approaches [4–7] is typically a pipeline, which consist of three key stages: (1) entity linking, which links n-grams to knowledge base (KB) entities; (2) relation detection, which identifies the KB relation(s) to which a question refers; and (3) constraint detection, which adds constraints to construct structured queries. In this paper, we focus on relation detection, which is a core phase of the pipeline. Relation detection aims to build a correct mapping from questions to KB relations. Currently, it is a common practice to employ the neural network model to learn the semantic representations of relations and questions to bridge the lexical gap between them. To make the model focus on the parts of the question which are relevant to the relation, the attention mechanism is applied in recent work [7,8]. In order to reduce the number of unseen relations [5], recent research breaks the relation name into a sequence of words as the input of the model. For example, the relation place_of_birth is converted into a sequence {‘‘place‘‘,‘‘of’’,‘‘birth’’}. ∗ Corresponding author. E-mail address: huiyingli@seu.edu.cn (H. Li). https://doi.org/10.1016/j.knosys.2020.106077 0950-7051/© 2020 Elsevier B.V. All rights reserved. However, this word-level representation often makes the input sequences of relations too long. It is mainly caused by two reasons: First, in some knowledge base like Freebase, each relation (e.g., people.person.place_of_birth) consists of two parts: type of the subject (people.person) and the genuine relation name (date_of_birth).1 Second, In order to answer complex questions, 2-hop chains of relations in the KB are often required as the core relations. The length of the sequence makes the existing model difficult to capture the complete semantic information of the relation. Further, it limits the performance of the existing attention mechanism, which calculates the attention weight by the global vector representation of the relation. In fact, different words in one relation are typically relevant to different parts of the question. For example in Fig. 1, the word ‘‘education’’ in the relation corresponds to the words ‘‘university’’ and ‘‘graduated’’ in the question, and the word ‘‘institution’’ corresponds to the phrase ‘‘what university’’. Therefore, it is unreasonable to apply the identical attention distribution (from the global vector representation) to the question for each word in the relation. This paper overcomes the problem above by proposing a relation detector based on Transformer [9], which is a completely attention-based neural network. First, we employ the self-attention mechanism to encode the input question and candidate relations in order to capture the semantic information of each word from a global perspective. Then, we propose a fine-grained attention mechanism with multi-head attention to compute an attention distribution on the question for each word 1 According to our observation, there are a lot of cases where two relations have the same genuine relation names but different prefixes, so the prefix type cannot be ignored. 2 Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 Fig. 1. Entire process of the SP-based KBQA approach. The example question includes only type constraint, which is one kind of common constraints. in the relation independently. In order to explore whether the proposed relation detector can help KBQA, we construct a threestage pipeline system. For entity linking, the system directly uses the results of the existing methods. For constraint detection, the system handles three kinds of constraints, including multi-entity constraint, temporal constraint and type constraint. The main contributions of this paper are: • We propose a word-level attention mechanism by transformer for relation detection. To the best of our knowledge, it is the first attempt to perform the attention mechanism independently for each word of the relation instead of using the entire relation. • We build a pipeline system for the KBQA task based on the proposed model and propose to leverage the same model to handle the type constraint in the question. • We perform comprehensive experiments on multiple KBQA datasets. Our proposed model outperforms previous approaches for relation detection and our pipeline system produces state-of-the-art results. 2. Related works Understanding natural language questions is the most important part of KBQA. At present, mainstream KBQA methods can be roughly classified into three categories: template-based methods, semantic parsing-based methods, and end-to-end methods. 2.1. Template-based methods The template-based methods aim to obtain a logical formal query from the question using predefined templates. Traditional approaches [10–12] manually create templates. It is costly and time-assuming. To overcome this problem, Abujabal et al. [13] propose to generate templates automatically from user questions paired with their answers. Ding et al. [14] leverage the frequent query substructure from the training queries and build templates by combining these substructures. Nonetheless, the templatebased methods are still limited by the coverage of the templates and extremely vulnerable when they face the unconventional expression. 2.2. End-to-end methods An important line of research for KBQA constructs an end-toend system by deep learning powered similarity matching. These methods [15–17] aim to design an end-to-end neural network model to rank candidate answers. To enhance the KB context of candidate answers, Hao et al. [18] apply a cross-attention mechanism to integrate the information into the vector representation of the answers. Zhao et al. [19] focus on the simple questions and propose a joint-score for training both entities and relations. These end-to-end methods typically have only a single process to avoid complex NLP pipelines and error propagation, however their results are not explainable. 2.3. Semantic parsing-based methods SP-based methods aim to translate questions into logical formal queries. Traditional SP-based methods approaches [20,21] require complex NLP pipelines, including dependency parsing, vocabulary mapping. With the maturity of deep learning, recent work focuses on building semantic parsing framework with neural networks to improve the performance. Yih et al. [22] propose a query graph that can be mapped to a logical form. Then, the traditional semantic parsing is simplified into a pipeline for staged query graph generation. This framework is widely applied in recent research [4–7], including our work. Yu et al. [5] propose a hierarchical recursive model enhanced by residual learning for relation detection. Maheshwari et al. [7] propose a relation detection model with a slot-matching attention mechanism that treats each relation name as a slot and focus on different parts of the question for each slot. In fact, recent work [23,24] focus on capturing the semantic information of entire sentences from word-level. Inspired by them, we propose a fine-grained attention mechanism based on word-level to enhance the matching of questions and relations. In addition to the staged generation methods, Luo et al. [6] propose to rank entire candidate query graphs by a single representation model with dependency parsing. Sorokin et al. [25] constructs a state transition-based query graph generation method by using the gated graph neural network to represent the query graph. To further introduce external knowledge, Deng et al. [8] leverage the multi-task learning to combine the KBQA task (relation detection) and Answer Selection (AS) task. In comparison with the end-to-end methods, SP-based methods can provide a more explainable result by the logical form of the question. Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 3. Transformer-based relation detector This section describes our transformer-based relation detection model. 3.1. Task definition Initially, we provide a unified description of relation detection. We define Definition 1. The Core Relation of the question is a relation (or chain of relations2 ) in the KB that connects the topic entity in the question with the answers. For an input question q, assuming its topic entity (e.g., ‘‘m.0271_s’’ in Fig. 1) has been linked, denoted by eq . Then, the core relation rq can be selected by rq = arg max prel (r ; q) (1) r ∈Re where prel (r ; q) is the matching score of r and q. Re denotes the candidate set of relations, which consists of all relation chains (within 2-hop) connected with eq . In this sections, we will detail three models for calculating prel (r ; q), the first two of which are baselines, and the last one is our proposed model. 3.2. Baseline sequence matching model Following the common practise, each relation r and question q are first broken into sequences of words. Then, they are converted into word embeddings R = [r1 , r2 , . . . , rn ] and Q = [q1 , q2 , . . . , qm ], respectively. Here, n (m) is the word number of the relation (question), ri ∈ Rd and qi ∈ Rd are word vectors. Thereafter, the matching score is calculated by prel (r ; q) = h(g(f(R)), g(f(Q ))) n×d (2) n×d Here, f : R → R denotes a neural network (e.g., BiLSTM) which transforms the word embeddings into d-dimensional semantic vectors. g : Rn×d → Rd denotes an aggregation operation (e.g., pooling), which transforms the semantic vectors into the global vector representation of the input sequence. h : Rd × Rd → R denotes a distance function (e.g., cosine distance) used to measure the similarity of two vector representations. 3.3. Coarse-grained attention-based sequence matching model In order to enhance the matching of q and r, the attention mechanism is applied in recent work [7,8]. It aims to make the model focus on the parts of q that are relevant to r. By employing it, the matching score is calculated by prel (r ; q) = h(g(f(R)), a(X , y)) (3) where X = f(Q ) ∈ Rm×d , y = g(f(R)) ∈ Rd , and a : Rm×d × Rd → Rd denotes the attention mechanism. Specifically, a(X , y) ∈ Rd is calculated by a(X , y) = m ∑ αi xi (4) i=1 exp(u(xi , y)) αi = ∑m j=1 exp(u(xj , y)) (5) where αi denotes the attention weight of the xi in X . u : Rd × Rd → R is an attention function (e.g.,feed-forward network) used 2 In this paper, we focus on the chain within 2-hop, which meets the requirements of almost all existing KBQA datasets. 3 to calculate the weights by the semantic relevance between xi and y. In this framework, the model first obtains the global vector representation of r by the aggregation operation g, and then computes the attention weights of q by the global vector representation. By this mechanism, the parts of the q that relevant to r can be paid more attention. However, if r is a long sequence, its words are typically relevant to the different parts of q. This mechanism of utilizing the global representation of r is too coarse-grained to integrate the relevances from different word-pairs. 3.4. Deep attentive sequence matching model To build a fine-grained attention mechanism, we apply the scaled dot-product attention used in Transformer to independently calculate the attention weights of q for each word in r. Specifically, let Q ∈ Rm×d and R ∈ Rn×d denote the embedding vectors of q and r, respectively, which are obtained by function f. Then, the mechanism, denoted by a∗ can be described by a∗ (Q , R) = W a QW v W a = softmax( (6) (RW q )(QW k )T √ d ) (7) Here, W q ∈ Rd×d , W k ∈ Rd×d , and W v ∈ Rd×d are the trainable parameter matrices for affine transformation. W a ∈ Rn×m is the attention matrix of q and r. The ith row of W a is the attention distribution on q from the ith word of r. The output of a∗ (Q , R) is denoted by Q̄ = [q̄1 , q̄2 , . . . , q̄n ] ∈ Rn×d , where q̄i denotes the weighted sum of the vectors in Q under the attention from the ith word of r. Following Transformer, we apply the multi-head mechanism to calculate a∗ (Q , R) in different vector spaces, which is formulated by m(Q , R) = concat(head1 , . . . , headη )W o where headi = a∗i (Q , R) (8) Here, η is the number of heads, headi has the independent trainable parameters, and W o ∈ Rηd×d is the affine transformation to combine the output from different vector spaces. In addition to using the multi-head attention to perform a fine-grained question-relation (Q-R) attention, we also employ it as f to encode q and r, i.e., utilizing self-attention. For example, for the input question matrix Q ∈ Rm×d , the output of the selfattention is denoted by s(Q ) ∈ Rm×d , which is calculated by s(Q ) = M(Q , Q ) (9) In this mechanism, each head will obtain a self-attention matrix W s ∈ Rm×m , which is similar to W a in (6). By W s , each vector of the input sequence is directly affected by all other vectors (including itself) in the same sequence. Comparing to some sequence model (e.g., BiLSTM), the self-attention can break the limitation of the distance between two words, so that it has promising performance in handling long sequences. On the basis of the mechanisms above, we propose Deep Attentive Matching (DAM ) model for relation detection. It consists of an encoder and a decoder, which is illustrated in Fig. 2. The encoder aims to obtain the semantic vectors of q. It consists of l identical layers, where each layer mainly contains a selfattention sub-layer and a feed-forward network sub-layer. There is a residual connection [26] around each of the two sub-layers, followed by layer normalization. That is, for an arbitrary matrix Z , the output of each sub-layer is LayerNorm(Z + Sublayer(Z )), where Sublayer(Z ) is the function implemented by the sub-layer 4 Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 Fig. 2. Architecture of DAM. itself. Formally, by ignoring residual connection and normalization, the ith encoder layer ei can be formulated by { ei (Q ) = fe1 (se1 (Q )), i=1 fei (sei (ei−1 (Q ))), i≥2 (10) where sei and fei denote the self-attention sub-layer and feedforward network sub-layer of ei , respectively. The output of ei (Q ) can be regarded as the semantic matrix of q, denoted by Q̂ ∈ Rm×d . The decoder has two functions: one is to obtain the semantic vectors of r, and the other is to perform the Q-R attention. Similar to the encoder, it consists of l identical layers, where each layer mainly contains a self-attention sub-layer, a multi-head Q-R attention sub-layer, and a feed-forward network sub-layer. The residual connection and the layer normalization are also employed. Formally, the ith decoder layer di can be formulated by { di (R, Q ) = sdi , fd1 (md1 (Q̂ , sd1 (R))), i=1 fdi (mdi (Q̂ , sdi (di−1 (R, Q )))), i≥2 mdi 3.5. Training During training, we adopt the hinge loss to maximize the margin between the gold relations r + and the negative relations r −: lr (q) = ∑ max(0, γ + prel (r − ; q) − prel (r + ; q)) (13) r − ∈R− where R− is the set of negative relations sampled from the candidate set Re . 4. KBQA Pipeline system (11) fdi where and denote the self-attention sub-layer, multihead Q-R attention sub-layer, and feed-forward network sublayer of di , respectively. We take the output of sd1 (R) as the semantic vectors of r, denoted by R̂ ∈ Rn×d , and take sdl (R, Q ) as the attentive vectors of q, denoted by Q̃ = [q̃1 , . . . , q̃n ]. Here, q̃i ∈ Rd is the weighted sum of Q̂ under the attention of ith word in r. For the inputs to the encoder and decoder, we follow [9] to adopt the positional encoding for recording the position information of each word. Finally, the matching score is calculated by prel (r ; q) = h(g(R̂), g(Q̃ )) utilizes the self-attention to encode the question and relation instead of the traditional sequence model (e.g., BiLSTM). So it is beneficial for handling long sequences. (3) DAM is a deep structure so that it has a stronger capability of representation than other shallow models. This section describes our KBQA pipeline system based on the proposed relation detector. The system consists of three stages, namely (1) entity linking, (2) relation detection, (3) and constraint detection. 4.1. Entity linking For each question q, we follow previous work to use the results of an existing entity linking tools.3 The top-t linked entities Et ranked by the score pent (e; q) are selected to come into the subsequent stages. 4.2. Relation detection (12) where g is a max-pooling, and h is the cosine distance. In comparison with the coarse-grained attention-based model, our DAM has three differences: (1) DAM employs a fine-grained Q-R attention mechanism to make the model focus on different parts of the question for each word in the relation. (2) DAM First, we construct relation candidate set Re for each candidate entity e ∈ Et . Note that r ∈ Re can be a single relation or a 3 For LC-QuAD, we directly use the gold entity for a fair comparison with previous work. For WebQSP and CompQ, we utilize the results of S-MART [27]. Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 5 relation chain of 2-hop. Then, the entity-relation pair (eq , rq ) to the question is selected by (eq , rq ) = arg max max prel (r ; q) e∈Et (14) r ∈Re Here, if there are multiple pairs with the identical prel , we select the one with the highest pent . After that, the base formal query is built, denote by λb = {⟨eq , rq , ? x⟩},4 where ? x denotes the answer variable that needs to be grounded to the KB. Fig. 3. Architecture of DAM for type detection. 4.3. Constraint detection. To deal with complex questions, it is necessary to attach constraints to λb to represent the complete intentions of the question. Our KBQA system deals with the following three kinds of constraints: multi-entity constraint, temporal constraint, and type constraint. After all the three constraints are handled, the complete query is obtained, denoted by λ. Finally, the answers are retrieved by executing λ. For question q, we first collect the types of the answers retrieved by λb , denoted by T. Then, for each candidate type t ∈ T, we calculate its matching score by the DAM, denoted by ψ (t ; q). Finally, if the score of t is greater than the threshold θt , it will be added to λb by attached it to ? x. The new query can be described by {⟨eq , rq , ? x⟩, ⟨? x, rdf:type , t ⟩}. Specifically, ψ (t ; q) is calculated by 4.3.1. Multi-entity constraint Multi-entity constraint refers that there is more than one entity in the question. For example, the question ‘‘Who plays ken barlow in coronation street?’’ has two entities (italic). Following [5], we handle this constraint by text matching. Specifically, given a question q and its corresponding query λb , we first retrieve the answers A (including the grounded results of ? x and hidden variable ? y) by executing λb in the KB. Then, we collect all the neighbor entities of each answer a ∈ A, whose literal name is denoted by e. Thereafter, we compute the matching score ϕ (e; q) between e and q by ψ (t ; q) = sigmoid(concat(T̂ , Q̃ )W ) ϕ (e; q) = max n(q) lcs(n(q), e) |q| + lcs(n(q), e) |e| Q̃ = g(dl (T ; el (Q ))) (15) 4.3.2. Temporal constraint Temporal constraint refers that there is temporal information in the question. For example, the question ‘‘Who is the governor of ohio 2011?’’ has a constraint ‘‘2011’’. Since the temporal constraint is usually in the form of year in our used datasets.5 Therefore, we adopt a simple but efficient strategy. First, the year in the question is detected by rules (Regular Expression), denoted by yc . Then, we attach two variables ? ds (start date) and ? de (end date) with the relations from and to to λb . The new query can be described by {⟨eq , rq0 , ? y⟩, ⟨? y, rq1 , ? x⟩, ⟨? y, from, ? ds ⟩, ⟨? ds , ≤ , yc ⟩, ⟨? y, to, ? de ⟩, ⟨? de , ≥, yc ⟩}. Note that here omits the type prefixes of from and to. 4.3.3. Type constraint Type constraint refers that there is some information in the question which demonstrates the types of answers. For example in Fig. 1, only the entity ‘‘m.0l2tk’’ with the type ‘‘College/University’’ is the correct answer. To handle this case, we employ the DAM again to predict the types of answers (see Fig. 3). 4 If r is a 2-hop chain, λ = {⟨e , r 0 , ? y⟩, ⟨? y, r 1 , ? x⟩}, where ? y is a hidden q b q q q variable (e.g., CVT node in Freebase). Note that the direction of the relation is not represented in the examples. 5 The questions in SimpQ and LC-QuAD (DBpedia) do not have temporal (16) (17) (18) where T ∈ Rz ×d and Q ∈ Rm×d are the word embeddings of t and q with positional encoding, respectively. W ∈ Rd×1 is a trainable parameter vector. During training, we minimize the cross entropy loss, which is calculated by lt (q) = where n(x) denotes the n-gram of string x, |x| denotes the length of string x, and lcs(x, y) denotes the length of the Longest Consecutive Common Subsequence between the string x and y. Finally, if the score of e is greater than the threshold θe , it will be added to λb by attached it to ? y. The new query can be described by {⟨eq , rq0 , ? y⟩, ⟨? y, rq1 , ? x⟩, ⟨? y, rc , e⟩}, where rc is the relation that connects a with e. constraints. T̂ = g(sd1 (T )) ∑ ρ log ψ (t ; q) + (1 − ρ ) log (1 − ψ (t ; q)). (19) t ∈T where ρ ∈ {0, 1} is the ground truth for training. If t is the type of the correct answers to q, ρ is 1, otherwise ρ is 0. 5. Experiments This section describes the experiments to evaluate our relation detector and KBQA system. 5.1. Datasets and evaluation metrics Our models are trained and evaluated over the following four KBQA datasets: SimpleQuestions (SimpQ) is a single-relation (1-hop) KBQA task. It consists of more than 100K questions and the fact of each question can be described by a triple in Freebase. WebquestionsSP (WebQSP) is a multi-relation (1-hop and 2-hop) KBQA task over Freebase, which contains full semantic parses for a subset of the questions from the WebQuestions [20] dataset. It is split into 3098 training and 1639 test questions. LC-QuAD is a multi-relation (1-hop and 2-hop) complex KBQA task over DBpedia, having 5000 question-SPARQL pairs. It is split into 4000 training and 1000 testing questions. ComplexQuestions (CompQ) is a multi-relation (1-hop and 2hop) complex KBQA task over Freebase, which contains 2100 complex questions collected from Bing search query log. It is split into 1300 training and 800 testing questions. For relation detection, we follow previous work [5,8] to adopt Accuracy as the evaluation metric. That is, a result is considered to be correct if it matches one of the ground truth relations. For KBQA task, we follow the common practice to adopt Precision, Recall and F1-score as the evaluation metrics. 6 Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 Table 1 Accuracy for relation detection on the four datasets. Model SimpQ WebQSP LC-QuAD CompQ CNN BiCNN [22] BiLSTM HR-BiLSTM [5] MVA-MTQA-net [8] Slot-Matching [7] 89.8 90.0 91.2 93.3 93.1 93.7 78.1 77.7 79.3 82.5 82.3 82.8 44.0 – 61.0 62.0 – 63.0 52.5 – 56.8 57.4 – 57.8 BERT(base) [29] BERT(large) [29] 90.9 91.2 84.0 83.8 62.3 62.5 57.8 57.9 DAM w/o Q-R attention coarse-grained Q-R attention replacing self-attention with BiLSTM 93.3 92.4 93.1 92.4 84.1 82.3 82.7 81.5 63.2 61.9 62.0 61.5 58.5 55.2 57.5 56.8 5.2. Experimental settings During the training of the DAM, all word embeddings are initialized using the pre-trained GloVe6 with 300 dimensions. The layer number N is set to 6 and the head number η is set to 8. The size of hidden states d is set to 512. Dropout is adopted in each sub-layer, whose probability is set to 0.1. The loss margin γ in (13) is set to 0.1. All parameters are trained using Adam [28] optimizer with a learning rate of 5 × 10−5 in a mini-batch of size 16. The training process lasts 30 epochs. The threshold θe for multi-entity constraint is set to 0.1 and θt for type constraint is set to 0.5. The size of negative sampling is set to 50. 5.3. Relation detection results and analysis We first evaluate the performance of the proposed DAM model for the relation detection tasks on the four datasets. 5.3.1. Comparison with baselines We compared the DAM model with several baselines. CNN and BiLSTM are the most basic models for the semantic matching of sequences. BiCNN is proposed by [22], where both questions and relations are represented with a word hashing of character tri-grams. HR-BiLSTM [5] considers both word- and name-level representations of the relation and utilizes the Residual Connection with two layers of the BiLSTM for the hierarchical matching of the two levels. MVA-MTQA-net [8] leverages the multi-task learning of relation detection and answer selection to introduce external information. To focus on the performance of models, we only compare with their single-task results of relation detection. Slot-Matching is proposed by [7], which applies a slot-matching mechanism to make each relation name pay different attention to the question. The experimental results are reported in Table 1. Our relation detector outperforms previous models on WebQSP, LC-QuAD, and CompQ, and ranks second only on SimpQ. Overall, it can be considered that our model achieves state-of-the-art results. The performance of CNN and BiCNN is limited by the size of its local convolutional window so that it fails to obtain the global semantic information of the sequence for encoding each token. BiLSTM benefits from its recurrent structure to utilize the forward and backward context meanwhile suffers from it in handling long sequences. Both HR-BiLSTM and MVA-MTQA-net utilize the double-level input of the relation, namely word-level and relation(knowledge)-level, to enhance the representation of the relation, so they achieve better results. Slot-Matching considers each relation name in the chain as a slot and makes each slot focus on different parts of the question. Therefore, it outperforms 6 http://nlp.stanford.edu/projects/glove. the previous models. However, its attention mechanism based on relation name-level is too coarse-grained for long relations. Our model employs the self-attention mechanism instead of convolutional or recurrent networks so that it is able to encode each word from a global perspective. More importantly, the fine-grained QR attention mechanism helps our model to perform semantic matching more precisely. The main reason for ranking second on SimpQ dataset is that all the questions in SimpQ correspond to single relations. The input sequence is too short to highlight the advantages of our model. 5.3.2. Comparison with pre-trained model We also compared our proposed DAM with BERT [29], which is a pre-trained Transformer network. Specifically, we fine-tune BERT by using it as f to encode questions and relations. In the experiments, we evaluated the base and large versions of BERT separately. The experimental results are shown in Table 1. Facing relation detection tasks, base and large versions of BERT did not show excellent performance like that in other areas. Possible reasons include: (1) The questions in the dataset is too short for the input of BERT (512 tokens). Note the worse performance on the shorter-questions dataset (SimpQ). (2) BERT is pre-trained by using unstructured text, but the KB consists of structured data. Typically, relations in the KB have specific naming rules, which contain the structured information of the KB. Therefore, the prior knowledge obtained by BERT from the unstructured text has less effective for encoding relations. 5.3.3. Ablation tests of DAM To explore the contribution brought by each component of DAM, we evaluate the performance of DAM in the following settings. • w/o Q-R attention We removed the attention mechanism between the relation and the question. It can be formalized by (2), where, f denotes a transformer encoder, g denotes a max-pooling, and h denotes a cosine distance. • coarse-grained Q-R attention We calculate the attention weights on the question by utilizing the global vector representation of the relation. It can be formalized by (3), (4) and (5), where f denotes a transformer encoder, g denotes a max-pooling, h denotes a cosine distance and u is a two-layer feed-forward network. • replacing self-attention with BiLSTM We apply BiLSTM to encode questions and relations instead of the transformer encoder. The experimental results are shown at the bottom of Table 1. By removing Q-R attention, the performance declined approximately 1.5% on all the datasets. The name-level attention makes the performance drop 1% on WebQSP and CompQ. The main reason for the smaller drops on SimpQ and LC-QuAD is that the relations of two datasets are typically shorter.7 It is a similar reason that the BiLSTM encoder brought a less significant drop on SimpQ and LC-QuAD. To intuitively reflect the effectiveness of the fine-grained QR attention mechanism in DAM, we visualize some rows in the attention matrix W a in (6). The heat maps are shown in Fig. 4. The results show that the words ‘‘education’’ and ‘‘institution’’ in the relation learn different attention distributions for the words of the question, which is approximately consistent with our expectations. 7 SimpQ only have single-relations. LC-QuAD relies on DBpedia, whose relations do not have type prefixes. Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 7 Table 2 Precision, recall, and F1-score on WebQSP, LC-QuAD, and CompQ for KBQA task. WebQSP LC-QuAD CompQ P R F1 P R F1 P R F1 STAGG [22] HR-BiLSTM [5] GGNN [25] CompQA [6] Slot-Matching [7] 0.79 0.79 0.82 0.79 0.80 0.72 0.73 0.73 0.75 0.75 0.67 0.68 0.71 0.69 0.70 0.63 0.64 0.66 0.64 0.65 0.75 0.77 0.78 0.75 0.78 0.69 0.70 0.71 0.69 0.71 0.40 0.39 0.41 0.41 0.40 0.51 0.52 0.55 0.56 0.55 0.37 0.37 0.42 0.43 0.41 Our w/o w/o w/o 0.81 0.77 0.78 0.78 0.74 0.75 0.75 0.76 0.70 0.67 0.68 0.69 0.65 0.63 – 0.62 0.80 0.80 – 0.81 0.72 0.69 – 0.68 0.40 0.37 0.38 0.38 0.57 0.57 0.57 0.58 0.43 0.40 0.41 0.41 system multi-ent temporal type 6. Conclusion In this paper, we propose a deep attentive relation detection model, DAM, based on Transformer. It performs the fine-grained attention mechanism that calculates the attention weights of the question for each word of the relation. Our proposed DAM achieves state-of-the-art results for relation detection on multiple datasets. The experimental results also demonstrate that the DAM can help the SP-based KBQA system to be more competitive. CRediT authorship contribution statement Fig. 4. Two visualized attention distributions on the question for the words ‘‘education’’ (Left) and ‘‘institution’’ (Right) of the relation. The size of each heat map is t × m, where ith row denotes the weights on the question from ith attention head. In one row, the darker the color denotes the higher the weight. 5.4. KBQA Results and analysis We compared our DAM-based KBQA pipeline system with several state-of-the-art SP-based approaches. We do not use SimpQ because it consists of only simple questions without constraints, which cannot reflect the performance of constraint detection in our system. The first block of Table 2 shows their precision, recall, and F1-score. From the perspective of F1-score, our DAM-based system outperformed all the baselines on LC-QuAD and CompQ and ranked second on WebQSP. STAGG [22] and HR-BiLSTM [5] are standard semantic parsingbased pipeline systems, which are similar to ours. However, their performance is mainly limited by their relation detection model. CompQA [6] and Slot-Matching [7] are based on query graph ranking. Although the error propagation of pipelines is avoided, they also suffer from that one single model lacks enough capability to represent entire query graphs. GGNN [25] is a state transition-based query graph generation method. It leverages the gated graph neural network to capture the structure information of the query graph so that it achieves better results. However, because it encodes the relation only by the sum of the embeddings in the bag-of-words without adding any weight, it is not able to precisely capture the semantic information of the relation with a long sequence of words. The second block of Table 2 shows our ablation tests for evaluating the improvements by each constraint. The multi-entity constraint has the most important contribution on the three datasets. Note that when the constraints are added, the recall of the results will have a little drop. It is due to some constraint detection errors that removes some correct answers. However, the overall performance still improves due to the improved precision of the results. Yongrui Chen: Methodology, Software, Writing - original draft. Huiying Li: Conceptualization, Investigation, Writing - review & editing, Supervision, Funding acquisition. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments The work is supported by the National Natural Science Foundation of China under grant No. 61502095 and the Natural Science Foundation of Jiangsu Province, PR China under Grant BK20140643. References [1] D.Q. Nguyen, T.D. Nguyen, D.Q. Nguyen, D.Q. Phung, A novel embedding model for knowledge base completion based on convolutional neural network, in: M.A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Association for Computational Linguistics, 2018, pp. 327–333, http://dx. doi.org/10.18653/v1/n18-2053. [2] L. He, B. Liu, G. Li, Y. Sheng, Y. Wang, Z. Xu, Knowledge base completion by variational Bayesian neural tensor decomposition, Cogn. Comput. 10 (6) (2018) 1075–1084, http://dx.doi.org/10.1007/s12559-018-9565-x. [3] Y. Wang, M. Wang, H. Fujita, Word sense disambiguation: A comprehensive knowledge exploitation framework, Knowl. Based Syst. 190 (2020) 105030, http://dx.doi.org/10.1016/j.knosys.2019.105030. [4] J. Bao, N. Duan, Z. Yan, M. Zhou, T. Zhao, Constraint-based question answering with knowledge graph, in: Proceedings of 26th International Conference on Computational Linguistics, Proceedings of the Conference, 2016, pp. 2503–2514, https://www.aclweb.org/anthology/C16-1236/. [5] M. Yu, W. Yin, K.S. Hasan, C.N. dos Santos, B. Xiang, B. Zhou, Improved neural relation detection for knowledge base question answering, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 571–581, http://dx.doi.org/10.18653/v1/P17-1053. [6] K. Luo, F. Lin, X. Luo, K.Q. Zhu, Knowledge base question answering via encoding of complex query graphs, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2185–2194, https://www.aclweb.org/anthology/D18-1242/. 8 Y. Chen and H. Li / Knowledge-Based Systems 201–202 (2020) 106077 [7] G. Maheshwari, P. Trivedi, D. Lukovnikov, N. Chakraborty, A. Fischer, J. Lehmann, Learning to rank query graphs for complex question answering over knowledge graphs, in: Proceedings of 18th International Semantic Web Conference, 2019, pp. 487–504, http://dx.doi.org/10.1007/978-3-03030793-6_28. [8] Y. Deng, Y. Xie, Y. Li, M. Yang, N. Du, W. Fan, K. Lei, Y. Shen, Multi-task learning with multi-view attention for answer selection and knowledge base question answering, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, AAAI Press, 2019, pp. 6318–6325, http://dx.doi.org/10.1609/aaai.v33i01.33016318. [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, in: Proceedings of Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, 2017, pp. 5998–6008, http://papers.nips. cc/paper/7181-attention-is-all-you-need. [10] C. Unger, L. Bühmann, J. Lehmann, A.N. Ngomo, D. Gerber, P. Cimiano, Template-based question answering over RDF data, in: A. Mille, F.L. Gandon, J. Misselis, M. Rabinovich, S. Staab (Eds.), Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, ACM, 2012, pp. 639–648, http://dx.doi.org/10.1145/2187836. 2187923. [11] A. Fader, L.S. Zettlemoyer, O. Etzioni, Paraphrase-driven learning for open question answering, in: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, The Association for Computer Linguistics, 2013, pp. 1608–1618, https://www.aclweb.org/anthology/P131158/. [12] W. Zheng, L. Zou, X. Lian, J.X. Yu, S. Song, D. Zhao, How to build templates for RDF question/answering: An uncertain graph similarity join approach, in: T.K. Sellis, S.B. Davidson, Z.G. Ives (Eds.), Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, ACM, 2015, pp. 1809–1824, http://dx.doi.org/10.1145/2723372.2747648. [13] A. Abujabal, M. Yahya, M. Riedewald, G. Weikum, Automated template generation for question answering over knowledge graphs, in: Proceedings of the 26th International Conference on World Wide Web, 2017, pp. 1191–1200, http://dx.doi.org/10.1145/3038912.3052583. [14] J. Ding, W. Hu, Q. Xu, Y. Qu, Leveraging frequent query substructures to generate formal queries for complex question answering, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 2614–2622, http://dx.doi.org/10.18653/v1/D19-1263. [15] L. Dong, F. Wei, M. Zhou, K. Xu, Question answering over freebase with multi-column convolutional neural networks, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, The Association for Computer Linguistics, 2015, pp. 260–269, http://dx.doi.org/10.3115/v1/p15-1026. [16] K. Xu, S. Reddy, Y. Feng, S. Huang, D. Zhao, Question answering on freebase via relation extraction and textual evidence, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, The Association for Computer Linguistics, 2016, http://dx.doi.org/10.18653/v1/ p16-1220. [17] D. Lukovnikov, A. Fischer, J. Lehmann, S. Auer, Neural network-based question answering over knowledge graphs on word and character level, in: R. Barrett, R. Cummings, E. Agichtein, E. Gabrilovich (Eds.), Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, ACM, 2017, pp. 1211–1220, http://dx.doi. org/10.1145/3038912.3052675. [18] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, J. Zhao, An end-to-end model for question answering over knowledge base with cross-attention combining global knowledge, in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, Association for Computational Linguistics, 2017, pp. 221–231, http://dx.doi.org/10.18653/v1/P17-1021. [19] W. Zhao, T. Chung, A.K. Goyal, A. Metallinou, Simple question answering with subgraph ranking and joint-scoring, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 324–334, http://dx.doi.org/10.18653/v1/n19-1029. [20] J. Berant, A. Chou, R. Frostig, P. Liang, Semantic parsing on freebase from question-answer Pairs, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013, pp. 1533–1544, https://www.aclweb.org/anthology/D13-1160/. [21] J. Berant, P. Liang, Semantic parsing via paraphrasing, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2013, pp. 1415–1425, https://www.aclweb.org/anthology/P14-1133/. [22] W. Yih, M. Chang, X. He, J. Gao, Semantic parsing via staged query graph generation: question answering with knowledge base, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, 2015, pp. 1321–1331, https://www.aclweb.org/anthology/P15-1128/. [23] M. Esposito, E. Damiano, A. Minutolo, G.D. Pietro, H. Fujita, Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering, Inform. Sci. 514 (2020) 88–105, http: //dx.doi.org/10.1016/j.ins.2019.12.002. [24] T. Hayashi, H. Fujita, Word embeddings-based sentence-level sentiment analysis considering word importance, Acta Polytech. Hungarica 16 (7) (2019) 7–24, http://www.uni-obuda.hu/journal/Hayashi_Fujita_94.pdf. [25] D. Sorokin, I. Gurevych, Modeling semantics with gated graph neural networks for knowledge base question answering, in: E.M. Bender, L. Derczynski, P. Isabelle (Eds.), Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, Association for Computational Linguistics, 2018, pp. 3306–3317, https://www.aclweb.org/anthology/C18-1280/. [26] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, 2016, pp. 770–778, http://dx.doi.org/10.1109/CVPR.2016.90. [27] Y. Yang, M. Chang, S-MART: novel tree-based structured learning algorithms applied to tweet entity linking, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, 2015, pp. 504–513, https://www.aclweb.org/anthology/P15-1049/. [28] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015, http://arxiv.org/abs/1412.6980. [29] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186, https://www.aclweb.org/anthology/N19-1423/.