Deep Neural Architectures for Joint Named Entity Recognition and Disambiguation Qianwen Wang1 and Mizuho Iwaihara2 Graduate School of Information, Production and Systems Waseda University Kitakyushu, Japan qianwen@akane.waseda.jp1, iwaihara@waseda.jp2 Abstract—Current entity linking methods typically first apply a named entity recognition (NER) model to extract a named entity and classify it into a predefined category, then apply an entity disambiguation model to link the named entity to a corresponding entity in the reference knowledge base. However, these methods ignore the inter-relations between the two tasks. Our work jointly optimizes deep neural models of both NER and entity disambiguation. In the entity disambiguation task, our deep neural model includes a recursive neural network and convolutional neural network with attention mechanism. Our model compares similarities between a mention and candidate entities by simultaneously considering semantic and background information, including Wikipedia description pages, contexts where the mention occurs, and entity typing information. The experiments show that our model can effectively leverage the semantic information of context, and performs competitively to conventional approaches. Keywords— named entity recognition, entity disambiguation, recursive neural network, convolutional neural network, attention I. INTRODUCTION Named Entity Recognition (NER) is a combined task of locating named entities in texts and classifying them into predefined categories such as persons, organizations, locations, etc. After NER, Entity Disambiguation (ED) is to map the mention to an entity from reference knowledge bases like Wikipedia. Previous studies first generate a set of candidate entities in the knowledge bases for each mention in the document, and then solve the disambiguation task as a ranking problem. The ranking model computes the relevance of each entity candidate to the corresponding entity mention by using available context information in both documents and knowledge bases. For example, in the sentence ‘Apple should shrink its finance arm before it goes bananas.’, for the NER step, we need to recognize ‘apple’ as a named entity and classify it as an organization. Then in the ED step, we generate several candidates for ‘apple’, such as ‘Apple Inc’ and the fruit apple. Then after we rank each candidate with the context, we choose ‘apple’ to be ‘Apple Inc.’ instead of the fruit apple. In most ED works, the common approach is to use an offthe-shelf Named Entity Recognizer and to run NER and ED models separately. Therefore, the dependency between the two tasks is ignored. The importance of a NER system to ED is that errors caused by NER will be propagated to linking and are not recoverable. For example, in the sentence ‘Apple should shrink its finance arm before it goes bananas’, NER system tends to not regard ‘apple’ as a named entity since it may predict it as fruit apple because another fruit banana is mentioned. However, the mention ‘apple’ should be recognized as a named entity and linked to ‘Apple Inc’ for this disambiguation task. If NER results are improved and more mentions are recognized, the precision of ED can be increased. Also, if NER classifies a mention into an accurate category, it is easier for ED to link the correct entity with the mention. Therefore our work tries to simultaneously optimize NER and ED. Although this approach increases the problem complexity and requires careful consideration of both tasks, we believe that it can improve the performance, since it is consistent with how humans process these sorts of information. Our main contributions are as follows. In our ranking system for entity disambiguation, we introduce a deep neural network approach including the tree recursive neural network (TNN) [1] and convolutional neural network (CNN) with attention mechanism. Our model compares the similarities between a mention and candidate entities by simultaneously considering the semantic and background information, including Wikipedia description pages, contexts the mention occurs in, and entity typing information. To the best of our knowledge, our work is the first to introduce tree recursive neural networks to entity disambiguation task. Furthermore, we analyze the effects of the NER system to ED results. Our approach simultaneously optimizes the results of deep neural networks of NER and ED. II. RELATED WORK A. Entity Disambiguation Recent works leverage deep neural networks to perform ED. Compared with traditional learning methods, deep learning methods do not rely on manually designed features. He et al. [2] are the first to use deep neural networks in ED task. Their model learns entity representation via stacked denoising autoencoders to measure context-entity similarity. Francis et al. [3] use CNNs to represent mentions’ contexts and Wikipedia 978-1-5386-7789-6/19/$31.00 ©2019 IEEE horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap entity pages. Then they combine the networks with sparse features to capture semantic similarity. Gupta et al. [4] model mentions’ contexts with bidirectional long short-term memory (LSTM) encoders and model entity documents with CNNs. Then they concatenate these document information with finegrained types. Ganea et al. [5] use neural attention mechanism over local context and pass the results to global disambiguation. grained entity type system to provide a series of types for each mention and entity to reduce ambiguities. B. Named Entity Recognition Recently, various NER researches incorporate deep neural networks. Chiu et al. [6] use a sequential bidirectional LSTMCNN. Lample et al. [7] introduce two neural architectures for NER: one is bi-LSTM and conditional random fields, and the other is a transition-based approach. C. Joint Named Entity Recognition and Disambiguation Recent studies note the heavy drawbacks of performing NER and ED in two steps. Luo et al. [8] are the first to jointly optimize NER and ED. Nguyen et al. [9] propose a joint NER and ED model based on conditional random fields. D. Tree Recursive Neural Network TNNs [1] are introduced to compose sentence representation by utilizing linguistic structures of texts. By computing two candidate children’s representations, the networks will compute the representation of their parent if the two nodes are merged and scores of how plausible the new node would be. Then by recursively repeating this process, we can obtain semantic representation of the phrase. LSTMs can also be used to preserve sequence information over time when connecting with TNNs. TNNs can be regarded as a superset of RNNs, which have a linear structure. Unlike TNNs, RNNs need previous context to capture phrases and often captures too much of last words in the final vector. However, TNNs capture a larger piece of structure from small pieces, which is more similar to language structure. TNNs are widely used in natural language processing tasks. It was first introduced by Pollack [10] in 1990. Socher et al. [11] apply TNNs to sentiment analysis with Stanford sentiment treebank. They derive sentiment on phrase structure in a bottom-up way. Tai et al. [1] add LSTMs to TNNs and use it to sentiment classification and sentence semantic relatedness. Li et al. [12] perform NER with bidirectional TNNs. III. PROPOSED METHOD In the Entity Disambiguation task, we compute semantic similarities between mention parts (context, document, type) and candidate entity parts (context, document, type). Fig. 1 depicts an overview of our model. Specifically, we model context parts using TNNs with LSTM (Tree-LSTM), as Tree-LSTM performs effectively on representing the semantic meaning of long sentences. The document part is encoded with CNNs attached with attention mechanism (ACNN). Since the document part is relatively noisy, we employ attention mechanism to capture the important components of the input. For the type part, we use a fine- Fig. 1. Overview of the proposed method for entity disambiguation A. Context Representation We choose Tree-LSTMs to encode context information so that we can obtain a better representation of text units by considering the semantic composition of small elements. The goal of our Tree-LSTM model is to strengthen the high level representation of tree nodes, by recursively following propagation over different branches in the tree. For example, as shown in Fig. 2, in the sentence ‘Apple is a technology company headquartered in California’, important words ‘apple’ and ‘company’ are strengthened in the final context vector. Fig. 2. An example of Tree-LSTM. If the color of a word is darker than the other, it means that this word relates more with context vectors than the others. In the context representation, we first embed the context parse tree and the context word vectors. We use GloVe vectors [13] as word vectors and Stanford Parser [14] to encode parse trees. After the embedding, we use the constituency TreeLSTMs [1] to generate hidden features for each node in the tree. For each node j, hjk and cjk are the hidden state and memory cell of its k-th child, respectively. (!) ! !" = % & '" + )* ℎ"* + 0 ! *=1 . !"# = & ' ! (!) (" + *#+ ℎ"+ + 1 ! +=1 - !" = % & ! (!) '" + )* ℎ"* + 0 ! *=1 horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap 0 !" = tanh ) ! (!) *" + + !" = %" ⊙ '" + ,- ℎ"- + 3 ! -=1 )"* ⊙ -"* *=1 ℎ" = %" ⊙ tanh(," ) After the constituency Tree-LSTM is generated, the context representation of the non-leaf nodes are recursively computed from the representations of the child nodes. B. Document and Description Representation [17] which uses an attentive encoder neural model to predict fine-grained types for mentions. D. Entity Disambiguation Before we perform the ED system on mentions, we need to generate a number of candidate entities. These candidate entities have prior scores derived from a pre-computed frequency dictionary [18]. Similar to Francis et al. [3], we compute semantic similarities between mentions and candidate entities from multiple granularities. Let !" , !$ , !% , &" , &$ , &% be the distributed representation of mention’s context, mention’s document, mention’s type, candidate entity’s context, candidate entity’s document and candidate entity’s type, respectively. The feature ! ", $ below indicates multiple granularities of the semantic similarity. ! ", $ = cos("* , $* , cos("* , $+ ), cos("* , $- ), cos("+ , $* ), cos(%& , (& ), cos(%& , (* ), cos(%* , (+ ), cos(%* , (& ), cos(%* , (* )] Fig. 3. The overview of document and description representation. To represent document information of a mention and the description page of an entity, we use CNNs attached with attention mechanism to encode these background information. CNNs are good at extracting robust and abstract features of input. After the word embedding, we use CNNs to produce a fixed-length vector for the document. We use the rectified linear unit (ReLU) as the activation unit and combine the results with max pooling. However, CNNs cannot capture important components in the document. To decrease the effect of noisy mentions, we add an attention mechanism to keep the model focusing on the important part with respect to the mention we need to disambiguate. Inspired by Zeng et al. [15], for the candidate description part, we use the entity representation as the attention weight on the ouput of the convolution layer to strengthen the key parts of the description page. The entity representation is computed by the average of word vectors of candidate entity words. For the mention document part, we use candidate entity description vectors as attention weights to strengthen the important components in the mention document. C. Type Representation Fine-grained types provide structured information for an entity. For typing candidate entities, we use FIGER [16] by Ling and Weld, a publicly available package, which contains 112 fine-grained entity types to encode type information. FIGER returns a set of types for each candidate entity. We compute the probability of each type being relevant to the entity. We need context information to figure out the fine-grained types of each mention. We use the system of Shimaoka et al. Final scores of our ED system are the combination of the prior scores and semantic similarity scores. We choose the candidate entity which has the highest final scores to be the result of our disambiguation system. E. Joint Model Since TNNs leverage linguistic information to represent sentences, we choose NER system BRNN-CNN by Li et al. [12] which also includes TNNs to construct the joint model. Since both the NER and ED models use deep neural networks, we perform their joint training by combining the loss functions of the NER and ED models and minimizing the resulting function. !' = arg ()* ,-./ + ,.1 + Here, ! is the set of parameters for both the NER and ED tasks. IV. EXPERIMENT AND EVALUATION A. Training Data We use the CoNLL-YAGO [19] dataset as our training data. CoNLL-YAGO is the largest public ED dataset consisting of training, development and test sets. The training set contains 18448 mentions and 946 documents. The development set contains 4791 mentions and 216 documents. The test set contains 4485 mentions and 231 documents. B. Test Datasets AQUAINT [20]: This dataset consists of newswire text data. It has 50 articles with over 700 mentions. ACE04 [21]: This dataset is a subset of ACE2004 coreference documents annotated by Amazon Mechanical Turk. It has 35 articles and each article has 7 mentions on average. CoNLL-YAGO [19]: We evaluated our model on the test dataset which has 231 news articles and several rarer entities. horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap C. Hyper Parameters The vectors for the pre-trained word embeddings have 300 dimensions. The mention and entity context window sizes are 20. The mention document and entity description window sizes are 100. We use Adam [22] for optimization. The learning rate is 0.001 and batch size is 256. The number of epochs for training is 30. [2] D. Results We compare our Tree-ACNN ED model with four popular systems: AGDISTIS [23], AIDA [19], Babelfy [24], and DBpedia Spotlight [25]. These systems are integrated within the Gerbil testing platform [26]. The results are measured by micro F1 scores. Our work outperforms all these four systems. [6] [3] [4] [5] [7] [8] [9] TABLE I. RESULTS FOR ENTITY DISAMBIGUATION AQUAINT ACE04 CoNLL-YAGO AGDISTIS 48.69 68.30 60.11 AIDA 55.28 71.74 69.12 Babelfy 68.15 56.15 65.70 49.95 47.53 46.67 76.78 76.92 81.66 [10] [11] DBpedia Spotlight Tree-ACNN (proposed) [12] [13] [14] [15] We also compare our ED model using Stanford NER [27] with our joint model using the BRNN-CNN NER [12] on the AQUAINT dataset. The results are improved when we perform the joint model. [16] [17] [18] TABLE II. RESULTS FOR JOINT NER AND ED ED model with Stanford NER Joint NER and ED Micro F1 75.86 79.39 [19] [20] [21] V. CONCLUSION We present a joint model to simultaneously learn NER and ED. In the ED system, we leverage the linguistic superiority of TNNs to represent rich semantic knowledge of sentences. We use CNNs attached with attention mechanism to extract informative contents for resolving the ambiguity. We also encode fine-grained types to provide background information of entities. Our future work is to improve the joint model by learning relative weighting between NER and ED. More sophisticated models, such as stacking NER and ED layers, can be discussed. REFERENCES [1] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” In Proc. 53rd ACL, 2015, pp. 1556-1566. [22] [23] [24] [25] [26] [27] Z. He, et al., “Learning entity representation for entity disambiguation,” In Proc. 51st ACL, 2013, pp. 30-34. M. Francis-Landau, G. Durrett, and D. Klein, “Capturing semantic similarity for entity linking with convolutional neural network,” In Proc. NAACL-HLT, 2016, pp. 1256-1261. N. Gupta, S. Singh, and D. Roth, “Entity linking via joint encoding of types, description, and context,” In Proc. EMNLP, 2017, pp. 2681-2690. O. E. Ganea, and T. Hofmann, “Deep joint entity disambiguation with local neural attention,” In Proc. EMNLP, 2017, pp. 2619-2629. J. Chiu, and E. Nichols, “Named entity recognition with bidirectional LSTM-CNNs,” In Proc. ACL, 2016, pp. 357-370. G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” In Proc. NAACLHLT, 2016, pp.260-270. G. Luo, X. Huang, C. Y. Lin, and Z. Nie, “Joint named entity recognition and disambiguation,” In Proc. EMNLP, 2015, pp.879-888. D. B. Nguyen, M. Theobald, and G. Weikum, “J-NERD: joint named entity recognition and disambiguation with rich linguistic features,” In Proc. ACL, 2016, pp. 215-229. J. B. Pollack, “Recursive distributed representations,” Artificial Intelligence, vol. 46, pp. 77-105, November 1990. R. Socher, et al., “Recursive deep models for semantic compositionality over a sentiment treebank,” In Proc. EMNLP, 2013, pp. 1631-1642. P. Li, R. Dong, Y. Wang, J. Chou, and W. Ma, “Leveraging linguistic structures for named entity recognition with bidirectional recursive neural networks,” In Proc. EMNLP, 2017, pp. 2664-2669. J. Pennigton, R. Socher, and C. Manning, “GloVe: global vectors for word representation,” In Proc. EMNLP, 2014, pp. 1532-1543. Stanford Parser. [Online]. Available: https://nlp.stanford.edu/software/srparser.html. W. Zeng, J. Tang, and X. Zhao, "Entity Linking on Chinese Microblogs via Deep Neural Network." IEEE Access, vol. 6, pp. 25908-25920, May 2018. X. Ling, and D. S. weld, “Fine-grained entity recognition,” AAAI, vol. 12, pp. 94-100, July 2012. S. Shimaoka, P. Stenetorp, K. Inui, and S. Riedel, “Neural architectures for fine-grained entity type classification,” In Proc. EACL, vol. 15, 2017, pp.1271-1280. V. Spitkovsky, and A. Chang, “A cross-lingual dictionary for english wikipedia concepts,” In Proc. LREC, 2012, pp. 3168-3175. J. Hoffart, et al., “Robust disambiguation of named entities in text,” In Proc. EMNLP, 2011, pp. 782-792. D. Milne, and I. H. Witten, “Learning to link with wikipedia,” In Proc. ACM, 2008, pp.509-518. L. Ratinov, D. Roth, D. Downey, and M. Anderson, “Local and global algorithms for disambiguation to wikipedia,” In Proc. 49th ACL, 2011, pp. 1375-1384. D. P. Kingma, and J. L. Ba, "Adam: a method for stochastic optimization." arXiv preprint arXiv: 1412.6980, 2014. R. Usbeck, et al., “AGDISTIS-graph-based disambiguation of named entities using linked data,” In Proc. ISWC, 2014, pp.457-471. A. Moro, F. Cecconi, and R. Navigli, “Multilingual word sense disambiguation and entity linking for everybody,” In Proc. ISWC, 2014, pp. 25-28. P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer, “DBpedia spotlight: shedding light on the web of documents,” In Proc. 7th SEMANTiCS, 2011, pp. 1-8. R. Usbeck, M. Roder, and A. Ngonga, “GERBIL: general entity annotator benchmarking framework,” In Proc. 24th IW3C2, 2015, pp. 1133-1143. J. R. Finkel, T. Grenager, and C. Manning, "Incorporating non-local information into information extraction systems by gibbs sampling." In Proc. 43rd ACL, 2005, pp. 363-370. horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap