Uploaded by Mihika Bodke

Deep Neural Architectures for Joint Named Entity Recognition and Disambiguation

advertisement
Deep Neural Architectures for Joint Named Entity
Recognition and Disambiguation
Qianwen Wang1 and Mizuho Iwaihara2
Graduate School of Information, Production and Systems
Waseda University
Kitakyushu, Japan
qianwen@akane.waseda.jp1, iwaihara@waseda.jp2
Abstract—Current entity linking methods typically first apply
a named entity recognition (NER) model to extract a named
entity and classify it into a predefined category, then apply an
entity disambiguation model to link the named entity to a
corresponding entity in the reference knowledge base. However,
these methods ignore the inter-relations between the two tasks.
Our work jointly optimizes deep neural models of both NER and
entity disambiguation. In the entity disambiguation task, our
deep neural model includes a recursive neural network and
convolutional neural network with attention mechanism. Our
model compares similarities between a mention and candidate
entities by simultaneously considering semantic and background
information, including Wikipedia description pages, contexts
where the mention occurs, and entity typing information. The
experiments show that our model can effectively leverage the
semantic information of context, and performs competitively to
conventional approaches.
Keywords— named entity recognition, entity disambiguation,
recursive neural network, convolutional neural network, attention
I. INTRODUCTION
Named Entity Recognition (NER) is a combined task of
locating named entities in texts and classifying them into predefined categories such as persons, organizations, locations, etc.
After NER, Entity Disambiguation (ED) is to map the mention
to an entity from reference knowledge bases like Wikipedia.
Previous studies first generate a set of candidate entities in the
knowledge bases for each mention in the document, and then
solve the disambiguation task as a ranking problem. The
ranking model computes the relevance of each entity candidate
to the corresponding entity mention by using available context
information in both documents and knowledge bases. For
example, in the sentence ‘Apple should shrink its finance arm
before it goes bananas.’, for the NER step, we need to
recognize ‘apple’ as a named entity and classify it as an
organization. Then in the ED step, we generate several
candidates for ‘apple’, such as ‘Apple Inc’ and the fruit apple.
Then after we rank each candidate with the context, we choose
‘apple’ to be ‘Apple Inc.’ instead of the fruit apple.
In most ED works, the common approach is to use an offthe-shelf Named Entity Recognizer and to run NER and ED
models separately. Therefore, the dependency between the two
tasks is ignored. The importance of a NER system to ED is that
errors caused by NER will be propagated to linking and are not
recoverable. For example, in the sentence ‘Apple should shrink
its finance arm before it goes bananas’, NER system tends to
not regard ‘apple’ as a named entity since it may predict it as
fruit apple because another fruit banana is mentioned. However,
the mention ‘apple’ should be recognized as a named entity
and linked to ‘Apple Inc’ for this disambiguation task. If NER
results are improved and more mentions are recognized, the
precision of ED can be increased. Also, if NER classifies a
mention into an accurate category, it is easier for ED to link the
correct entity with the mention. Therefore our work tries to
simultaneously optimize NER and ED. Although this approach
increases the problem complexity and requires careful
consideration of both tasks, we believe that it can improve the
performance, since it is consistent with how humans process
these sorts of information.
Our main contributions are as follows. In our ranking
system for entity disambiguation, we introduce a deep neural
network approach including the tree recursive neural network
(TNN) [1] and convolutional neural network (CNN) with
attention mechanism. Our model compares the similarities
between a mention and candidate entities by simultaneously
considering the semantic and background information,
including Wikipedia description pages, contexts the mention
occurs in, and entity typing information. To the best of our
knowledge, our work is the first to introduce tree recursive
neural networks to entity disambiguation task. Furthermore, we
analyze the effects of the NER system to ED results. Our
approach simultaneously optimizes the results of deep neural
networks of NER and ED.
II. RELATED WORK
A. Entity Disambiguation
Recent works leverage deep neural networks to perform ED.
Compared with traditional learning methods, deep learning
methods do not rely on manually designed features. He et al. [2]
are the first to use deep neural networks in ED task. Their
model learns entity representation via stacked denoising autoencoders to measure context-entity similarity. Francis et al. [3]
use CNNs to represent mentions’ contexts and Wikipedia
978-1-5386-7789-6/19/$31.00 ©2019 IEEE
horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap
entity pages. Then they combine the networks with sparse
features to capture semantic similarity. Gupta et al. [4] model
mentions’ contexts with bidirectional long short-term memory
(LSTM) encoders and model entity documents with CNNs.
Then they concatenate these document information with finegrained types. Ganea et al. [5] use neural attention mechanism
over local context and pass the results to global disambiguation.
grained entity type system to provide a series of types for each
mention and entity to reduce ambiguities.
B. Named Entity Recognition
Recently, various NER researches incorporate deep neural
networks. Chiu et al. [6] use a sequential bidirectional LSTMCNN. Lample et al. [7] introduce two neural architectures for
NER: one is bi-LSTM and conditional random fields, and the
other is a transition-based approach.
C. Joint Named Entity Recognition and Disambiguation
Recent studies note the heavy drawbacks of performing
NER and ED in two steps. Luo et al. [8] are the first to jointly
optimize NER and ED. Nguyen et al. [9] propose a joint NER
and ED model based on conditional random fields.
D. Tree Recursive Neural Network
TNNs [1] are introduced to compose sentence
representation by utilizing linguistic structures of texts. By
computing two candidate children’s representations, the
networks will compute the representation of their parent if the
two nodes are merged and scores of how plausible the new
node would be. Then by recursively repeating this process, we
can obtain semantic representation of the phrase. LSTMs can
also be used to preserve sequence information over time when
connecting with TNNs. TNNs can be regarded as a superset of
RNNs, which have a linear structure. Unlike TNNs, RNNs
need previous context to capture phrases and often captures too
much of last words in the final vector. However, TNNs capture
a larger piece of structure from small pieces, which is more
similar to language structure.
TNNs are widely used in natural language processing tasks.
It was first introduced by Pollack [10] in 1990. Socher et al.
[11] apply TNNs to sentiment analysis with Stanford sentiment
treebank. They derive sentiment on phrase structure in a
bottom-up way. Tai et al. [1] add LSTMs to TNNs and use it to
sentiment classification and sentence semantic relatedness. Li
et al. [12] perform NER with bidirectional TNNs.
III. PROPOSED METHOD
In the Entity Disambiguation task, we compute semantic
similarities between mention parts (context, document, type)
and candidate entity parts (context, document, type). Fig. 1
depicts an overview of our model.
Specifically, we model context parts using TNNs with
LSTM (Tree-LSTM), as Tree-LSTM performs effectively on
representing the semantic meaning of long sentences. The
document part is encoded with CNNs attached with attention
mechanism (ACNN). Since the document part is relatively
noisy, we employ attention mechanism to capture the important
components of the input. For the type part, we use a fine-
Fig. 1. Overview of the proposed method for entity disambiguation
A. Context Representation
We choose Tree-LSTMs to encode context information so
that we can obtain a better representation of text units by
considering the semantic composition of small elements. The
goal of our Tree-LSTM model is to strengthen the high level
representation of tree nodes, by recursively following
propagation over different branches in the tree. For example, as
shown in Fig. 2, in the sentence ‘Apple is a technology
company headquartered in California’, important words ‘apple’
and ‘company’ are strengthened in the final context vector.
Fig. 2. An example of Tree-LSTM. If the color of a word is darker than the
other, it means that this word relates more with context vectors than the others.
In the context representation, we first embed the context
parse tree and the context word vectors. We use GloVe vectors
[13] as word vectors and Stanford Parser [14] to encode parse
trees. After the embedding, we use the constituency TreeLSTMs [1] to generate hidden features for each node in the tree.
For each node j, hjk and cjk are the hidden state and memory
cell of its k-th child, respectively.
(!)
!
!" = % & '" +
)* ℎ"* + 0
!
*=1
.
!"# = & '
!
(!)
(" +
*#+ ℎ"+ + 1
!
+=1
-
!" = % &
!
(!)
'" +
)*
ℎ"* + 0
!
*=1
horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap
0
!" = tanh )
!
(!)
*" +
+
!" = %" ⊙ '" +
,-
ℎ"- + 3
!
-=1
)"* ⊙ -"*
*=1
ℎ" = %" ⊙ tanh(," )
After the constituency Tree-LSTM is generated, the context
representation of the non-leaf nodes are recursively computed
from the representations of the child nodes.
B. Document and Description Representation
[17] which uses an attentive encoder neural model to predict
fine-grained types for mentions.
D. Entity Disambiguation
Before we perform the ED system on mentions, we need to
generate a number of candidate entities. These candidate
entities have prior scores derived from a pre-computed
frequency dictionary [18].
Similar to Francis et al. [3], we compute semantic
similarities between mentions and candidate entities from
multiple granularities. Let !" , !$ , !% , &" , &$ , &% be the
distributed representation of mention’s context, mention’s
document, mention’s type, candidate entity’s context,
candidate entity’s document and candidate entity’s type,
respectively. The feature ! ", $ below indicates multiple
granularities of the semantic similarity.
! ", $ =
cos("* , $* , cos("* , $+ ), cos("* , $- ), cos("+ , $* ),
cos(%& , (& ), cos(%& , (* ), cos(%* , (+ ), cos(%* , (& ), cos(%* , (* )]
Fig. 3. The overview of document and description representation.
To represent document information of a mention and the
description page of an entity, we use CNNs attached with
attention mechanism to encode these background information.
CNNs are good at extracting robust and abstract features of
input. After the word embedding, we use CNNs to produce a
fixed-length vector for the document. We use the rectified
linear unit (ReLU) as the activation unit and combine the
results with max pooling. However, CNNs cannot capture
important components in the document. To decrease the effect
of noisy mentions, we add an attention mechanism to keep the
model focusing on the important part with respect to the
mention we need to disambiguate.
Inspired by Zeng et al. [15], for the candidate description
part, we use the entity representation as the attention weight on
the ouput of the convolution layer to strengthen the key parts of
the description page. The entity representation is computed by
the average of word vectors of candidate entity words. For the
mention document part, we use candidate entity description
vectors as attention weights to strengthen the important
components in the mention document.
C. Type Representation
Fine-grained types provide structured information for an
entity. For typing candidate entities, we use FIGER [16] by
Ling and Weld, a publicly available package, which contains
112 fine-grained entity types to encode type information.
FIGER returns a set of types for each candidate entity. We
compute the probability of each type being relevant to the
entity.
We need context information to figure out the fine-grained
types of each mention. We use the system of Shimaoka et al.
Final scores of our ED system are the combination of the
prior scores and semantic similarity scores. We choose the
candidate entity which has the highest final scores to be the
result of our disambiguation system.
E. Joint Model
Since TNNs leverage linguistic information to represent
sentences, we choose NER system BRNN-CNN by Li et al. [12]
which also includes TNNs to construct the joint model. Since
both the NER and ED models use deep neural networks, we
perform their joint training by combining the loss functions of
the NER and ED models and minimizing the resulting function.
!' = arg ()* ,-./ + ,.1
+
Here, ! is the set of parameters for both the NER and ED
tasks.
IV. EXPERIMENT AND EVALUATION
A. Training Data
We use the CoNLL-YAGO [19] dataset as our training data.
CoNLL-YAGO is the largest public ED dataset consisting of
training, development and test sets. The training set contains
18448 mentions and 946 documents. The development set
contains 4791 mentions and 216 documents. The test set
contains 4485 mentions and 231 documents.
B. Test Datasets
AQUAINT [20]: This dataset consists of newswire text
data. It has 50 articles with over 700 mentions.
ACE04 [21]: This dataset is a subset of ACE2004 coreference documents annotated by Amazon Mechanical Turk.
It has 35 articles and each article has 7 mentions on average.
CoNLL-YAGO [19]: We evaluated our model on the test
dataset which has 231 news articles and several rarer entities.
horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap
C. Hyper Parameters
The vectors for the pre-trained word embeddings have 300
dimensions. The mention and entity context window sizes are
20. The mention document and entity description window sizes
are 100. We use Adam [22] for optimization. The learning rate
is 0.001 and batch size is 256. The number of epochs for
training is 30.
[2]
D. Results
We compare our Tree-ACNN ED model with four popular
systems: AGDISTIS [23], AIDA [19], Babelfy [24], and
DBpedia Spotlight [25]. These systems are integrated within
the Gerbil testing platform [26]. The results are measured by
micro F1 scores. Our work outperforms all these four systems.
[6]
[3]
[4]
[5]
[7]
[8]
[9]
TABLE I.
RESULTS FOR ENTITY DISAMBIGUATION
AQUAINT
ACE04
CoNLL-YAGO
AGDISTIS
48.69
68.30
60.11
AIDA
55.28
71.74
69.12
Babelfy
68.15
56.15
65.70
49.95
47.53
46.67
76.78
76.92
81.66
[10]
[11]
DBpedia
Spotlight
Tree-ACNN
(proposed)
[12]
[13]
[14]
[15]
We also compare our ED model using Stanford NER [27]
with our joint model using the BRNN-CNN NER [12] on the
AQUAINT dataset. The results are improved when we perform
the joint model.
[16]
[17]
[18]
TABLE II.
RESULTS FOR JOINT NER AND ED
ED model with Stanford NER
Joint NER and ED
Micro F1
75.86
79.39
[19]
[20]
[21]
V. CONCLUSION
We present a joint model to simultaneously learn NER and
ED. In the ED system, we leverage the linguistic superiority of
TNNs to represent rich semantic knowledge of sentences. We
use CNNs attached with attention mechanism to extract
informative contents for resolving the ambiguity. We also
encode fine-grained types to provide background information
of entities.
Our future work is to improve the joint model by learning
relative weighting between NER and ED. More sophisticated
models, such as stacking NER and ED layers, can be discussed.
REFERENCES
[1]
K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic
representations from tree-structured long short-term memory networks,”
In Proc. 53rd ACL, 2015, pp. 1556-1566.
[22]
[23]
[24]
[25]
[26]
[27]
Z. He, et al., “Learning entity representation for entity disambiguation,”
In Proc. 51st ACL, 2013, pp. 30-34.
M. Francis-Landau, G. Durrett, and D. Klein, “Capturing semantic
similarity for entity linking with convolutional neural network,” In Proc.
NAACL-HLT, 2016, pp. 1256-1261.
N. Gupta, S. Singh, and D. Roth, “Entity linking via joint encoding of
types, description, and context,” In Proc. EMNLP, 2017, pp. 2681-2690.
O. E. Ganea, and T. Hofmann, “Deep joint entity disambiguation with
local neural attention,” In Proc. EMNLP, 2017, pp. 2619-2629.
J. Chiu, and E. Nichols, “Named entity recognition with bidirectional
LSTM-CNNs,” In Proc. ACL, 2016, pp. 357-370.
G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer,
“Neural architectures for named entity recognition,” In Proc. NAACLHLT, 2016, pp.260-270.
G. Luo, X. Huang, C. Y. Lin, and Z. Nie, “Joint named entity
recognition and disambiguation,” In Proc. EMNLP, 2015, pp.879-888.
D. B. Nguyen, M. Theobald, and G. Weikum, “J-NERD: joint named
entity recognition and disambiguation with rich linguistic features,” In
Proc. ACL, 2016, pp. 215-229.
J. B. Pollack, “Recursive distributed representations,” Artificial
Intelligence, vol. 46, pp. 77-105, November 1990.
R. Socher, et al., “Recursive deep models for semantic compositionality
over a sentiment treebank,” In Proc. EMNLP, 2013, pp. 1631-1642.
P. Li, R. Dong, Y. Wang, J. Chou, and W. Ma, “Leveraging linguistic
structures for named entity recognition with bidirectional recursive
neural networks,” In Proc. EMNLP, 2017, pp. 2664-2669.
J. Pennigton, R. Socher, and C. Manning, “GloVe: global vectors for
word representation,” In Proc. EMNLP, 2014, pp. 1532-1543.
Stanford
Parser.
[Online].
Available:
https://nlp.stanford.edu/software/srparser.html.
W. Zeng, J. Tang, and X. Zhao, "Entity Linking on Chinese Microblogs
via Deep Neural Network." IEEE Access, vol. 6, pp. 25908-25920, May
2018.
X. Ling, and D. S. weld, “Fine-grained entity recognition,” AAAI, vol.
12, pp. 94-100, July 2012.
S. Shimaoka, P. Stenetorp, K. Inui, and S. Riedel, “Neural architectures
for fine-grained entity type classification,” In Proc. EACL, vol. 15, 2017,
pp.1271-1280.
V. Spitkovsky, and A. Chang, “A cross-lingual dictionary for english
wikipedia concepts,” In Proc. LREC, 2012, pp. 3168-3175.
J. Hoffart, et al., “Robust disambiguation of named entities in text,” In
Proc. EMNLP, 2011, pp. 782-792.
D. Milne, and I. H. Witten, “Learning to link with wikipedia,” In Proc.
ACM, 2008, pp.509-518.
L. Ratinov, D. Roth, D. Downey, and M. Anderson, “Local and global
algorithms for disambiguation to wikipedia,” In Proc. 49th ACL, 2011,
pp. 1375-1384.
D. P. Kingma, and J. L. Ba, "Adam: a method for stochastic
optimization." arXiv preprint arXiv: 1412.6980, 2014.
R. Usbeck, et al., “AGDISTIS-graph-based disambiguation of named
entities using linked data,” In Proc. ISWC, 2014, pp.457-471.
A. Moro, F. Cecconi, and R. Navigli, “Multilingual word sense
disambiguation and entity linking for everybody,” In Proc. ISWC, 2014,
pp. 25-28.
P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer, “DBpedia
spotlight: shedding light on the web of documents,” In Proc. 7th
SEMANTiCS, 2011, pp. 1-8.
R. Usbeck, M. Roder, and A. Ngonga, “GERBIL: general entity
annotator benchmarking framework,” In Proc. 24th IW3C2, 2015, pp.
1133-1143.
J. R. Finkel, T. Grenager, and C. Manning, "Incorporating non-local
information into information extraction systems by gibbs sampling." In
Proc. 43rd ACL, 2005, pp. 363-370.
horized licensed use limited to: SVKM¿s NMIMS Mukesh Patel School of Technology Management & Engineering. Downloaded on March 14,2022 at 10:43:19 UTC from IEEE Xplore. Restrictions ap
Download