國立雲林科技大學
National Yunlin University of Science and Technology
Chinese Named Entity Recognition using Lexicalized HMMs
Advisor : Dr. Hsu
Presenter : Zih-Hui Lin
Author :Guohong Fu, Kang-Kwong Luke
Intelligent Database Systems Lab
1
Motivation
Objective
Introduction
NER as Known Word Tagging
Known Word Segmentation
Lexicalized HMM Tager
Conclusions
Future Research
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
A HMM-based tagger don’t provide contextual word information, which sometimes gives strong evidence for NER.
N.Y.U.S.T.
I. M.
Some learning methods such as ME and SVMs usually need much more time in training and tagging.
3
Intelligent Database Systems Lab
Our system is able to integrate both the internal formation patterns and the surrounding contextual clues for NER under the framework of HMMs
N.Y.U.S.T.
I. M.
As a result, the performance of the system can be improved without losing its efficiency in training and tagging.
4
Intelligent Database Systems Lab
We develop a two-stage NERsystem for
Chinese.
─ Given a sentence, a known word bigram model is first applied to segment it into a meaningful sequence of known words.
─ Then, a lexicalized HMM tagger is used to assign each known word a proper hybrid tag that indicates its pattern in forming an entity and the category of the formed entity.
N.Y.U.S.T.
I. M.
5
Intelligent Database Systems Lab
(cont.)
Our system is able to explore three types of features
─ entity-internal formation patterns
─ contextual word evidence
─ contextual category information
As a consequence, the system’s performance can be improved without losing its efficiency in training and processing.
N.Y.U.S.T.
I. M.
6
Intelligent Database Systems Lab
NER as Known Word Tagging
-
We use the same named entity tag set as defined in the IEER-99 Mandarin named entity task, these entity categories are further encoded using twelve different abbreviated SGML tags.
Our system will also assign each common word in the input sentence a proper POS tag. (Peking University
POS tag-set)
N.Y.U.S.T.
I. M.
7
Intelligent Database Systems Lab
NER as Known Word Tagging
-
N.Y.U.S.T.
I. M.
A known word w may take one of the following
─
─
─
─ four patterns to present itself during NER: w is an independent named entity
ISE w is the beginning component of a named entity
BOE w is at the middle of a named entity
MOE w is at the end of a named entity
ISE
Example:
“ 温家宝 /
总理 /”
“<BOE> 温 </BOE><MOE> 家 </MOE><EOE> 宝 </EOE><ISE>
总理 </ISE>”.
8
Intelligent Database Systems Lab
NER as Known Word Tagging
-
We define a hybrid tag set by merging the category tags and the pattern tags.
─
N.Y.U.S.T.
I. M.
9
Intelligent Database Systems Lab
To find the most appropriate segmentation that maximizes the conditional probability
P ( W | C ) , i.e.
─
N.Y.U.S.T.
I. M.
─ P (w | w t-1
) denotes the known word bigram probability, which can be estimated from a segmented corpus using maximum likelihood estimation (MLE).
10
Intelligent Database Systems Lab
Lexicalized HMM Tager
-
We employ the uniformly lexicalized models to perform the tagging of known words for Chinese
NER.
To find an appropriate sequence of hybrid tags that maximizes the conditional probability
P ( T | W ) , namely
N.Y.U.S.T.
I. M.
11
Intelligent Database Systems Lab
Lexicalized HMM Tager
-
N.Y.U.S.T.
I. M.
─ Two types of approximations are employed to simplify the general model.
The firs approximation is based on the independent hypothesis used in standard HMMs:
The second pproximation follows the notion of the lexicalization technique,
12
Intelligent Database Systems Lab
Lexicalized HMM Tager
-
N.Y.U.S.T.
I. M.
Because MLE will yield zero will yield zero probabilities for any cases that are not observed in the training data. To solve this problem, we employ the linear interpolation smoothing technique .
13
Intelligent Database Systems Lab
Lexicalized HMM Tager
-
In our implementation, we employ the classical
Viterbi algorithm to find the most probable sequence of hybrid tags for a given sequence of known words, which works in three major steps as follows:
─ The generation of candidate tags
─ The decoding of the best tag sequence
─ The conversion of the results
N.Y.U.S.T.
I. M.
14
Intelligent Database Systems Lab
Lexicalized HMM Tager
-
Our system may yield two types of inconsistent tagging :
─ pattern inconsistency
when two adjacent known words are assigned inconsistent pattern tags such as “ISE:MOE” or “ISE:EOE”.
N.Y.U.S.T.
I. M.
─ class inconsistency
two adjacent known words are labeled with different category-
tags while at the same time
< CPN -BOE>
张
</ CPN -BOE>< Vg -MOW>
晓
</ Vg -MOW>< CPN -EOE>
华
</ CPN -EOE>
15
Intelligent Database Systems Lab
Experiments -
N.Y.U.S.T.
I. M.
F-measure is a weighted harmonic mean of
─
─
─ precision and recall.
R =
P= correctly recognized NEs
─ manually annotated corpus correctly recognized NEs
─ recognized by the system
β is the weighing coefficient
16
Intelligent Database Systems Lab
Experiments Data
N.Y.U.S.T.
I. M.
17
Intelligent Database Systems Lab
Experiments -
Consequently, our first aim is to examine how the use of the lexicalization technique affects the performance of our system.
N.Y.U.S.T.
I. M.
18
Intelligent Database Systems Lab
Experiments
-
(
N.Y.U.S.T.
I. M.
To investigate whether the word-level mode or the character-level model is more effective for Chinese
NER.
19
Intelligent Database Systems Lab
Experiments
-
(
N.Y.U.S.T.
I. M.
to compare our system with other public systems for
Chinese NER.
20
Intelligent Database Systems Lab
Conclusions
The experimental results indicate
─ on different public corpora show that the NER performance can be significantly enhanced using lexicalization techniques.
─ character-level tagging are comparable to and may even outperform known-word based tagging when a lexicalized method is applied.
N.Y.U.S.T.
I. M.
21
Intelligent Database Systems Lab
Future Directions
Our current tagger is a purely statistical system; it will inevitably suffer from the problem of data sparseness, particularly in open-domain applications.
Our system usually fails to yield correct results for some complicated NEs such as nested organization names.
N.Y.U.S.T.
I. M.
22
Intelligent Database Systems Lab