Research Journal of Applied Sciences, Engineering and Technology 4(20): 3973-3980,... ISSN: 2040-7467

advertisement
Research Journal of Applied Sciences, Engineering and Technology 4(20): 3973-3980, 2012
ISSN: 2040-7467
© Maxwell Scientific Organization, 2012
Submitted: December 20, 2011
Accepted: April 20, 2012
Published: October 15, 2012
Named Entity Recognition Based on A Machine Learning Model
Jing Wang, Zhijing Liu and Hui Zhao
School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
Abstract: For the recruitment information in Web pages, a novel unified model for named entity recognition
is proposed in this study. The models provide a simple statistical framework to incorporate a wide variety of
linguistic knowledge and statistical models in a unified way. In our approach, firstly, Multi-Rules are built for
a better representation of the named entity, in order to emphasize the specific semantics and term space in the
named entity. Then an optimal algorithm of the hierarchically structured DSTCRFs is performed, in order to
pick out the structure attributes of the named entity from the recruitment knowledge and optimize the efficiency
of the training. The experimental results showed that the accuracy rate has been significantly improved and the
complexity of sample training has been decreased.
Keywords: Entity identification, Hidden Markov Model (HMM), named entity
INTRODUCTION
In general, the task of Named Entity Recognition
(NER) is to recognize phrases in a document, such as the
Person Name (PN), Organization Name (ON), Location
Name (LN) and so on. As an basic subtask of Natural
Language Processing (NLP), it has been applied to
Information Extraction (IE), Question Answering (QA),
Parsing, Metadata Tagging in Semantic Web. In this
study, the NER study is used to extract special Web
information---recruitment information, which we mainly
focus on Position (PSN), ON, LN and Date.
In this study, we pay attention to the PSN
recognition, which can help us get more knowledge about
the recruitment information. In fact, compared with the
ON and LN, the PSN is a special and complex kind of
NE. Firstly, PSN always contains some special words
which can denote certain job such as "engineer" and
"teacher". Secondly, the length of PSN is variable, some
one contain dozens of words and some one just two words
which is difficult to decide the border of PSNs. Finally,
the PSN always contains location name, such as
"Shanghai sub-company manager". The reasons
mentioned above will seriously influence the performance
for the NER.
There have been many studies on PN and LN
recognition, while very little research on PSN. In this
study, a novel unified model for NER based on improved
Dramitic hidden Semi-makov Tree Conditional Fields
(DSTCRFs) is proposed for the PSN NER. This article is
organized as follows. We briefly describe the related work
of NER in section II and the NE feature is presented in
section III. Section IV devotes to give the definition of
DSTCRFs model for NER. With the framework, section
V gives the experimental results. Finally, summary and
the future work is introduced in section VI.
In this way, we reduce the transfer of state and the
total numbers of entered symbols in the sequence to
improve the operating efficiency for the NER. The models
provide a simple statistical framework to incorporate a
wide variety of linguistic knowledge and statistical
models in a unified way. The experimental results showed
that the accuracy rate has been significantly improved and
the complexity of sample training has been decreased.
LITERATURE REVIEW
In the Seventh Message Understanding Conference
(MUC-7) (Eason et al., 1955), 95% of the recall and 92%
of precision were reached at the best level for the English
NER system. However, compared with English NER, the
Chinese NER is still on initial stage. In Multilingual
Entity Task (MET-2) (Chinchor, 1988), the best Chinese
NER system in MUC-7 achieved 66, 89 and 89%
precisions and 92, 91 and 88% recalls.
At present, the models for Chinese NER can mainly
be divided into 2 categories, rule-based method and
statistical method. Rule-based method commonly used
features of the word to trigger the NER (Toutanova et al.,
2005). For instance, Chinese family name is adopted to
trigger the PN recognition (Kou, 2008) and the keywords
at the end of the organization name ON recognition (Sun
et al., 2002). The statistical methods build model for the
NER based on the statistical analysis for the large corpus
of NEs and the semantic analysis for their context feature
(Cohen and Sarawagi, 2004; Fu and Luke, 2005).
The statistical ML is introduced for Chinese NER,
such as N-gram Grammar, Hidden Markov Model
Corresponding Author: Jing Wang, School of Computer Science and Technology, Xidian University, Xi’an, 710071 China
3973
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
(HMM) (Cohen and Sarawagi, 2004; Fu and Luke, 2005),
Maximum Entropy Model (MEM) (Fresko et al., 2005;
Uchimoto et al., 2000; Lim et al., 2004), Support Vector
Machine (SVM) (Li et al., 2005), Conditional Random
Fields (CRF) (Cohn and Blunsom, 2005; Roark et al.,
2004; Zou et al., 2005) and so on. Then, an integrated
model was approved later, which combined different ML
models into a unified statistical framework to improve the
system's performance.
CRFs has been widely adopted for text processing
applications, such as Part-of-Speech tagging (POS),
chunking and semantic role labeling (Sarawagi and
Cohen, 2004; Chen and Goodman, 1998; Gao et al.,
2006). Recently, the application of CRFs has been
expanded to word alignment, IE and document
summarization (Lafferty et al., 2001; Peng et al., 2004;
Peng and McCallum, 2004). Thereby, more and more
improved methods for CRF have been proposed to
address this challenge (Li and McCallum, 2003).
2D Conditional Random Fields (2D CRFs) (Zhu
et al., 2005) is proposed, aimed at extracting object(a kind
of NEs) from two-dimensionally laid-out Web pages and
better incorporate the two-dimensional neighborhood
dependencies. However, they did not take into account the
dependence conditions of cross-distance. For the simple
object, the effect of IE is very good. But for the complex
object, the result is not satisfactory. A possible
explanation for this is that there are a lot of noise elements
between the different blocks.
DCRFs (Zhu et al., 2008) combine the best of both
conditional random fields and the widely successful
Dynamic Bayesian Networks (DBNs). Often, however,
we wish to represent more complex interaction between
labels-for example, when longer-range dependencies exist
between labels, when the state can be naturally
represented as a vector of variables, or when performing
multiple cascaded labeling tasks on the same input
sequence. These models assume predefined structures,
therefore, they are not flexible to adapt to many realworld datasets.
Subsequently, the two integrated models of SemiCRFs and HSCRF are approved. The idea of SemiMarkov Conditional Random Fields (Semi-CRFs)
(Zhu et al., 2007) is a tractable extension of CRFs, which
offers much of the power of higher-order models and
allows features which measure properties of segments,
rather than individual elements. And the Hierarchical
Semi-markov Conditional Random Field (HSCRF), a
generalization of embedded undirected Markov chains to
model complex hierarchical, nested Markov processes, is
described in Truyen et al. (2008). Consequently, these 2
models put more emphasis on the semantic features for
the entity recognition, which make a good use of physical
characteristics instead of the single word. Moreover, the
models are divided into different levels for the entity
recognition. But the shortcoming is that they are more
suitable for a single record and attributes recognition.
In a word, the motivation for our improved method
comes from the following two aspects. On the one hand,
the available NER model cost too much number of
training examples in sample training. That is why we
improved the traditional probability statistical model to
enhance the efficiency of the training. On the other hand,
for the complex physical structure of NE, we optimize
model to improve the accuracy for the NER. We propose
a novel unified framework DSTCRFs for the PSN
identification. Under this framework, firstly, we adopt
hidden semi-Markov models to identify entities with their
word feature table. Then, based on the labels, the
DTCRFs is used to recognize nested PSNs for the target
entity type.
METHODOLOGY
The feature definition of NE: As the NE is usually
emphasized with some features in the Web pages, we can
take into account these characteristics for the NER. We
will expand on the analysis of their characteristics in this
section.
Structure features: In Web pages, the font of PSN
display as the large size and red color, different from that
of other texts. The display of NE is employed to
emphasize some important information and to attract user
attention.
As a result, the CSS attributes are introduced to
represent the definition of the structure of features.
However, different from the traditional ways, we
introduce the simple Boolean model instead of using the
detail values of the CSS attribute to express the features.
Content features: The characters of the NE depend on
their context feature. Here are some examples to illustrate.
Generally, the emergence of the PSN is mostly in the
form of "recruit market manager". Therefore, we can
adopt the context features for the identification. If the
front word is "recruited", the probability of which the next
word is PSN will be very large.
The DSTCRFS model for NER: We adopt DSTCRFs to
identify various types of NE, which based on the feature
word tagging. The basic idea is that we build a feature
words tag sets according to the composition and
characteristics of various NEs respectively.
The HSCRFs model divided Conditional Random
Fields model into two layers, POS and NP and each state
embed Semi-markov model. But the model cannot
effectively deal with included nested NER, but just the
parallel nested NER, which are defined as follows:
Definition 2: In external characteristics of PSN, given fa1
and fb1 as the following trigger feature and the previous
3974
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
trigger feature. fa2 is the feature word and the others are fb2
in internal characteristics. So fa1 fa2 fa1 fb1 fb2 fb1 stands for
the include nested NE and fa1 fa2 fb2 fb2 fb1 parallel nested
NE.
Therefore, the HSCRFs cannot handle such complex
NE, included nested NE. For addressing this issue, we
proposed the improved model DSTCRFs to resolve the
nested NER problem. HSCRFs models are generally
divided into three layers to complete the identification
task. However, DSTCRFs model is not a fixed model, but
a dynamically generated model. The levels of DSTCRFs
depend on the components relationship of NE, which we
call this as "dynamic."
DSTCRFs:
DSTCRFs definition 3: The joint model is a pair <yT,
ySM>, where, y is tree-structure hierarchical model
DTCRFs and ySM is the label model of the HSMM, known
as DSTCRFs. A valid assignment <yT, ySM> must satisfy
the condition that the two assignments match at the leaf
variables.
Then, the joint probability distribution of our model
DSTCRFs has the following factorization form:
(
)
node to child node and Ecp from child nodes to parent
nodes.
For p(si | x, yTj), we introduced the GHSMM to
computing its value.
Formally, a GHSMM consists of the following five parts:
C
C
Let G = {g1, g2,...} denote a segmentation of x,
where, segment gj = (x, st, ed) consists of a start position
st, an end position ed. So in this model, the g is
instead of y.
C
C
∏
0≤ i ≤ F
j= L
(
i
p ySM
| x , yTi
)
(2)
C
where, yTj denotes the nodes of j-th level, so when j = L,
that means the leaf nodes of DTCRFs and the calculation
of the different parts is as follows:
p ( y| x ) =
⎛
1
exp⎜⎜
∑
z( x)
⎝ e∈{ E pc , E cp , E ns , E ss },
⎞
λ j t j (e, ye , x )⎟⎟
⎠
j
⎞
+ ∑ µ k S k ( v , y v , x )⎟
⎠
v ∈V , k
⎛
Z ( x ) = Σ ⎜⎜ exp (
∑
⎝
e∈{ E pc , E cp , E nc , E ss },
⎞
+ ∑ µ k sk (v , y v , x ))⎟
⎠
v ∈V , k
(5)
B: The emission probability distribution:
{
}
B = B1 , B 2 ,..., B z , B = (b j ( k s )) N × M
b j ( k s ) = P( xt , s = Vk | yt = S j )1 ≤ j ≤ N ,1 ≤ k ≤ M
(6)
For given the model 8, find out the state transition
sequence Y which maximizes P(X, Y|8) is our objective:
P( X , Y | λ ) = P(Y | λ ) P( X |Y , λ )
(3)
T
= π y1by1( x1, s )a y1 y2 by1 y2 ( x2, s )∏ (a yt −2 yt −1 yt byt −1 yt ( xt , s ))
(7)
t=3
λ j t j ( e, y e , x )
j
B: The initial state probability distribution and B =
{B1,B2,...,BN}, where, Bi = P(y1 = Si), 1#i#N.
A: The hidden state transition probability
distribution. A = (aij)N×N, where,
aij = P(yt+1 = Sj | yt = Si) ,1#i, j#N
p yT , ySM | x = p( yT | x ) p( ySM | x , yT ) =
p( yT | x )
N is the number of states in model. These N states
are S = {S1, S2, ..., SN} and suppose the state at time
t is yt , so we can know yt , S.
M is the number of probability observable symbols.
V = {V1,V2,...,VM} and the observable symbol at time
t is xt, where, xt , V.
In GHMM an observation symbol k is extended from
its mere term attribute to a set of CSS attributes as k1, k2,
..., kz. We assume the attributes are independent from
each other and consider a linear combination of these
(4)
Z
where, ye and yv are the label sequence, denoting the set
of components of y associated with edge e and vertex v in
the linear chain respectively; tj and Sk are feature
functions; The parameters 8j and :k refer to the weight of
the feature functions tj and Sk respectively, estimated from
the training data; ZT(x) is the normalization factor, also
known as partition function.
In addition, Ens represents the state transition between
brother nodes and Ess between the brother's child nodes.
Likewise, Epc stands for the state transition from parent
attributes
bij ( k ) = ∑ α s bijs ( k s ) . where, "s is the weight factor
s =1
th
for the S
attribute and
Z
∑ α s = 1,0 ≤ α s ≤ 1 . So (7) now
s =1
becomes:
3975
⎤
⎡ Z
P( X , Y | λ ) = π y1 ( x1, s )a y1 y2 ⎢ ∑ α s bys1 y2 ( x 2 , s ) ⎥
⎦
⎣ s=1
Z
T
⎛
⎡
⎤⎞
⎜ a yt −2 yt −1 yt ⎢ ∑ α s byst −1 yt ( x t , s ) ⎥ ⎟
∏
⎝
t =3
⎣ s=1
⎦⎠
(8)
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
DSTCRFs training: Generally, for the data training of
DSTCRFs, there are two inference problems to be solved
for an unlabeled sequence x:
Compute the parameter estimation p( yT , y SM i | x i , Θ , Ψ ) over
all cliques.
The parameter estimation problem is to calculate the
parameters from training data D = (+yT, ySM,i, si). More
specifically, we optimize the log-likelihood objective
i
function with a conditional model p y T , y SM | s i , Θ , Ψ :
(
∑
L( Θ , Ψ ) =
0≤ i ≤ N
)
i
log p⎛⎜⎝ yT , ySM | xi , Θ , Ψ ⎞⎟⎠
(9)
where, Θ = λ1 , λ2 L ; µ1 , µ2 L is the parameter vector of
∑
0≤ i ≤ N
=
∑
(
log p
0≤ i ≤ N
(
yTi | xi , Θ , Ψ
L(Θ ) =
∑
0≤ i ≤ N
⎡
⎢∑
⎣ 1≤ i ≤ N
−
∑
)
log p yTi | si , Θ , Ψ +
)+
∑
∑
(
i
log p xSM
| si , yTi , Θ , Ψ
0≤ i ≤ N
(
log p
0≤ i ≤ N
j= L
j
i
ySM
| xi , yT , Θ , Ψ
)
)
(13)
N
where,
∑ aijk = 1, aijk ≥ 0 . And in the same way, (6) can be
k =1
rewritten as:
bij ( k s ) = P( xt , s = Vk | yt = S j , yt −1 = Si )
1#i, j#N, 1#k#M
Decode y * = arg max p( t T , y SM | x ) with the Viterbi,
which is used to decide the label sequence with maximum
probability of the sequence x.
The Viterbi decoding problem is employed
y * = arg max p t T , y SM | x to calculate the maximum
(
)
)
y * = arg max p( yT , ySM | x )
(10)
{
}
= arg max p( yT | x ) * p( ySM | x , yT )
(15)
= arg max p( yT | x ) *arg max p( ySM | x , yT )
log p( y i | x i , Θ ) =
(14)
probability of the sequence x. Let Pvt(t, ys) indicate the
maximum of a sequence of length t, ending with state s.
Then we have following calculations:
the TCRFs and Ψ = aij L ; bij L of GHSMM:
L (Θ , Ψ ) =
(
aijk = P yt +1 = Sk | yt = S j , yt −1 = Si 1 ≤ i , j , k ≤ N
= Pvt * Pvsm
∑ p( x , y )
x, y
λk f k ( yi −1 , yi , x ) +
k
∑ ∑λ
1≤ i ≤ N
k
∑ p( x) log Z ( x)
k
⎤
f k ( yi , x ) ⎥
⎦
Pvt (t , ys ) = max{ Pvsm (t − 1, ys −1 ) * At ( ys −1 , ys | x)}
(11)
∑
At ( ys −1 , ys | x ) = exp(
λl tl ( ys −1, ys , x )
(16)
e∈{ E pc , E cp , E ns , E ss },l
+
x
∑ µk sk ( ys, x ))
v ∈V , k
n
∂ L(Θ )
= ∑ p( x , y ) ∑ f k ( yi −1, yi , x ) −
∂λk
x, y
i =1
where, At(ys-1, ys|x) represents the potential functions
between the state s-1 and the state S to record the optimal
previous state of each state.
According to Bayes law P( y| x) = PP( (yx, x) ) and have:
n +1
∑ p( x) p( y| x, Θ ) ∑ f k ( yi −1, yi , x)
x, y
(12)
i =1
= E p ( x , y ) [ f k ] − E p ( y | x ,Θ ) [ f k ]
log P (y|x) = log P (y)+log(P) (x|y)
(17)
Pvsm = arg maxlog P( y| x ) = arg max(log P( y ) + log P( x| y ))
(18)
If we assume that the training data consists of a set of
data points D = ⎛⎜⎝ yT , ySM i , xi ⎞⎟⎠ , each of which has been
generated independently and identically from the joint
empirical distribution p(x, y), such that the log-likelihood
of the training data is maximized.
When the state St transits to the state St+1, the state
transition and emission probability are related to the state
at time t, but not any previous state. However, the
emission probability is not only related to the current state
but also the state before. So, in this study, we supposed
that the state transition sequence is based on second-order
Markov chain; it means that when the state St transits to
the state St, the state transition probability is related to not
only the state at time t but also the state at time t-1. In this
case, (5) becomes:
T
Then
T
we
independence:
assume
P ( x| y ) =
conditional
probability
n
∏ P( xi | yi )
i =1
Applying it to (18), we have:
Pvsm = arg max P( y| x ) = arg max
T
n
T
( ∑ log P( xi | yi ) + log P( y ))
(19)
i =1
Now, the mutual information between y and x is
instead of conditional probability. We assume mutual
information independence (Chen and Goodman, 1998):
3976
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
Table 1: The sample datasets
Website
----------------------------------------------------Zhaopin.com Job086.com Chinahr.com
Category computers
109
84
91
biomedical
86
52
107
architecture 95
68
80
environmental 87
91
84
protection
mechanization 78
83
74
secretary
89
67
72
Training number
436
329
371
Test number
108
116
137
Total number
544
445
508
n
MI ( y , x ) = ∑ MI ( yi , x )
(20)
n
P( y , x )
P( yi , x )
= ∑ log
P( y ) ⋅ P( x ) i =1
P( yi ) ⋅ P( x )
(21)
i =1
log
Or we write as:
n
n
i =1
i =1
log P( y| x ) − log P( y ) = ∑ log P( yi | x ) − ∑ log P( yi )
(22)
We can obtain (23) from (22) by assuming abovementioned mutual information:
log P( y| x ) = log P( y ) −
n
n
i =1
i =1
∑ log P( yi ) + ∑ logP( yi | x)
Many of the possible sequence of words may not be
collected in the training corpus. Namely, if Pbo(h|h’, g) =
0, the probability of whole sentence is zero. Therefore the
smoothing is used to address this problem:
(23)
PGT (h| h' , g ) =
C ( h , h' , g )
C ( h' , g )
(27)
So the final tagging result is as follow:
T
(28)
T
n
−
N (C (h, h' g ) + 1)
N (C (h; h' , g ))
CGT (h, h' , g ) = (C (h, h' , g ) + 1 ×
Pvsm = Arg max log P( y| x ) = Arg max(log P( y )
n
∑ log P( y ) + ∑ logP( y | x))
i
i =1
i =1
(24)
N(C) is the number of the bigram items that occur C
times in the training corpus and "(h!, g) is the back-off
weight:
i
For the P(y) in (24), we introduce n-gram language
model, based on statistical probability in NLP, to
calculate the probability of a sentence y = (y1, y2, ..., ym),
according to the chain rules:
∑
β (h ' , g ) = 1 −
(25)
∑
β (h ' , g )
=
p(h|h') 1 −
s:C ( h ', h , g ) = 0
m
P( y ) = P( y1)∏ P( yi | y1, y2 ,..., yi −1)
β (h', g )
α (h', g ) =
s:C ( h ', h , g ) > 0
∑
p(h| h')
(29)
s:C ( h ', h , g ) > 0
pGT (h|h', g )
(30)
i
As a matter of fact, we can calculate it by this
formula owing to the data-sparseness problem. So, one
viable solution is that we suppose each tag probability lies
on the previous N tags. There are mainly four models:
context-free grammars-unigram (N = 0), bigram (N = 1),
trigram (N = 2) and fourgram (N = 3). Usually, the
trigram model is used mostly and we choose it in this
study.
n
In order to compute ∑ log P( yi | x) , an improved back-off
i =1
model is proposed, which is showing as following:
⎧ PGT ( h|h', g ) if C (h, h', g ) > 0
Pbo (h|h', g ) = ⎨
⎩α (h', g ) Pbo (h| h') otherwise
(26)
where, Pbo(h|h’) is the probability for the traditional
trigram model, h = ti!n+1...ti!1, h’ = ti!n+2...ti!1. And PGT(h|h’,
g) is the probability for the bigram model smoothed by the
Good--Turing method (Chen and Goodman, 1998).
EXPERIMENTAL RESULTS
Data set: Based on the above discussion, we verify the
NER based on the DSTCRFs model in this study. We
have adopted the following test datasets from three
Websites. There are six classes of Web pages, including
Computers, Biomedical, Architecture, Environmental
Protection, Mechanization and Secretary. Therefore, we
randomly select 1497 Web pages as datasets, 1136 as the
training datasets, the rest of the 361 as testing. The
datasets are presented in Table 1.
When we compare the identification results with
different NE, we employ the Recall (R), Precision (P) and
F measure which combines recall and precision. They are
defined as follows (Sun et al., 2002):
R = (number of correct identified entities)/(numbers of all
entities)
P = (number of correct identified entities)/(numbers of
possible entities)
3977
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
F=
60%
40%
20%
00
14
00
12
00
10
0
0%
80
Date
94.8
96.9
96.1
96.7
97.8
97.2
96.9
98.4
97.3
97.2
99.7
98.5
80%
60 0
Table 3: The results of three methods for the NER (%)
Pos
Org
Loc
CRFs
P
71.6
69.8
95.7
R
77.9
77.6
96.7
F
74.8
78.4
95.9
HCRFs
P
76.7
76.4
96.5
R
81.4
79.5
97.3
F
79.2
79.9
96.9
HSCRFs
P
79.3
80.5
97.2
R
82.0
81.6
98.1
F
81.7
80.9
97.8
DSTCRFs
P
84.2
86.3
98.2
R
88.5
89.0
98.7
F
86.3
87.1
98.4
CRFs
HCRFs
HSCRFs
TH-CRFs
100%
40 0
Date
94.4
96.8
95.5
96.9
97.6
97.4
95.5
98.0
97.2
96.1
98.4
97.5
97.3
97.8
97.5
93.8
95.1
94.3
20 0
Loc
92.4
94.2
93.0
93.7
95.7
94.8
92.5
96.9
94.1
95.6
96.3
95.9
94.9
96.9
95.1
94.8
98.4
96.3
0
Table 2: The results for NER base on DSTCRFs
Pos
Org
Computer
R
82.9
86.5
P
87.3
89.3
F
85.2
88.7
Biomedicine
R
81.1
81.2
P
85.6
87.2
F
83.4
84.1
Architecture
R
84.7
79.2
P
88.2
89.0
F
86.0
84.8
Environment protection
R
86.5
85.5
P
88.8
87.1
F
87.6
86.2
Mechanization
R
87.4
83.4
P
89.3
86.2
F
88.3
84.8
Secretary
R
82.0
83.8
P
87.2
86.8
F
85.8
85.4
Fig. 1: The average of NER based on DSTCRFs
2× R× P
R+ P
The result of web IE: We adopt the improved DSTCRFs
method to identify the Position (Pos), Organization (Org),
Location (Loc) and the date respectively. The averages of
the experimental results for the NER are as follows Table
2 showing.
Results in Table 2 reveal that DSTCRFs model
achieves higher accuracies for Loc and Date. Their feature
words are numerals or fixed nouns, thereby, the F
measure are almost over 94%. However, due to the
complexity of Pos and Org, their identification accuracies
are not higher than the Loc and Date, but have improved
greatly. Because professional glossaries in Mechanization
recruitment pages are more abundant than other
recruitment pages, the F measure of Pos is high at 88.3%.
The F measure of Org in Computer recruitment pages is
88.7%, which are higher than the other kinds of pages. In
the Computer recruitment pages, most of the Org are
ended with “Co.” or “Ltd.”, so the feature words are
simple to identify.
For the different datasets, we have performed an
experiment to test the proposed model. So the P, R and F
measure for the NER, based on the improved DSTCRFs,
CRFs, HCRFs and HSCRFs are shown in the Table 3.
From the results obtained in the experiment, it can be
detected that the performance of DSTCRFs is excellent
compared with the two mainstream algorithms, CRFs and
HCRFs. For the Pos and Org, the F measure of DSTCRFs
significantly higher than those of the previous three
models, increased by 11, 7 and 5%, Loc by 3, 2 and 1%,
Date by 2, 1 and 1%, respectively which means that our
model is more applicable to the Web NE, especially for
the nested NE. As the reason of the Loc is the fixed noun
in the feature table there, the accuracy of extraction is
relatively high. Some of the Pos and Org are usually
named for short or the nested NE which lead to the F
measure for its recognition is relatively low.
Obviously, HSCRFs model achieves lower F measure
than the DSTCRFs but higher than the other methods.
HSCRFs puts more emphasis on the entity characteristics
instead of the single word. Moreover, the models are
divided into different levels for the entity recognition.
Hence, F measure of HSCRFs model is higher than the
CRFs and HCRFs methods. However, it has no effect on
the Web page, which contains multi-record.
Consequently, the DSTCRFs performs well than
HSCRFs.
For the different number of datasets, the average
accuracy of 4 methods for NER, are indicated in the
following Fig. 1.
In the curve, the results indicate that the accuracy of
DSTCRFs is improved significantly compared to other
three methods. Especially, after the datasets more than
500, the identification accuracy is more than 76.4% based
on the DSTCRFs. The results depict that the accuracy for
NER, improved markedly based on the method we
proposed in this article.
After about 35% of the training samples documents,
the F meature basically keep unchanged. So the number
of training examples can be reduced in this study.
CONCLUSION
Recently the NER is a difficult and more challenging
problem. A practical approach combined DSTCRFs
3978
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
technology with multi-rules is proposed in the study,
which take the nested structure and content features of NE
into cosideration for a better recognition. And the
experimental results reveal that the novel approach can
improve the NER accuracy significantly, only with small
amount trained data. Results of this study could have
considerable impact on the NER. It is demonstrated that
DSTCRFs method is adaptable to the Chinere NER.
As we all know, current NER technologies are all
based on the machine learning. So how to optimize well
learning model is an urgent problem, in this way, we can
reduce the transfer of state and the total numbers of
entered symbols in the sequence to improve the operating
efficiency for the NER. And these will be our future
work.
ACKNOWLEDGMENT
This research project was promoted by the National
Science and Technology Pillar Program No.
2007BAH08B02.
REFERENCES
Chen, S.F. and J. Goodman, 1998. An Empirical Study of
Smoothing Techniques for Language Modeling.
Center for Research in Computing Technology
Harvard University Cambridge, Massachusetts,
Technical Report Tr-10-98.
Cohen, W.W. and S. Sarawagi, 2004. Exploiting
Dictionaries in Named Entity Extraction: Combining
Semi-Markov Extraction Processes and Data
Integration Methods. KDD’04, Seattle, Washington,
USA.
Chinchor, N., 1988. MUC-7 Named Entity Task
Definition (Version 3.5). The 7th Message
Understanding Conference, Fairfax, Virginia.
Cohn, T. and P. Blunsom, 2005. Semantic role labeling
with tree conditional random fields. Proceedings of
the Ninth Conference on Computational Natural
Language Learning, (CoNLL'2005), USA, pp: 169172.
Eason, G., B. Noble and I.N. Sneddon, 1955. On certain
integrals of Lipschitz-Hankel type involving products
of Bessel functions. Phil. Trans. Roy. Soc. London,
247: 529-551.
Fu, G.H. and K.K. Luke, 2005. Chinese named entity
recognition using lexicalized HMMs. ACM SIGKDD
Explorat. Newslett., 7 (1): 19-25.
Fresko, M., B. Rosenfeld and R. Feldman, 2005. A
Hybrid Approach to NER by MEMM and Manual
Rules. CIKM’05, Bremen, Germany.
Fu, G.H. and K.K. Luke, 2005. Chinese named entity
recognition using lexicalized HMMs. ACM SIGKDD
Explorat. Newslett., 7 (1): 19-25.
Gao, J.F., A. Wu, M. Li and C.N. Huang, 2006. Chinese
word segmentation and named entity recognition: A
pragmatic approac. Assoc. Computational
Linguistics, 31(4).
Kou, Y., 2008. Improving the accuracy of entity
identification through refinement. Proceedings of the
2008 EDBT Ph.D. WorkshopMar.2008.
Lim, J.H., Y.S. Hwang, S.Y. Park and H.C. Rim, 2004.
Semantic Role Labeling using Maximum Entropy
Model. (CoNLL'2004), Retrieved from:
http://acl.ldc.upenn.edu/W/W04/W04-2419.pdf.
Li, W. and A. McCallum, 2003. Rapid development of
Hindi named entity recognition using conditional
random fields and feature induction. ACM Trans.
Asian Lang. Inf. Proc., 2003: 290-294.
Li, Y., K. Bontcheva and H. Cunningham, 2005. Using
uneven-margins svm and perceptron for information
extraction. Proceedings of the 9th Conference on
Computational Natural Language Learning,
(CoNLL'05), USA, pp: 72-79.
Lafferty, J., A. Mccallum and F. Pereira, 2001.
Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. ICML '01
Proceedings of the 18th International Conference on
Machine Learning, USA, pp: 282-289.
Peng, F.C. and A. McCallum, 2004. Accurate Information
Extraction from Research Papers using Conditional
Random Fields. (HLT-NAACL'2004), pp: 329-336.
R e t r i e v e d
f r o m :
http://people.cs.umass.edu/~mccallum/papers/hlt20
04.pdf.
Peng, F., F.F. Feng and A. McCallum, 2004. Chinese
segmentation and new word detection using
conditional random fields. Proceedings of the 20th
International Conference on Computational
Linguistics, COLING 2004.
Roark, B., M. Saraclar, M. Collins and M. Johnson, 2004.
Discriminative Language Modeling with Conditional
Random Fields and the Perceptron Algorithm. In
ACL 2004, Retrieved from: http://acl.ldc.upenn.edu/
P/P04/P04-1007.pdf.
Sarawagi, S. and W.W. Cohen, 2004. Semi-Markov
Conditional Random Fields for Information
Extraction. (NIPS'2004), Retrieved from:
http://www.cs.cmu.edu/~wcohen/postscript/semiC
RF.pdf.
Sun J., J.F. Gao, L. Zhang, M. Zhou and C.N. Huang,
2002. Chinese named entity identification using
class-based language model. Presented at the 19th
International Conference on Computational
Linguistics.
Toutanova, K.A. Haghighi and D.C. Manning, 2005. Joint
learning improves semantic role labeling.
Proceedings of the 43rd Annual Meeting of the ACL,
pp: 589-596.
3979
Res. J. Appl. Sci. Eng. Technol., 4(20): 3973-3980, 2012
Truyen, T.T., D.Q. Phung, H.H. Bui and S. Venkatesh,
2008. Hierarchical semi-Markov conditional random
fields for recursive sequential data. Proceeding of
Twenty-Second Annual Conference on Neural
Information Processing Systems, Dec 2008,
Vancouver, Canada.
Uchimoto, K., Q. Ma, M. Murata, H. Ozaku and H.
Isahara, 2000. Named entity extraction based on a
maximum entropy model and transformation rules.
Proceedings of 33rd Annual Meeting of the
Association of the Computational Linguistics,
(ACL'2000).
Zhu, J., Z.Q. Nie, J.R. Wen, B. Zhang and W.Y. Ma,
2005. 2D Conditional Random Fields for Web
information extraction. Proceedings of the 22nd
International Conference on Machine Learning, pp:
1044-1051.
Zhu, J., Z.Q. Nie, J.R. Wen, B. Zhang and H.W. Hon,
2007. Webpage understanding: An integrated
approach. Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery
and data mining, August 12-15, 2007, San Jose,
California, USA, pp: 903-912.
Zhu, J., Z.Q. Nie, B. Zhang and J.R. Wen, 2008. Dynamic
hierarchical markov random fields for integrated web
data extraction. J. Mach. Learn. Res., 9: 1583-1614.
Zou, J.Q., G.L. Chen and W.Z. Guo, 2005. Chinese web
page classification using noise-tolerant support
vector machines. Proceeding of NLP-KE’, 05: 785790.
3980
Download