A Chemical Compound and Drug Named Recognizer

advertisement
A Chemical Compound and Drug Named Recognizer for
BioCreative IV
1
*Ching-Yao Shu , Po-Ting Lai2, Chi-Yang Wu3, Hong-Jie Dai4 and Richard Tzong-Han Tsai5
1
Department of Computer Science & Engineering, Yuan Ze University, Taoyuan, Taiwan,
R.O.C., 2Department of Computer Science, National Tsing-Hua University, HsinChu, Taiwan,
R.O.C., 3Institute of Information Science, Acdemia Sinica, Taipei, Taiwan, R.O.C., 4Graduate
Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical
University, Taipei, Taiwan, R.O.C., 5Department of Computer Science and Information
Engineering, National Central University, Taoyuan, Taiwan, R.O.C.
*Corresponding author: E-mail: s66273070@gmail.com
Abstract
The function of chemical compound and drugs involved in biological processes and their relative
effect on the onset and treatment of diseases have gained increasing interest with the
advancement of life science researches. To extract knowledge of interest from massive amount
of related literatures, the organizers of BioCreative IV hosted a chemical compound and drug
name recognition task. This paper briefly introduces the approach of our CHEMDNER system
with preliminary evaluation results on the provided development set.
Introduction
Studies regarding the effect of chemical and drugs under various circumstances have become an
important task due to their impact on organismal growth and development. However, in contrast
to the gene mention recognition and normalization task proposed previously, the recognition of
chemical compound and drugs has yet to make progress with limited resources and available
tools. To propel the research of CHEMical compound and Drug Name Entity Recognition
(NER), the CHEMDNER of BioCreative IV was held in hope to improve the efficiency and
accuracy of chemical and drug recognition, which can benefit both the academic and industrial
area.
Material and Methods
We formulate chemical compound and drug named entity recognition as a traditional sequence
labeling problem, and would like to examine the effect of the utilized features within the
chemical domain. Following, we describe the features used in our system.
Word features
Word features are used in many machine learning (ML)-based NER systems. Take the sentence
"Mercury induces the expression of cyclooxygenase-1 and inducible nitric oxide synthase" as an
example. If the target word is "oxide", the following word "synthase" will help the conditional
random fields (CRF) model distinguish the oxide synthase from oxide layer. In this task, we set
Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2
the number of the preceding/following words as 2. Furthermore, we also apply bigram and
trigram words feature as our parts of conjunction features.
The affix of a word is a morpheme that is attached to a base morpheme, such as a root or a stem,
to form a word. The type of an affix depends on its position relative to the root. Prefixes
(attached before another morpheme) and suffixes (attached after another morpheme) are two of
these types. Some prefixes and suffixes provide good clues for classifying named entities. For
example, words that start with "1,1" are usually chemicals. Nevertheless, short prefixes or
suffixes are too common to be of any help in classification. For instance, it would be difficult to
guess to which category a word ending in "~es" belongs. For this task, we set the prefixes and
suffixes as 3–5 characters.
Orthographical features
The orthographical features are the regular expressions for the named entities, and they are
widely used in other general NER [1] or biomedical NER systems[2]. Empirically, it has been
shown that these features can detect NE patterns. Take the pattern “NUMBER” as an example.
The digits following a sequence of a comma and a number, e.g., '2' in the SYSTEMATIC 2,4dinitrophenyl, usually denote the position of the attached functional groups. Sometimes, named
entities possess similar patterns, such as As(V) and DMA(V), and they belong to the same
category. Hence, we use a simple process to normalize the words: 1) all capitalized characters
are all replaced by 'A'; 2) non-capitalized characters are replaced by 'a'; 3) all digits are all
replaced by '0'; 4) non-English characters are replaced by '_'. We also reduce consecutive strings
of identical characters to a single character. For example, "Aaaaa_A" is normalized to "Aa_A".
Consider two named entities "Na(2)CO(3)" and "As(2)O(3)". After applying the normalization
process, both of these entities will be normalized to the same term "Aa_1_A_1_".
External tool features
Besides the word and orthographical features, we also use chunk and part-of-speech (POS)
information, which are obtained by the GENIA tagger [3]. In chunking, also called shallow
parsing, a sentence is divided into a series of chunks that include nouns, verbs, and prepositional
phrases. Generally speaking, NEs are usually located in noun phrases. In most cases, either the
left or right boundary of an NE is aligned with either edge of a noun phrase. For instance, in the
noun phrase "the polyhedral oligomeric silsesquioxane", the chemical name "polyhedral
oligomeric silsesquioxane" aligns with the right boundary of the noun phrase. NEs rarely exceed
phrase boundaries. Part-of-speech (POS) information is useful for identifying named entities.
Verbs and prepositions usually indicate an NE's boundary. Nouns not found in the dictionary are
usually proper nouns, which are good candidates for named entities. Setting a context window
length for POS features is similar to setting the window length for word features. We found that
five is also a suitable window size for POS features.
Preliminary Results
A total of three different configurations are used for the CHEMDNER task. Configuration 1
consists of complete features with complete templates. Configuration 2 is generated by removing
templates without the symbol “\” from configuration 1, and configuration 3 is created by
removing sentence with “O”s from configuration 2. The performance of our system on the
BioCreative IV - CHEMDNER track
development set is shown in Table 2 and 3. Although the difference of the three configurations
is insubstantial, configuration 3 unexpectedly achieved the highest performance of all. We
speculate that the features and templates derived from previous gene mention recognition
experiences may be insufficient, or even adverse for chemical and drug recognition. This
indicates that the nature of gene and chemical named entity recognition are somewhat different,
which requires further error analysis to substantiate our hypothesis.
Table 1. Performance on the chemical document indexing development set
Configuration
Macro Macro Micro
Micro
F-score FAP-s. F-score. FAP-s.
CDI_all_feature_all_template 69.94
60.23
70.74
59.86
CDI_all_feature_no_conj
70.02
60.32
70.86
59.93
CDI_remove_o_sent_no_conj 70.83
60.65
71.53
59.93
Table 2. Performance on the chemical entity mention recognition development set
Configuration
Macro Macro Micro
Micro
F-score FAP-s. F-score. FAP-s.
CEM_all_feature_all_template 65.22
54.96
67.23
55.59
CEM_all_feature_no_conj
65.40
55.13
67.37
55.73
CEM_remove_o_sent_no_conj 66.69
56.38
68.70
56.80
References
"# $#!%&'()*+,!-#!.//0123()*2,!4#!5)+6,!*+7!8#!92*+6,!:;*<37!3+/)/0!(31'6+)/)'+!/2('=62!1&*>>)?)3(
1'<@)+*/)'+,:!A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!>3C3+/2!1'+?3(3+13!'+!;*/=(*&!&*+6=*63!&3*(+)+6!*/!
4D8E;--FD!GHHI!E!J'&=<3!K,!L7<'+/'+,!F*+*7*,!GHHI#!
G# 9#!M='N'+6!*+7!O#!5)*+,!:LPA&'()+6!733A!Q+'R&3763!(3>'=(13>!)+!@)'<37)1*&!+*<3!(31'6+)/)'+,:
A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!.+/3(+*/)'+*&!5')+/!S'(Q>2'A!'+!;*/=(*&!D*+6=*63!B('13>>)+6!)+
T)'<37)1)+3!*+7!)/>!-AA&)1*/)'+>,!M3+3C*,!OR)/U3(&*+7,!GHHK#!
I#!V#!8>=(='Q*!*+7!5#!)#!8>=W)),!:T)7)(31/)'+*&!)+?3(3+13!R)/2!/23!3*>)3>/E?)(>/!>/(*/360!?'(!/*66)+6!>3X=3+13
7*/*,:!A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!1'+?3(3+13!'+!4=<*+!D*+6=*63!8312+'&'60!*+7!L<A)()1*&!
Y3/2'7>!)+!;*/=(*&!D*+6=*63!B('13>>)+6,!J*+1'=C3(,!T()/)>2!F'&=<@)*,!F*+*7*,!GHHZ#!
Download