A Chemical Compound and Drug Named Recognizer for BioCreative IV 1 *Ching-Yao Shu , Po-Ting Lai2, Chi-Yang Wu3, Hong-Jie Dai4 and Richard Tzong-Han Tsai5 1 Department of Computer Science & Engineering, Yuan Ze University, Taoyuan, Taiwan, R.O.C., 2Department of Computer Science, National Tsing-Hua University, HsinChu, Taiwan, R.O.C., 3Institute of Information Science, Acdemia Sinica, Taipei, Taiwan, R.O.C., 4Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan, R.O.C., 5Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan, R.O.C. *Corresponding author: E-mail: s66273070@gmail.com Abstract The function of chemical compound and drugs involved in biological processes and their relative effect on the onset and treatment of diseases have gained increasing interest with the advancement of life science researches. To extract knowledge of interest from massive amount of related literatures, the organizers of BioCreative IV hosted a chemical compound and drug name recognition task. This paper briefly introduces the approach of our CHEMDNER system with preliminary evaluation results on the provided development set. Introduction Studies regarding the effect of chemical and drugs under various circumstances have become an important task due to their impact on organismal growth and development. However, in contrast to the gene mention recognition and normalization task proposed previously, the recognition of chemical compound and drugs has yet to make progress with limited resources and available tools. To propel the research of CHEMical compound and Drug Name Entity Recognition (NER), the CHEMDNER of BioCreative IV was held in hope to improve the efficiency and accuracy of chemical and drug recognition, which can benefit both the academic and industrial area. Material and Methods We formulate chemical compound and drug named entity recognition as a traditional sequence labeling problem, and would like to examine the effect of the utilized features within the chemical domain. Following, we describe the features used in our system. Word features Word features are used in many machine learning (ML)-based NER systems. Take the sentence "Mercury induces the expression of cyclooxygenase-1 and inducible nitric oxide synthase" as an example. If the target word is "oxide", the following word "synthase" will help the conditional random fields (CRF) model distinguish the oxide synthase from oxide layer. In this task, we set Proceedings of the fourth BioCreative challenge evaluation workshop, vol. 2 the number of the preceding/following words as 2. Furthermore, we also apply bigram and trigram words feature as our parts of conjunction features. The affix of a word is a morpheme that is attached to a base morpheme, such as a root or a stem, to form a word. The type of an affix depends on its position relative to the root. Prefixes (attached before another morpheme) and suffixes (attached after another morpheme) are two of these types. Some prefixes and suffixes provide good clues for classifying named entities. For example, words that start with "1,1" are usually chemicals. Nevertheless, short prefixes or suffixes are too common to be of any help in classification. For instance, it would be difficult to guess to which category a word ending in "~es" belongs. For this task, we set the prefixes and suffixes as 3–5 characters. Orthographical features The orthographical features are the regular expressions for the named entities, and they are widely used in other general NER [1] or biomedical NER systems[2]. Empirically, it has been shown that these features can detect NE patterns. Take the pattern “NUMBER” as an example. The digits following a sequence of a comma and a number, e.g., '2' in the SYSTEMATIC 2,4dinitrophenyl, usually denote the position of the attached functional groups. Sometimes, named entities possess similar patterns, such as As(V) and DMA(V), and they belong to the same category. Hence, we use a simple process to normalize the words: 1) all capitalized characters are all replaced by 'A'; 2) non-capitalized characters are replaced by 'a'; 3) all digits are all replaced by '0'; 4) non-English characters are replaced by '_'. We also reduce consecutive strings of identical characters to a single character. For example, "Aaaaa_A" is normalized to "Aa_A". Consider two named entities "Na(2)CO(3)" and "As(2)O(3)". After applying the normalization process, both of these entities will be normalized to the same term "Aa_1_A_1_". External tool features Besides the word and orthographical features, we also use chunk and part-of-speech (POS) information, which are obtained by the GENIA tagger [3]. In chunking, also called shallow parsing, a sentence is divided into a series of chunks that include nouns, verbs, and prepositional phrases. Generally speaking, NEs are usually located in noun phrases. In most cases, either the left or right boundary of an NE is aligned with either edge of a noun phrase. For instance, in the noun phrase "the polyhedral oligomeric silsesquioxane", the chemical name "polyhedral oligomeric silsesquioxane" aligns with the right boundary of the noun phrase. NEs rarely exceed phrase boundaries. Part-of-speech (POS) information is useful for identifying named entities. Verbs and prepositions usually indicate an NE's boundary. Nouns not found in the dictionary are usually proper nouns, which are good candidates for named entities. Setting a context window length for POS features is similar to setting the window length for word features. We found that five is also a suitable window size for POS features. Preliminary Results A total of three different configurations are used for the CHEMDNER task. Configuration 1 consists of complete features with complete templates. Configuration 2 is generated by removing templates without the symbol “\” from configuration 1, and configuration 3 is created by removing sentence with “O”s from configuration 2. The performance of our system on the BioCreative IV - CHEMDNER track development set is shown in Table 2 and 3. Although the difference of the three configurations is insubstantial, configuration 3 unexpectedly achieved the highest performance of all. We speculate that the features and templates derived from previous gene mention recognition experiences may be insufficient, or even adverse for chemical and drug recognition. This indicates that the nature of gene and chemical named entity recognition are somewhat different, which requires further error analysis to substantiate our hypothesis. Table 1. Performance on the chemical document indexing development set Configuration Macro Macro Micro Micro F-score FAP-s. F-score. FAP-s. CDI_all_feature_all_template 69.94 60.23 70.74 59.86 CDI_all_feature_no_conj 70.02 60.32 70.86 59.93 CDI_remove_o_sent_no_conj 70.83 60.65 71.53 59.93 Table 2. Performance on the chemical entity mention recognition development set Configuration Macro Macro Micro Micro F-score FAP-s. F-score. FAP-s. CEM_all_feature_all_template 65.22 54.96 67.23 55.59 CEM_all_feature_no_conj 65.40 55.13 67.37 55.73 CEM_remove_o_sent_no_conj 66.69 56.38 68.70 56.80 References "# $#!%&'()*+,!-#!.//0123()*2,!4#!5)+6,!*+7!8#!92*+6,!:;*<37!3+/)/0!(31'6+)/)'+!/2('=62!1&*>>)?)3( 1'<@)+*/)'+,:!A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!>3C3+/2!1'+?3(3+13!'+!;*/=(*&!&*+6=*63!&3*(+)+6!*/! 4D8E;--FD!GHHI!E!J'&=<3!K,!L7<'+/'+,!F*+*7*,!GHHI#! G# 9#!M='N'+6!*+7!O#!5)*+,!:LPA&'()+6!733A!Q+'R&3763!(3>'=(13>!)+!@)'<37)1*&!+*<3!(31'6+)/)'+,: A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!.+/3(+*/)'+*&!5')+/!S'(Q>2'A!'+!;*/=(*&!D*+6=*63!B('13>>)+6!)+ T)'<37)1)+3!*+7!)/>!-AA&)1*/)'+>,!M3+3C*,!OR)/U3(&*+7,!GHHK#! I#!V#!8>=(='Q*!*+7!5#!)#!8>=W)),!:T)7)(31/)'+*&!)+?3(3+13!R)/2!/23!3*>)3>/E?)(>/!>/(*/360!?'(!/*66)+6!>3X=3+13 7*/*,:!A(3>3+/37!*/!/23!B('1337)+6>!'?!/23!1'+?3(3+13!'+!4=<*+!D*+6=*63!8312+'&'60!*+7!L<A)()1*&! Y3/2'7>!)+!;*/=(*&!D*+6=*63!B('13>>)+6,!J*+1'=C3(,!T()/)>2!F'&=<@)*,!F*+*7*,!GHHZ#!