The BIOCREATIVE Task in SEER Outline • Background for biomedical information extraction and BIOCREATIVE • BIOCREATIVE NER Task • Stanford-Edinburgh System • Problems Terms and Resources Gene An ordered sequence of nucleotides that encodes a product such as a protein. Protein Gene products; composed of chains of amino acids; Have sophisticated structures; kinases, enzymes, etc are types of proteins Nucleotide Thousands of nucleotides link to form a DNA/RNA molecule Molecular Biology Branch of biology studying all of the above MEDLINE The primary research database of the biomedical community, from nursing to drugs to genetics Gene Databases FlyBase, MGI (mouse), Saccharomyces Gen. Database (Yeast ) Other Databases Swiss-Prot (amino acid sequences of proteins) GenBank (nucleotide sequences of genes) Biotechnology Information Explosion David Landsman NCBI Presentation NER in the Biomedical Domain • Many types of entities can be studied in the biomedical domain (drug names, chemicals) • Much research has focused on molecular biological entities, particularly genes and proteins Gene Names • Genes and gene products are constantly being discovered and new names invented • Nomenclatures exist but vary from organism to organism • Diverse: – ‘bride of frizzled disco’, ‘cheap date’, ‘broken heart’ – ‘REP2’, ‘RFM’ • Ambiguous: – With other genes – Acronyms – With proteins, where genes and their products are often referred to by the same name. (1st gene in LocusLink is officially alpha-1-Bglycoprotein) Varying Tasks, Results and Evaluation Methods F-Score Evaluation Corpus Publication 0.92/Gene Corpus consisting of 750 sentences from FlyBase where each gene is referred to by its official name, and where each name is a single word, kept only sentences containing at least 2 gene mentions, and those gene mentions appear in the dictionary and all the articles concern drosophila melanogaster Proux et al 1998 0.97/Protein 30 abstracts on SH3 protein Fukuda et al 1998 (KeX) 0.92/Protein SWISSPROT annotations on Transpath database Hanisch et al 2003 0.15/DNA 0.72/Protein 100 MEDLINE abstracts Nobata et al 1999 0.64/Protein 99 MEDLINE abstracts Eriksson et al 2002 (Yapex) 0.76 Protein 0.03/RNA 100 MEDLINE abstracts Collier et al 2000 0.56 – 24 classes GENIA corpus Kazama et al 2002 0.70/Protein Molecule GENIA corpus Yamamoto et al 2003 BIOCREATIVE Motivations • Seeking to be the MUC of the biomedical information extraction field The BIOCREATIVE NER Task • Given a single sentence from an abstract, to identify all mentions of genes • “(or proteins where there is ambiguity)” • In November changed the task to identify all mentions of genes and proteins (but not distinguishing between them) The BIOCREATIVE NER Data Data consisted of MEDLINE abstracts annotated for the single NE “GENE” Data Set Sentences Words Genes Training 7500 200,000 9000 Development 2500 70,000 3000 Evaluation 5000 130,000 6000 The BIOCREATIVE NER Evaluation Method • Only exact matches to the gold standard (which includes alternate correct boundaries for several cases) are counted as correct. • Genes detected with incorrect boundaries are doubly penalized as false negatives and false positives. chloramphenicol acetyl transferase reporter gene (FN) transferase reporter gene (FP) Outline • Background for BIOCREATIVE and biomedical information extraction • BIOCREATIVE NER Task • Stanford-Edinburgh System • Problems Baseline System • Maximum Entropy Tagger in Java • Based on Klein et al (2003) CoNLL submission • Baseline Performance: Precision 0.79 Recall 0.74 F-Score 0.76 • Efforts were mostly in trying different features, including different POS taggers, NP-chunking, Parsing, Gazetteers, Web, Abbreviations, Word Shapes, Tokenization… Feature Set Word Features (All time s e.g. Monday, April are mapped to lower case) Bigrams TnT POS (trained on GENIA POS) Character Substrings Abbreviations Word + POS wi wi-1 wi+1 Last “real” word Next “real” word Any of the 4 previous words Any of the 4 next words wi + wi-1 wi + wi+1 POSi POSi-1 POSi+1 Up to a length of 6 abbri abbri-1 + abbri abbri + abbri+1 abbri-1 + abbri + abbri+1 wi + POSi wi-1 + POSi wi+1 + POSi Word Shape shapei shapei-1 shapei+1 shapei-1 + shapei shapei + shapei+1 shapei-1 + shapei + shapei+1 Word Shape+Word wi-1 + shapei wi+1+ shapei Previous NE NEi-1 NEi-2+ NEi-1 NEi-1+wi Previous NE + POS NEi-1+POSi-1+POSi NEi-2+ NEi-1+POSi-2+POSi-1+POSi Previous NE + NEi-1 + shapei Word Shape NEi-1 + shapei+1 NEi-1 + shapei-1 + shapei NEi-2+ NEi-1+ shapei-2 + shapei-1 + shapei Parentheses Paren-Matching – a feature that signals when one parentheses in a pair has been assigned a different tag than the other in a window of 4 s Features – External Gazetteers 1,731,581 entries Adapted from Locus Link, Gene Ontology and BIOCREATIVE data ABGENE A transformation-based NE tagger based on gazetteers and pattern matching GENIA Biomedical corpus using a different tag set consisting of 37 Named Entities Web Test Initial tagger output submitted to the Web in patterns such as “X gene” Postprocessing • Discarded results with mismatched parentheses • Different boundaries were detected when searching the sentence forwards versus backwards • Unioned the results of both; in cases where boundary disagreements meant that one detected gene was contained in the other, we kept the shorter gene Final System and Results • Trained on training+development data (1000 sentences) • 1,247,775 features Precision Recall F-Score 0.791 0.828 0.854 0.836 0.821 0.832 Preliminary Best-Closed 0.855 0.854 0.825 Preliminary Best-Open 0.836 0.832 Closed Open 0.863 Outline • Background for BIOCREATIVE and biomedical information extraction • BIOCREATIVE NER Task • Stanford-Edinburgh System • Problems Performance Discrepancy Klein et al Precision Recall F-Score CoNLL-2003 86.1 BIOCREATIVE 78.8 C&C 86.5 73.5 86.3 76.1 Precision Recall F-Score CoNLL-2003 84.3 BIOCREATIVE 77.6 85.5 75.9 84.9 76.8 Gene Entity Pitfalls • Language is complex Stably transfected human kidney 293 cells expressing the wild type rat LH / CG receptor ( rLHR ) or receptors with C-terminal tails truncated at residues 653 , 631 , or 628 (designated rLHR-t653 , rLHR-t631 , and rLHR-t628 ) were used to probe the importance of this region on the regulation of hormonal responsiveness. • Gene names are frequently uncapitalized The chick axon-associated surface glycoprotein neurofascin is implicated in axonal growth and fasciculation as revealed by antibody perturbation experiments . • Looks weird is not indicative A newly synthesized anti-inflammatory agent , Y-8004 demonstrated a greater inhibition than did indomethacin ( IM ) . on inflammatory response such as ultraviolet erythema in guinea pigs , carrageenin edema , evans blue and carrageenin-induced pleuritis and acetic acidinduced peritonitis in rats . Boundary Problems • Gene names can be long and complex • 37% of our false positives and 39% of false negatives were boundary problems • Gold: chloramphenicol acetyl transferase reporter gene chloramphenicol acetyl transferase reporter gene deletion Gold: estrogen receptor estrogen receptor ligand Interannotator Agreement • MUC-7 interannotator agreement was measured at 97 F-Score • Demetriou and Gaizauskas: Interannotator agreement for biomedical terms at 89% F-Score • Hirschman measured interannotator agreement for gene names at 87% F-Score