The BIOCREATIVE Task in SEER

advertisement
The BIOCREATIVE Task
in SEER
Outline
• Background for biomedical information
extraction and BIOCREATIVE
• BIOCREATIVE NER Task
• Stanford-Edinburgh System
• Problems
Terms and Resources
Gene
An ordered sequence of nucleotides that encodes a product
such as a protein.
Protein
Gene products; composed of chains of amino acids;
Have sophisticated structures;
kinases, enzymes, etc are types of proteins
Nucleotide
Thousands of nucleotides link to form a DNA/RNA molecule
Molecular Biology
Branch of biology studying all of the above
MEDLINE
The primary research database of the biomedical community,
from nursing to drugs to genetics
Gene Databases
FlyBase, MGI (mouse), Saccharomyces Gen. Database (Yeast )
Other Databases
Swiss-Prot (amino acid sequences of proteins)
GenBank (nucleotide sequences of genes)
Biotechnology Information
Explosion
David Landsman NCBI Presentation
NER in the Biomedical Domain
• Many types of entities can be studied in the
biomedical domain (drug names, chemicals)
• Much research has focused on molecular
biological entities, particularly genes and
proteins
Gene Names
• Genes and gene products are constantly being discovered and new
names invented
• Nomenclatures exist but vary from organism to organism
• Diverse:
– ‘bride of frizzled disco’, ‘cheap date’, ‘broken heart’
– ‘REP2’, ‘RFM’
• Ambiguous:
– With other genes
– Acronyms
– With proteins, where genes and their products are often referred to
by the same name. (1st gene in LocusLink is officially alpha-1-Bglycoprotein)
Varying Tasks, Results and Evaluation Methods
F-Score
Evaluation Corpus
Publication
0.92/Gene
Corpus consisting of 750 sentences from FlyBase
where each gene is referred to by its official name,
and where each name is a single word, kept only
sentences containing at least 2 gene mentions, and
those gene mentions appear in the dictionary and all
the articles concern drosophila melanogaster
Proux et al 1998
0.97/Protein
30 abstracts on SH3 protein
Fukuda et al 1998 (KeX)
0.92/Protein
SWISSPROT annotations on
Transpath database
Hanisch et al 2003
0.15/DNA 0.72/Protein
100 MEDLINE abstracts
Nobata et al 1999
0.64/Protein
99 MEDLINE abstracts
Eriksson et al 2002 (Yapex)
0.76 Protein 0.03/RNA
100 MEDLINE abstracts
Collier et al 2000
0.56 – 24 classes
GENIA corpus
Kazama et al 2002
0.70/Protein Molecule
GENIA corpus
Yamamoto et al 2003
BIOCREATIVE Motivations
• Seeking to be the MUC of the biomedical
information extraction field
The BIOCREATIVE NER Task
• Given a single sentence from an abstract, to
identify all mentions of genes
• “(or proteins where there is ambiguity)”
• In November changed the task to identify
all mentions of genes and proteins (but not
distinguishing between them)
The BIOCREATIVE NER Data
Data consisted of MEDLINE abstracts
annotated for the single NE “GENE”
Data Set
Sentences
Words
Genes
Training
7500
200,000
9000
Development
2500
70,000
3000
Evaluation
5000
130,000
6000
The BIOCREATIVE NER
Evaluation Method
• Only exact matches to the gold standard
(which includes alternate correct boundaries
for several cases) are counted as correct.
• Genes detected with incorrect boundaries
are doubly penalized as false negatives and
false positives.
chloramphenicol acetyl transferase reporter gene (FN)
transferase reporter gene (FP)
Outline
• Background for BIOCREATIVE and
biomedical information extraction
• BIOCREATIVE NER Task
•  Stanford-Edinburgh System
• Problems
Baseline System
• Maximum Entropy Tagger in Java
• Based on Klein et al (2003) CoNLL submission
• Baseline Performance:
Precision 0.79 Recall 0.74 F-Score 0.76
• Efforts were mostly in trying different features,
including different POS taggers, NP-chunking,
Parsing, Gazetteers, Web, Abbreviations, Word
Shapes, Tokenization…
Feature Set
Word Features
(All time s e.g.
Monday, April are
mapped to lower
case)
Bigrams
TnT POS
(trained on GENIA
POS)
Character
Substrings
Abbreviations
Word + POS
wi
wi-1
wi+1
Last “real” word
Next “real” word
Any of the 4 previous words
Any of the 4 next words
wi + wi-1
wi + wi+1
POSi
POSi-1
POSi+1
Up to a length of 6
abbri
abbri-1 + abbri
abbri + abbri+1
abbri-1 + abbri + abbri+1
wi + POSi
wi-1 + POSi
wi+1 + POSi
Word Shape
shapei
shapei-1
shapei+1
shapei-1 + shapei
shapei + shapei+1
shapei-1 + shapei + shapei+1
Word Shape+Word wi-1 + shapei
wi+1+ shapei
Previous NE
NEi-1
NEi-2+ NEi-1
NEi-1+wi
Previous NE + POS NEi-1+POSi-1+POSi
NEi-2+ NEi-1+POSi-2+POSi-1+POSi
Previous NE +
NEi-1 + shapei
Word Shape
NEi-1 + shapei+1
NEi-1 + shapei-1 + shapei
NEi-2+ NEi-1+ shapei-2 + shapei-1 + shapei
Parentheses
Paren-Matching – a feature that signals
when one parentheses in a pair has been
assigned a different tag than the other in a
window of 4 s
Features – External
Gazetteers 1,731,581 entries
Adapted from Locus Link, Gene Ontology
and BIOCREATIVE data
ABGENE A transformation-based NE tagger based on
gazetteers and pattern matching
GENIA
Biomedical corpus using a different tag set
consisting of 37 Named Entities
Web Test Initial tagger output submitted to the Web in
patterns such as “X gene”
Postprocessing
• Discarded results with mismatched parentheses
• Different boundaries were detected when
searching the sentence forwards versus backwards
• Unioned the results of both; in cases where
boundary disagreements meant that one detected
gene was contained in the other, we kept the
shorter gene
Final System and Results
• Trained on training+development data
(1000 sentences)
• 1,247,775 features
Precision Recall
F-Score
0.791
0.828
0.854
0.836
0.821
0.832
Preliminary Best-Closed 0.855
0.854
0.825
Preliminary Best-Open
0.836
0.832
Closed
Open
0.863
Outline
• Background for BIOCREATIVE and
biomedical information extraction
• BIOCREATIVE NER Task
• Stanford-Edinburgh System
• Problems
Performance Discrepancy
Klein et al
Precision Recall F-Score
CoNLL-2003
86.1
BIOCREATIVE 78.8
C&C
86.5
73.5
86.3
76.1
Precision Recall F-Score
CoNLL-2003
84.3
BIOCREATIVE 77.6
85.5
75.9
84.9
76.8
Gene Entity Pitfalls
• Language is complex
Stably transfected human kidney 293 cells expressing the wild type rat LH /
CG receptor ( rLHR ) or receptors with C-terminal tails truncated at residues
653 , 631 , or 628 (designated rLHR-t653 , rLHR-t631 , and rLHR-t628 )
were used to probe the importance of this region on the regulation of hormonal
responsiveness.
• Gene names are frequently uncapitalized
The chick axon-associated surface glycoprotein neurofascin is implicated in
axonal growth and fasciculation as revealed by antibody perturbation
experiments .
• Looks weird is not indicative
A newly synthesized anti-inflammatory agent , Y-8004 demonstrated a
greater inhibition than did indomethacin ( IM ) . on inflammatory
response such as ultraviolet erythema in guinea pigs , carrageenin
edema , evans blue and carrageenin-induced pleuritis and acetic acidinduced peritonitis in rats .
Boundary Problems
• Gene names can be long and complex
• 37% of our false positives and 39% of false
negatives were boundary problems
• Gold: chloramphenicol acetyl transferase reporter gene
chloramphenicol acetyl transferase reporter gene deletion
Gold: estrogen receptor
estrogen receptor ligand
Interannotator Agreement
• MUC-7 interannotator agreement was
measured at 97 F-Score
• Demetriou and Gaizauskas:
Interannotator agreement for biomedical terms
at 89% F-Score
• Hirschman measured interannotator agreement
for gene names at 87% F-Score
Download