- Personal Webpages (The University of Manchester)

advertisement
Text Mining for Biomedicine:
Techniques & tools
Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki,
Yoshimasa Tsuruoka
School of Computer Science
National Centre for Text Mining
www.nactem.ac.uk
Sophia.Ananiadou@manchester.ac.uk
Outline
• Challenges / objectives of TM in biomedicine
• Terminology processing
– Term extraction, term variation, named entity recognition
•
•
•
•
•
Resources for TM in biomedicine
Document classification
Information Extraction approaches
Levels of Text Mining Processing
Biomedical text mining services and systems @
NaCTeM
– TerMine, AcroMine, Smart dictionary look up, Phenetica
– Medie, InfoPubMed, KLEIO
2
Material
• Further background on TM for Biology
Ananiadou, S. & McNaught, J. (eds) (2006) Text
Mining for Biology and Biomedicine. Boston, MA:
Artech House
• Numerous papers on line from bibliography
• See BLIMP http://blimp.cs.queensu.ca/
– Biomedical Literature (and text) mining publications
3
Text Mining in biomedicine
• Why biomedicine?
– Consider just MEDLINE: 16,000,000 references,
40,000 added per month
– Dynamic nature of the domain: new terms (genes,
proteins, chemical compounds, drugs) constantly
created
– Impossible to manage such an information overload
4
From Text to Knowledge:
tackling the data deluge through text mining
Unstructured Text
(implicit knowledge)
Structured content
(explicit knowledge)
Information deluge
• Bio-databases, controlled vocabularies and bioontologies encode only small fraction of
information
• Linking text to databases and ontologies
– Curators struggling to process scientific literature
– Discovery of facts and events crucial for gaining
insights in biosciences: need for text mining
6
Oct-05
Mar-
Aug-
Jan-04
Jun-03
Nov-
Apr-02
Sep-
Feb-
Jul-00
Dec-
May-
Oct-98
Mar-
Aug-
Jan-97
Searches (millions)
Medline searches over time
90
80
70
60
50
40
30
20
10
0
Month/year
7
The solution: The UK National
Centre for Text Mining
www.nactem.ac.uk
• Location: Manchester Interdisciplinary
Biocentre (MIB) www.mib.ac.uk
• First publicly funded text mining centre in the
world..
• Focus: biology, medicine, social sciences…
8
We don’t just press a button…
• TM involves
– Many components (converters, analysers, miners,
visualisers, ...)
– Many resources (grammars, ontologies, lexicons,
terminologies, thesauri, CVs)
– Many combinations of components and resources
for different applications
– Many different user requirements and scenarios,
training needs
• The best solutions are customised
9
People behind NaCTeM
• Text Mining Team: 14 members
• Close collaboration with University of Tokyo,
Tsujii Lab http://www-tsujii.is.s.u-tokyo.ac.jp/
10
What NaCTeM is building:
• Resources: ontologies, lexicons, terminologies,
thesauri, grammars, annotated corpora
– BOOTStrep project http://www.nactem.ac.uk/bootstrep.php
• Tools: tokenisers, taggers, chunkers, parsers, NE
recognisers, semantic analysers
• NaCTeM is also providing services
• Our related bio-text mining projects
– REFINE http://dbkgroup.org/refine/
– Representing Evidence For Interacting Network Elements
– ONDEX (data integration, workflows, text mining)
11
Individual tools for user data
• Splitters, taggers, chunkers, parsers, NER, term
extractors
• Modes of use




Demonstrators: for small-scale online use
Batch mode: upload data, get email with link to
download site when job done
Web Services
Integration into Workflows (Taverna)
• Some services are compositions of tools
12
Aims
• Text mining: discover & extract unstructured
knowledge hidden in text
– Hearst (1999)
• Text mining aids to construct hypotheses from
associations derived from text
– protein-protein interactions
–associations of genes – phenotypes
–functional relationships among genes
13
Impact of text mining
• Extraction of named entities (genes, proteins,
metabolites, etc)
• Discovery of concepts allows semantic
annotation of documents
– Improves information access by going beyond index
terms, enabling semantic querying
• Construction of concept networks from text
– Allows clustering, classification of documents
– Visualisation of concept maps
14
Impact of TM
• Extraction of relationships (events and facts) for
knowledge discovery
– Information extraction, more sophisticated annotation
of texts (event annotation)
– Beyond named entities: facts, events
– Enables even more advanced semantic querying
15
Hypothesis generation from literature
• Swanson experiments (1986) influenced conceptual
biology
– rapid ‘mining’ of candidate hypotheses from the literature
– migraine and magnesium deficiency (Swanson, 1988)
– indomethacin and Alzheimer’s disease (Swanson and
Smalheiser 1994),
– Curcuma longa and retinal diseases, Crohn's disease and
disorders related to the spinal cord (Srinivasan and Libbus
2004).
– (Weeber M, Rein et al. 2003) thalidomide for treating a series
of diseases such as acute pancreatitis, chronic hepatitis C.
16
Text mining steps
• Information Retrieval yields all relevant texts
– Gathers, selects, filters documents that may prove useful
– Finds what is known
• Information Extraction extracts facts & events of
interest to user
– Finds relevant concepts, facts about concepts
– Finds only what we are looking for
• Data Mining discovers unsuspected associations
– Combines & links facts and events
– Discovers new knowledge, finds new associations
17
From Text to Knowledge:
NLP and Knowledge Extraction
Text
Annotation Tools
Lexicons and
ontologies
Structured Knowledge
Knowledge
Extraction
Tools
18
Challenge: the resource bottleneck
• Lack of large-scale, richly annotated corpora
– Support training of ML algorithms
– Development of computational grammars
– Evaluation of text mining components
• Lack of knowledge resources: lexica,
terminologies, ontologies.
19
Annotation & Information Extraction
Biomedical Knowledge
Annotation
IE system
Biomedical Literature
•
Semantic annotation simulates an ideal performance of IE system.
– IE systems can be developed by referencing annotated corpus.
– The performance of IE systems can be evaluated by being compared to
the annotated corpus.
(Kim & Tsujii, Text Mining Workshop, Manchester, 2006)
20
Text Annotation
•
•
Task-oriented Annotation
Task-neutral Annotation
–
Application annotated text
–
–
User system development
–
–
Defined by specific tasks
•
•
•
•
GENIA Corpus
[U-Tokyo, NaCTeM]
Development of generic tools
Interoperable Tools
Specific curation tasks in specific
environments
Mapping of Protein names to database
IDs in specific text types
Specific event types such as ProteinProtein Interaction
Disease-Gene Association of specific
diseases
–
Defined by theories
•
•
•
Linguistics
– Tokens
– POS
– Phrase Structure
– Dependency Structure
– Deep Syntax (PAS)
Biology
– Named Entities of various semantic
types
– Events
Linguistics + Biology
– Co-references
21
Annotation of GENIA corpus – Term&POS
Part-of-speech
annotation
2,000 abstracts
Term (entity)
annotation
2000+400
abstracts
22
Text semantic annotation
• annotation of events and involved named
entities
– Example: “Regulation of Transcription events”
– BOOTSTrep project
http://www.nactem.ac.uk/bootstrep.php
• two different types of annotation levels
• linguistic annotation levels
• biological annotation level, in charge of marking
the biological knowledge contained in the text
• Linking text with biological knowledge
23
Events and variables
• Biological events can be centred on:
– verbs, e.g. activate,
– nouns with verb-like meanings (nominalised verbs), e.g.
transcription
• Different parts of sentence correspond to different types
of variables in the event e.g.
– What caused event
• The narL gene product activates the nitrate reductase operon
– What was affected by event
• Analysis of mutants …
– Where event took place
• These fusions were formed on plasmid cloning vectors
Verb Frame Example
Agent Characteristics
protein
Theme Characteristics
activate
operon
“The narL gene product activates the nitrate reductase operon”
25
Role Name
Description
Phrase Type(s)
AGENT
Drives or instigates Entity or event
event
Clues
Typically subject of
verb,
Follows by in
passives
The narL gene product activates the nitrate reductase operon
THEME
Affected by or
results from event
Entity or event
Typically object of
verb, subject in
passives
recA protein was induced by UV radiation
MANNER
Method or way in
which event is
carried out
Event (process),
adverb, direction,
in vitro, in vivo etc
by, through, via,
using
cpxA gene increases the levels of csgA transcription by dephosphorylation of
CpxR
Role Name
Description
Phrase Type(s)
Clues
INSTRUMENT
Used to carry out
event
Entity
with,with the aid of,
via, by, through,
using
EnvZ functions through OmpR to control porin gene expression in Escherichia coli
K-12
LOCATION
Location of event
Entity
in, on, near, etc
Phosphorylation of OmpR by the osmosensor EnvZ modulates expression of the
ompF and ompC genes in Escherichia coli
SOURCE
Start point of event Entity
from
A transducing lambda phage carrying glpD''lacZ, glpR, and malT was isolated
from a strain harbouring a glpD''lacZ fusion
DESTINATION
End point of event
Entity
to, into
Transcription of gntT is activated by binding of the cyclic AMP (cAMP)-cAMP
receptor protein (CRP) complex to a CRP binding site
Example 1
the agent
The narL gene product
protein
activates
operon
the nitrate reductase operon
the theme (what is acted upon)
28
Linguistically Annotated Corpora
• GENIA
– Domain
• Mesh term: Human, Blood Cells, and Transcription Factors.
– Annotation: POS, named entity, parse tree
• Penn BioIE
– Domain
• the molecular genetics of oncology
• the inhibition of enzymes of the CYP450 class.
– Annotation: POS, named entity, parse tree
• Yapex
• GENETAG a corpus of 20K MEDLINE® sentences for gene/protein NER
29
The GENIA annotation
•
Linguistic annotation
– Reveals linguistic structures behind the text
• Part-of-speech annotation
– annotates for the syntactic category of each word.
• Syntactic Tree annotation
– annotates for the syntactic structure of sentences.
•
Semantic annotation
– Reveals knowledge pieces delivered by the text.
• Term annotation
– annotates domain-specific terms
• Event annotation
– annotates events on biological entities.
Ontology-driven
annotation
30
Annotation Tool
• WordFreak http://wordfreak.sourceforge.net/
• Java-based linguistic annotation tool developed at
University of Pennsylvania
• Extensible to new tasks and domains
• Customised visualisation and annotation specification
– Allows annotation process to be made as simple as possible
31
Resources
32
What about existing resources?
• Ontologies important for knowledge discovery
– They form the link between terms in texts and
biological databases
– Can be used to add meaning, semantic annotation of
texts
33
Link between text and ontologies
Adding new
knowledge
UMLS
KEGG
Ontological
resources
GO
GENIA
text
Supporting
semantics
34
Bridging the Gap– Integrating data, text and
knowledge
Databases
Semantic
Interpretation of data
UMLS
Adding new
knowledge
Ontological
text
resources
GO
KEGG
GENIA
Supporting
semantics
Semantic
Interpretation of models
in Systems Biology
Mathematical
Models
Resources for Bio-Text Mining
• Lexical / terminological resources
– SPECIALIST lexicon, Metathesaurus (UMLS)
– Lists of terms / lexical entries (hierarchical relations)
• Ontological resources
– Metathesaurus, Semantic Network, GO, SNOMED
CT, etc
– Encode relations among entities
Bodenreider, O. “Lexical, Terminological, and Ontological Resources for
Biological Text Mining”, Chapter 3, Text Mining for Biology and Biomedicine,
pp.43-66
36
SPECIALIST lexicon
– UMLS specialist lexicon
http://SPECIALIST.nlm.nih.gov
• Each lexical entry contains morphological (e.g. cauterize,
cauterizes, cauterized, cauterizing), syntactic (e.g.
complementation patterns for verbs, nouns, adjectives),
orthographic information (e.g. esophagus – oesophagus)
• General language lexicon with many biomedical terms (over
180,000 records)
• Lexical programs include variation (spelling), base form,
inflection, acronyms
37
Lexicon record
{base=Kaposi's sarcoma
spelling_variant=Kaposi sarcoma
entry=E0003576
cat=noun
variants=uncount
variants=reg
variants=glreg}
Kaposi’s sarcoma
Kaposi’s sarcomas
Kaposi’s sarcomata
Kaposi sarcoma
Kaposi sarcomas
Kaposi sarcomata
The SPECIALIST Lexicon and Lexical Tools
Allen C. Browne, Guy Divita, and Chris Lu PhD
2002 NLM Associates Presentation, 12/03/2002, Bethesda, MD
38
Normalisation (lexical tools)
Hodgkin Disease
HODGKIN DISEASE
Hodgkin’s Disease
Hodgkin’s disease
Disease, Hodgkin ...
disease hodgkin
normalise
39
Steps of Norm
Remove genitive
Hodgkin’s Diseases
Replace punctuation with spaces
Hodgkin Diseases
Remove stop words
Hodgkin Diseases
Lowercase
hodgkin diseases
Uninflect each word
hodgkin disease
Word order sort
disease hodgkin
Lexical tools of the UMLS
http://lexsrv3.nlm.nih.gov/SPECIALIST/index.html
40
The Gene Ontology (GO)
•
Controlled vocabulary for the annotation of
gene products
http://www.geneontology.org/
19,468 terms. 95.3% with definitions
10391 biological_process
1681 cellular_component
7396 molecular_function
41
Gene Ontology
• GOA database (http://www.ebi.ac.uk/GOA/)
assigns gene products to the Gene Ontology
• GO terms follow certain conventions of creation,
have synonyms such as:
– ornithine cycle is an exact synonym of urea cycle
– cell division is a broad synonym of cytokinesis
– cytochrome bc1 complex is a related synonym of
ubiquinol-cytochrome-c reductase activity
42
GO terms, definitions and ontologies in OBO
id: GO:0000002
name: mitochondrial genome maintenance
namespace: biological_process
def: "The maintenance of the structure and integrity of the
mitochondrial genome.“ [GOC:ai]
is_a: GO:0007005 ! mitochondrion organization and
biogenesis
43
Metathesaurus
• organised by concept
– 5M names, 1M concepts, 16M relations
• built from 134 electronic versions of many
different thesauri, classifications, code sets, and
lists of controlled terms
• "source vocabularies“
• common representation
44
Are the existing knowledge resources
sufficient for TM?
No!
Why?
 Limited lexical & terminological coverage of biological
sub-domains
 Resources focused on human specialists
GO, UMLS, UniProt ontology concept names
frequently confused with terms
45
Naming conventions
3.
Update and curation of resources
– FlyBase gene name coverage 31% (abstracts) to
84% (full texts)
4.
Naming conventions and representation in
heterogeneous resources
– Term formation guidelines from formal bodies e.g.
HUGO, IPI not uniformly used
– Problems with integration of resources
dystrophin used for 18 gene products
“Dystrophin (muscular dystrophy, Duchenne and Becker
types), included DXS143, DXS164, DXS206, …”
HUGO
46
Term variation
5.
Terminological variation and complexity of names
– High correlation between degree of term variation
and dynamic nature of biomedicine
– Variation occurs in controlled vocabularies and texts
but discrepancy between the two
– Exact match methods fail to associate term
occurrences in texts with databases
47
What’s in a name?
Terms, named entities in biology
48
What’s in a name?
•
•
•
•
•
•
Breast cancer 1 (BRCA1)
p53
Ribosomal protein S27
Heat shock protein 110
Mitogen activated protein kinase 15
Mitogen activated protein kinase kinase kinase 5
From K. Cohen, NAACL 2007
49
Worst gene names
• sema domain, seven thrombospondin repeats
(type 1 and type 1-like), transmembrane domain
(TM) and short cytoplasmic domain,
(semaphorin) 5A
K. Cohen NAACL 2007
50
Worst gene names
• sema domain, seven thrombospondin repeats
(type 1 and type 1-like), transmembrane domain
(TM) and short cytoplasmic domain,
(semaphorin) 5A
K. Cohen NAACL 2007
51
Worst gene names
• sema domain, seven thrombospondin repeats
(type 1 and type 1-like), transmembrane domain
(TM) and short cytoplasmic domain,
(semaphorin) 5A
• SEMA5A
K. Cohen NAACL 2007
52
Worst gene names
• sema domain, seven thrombospondin repeats (type 1
and type 1-like), transmembrane domain (TM) and short
cytoplasmic domain, (semaphorin) 5A
• SEMA5A
• Tyrosine kinase with immunoglobulin and epidermal
growth factor homology domains
• tie
K. Cohen NAACL 2007
53
Term ambiguity
Neurofibromatosis 2 [disease]
NF2
Neurofibromin 2 [protein]
Neurofibromatosis 2 gene [gene]
O. Bodenreider, MIE 2005 tutorial
http://www.nactem.ac.uk/
54
Term ambiguity
– Gene terms may be also common English words
• BAD human gene encoding BCL-2 family of proteins (bad
news, bad prediction)
– Gene names are often used to denote gene products
(proteins)
• suppressor of sable is used ambiguously to refer to either
genes and proteins
– Existing resources lack information that can support
term disambiguation
– Difficult to establish equivalences between termforms
and concepts
55
Homologues
• Cycline-dependent kinase inhibitor first
introduced to represent a protein family p27
– But it is used interchangeably with p27 or p27kip1, as
the name of the individual protein and not as the
name of the protein family (Morgan 2003).
• NFKB2 denotes the name of a family of 2
individual proteins with separate IDs in SwissProt.
– These proteins are homologues belonging to different
species, homo sapiens & chicken.
56
Terms
– Term: linguistic realisation of specialised concepts,
e.g. genes, proteins, diseases
– Terminology: collection of terms structured (hierarchy)
denoting relationships among concepts, part-whole,
is-a, specific, generic, etc.
– Terms link text and ontologies
– Mapping is not trivial (main challenge)
57
Term variation and ambiguity
Term
variation
Term1
Term2
Term3
TEXT
Term ambiguity
Concept1
concept3
concept2
ONTOLOGY
58
Term mining steps
Tp53
Gene
Term recognition
Term classification
Genome
Database,
IARC TP53
Mutation
Database
Term mapping
59
Term recognition techniques
• ATR extracts terms (variants) from a collection of
document
• Distinguishes terms vs non-terms
• In NER the steps of recognition and
classification are merged, a classified
terminological instance is a named entity
• The tasks of ATR and NER share techniques but
their ultimate goals are different
– ATR for resource building, lexica & ontologies
– NER first step of IE, text mining
60
Overview papers
1.
S. Ananiadou & G. Nenadic (2006) Automatic Terminology Management in
Biomedicine, Text Mining for Biology and Biomedicine, pp. 67- 97.
2.
M. Krauthammer & G. Nenadic (2004) Term identification in the biomedical
literature, JBI 37 (2004) 512-526
3.
J.C. Park & J. Kim (2006) Named Entity Recognition, Text Mining for Biology
and Biomedicine, pp. 121-142
Detailed bibliography in Bio-Text Mining
1.
BLIMPhttp://blimp.cs.queensu.ca/
2.
http://www.ccs.neu.edu/home/futrelle/bionlp/
Book on BioText Mining
1.
S. Ananiadou & J. McNaught (eds) (2006) Text Mining for Biology and
Biomedicine, Artech House.
Other Bio-Text Mining tutorials
Kevin Cohen (NAACL 2007 tutorial) U. Colorado
61
Main ATR approaches
ATR
Dictionary based
Rule based
Machine learning
62
Dictionary NER (1)
• Use terminological resources to locate term
occurrences in text
– NCBI http://www.ncbi.nlm.nih.gov/
– EBI http://www.ebi.ac.uk/
– neologisms, variations, ambiguity problematic for
simple dictionary look-up
– Ambiguous words e.g. an, for, can …
– spelling variants, punctuation, word order variations
• estrogen oestrogen
• NF kappa B / NF kB
63
Dictionary NER (2)
– Hirschman (2002) used FlyBase for gene name
recognition, results disappointing due to homonymy,
spelling variations
• Precision, 7% abstracts, 2% full papers
• Recall, 31% -- 84%
– Tuason (2004) reports term variation as main problem
of mismatch
• bmp-4
bmp4
• syt4
syt iv
• integrin alpha 4 alpha4 integrin
64
Dictionary NER (3)
– Tsuruoka & Tsujii (2003) suggest a
probabilistic generator of spelling variants,
edit distance operations (delete, substitute,
insert)
• Terms with ED ≤ 1 considered spelling
variants
• Used a dictionary of protein terms
– Support query expansion
– Augment dictionaries with variation
65
Rule NER (2)
Rule based
4-level morphology
Neoclassical elements
Ananiadou (1994)
EMPATHIE, PASTA
Gaizauskas, 2000
PROPER,
Fukuda,1998
Yapex, Franzen 2002
66
Rule based (1)
• Use orthographic, morpho-syntactic features of
terms
– Rules that make use of internal term formation
patterns (tagging, morphological analysers) e.g.
affixes, combining forms
– Do not take into account contextual features
– Dictionaries of constituents e.g. affixes, neoclassical
forms included
• Portability to different domains?
67
Rule based (2)
• Ananiadou, S. (1994) recognised single-word terms
based on morphological analysis of term formation
patterns (internal term make up)
• based on analysis of neoclassical and hybrid elements
‘alphafetoprotein’ ‘immunoosmoelectrophoresis’
‘radioimmunoassay’
• some elements are used for creating terms
term  word + term_suffix
term  term + word_suffix
• neoclassical combining forms (electro- adeno-),
• prefixes (auto-, hypo-)
• suffixes ( -osis, -itis)
68
Rule-based (3)
• Fukuda (1998) used lexical, orthographic features for
protein name recognition e.g. upper case character,
numerals etc.
• PROPER: core and feature elements
– Core: meaning bearing elements
– Feature: function elements
core
SAP kinase
feature
Core elements extended to feature based on
concatenation rules (based on POS tags)
69
Rule-based (4)
• Gaizauskas (2000) CFG for protein name recognition
(PASTA, EMPATHIE)
• Based on morphological and lexical characteristics of
terms
• biochemical suffixes (-ase enzyme name)
• dictionary look-up (protein names, chemical compounds, etc)
• deduction of term grammar rules from Protein Data Bank
Protein -> protein_modifier, protein_head, numeral
70
Rule-based (5)
• Inspired by PROPER, Yapex uses Swiss-Prot to add
core term elements
http://www.sics.se/humle/projects/prothalt/yapex.cgi
• Hou (2003) used Yapex with context information (collocations)
appearing with protein names
• Rule based approaches construct rule and patterns manually or
automatically
• Difficult to tune to different domains
71
Machine learning systems
• Learn features from training data for term
recognition and classification
• Most ML systems combine recognition and
classification
Challenges
– Feature selection and optimisation
– Availability of training data
– detection of term boundaries
72
Overview of ML-based NER
• Training phase:
•Detecting features
•Learning model
Manually phase:
tagged texts
• Testing
Learned Model
Tag annotator
with model
Raw texts
Tagged texts
73
ML (1)
• Nobata et al.(1999) used Decision Tree for NER
• Decision tree: one of the methods to classify a case
using training data
– Node: specifies some condition with a subtree
– Leaf: indicates a class
• Features:
– Part-of-speech information
– Orthographic information
– Term lists
74
Example of a decision tree
Each node has one condition:
Is the current word
in the Protein term list?
No
Yes
Does the previous word
What is the
have figures?
next word’s POS?
No
Yes Noun Verb …
Each leaf has one class:
Unknown
PROTEIN
DNA
RNA
……
75
ML (2)
• Collier (2000) used HMM, orthographic features
for term recognition
– HMM looks for most likely sequence of classes
corresponding to a word sequence e.g. interleukin-2
protein/DNA
– To find similarities between known words (training
set) and unknown words, use character features
Feature
Examples
DigitNumber
[2]protein[3]DNA
GreekLetter
[alpha]protein
TwoCaps
[RelB]protein[TAR]RNA
76
ML (2)
• Use of GENIA resources as training data
– Results depend on training data
• Morgan (2004) used FlyBase to construct
automatically training corpus
– Pattern matching for gene name recognition, noisy
corpus annotated
– HMM was trained on that corpus for gene name
recognition
77
Support Vector Machines (1)
• Kazama trained multi-class SVMs on Genia
corpus
• Corpus annotated with B-I-O tags
–
–
–
–
B tags denote words at beginning of term
I tags inside term
O tags outside term
B-protein-tag : word in the beginning of a protein
name
78
SVMs for NER (2)
• Yamamoto used a combination of features for
protein name recognition:
– Morphological, lexical, boundary, syntactic (head
noun), domain specific (if term exists in biomedical
database).
• Lee use different features for recognition and
classification.
• orthographic, prefix, suffix
• Contextual information
79
Hybrid approaches
• Combine rules, statistics, resources
Hybrid ATR / NER
ABGene (Tanabe & Wilbur)
ARBITER (Rindflesch)
C/NC-value (Frantzi & Ananiadou)
80
Hybrid (1)
• ABGene: protein and gene name tagger
– Combines ML, transformation rules, dictionaries with
statistics
– Protein tagger trained on MEDLINE abstracts by
adapting Brill’s tagger
– Transformation rules for recognition of gene, protein
names
– Used GO, LocusLink list of genes, proteins for false
negative tags
81
Hybrid (2)
– ARBITER (Access and Retrieve Binding
Terms) uses
• UMLS Metathesaurus and GenBank to
map NPs (binding terms)
• morphological features
• lexical information (head noun)
– EDGAR recognises gene, cell, drug names
using co-occurrences of cell, clone,
expression
82
Hybrid (3)
• C/NC value (Frantzi & Ananiadou, 1999)
• C-value
• Linguistic filters
• total frequency of occurrence of string in corpus
• frequency of string as part of longer candidate
terms (nested terms)
• number of these longer candidate terms
• length of string
– Output: automatically ranked terms (TerMine)
83
C-value
• C- value measure extracts multi-word, nested
terms
[adenoid [cystic [basal [cell carcinoma]]]]
cystic basal cell carcinoma
ulcerated basal cell carcinoma
recurrent basal cell carcinoma
basal cell carcinoma
84
Term variation
• variation recognition as part of ATR (Nenadic,
Ananiadou)
• recognise term forms and link them into
equivalence classes
• important if ATR is based on statistics
(e.g. frequency of occurrence)
– corpus-based measures are distributed across
different variants
– conflation of various surface representations of a
given term should improve ATR
85
Simple variation
• orthographic
–
–
–
–
hyphens, slashes (amino acid and amino-acid)
lower/upper cases (NF-KB and NF-kb)
spelling variations (tumour and tumor)
transliterations (oestrogen and estrogen)
• morphological
– inflectional phenomena (plural, possessives)
• lexical
– genuine synonyms (carcinoma and cancer)
86
Complex variation
• Structural
– Possessive usage of nouns using prepositions
(clones of human and human clones)
– Prepositional variants
(cell in blood, cell from blood)
– Term coordinations
(adrenal glands and gonads)
87
Coordinated term variants
•
Structure is ambiguous
–
Head coordination or term conjunction?
example
adrenal glands and gonads
head
[adrenal [glands and gonads]]
coordination
term
[adrenal glands] and [gonads]
conjunction
•
Head or argument coordination?
(N|A)+ CC (N|A)* N+
• cell differentiation and proliferation
• chicken and mouse receptors
88
TerMine: a term management system
Demo
89
http://www.nactem.ac.uk/software/termine/
Marrying IR and terminology
• IR engine plus TerMine
• Discover associated terms ranked according to
relevance
• Allow user to link term with IR for document
discovery
• NB compound terms
• NB technical terms, not classic index terms
• NB terms familiar to user, found in documents
91
http://www.nactem.ac.uk/software/ctermine/
Biomedical IE/IR Systems
• iHOP
– http://www.ihop-net.org/UniPub/iHOP/
• EBIMed
– http://www.ebi.ac.uk/Rebholz-srv/ebimed/index.jsp
• GoPubMed
– http://www.gopubmed.org/
• PubFinder
– http://www.glycosciences.de/tools/PubFinder
• Textpresso
– http://www.textpresso.org/
93
Acronyms
• Very productive type of term variation
• Acronym variation (synonymy)
– NF kappa B/ NF kB / nuclear factor kappa B
• Acronym ambiguity (polysemy) even in
controlled vocabularies
GR
glucocorticoid receptor
glutathione reductase
94
Acronym recognition
• Swartz, A. & Hearst, M. (2003) A simple algorithm for identifying
abbreviation definitions in biomedical text, PSB 2003,8, 451-462
• Adar, E. (2004) SaRAD: a simple and robust abbreviation
dictionary, Bioinformatics, 20(4) 527-533
• Chang, J.T. & Schutze, H. (2006) Abbreviations in biomedical
text, Text Mining for Biology and Biomedicine, pp.99-119,
Artech
• Tsuruoka, Y., Ananiadou, S. & Tsujii, J. (2005) A Machine
learning approach to automatic acronym generation, ISMB,
BioLink SIG, 25-31
• Okazaki, N. & S.Ananiadou (2006) Acronym recognition based
on term identification, Bioinformatics
95
The importance of
acronym recognition
• Acronyms are among the most productive type of term
variation
– 64, 242 new acronyms are introduced in 2004 [Chang and
Schütze 06]
• Acronyms are used more frequently than full terms
– 5,477 documents could be retrieved by using the acronym JNK
while only 3,773 documents could be retrieved by using its full
term, c-jun N-terminal kinase [Wren et al. 05]
• No rules or exact patterns for the creation of acronyms
from their full form
96
Recognition
• Extracting pairs of short and long forms
<acronym, long form>
– Distinguishing acronyms from parenthetical
expressions
– Search for parentheses in text; single or more words;
e.g. Ab (antibody)
– Limit context around ( ); limit number of words
according to number of letters in acronym
97
Recognition (heuristics)
– Heuristics: match letters of acronym with letters of
long form using rules, patterns
• letters from beginning of words
• combining forms
carboxifluorescein diacetate (CFDA)
• Acronym normalisation to allow orthographic, structural and
lexical variations
• morphological information, positional info
• Penalise words in long form that do not match acronym
• Accidental matching
argininosuccitate synthetase (AS)
A
S
98
Letter matching
– Alignment: find all matches between letters of
acronyms and their long forms and calculate
likelihood (Chang & Schütze)
• Solves problem of acronyms containing letters not
occurring in LF
• Choose best alignment based on features, e.g.
position of letter etc.
• Finding optimal weight for each feature challenge
http://abbreviation.stanford.edu/
99
Acronym Recognition
Okazaki, N., Ananiadou, S. (2006) Building an abbreviation
dictionary using a term recognition approach. Bioinformatics.
S.Ananiadou
NaCTeM
100
A simple algorithm –
Schwartz and Hearst (2003)
• Uses parenthetical expressions as a marker of a
short form
… long-form ‘(‘short-form ‘)’ …
• All letters and digits in a short form must appear
in the corresponding long form in the same order
– We used hidden markov model (HMM) to …
– Early repolarization (ER) is an enigma.
101
Problems of letter-matching approach
• Highly dependent on the expressions in the target text
– o acquired immuno deficiency syndrome (AIDS)
– x acquired syndrome (AIDS)
– x a patient with human immunodeficiency syndrome (AIDS)
– ? magnetic resonance imaging unit (MRI)
– ! beta 2 adrenergic receptor (ADRB2)
– ! gamma interferon (IFN-GAMMA)
(These examples are obtained from actual MEDLINE abstracts)
• Naive with respect to term variations
102
AcroMine’s approach
• Extract a word or word sequence:
– Co-occurring frequently with an acronym (e.g., TTF-1)
• 1, factor 1, transcription factor 1, thyroid transcription
factor 1
– Does not co-occur with other surrounding words
• thyroid transcription factor 1
• Not necessarily based on letter-matching
– Note that this is a difficult case for the letter-matching algorithm
• Prune unlikely candidates
– Nested candidates: transcription factor 1
– Expansions: expression of thyroid transcription factor 1
– Insertions: thyroid specific transcription factor 1
103
Short-form mining
• Enumerate all short forms in a target text
– Using parentheses as a clue: … ‘(‘short-form ‘)’ …
– Validation rules for identifying acronyms [Schwartz and Hearst
03]
• It consists of at most two words
• Its length is between two to ten characters
• It contains at least an alphabetic letter
• The first character is alphanumeric
The contextual sentence
of HMM and ASR.
The present system consists of a hidden Markov model (HMM) based automatic
speech recognizer (ASR), with a keyword spotting system to capture the machine
sensitive words (registered in a dictionary) from the running utterances.
104
Enumerating long-form candidates for
an acronym
• Tokenize a contextual sentence by non-alphanumeric characters
(e.g., space, hyphen, etc.)
• Apply Porter’s stemming algorithm [Porter 80]
• Extract terms that match the following pattern
[:WORD:].*$
We studied the expression of thyroid transcription factor-1 (TTF-1).
studi
transcript
thyroid transcript
expression of thyroid transcript
the expression of thyroid transcript
1
factor 1
factor 1
factor 1
factor 1
factor 1
Empty
string or
words
of any
length
of thyroid transcript factor 1
thyroid transcript
105
Expansions for TTF-1
106
Top 20 acronyms in MEDLINE
107
Long-form candidates for acronym
ADM
Candidate
Length
Frequency
Score
Validity
adriamycin
1
727
721.4
o
adrenomedullin
1
247
241.7
o
abductor digiti minimi
3
78
74.9
o
doxorubicin
1
56
54.6
x
effect of adriamycin
3
25
23.6
Expansion
adrenodemedullated
1
19
17.7
o
acellular dermal matrix
3
17
15.9
o
peptide adrenomedullin
2
17
15.1
Expansion
effects of adrenomedullin
3
15
13.2
Expansion
resistance to adriamycin
3
15
13.2
Expansion
amyopathic dermatomyositis
2
14
12.8
o
brevis and abductor digiti minimi
5
11
9.8
Expansion
minimi
1
83
5.8
Nested
digiti minimi
2
80
3.9
Nested
automated digital microscopy
3
1
0.0
match
adrenomedullin concentration
2
1
0.0
Nested
108
Long-form extraction
• Long-form candidates are sorted with their
scores in a descending order
• A long-form candidate is considered valid if:
– It has a score greater than 2.0
– The words in the long form can be rearranged so that
all alphanumeric letters appear in the same order as
the short form
– It is not nested or expansion of the previously chosen
long forms
109
http://www.nactem.ac.uk/software/acromine/
Acronym disambiguation
• Local acronyms
– Accompany their expanded forms in documents
• Global acronyms
– Appear in documents without the expanded forms stated
– Need to be their correct expanded forms identified
• Immunomodulatory effects of CT were investigated in a rat model,
and the effects of CT on rat renal allograft (from Lewis rat to WKAH
rat) were also examined.
• Immunomodulatory effects of cholera toxin (CT) were investigated
in a rat model, and the effects of cholera toxin (CT) on rat renal
allograft (from Lewis rat to Wistar-King-Aptekman-Hokudai (WKAH)
rat) were also examined.
111
Acronym disambiguation
Sample text: Considerations in the identification of functional RNA structural
elements in genomic alignments (Tomas Babak et al)
http://www.biomedcentral.com/1471-2105/8/33
Term structuring
113
Term structuring
• term clustering (linking semantically similar terms) and
term classification (assigning terms to classes from a
pre-defined classification scheme)
• Hypothesis: similar terms tend to appear in similar
contexts (patterns)
• combining various sources of similarity:
–
–
–
–
lexical
syntactic
contextual
Ontological (using external resources)
114
Term structuring
• Based on term similarities
– choice of features:
– domain specific
– linguistic


ontology
text
• ontology-based similarity
• textual similarity
– internal features
– contextual features
115
Using ontologies
• two terms should match if they are:
– identified as variants
– siblings in the is-a hierarchy
– in the is-a or part-whole relation
• the distance between the corresponding nodes
in the ontology should be transformed into the
matching score
► I. Spasic presentation MIE Tutorial http://www.nactem.ac.uk/
116
Using text
• number of neologisms: terms are not in the ontologies
• Use of text based techniques to calculate similarities
• edit distance (ED) – the minimal number (or cost) of changes
needed to transform one string into the other
• edit operations:
insertion
deletion
replacement
transposition
...a-c...
...abc...
...abc...
...abc...
...abc...
...a-c...
...adc...
...acb...
• use of dynamic programming
117
Term similarities
– lexical similarity: based on sharing term head and/or
modifier(s) --hyponymy
nuclear receptor
orphan nuclear receptor
– Sharing heads
progesterone receptor oestrogen receptor
• Specific types of associations
– mainly general is_a and part_of
– some domain-specific, e.g. binding: CREP binding protein
118
Contextual similarities
• Features from context
–
–
–
–
syntactic category
terminological status
position relative to the term
syntactic relation between a context element and the
term
– semantic properties
– semantic relation between a context element and the
term …….
119
Lexical & syntactic patterns
• a lexico-syntactic pattern:
. . . Term (, Term)* [,] and other Term . . .
• the leading Terms hyponyms of the head Term
... antiandrogens, hydroxyflutamide, bicalutamide,
cyproterone acetate, RU58841, and other compounds ...
• candidate instances of the hyponymy relation:
hyponym(
hyponym(
hyponym(
hyponym(
hyponym(
antiandrogens, compound )
hydroxyflutamide, compound )
bicalutamide, compound )
cyproterone acetate, compound )
RU58841, compound )
120
Contextual information
• automatic pattern mining for most important context patterns
– find most important contexts in which a term appears
… receptor is bound to these DNA sequences …
… proteins bound to the DNA …
… estrogen receptor bound to DNA …
… steroid receptor coactivator-1 when bound to DNA …
… progesterone receptor complexes bound to DNA …
… RXRs bound to respective DNA elements in vitro …
… glucocorticoid receptor to bind DNA …
pattern:
<TERM> V:bind <TERM:DNA>
121
Stumbling blocks
• Lexical similarities affected by many neologisms and ad
hoc names
– only 5% of most frequent terms in GENIA belonging to same
biomedical class have some lexical links
• how much context to use? (sentence, phrase, abstract,
…)
• Attempts at using co-occurrence: many report up to 40%
of co-occurrence based relationships biologically
meaningless
122
Term similarities
• SOLD = Syntactic, Ontology-driven & Lexical Distance
(Spasic, I. & Ananiadou, S. 2005, Bioinformatics)
• hybrid approach to comparing term contexts, which
relies on:
– linguistic information (acquired through tagging and parsing)
– domain-specific knowledge (obtained from the ontology)
• based on the approximate pattern matching
• combines ontology-based similarity with corpus-based
similarity using both internal and contextual features
123
Challenges of biomedical terminology
• Linking termforms in text with existing resources
• Term clustering, classification and linking to databases,
ontologies
• Selection of most representative terms (concepts) in
documents (important for improved IR, database
curation, annotation tasks)
• Efficient term management important for updating
terminological and ontological resources, text mining
applications e.g. IE, Q/A, summarisation, linking
heterogeneous resources, IR etc
124
Information Extraction in Biology
• Results appear depressed compared to general
language
– Dependent of earlier stages of processing
(tokenisers, taggers, results from NER, etc)
– MUC data 80% F-score template relations, 60%
events
– Challenge for bio-text mining is to achieve similar
results
• Evaluation see Hirschman, L. (Text mining book)
BioCreATive 2004
125
I
Information Extraction
126
IE in Biology
 Pattern-matching
 Context-free grammar approaches
 Full parsing approaches
 Sublanguage driven IE
 Ontology-driven IE
McNaught, J. & Black, W. (2006) Information Extraction, Text
Mining for Biology & Biomedicine, Artech house, pp.143-177
127
Pattern-matching IE
– Usual limitations with non inclusion of semantic
processing
– Large amount of surface grammatical structures = too
many patterns (Zipf’s law)
– Cannot explore syntactic generalisations (active,
passive voice)
– Systems extract phrases or entire sentences with
matched patterns; restricted usefulness for
subsequent mining
128
Pattern-matching systems (1)
 BioIE uses patterns to extract sentences, protein
families, structures, functions..
 Presents user with relevant information, improvement
from classic IR
 BioRAT uses “deeper” analysis, tagging, apply
RE over POS tags, stemming, gazetter
categories etc
 Templates apply to extract matching phrases,
primitive filters (verbs are not proteins, etc)
129
Pattern matching systems (2)
 RLIMS-P (Hu) protein phosphorylation by looking for enzymes,
substrates, sites assigned to agent, theme, site roles of
phosphorylation relations
 Pos tagger, trained on newswire, chunking, semantic typing of
chunks, identification of relations using pattern-matching rules
 Semantic typing of NPs: using combination of clue words, suffixes,
acronyms etc
 Semantically typed sentences matched with rules
 Patterns target sentences containing phosphorylate
130
Full parsing approaches
• Link Grammar applied for protein-protein interactions; general
English grammar adapted to bio-text
• Link Grammar finds all possible linkages according to its grammar
• Number of analyses reduced by random sampling, heuristics,
processing constraints relaxed
– 10,000 results permitted per sentence
– 60% of protein interactions extracted
– Problems: missing possessive markers & determiners, coordination of
compound noun modifiers
131
Full parsing IE (2)
• Not all parsing strategies suitable for bio-text mining
• Text type, abstracts, “ungrammaticality” related with sublanguage
characteristics?
• Ambiguity and full parsing; fragmentary phrases (titles, headings,
text in table cells, etc)
• CADERIGE project used Link grammar but on shallow parsing
mode
• Kim & Park (BioIE) use combinatorial categorial grammar, annotated
with GO concepts, extract general biological interactions
• 1,300 patterns applied to find instances of patterns with keywords
132
Full parsing (3)
• Keywords indicate basic biological interactions
• Patterns find potential arguments of the interaction
keywords (verbs or nominalisations)
– Validated arguments mapped into GO concepts
– Difficult to generalise interaction keyword patterns
• BioIE’s syntactic parsing performance improved after
adding subcategorisation frames on verbal interaction
keywords
133
Full parsing (4)
–
1.
2.
3.
4.
5.
Daraselia(2004) use full parsing and domain specific filter to
extract protein interactions
All syntactic analyses discovered using CFG and variant of
LFG
Each alternative parse mapped to its corresponding semantic
representation
Output= set of semantic trees, lexemes linked by relations
indicating thematic or attributive roles
Apply custom-built, frame based ontology to filter
representations of each sentence
Preference mechanism controls construction of frame tree,
high precision, low recall (21%)
134
Sublanguage-driven IE (1)
• Language of a special community (e.g. biology)
• Particular set of constraints re GL
• Constraints operate at all linguistic levels
–
–
–
–
Special vocabulary (terms)
Specialised term formation rules
Sublanguage syntactic patterns
Sublanguage semantics
• These constraints give rise to the informational structure of the
domain (Z. Harris)
• See JBI 35(4) Special Issue on Sublanguage
135
GENIES system
• Employs SL approach to extract biomolecular interactions
• Uses hybrid syntactic-semantic rules
– Syntactic and semantic constraints referred to in one rule
• Able to cope with complex sentences
• Frame-based representation
– Embedded frames
• Domain specific ontology covers both entities and events
136
GENIES system
• Default strategy: full parsing
– Robust due to sublanguage constraints
– Much ambiguity excluded
• If full parse fails, partial parsing invoked
– Maintains good level of recall
• Precision: 96%, Recall: 63%
137
Ontology-driven IE
• Until recently most rule based IE have used neither linguistic lexica
nor ontologies
– Reliance on gazetteers
– Small number of semantic categories
• Gazetteer approach not well suited in bioIE
• Ontology based vs ontology driven
– Passive use of ontologies, map discovered entity to concept
– Active use, ontology guides and constrains analysis, fewer rules
• Examples: PASTA, GenIE not SL
• GENIES, SL and ontology driven
138
Summary: simple pattern matching
 Over text strings
 Many patterns required, no generalisation possible
 Over POS
 Some generalisation but ignore sentence structure
 POS tagging, chunking, semantic p-m, typing
 Limited generalisation, some account taken of structure, limited
consideration of SL patterns
139
Summary: full parsing
 Full parsing on its own, parsing done in combination with
chunking, partial parsing, heuristics) to reduce ambiguity,
filter out implausible readings




GL theories not appropriate
Difficult to specialise for biotext
Many analyses per sentence
Missing information due to sublanguage meaning
140
Summary: sublanguage approach




Exploits a rich SL lexicon
Describes SL verbs in detail
Syntactic-semantic grammar
Current systems would benefit from adopting ontologydriven approach
141
Ontology-driven
 Uses event concept frames to guide processing
 Integration of extracted information
 Current systems would benefit from adopting
also SL approach
142
Applications
143
How do we apply TM to Systems Biology?
REFINE project
• Adapting TM tools to evaluate the basis in the
literature for the structure of biochemical and
signalling models in systems biology
• Integrating TM with visualisation for better
understanding of the evidence for biochemical
and signalling pathways
• Enriching models encoded in SBML with
information derived from TM
Kell, Ananiadou, Tsujii
144
Applications
• Semantic annotation not only based on concepts
but also on facts, events extracted by IE
• Enables semantic querying
• Facilitates curation
• Hypothesis generation for scientific discovery
145
Applications
• Other text mining applications
– Summarisation
– Question answering
• Integration of IR with TM
– Terms / concepts as index terms
– Topic detection
– Document clustering and classification
146
Download