Real-time Text Mining for the Biomedical Literature

advertisement
Real-time Text Mining for the Biomedical Literature
a collaboration between Discovery Net & myGrid
Rob Gaizauskas
Department of Computer Science
University of Sheffield
April 21, 2005
Moustafa M. Ghanem
Department of Computing
Imperial College London
EPSRC E-Science Meeting, NeSC
Outline
•
Context
– Workflows, Services and Text Mining
– Discovery Net & myGrid
•
Aims and Objectives of New Project
•
Architecture of New System
– Integration of Existing Components
•
Approach to Text Mining
– Data Resources & Evaluation
– Techniques for Go Tagging
•
Interface and Results Presentation
•
Lessons Learnt So far, Conclusions and Broader Applicability of Work
April 21, 2005
EPSRC E-Science Meeting, NeSC
Workflows, Web Services and
Text Mining for Bioinformatics
Workflows
– useful computational models for processes that require
repeated execution of a series of complex analytical tasks
– e.g. biologist researching genetic basis of a disease repeatedly
• maps reactive spot in microarray data to gene sequence
• uses a sequence alignment tool to find proteins/DNA of similar
structure
• mines info about these homologues from remote DBs
• annotates unknown gene sequence with this discovered info
April 21, 2005
EPSRC E-Science Meeting, NeSC
Workflows, Web Services and
Text Mining for Bioinformatics
Web services
– Processing resources that are
• available via the Internet
• use standardised messaging formats, such as XML
• enable communication between applications without being tied to a
particular operating system/programming language
– Useful for bioinformatics where data used in research is
• heterogeneous in nature – DB records, numerical results, NL texts
• distributed across the internet in research institutions around the world
• available on a variety of platforms and via non-uniform interfaces
April 21, 2005
EPSRC E-Science Meeting, NeSC
Workflows, Web Services and
Text Mining for Bioinformatics
Text mining
– any process of revealing information – regularities, patterns or
trends – in textual data
– includes more established research areas such as information
extraction (IE), information retrieval (IR), natural language
processing (NLP), knowledge discovery from databases (KDD)
and traditional data mining (DM)
– relevant to bioinformatics because of
• explosive growth of biomedical literature
• availability of some information in textual form only, e.g. clinical records
April 21, 2005
EPSRC E-Science Meeting, NeSC
Workflows, Web Services and
Text Mining for Bioinformatics
Workflows
Web services
Text mining
Bioinformatics
April 21, 2005
EPSRC E-Science Meeting, NeSC
Discovery Net & myGrid
•
Discovery Net: An e-Science testbed for High Throughput Informatics
– £2.2M EPSRC Pilot Project
– Started Oct 01, Ended in March 05
– Service-based infrastructure/workflow model for Life Sciences, Environmental
Modelling and Geo-hazard Modelling
– Infrastructure for mixed data mining / text mining
– Machine learning methods for text mining
•
myGrid: Directly Supporting the e-Scientist
–
–
–
–
£3.5M EPSRC Pilot Project
Started Oct 01, Ends June 05
Service-based infrastructure/workflow model for Life Sciences
Infrastructure for Text Collection Server, Text Services Workflow Server and
Interface/Browsing Client
– Service-based Terminology Servers
April 21, 2005
EPSRC E-Science Meeting, NeSC
myGrid
• Overall aim: develop an e-biologist’s workbench – a
platform allowing biologists to execute, analyze, repeat
multi-stage in silico experiments involving distributed
data, code and processing resources
– Workflow model for composing/executing processing components
– Web services for distribution
• Problem: how to integrate text mining into a biological
workflow?
– Most text mining runs off-line and supports interactive browsing of
results
– Most workflows run end to end with no user intervention
– What are the inputs to text mining to be?
• Solution: tap off result of a workflow step and treat as
implicit query
April 21, 2005
EPSRC E-Science Meeting, NeSC
A myGrid example studying
the Genetic Basis of Disease
Graves’ Disease
– an autoimmune condition affecting tissues in the thyroid and orbit
– being investigated using the micro-array methods
• micro-array shows which genes are differentially expressed in normal patients
vs patients with the disease = candidate genes
• sequence alignment search (e.g. BLAST) finds genes/proteins with similar
structure
• function of these “homologues” may suggest function of candidate gene
– key step for text mining follows BLAST search
• for homologous proteins BLAST report contains references to proteins in
SWISSPROT protein database
• Swissprot records contain ids of abstracts describing the protein in Medline
abstract database
• abstracts can be mined directly or used as ``seed'' documents to assemble a set
of related abstracts
April 21, 2005
EPSRC E-Science Meeting, NeSC
myGrid Text Services Architecture
User Client
Workflow definition
+ parameters
Workflow Server
Clustered PubMed Ids
+ titles
Initial
Cluster
Workflow
Abstracts
Workflow
Swissprot/Blast
Enactment
record
Extract
Get Related
PubMed Id
Abstracts
Term-annotated
Medline abstracts
Get Medline
Abstract
Medline Server
Medline
Abstracts
PubMed Ids
Medline: pre-processed
offline to extract biomedical
terms + indexed
April 21, 2005
PubMed Ids
EPSRC E-Science Meeting, NeSC
myGrid Text Services
Architecture
3-way division of labour sensible way to deliver distributed
text mining services
– Providers of e-archives, such as Medline, will make archives
available via web-services interface
• Cannot offer tailored sevices for every application
• Will provide core, common services
– Specialist workflow designers will add value to basic services from
archive to meet their organization’s needs
– Users will prefer to execute predefined workflows via standard
light clients such as a browser
Architecture appropriate for many research areas, not just
bioinformatics
April 21, 2005
EPSRC E-Science Meeting, NeSC
myGrid Interface/Browsing
Client
MeSH
Tree
Abstract
Titles
Abstract
body
Search
scope
restrictors
Linked
terms
Get
Related
Abstracts
Free text
search
April 21, 2005
EPSRC E-Science Meeting, NeSC
Discovery Net: Adding text mining to eScience workflows
DNet Workflow server executes DPML workflow and uses Discovery
Net’s InfoGrid data access and integration wrappers and web services
Gene Expression Analysis
Find Relevant Genes from
Online Databases
Find Associations between Frequent Terms
April 21, 2005
EPSRC E-Science Meeting, NeSC
Text Mining in e-Science workflows
Problem: how to develop new distributed text mining applications
using a workflow?
– Most text mining applications require the integration of a mixture of
components (Services) for text processing tasks (e.g. parsing and
cleaning), natural language processing (e.g. named entity recognition),
statistics and data mining (e.g. classification, clustering, etc).
– There are many design alternatives and end users may want to prototype
and compare alternative implementations.
– Once application developed, most workflows run end to end with no user
intervention
Solution: Extend service infrastructure to allow composition of text
mining services.
April 21, 2005
EPSRC E-Science Meeting, NeSC
Building text mining applications from
workflows
Using workflow technologies to build text mining applications and
services using finer grain components/services
Text Mining Pipelines
Retrieval/ Storage
Text Processing
Feature Extraction
Data Mining
Indexing
Access Drivers
Storage
Stemming,
Stop-word filters,
Pattern filters,
Lexicon matching,
Ontologies,
NLP parsing
etc, ..
Statistical:
Word Counts,
Pattern Extraction &
Counts, etc
Classification,
Clustering,
Association,
Statistical
Analysis,
Visual Analysis,
etc …
Text
docs
Text
documents
Text
docs
Domain-specific
Gene Name counts,
etc
Numerical
Feature
Vectors
NLP-specific
Phrase counts, etc
Retrieve and
organize relevant
documents
April 21, 2005
Pre-process
documents to
enhance the ease of
feature extraction
Features are
summarized into
vector forms
which are suitable
for data mining
Results can be
document
characterization or
hidden relationship
extraction
EPSRC E-Science Meeting, NeSC
Simplified Document Classification
Workflow
Predictive Accuracy of Relevance prediction,
using
Support
Vector Patterns
Machine classification
Examples
of Extracted
Examples of Pattern Definitions
GENE_NAME protein
GENE_NAME express
express GENE_NAME
GENE_NAME mutant
GENE_NAME activity
activity GENE_NAME
GENE_NAME drosophila
Overall accuracy: 84.5%
Precision 78.11%
Recall 73.40%
April 21, 2005
delet\s([a-z]*(\s)+)*genenam+\s
depend\s([a-z]*(\s)+)*genenam+\s
describ\s([a-z]*(\s)+)*genenam+\s
detect\s([a-z]*(\s)+)*genenam+\s
determin\s([a-z]*(\s)+)*genenam+\s
differ\s([a-z]*(\s)+)*genenam+\s
disc\s([a-z]*(\s)+)*genenam+\s
dna\s([a-z]*(\s)+)*genenam+\s
EPSRC E-Science Meeting, NeSC
Text Meta Data Model
Build Classifier training phase using workflow co-ordinating distributed
services
Build Prediction phase using workflow co-ordinating distributed services
Metadata Model: Service Interfaces only tell you how to invoke remote service but it is up to you
to decide what information flows between services !
Text
Start
End
Annot. Type
Attributes
Insulin
resistance
1
9
7
18
token
token
pos:noun, stem:insulin
pos:noun, stem:resist
Insulin
resistance
plays
1
18
disease:insulin resistance
20
24
compound
token
token
major
26
32
30
35
token
Token
pos:adj, stem:major
EPSRC E-Science
pos:noun, stem:role
April 21, 2005
role
pos:verb, stem:p lai
Meeting, NeSC
Aims & Objectives of New
Project
•
Aim: to develop a unified real-time e-Science text-mining infrastructure
that leverages the technologies and methods developed by both Discovery
Net and myGrid
– Software engineering challenge: integrate complementary service-based text
mining capabilities with different metadata models into a single framework
– Application challenge: annotate biomedical abstracts with semantic categories
from the Gene Ontology
•
Deliverables:
– D1: A GO Annotation Service
– D2: A Generic Shared Infrastructure for Grid-enabled Biomedical Document
Categorization
– D3: Infrastructure for Semantic Document Annotation
– D4: A Detailed Case Study (analysing/evaluating the GO annotator)
– D5: Developing a common framework for representing + exchanging
information about:
1. Data: biomedical documents/doc collections + metadata, biomedical dictionaries
2. Intermediate data: Document indexes and Document feature vectors
3. Text Analysis Results
April 21, 2005
EPSRC E-Science Meeting, NeSC
Go TAG: A Novel Application
•The GO TAG Application: Automatic Assignment of GO (Gene
Ontology) Codes to Medline Documents
April 21, 2005
EPSRC E-Science Meeting, NeSC
A Machine Learning
Approach
Overview of Training Phase
April 21, 2005
EPSRC E-Science Meeting, NeSC
Run-time System
Overview of Run-time System
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
• Version 1a:
– Direct search for GO Annotation descriptions and synonyms in document
text
– If description is found, document is labelled with this GO Annotation
– Description is also marked-up in document
• Version 1b:
– 1a + search for gene names extracted from yeast genome DB
– If gene name found, document labelled with GO annotation(s) associated
with gene in DB
– Gene name also marked up in document
• Termino web-service, hosted at Sheffield, provides lookup
capability
• This is wrapped in a DiscoveryNet workflow to include PubMed
query, results visualization and performance calculations
• Workflow is deployed as a web application for end users which
includes applet to interactively browse results
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
Underlying Discovery Net Workflow
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
Underlying Discovery Net Workflow
Enter query and retrieve abstracts from
PubMed.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
Underlying Discovery Net Workflow
Use Termino to mark-up abstracts with
GO Annotations when match for GO
Annotation description is found.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
Underlying Discovery Net Workflow
Tabulate GO Annotations by PMID.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 1
Underlying Discovery Net Workflow
Join PMIDs and matching GO
Annotations with abstracts and titles.
April 21, 2005
EPSRC E-Science Meeting, NeSC
Workflow Deployment
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
• Use Saccharomyces (Yeast) Genome Database as source of
papers expertly curated with GO Annotations
• Train classifier using these papers
• Hierarchical classification
• Training data sufficient to classify over 2000 GO
Annotations
• Classifier is then applied to assign unseen papers with GO
Annotations
• Main Issues:
– Choice of features to be extracted from the training documents
– Choice of feature reduction methods to produce accurate
classification
– Choice of classification algorithm to be used?
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying Discovery Net Workflow
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying DiscoveryNet Workflow
Papers expertly curated with GO
Annotations from SGD database.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying Discovery Net Workflow
Generate vector of features (frequent
phrases) for each paper. This is used to
train classifier.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying Discovery Net Workflow
Generate a Naïve Bayesian
classification model.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying Discovery Net Workflow
Generate vector of features (frequent
phrases) for each paper in test data set.
This is used to test the classifier.
April 21, 2005
EPSRC E-Science Meeting, NeSC
GO Annotator – Version 2
Underlying Discovery Net Workflow
Apply classification model to test data
to evaluate classification accuracy.
April 21, 2005
EPSRC E-Science Meeting, NeSC
Interface + Results
Presentation
GO
Hierarchy
Abstract
Titles
Abstract
Bodies
Go Labels/
Gene Names
April 21, 2005
EPSRC E-Science Meeting, NeSC
Achievements to date
•
Infrastructure Interoperability
– More than just remote web service invocation: interoperable metadata
models
•
Mark 1 System Implemented
– Annotation based on terminology lookups
– 15% Recall & 5% Precision (Exact matches for 18,000 GO terms)
• Measures inadequate due to incompleteness of gold standard
•
In process of Finalising Training Data Sets and Evaluation Metrics
– 4,922 papers referencing 2,455 GO Terms
•
Mark 2 Systems in Progress
– Naïve Bayesian Approach
– 41% Recall and 27% Precision
•
User Interfaces
•AprilMark
21, 20053, 4, … Systems and Evaluation
EPSRC E-Science Meeting, NeSC
Implementation Options
• Feature Vector Options
– Bag of words
– Frequent Phrases
– Key Phrases (Gene Names, Protein Names, MeSH
terms, etc).
• Classifier Options
– Bayesian Classifiers
– Support Vector Machines
– Drag Push (a novel centroid based method)
April 21, 2005
EPSRC E-Science Meeting, NeSC
Lessons Learnt and
Challenges to Face
• Infrastructure
– Interoperability Issues
– Performance Issues:
• Communication vs Persistence of remote server
• Off-line vs on-line feature extraction
• Text Mining
– Usability Issues
– Evaluation Issues
April 21, 2005
EPSRC E-Science Meeting, NeSC
Download