Are you ready for the golden age of text mining?

advertisement
Are you ready for the golden age
of text mining?
John McNaught
Deputy Director, National Centre for Text Mining
University of Manchester
John.McNaught@manchester.ac.uk
Overview
• Text mining in a nutshell
• Enriching content, enhancing search, enabling
discovery, reducing costs
• Interoperability and evaluation
• The C change
McNaught
London Info International
2
How do we (humans) discover?
• Find, read, learn, analyse a lot
• Ask “What if…?”
• Construct hypotheses, test them
– Explore many avenues, associations
• Work collaboratively
• Share results and data with others
– Reproducibility  validation
• Integrate heterogeneous
data/information/knowledge
• (vs. Serendipity: by lucky accident)
McNaught
London Info International
3
Barriers to discovery
•
•
•
•
Find: document oriented, too many hits
Read: too much to read, even if we find relevant hits
Learn: too fast growth to keep up, to know most things
Analyse: duplication of efforts, many new results to
document
• Construct hypotheses: hard, can’t tell which are most
promising, or if have missed any
• Share: primary vehicles are documents and curated
databases (massive curation backlog)
• Integrate: document often the key, hard to link in to
different worlds of data, information, knowledge
McNaught
London Info International
4
How does TM aid discovery?
• Find: more precise, relevant information, within and across
documents
• Read: much faster than human
• Learn: extracts, packages, links, synthesises, summarises,
reduces burden
• Analyse: recognises duplication; clusters, classifies, drives
semantic author aids
• Construct hypotheses: rapidly finds and ranks unknown
associations for testing
• Share: reduces curation effort, complements and validates
data bases
• Integrate: links documents deeply into worlds of data,
information and knowledge
McNaught
London Info International
5
Text mining in a nutshell
Other
data
McNaught
Applications
Semantic search
Data mining
London Info International
6
Increased sophistication? Increased customisation!
What if…?
Is X possible, certain, probable, suggested, past, to come?
Associations
Metaknowledge
extraction
{Who, what} Xed {whom, what} where, when and how?
What is known about
this disease, protein, person?
What is this
paper about?
Keyword
search
Words
McNaught
Terms
Entities
Events
Relations
Event extraction
Data mining, Clustering
What is linked with X?
Relation extraction
Named entity recognition
Term recognition and normalisation
Wordform co-occurrence, pattern matching, …
London Info International
7
A complex space
Text Types
Technology
Scientific articles
(Full papers/abstracts)
Social media
Patents
Clinical records, EMR
Books, theses, reports
Newswire
…
Tokenizers
Sentence Splitters
Paragraph Splitters
NP Chunkers
Syntactic parsers
Semantic parsers
NE recognizers
Relation extractors
Event extractors
…
Domains
Tasks
Finance/Business
Health
Biology
Social Sciences
Humanities
…
Translation
Information extraction
Semantic search
Question answering
Sentiment analysis
Summarization
Knowledge discovery
Database curation
Systematic reviewing
Pathway reconstruction
Diversity of Contexts
….
Resources
(mono- and
multilingual)
Gazetteers
Annotated corpora
Lexicons
Terminologies
Wordnets
Thesauri
Ontologies
Grammars
…
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Arabic
Urdu
Japanese
Korean….
Diversity of Languages and
Language Resources
including temporal diversity
Diversity of Applications
8
Europe’s Languages and
Language Technology support
http://www.meta-net.eu
English
Dutch
French
German
Italian
Spanish
Catalan
Czech
Finnish
Hungarian
Polish
Portuguese
Swedish
good support through
Language Technology
Basque
Bulgarian
Danish
Galician
Greek
Norwegian
Romanian
Slovak
Slovene
Croatian
Estonian
Icelandic
Irish
Latvian
Lithuanian
Maltese
Serbian
weak or
no support
(no ‘excellent’ support)
McNaught
London Info International
9
Enhancing historical collections
• If you have a domain collection going back
centuries
– How easy is it for users to find answers to research
questions?
• Language evolves, terms come and go,
concepts drift, …
• TM can enhance collections in many ways
– Handling temporal aspects of language is key
– Enabling event-based semantic search
McNaught
London Info International
10
Looking into the past
• Semantic search for historians of medicine
– Treatment and prevention of diseases over time
– Medical and public health perspectives
• British Medical Journal archive (from 1840)
– Around 350K articles
• London Medical Officer of Health reports
(1848-1972) (Wellcome Library)
– Around 5,000 reports from different boroughs
McNaught
London Info International
11
 In historical collections, same concept expressed by
different terms across different time periods
 Users miss information due to unfamiliar terminology
 TM to extract/link diachronic synonyms, organize in thesaurus
 Use diachronic thesaurus for time-sensitive search
(A mock-up for user feedback)
User expands query
Traditional search
User searches for
”pulmonary tuberculosis” but
doesn’t know historical synonym
“pulmonary phthisis”
Narrow down results
according to faceted search
(facets derived both from
document metadata and from
text mining)
Distribution of “pulmonary
tuberculosis” and “pulmonary
System automatically suggests
phthisis” across time
related terms
Analysing events of interest to historians
Type
Description
Participants
Affect
An entity or event is affected,
infected, changed or
transformed, possibly by
another entity or event
Cause: of the affection
Target: Entity or event affected
Subject: Medical subject affected
Cause
An entity or event results in
manifestation of another entity
or event
Cause: of the event
Result: Resulting entity or event
Subject: Medical subject affected
Classic case of working together
•
•
•
•
End user (typically) not a text miner
Text miner (typically) not a domain expert
Requirements and evaluation: challenge for both
Need to work together to understand
–
–
–
–
–
McNaught
How TM can help, what it can and cannot do
What questions are of interest
What role human has
What outcomes are desirable
What existing resources can be exploited
London Info International
15
http://miningbiodiversity.org
Mining Biodiversity
Mining Biodiversity
Aim
Transform Biodiversity Heritage Library into a nextgeneration social digital library
130,000 volumes of digitised legacy literature
A multi-disciplinary approach
1. Text Mining
2. Machine learning
3. Data visualisation
4. History of Science
5. Environmental History & Studies
6. Library and Information Science
7. Social Media
Semantic metadata
Mining
Biodiversity
extraction
to support search
Observation
Habitation
Nutrition
Finding evidence
• Event extraction can drive semantic search as
we’ve seen. We can go a step further…
• Example: application for Europe PubMed Central
• Deeply analyse documents
• Index relationships
• Key off search term, to dynamically generate
from indexed relationships questions that have
known answers
– Not auto-completion
McNaught
London Info International
19
EvidenceFinder: a new way to discover
83,717,24
2,550,328
Sentences about genes, proteins, diseases & metabolites
Documents
How can you tell if an article is relevant to you in your listed search results? Are t
Europe PMC’s EvidenceFinder enriches your literature exploration by
suggesting questions alongside your search results, providing a way to find information
buried in full text articles that is directly relevant to you.
This helps you identify articles and research that you might have overlooked through
direct key word searching.
http://europepmc.org/
McNaught
London Info International
21
Finding unknown associations
• Need massive amounts of text to find
unknown associations, generate hypotheses
• Must go across collections: silos irrelevant to
researcher
• Must go across disciplines: cognate and
distant – all can shed light
• Information often available in literature many
years before, but unsuspected as not explicitly
written down
Reproducing a finding - reported (11/2011) in Nature
Medicine - with FACTA+, using MEDLINE prior to date
http://www.nactem.ac.uk/facta-visualizer/
Info=degree of surprise
SGK1 gene, enzyme and symptom:
high level of enzyme = infertile
low level = miscarriage
Building models
• In many domains, build models to understand
relationships and processes
• Rely on literature to provide evidence
• Slow, laborious work
• Example: reconstruction of biological
pathways
McNaught
London Info International
25
600 papers were read to
Nodes : 652
construct the pathway:
Links: 444
“inevitable gaps” due
to manual methods
Oda & Kitano (2006) in Mol Syst Biol
Mapping reactions and text: PathText
Link to text
mining results
(green icon)
www.nactem.ac.uk
27
Building models based on
textual evidence
1.
2.
The mitotic arrest-deficient
protein Mad1 forms a complex
with Mad2, which is required
for imposing mitotic arrest on
cells in which the spindle
assembly is perturbed. PMID:
18981471
Mad1, an upstream regulator of
Mad2, forms a tight core
complex with Mad2 and
facilitates Mad2 binding to
Cdc20. PMID: 18318601
2013
28
Systematic reviews, etc.
• Systematic reviews, evidence-based public health
reviews
– Balanced reviews to aid policy, guideline, best practice
development
• Trade-offs: cost, time available, number of hits to
screen/retain, number of full texts to read
– May miss relevant items
• EBPH reviews: complex questions, exploration of scope
required
• Even basic TM can save 75% of manual effort (EPPICentre, IoE)
• Use of TM to identify, rank, cluster most relevant items
• NaCTeM & Univ Liverpool currently working with NICE
on supporting EBPH reviewers
McNaught
London Info International
29
Interoperability and evaluation
• TM involves many processes and resources
• May be no need to customise, just to select from
repositories of available tools and resources
• But tools and resources often incompatible at
linguistic/semantic levels
• Difficult to mix and match, to find best
combination for task at hand
• Hence drive towards interoperability to enable
users to get best out of TM
McNaught
London Info International
30
Importance of evaluating tools
Training
data
Test data
AIMed
GENETAG
GENIA GGP
PennBioIE
PIR
AIMed
89.5
38.5
63.3
40.8
54.7
GENETAG
58.4
75.2
43.1
31.3
56.0
GENIA GGP
66.3
31.0
90.7
34.1
42.6
PennBioIE
65.9
41.2
55.4
84.1
54.0
PIR
54.3
42.0
49.0
37.0
83.6
A tool can show different results when trained on
one corpus and tested on another, compared to
training and testing on same corpus
McNaught
London Info International
31
Text mining workflows:
Rapid TM development, interoperability, common data
representation, sharable type system, evaluation
IBM Journal of Research and
Development (2011)
U-Compare: a modular NLP workflow
construction and evaluation system.
Kano, Y., Miwa, M., Cohen, K. B., Hunter,
L., Ananiadou, S. and Tsujii, J.
Database: The Journal of Biological Databases
and Curation (2012)
Argo: an integrative, interactive, text miningbased workbench supporting curation.
Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
U-Compare: Evaluate and Compare TM
Workflows
library
Sentence
Splitter A
Sentence
Splitter B
POS tagger
A
POS tagger
B
Workflow A
NER
UIMA SS
OpenNLP SS
GENIA SS
F-Score A
Workflow C
Workflow B

F-Score B

F-Score C
UIMA Tokenizer
GENIA Tagger
ABNER
OpenNLP Tokenizer
Stepp Tagger
MedT-NER
GENIA Tagger as
Tokenizer
OpenNLP
Tagger
GENIA Tagger
as NER
•
•
•
•
•
•
Integrated TM/NLP processing system
GUI for workflow creation
Library of ready-to-use processing components
Statistics, visualizations, developer APIs
Supports UIMA and sharable type system
http://argo.nactem.ac.uk
• Web-based application
• Interactive creation of
workflows
• Cloud and highperformance computing
34
Workflow Editor
Open AIRE-COAR Conference
35
Evaluation of Chemical NER workflows
Supplies gold
standard corpus
Compares and reports precision, recall
and F1 of the different branches
against the gold standard corpus
Removes gold annotations so
that they can be created
automatically
Combinations of syntactic and
semantic components create
annotations
The C change in TM in the UK
• 1/7/2014: Copyright exception for text and data mining
for non-commercial purposes
• 1/10/2014: Copyright exception for quotation
• If have lawful access to any text, you can now
– Copy it for non-commercial text mining purposes
– Display/communicate results (e.g., annotations,
associations) of TM to others
– Illustrate results with snippets from text (quotations)
• None of this can be overridden by contract (licence,
Ts&Cs)
• https://www.gov.uk/government/uploads/system/uplo
ads/attachment_data/file/375954/Research.pdf
McNaught
London Info International
37
Current state in the EU
• Copyright and licensing in relation to TM is a
hot topic
• “The right to read is the right to mine” (Open
Knowledge Foundation)
• Hope on the horizon:
– EC President Jean-Claude Juncker to take steps
within his first 6 months to modernise copyright
rules “in light of digital revolution and changed
consumer behaviour”
McNaught
London Info International
38
Take home messages
• Text mining can be applied in any domain and
for many tasks
• In text mining, no one size fits all
– Text miners and users must work closely together
• Content (at least in UK) can be mined on a
massive scale for non-commercial purposes
– but even a modest collection can benefit from text
mining
• Who is your text mining champion?
McNaught
London Info International
39
Contact and Acknowledgements
• www.nactem.ac.uk
• Funders and sponsors: MRC, AHRC, JISC,
BBSRC, ESRC, NIH, DARPA, Europe PubMed
Central funders (Wellcome Trust + 25 funders),
NHS, European Commission
• Previous funding from: AstraZeneca, Pfizer,
Elsevier, Nature Publishing Group, BBC
McNaught
London Info International
40
Download