Are you ready for the golden age of text mining? John McNaught Deputy Director, National Centre for Text Mining University of Manchester John.McNaught@manchester.ac.uk Overview • Text mining in a nutshell • Enriching content, enhancing search, enabling discovery, reducing costs • Interoperability and evaluation • The C change McNaught London Info International 2 How do we (humans) discover? • Find, read, learn, analyse a lot • Ask “What if…?” • Construct hypotheses, test them – Explore many avenues, associations • Work collaboratively • Share results and data with others – Reproducibility validation • Integrate heterogeneous data/information/knowledge • (vs. Serendipity: by lucky accident) McNaught London Info International 3 Barriers to discovery • • • • Find: document oriented, too many hits Read: too much to read, even if we find relevant hits Learn: too fast growth to keep up, to know most things Analyse: duplication of efforts, many new results to document • Construct hypotheses: hard, can’t tell which are most promising, or if have missed any • Share: primary vehicles are documents and curated databases (massive curation backlog) • Integrate: document often the key, hard to link in to different worlds of data, information, knowledge McNaught London Info International 4 How does TM aid discovery? • Find: more precise, relevant information, within and across documents • Read: much faster than human • Learn: extracts, packages, links, synthesises, summarises, reduces burden • Analyse: recognises duplication; clusters, classifies, drives semantic author aids • Construct hypotheses: rapidly finds and ranks unknown associations for testing • Share: reduces curation effort, complements and validates data bases • Integrate: links documents deeply into worlds of data, information and knowledge McNaught London Info International 5 Text mining in a nutshell Other data McNaught Applications Semantic search Data mining London Info International 6 Increased sophistication? Increased customisation! What if…? Is X possible, certain, probable, suggested, past, to come? Associations Metaknowledge extraction {Who, what} Xed {whom, what} where, when and how? What is known about this disease, protein, person? What is this paper about? Keyword search Words McNaught Terms Entities Events Relations Event extraction Data mining, Clustering What is linked with X? Relation extraction Named entity recognition Term recognition and normalisation Wordform co-occurrence, pattern matching, … London Info International 7 A complex space Text Types Technology Scientific articles (Full papers/abstracts) Social media Patents Clinical records, EMR Books, theses, reports Newswire … Tokenizers Sentence Splitters Paragraph Splitters NP Chunkers Syntactic parsers Semantic parsers NE recognizers Relation extractors Event extractors … Domains Tasks Finance/Business Health Biology Social Sciences Humanities … Translation Information extraction Semantic search Question answering Sentiment analysis Summarization Knowledge discovery Database curation Systematic reviewing Pathway reconstruction Diversity of Contexts …. Resources (mono- and multilingual) Gazetteers Annotated corpora Lexicons Terminologies Wordnets Thesauri Ontologies Grammars … Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Arabic Urdu Japanese Korean…. Diversity of Languages and Language Resources including temporal diversity Diversity of Applications 8 Europe’s Languages and Language Technology support http://www.meta-net.eu English Dutch French German Italian Spanish Catalan Czech Finnish Hungarian Polish Portuguese Swedish good support through Language Technology Basque Bulgarian Danish Galician Greek Norwegian Romanian Slovak Slovene Croatian Estonian Icelandic Irish Latvian Lithuanian Maltese Serbian weak or no support (no ‘excellent’ support) McNaught London Info International 9 Enhancing historical collections • If you have a domain collection going back centuries – How easy is it for users to find answers to research questions? • Language evolves, terms come and go, concepts drift, … • TM can enhance collections in many ways – Handling temporal aspects of language is key – Enabling event-based semantic search McNaught London Info International 10 Looking into the past • Semantic search for historians of medicine – Treatment and prevention of diseases over time – Medical and public health perspectives • British Medical Journal archive (from 1840) – Around 350K articles • London Medical Officer of Health reports (1848-1972) (Wellcome Library) – Around 5,000 reports from different boroughs McNaught London Info International 11 In historical collections, same concept expressed by different terms across different time periods Users miss information due to unfamiliar terminology TM to extract/link diachronic synonyms, organize in thesaurus Use diachronic thesaurus for time-sensitive search (A mock-up for user feedback) User expands query Traditional search User searches for ”pulmonary tuberculosis” but doesn’t know historical synonym “pulmonary phthisis” Narrow down results according to faceted search (facets derived both from document metadata and from text mining) Distribution of “pulmonary tuberculosis” and “pulmonary System automatically suggests phthisis” across time related terms Analysing events of interest to historians Type Description Participants Affect An entity or event is affected, infected, changed or transformed, possibly by another entity or event Cause: of the affection Target: Entity or event affected Subject: Medical subject affected Cause An entity or event results in manifestation of another entity or event Cause: of the event Result: Resulting entity or event Subject: Medical subject affected Classic case of working together • • • • End user (typically) not a text miner Text miner (typically) not a domain expert Requirements and evaluation: challenge for both Need to work together to understand – – – – – McNaught How TM can help, what it can and cannot do What questions are of interest What role human has What outcomes are desirable What existing resources can be exploited London Info International 15 http://miningbiodiversity.org Mining Biodiversity Mining Biodiversity Aim Transform Biodiversity Heritage Library into a nextgeneration social digital library 130,000 volumes of digitised legacy literature A multi-disciplinary approach 1. Text Mining 2. Machine learning 3. Data visualisation 4. History of Science 5. Environmental History & Studies 6. Library and Information Science 7. Social Media Semantic metadata Mining Biodiversity extraction to support search Observation Habitation Nutrition Finding evidence • Event extraction can drive semantic search as we’ve seen. We can go a step further… • Example: application for Europe PubMed Central • Deeply analyse documents • Index relationships • Key off search term, to dynamically generate from indexed relationships questions that have known answers – Not auto-completion McNaught London Info International 19 EvidenceFinder: a new way to discover 83,717,24 2,550,328 Sentences about genes, proteins, diseases & metabolites Documents How can you tell if an article is relevant to you in your listed search results? Are t Europe PMC’s EvidenceFinder enriches your literature exploration by suggesting questions alongside your search results, providing a way to find information buried in full text articles that is directly relevant to you. This helps you identify articles and research that you might have overlooked through direct key word searching. http://europepmc.org/ McNaught London Info International 21 Finding unknown associations • Need massive amounts of text to find unknown associations, generate hypotheses • Must go across collections: silos irrelevant to researcher • Must go across disciplines: cognate and distant – all can shed light • Information often available in literature many years before, but unsuspected as not explicitly written down Reproducing a finding - reported (11/2011) in Nature Medicine - with FACTA+, using MEDLINE prior to date http://www.nactem.ac.uk/facta-visualizer/ Info=degree of surprise SGK1 gene, enzyme and symptom: high level of enzyme = infertile low level = miscarriage Building models • In many domains, build models to understand relationships and processes • Rely on literature to provide evidence • Slow, laborious work • Example: reconstruction of biological pathways McNaught London Info International 25 600 papers were read to Nodes : 652 construct the pathway: Links: 444 “inevitable gaps” due to manual methods Oda & Kitano (2006) in Mol Syst Biol Mapping reactions and text: PathText Link to text mining results (green icon) www.nactem.ac.uk 27 Building models based on textual evidence 1. 2. The mitotic arrest-deficient protein Mad1 forms a complex with Mad2, which is required for imposing mitotic arrest on cells in which the spindle assembly is perturbed. PMID: 18981471 Mad1, an upstream regulator of Mad2, forms a tight core complex with Mad2 and facilitates Mad2 binding to Cdc20. PMID: 18318601 2013 28 Systematic reviews, etc. • Systematic reviews, evidence-based public health reviews – Balanced reviews to aid policy, guideline, best practice development • Trade-offs: cost, time available, number of hits to screen/retain, number of full texts to read – May miss relevant items • EBPH reviews: complex questions, exploration of scope required • Even basic TM can save 75% of manual effort (EPPICentre, IoE) • Use of TM to identify, rank, cluster most relevant items • NaCTeM & Univ Liverpool currently working with NICE on supporting EBPH reviewers McNaught London Info International 29 Interoperability and evaluation • TM involves many processes and resources • May be no need to customise, just to select from repositories of available tools and resources • But tools and resources often incompatible at linguistic/semantic levels • Difficult to mix and match, to find best combination for task at hand • Hence drive towards interoperability to enable users to get best out of TM McNaught London Info International 30 Importance of evaluating tools Training data Test data AIMed GENETAG GENIA GGP PennBioIE PIR AIMed 89.5 38.5 63.3 40.8 54.7 GENETAG 58.4 75.2 43.1 31.3 56.0 GENIA GGP 66.3 31.0 90.7 34.1 42.6 PennBioIE 65.9 41.2 55.4 84.1 54.0 PIR 54.3 42.0 49.0 37.0 83.6 A tool can show different results when trained on one corpus and tested on another, compared to training and testing on same corpus McNaught London Info International 31 Text mining workflows: Rapid TM development, interoperability, common data representation, sharable type system, evaluation IBM Journal of Research and Development (2011) U-Compare: a modular NLP workflow construction and evaluation system. Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J. Database: The Journal of Biological Databases and Curation (2012) Argo: an integrative, interactive, text miningbased workbench supporting curation. Rak, R., Rowley, A., Black, W.J. and Ananiadou, S U-Compare: Evaluate and Compare TM Workflows library Sentence Splitter A Sentence Splitter B POS tagger A POS tagger B Workflow A NER UIMA SS OpenNLP SS GENIA SS F-Score A Workflow C Workflow B F-Score B F-Score C UIMA Tokenizer GENIA Tagger ABNER OpenNLP Tokenizer Stepp Tagger MedT-NER GENIA Tagger as Tokenizer OpenNLP Tagger GENIA Tagger as NER • • • • • • Integrated TM/NLP processing system GUI for workflow creation Library of ready-to-use processing components Statistics, visualizations, developer APIs Supports UIMA and sharable type system http://argo.nactem.ac.uk • Web-based application • Interactive creation of workflows • Cloud and highperformance computing 34 Workflow Editor Open AIRE-COAR Conference 35 Evaluation of Chemical NER workflows Supplies gold standard corpus Compares and reports precision, recall and F1 of the different branches against the gold standard corpus Removes gold annotations so that they can be created automatically Combinations of syntactic and semantic components create annotations The C change in TM in the UK • 1/7/2014: Copyright exception for text and data mining for non-commercial purposes • 1/10/2014: Copyright exception for quotation • If have lawful access to any text, you can now – Copy it for non-commercial text mining purposes – Display/communicate results (e.g., annotations, associations) of TM to others – Illustrate results with snippets from text (quotations) • None of this can be overridden by contract (licence, Ts&Cs) • https://www.gov.uk/government/uploads/system/uplo ads/attachment_data/file/375954/Research.pdf McNaught London Info International 37 Current state in the EU • Copyright and licensing in relation to TM is a hot topic • “The right to read is the right to mine” (Open Knowledge Foundation) • Hope on the horizon: – EC President Jean-Claude Juncker to take steps within his first 6 months to modernise copyright rules “in light of digital revolution and changed consumer behaviour” McNaught London Info International 38 Take home messages • Text mining can be applied in any domain and for many tasks • In text mining, no one size fits all – Text miners and users must work closely together • Content (at least in UK) can be mined on a massive scale for non-commercial purposes – but even a modest collection can benefit from text mining • Who is your text mining champion? McNaught London Info International 39 Contact and Acknowledgements • www.nactem.ac.uk • Funders and sponsors: MRC, AHRC, JISC, BBSRC, ESRC, NIH, DARPA, Europe PubMed Central funders (Wellcome Trust + 25 funders), NHS, European Commission • Previous funding from: AstraZeneca, Pfizer, Elsevier, Nature Publishing Group, BBC McNaught London Info International 40