Developing a Semantic Search Application

Developing a Semantic Search Application A Pharma Case Study Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Taxonomy Boot Camp: Washington DC, 2013 KAPS Group: General  Knowledge Architecture Professional Services – Network of Consultants  Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching – Attensity, Clarabridge, Lexalytics,  Strategy – IM & KM - Text Analytics, Social Media, Integration  Services: – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Fast Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development  Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc.  Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Presentations, Articles, White Papers – http://www.kapsgroup.com 2 Project  Agile Methodology  Goal – evaluate semantic technologies ability to: – Replace manual annotation of scientific documents – automated or semi-automated – Discover new entities and relationships – Provide users with self-service capabilities  Goal – feasibility and effort level 3 Components – Technology, Resources  Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization – Initial integration – passing results as XML  Content – scientific journal articles  Taxonomy – Mesh – select small subset  Access to a “customer” – critical for success 4 Three rounds - Iterations  Visualization – faceted search, sort by date, author, journal – Cambridge Semantics  Round 1 – PDF from their database – Needed to create additional structure and metadata – No such thing as unstructured content  Round 2 & 3 – XML with full metadata from PubMed  Entity Recognition – Species, Document Type, Study Type, Drug Names, Disease Names, Adverse Events 5 Components & Approach  Rules or sample documents? – Need more precision and granularity than documents can do – Training sets – not as easy as thought  First Rules – text indicators to define sections of the document – Objectives, Abstract, Purpose, Aim – all the “same” section  Separate logic of the rules from the text – Stable rules, changing text  Scores – relevancy with thresholds – Not just frequency of words 6 Document Type Rules 7 Document Type Rules  (START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]",  _/article:"[Results]", _/article:"[Discussion]“, (OR,  _/article:"clinical trial*", _/article:"humans",  (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe", _/article:"use", _/article:"animals"),  Clinical Trial Rule:  If the article has sections like Abstract or Methods  AND has phrases around “clinical trials / Humans” and not words like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score 8 Rules for Drug Names and Diseases  Primary issue – major mentions, not every mention – Combination of noun phrase extraction and categorization – Results – virtually 100%  Taxonomy of drug names and diseases  Capture general diseases like thrombosis and specific types like deep vein, cerebral, and cardiac  Combine text about arthritis and synonyms with text like “Journal of Rheumatology” 9 10 Rules for Drug Names and Diseases     (OR, _/article/title:"[clonidine]", (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"), (MINOC_2, _/article/abstract:"[clonidine]") (START_500, (MINOC_2,"[clonidine]")))     Means – any variation of drug name in title – high score Any variation in Mesh Keywords AND in abstract – high score Any variation in Abstract at least 2x – good score Any variation in first 500 words at least 2x – suspect 11 Rules for Drug Names and Diseases  Results: – Wide Range by type -- 70-100% recall and precision  Focus mostly on precision – difficult to test recall  One deep dive area indicated that 90%+ scores for both precision and recall could be built with moderate level of effort  Not linear effort – 30% accuracy does not mean 1/3 done 12 Iteration 3  Complete treatment of disease state: – Indication (disease you want to treat) – Concomitant disease – Adverse or side effects  Use XML metadata – some variant of “adverse”  Any combination of words associated with a disease (depression) and any of the words that indicated an adverse event or effect 13 Conclusion  Project was a success!  Useful results – as defined by the customer  Reasonable and doable effort level – both for initial development and maintenance  Essential Success Factors – Rules not documents, training sets (starting point) – Full platform for disambiguation of noun phrase extraction, major-minor mention – Separation of logic and text  Semantic Search works! – If you do it smart! 14 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com www.TextAnalyticsWorld.com March 17-19, San Francisco

Developing a Semantic Search Application

Related documents

Products

Support

Developing a Semantic Search Application

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib