Developing a Semantic Search Application

advertisement
Developing a Semantic Search
Application
A Pharma Case Study
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Taxonomy Boot Camp: Washington DC, 2013
KAPS Group: General
 Knowledge Architecture Professional Services – Network of Consultants
 Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching
– Attensity, Clarabridge, Lexalytics,
 Strategy – IM & KM - Text Analytics, Social Media, Integration
 Services:
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Fast Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, etc.
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Presentations, Articles, White Papers – http://www.kapsgroup.com
2
Project
 Agile Methodology
 Goal – evaluate semantic technologies ability to:
– Replace manual annotation of scientific documents –
automated or semi-automated
– Discover new entities and relationships
– Provide users with self-service capabilities
 Goal – feasibility and effort level
3
Components – Technology, Resources
 Cambridge Semantics, Linguamatics, SAS Enterprise Content
Categorization
– Initial integration – passing results as XML
 Content – scientific journal articles
 Taxonomy – Mesh – select small subset
 Access to a “customer” – critical for success
4
Three rounds - Iterations
 Visualization – faceted search, sort by date, author, journal
– Cambridge Semantics
 Round 1 – PDF from their database
– Needed to create additional structure and metadata
– No such thing as unstructured content
 Round 2 & 3 – XML with full metadata from PubMed
 Entity Recognition – Species, Document Type, Study Type, Drug
Names, Disease Names, Adverse Events
5
Components & Approach
 Rules or sample documents?
– Need more precision and granularity than documents can do
– Training sets – not as easy as thought
 First Rules – text indicators to define sections of the document
– Objectives, Abstract, Purpose, Aim – all the “same” section
 Separate logic of the rules from the text
– Stable rules, changing text
 Scores – relevancy with thresholds
– Not just frequency of words
6
Document Type Rules
7
Document Type Rules
 (START_2000, (AND, (OR, _/article:"[Abstract]",
_/article:"[Methods]“, _/article:"[Objective]",
 _/article:"[Results]", _/article:"[Discussion]“, (OR,
 _/article:"clinical trial*", _/article:"humans",
 (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",
_/article:"use", _/article:"animals"),
 Clinical Trial Rule:
 If the article has sections like Abstract or Methods
 AND has phrases around “clinical trials / Humans” and not words
like “animals” within 5 words of “clinical trial” words – count it and
add up a relevancy score
8
Rules for Drug Names and Diseases
 Primary issue – major mentions, not every mention
– Combination of noun phrase extraction and categorization
– Results – virtually 100%
 Taxonomy of drug names and diseases
 Capture general diseases like thrombosis and specific types like
deep vein, cerebral, and cardiac
 Combine text about arthritis and synonyms with text like “Journal
of Rheumatology”
9
10
Rules for Drug Names and Diseases




(OR, _/article/title:"[clonidine]",
(AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"),
(MINOC_2, _/article/abstract:"[clonidine]")
(START_500, (MINOC_2,"[clonidine]")))




Means – any variation of drug name in title – high score
Any variation in Mesh Keywords AND in abstract – high score
Any variation in Abstract at least 2x – good score
Any variation in first 500 words at least 2x – suspect
11
Rules for Drug Names and Diseases
 Results:
– Wide Range by type -- 70-100% recall and precision
 Focus mostly on precision – difficult to test recall
 One deep dive area indicated that 90%+ scores for both precision
and recall could be built with moderate level of effort
 Not linear effort – 30% accuracy does not mean 1/3 done
12
Iteration 3
 Complete treatment of disease state:
– Indication (disease you want to treat)
– Concomitant disease
– Adverse or side effects
 Use XML metadata – some variant of “adverse”
 Any combination of words associated with a disease (depression)
and any of the words that indicated an adverse event or effect
13
Conclusion
 Project was a success!
 Useful results – as defined by the customer
 Reasonable and doable effort level – both for initial development
and maintenance
 Essential Success Factors
– Rules not documents, training sets (starting point)
– Full platform for disambiguation of noun phrase extraction,
major-minor mention
– Separation of logic and text
 Semantic Search works!
– If you do it smart!
14
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
www.TextAnalyticsWorld.com March 17-19, San Francisco
Download