Taxonomy Development Workshop

advertisement
Applying Semantics to Search
Text Analytics
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Enterprise Search Summit
New York
Agenda
 Introduction – Search, Semantics, Text Analytics
–
How do you mean?
 Getting (Re)Started with Text Analytics – 3 ½ steps
 Preliminary: Strategic Vision
– What is text analytics and what can it do?
 Step 1: Self Knowledge – TA Audit
 Step 2: Text Analytics Software Evaluation
 Step 3: POC / Quick Start – Pilot to Development
 Rest of your Life: Refinement, Feedback, Learning
 Conclusions
2
KAPS Group: General
 Knowledge Architecture Professional Services – Network of Consultants
 Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching
– Attensity, Clarabridge, Lexalytics,
 Strategy – IM & KM - Text Analytics, Social Media, Integration
 Services:
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Fast Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, etc.
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Presentations, Articles, White Papers – http://www.kapsgroup.com
3
Introduction: Search, Semantics, Text Analytics
What do you mean?
 All Search is (should be) semantic
–
Humans search concepts not chicken scratches
 Is this semantics?
–
NLP, Concept Search, Semantic Web (ontologies)
 Meaning in Text
Text Analytics – categorization
– Extraction – noun phrase, facts-triples
–
 Meaning from Search Results
–
A conversation, not a list of ranked (poorly) documents
4
What is Text Analytics?
Text Analytics Features
 Noun Phrase Extraction
–
Catalogs with variants, rule based dynamic
– Multiple types, custom classes – entities, concepts, events
– Feeds facets
 Summarization
–
Customizable rules, map to different content
 Fact Extraction
Relationships of entities – people-organizations-activities
– Ontologies – triples, RDF, etc.
–
 Sentiment Analysis
–
Rules & statistical – Objects, products, companies, and phrases
5
What is Text Analytics?
Text Analytics Features
 Auto-categorization
Training sets – Bayesian, Vector space
– Terms – literal strings, stemming, dictionary of related terms
– Rules – simple – position in text (Title, body, url)
– Semantic Network – Predefined relationships, sets of rules
– Boolean– Full search syntax – AND, OR, NOT
– Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
This is the most difficult to develop
Build on a Taxonomy
Combine with Extraction
– If any of list of entities and other words
– Disambiguation - Ford
–



6
Case Study – Categorization & Sentiment
7
Case Study – Categorization & Sentiment
8
9
10
11
12
13
Preliminary: Text Analytics Vision
What can Text Analytics Do?
 Strategic Questions – why, what value from the text analytics,
how are you going to use it
–
Platform or Applications?
 What are the basic capabilities of Text Analytics?
 What can Text Analytics do for Search?
–
After 10 years of failure – get search to work?
 What can you do with smart search based applications?
–
RM, PII, Social
 ROI for effective search – difficulty of believing
–
Problems with metadata, taxonomy
14
Preliminary: Text Analytics Vision
Adding Structure to Unstructured Content
 How do you bridge the gap – taxonomy to documents?
 Tagging documents with taxonomy nodes is tough
–
And expensive – central or distributed
 Library staff –experts in categorization not subject matter
–
Too limited, narrow bottleneck
– Often don’t understand business processes and business uses
 Authors – Experts in the subject matter, terrible at categorization
Intra and Inter inconsistency, “intertwingleness”
– Choosing tags from taxonomy – complex task
– Folksonomy – almost as complex, wildly inconsistent
– Resistance – not their job, cognitively difficult = non-compliance
–
 Text Analytics is the answer(s)!
15
Preliminary: Text Analytics Vision
Adding Structure to Unstructured Content
 Text Analytics and Taxonomy Together – Platform
–
Text Analytics provides the power to apply the taxonomy
– And metadata of all kinds
– Consistent in every dimension, powerful and economic
 Hybrid Model
–
Publish Document -> Text Analytics analysis -> suggestions for
categorization, entities, metadata - > present to author
– Cognitive task is simple -> react to a suggestion instead of select
from head or a complex taxonomy
– Feedback – if author overrides -> suggestion for new category
– Facets – Requires a lot of Metadata - Entity Extraction feeds facets
 Hybrid – Automatic is really a spectrum – depends on context
–
Automatic – adding structure at search results
16
Step 1 : TA Information Audit
Start with Self Knowledge
 Info Problems – what, how severe
 Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller
organization,
 Contextual interviews, content analysis, surveys, focus groups,
ethnographic studies, Text Mining
 Category modeling – Cognitive Science – how people think
 Natural level categories mapped to communities, activities
• Novice prefer higher levels
• Balance of informative and distinctiveness
 Text Analytics Strategy/Model – forms, technology, people
17
Step 1 : TA Information Audit
Start with Self Knowledge
 Ideas – Content and Content Structure
Map of Content – Tribal language silos
– Structure – articulate and integrate
– Taxonomic resources
–
 People – Producers & Consumers
–
Communities, Users, Central Team
 Activities – Business processes and procedures
–
Semantics, information needs and behaviors
– Information Governance Policy
 Technology
–
–
CMS, Search, portals, text analytics
Applications – BI, CI, Semantic Web, Text Mining
18
Step 2: TA Evaluation
Varieties of Taxonomy/ Text Analytics Software
 Taxonomy Management - extraction
 Full Platform
–
SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM,
Linguamatics, GATE
 Embedded – Search or Content Management
–
FAST, Autonomy, Endeca, Vivisimo, NLP, etc.
– Interwoven, Documentum, etc.
 Specialty / Ontology (other semantic)
–
–
Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots
Ontology – extraction, plus ontology
19
Step 2: Text Analytics Evaluation
Different Kind of software evaluation
 Traditional Software Evaluation - Start
Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
• Feature scorecard – minimum, must have, filter to top 6
– Filter Two – Technology Filter – match to your overall scope and
capabilities – Filter not a focus
– Filter Three – In-Depth Demo – 3-6 vendors
–
 Reduce to 1-3 vendors
 Vendors have different strengths in multiple environments
–
–
Millions of short, badly typed documents, Build application
Library 200 page PDF, enterprise & public search
20
Design of the Text Analytics Selection Team
Traditional Candidates – IT&, Business, Library
 IT - Experience with software purchases, needs assess, budget
–
Search/Categorization is unlike other software, deeper look
 Business -understand business, focus on business value
 They can get executive sponsorship, support, and budget
–
But don’t understand information behavior, semantic focus
 Library, KM - Understand information structure
 Experts in search experience and categorization
–
But don’t understand business or technology
21
Design of the Text Analytics Selection Team
 Interdisciplinary Team, headed by Information Professionals
 Relative Contributions
IT – Set necessary conditions, support tests
– Business – provide input into requirements, support project
– Library – provide input into requirements, add understanding of
search semantics and functionality
–
 Much more likely to make a good decision
 Create the foundation for implementation
22
Step 3: Proof of Concept / Pilot Project








4 weeks POC – bake off / or short pilot
Real life scenarios, categorization with your content
2 rounds of development, test, refine / Not OOB
Need SME’s as test evaluators – also to do an initial categorization of
content
Measurable Quality of results is the essential factor
Majority of time is on auto-categorization
Need to balance uniformity of results with vendor unique capabilities –
have to determine at POC time
Taxonomy Developers – expert consultants plus internal taxonomists
23
Step 3 : Proof of Concept
POC Design: Evaluation Criteria & Issues
 Basic Test Design – categorize test set
– Score – by file name, human testers
 Categorization & Sentiment – Accuracy 80-90%
– Effort Level per accuracy level
 Quantify development time – main elements
 Comparison of two vendors – how score?
– Combination of scores and report
 Quality of content & initial human categorization
–
Normalize among different test evaluators
 Quality of taxonomists – experience with text analytics software and/or
experience with content and information needs and behaviors
 Quality of taxonomy – structure, overlapping categories
24
Step 3: Proof of Concept
POC and Early Development: Risks and Issues
 CTO Problem –This is not a regular software process
 Semantics is messy not just complex
–
30% accuracy isn’t 30% done – could be 90%
 Variability of human categorization
 Categorization is iterative, not “the program works”
–
Need realistic budget and flexible project plan
 Anyone can do categorization
–
Librarians often overdo, SME’s often get lost (keywords)
 Meta-language issues – understanding the results
–
Need to educate IT and business in their language
25
Step 3: Proof of Concept / Quick Start
Outcomes
 POC – understand how text analytics can work in your
environment
 Learn the software – internal resources trained by doing
 Learn the language – syntax (Advanced Boolean)
 Learn categorization and extraction
 Good categorization rules
–
Balance of general and specific
– Balance of recall and precision
 Develop or refine taxonomies for categorization
 POC – can be the Quick Start or the Start of the Quick Start
26
Development, Implementation
Quick Start – First Application: Search and TA
 Simple Subject Taxonomy structure
–
Easy to develop and maintain
 Combined with categorization capabilities
–
Added power and intelligence
 Combined with people tagging, refining tags
 Combined with Faceted Metadata
–
–
Dynamic selection of simple categories
Allow multiple user perspectives
• Can’t predict all the ways people think
• Monkey, Banana, Panda
 Combined with ontologies and semantic data
–
–
Multiple applications – Text mining to Search
Combine search and browse
27
3. Roles and Responsibilities
 Sample roles matrix:
28
3. Roles and Responsibilities
 Common Roles and SharePoint Permissions:
Recommended
Roles
Site Administrator
SharePoint Owner
SharePoint Member
SharePoint Visitor
SharePoint
Permissions
Site Administrator
Full Control
Contribute
Read
29
Rest of Your Life:
Maintenance, Refinement, Application, Learning






This is easy – if you did the TA Audit and POC/Quick Start
Content – new content – calls for flexible, new methods
People – Have a trained team and extended team
Technology – integrate into variety of applications – SBA
Processes, workflow – how semi-automate, part of normal
Maintenance – Refinement – in world of rapid change
– Mechanisms for feedback, learning – of text analysts and software
 Future Directions - Advanced Applications
–
–
–
–
–
Embedded Applications, Semantic Web + Unstructured Content
Integration of Enterprise and External - Social Media
Expertise Analysis, Behavior Prediction (Predictive Analytics)
Voice of the Customer, Big Data
Turning unstructured content into data – new worlds
30
Conclusion
 Text Analytics can fulfill the promise of taxonomy and metadata
–
Economic and consistent structure for unstructured content
 Search and Text Analytics
Search that works – finally!
– Platform for Search-Based Applications
–
 Text Analytics is different kind of software / solution
–
Infrastructure – Hybrid CM to Search and feedback
 How to Get Started with Text Analytics
–
Strategic Vision of Text Analytics
– Three steps – TA Audit, TA evaluation, POC/Quick Start
 Text Analytics opens up new worlds of applications
31
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
www.TextAnalyticsWorld.com Oct 3-4, Boston
Download