Enterprise Search / Text Analytics Evaluation

advertisement
Enterprise Search/
Text Analytics
Evaluation
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda – Part II
 Introduction – Text Analytics Can Rescue Search
–
Elements of Text Analytics
 Evaluating Text Analytics Software
–
Varieties of Software – platform and embedded
 Three Phase Process / Two Examples
–
–
–
Initial Evaluation – 4 vendors
Demo’s – from 4 to 2
4-6 week POC – 2 vendors
 Conclusions
2
KAPS Group: General





Knowledge Architecture Professional Services
Virtual Company: Network of consultants – 8-10
Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc.
Consulting, Strategy, Knowledge architecture audit
Services:
– Taxonomy/Text Analytics development, consulting, customization
– Technology Consulting – Search, CMS, Portals, etc.
–
Evaluation of Enterprise Search, Text Analytics
–
Metadata standards and implementation
– Knowledge Management: Collaboration, Expertise, e-learning
– Applied Theory – Faceted taxonomies, complexity theory, natural
categories
3
Introduction Text Analytics to the Rescue




Enterprise Search is Dead!
Taxonomy is Dead!
Long Live Text Analytics!
ECM and ES failed because search is about semantics / meaning
not technology
 Taxonomy failed because it is too rigid, and too dumb
 Metadata failed because it is too hard to do for authors (not
really)
 They all failed because Search is a semantic infrastructure
element not a project
4
Introduction to Text Analytics
Text Analytics Features
 Noun Phrase Extraction
–
Catalogs with variants, rule based dynamic
– Multiple types, custom classes – entities, concepts, events
– Feeds facets
 Summarization
–
Customizable rules, map to different content
 Fact Extraction
Relationships of entities – people-organizations-activities
– Ontologies – triples, RDF, etc.
–
 Sentiment Analysis
–
Rules – Objects and phrases
5
Introduction to Text Analytics
Text Analytics Features
 Auto-categorization
Training sets – Bayesian, Vector space
– Terms – literal strings, stemming, dictionary of related terms
– Rules – simple – position in text (Title, body, url)
– Semantic Network – Predefined relationships, sets of rules
– Boolean– Full search syntax – AND, OR, NOT
– Advanced – NEAR (#), PARAGRAPH, SENTENCE
This is the most difficult to develop
Build on a Taxonomy
Combine with Extraction
– If any of list of entities and other words
–



6
7
8
9
10
11
Varieties of Taxonomy/ Text Analytics Software
 Taxonomy Management
–
Synaptica, SchemaLogic
 Full Platform
–
SAS-Teragram, SAP-Inxight, Clear Forest, Smart Logic, Data
Harmony, Concept Searching, Expert System, IBM, GATE
 Content Management
–
Nstein, Interwoven, Documentum, etc.
 Embedded – Search
–
FAST, Autonomy, Endeca, Exalead, etc.
 Specialty
–
Sentiment Analysis - Lexalytics
12
Evaluation Process & Methodology
 Start with Self Knowledge
–
Think Big, Start Small, Scale Fast
 Eliminate the unfit
–
Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
• Feature scorecard – minimum, must have, filter to top 3-6
–
–
Filter Two – Technology Filter – match to your overall scope
and capabilities – Filter not a focus
Filter Three – In-Depth Demo – 3-6 vendors
• Beyond “Yes, we have that feature.”
 Deep POC (2) – advanced, integration, semantics
 Focus on working relationship with vendor.
13
Evaluation Process & Methodology
Initial Evaluation - Basic Requirements
 Platform – range of capabilities
–
Categorization, Sentiment analysis, etc.
 Technical – Search evaluation + Integration
–
–
–
API’s, Java based, Linux run time
Scalability – millions of documents a day
Import-Export – XML, RDF
 Usability
 Multiple Language Support
14
Evaluating Text Analytics Software
Initial Evaluation: Usability
 Ease of use – copy, paste, rename, merge, etc.
 User Documentation, user manuals, on-line help, training and
tutorials
 Visualization
– file structure, tree, Hierarchy and alphabetical
 Automatic Taxonomy/Node & Rule Generation
–
Nonsense for Taxonomy
– Node – suggestions for sub-categories, rules
 Variety of node relationships – child-parent, related
15
Initial Evaluation Example Outcomes
 Filter One:
–
–
–
–
–
Company A, B – sentiment analysis focus, weak categorization
Company C – Lack of full suite of text analytics
Company D – business concerns, support
Open Source – license issues
Ontology Vendors – missing categorization capabilities
 4 Demos
–
–
–
Saw a variety of different approaches, but
Company X – lacking sentiment analysis, require 2 vendors
Company Y – lack of language support, development cost
16
Evaluating Taxonomy Software
Proof Of Concept - POC




Quality of results is the essential factor
4-6 weeks POC – bake off / or short pilot
Real life scenarios, categorization with your content
Preparation:
–
Preliminary analysis of content and users information needs
– Set up software in lab – relatively easy
– Train taxonomist(s) on software(s)
– Develop taxonomy if none available
 4-6 week POC – 2-3 rounds of development, test, refine / Not OOB
 Need SME’s as test evaluators – also to do an initial categorization of
content
17
Evaluating Taxonomy Software
POC
 Majority of time is on auto-categorization
 Need to balance uniformity of results with vendor unique capabilities –
have to determine at POC time
 Risks – getting software installed and working, getting the right content,
initial categorization of content
 Elements:
–
Content
– Search terms / search scenarios
– Training sets
– Test sets of content
 Taxonomy Developers – expert consultants plus internal taxonomists
18
Evaluating Taxonomy Software
POC Feature Test Cases







Auto-categorization to existing taxonomy – variety of content
Clustering – automatic node generation
Summarization
Entity extraction – build a number of catalogs – People, products, etc.
Sentiment Analysis – products and categorization rules
Evaluate usability in action by taxonomists
Question – Integration with Ontologies?
 Technical / integration
– Output in XML- API’s
 Map above to Client use cases
19
POC Design Discussion: Evaluation Criteria
 Basic Test Design – categorize test set
–
–
–
Score – by file name, human testers
Accuracy Level – 80-90%
Effort Level per accuracy level
 Sentiment Analysis
–
–
Accuracy Level – 80-90%
Effort Level per accuracy level
 Quantify development time – main elements
 Comparison of two vendors – how score?
–
Combination of scores and report
20
Evaluating Taxonomy Software
POC - Issues





Quality of content
Quality of initial human categorization
Normalize among different test evaluators
Quality of taxonomists – experience with text analytics software and/or
experience with content and information needs and behaviors
Quality of taxonomy
General issues – structure (too flat or too deep)
– Overlapping categories
– Differences in use – browse, index, categorize
–
 Foundation for Development
21
Text Analytics Evaluation
Context Dependent – Tale of Two POC’s
 Taxonomy
GAO – flat, not very orthogonal, political concerns
• Department centric, subject
– Company A– flat, two taxonomies, technical - open to change?
• Action oriented, activities and events
–
 Content
GAO – giant 200 page PDF formal documents, variety
• Start_200, Title
– Company A – short cryptic customer support notes, Social Media
• Creative spelling, combination of formal and individual
–
 Culture / Applications
GAO - Formal Federal – big applications, infrastructure
– Company A – informal, technical, lots of small apps
–
22
Text Analytics Evaluation - Conclusions
 Text Analytics with Search and ECM can finally deliver the
promise
 Search / Text analytics is not like other software – meaning
 Only way to deal with meaning is in context – your context
–
Technical issues are filters not features
 Enterprise Text Analytics (Platform) is best
–
If have to start with one application, plan for the platform
 POC seems expensive, but is cheaper than making a dead
end choice
–
It is also the foundation and head start on development
23
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Download