Text Analytics Software Evaluation

advertisement
SemTech
Text Analytics
Evaluation
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Text Analytics Features, Varieties, Vendors
 Evaluation Process
–
–
–
Start with Self-Knowledge
Text Analytics Team
Features and Capabilities – Filter
 Proof of Concept/Pilot
–
–
Themes and Issues
Case Study
 Conclusion
2
KAPS Group: General





Knowledge Architecture Professional Services
Virtual Company: Network of consultants – 8-10
Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc.
Consulting, Strategy, Knowledge architecture audit
Services:
– Taxonomy/Text Analytics development, consulting, customization
– Technology Consulting – Search, CMS, Portals, etc.
–
Evaluation of Enterprise Search, Text Analytics
–
Metadata standards and implementation
– Knowledge Management: Collaboration, Expertise, e-learning
– Applied Theory – Faceted taxonomies, complexity theory, natural
categories
3
Introduction to Text Analytics
Text Analytics Features
 Noun Phrase Extraction (Entity, Concept, Events, etc.)
–
Catalogs with variants, rule based dynamic
– Multiple types, custom classes – entities, concepts, events
– Feeds facets
 Summarization
–
Customizable rules, map to different content
 Fact Extraction
Relationships of entities – people-organizations-activities
– Ontologies – triples, RDF, etc.
–
 Sentiment Analysis
–
Statistical, rules – full categorization set of operators
4
Introduction to Text Analytics
Text Analytics Features
 Auto-categorization
Training sets – Bayesian, Vector space
– Terms – literal strings, stemming, dictionary of related terms
– Rules – simple – position in text (Title, body, url)
– Semantic Network – Predefined relationships, sets of rules
– Boolean– Full search syntax – AND, OR, NOT
– Advanced – NEAR (#), PARAGRAPH, SENTENCE
This is the most difficult to develop
Build on a Taxonomy
Combine with Extraction
– If any of list of entities and other words
–



5
6
7
8
9
10
11
Varieties of Taxonomy/ Text Analytics Software
 Taxonomy Management
–
Synaptica, SchemaLogic
 Full Platform
–
SAS-Teragram, SAP-Inxight, Smart Logic, Data Harmony, Concept
Searching, Expert System, IBM, GATE
 Content Management – embedded
 Embedded – Search
–
FAST, Autonomy, Endeca, Exalead, etc.
 Specialty
Sentiment Analysis , VOC – Lexalytics, Attensity / Reports
– Ontology – extraction, plus ontology
–
12
Evaluating Taxonomy/Text Analytics Software
Start with Self Knowledge
 Strategic and Business Context
 Info Problems – what, how severe
 Strategic Questions – why, what value from the taxonomy/text
analytics, how are you going to use it
 Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller
organization, application specific initiatives
 Text Analytics Strategy/Model – forms, technology, people
– Existing taxonomic resources, software
 Need this foundation to evaluate and to develop
13
Evaluating Taxonomy/Text Analytics Software
Start with Self Knowledge
 Do you need it – and what blend if so?
 Taxonomy Management Full Functionality
– Multiple taxonomies, languages, authors-editors
 Technology Environment – Text Mining, ECM, Enterprise Search
–
Where is it embedded, integration issues
 Publishing Process – where and how is metadata being added –
now and projected future
–
Can it utilize auto-categorization, entity extraction, summarization
 Applications – text mining, BI, CI, Social Media, Mobile?
14
Design of the Text Analytics Selection Team
 Traditional Candidates - IT
 Experience with large software purchases
–
Search/Categorization is unlike other software
 Experience with needs assessments
–
Need more – know what questions to ask, knowledge audit
 Objective criteria
–
Looking where there is light?
 Asking IT to select text analytics software is like asking a
construction company to select the design of your house.
 They have the budget
–
OK, they can play.
15
Design of the Text Analytics Selection Team
 Traditional Candidates - Business Owners
 Understand the business
–
But don’t understand information behavior
 Focus on business value, not technology
–
Focus on semantics is needed
 Asking business owners to select text analytics software is
like asking a restaurant owner to do the cooking
 They can get executive sponsorship, support, and budget.
–
OK, they can play
16
Design of the Text Analytics Selection Team
 Traditional Candidates - Library
 Understand information structure
– But not how it is used in the business
 Experts in search experience and categorization
– Suitable for experts, not regular users
 Asking librarians to select text analytics software is like asking an
accountant to establish your financial strategy
 Experience with variety of search engines, taxonomy software,
integration issues
– OK, they can play
17
Design of the Text Analytics Selection Team
 Interdisciplinary Team, headed by Information Professionals
 Relative Contributions
– IT – Set necessary conditions, support tests
– Business – provide input into requirements, support project
– Library – provide input into requirements, add understanding
of search semantics and functionality
 Much more likely to make a good decision
 Create the foundation for implementation
18
Evaluating Text Analytics Software – Process
 Start with Self Knowledge
 Eliminate the unfit
– Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
Filter Two - Feature scorecard – minimum, must have, filter
– Filter Three – Technology Filter – match to your overall scope
and capabilities – Filter not a focus
– Filter Four – Focus Group one day visit – 3-4 vendors
 Deep pilot (2) / POC – advanced, integration, semantics
 Focus on working relationship with vendor.
–
19
Initial Evaluation Example Outcomes
 Filter One:
–
–
–
–
–
Company A, B – sentiment analysis focus, weak categorization
Company C – Lack of full suite of text analytics
Company D – business concerns, support
Open Source – license issues
Ontology Vendors – missing categorization capabilities
 4 Demos
–
–
–
Saw a variety of different approaches, but
Company X – lacking sentiment analysis, require 2 vendors
Company Y – lack of language support, development cost
20
Evaluating Taxonomy Software
POC - Approach




Quality of results is the essential factor
6 weeks POC – bake off / or short pilot
Real life scenarios, categorization with your content
Preparation:
–
Preliminary analysis of content and users information needs
– Set up software in lab – relatively easy
– Train taxonomist(s) on software(s)
– Develop taxonomy if none available
 Six week POC – 3 rounds of development, test, refine / Not OOB
 Need SME’s as test evaluators – also to do an initial
categorization of content
21
Evaluating Taxonomy Software
POC – Initial Design
 Majority of time is on auto-categorization
 Need to balance uniformity of results with vendor unique
capabilities – have to determine at POC time
 Risks – getting software installed and working, getting the right
content, initial categorization of content
 Elements:
–
Content
– Search terms / search scenarios
– Training sets
– Test sets of content
 Development Team – expert consultants plus internal
taxonomists, technical
22
Evaluating Taxonomy Software
POC – Range of Evaluations









Basic – Can this stuff work at all?
Auto-categorization to existing taxonomy – variety of content
Clustering – automatic node generation
Summarization
Entity extraction – build a number of catalogs – design which
ones based on projected needs – example privacy info (SS#,
phone, etc.)
Entity example –people, organization, methods, etc.
Evaluate usability in action by taxonomists
Integration – with ontologies
Output – XML, API’s
23
Evaluating Text Analytics Software
POC - Issues




Quality of content – range of issues – spelling to size to ?
Quality of initial human categorization
Normalize among different test evaluators
Quality of taxonomists – experience with text analytics software
and/or experience with content and information needs and
behaviors
 Quality of taxonomy
General issues – structure (too flat or too deep)
– Overlapping categories
– Differences in use – browse, index, categorize
–
 Categorization essential issue is complexity of language
 Entity Extraction essential issue is scale and disambiguation
24
Evaluating Text Analytics Software
Risks
 CIO/CTO Problem –This is not a regular software process
 Language is messy not just complex
– 30% accuracy isn’t 30% done – could be 90%
 Variability of human categorization / expression
–
Even professional writers – journalists examples
 Categorization is iterative, not “the program works”
– Need realistic budget and flexible project plan
 Anyone can do categorization
– Librarians often overdo, SME’s often get lost (keywords)
 Meta-language issues – understanding the results
– Need to educate IT and business in their language
25
Case Study: Telecom Service
 Company History, Reputation
 Full Platform –Categorization,
Extraction, Sentiment
 Integration – java, API-SDK,
Linux
 Multiple languages
 Scale – millions of docs a day
 Total Cost of Ownership
 Ease of Development - new
 Vendor Relationship – OEM,




Expert Systems
IBM
SAS
Smart Logic
 Option – Multiple vendors –
Sentiment & Platform
etc.
26
POC Design Discussion: Evaluation Criteria
 Basic Test Design – categorize test set
– Score – by file name, human testers
 Categorization
– Accuracy Level – 80-90%
– Effort Level per accuracy level
 Sentiment Analysis
– Accuracy Level – 80-90%
– Effort Level per accuracy level
 Quantify development time – main elements
 Comparison of two vendors – how score?
– Combination of scores and report
27
Text Analytics POC Outcomes
Categorization Results
SAS
IBM
Recall-Motivation
92.6
90.7
Recall-Actions
93.8
88.3
Precision – Mot.
84.3
Precision-Act
100
Uncategorized
87.5
Raw Precision
73
46
28
Text Analytics POC Outcomes
Vendor Comparisons
 Categorization Results – both good, edge to SAS on precision
–
Use of Relevancy to set thresholds
 Development Environment
–
IBM as toolkit provides more flexibility but it also increases
development effort
 Methodology – IBM enforces good method, but takes more time
–
SAS can be used in exactly the same way
 SAS has a much more complete set of operators – NOT, DIST,
START
29
Text Analytics POC Outcomes
Vendor Comparisons - Functionality
 Sentiment Analysis – SAS has workbench, IBM would require
more development
–
SAS also has statistical modeling capabilities
 Entity and Fact extraction – seems basically the same
–
SAS and use operators for improved disambiguation –
 Summarization – SAS has built-in
–
IBM could develop using categorization rules – but not clear that
would be as effective without operators
 Conclusion: Both can do the job, edge to SAS
30
Conclusion
 Start with self-knowledge – what will you use it for?
–
Current Environment – technology, information
 Basic Features are only filters, not scores
 Integration – need an integrated team (IT, Business, KA)
–
For evaluation and development
 POC – your content, real world scenarios – not scores
 Foundation for development, experience with software
–
Development is better, faster, cheaper
 Categorization is essential, time consuming
 Next: Text Analytics + Semantic Web + Ontology
–
Integration of Data and Text Mining
– Mutual Enrichment – smarter data, richer analytics
31
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Download