Text Analytics Evaluation
A Case Study: Amdocs
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics Evaluation Case Study
 Agenda
Introduction – Text Analytics Basics
Evaluation Process & Methodology
Two Stages – Initial Filters & POC
Initial Evaluation Results
Proof of Concept
Methodology
Results
Final Recommendation
Sentiment Analysis and Beyond
Conclusions
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
KAPS Group: General

Knowledge Architecture Professional Services

Virtual Company: Network of consultants – 8-10

Partners – SAS – 2 Whitepapers (Semantic infrastructure)
 GAO, FDA, Amdocs – Sales & Development Projects

Other Partners: Smart Logic, FAST, Concept Searching, etc.

Consulting, Strategy, Knowledge architecture audit

Services:
 Text Analytics evaluation, development, consulting, customization
 Knowledge Representation – taxonomy, ontology, Prototype
 Knowledge Management: Collaboration, Expertise, e-learning
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories
3
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics Evaluation Case Study
Text Analytics Features
 Noun Phrase Extraction
 Catalogs with variants, rule based dynamic
 Multiple types, custom classes – entities, concepts, events
 Feeds facets
 Summarization
 Customizable rules, map to different content
 Fact Extraction
 Relationships of entities – people-organizations-activities
 Ontologies – triples, RDF, etc.
 Sentiment Analysis
 Rules – Objects and phrases
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Introduction to Text Analytics
Text Analytics Features
 Auto-categorization






Training sets – Bayesian, Vector space
Terms – literal strings, stemming, dictionary of related terms
Rules – simple – position in text (Title, body, url)
Semantic Network – Predefined relationships, sets of rules
Boolean– Full search syntax – AND, OR, NOT
Advanced – DIST (#), PARAGRAPH, SENTENCE

This is the most difficult to develop

Build on a Taxonomy

Combine with Extraction
 If any of list of entities and other words
5
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
6
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
7
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
8
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
9
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
10
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
11
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Evaluating Text Analytics Software
Start with Self Knowledge
 Strategic and Business Context
 Strategic Questions – why, what value from the taxonomy/text
analytics, how are you going to use it
 Info Problems – what, how severe
 Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller
organization, application specific initiatives
 Text Analytics Strategy/Model – forms, technology, people
 Existing taxonomic resources, software
 Need this foundation to evaluate and to develop
12
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Evaluating Text Analytics Software
Start with Self Knowledge
 Do you need it – and what blend if so?
 Taxonomy Management Full Functionality
 Multiple taxonomies, languages, authors-editors
 Technology Environment – Text Mining, ECM, Enterprise Search
 Where is it embedded, integration issues
 Publishing Process – where and how is metadata being added –
now and projected future
 Can it utilize auto-categorization, entity extraction, summarization
 Applications – text mining, BI, CI, Social Media, Mobile?
13
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Evaluation Process & Methodology
Team - Interdisciplinary
 IT – Large software purchase, needs assessment
» Text Analytics is different – semantics
» Construction company designing your house
 Business – Understand the business needs
» Don’t understand information
» Restaurant owner doing the cooking
 Library - know information, search
» Don’t understand the business, non-information experts
» Accountant doing financial strategy
 Team – 3 KAPS - Information
 5-8 Amdocs – SME - business, Technical.
14
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Evaluation Process & Methodology
Amdocs Requirements / Initial Filters
 Platform – range of capabilities
 Categorization, Sentiment analysis, etc.
 Technical
 API’s, Java based, Linux run time
 Scalability – millions of documents a day
 Import-Export – XML, RDF
 Total Cost of Ownership
 Vendor Relationship - OEM
 Usability, Multiple Language Support
15
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Evaluation Process & Methodology
Two Phases
 Phase I – Traditional Software Evaluation
 Filter One- Ask Experts - reputation, research – Gartner, etc.
» Market strength of vendor, platforms, etc.
 Filter Two - Feature scorecard – minimum, must have, filter to
top 3
 Filter Three – Technology Filter – match to your overall scope
and capabilities – Filter not a focus
 Filter Four – In-Depth Demo – 3-6 vendors
 Phase II - Deep POC (2) – advanced, integration,
semantics
16
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase I – Case Study
 Attensity
 Lexalytics
 SAP – Inxight
 Multi-Tes
 Clarabridge
 Nstein
 ClearForest
 SAS
 Concept Searching
 SchemaLogic
 Data Harmony / Access
Innovations
 Smart Logic
 Expert Systems
 Enterprise Search
 GATE (Open Source)
 IBM
 Content Management
 Sentiment Analysis Specialty
 Ontology Platforms
17
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase I - 4 Demos
 SmartLogic
 Taxonomy Management, good interface
 20 types of entities, API’s, XML-Http
 Full Platform – no Sentiment Analysis
 Expert Systems
 Different Approach – Semantic Network – 400,000 words / 3,500
rules, 65 types of relationships
 Strong out of the box – 80%, no training sets
 Language concerns – no Spanish, high cost to develop new ones
 Customization – add terms and relationships, develop rules –
uncertain how much effort, use their professional linguists
18
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase I - 4 Demos
 SAS- Content Categorization & Sentiment
 Full Platform – categorization, entity, sentiment – integrated
 API’s, XML, Java – ease of integration
 Strong history of company, range of experience
 IBM – Classification, Concept Analytics – Two products
 Classification Module – statistical emphasis
» Once trained, it could “learn” new words
» Rapid development / depends on training sets
 Content Analytics, Languageware Workbench
» Full Platform
19
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase I – Findings
 SAS & IBM – Full Platform, OEM Experience, multilingual
 Proven ability to scale, customizable components, mature tool sets
 SAS was the strongest offering
 Capabilities, experience, integrated tool sets
 IBM good second choice
 Capabilities, experience - multiple products – strength and weakness
 Single Vendor POC - Demonstrate it can be done
 Ability to dive more deeply into capabilities, issues
 Stronger foundation for future development, Learn the software better
 Danger of missing better choice
 Two Vendor POC
 Balance of depth and full testing
20
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase II - Proof Of Concept - POC

4-6 weeks POC – bake off / or short pilot

Measurable Quality of results is the essential factor

Real life scenarios, categorization with your content

2-3 rounds of development, test, refine / Not OOB

Need SME’s as test evaluators – also to do an initial categorization of
content

Majority of time is on auto-categorization

Need to balance uniformity of results with vendor unique capabilities – have
to determine at POC time

Taxonomy Developers – expert consultants plus internal taxonomists
21
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase II – POC: Range of Evaluations
 Basic Question – Can this stuff work at all?
 Auto-categorization to existing taxonomy – variety of content
 Essential Issue is complexity of language
 Clustering – automatic node generation
 Summarization
 Entity extraction – build a number of catalogs – design which ones
based on projected needs – example privacy info (SS#, phone, etc.)
 Entity example –people, organization, methods, etc.
 Essential issue is scale and disambiguation
 Evaluate usability in action by taxonomists
22
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase II – POC: Evaluation Criteria & Issues

Basic Test Design – categorize test set
 Score – by file name, human testers

Categorization & Sentiment – Accuracy 80-90%
 Effort Level per accuracy level

Quantify development time – main elements

Comparison of two vendors – how score?
 Combination of scores and report

Quality of content & initial human categorization
 Normalize among different test evaluators

Quality of taxonomists – experience with text analytics software and/or
experience with content and information needs and behaviors

Quality of taxonomy – structure, overlapping categories
23
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Phase II – POC: Risks
 CIO/CTO Problem –This is not a regular software process
 Language is messy not just complex
 30% accuracy isn’t 30% done – could be 90%
 Variability of human categorization / expression
 Even professional writers – journalists examples
 Categorization is iterative, not “the program works”
 Need realistic budget and flexible project plan
 Anyone can do categorization
 Librarians often overdo, SME’s often get lost (keywords)
 Meta-language issues – understanding the results
 Need to educate IT and business in their language
24
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics POC Outcomes
Categorization of CSR Notes
 Content –2,000 CSR notes categorized by humans
 Variation among human categorization
 Recall (finding all the correct documents)
 Precision (not categorizing documents from other categories)
 Precision is harder than recall
 Two scores – raw and corrected – only raw for IBM precision
 First score was very low, with an extra round got it up
 Uncategorized documents – 50,000 – look at top 10 in each
category
25
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics POC Outcomes
Categorization Results
SAS
IBM
Recall-Motivation
92.6
90.7
Recall-Actions
93.8
88.3
Precision – Mot.
84.3
Precision-Act
100
Uncategorized
87.5
Raw Precision
73
46
26
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics POC Outcomes
Vendor Comparisons
 SAS has a much more complete set of operators – NOT, DIST,
ORDDIST, START
 IBM team was able to develop work arounds for some – more
development effort
 Operators impact most other features – Sentiment analysis, Entity and
Fact Extraction, Summarization, etc.
 SAS has relevancy – can be used for precision, applications
 Sentiment Analysis – SAS has workbench, IBM would require more
development
 SAS also has statistical modeling capabilities
 Development Environment & Methodology
 IBM as toolkit provides more flexibility but it also increases development
effort, enforces good method
27
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics POC Outcomes
Vendor Comparisons - Conclusions
 Both can do the job
 Product vs. Tool Kit (SAS has toolkit capabilities also)
 IBM will require more development effort
 Boolean Operators – NOT, DIST, ORDDIST, START, etc.
» In rules, entity and fact extraction
 Sentiment Analysis – rules, statistical
 Summarization
 Rule building more programming than taxonomy
 IBM harder to learn – POC had 2X effort for IBM
 Conclusion: Buy SAS ECC and Sentiment Workbench
28
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Sentiment Analysis
Development Process
 Combination of Statistical and categorization rules
 Start with Training sets – examples of positive,
negative, neutral documents
 Develop a Statistical Model
 Generate domain positive and negative words and
phrases
 Develop a taxonomy of Products & Features
 Develop rules for positive and negative statements
 Test and Refine
 Test and Refine again
29
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
30
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
31
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
32
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
33
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
34
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Beyond Sentiment: Behavior Prediction
Case Study – Telecom Customer Service

Problem – distinguish customers likely to cancel from mere threats

Analyze customer support notes

General issues – creative spelling, second hand reports

Develop categorization rules
 First – distinguish cancellation calls – not simple
 Second - distinguish cancel what – one line or all
 Third – distinguish real threats
35
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Beyond Sentiment
Behavior Prediction – Case Study
 Basic Rule
 (START_20, (AND,
 (DIST_7,"[cancel]", "[cancel-what-cust]"),
 (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))

Examples:
 customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
 cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
 ask about the contract expiration date as she wanted to cxl teh acct
Combine sophisticated rules with sentiment statistical training and
Predictive Analytics
36
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Beyond Sentiment - Wisdom of Crowds
Crowd Sourcing Technical Support

Example – Android User Forum

Develop a taxonomy of products, features, problem areas

Develop Categorization Rules:
 “I use the SDK method and it isn't to bad a all. I'll get some pics up
later, I am still trying to get the time to update from fresh 1.0 to 1.1.”
 Find product & feature – forum structure
 Find problem areas in response, nearby text for solution

Automatic – simply expose lists of “solutions”
 Search Based application

Human mediated – experts scan and clean up solutions
37
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Beyond Sentiment: Expertise Analysis
 Apply Sentiment Analysis techniques to Expertise
 Expertise Characterization for individuals, communities,
documents, and sets of documents
 Experts prefer lower, subordinate levels
 Novice prefer higher, superordinate levels
 General Populace prefers basic level
 Experts language structure is different
 Focus on procedures over content
 Develop expertise rules – sentiment and categorization
38
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Expertise Analysis
Expertise – application areas

Taxonomy / Ontology development /design – audience focus
 Card sorting – non-experts use superficial similarities

Business & Customer intelligence – add expertise to sentiment
 Deeper research into communities, customers

Text Mining - Expertise characterization of writer, corpus

eCommerce – Organization/Presentation of information – expert, novice

Expertise location- Generate automatic expertise characterization based on
documents

Experiments - Pronoun Analysis – personality types
 Essay Evaluation Software - Apply to expertise characterization
» Model levels of chunking, procedure words over content
39
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics Evaluation
Conclusions
 Start with Self Knowledge – text analytics not an end in itself
 Initial Evaluation – filters, not scorecards
 Weights change output – need self knowledge for good weights
 Proof of Concept – essential
 OOB doesn’t tell you how it will work in real world
 Content and Scenarios is your real world
 Good idea even if you know SAS is the answer
 Importance of operators, relevance for a platform
 Sentiment needs full platform capabilities
 Everyone has room for improvement
40
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Text Analytics
Future Directions
 Start with the 80% of significant content that is not data
 Enterprise search, content management, Search based applications
 Text Analytics and Text Mining
 Text Analytics turns text into data – Build better TM Apps
 Better extraction and add Subject / Concepts
 Sentiment and Beyond – Behavior, Expertise
 Text Mining and Text Analytics
 TM enriching TA
 Taxonomy development
 New Content Structures, ensemble models
 Text Analytics and Predictive Analytics
 More content, New content – social, interactive – CSR
 New sources of content/data = new & better apps
 Add Learning & Cognitive Science and the future is ?
41
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Copyright © 2011, SAS Institute Inc. All rights reserved.
#analytics2011