Quick Start for Text Analytics

Quick Start for Text Analytics
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Agenda
 Introduction – State of Text Analytics
 What is Text Analytics?
 Why need a Quick Start Process?
 Quick Start Process - Foundation
 Knowledge Audit
 Text Analytics Software Evaluation
 Proof of Concept / Pilot
 Building on the Foundation
 From POC to Development
 Conclusions
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
KAPS Group: General

Knowledge Architecture Professional Services – Network of Consultants
 Program Chair – Text Analytics World – April 17-18, San Francisco

Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching,
Attensity, Clarabridge, Lexalytics,

Services:




Strategy – IM & KM - Text Analytics, Social Media, Integration
Taxonomy/Text Analytics development, consulting, customization
Text Analytics Quick Start – Audit, Evaluation, Pilot
Social Media: Text based applications – design & development
 Clients:
 Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt,
Home Depot, Harvard Business Library, British Parliament, Battelle,
Amdocs, FDA, GAO, World Bank, etc.

Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Current State of Text Analytics
 Big Data
 Big Text is bigger than Big Data / 90% of information
 Text Analytics as pre-processing – text into data, new variables and
structure for predictive analytics
 Text Mining as pre-processing for text analytics – discover patterns
 New models, Watson – ensemble methods, modules - puns
 Social Media / Sentiment Analysis
 Next stage – beyond simple sentiment, business value
 Importance of context
 New emotion taxonomies
 Analysis of conversations, speaker relationships, strength of social ties
 Expertise Analysis, crowd sourcing, political
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Current State of Text Analytics
 Enterprise Text Analytics (ETA)
 ETA is the platform for unstructured text applications
 Enterprise Content Management and Search – failed
 Taxonomies and metadata – failed (Mind the Gap)
 Wide Range of InfoApps – BI,CI, Fraud detection, social media
 Has Text Analytics Arrived?
 Survey – 28% just getting started, 11% not yet, 17.5% ETA
 What is holding it back?
 Lack of clarity about business value, what it is – 55%
 Lack of strategic vision, real examples
 Gartner – new report on text analytics
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Introduction: Future Directions
What is Text Analytics Good For?
6
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
What is Text Analytics?
 Text Mining – NLP, statistical, predictive, machine learning
 Semantic Technology – ontology, fact extraction
 Extraction – entities – known and unknown, concepts, events
 Catalogs with variants, rule based
 Sentiment Analysis
 Objects and phrases – statistics & rules – Positive and Negative
 Auto-categorization





Training sets, Terms, Semantic Networks
Rules: Boolean - AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Disambiguation - Identification of objects, events, context
Build rules based, not simply Bag of Individual Words
7
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Case Study – Categorization & Sentiment
8
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Case Study – Categorization & Sentiment
9
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Case Study – Taxonomy Development
10
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Case Study – Taxonomy Development
11
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Need for a Quick Start
 Text Analytics is weird, a bit academic, and not very practical
» It involves language and thinking and really messy stuff
 On the other hand, it is really difficult to do right (Rocket Science)
 Organizations don’t know what text analytics is and what it is for
 Survey shows - need two things:
» Strategic vision of text analytics in the enterprise
» Business value, problems solved, information overload
» Text Analytics as platform for information access
» Real life functioning program showing value and
demonstrating an understanding of what it is and does
 Text Analytics is more than Text Mining
 Enterprise TA or One Application – same process, different scale
12
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
The start and foundation: Knowledge Architecture Audit
 Knowledge Map - Understand what you have, what you are,
what you want
 The foundation of the foundation
 Contextual interviews, content analysis, surveys, focus
groups, ethnographic studies, Text Mining
 Category modeling – “Intertwingledness” -learning new
categories influenced by other, related categories
 Monkey, Panda, Banana
 Natural level categories mapped to communities, activities
 Novice prefer higher levels
 Balance of informative and distinctiveness
 4 Dimensions – Content, People, Technology, Activities
13
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Knowledge Audit: Contextual Interviews
 Organizational Context – Free Form
 Management, enterprise wide function
 What is the size and makeup of the organizational units that will
be impacted by this project?
 Are there special constituencies that have to be taken into
account?
 What is the level of political support for this project? Any
opposition?
 What are your major information or knowledge access issues?
 These determine approach and effort for each area
» Content Map often the most complex and time-consuming
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Knowledge Audit: Information Interviews
 Structured, feed survey – list options
 Could you describe the kinds of information activities that you
and your group engage in? (types of content, search, write
proposals, research?) How often?
 How do they carry out these activities?
 Qualitative Research
 What are your major information or knowledge access issues -examples?
 In an ideal world, how would information access work at your
organization?
 What is right and what’s wrong with today’s methods
 Output = map of information communities, activities
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Knowledge Audit: Map of Information Technology
 Content Management – ability to integrate text analytics
 Search – Integration of text analytics – Beyond XML
 Metadata – facets
 Existing Text Analytics – Underutilization?
 Text Mining – often separate silo, how integrate?
 Taxonomy Management, Databases, portals
 Semantic Technologies, Wiki’s
 Visualization software
 Applications – business intelligence, customer support, etc.
 Map- often reveals multiple redundancies, technology silos
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Knowledge Audit: Content Analysis
 Content Map – size, format, audience, purpose, priority,
special features, data and text, etc.
 Content Creation – content management workflow and real
life workflow, publishing process – policy
 Integrate external content – little control, massive scale
 Content Structure –taxonomies, vocabularies, metadata
standards
 Drill Down, theme discovery
 Search log analysis
 Folksonomy if available
 Text Mining, categorization exploration, clustering
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Knowledge Audit- Output
 Strategic Vision and Change Management
» Format – reports, enterprise ontology
» Political/ People and technology requirements
 Business Benefits and ROI
» Enterprise Text Analytics- information overload – IDC study:
» Per 1,000 people = $ 22.5 million a year
» 30% improvement = $6.75 million a year
» Add own stories – especially cost of bad information
 Strategic Project Plan and Road Map
» Text Analytics support requirements –taxonomies, resources
» Map of Initial Projects – and selection criteria
 Software Evaluation, Proof of Concept or Initial Project?
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Evaluation Process & Methodology
 Build on Knowledge Audit
 Deep well articulated evaluation
 Standard Software Evaluation – if needed
 Filter One- Ask Experts - reputation, research – Gartner, etc.
» Market strength of vendor, platforms, etc.
» Feature scorecard – minimum, must have, filter to top 3-6
 Filter Two – Technology Filter – match to your overall scope and
capabilities – Filter not a focus
 Filter Three – In-Depth Demo – 3-6 vendors
 Goal – Eliminate the unfit
 Selection of best 1-3 – preparation for POC
 Input into design of POC
19
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Software Evaluation / Sole Source
 Amdocs
 Customer Support Notes – short, badly written, millions of
documents
 Total Cost, multiple languages, Integration with their application
 Distributed expertise – SME, not classification
 Platform – resell full range of services, Sentiment Analysis
 Twenty to Four to POC IBM vs. SAS to SAS
 GAO
 Library of 200 page PDF formal documents, public web site
 People – library staff – 3-4 taxonomists – centralized expertise
 Enterprise search, general public
 Twenty Vendors to POC with SAS
20
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Proof of Concept (POC) or Pilot
Test Cases from Knowledge Audit – Identify Critical Variables
 Measurable Quality of results is the essential factor
 Design - Real life scenarios, categorization with your content
 Preparation:
 Preliminary analysis of content and users information needs
» Training & test sets of content, search terms & scenarios
 Train taxonomist(s) on software(s)
 Develop taxonomy if none available
 Four week POC – 2 rounds of develop, test, refine / Not OOB
 Need SME’s as test evaluators – also to do an initial
categorization of content
 Majority of time is on auto-categorization
 POC use cases – basic features needed for initial projects
21
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Proof of Concept: Design of the Team
Traditional Candidates – IT&, Business, Library
 IT - Experience with software purchases, needs assess, budget
 Search/Categorization is unlike other software, deeper look
 Business -understand business, focus on business value
 They can get executive sponsorship, support, and budget
 But don’t understand information behavior, semantic focus
 Library, KM - Understand information structure
 Experts in search experience and categorization
 But don’t understand business or technology
 Interdisciplinary team headed up by info professionals
 Enterprise Architecture (CAO?)
 Make a better decision, foundation for development
22
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Proof of Concept -- Value of POC
 Selection of best product(s)
 Identification and development of infrastructure elements –
taxonomies, metadata – standards and publishing process
 Training by doing –SME’s learning categorization,
Library/taxonomist learning business language
 Understand effort level for categorization, application
 Test suitability of existing taxonomies for range of
applications
 Explore application issues – example – how accurate does
categorization need to be for that application – 80-90%
 Develop resources – categorization taxonomies, entity
extraction catalogs/rules
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
POC and Early Development: Risks and Issues
 CTO Problem –This is not a regular software process
 Semantics is messy not just complex
 30% accuracy isn’t 30% done – could be 90%
 Variability of human categorization
 Categorization is iterative, not “the program works”
 Need realistic budget and flexible project plan
 Anyone can do categorization
 Librarians often overdo, SME’s often get lost (keywords)
 Meta-language issues – understanding the results
 Need to educate IT and business in their language
 Need for a quick start win – deep understanding
24
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Building on the Foundation
 Initial Projects – Processes to apply Text Analytics
 New Electronic Publishing Process
» Use text analytics to tag, new hybrid workflow
 New Enterprise Search
» Build faceted navigation on metadata, extraction
 Applications – Business Intelligence – Behavior Prediction
» Combine with big data – free text in surveys, social media
» Internet application – spider and categorize, extract
 Interdisciplinary Processes – Integrating the pieces
 Content management, search, InfoApps
 Social Media – adding intelligence to simple sentiment
 In depth project based on your real needs, not buzz
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Beyond Simple Sentiment
 Beyond Good and Evil (positive and negative)
 Social Media is approaching next stage (growing up)
 Where is the value? How get better results?
 Sentiment Analysis is easy to do - wrong
 Importance of Context – around positive and negative words
 Rhetorical reversals – “I was expecting to love it”
 Issues of sarcasm, (“Really Great Product”), slanguage
 Limited value of Positive and Negative
 Early Categorization – Politics or Sports
 Degrees of intensity, complexity of emotions and documents
 Addition of focus on behaviors – why someone calls a support
center – and likely outcomes
26
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
27
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Behavior Prediction – Telecom Customer Service
 Problem – distinguish customers likely to cancel from mere
threats
 Analyze customer support notes
 General issues – creative spelling, second hand reports
 Develop categorization rules
 First – distinguish cancellation calls – not simple
 Second - distinguish cancel what – one line or all
 Third – distinguish real threats
28
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick Start for Text Analytics
Behavior Prediction – Telecom Customer Service
 Basic Rule
 (START_20, (AND,
 (DIST_7,"[cancel]", "[cancel-what-cust]"),
 (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
 Examples:
 customer called to say he will cancell his account if the does not stop
receiving a call from the ad agency.
 cci and is upset that he has the asl charge and wants it off or her is
going to cancel his act
 ask about the contract expiration date as she wanted to cxl teh acct
 Combine sophisticated rules with sentiment statistical training
and Predictive Analytics and behavior monitoring
29
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Quick and Smart Start for Text Analytics
Conclusion
 Start with self-knowledge – what will you use it for?
 Knowledge audit – content, people, activities, technology
 Create Information Strategy / Vision – Text Analytics Focus
 Focus on business value and solutions not the technology
 POC/ Pilot – your content, real world scenarios
 Foundation for development, experience with software, process
 Integration – need an integrated team on integrated platform
 Initial Development Projects – within strategic context, POC+
 Quick Win – real value, learn, and can be used to sell the vision
 Text Analytics is all about context – in enterprise and in text
 Think Big, Start Small, Scale Fast
30
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012
Questions
Tom Reamy
Chief Knowledge Architect
KAPS Group
http://www.kapsgroup.com
Program Chair – Text Analytics World
Copyright © 2012, SAS Institute Inc. All rights reserved.
#analytics2012