Quick Start for Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Agenda Introduction – State of Text Analytics What is Text Analytics? Why need a Quick Start Process? Quick Start Process - Foundation Knowledge Audit Text Analytics Software Evaluation Proof of Concept / Pilot Building on the Foundation From POC to Development Conclusions Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics KAPS Group: General Knowledge Architecture Professional Services – Network of Consultants Program Chair – Text Analytics World – April 17-18, San Francisco Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching, Attensity, Clarabridge, Lexalytics, Services: Strategy – IM & KM - Text Analytics, Social Media, Integration Taxonomy/Text Analytics development, consulting, customization Text Analytics Quick Start – Audit, Evaluation, Pilot Social Media: Text based applications – design & development Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc. Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Current State of Text Analytics Big Data Big Text is bigger than Big Data / 90% of information Text Analytics as pre-processing – text into data, new variables and structure for predictive analytics Text Mining as pre-processing for text analytics – discover patterns New models, Watson – ensemble methods, modules - puns Social Media / Sentiment Analysis Next stage – beyond simple sentiment, business value Importance of context New emotion taxonomies Analysis of conversations, speaker relationships, strength of social ties Expertise Analysis, crowd sourcing, political Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Current State of Text Analytics Enterprise Text Analytics (ETA) ETA is the platform for unstructured text applications Enterprise Content Management and Search – failed Taxonomies and metadata – failed (Mind the Gap) Wide Range of InfoApps – BI,CI, Fraud detection, social media Has Text Analytics Arrived? Survey – 28% just getting started, 11% not yet, 17.5% ETA What is holding it back? Lack of clarity about business value, what it is – 55% Lack of strategic vision, real examples Gartner – new report on text analytics Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Introduction: Future Directions What is Text Analytics Good For? 6 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics What is Text Analytics? Text Mining – NLP, statistical, predictive, machine learning Semantic Technology – ontology, fact extraction Extraction – entities – known and unknown, concepts, events Catalogs with variants, rule based Sentiment Analysis Objects and phrases – statistics & rules – Positive and Negative Auto-categorization Training sets, Terms, Semantic Networks Rules: Boolean - AND, OR, NOT Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE Disambiguation - Identification of objects, events, context Build rules based, not simply Bag of Individual Words 7 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Case Study – Categorization & Sentiment 8 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Case Study – Categorization & Sentiment 9 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Case Study – Taxonomy Development 10 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Case Study – Taxonomy Development 11 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Need for a Quick Start Text Analytics is weird, a bit academic, and not very practical » It involves language and thinking and really messy stuff On the other hand, it is really difficult to do right (Rocket Science) Organizations don’t know what text analytics is and what it is for Survey shows - need two things: » Strategic vision of text analytics in the enterprise » Business value, problems solved, information overload » Text Analytics as platform for information access » Real life functioning program showing value and demonstrating an understanding of what it is and does Text Analytics is more than Text Mining Enterprise TA or One Application – same process, different scale 12 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics The start and foundation: Knowledge Architecture Audit Knowledge Map - Understand what you have, what you are, what you want The foundation of the foundation Contextual interviews, content analysis, surveys, focus groups, ethnographic studies, Text Mining Category modeling – “Intertwingledness” -learning new categories influenced by other, related categories Monkey, Panda, Banana Natural level categories mapped to communities, activities Novice prefer higher levels Balance of informative and distinctiveness 4 Dimensions – Content, People, Technology, Activities 13 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Knowledge Audit: Contextual Interviews Organizational Context – Free Form Management, enterprise wide function What is the size and makeup of the organizational units that will be impacted by this project? Are there special constituencies that have to be taken into account? What is the level of political support for this project? Any opposition? What are your major information or knowledge access issues? These determine approach and effort for each area » Content Map often the most complex and time-consuming Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Knowledge Audit: Information Interviews Structured, feed survey – list options Could you describe the kinds of information activities that you and your group engage in? (types of content, search, write proposals, research?) How often? How do they carry out these activities? Qualitative Research What are your major information or knowledge access issues -examples? In an ideal world, how would information access work at your organization? What is right and what’s wrong with today’s methods Output = map of information communities, activities Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Knowledge Audit: Map of Information Technology Content Management – ability to integrate text analytics Search – Integration of text analytics – Beyond XML Metadata – facets Existing Text Analytics – Underutilization? Text Mining – often separate silo, how integrate? Taxonomy Management, Databases, portals Semantic Technologies, Wiki’s Visualization software Applications – business intelligence, customer support, etc. Map- often reveals multiple redundancies, technology silos Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Knowledge Audit: Content Analysis Content Map – size, format, audience, purpose, priority, special features, data and text, etc. Content Creation – content management workflow and real life workflow, publishing process – policy Integrate external content – little control, massive scale Content Structure –taxonomies, vocabularies, metadata standards Drill Down, theme discovery Search log analysis Folksonomy if available Text Mining, categorization exploration, clustering Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Knowledge Audit- Output Strategic Vision and Change Management » Format – reports, enterprise ontology » Political/ People and technology requirements Business Benefits and ROI » Enterprise Text Analytics- information overload – IDC study: » Per 1,000 people = $ 22.5 million a year » 30% improvement = $6.75 million a year » Add own stories – especially cost of bad information Strategic Project Plan and Road Map » Text Analytics support requirements –taxonomies, resources » Map of Initial Projects – and selection criteria Software Evaluation, Proof of Concept or Initial Project? Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Evaluation Process & Methodology Build on Knowledge Audit Deep well articulated evaluation Standard Software Evaluation – if needed Filter One- Ask Experts - reputation, research – Gartner, etc. » Market strength of vendor, platforms, etc. » Feature scorecard – minimum, must have, filter to top 3-6 Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus Filter Three – In-Depth Demo – 3-6 vendors Goal – Eliminate the unfit Selection of best 1-3 – preparation for POC Input into design of POC 19 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Software Evaluation / Sole Source Amdocs Customer Support Notes – short, badly written, millions of documents Total Cost, multiple languages, Integration with their application Distributed expertise – SME, not classification Platform – resell full range of services, Sentiment Analysis Twenty to Four to POC IBM vs. SAS to SAS GAO Library of 200 page PDF formal documents, public web site People – library staff – 3-4 taxonomists – centralized expertise Enterprise search, general public Twenty Vendors to POC with SAS 20 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Proof of Concept (POC) or Pilot Test Cases from Knowledge Audit – Identify Critical Variables Measurable Quality of results is the essential factor Design - Real life scenarios, categorization with your content Preparation: Preliminary analysis of content and users information needs » Training & test sets of content, search terms & scenarios Train taxonomist(s) on software(s) Develop taxonomy if none available Four week POC – 2 rounds of develop, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content Majority of time is on auto-categorization POC use cases – basic features needed for initial projects 21 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Proof of Concept: Design of the Team Traditional Candidates – IT&, Business, Library IT - Experience with software purchases, needs assess, budget Search/Categorization is unlike other software, deeper look Business -understand business, focus on business value They can get executive sponsorship, support, and budget But don’t understand information behavior, semantic focus Library, KM - Understand information structure Experts in search experience and categorization But don’t understand business or technology Interdisciplinary team headed up by info professionals Enterprise Architecture (CAO?) Make a better decision, foundation for development 22 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Proof of Concept -- Value of POC Selection of best product(s) Identification and development of infrastructure elements – taxonomies, metadata – standards and publishing process Training by doing –SME’s learning categorization, Library/taxonomist learning business language Understand effort level for categorization, application Test suitability of existing taxonomies for range of applications Explore application issues – example – how accurate does categorization need to be for that application – 80-90% Develop resources – categorization taxonomies, entity extraction catalogs/rules Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics POC and Early Development: Risks and Issues CTO Problem –This is not a regular software process Semantics is messy not just complex 30% accuracy isn’t 30% done – could be 90% Variability of human categorization Categorization is iterative, not “the program works” Need realistic budget and flexible project plan Anyone can do categorization Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results Need to educate IT and business in their language Need for a quick start win – deep understanding 24 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Building on the Foundation Initial Projects – Processes to apply Text Analytics New Electronic Publishing Process » Use text analytics to tag, new hybrid workflow New Enterprise Search » Build faceted navigation on metadata, extraction Applications – Business Intelligence – Behavior Prediction » Combine with big data – free text in surveys, social media » Internet application – spider and categorize, extract Interdisciplinary Processes – Integrating the pieces Content management, search, InfoApps Social Media – adding intelligence to simple sentiment In depth project based on your real needs, not buzz Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Beyond Simple Sentiment Beyond Good and Evil (positive and negative) Social Media is approaching next stage (growing up) Where is the value? How get better results? Sentiment Analysis is easy to do - wrong Importance of Context – around positive and negative words Rhetorical reversals – “I was expecting to love it” Issues of sarcasm, (“Really Great Product”), slanguage Limited value of Positive and Negative Early Categorization – Politics or Sports Degrees of intensity, complexity of emotions and documents Addition of focus on behaviors – why someone calls a support center – and likely outcomes 26 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 27 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Behavior Prediction – Telecom Customer Service Problem – distinguish customers likely to cancel from mere threats Analyze customer support notes General issues – creative spelling, second hand reports Develop categorization rules First – distinguish cancellation calls – not simple Second - distinguish cancel what – one line or all Third – distinguish real threats 28 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick Start for Text Analytics Behavior Prediction – Telecom Customer Service Basic Rule (START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"), (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”))))) Examples: customer called to say he will cancell his account if the does not stop receiving a call from the ad agency. cci and is upset that he has the asl charge and wants it off or her is going to cancel his act ask about the contract expiration date as she wanted to cxl teh acct Combine sophisticated rules with sentiment statistical training and Predictive Analytics and behavior monitoring 29 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Quick and Smart Start for Text Analytics Conclusion Start with self-knowledge – what will you use it for? Knowledge audit – content, people, activities, technology Create Information Strategy / Vision – Text Analytics Focus Focus on business value and solutions not the technology POC/ Pilot – your content, real world scenarios Foundation for development, experience with software, process Integration – need an integrated team on integrated platform Initial Development Projects – within strategic context, POC+ Quick Win – real value, learn, and can be used to sell the vision Text Analytics is all about context – in enterprise and in text Think Big, Start Small, Scale Fast 30 Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012 Questions Tom Reamy Chief Knowledge Architect KAPS Group http://www.kapsgroup.com Program Chair – Text Analytics World Copyright © 2012, SAS Institute Inc. All rights reserved. #analytics2012