Text Analytics World Future Directions of Text Analytics Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com Agenda Introduction: – Current State of Text Analytics – Survey Roadblocks for Text Analytics – Complexity and Customization Fast and Slow (Thinking) Text Analytics – Building Text Analytics Brains New Methods for Text Analytics – Lessons from Watson – Some Wild New Ideas and Approaches Questions 2 Introduction: KAPS Group Knowledge Architecture Professional Services – Network of Consultants Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies Services: – Strategy – IM & KM - Text Analytics, Social Media, Integration – Taxonomy/Text Analytics development, consulting, customization – Text Analytics Quick Start – Audit, Evaluation, Pilot – Social Media: Text based applications – design & development Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A. Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc. 3 Presentations, Articles, White Papers – www.kapsgroup.com Introduction: What is Text Analytics? Text Mining – NLP, statistical, predictive, machine learning Semantic Technology – ontology, fact extraction Extraction – entities – known and unknown, concepts, events – Catalogs with variants, rule based Sentiment Analysis – Objects and phrases – statistics & rules – Positive and Negative Auto-categorization – – – – – Training sets, Terms, Semantic Networks Rules: Boolean - AND, OR, NOT Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE Disambiguation - Identification of objects, events, context Build rules based, not simply Bag of Individual Words 4 Text Analytics World Current State of Text Analytics History – academic research, focus on NLP Inxight –out of Zerox Parc – Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends – Half from 2008 are gone - Lucky ones got bought Early applications – News aggregation and Enterprise Search – Second Wave = shift to sentiment analysis Enterprise search down, taxonomy up –need for metadata – not great results from either – 10 years of effort for what? Text Analytics is growing – But 5 Text Analytics World Current State of Text Analytics Current Market: 2012 – exceed $1 Bil for text analytics (10% of total Analytics) Growing 20% a year Search is 33% of total market Other major areas: – Sentiment and Social Media Analysis, Customer Intelligence – Business Intelligence, Range of text based applications Fragmented market place – full platform, low level, specialty – Embedded in content management, search, No clear leader. 6 Text Analytics World Current State of Text Analytics: Vendor Space Taxonomy Management – SchemaLogic, Pool Party From Taxonomy to Text Analytics – Data Harmony, Multi-Tes Extraction and Analytics – Linguamatics (Pharma), Temis, whole range of companies Business Intelligence – Clear Forest, Inxight Sentiment Analysis – Attensity, Lexalytics, Clarabridge Open Source – GATE Stand alone text analytics platforms – IBM, SAS, SAP, Smart Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching Embedded in Content Management, Search – Autonomy, FAST, Endeca, Exalead, etc. 7 Future Directions: Survey Results 28% just getting started, 11% not yet What factors are holding back adoption of TA? – – – – Lack of clarity about value of TA – 23.4% Lack of knowledge about TA – 17.0% Lack of senior management buy-in - 8.5% Don’t believe TA has enough business value -6.4% Other factors Financial Constraints – 14.9% – Other priorities more important – 12.8% – Lack of articulated strategic vision – by vendors, consultants, advocates, etc. 8 Text Analytics World Primary Obstacle: Complexity Usability of software is one element More important is difficulty of models: – Conceptual and document models General need – more structure but also more flexible kinds of structure and interactions More modules and more ways of combining or interacting – IBM – select best answer but others – – Competitive – learn and evolve – Feedback! Cooperative – join together to form higher level structures 9 Text Analytics World Primary Obstacle: Complexity: Partial Solutions Build complex semantic networks – basic concepts – good for demo, gets a start, but very complex to build on Library of taxonomies – but all need major customization and often are not a good starting point – different types of taxonomies – index vs. categorization Customization – Text Analytics– heavily context dependent – Content, Questions, Taxonomy-Ontology – Level of specificity – Telecommunications – Specialized vocabularies, acronyms – Specialized relationships – conceptual and organizational – How overcome? 10 Text Analytics World Thinking Fast and Slow – Daniel Kahneman System 1 and System 2 – Daniel Kahneman System 1 – fast and automatic – little conscious control Represents categories as prototypes – stereotypes – Norms for immediate detection of anomalies – distinguish the surprising from the normal – fast detection of simple differences, detect hostility in a voice, find best chess move (if a master) – Priming / Anchoring – susceptible to systemic errors • Temperature Example – Biased to believe and confirm – Focuses on existing evidence (ignores missing – WYSIATI) . 11 Text Analytics World Thinking Fast and Slow System 2 – Complex, effortful judgments and calculations – System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices – Understand complex sentences – Check the validity of a complex logical argument – Focus attention – can make people blind to all else – Invisible Gorilla Similar to traditional dichotomies – Tacit – Explicit, etc Basic Design – System 1 is basic to most experiences, and System 2 takes over when things get difficult – conscious control Text Analysis and Text Mining / Auto-Cat and TA Cat 12 Text Analytics World System 1 & 2 – and Text Analytics Approaches “Automatic Categorization” – System 1 prototypes – Limited value -- only works in simple environments – Shallow categories with large differences – Not open to conscious control System 2 – categories – complex, minute differences, deep categories Together: – Choose one or other for some contexts – Combine both – need to develop new kinds of categories and/or new ways to combine? 13 Text Analytics World Text Mining and Text Analytics Text Analytics and Big Data enrich each other – Data tells you what people did, TA tells you why Text Analytics – pre-processing for TM – Discover additional structure in unstructured text – Behavior Prediction – adding depth in individual documents – New variables for Predictive Analytics, Social Media Analytics – New dimensions – 90% of information, 50% using Twitter analysis Text Mining for TA– Semi-automated taxonomy development – Apply data methods, predictive analytics to unstructured text – New Models – Watson ensemble methods, reasoning apps Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention 14 Text Analytics World Integration of Text and Data Analytics Expertise Location: Case Study: Data and Text Data Sources: – HR Information: Geography, Title-Grade, years of experience, education, projects worked on, hours logged, etc. Text Sources: Document authored (major and minor authors) – data and/or text – Documents associated (teams, themes) – categorized to a taxonomy – Experience description – extract concepts, entities – Self-reported expertise – requires normalization, quality control Complex judgments: – – Faceted application Ensemble methods – combine evaluations 15 Text Analytics World : Building on the Platform Expertise Analysis Expertise Characterization for individuals, communities, documents, and sets of documents Experts prefer lower, subordinate levels – Novice & General – high and basic level Experts language structure is different – Focus on procedures over content Applications: – Business & Customer intelligence – add expertise to sentiment – Deeper research into communities, customers – Expertise location- Generate automatic expertise characterization based on documents 16 Text Analytics World New Approaches – Applied Watson Key concept is that multiple approaches are required – and a way to combine them – confidence score Aim = 85% accuracy of 50% of questions (Ken Jennings – 92% of 62% Used a combination of structure and text search Massive parallelism, many experts, pervasive confidence estimation, integration of shallow and deep knowledge Key step – fast filtering to get to top 100 (System 1) Then – intense analysis to evaluate (System 2) – multiple scoring 17 Text Analytics World New Approaches – Applied Watson Multiple sources – taxonomies, ontologies, etc. Special modules – temporal and spatial reasoning – anomalies Taxonomic, Geospatial, Temporal, Source Reliability, Gender, Name Consistency, Relational, Passage Support, Theory Consistency, etc. Merge answer scores before ranking 3 Years, 20 researchers of all types Got to 70% of 70% - in two hours More difficult answers / more complete questions 18 Text Analytics World New Approaches: Adding Structure to Content Contexts – whole range of types of context – Document types-purpose, Textual complexity, formats Categorization by page, sections (text markers) or even sentence or phrase – Key – remember what the last page was – [Key– documents are not unstructured – they have a variety of structures] Use generic components – like the level of generality of terms or concepts (general and context specific) 19 Text Analytics World New Approaches Idea – build a higher level language – like tutoring systems – More complex primitives IDEA – Crowd sourcing – to evolve better structures – how design to avoid design by committee – other side of wisdom of crowds Design TA Game – 1,000’s to play and evolve Partner with MOOC - example – better essay evaluation – avoid gaming the system – lots of multi-syllabic words – nonsense – Also to enhance software / modules 20 New Directions in Text Analytics Conclusions Text Analytics is growing – but Big obstacles remain – – – Strategic Vision of text analytics in the enterprise, applications Concrete and quick application to drive acceptance Software still too complex, un-integrated New models are being developed Cognitive science – System 1 and 2, AI – brains that learn Watson like integrated approaches Overcome complexity – modules (System 1/ Standard) with new ways of integrating (System 2 / Customized) – smarter and easier 21 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group http://www.kapsgroup.com Upcoming: Taxonomy Boot Camp – KMWorld -DC, Nov 3-6 Workshop on Text Analytics Text Analytics World – San Francisco, March 17-19 Future Directions for Text Analytics Social Media: Beyond Simple Sentiment Analysis of Conversations- Higher level context – Techniques: self-revelation, humor, sharing of secrets, establishment of informal agreements, private language – Detect relationships among speakers and changes over time – Strength of social ties, informal hierarchies Combination with other techniques Expertise Analysis – plus Influencers – Quality of communication (strength of social ties, extent of private language, amount and nature of epistemic emotions – confusion+) – Experiments - Pronoun Analysis – personality types – Analysis of phrases, multiple contexts – conditionals, oblique – 23 Introduction: Personal Deep Background: History of Ideas – dissertation – Models of Historical Knowledge Artificial Intelligence research at Stanford AI Lab Programming – designed two computer games, educational software Started an Education Software company, CTO – Height of California recession Information Architect – Chiron/Novartis, Schwab Intranet – Importance of metadata, taxonomy, search – Verity From technology to semantics, usability From library science to cognitive science 2002 – started consulting company 24