Future Directions of Text Analytics

advertisement
Text Analytics World
Future Directions of Text Analytics
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Introduction:
– Current State of Text Analytics
– Survey
 Roadblocks for Text Analytics
– Complexity and Customization
 Fast and Slow (Thinking) Text Analytics
– Building Text Analytics Brains
 New Methods for Text Analytics
– Lessons from Watson
– Some Wild New Ideas and Approaches
 Questions
2
Introduction: KAPS Group
 Knowledge Architecture Professional Services – Network of Consultants
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
 Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Quick Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST,
Concept Searching, Attensity, Clarabridge, Lexalytics
 Projects – Portals, taxonomy, Text analytics – news, expertise location,
information strategy, text analytics evaluation, Quick Start in Text A.
 Clients: Genentech, Novartis, Northwestern Mutual Life, Financial
Times, Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, World Bank, etc.
3
 Presentations, Articles, White Papers – www.kapsgroup.com
Introduction:
What is Text Analytics?
 Text Mining – NLP, statistical, predictive, machine learning
 Semantic Technology – ontology, fact extraction
 Extraction – entities – known and unknown, concepts, events
–
Catalogs with variants, rule based
 Sentiment Analysis
–
Objects and phrases – statistics & rules – Positive and Negative
 Auto-categorization
–
–
–
–
–
Training sets, Terms, Semantic Networks
Rules: Boolean - AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Disambiguation - Identification of objects, events, context
Build rules based, not simply Bag of Individual Words
4
Text Analytics World
Current State of Text Analytics
 History – academic research, focus on NLP
 Inxight –out of Zerox Parc
–
Moved TA from academic and NLP to auto-categorization, entity
extraction, and Search-Meta Data
 Explosion of companies – many based on Inxight extraction with
some analytical-visualization front ends
–
Half from 2008 are gone - Lucky ones got bought
 Early applications – News aggregation and Enterprise Search –
 Second Wave = shift to sentiment analysis
 Enterprise search down, taxonomy up –need for metadata – not
great results from either – 10 years of effort for what?
 Text Analytics is growing – But
5
Text Analytics World
Current State of Text Analytics
 Current Market: 2012 – exceed $1 Bil for text analytics (10% of total
Analytics)
 Growing 20% a year
 Search is 33% of total market
 Other major areas:
–
Sentiment and Social Media Analysis, Customer Intelligence
– Business Intelligence, Range of text based applications
 Fragmented market place – full platform, low level, specialty
–
Embedded in content management, search, No clear leader.
6
Text Analytics World
Current State of Text Analytics: Vendor Space
 Taxonomy Management – SchemaLogic, Pool Party
 From Taxonomy to Text Analytics
– Data Harmony, Multi-Tes
 Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies
 Business Intelligence – Clear Forest, Inxight
 Sentiment Analysis – Attensity, Lexalytics, Clarabridge
 Open Source – GATE
 Stand alone text analytics platforms – IBM, SAS, SAP, Smart
Logic, Expert System, Basis, Open Text, Megaputer, Temis,
Concept Searching
 Embedded in Content Management, Search
– Autonomy, FAST, Endeca, Exalead, etc.
7
Future Directions: Survey Results
 28% just getting started, 11% not yet
 What factors are holding back adoption of TA?
–
–
–
–
Lack of clarity about value of TA – 23.4%
Lack of knowledge about TA – 17.0%
Lack of senior management buy-in - 8.5%
Don’t believe TA has enough business value -6.4%
 Other factors
Financial Constraints – 14.9%
– Other priorities more important – 12.8%
–
 Lack of articulated strategic vision – by vendors, consultants,
advocates, etc.
8
Text Analytics World
Primary Obstacle: Complexity
 Usability of software is one element
 More important is difficulty of models:
–
Conceptual and document models
 General need – more structure but also more flexible kinds
of structure and interactions
 More modules and more ways of combining or interacting –
IBM – select best answer but others
–
–
Competitive – learn and evolve – Feedback!
Cooperative – join together to form higher level structures
9
Text Analytics World
Primary Obstacle: Complexity: Partial Solutions
 Build complex semantic networks – basic concepts – good for
demo, gets a start, but very complex to build on
 Library of taxonomies – but all need major customization and
often are not a good starting point – different types of taxonomies
– index vs. categorization
 Customization – Text Analytics– heavily context dependent
– Content, Questions, Taxonomy-Ontology
– Level of specificity – Telecommunications
– Specialized vocabularies, acronyms
– Specialized relationships – conceptual and organizational
– How overcome?
10
Text Analytics World
Thinking Fast and Slow – Daniel Kahneman
 System 1 and System 2 – Daniel Kahneman
 System 1 – fast and automatic – little conscious control
 Represents categories as prototypes – stereotypes
– Norms for immediate detection of anomalies – distinguish the
surprising from the normal
– fast detection of simple differences, detect hostility in a voice,
find best chess move (if a master)
– Priming / Anchoring – susceptible to systemic errors
• Temperature Example
– Biased to believe and confirm
– Focuses on existing evidence (ignores missing – WYSIATI)
 .
11
Text Analytics World
Thinking Fast and Slow
 System 2 – Complex, effortful judgments and calculations
–
System 2 is the only one that can follow rules, compare objects on
several attributes, and make deliberate choices
– Understand complex sentences
– Check the validity of a complex logical argument
– Focus attention – can make people blind to all else – Invisible Gorilla
 Similar to traditional dichotomies – Tacit – Explicit, etc
 Basic Design – System 1 is basic to most experiences, and
System 2 takes over when things get difficult – conscious
control
 Text Analysis and Text Mining / Auto-Cat and TA Cat
12
Text Analytics World
System 1 & 2 – and Text Analytics Approaches
 “Automatic Categorization” – System 1 prototypes
– Limited value -- only works in simple environments
– Shallow categories with large differences
– Not open to conscious control
 System 2 – categories – complex, minute differences, deep
categories
 Together:
– Choose one or other for some contexts
– Combine both – need to develop new kinds of categories
and/or new ways to combine?
13
Text Analytics World
Text Mining and Text Analytics
 Text Analytics and Big Data enrich each other
–
Data tells you what people did, TA tells you why
 Text Analytics – pre-processing for TM
–
Discover additional structure in unstructured text
– Behavior Prediction – adding depth in individual documents
– New variables for Predictive Analytics, Social Media Analytics
– New dimensions – 90% of information, 50% using Twitter analysis
 Text Mining for TA– Semi-automated taxonomy development
–
Apply data methods, predictive analytics to unstructured text
– New Models – Watson ensemble methods, reasoning apps
 Extraction – smarter extraction – sections of documents, Boolean,
advanced rules – drug names, adverse events – major mention
14
Text Analytics World
Integration of Text and Data Analytics
 Expertise Location: Case Study: Data and Text
 Data Sources:
–
HR Information: Geography, Title-Grade, years of experience,
education, projects worked on, hours logged, etc.
 Text Sources:
Document authored (major and minor authors) – data and/or text
– Documents associated (teams, themes) – categorized to a taxonomy
– Experience description – extract concepts, entities
–
 Self-reported expertise – requires normalization, quality control
 Complex judgments:
–
–
Faceted application
Ensemble methods – combine evaluations
15
Text Analytics World : Building on the Platform
Expertise Analysis
 Expertise Characterization for individuals, communities,
documents, and sets of documents
 Experts prefer lower, subordinate levels
– Novice & General – high and basic level
 Experts language structure is different
– Focus on procedures over content
 Applications:
– Business & Customer intelligence – add expertise to
sentiment
– Deeper research into communities, customers
– Expertise location- Generate automatic expertise
characterization based on documents
16
Text Analytics World
New Approaches – Applied Watson
 Key concept is that multiple approaches are required – and





a way to combine them – confidence score
Aim = 85% accuracy of 50% of questions (Ken Jennings –
92% of 62%
Used a combination of structure and text search
Massive parallelism, many experts, pervasive confidence
estimation, integration of shallow and deep knowledge
Key step – fast filtering to get to top 100 (System 1)
Then – intense analysis to evaluate (System 2) – multiple
scoring
17
Text Analytics World
New Approaches – Applied Watson
 Multiple sources – taxonomies, ontologies, etc.
 Special modules – temporal and spatial reasoning –





anomalies
Taxonomic, Geospatial, Temporal, Source Reliability,
Gender, Name Consistency, Relational, Passage Support,
Theory Consistency, etc.
Merge answer scores before ranking
3 Years, 20 researchers of all types
Got to 70% of 70% - in two hours
More difficult answers / more complete questions
18
Text Analytics World
New Approaches: Adding Structure to Content
 Contexts – whole range of types of context
–
Document types-purpose, Textual complexity, formats
 Categorization by page, sections (text markers) or even
sentence or phrase – Key – remember what the last page
was
–
[Key– documents are not unstructured – they have a variety of
structures]
 Use generic components – like the level of generality of
terms or concepts (general and context specific)
19
Text Analytics World
New Approaches
 Idea – build a higher level language – like tutoring systems
–
More complex primitives
 IDEA – Crowd sourcing – to evolve better structures – how
design to avoid design by committee – other side of
wisdom of crowds
 Design TA Game – 1,000’s to play and evolve
 Partner with MOOC - example – better essay evaluation –
avoid gaming the system – lots of multi-syllabic words –
nonsense
–
Also to enhance software / modules
20
New Directions in Text Analytics
Conclusions
 Text Analytics is growing – but
 Big obstacles remain
–
–
–
Strategic Vision of text analytics in the enterprise, applications
Concrete and quick application to drive acceptance
Software still too complex, un-integrated
 New models are being developed
Cognitive science – System 1 and 2, AI – brains that learn
Watson like integrated approaches
 Overcome complexity – modules (System 1/ Standard) with new
ways of integrating (System 2 / Customized) – smarter and easier
21
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
http://www.kapsgroup.com
Upcoming: Taxonomy Boot Camp – KMWorld -DC, Nov 3-6
Workshop on Text Analytics
Text Analytics World – San Francisco, March 17-19
Future Directions for Text Analytics
Social Media: Beyond Simple Sentiment
 Analysis of Conversations- Higher level context
–
Techniques: self-revelation, humor, sharing of secrets,
establishment of informal agreements, private language
– Detect relationships among speakers and changes over time
– Strength of social ties, informal hierarchies
 Combination with other techniques
Expertise Analysis – plus Influencers
– Quality of communication (strength of social ties, extent of private
language, amount and nature of epistemic emotions – confusion+)
– Experiments - Pronoun Analysis – personality types
– Analysis of phrases, multiple contexts – conditionals, oblique
–
23
Introduction: Personal
 Deep Background: History of Ideas – dissertation – Models of
Historical Knowledge
 Artificial Intelligence research at Stanford AI Lab
 Programming – designed two computer games, educational
software
 Started an Education Software company, CTO
–
Height of California recession
 Information Architect – Chiron/Novartis, Schwab Intranet
–
Importance of metadata, taxonomy, search – Verity
 From technology to semantics, usability
 From library science to cognitive science
 2002 – started consulting company
24
Download