Taxonomy and Text Analytics

advertisement
Best of All Worlds
Text Analytics and Text Mining
and Taxonomy
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Text Analytics Introduction
– Text Analytics
– Text Mining
 Case Study – Taxonomy Development
 Text Analytics, Text Mining, and Taxonomy,
 Text Analytics Applications – New Directions
– Search & Info Apps
– Expertise Analysis, Behavior Prediction, More
 Conclusions
2
KAPS Group: General
 Knowledge Architecture Professional Services – Network of Consultants
 Partners – SAS, SAP, IBM, FAST, Smart Logic, Concept Searching
– Attensity, Clarabridge, Lexalytics,
 Strategy – IM & KM - Text Analytics, Social Media, Integration
 Services:
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Quick Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, etc.
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
Presentations, Articles, White Papers – http://www.kapsgroup.com
3
Taxonomy, Text Mining, and Text Analytics
Text Analytics Features
 Noun Phrase Extraction
–
Catalogs with variants, rule based dynamic
– Multiple types, custom classes – entities, concepts, events
– Feeds facets
 Summarization
–
Customizable rules, map to different content
 Fact Extraction
Relationships of entities – people-organizations-activities
– Ontologies – triples, RDF, etc.
–
 Sentiment Analysis
–
Rules – Objects and phrases – positive and negative
4
Taxonomy, Text Mining, and Text Analytics
Text Analytics Features
 Auto-categorization
Training sets – Bayesian, Vector space
– Terms – literal strings, stemming, dictionary of related terms
– Rules – simple – position in text (Title, body, url)
– Semantic Network – Predefined relationships, sets of rules
– Boolean– Full search syntax – AND, OR, NOT
– Advanced – DIST (#), PARAGRAPH, SENTENCE
This is the most difficult to develop
Build on a Taxonomy
Combine with Extraction
– If any of list of entities and other words
–



5
6
Case Study – Categorization & Sentiment
7
Case Study – Categorization & Sentiment
8
9
10
11
12
13
Taxonomy and Text Analytics
14
Taxonomy and Text Analytics
15
Taxonomy, Text Mining, and Text Analytics
Case Study – Taxonomy Development










Problem – 200,000 new uncategorized documents
Old taxonomy –need one that reflects change in corpus
Text mining, entity extraction, categorization
Content – 250,000 large documents, search logs, etc.
Bottom Up- terms in documents – frequency, date,
Clustering – suggested categories
Clustering – chunking for editors
Entity Extraction – people, organizations, Programming languages
Time savings – only feasible way to scan documents
Quality – important terms, co-occurring terms
16
Case Study – Taxonomy Development
17
Case Study – Taxonomy Development
18
Case Study – Taxonomy Development
19
Text Analytics Development
20
New Directions in Social Media
Text Analytics, Text Mining, and Predictive Analytics
 Two Systems of the Brain
–
Fast, System 1, Immediate patterns (TM)
– Slow, System 2, Conceptual, reasoning (TA)
 Text Analytics – pre-processing for TM
–
–
–
–
Discover additional structure in unstructured text
Behavior Prediction – adding depth in individual documents
New variables for Predictive Analytics, Social Media Analytics
New dimensions – 90% of information
 Text Mining for TA– Semi-automated taxonomy development
–
–
Bottom Up- terms in documents – frequency, date, clustering
Improve speed and quality – semi-automatic
21
Text Analytics and Taxonomy
Complimentary Information Platform
 Taxonomy provides a consistent and common vocabulary
– Enterprise resource – integrated not centralized
 Text Analytics provides a consistent tagging
– Human indexing is subject to inter and intra individual variation
 Taxonomy provides the basic structure for categorization
– And candidates terms
 Text Analytics provides the power to apply the taxonomy
– And metadata of all kinds
 Text Analytics and Taxonomy Together – Platform
– Consistent in every dimension
– Powerful and economic
22
Taxonomy, Text Mining, and Text Analytics
Metadata – Tagging – the Problem
 How do you bridge the gap – taxonomy to documents?
 Tagging documents with taxonomy nodes is tough
– And expensive – central or distributed
 Library staff –experts in categorization not subject matter
– Too limited, narrow bottleneck
– Often don’t understand business processes and business uses
 Authors – Experts in the subject matter, terrible at categorization
– Intra and Inter inconsistency, “intertwingleness”
– Choosing tags from taxonomy – complex task
– Folksonomy – almost as complex, wildly inconsistent
– Resistance – not their job, cognitively difficult = non-compliance
 Text Analytics is the answer(s)!
23
Taxonomy, Text Mining, and Text Analytics
Metadata Tagging – the Solution




Mind the Gap – Manual, Automatic, Hybrid
All require human effort – issue of where and how effective
Manual - human effort is tagging (difficult, inconsistent)
Automatic and Hybrid - human effort is prior to tagging
–
Build on expertise – librarians on categorization, SME’s on subject
terms
 Hybrid Model
–
Publish Document -> Text Analytics analysis -> suggestions for
categorization, entities, metadata - > present to author
– Cognitive task is simple -> react to a suggestion instead of select
from head or a complex taxonomy
– Feedback – if author overrides -> suggestion for new category
– Facets – Requires a lot of Metadata - Entity Extraction feeds facets
 Hybrid – Automatic is really a spectrum – depends on context
24
Taxonomy, Text Mining, and Text Analytics
Applications: Search
 Multiple Knowledge Structures
–
–
–
Facet – orthogonal dimension of metadata
Taxonomy - Subject matter / aboutness
Ontology – Relationships / Facts
• Subject – Verb - Object
 Software - Search, ECM, auto-categorization, entity
extraction, Text Analytics and Text Mining
 People – tagging, evaluating tags, fine tune rules and
taxonomy
 People – Users, social tagging, suggestions
 Rich Search Results – context and conversation
25
26
27
Taxonomy, Text Mining, and Text Analytics
Applications: Search-Based Applications
 Platform for Information Applications
–
–
–
–
Content Aggregation
Duplicate Documents – save millions!
Text Mining – BI, CI – sentiment analysis
Combine with Data Mining – disease symptoms, new
• Predictive Analytics
–
–
–
Social – Hybrid folksonomy / taxonomy / auto-metadata
Social – expertise, categorize tweets and blogs, reputation
Ontology – travel assistant – SIRI
 Use your Imagination!
28
Taxonomy, Text Mining, and Text Analytics
Applications: Expertise Analysis
 Sentiment Analysis to Expertise Analysis(KnowHow)
–
Know How, skills, “tacit” knowledge
 Experts write and think differently
 Basic level is lower, more specific
–
Levels: Superordinate – Basic – Subordinate
• Mammal – Dog – Golden Retriever
–
Furniture – chair – kitchen chair
 Experts organize information around processes, not
subjects
 Build expertise categorization rules
29
Taxonomy, Text Mining, and Text Analytics
Expertise – application areas
 Taxonomy / Ontology development /design – audience focus
– Card sorting – non-experts use superficial similarities
 Business & Customer intelligence – add expertise to sentiment
Deeper research into communities, customers
Text Mining - Expertise characterization of writer, corpus
eCommerce – Organization/Presentation of information – expert, novice
Expertise location- Generate automatic expertise characterization based
on documents
Experiments - Pronoun Analysis – personality types
– Essay Evaluation Software - Apply to expertise characterization
• Model levels of chunking, procedure words over content
–




30
Beyond Sentiment: Behavior Prediction
Case Study – Telecom Customer Service




Problem – distinguish customers likely to cancel from mere threats
Analyze customer support notes
General issues – creative spelling, second hand reports
Develop categorization rules
–
–
–
First – distinguish cancellation calls – not simple
Second - distinguish cancel what – one line or all
Third – distinguish real threats
31
Beyond Sentiment
Behavior Prediction – Case Study
 Basic Rule
–
(START_20, (AND,
–
(DIST_7,"[cancel]", "[cancel-what-cust]"),
– (NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
 Examples:
–
customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
– ask about the contract expiration date as she wanted to cxl teh acct
Combine sophisticated rules with sentiment statistical training and
Predictive Analytics
32
Beyond Sentiment - Wisdom of Crowds
Crowd Sourcing Technical Support
 Example – Android User Forum
 Develop a taxonomy of products, features, problem areas
 Develop Categorization Rules:
– “I use the SDK method and it isn't to bad a all. I'll get some pics up
later, I am still trying to get the time to update from fresh 1.0 to 1.1.”
–
–
Find product & feature – forum structure
Find problem areas in response, nearby text for solution
 Automatic – simply expose lists of “solutions”
– Search Based application
 Human mediated – experts scan and clean up solutions
33
Taxonomy, Text Mining, and Text Analytics
Conclusions
 Text Analytics is an essential platform for multiple applications
 Text Analytics and Text Mining and Taxonomy are mutually
enriching approaches
 Sentiment Analysis, Beyond Positive & Negative
 New emotion taxonomies, context around terms
 New applications – Expertise, behavior prediction, etc.
 Future – new kinds of applications:
– Enterprise Search – Hybrid ECM model with text analytics
– Expertise Analysis, Behavior Prediction, and more
– Social Media and Big Data built from TM & TA
– NeuroAnalytics – cognitive science meets taxonomy and more
• Watson is just the start
34
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Resources
 Books
–
Women, Fire, and Dangerous Things
• George Lakoff
–
Knowledge, Concepts, and Categories
• Koen Lamberts and David Shanks
–
Formal Approaches in Categorization
• Ed. Emmanuel Pothos and Andy Wills
–
The Mind
• Ed John Brockman
• Good introduction to a variety of cognitive science theories,
issues, and new ideas
–
Any cognitive science book written after 2009
36
Resources
 Conferences – Web Sites
–
–
–
–
–
–
Text Analytics World
http://www.textanalyticsworld.com
Text Analytics Summit
http://www.textanalyticsnews.com
Semtech
http://www.semanticweb.com
37
Resources
 Blogs
–
SAS- http://blogs.sas.com/text-mining/
 LinkedIn Groups:
–
–
–
–
–
–
Text Analytics World
Text Analytics Group
Data and Text Professionals
Sentiment Analysis
Metadata Management
Semantic Technologies
38
Resources
 Web Sites
–
–
–
Taxonomy Community of Practice:
http://finance.groups.yahoo.com/group/TaxoCoP/
Whitepaper – CM and Text Analytics http://www.textanalyticsnews.com/usa/contentmanagementm
eetstextanalytics.pdf
Whitepaper – Enterprise Content Categorization strategy and
development – http://www.kapsgroup.com
39
Resources
 Articles
–
–
–
–
Malt, B. C. 1995. Category coherence in cross-cultural
perspective. Cognitive Psychology 29, 85-148
Rifkin, A. 1985. Evidence for a basic level in event
taxonomies. Memory & Cognition 13, 538-56
Shaver, P., J. Schwarz, D. Kirson, D. O’Conner 1987.
Emotion Knowledge: further explorations of prototype
approach. Journal of Personality and Social Psychology 52,
1061-1086
Tanaka, J. W. & M. E. Taylor 1991. Object categories and
expertise: is the basic level in the eye of the beholder?
Cognitive Psychology 23, 457-82
40
Download