Text Analytics Workshop

Text Analytics
Workshop
Tom Reamy
Chief Knowledge Architect
KAPS Group
Program Chair – Text Analytics World
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Introduction – State of Text Analytics
–
Text Analytics Features
– Information / Knowledge Environment – Taxonomy, Metadata,
Information Technology
– Value of Text Analytics
– Quick Start for Text Analytics
 Development – Taxonomy, Categorization, Faceted Metadata
 Text Analytics Applications
–
–
Integration with Search and ECM
Platform for Information Applications
 Questions / Discussions
2
Introduction: KAPS Group
 Knowledge Architecture Professional Services – Network of Consultants
 Applied Theory – Faceted taxonomies, complexity theory, natural
categories, emotion taxonomies
 Services:
– Strategy – IM & KM - Text Analytics, Social Media, Integration
– Taxonomy/Text Analytics development, consulting, customization
– Text Analytics Quick Start – Audit, Evaluation, Pilot
– Social Media: Text based applications – design & development
 Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST,
Concept Searching, Attensity, Clarabridge, Lexalytics
 Clients:
–
Genentech, Novartis, Northwestern Mutual Life, Financial Times,
Hyatt, Home Depot, Harvard Business Library, British Parliament,
Battelle, Amdocs, FDA, GAO, World Bank, etc.
 Presentations, Articles, White Papers – www.kapsgroup.com
3
Text Analytics Workshop
Introduction: Text Analytics
 History – academic research, focus on NLP
 Inxight –out of Zerox Parc
–
Moved TA from academic and NLP to auto-categorization, entity
extraction, and Search-Meta Data
 Explosion of companies – many based on Inxight extraction with
some analytical-visualization front ends
–
Half from 2008 are gone - Lucky ones got bought
 Focus on enterprise text analytics – shift to sentiment analysis easier to do, obvious pay off (customers, not employees)
–
Backlash – Real business value?
 Enterprise search down – 10 years of effort for what?
–
Need Text Analytics to work
 Text Analytics is slowly growing – time for a jump?
4
Text Analytics Workshop
Current State of Text Analytics
 Current Market: 2012 – exceed $1 Bil for text analytics (10% of total
Analytics)
 Growing 20% a year
 Search is 33% of total market
 Other major areas:
–
Sentiment and Social Media Analysis, Customer Intelligence
– Business Intelligence, Range of text based applications
 Fragmented market place – full platform, low level, specialty
–
Embedded in content management, search, No clear leader.
 Big Data – Big Text is bigger, text into data, data for text
–
Watson – ensemble methods, pun module
5
Text Analytics Workshop
Current State of Text Analytics: Vendor Space
 Taxonomy Management – SchemaLogic, Pool Party
 From Taxonomy to Text Analytics
– Data Harmony, Multi-Tes
 Extraction and Analytics
– Linguamatics (Pharma), Temis, whole range of companies
 Business Intelligence – Clear Forest, Inxight
 Sentiment Analysis – Attensity, Lexalytics, Clarabridge
 Open Source – GATE
 Stand alone text analytics platforms – IBM, SAS, SAP, Smart
Logic, Expert System, Basis, Open Text, Megaputer, Temis,
Concept Searching
 Embedded in Content Management, Search
– Autonomy, FAST, Endeca, Exalead, etc.
6
Future Directions: Survey Results
 Important Areas:
– Predictive Analytics & text mining – 90%
– Search & Search-based Apps – 86%
– Business Intelligence – 84%
– Voice of the Customer – 82%, Social Media – 75%
– Decision Support, KM – 81%
– Big Data- other – 70%, Finance – 61%
– Call Center, Tech Support – 63%
– Risk, Compliance, Governance – 61%
– Security, Fraud Detection-54%
7
Future Directions: Survey Results
 28% just getting started, 11% not yet
 What factors are holding back adoption of TA?
Lack of clarity about value of TA – 23.4%
– Lack of knowledge about TA – 17.0%
– Lack of senior management buy-in - 8.5%
– Don’t believe TA has enough business value -6.4%
–
 Other factors
Financial Constraints – 14.9%
– Other priorities more important – 12.8%
–
 Lack of articulated strategic vision – by vendors, consultants,
advocates, etc.
8
Introduction: Future Directions
What is Text Analytics Good For?
9
Text Analytics Workshop
What is Text Analytics?
 Text Mining – NLP, statistical, predictive, machine learning
 Semantic Technology – ontology, fact extraction
 Extraction – entities – known and unknown, concepts, events
–
Catalogs with variants, rule based
 Sentiment Analysis – Objects/ Products and phrases
–
Statistics, catalogs, rules – Positive and Negative
 Auto-categorization
–
–
–
–
–
Training sets, Terms, Semantic Networks
Rules: Boolean - AND, OR, NOT
Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE
Disambiguation - Identification of objects, events, context
Build rules based, not simply Bag of Individual Words
10
11
Case Study – Categorization & Sentiment
12
Case Study – Categorization & Sentiment
13
14
15
16
17
18
19
Case Study – Taxonomy Development
20
Text Analytics Workshop: Information Environment
Building an Infrastructure
 Semantic Layer = Taxonomies, Metadata, Vocabularies + Text
Analytics – adding cognitive science, structure to unstructured
 Modeling users/audiences
 Technology Layer
– Search, Content Management, SharePoint, Intranets
 Publishing process, multiple users & info needs
– SharePoint – taxonomies but
• Folksonomies – still a bad idea
 Infrastructure – Not an Application
–
Business / Library / KM / EA – not IT
 Building on the Foundation
–
Info Apps (Search-based Applications)
 Foundation of foundation – Text Analytics
21
Text Analytics Workshop: Information Environment
TA & Taxonomy Complimentary Information Platform
 Taxonomy provides a consistent and common vocabulary
– Enterprise resource – integrated not centralized
 Text Analytics provides a consistent tagging
– Human indexing is subject to inter and intra individual variation
 Taxonomy provides the basic structure for categorization
– And candidates terms
 Text Analytics provides the power to apply the taxonomy
– And metadata of all kinds
 Text Analytics and Taxonomy Together – Platform
– Consistent in every dimension
– Powerful and economic
22
Text Analytics Workshop: Information Environment
Metadata - Tagging
 How do you bridge the gap – taxonomy to documents?
 Tagging documents with taxonomy nodes is tough
– And expensive – central or distributed
 Library staff –experts in categorization not subject matter
– Too limited, narrow bottleneck
– Often don’t understand business processes and business uses
 Authors – Experts in the subject matter, terrible at categorization
– Intra and Inter inconsistency, “intertwingleness”
– Choosing tags from taxonomy – complex task
– Folksonomy – almost as complex, wildly inconsistent
– Resistance – not their job, cognitively difficult = non-compliance
 Text Analytics is the answer(s)!
23
Text Analytics Workshop: Information Environment
Mind the Gap – Manual-Automatic-Hybrid
 All require human effort – issue of where and how effective
 Manual - human effort is tagging (difficult, inconsistent)
–
Small, high value document collections, trained taggers
 Automatic - human effort is prior to tagging – auto-categorization
rules and/or NLP algorithm effort
 Hybrid Model – before (like automatic) and after
–
Build on expertise – librarians on categorization, SME’s on subject
terms
 Facets – Requires a lot of Metadata - Entity Extraction feeds
facets – more automatic, feedback by design
 Manual - Hybrid – Automatic is a spectrum – depends on context
24
Text Analytics Workshop
Benefits of Text Analytics
 Why Text Analytics?
–
Enterprise search has failed to live up to its potential
– Enterprise Content management has failed to live up to its potential
– Taxonomy has failed to live up to its potential
– Adding metadata, especially keywords has not worked
 What is missing?
Intelligence – human level categorization, conceptualization
– Infrastructure – Integrated solutions not technology, software
–
 Text Analytics can be the foundation that (finally) drives success
– search, content management, and much more
25
Text Analytics Workshop
Costs and Benefits
 IDC study – quantify cost of bad search
 Three areas:
–
Time spent searching
– Recreation of documents
– Bad decisions / poor quality work
 Costs
–
50% search time is bad search = $2,500 year per person
– Recreation of documents = $5,000 year per person
– Bad quality (harder) = $15,000 year per person
 Per 1,000 people = $ 22.5 million a year
–
30% improvement = $6.75 million a year
– Add own stories – especially cost of bad information
– Human measure - # of FTE’s, savings passed on to customers, etc.
26
Text Analytics Workshop
Need for a Quick Start
 Text Analytics is weird, a bit academic, and not very practical
• It involves language and thinking and really messy stuff
 On the other hand, it is really difficult to do right (Rocket Science)
 Organizations don’t know what text analytics is and what it is for
 TAW Survey shows - need two things:
• Strategic vision of text analytics in the enterprise
• Business value, problems solved, information overload
• Text Analytics as platform for information access
• Real life functioning program showing value and demonstrating
an understanding of what it is and does
 Quick Start – Strategic Vision – Software Evaluation – POC / Pilot
27
Text Analytics Workshop
Text Analytics Vision & Strategy
 Strategic Questions – why, what value from the text analytics,
how are you going to use it
–
Platform or Applications?
 What are the basic capabilities of Text Analytics?
 What can Text Analytics do for Search?
–
After 10 years of failure – get search to work?
 What can you do with smart search based applications?
–
RM, PII, Social
 ROI for effective search – difficulty of believing
–
Problems with metadata, taxonomy
28
Text Analytics Workshop
Quick Start Step One- Knowledge Audit
 Ideas – Content and Content Structure
Map of Content – Tribal language silos
– Structure – articulate and integrate
– Taxonomic resources
–
 People – Producers & Consumers
–
Communities, Users, Central Team
 Activities – Business processes and procedures
–
Semantics, information needs and behaviors
– Information Governance Policy
 Technology
–
–
CMS, Search, portals, text analytics
Applications – BI, CI, Semantic Web, Text Mining
29
Text Analytics Workshop
Quick Start Step One- Knowledge Audit
 Info Problems – what, how severe
 Formal Process – Knowledge Audit
–
Contextual & Information interviews, content analysis, surveys,
focus groups, ethnographic studies, Text Mining
 Informal for smaller organizations, specific application
 Category modeling – Cognitive Science – how people think
–
Panda, Monkey, Banana
 Natural level categories mapped to communities, activities
• Novice prefer higher levels
• Balance of informative and distinctiveness
 Strategic Vision – Text Analytics and Information/Knowledge
Environment
30
Quick Start Step Two - Software Evaluation
Varieties of Taxonomy/ Text Analytics Software
 Software is more important to text analytics
–
No spreadsheets for semantics
 Taxonomy Management - extraction
 Full Platform
–
SAS, SAP, Smart Logic, Concept Searching, Expert System, IBM,
Linguamatics, GATE
 Embedded – Search or Content Management
–
FAST, Autonomy, Endeca, Vivisimo, NLP, etc.
– Interwoven, Documentum, etc.
 Specialty / Ontology (other semantic)
Sentiment Analysis – Attensity, Lexalytics, Clarabridge, Lots
– Ontology – extraction, plus ontology
–
31
Quick Start Step Two - Software Evaluation
Different Kind of software evaluation
 Traditional Software Evaluation - Start
Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
• Feature scorecard – minimum, must have, filter to top 6
– Filter Two – Technology Filter – match to your overall scope and
capabilities – Filter not a focus
– Filter Three – In-Depth Demo – 3-6 vendors
–
 Reduce to 1-3 vendors
 Vendors have different strengths in multiple environments
–
–
Millions of short, badly typed documents, Build application
Library 200 page PDF, enterprise & public search
32
Quick Start Step Two - Software Evaluation
Design of the Text Analytics Selection Team
 IT - Experience with software purchases, needs assess, budget
–
Search/Categorization is unlike other software, deeper look
 Business -understand business, focus on business value
 They can get executive sponsorship, support, and budget
–
But don’t understand information behavior, semantic focus
 Library, KM - Understand information structure
 Experts in search experience and categorization
–
But don’t understand business or technology
 Interdisciplinary Team, headed by Information Professionals
 Much more likely to make a good decision
 Create the foundation for implementation
33
Quick Start Step Three –
Proof of Concept / Pilot Project
 POC use cases – basic features needed for initial projects
 Design - Real life scenarios, categorization with your content
 Preparation:
– Preliminary analysis of content and users information needs
• Training & test sets of content, search terms & scenarios
– Train taxonomist(s) on software(s)
– Develop taxonomy if none available
 Four week POC – 2 rounds of develop, test, refine / Not OOB
 Need SME’s as test evaluators – also to do an initial
categorization of content
 Majority of time is on auto-categorization
34
Text Analytics Workshop
POC Design: Evaluation Criteria & Issues
 Basic Test Design – categorize test set
– Score – by file name, human testers
 Categorization & Sentiment – Accuracy 80-90%
– Effort Level per accuracy level
 Combination of scores and report
 Operators (DIST, etc.) , relevancy scores, markup
 Development Environment – Usability, Integration
 Issues:
– Quality of content & initial human categorization
– Normalize among different test evaluators
– Quality of taxonomy – structure, overlapping categories
35
Quick Start for Text Analytics
Proof of Concept -- Value of POC
 Selection of best product(s)
 Identification and development of infrastructure elements –
taxonomies, metadata – standards and publishing process
 Training by doing –SME’s learning categorization,
Library/taxonomist learning business language
 Understand effort level for categorization, application
 Test suitability of existing taxonomies for range of applications
 Explore application issues – example – how accurate does
categorization need to be for that application – 80-90%
 Develop resources – categorization taxonomies, entity extraction
catalogs/rules
36
Text Analytics Workshop
POC and Early Development: Risks and Issues
 CTO Problem –This is not a regular software process
 Semantics is messy not just complex
–
30% accuracy isn’t 30% done – could be 90%
 Variability of human categorization
 Categorization is iterative, not “the program works”
–
Need realistic budget and flexible project plan
 Anyone can do categorization
–
Librarians often overdo, SME’s often get lost (keywords)
 Meta-language issues – understanding the results
–
Need to educate IT and business in their language
37
Development
38
Text Analytics Development: Categorization Process
Start with Taxonomy and Content
 Starter Taxonomy
–
If no taxonomy, develop (steal) initial high level
• Textbooks, glossaries, Intranet structure
• Organization Structure – facets, not taxonomy
 Analysis of taxonomy – suitable for categorization
–
–
Structure – not too flat, not too large
Orthogonal categories
 Content Selection
–
–
–
Map of all anticipated content
Selection of training sets – if possible
Automated selection of training sets – taxonomy nodes as first
categorization rules – apply and get content
39
Text Analytics Workshop
Text Analytics Development: Categorization Process
 First Round of Categorization Rules
 Term building – from content – basic set of terms that







appear often / important to content
Add terms to rule, apply to broader set of content
Repeat for more terms – get recall-precision “scores”
Repeat, refine, repeat, refine, repeat
Get SME feedback – formal process – scoring
Get SME feedback – human judgments
Test against more, new content
Repeat until “done” – 90%?
40
Text Analytics Workshop
Text Analytics Development: Entity Extraction Process
 Facet Design – from Knowledge Audit, K Map
 Find and Convert catalogs:
–
–
–
–
Organization – internal resources
People – corporate yellow pages, HR
Include variants
Scripts to convert catalogs – programming resource
 Build initial rules – follow categorization process
–
–
–
Differences – scale, threshold – application dependent
Recall – Precision – balance set by application
Issue – disambiguation – Ford company, person, car
41
Text Analytics Workshop
Text Analytics Development: Demo
 BioPharma – scientific vocabulary / articles

42
Text Analytics Workshop
Case Study - Background
 Inxight Smart Discovery
 Multiple Taxonomies
–
–
Healthcare – first target
Travel, Media, Education, Business, Consumer Goods,
 Content – 800+ Internet news sources
–
5,000 stories a day
 Application – Newsletters
–
–
Editors using categorized results
Easier than full automation
43
Text Analytics Workshop
Case Study - Approach
 Initial High Level Taxonomy
Auto generation – very strange – not usable
– Editors High Level – sections of newsletters
– Editors & Taxonomy Pro’s - Broad categories & refine
–
 Develop Categorization Rules
–
Multiple Test collections
– Good stories, bad stories – close misses - terms
 Recall and Precision Cycles
–
–
Refine and test – taxonomists – many rounds
Review – editors – 2-3 rounds
 Repeat – about 4 weeks
44
45
46
47
Text Analytics Workshop
Case Study – Issues & Lessons
 Taxonomy Structure: Aggregate vs. independent nodes
– Children Nodes – subset – rare
 Trade-off of depth of taxonomy and complexity of rules
 No best answer – taxonomy structure, format of rules
– Need custom development
–
Recall more important than precision – editors role
 Combination of SME and Taxonomy pros
–
Combination of Features – Entity extraction, terms, Boolean, filters,
facts
 Training sets and find similar are weakest
 Plan for ongoing refinement
48
Text Analytics Workshop
Enterprise Environment – Case Studies
 A Tale of Two Taxonomies
–
It was the best of times, it was the worst of times
 Basic Approach
–
–
–
–
–
–
Initial meetings – project planning
High level K map – content, people, technology
Contextual and Information Interviews
Content Analysis
Draft Taxonomy – validation interviews, refine
Integration and Governance Plans
49
Text Analytics Workshop
Enterprise Environment – Case One – Taxonomy, 7 facets
 Taxonomy of Subjects / Disciplines:
–
Science > Marine Science > Marine microbiology > Marine toxins
 Facets:
–
Organization > Division > Group
– Clients > Federal > EPA
– Facilities > Division > Location > Building X
– Content Type – Knowledge Asset > Proposals
– Instruments > Environmental Testing > Ocean Analysis > Vehicle
– Methods > Social > Population Study
– Materials > Compounds > Chemicals
50
Text Analytics Workshop
Enterprise Environment – Case One – Taxonomy, 7 facets
 Project Owner – KM department – included RM, business
process
 Involvement of library - critical
 Realistic budget, flexible project plan
 Successful interviews – build on context
–
Overall information strategy – where taxonomy fits
 Good Draft taxonomy and extended refinement
–
–
Software, process, team – train library staff
Good selection and number of facets
 Developed broad categorization and one deep-Chemistry
 Final plans and hand off to client
51
Text Analytics Workshop
Enterprise Environment – Case Two – Taxonomy, 4 facets
 Taxonomy of Subjects / Disciplines:
–
Geology > Petrology
 Facets:
–
Organization > Division > Group
– Process > Drill a Well > File Test Plan
– Assets > Platforms > Platform A
– Content Type > Communication > Presentations
52
Enterprise Environment – Case Two – Taxonomy, 4 facets
Environment & Project Issues
 Value of taxonomy understood, but not the complexity and scope
– Under budget, under staffed
 Location – not KM – tied to RM and software
– Solution looking for the right problem
 Importance of an internal library staff
– Difficulty of merging internal expertise and taxonomy
 Project mind set – not infrastructure
– Rushing to meet deadlines doesn’t work with semantics
Importance of integration – with team, company
– Project plan more important than results
53
Enterprise Environment – Case Two – Taxonomy, 4 facets
Research and Design Issues
 Research Issues
– Not enough research – and wrong people
– Misunderstanding of research – wanted tinker toy connections
• Interview 1 leads to taxonomy node 2
 Design Issues
– Not enough facets
– Wrong set of facets – business not information
– Ill-defined facets – too complex internal structure
54
Enterprise Environment – Case Two – Taxonomy, 4 facets
Conclusion: Risk Factors
 Political-Cultural-Semantic Environment
– Not simple resistance - more subtle
• – re-interpretation of specific conclusions and sequence of
conclusions / Relative importance of specific recommendations
 Access to content and people
– Enthusiastic access
 Importance of a unified project team
– Working communication as well as weekly meetings
55
Applications
56
Text Analytics Workshop
Building on the Foundation
 Text Analytics: Create the Platform – CM & Search
– New Electronic Publishing Process
• Use text analytics to tag, new hybrid workflow
– New Enterprise Search
• Build faceted navigation on metadata, extraction
 Enhance Information Access in the Enterprise - InfoApps
–
Governance, Records Management, Doc duplication, Compliance
–
Applications – Business Intelligence, CI, Behavior Prediction
eDiscovery, litigation support, Fraud detection
Productivity / Portals – spider and categorize, extract
–
–
57
Text Analytics Workshop
Information Platform: Content Management
 Hybrid Model – Internal Content Management
–
Publish Document -> Text Analytics analysis -> suggestions
for categorization, entities, metadata - > present to author
– Cognitive task is simple -> react to a suggestion instead of
select from head or a complex taxonomy
– Feedback – if author overrides -> suggestion for new category
 External Information - human effort is prior to tagging
– More automated, human input as specialized process –
periodic evaluations
– Precision usually more important
– Target usually more general
58
Text Analytics and Search
Multi-dimensional and Smart
 Faceted Navigation has become the basic/ norm
–




Facets require huge amounts of metadata
– Entity / noun phrase extraction is fundamental
– Automated with disambiguation (through categorization)
Taxonomy – two roles – subject/topics and facet structure
– Complex facets and faceted taxonomies
Clusters and Tag Clouds – discovery & exploration
Auto-categorization – aboutness, subject facets
– This is still fundamental to search experience
– InfoApps only as good as fundamentals of search
People – tagging, evaluating tags, fine tune rules and taxonomy
59
60
61
Integrated Facet Application
Design Issues - General
 What is the right combination of elements?
–
Dominant dimension or equal facets
– Browse topics and filter by facet, search box
– How many facets do you need?
 Scale requires more automated solutions
–
More sophisticated rules
 Issue of disambiguation:
Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford
– Same word, different entity – Ford and Ford
–
 Number of entities and thresholds per results set / document
–
Usability, audience needs
 Relevance Ranking – number of entities, rank of facets
62
Text Analytics Workshop : Applications
Text and Data: Two Way Street
 New types of applications
–
New ways to make sense of data, enrich data
 Harvard – Analyzing Text as Data
–
Detecting deception, Frame Analysis
 Narrative Science – take data (baseball statistics, financial data)
and turn into a story
 Political campaigns using Big Data, social media, and text
analytics
 Watson for healthcare – help doctors keep up with massive
information overload
63
Text Analytics Workshop : Applications
Social Media: Beyond Simple Sentiment
 Beyond Good and Evil (positive and negative)
–
Social Media is approaching next stage (growing up)
– Where is the value? How get better results?
 Importance of Context – around positive and negative words
Rhetorical reversals – “I was expecting to love it”
– Issues of sarcasm, (“Really Great Product”), slanguage
–
 Granularity of Application
Early Categorization – Politics or Sports
Limited value of Positive and Negative
– Degrees of intensity, complexity of emotions and documents
 Addition of focus on behaviors – why someone calls a support center
– and likely outcomes
–

64
Text Analytics Workshop : Applications
Social Media: Beyond Simple Sentiment
 Two basic approaches [Limited accuracy, depth]
–
Statistical Signature of Bag of Words
– Dictionary of positive & negative words
 Essential – need full categorization and concept extraction
 New Taxonomies – Appraisal Groups – Adjective and modifiers –
“not very good”
–
Supports more subtle distinctions than positive or negative
 Emotion taxonomies - Joy, Sadness, Fear, Anger, Surprise, Disgust
–
New Complex – pride, shame, confusion, skepticism
65
Text Analytics Workshop: Applications
Expertise Analysis
 Expertise Analysis
– Experts think & write differently – process, chunks
 Expertise Characterization for individuals, communities, documents, and
sets of documents
 Applications:
– Business & Customer intelligence, Voice of the Customer
– Deeper understanding of communities, customers – better models
– Security, threat detection – behavior prediction, Are they experts?
– Expertise location- Generate automatic expertise characterization
 Crowd Sourcing – technical support to Wiki’s
 Political – conservative and liberal minds/texts
– Disgust, shame, cooperation, openness
66
Text Analytics Workshop: Applications
Behavior Prediction – Telecom Customer Service
 Problem – distinguish customers likely to cancel from mere threats
 Basic Rule
–
(START_20, (AND, (DIST_7,"[cancel]", "[cancel-what-cust]"),
–
(NOT,(DIST_10, "[cancel]", (OR, "[one-line]", "[restore]", “[if]”)))))
 Examples:
–
customer called to say he will cancell his account if the does not stop receiving
a call from the ad agency.
– cci and is upset that he has the asl charge and wants it off or her is going to
cancel his act
 More sophisticated analysis of text and context in text
 Combine text analytics with Predictive Analytics and traditional behavior
monitoring for new applications
67
Text Analytics Workshop: Applications
Variety of New Applications
 Essay Evaluation Software - Apply to expertise characterization
–
Avoid gaming the system – multi-syllabic nonsense
• Model levels of chunking, procedure words over content
 Legal Review
Significant trend – computer-assisted review (manual =too many)
– TA- categorize and filter to smaller, more relevant set
– Payoff is big – One firm with 1.6 M docs – saved $2M
–
 Financial Services
–
–
–
–
Trend – using text analytics with predictive analytics – risk and fraud
Combine unstructured text (why) and structured transaction data
(what)
Customer Relationship Management, Fraud Detection
Stock Market Prediction – Twitter, impact articles
68
Text Analytics Workshop: Applications
Pronoun Analysis: Fraud Detection; Enron Emails
 Patterns of “Function” words reveal wide range of insights
 Function words = pronouns, articles, prepositions, conjunctions, etc.
– Used at a high rate, short and hard to detect, very social, processed
in the brain differently than content words
 Areas: sex, age, power-status, personality – individuals and groups
 Lying / Fraud detection: Documents with lies have
– Fewer and shorter words, fewer conjunctions, more positive emotion
words
– More use of “if, any, those, he, she, they, you”, less “I”
– More social and causal words, more discrepancy words
 Current research – 76% accuracy in some contexts
 Text Analytics can improve accuracy and utilize new sources
 Data analytics (standard AML) can improve accuracy
69
Text Analytics Workshop
Conclusions
 Text Analytics and Taxonomy are partners – enrich each other
 Text Analytics can mind the gap – between taxonomies and
documents
 Text Analytics needs strategic vision and quick start
–
Need to approach as platform – deep context – understand information
environment
 Text Analytics is a platform for huge range of applications:
–
–
Search and Content Management and Basic productivity apps
New kinds of applications - social, data, Info Apps of all kinds
 Want to learn more – come to Text Analytics World in Boston in Fall!
–
Early Bird Registration – www.textanalyticsworld.com
70
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Resources
 Books
–
Women, Fire, and Dangerous Things
• George Lakoff
–
Knowledge, Concepts, and Categories
• Koen Lamberts and David Shanks
–
Formal Approaches in Categorization
• Ed. Emmanuel Pothos and Andy Wills
–
The Mind
• Ed John Brockman
• Good introduction to a variety of cognitive science theories,
issues, and new ideas
–
Any cognitive science book written after 2009
72
Resources
 Conferences – Web Sites
–
Text Analytics World - All aspects of text analytics
• Oct 2-3, Boston
–
http://www.textanalyticsworld.com
–
Text Analytics Summit
http://www.textanalyticsnews.com
–
–
–
Semtech
http://www.semanticweb.com
73
Resources
 Blogs
–
SAS- http://blogs.sas.com/text-mining/
 Web Sites
–
–
–
–
–
Taxonomy Community of Practice:
http://finance.groups.yahoo.com/group/TaxoCoP/
LindedIn – Text Analytics Summit Group
http://www.LinkedIn.com
Whitepaper – CM and Text Analytics http://www.textanalyticsnews.com/usa/contentmanagementm
eetstextanalytics.pdf
Whitepaper – Enterprise Content Categorization strategy and
development – http://www.kapsgroup.com
74
Resources
 Articles
–
–
–
–
Malt, B. C. 1995. Category coherence in cross-cultural
perspective. Cognitive Psychology 29, 85-148
Rifkin, A. 1985. Evidence for a basic level in event
taxonomies. Memory & Cognition 13, 538-56
Shaver, P., J. Schwarz, D. Kirson, D. O’Conner 1987.
Emotion Knowledge: further explorations of prototype
approach. Journal of Personality and Social Psychology 52,
1061-1086
Tanaka, J. W. & M. E. Taylor 1991. Object categories and
expertise: is the basic level in the eye of the beholder?
Cognitive Psychology 23, 457-82
75
Resources
 LinkedIn Groups:
–
–
–
–
–
–
Text Analytics World
Text Analytics Group
Data and Text Professionals
Sentiment Analysis
Metadata Management
Semantic Technologies
 Journals
–
–
Academic – Cognitive Science, Linguistics, NLP
Applied – Scientific American Mind, New Scientist
76