Text Analytics

advertisement
Text Analytics
Workshop
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Introduction – Elements & Infrastructure Platform
–
Semantics not technology
– Infrastructure not project
– Value of Text Analytics
 Evaluating Software
–
Two Phase Process
– Designing the Team and Content Structures
 Development – Taxonomy, Categorization, Faceted Metadata
 Text Analytics Applications
–
Integration with Search and ECM
– Platform for Information Applications
2
KAPS Group: General





Knowledge Architecture Professional Services
Virtual Company: Network of consultants – 8-10
Partners – SAS, SAP, Microsoft-FAST, Concept Searching, etc.
Consulting, Strategy, Knowledge architecture audit
Services:
– Taxonomy/Text Analytics development, consulting, customization
– Technology Consulting – Search, CMS, Portals, etc.
–
Evaluation of Enterprise Search, Text Analytics
–
Metadata standards and implementation
– Knowledge Management: Collaboration, Expertise, e-learning
– Applied Theory – Faceted taxonomies, complexity theory, natural
categories
3
Introduction to Text Analytics
Semantic Infrastructure - Elements
 Taxonomy – Thesauri, Controlled Vocabulary
 Metadata – Standard (Dublin Core) and Facets
 Basic Text Analytics
Categorization – Document Topics – Aboutness
– Entity Extraction – noun phrases, feed facets
– Summarization – beyond snippets
–
 Advanced Text Analytics
–
–
Fact extraction – ontologies
Sentiment Analysis – good, bad, and ugly
 What is in a Name – text analytics or ?
4
Introduction to Text Analytics
Taxonomy
 Thesauri, Controlled Vocabulary
–
Resources to build on
– Indexing not categorization
 Taxonomy
–
–
–
–
–
Foundation for Categorization
Browse – classification scheme
Formal – Is-Child-Of, Is-Part-Of
Large taxonomies - MeSH – indexing all topics
Small is better – for categorization and faceted navigation
5
Introduction to Text Analytics
Metadata
 Metadata standards – Dublin Core - Mostly syntactic not semantic
Description – static or dynamic (summarization)
– Semantic – keywords – very poor performance
–
 Best Bets – high level categorization-search
–
Human judgments
 Audience – mixed results
–
Role, function, expertise, information behaviors
 Facets – classes of metadata
–
–
Standard - People, Organization, Document type-purpose
Specialized – methods, materials, products
6
Introduction to Text Analytics
Text Analytics
 Categorization
Multiple techniques – examples, terms, Boolean
– Built on a taxonomy
–
 Entity Extraction
–
Catalogs with variants, rule based dynamic
 Summarization
–
Rules – find sentences in a document
 Fact Extraction
–
Relationships of entities – people-organizations-activities
 Sentiment Analysis
–
Rules – adjectives & adverbs not nouns
7
Introduction to Text Analytics
Text Analytics
 Why Text Analytics?
–
Enterprise search has failed to live up to its potential
– Enterprise Content management has failed to live up to its potential
– Taxonomy has failed to live up to its potential
– Adding metadata, especially keywords has not worked
 What is missing?
Intelligence – human level categorization, conceptualization
– Infrastructure – Integrated solutions not technology, software
–
 Text Analytics can be the foundation that (finally) drives success
– search, content management, and much more
8
Text Analytics Platform
4 Basic Contexts
 Ideas – Content Structure
–
–
Language and Mind of your organization
Applications - exchange meaning, not data
 People – Company Structure
–
–
Communities, Users
Central team - establish standards, facilitate
 Activities – Business processes and procedures
 Technology
–
–
CMS, Search, portals, taxonomy tools
Applications – BI, CI, Text Mining
9
Text Analytics Platform: The start and foundation
Knowledge Architecture Audit
 Knowledge Map - Understand what you have, what you
are, what you want
–
The foundation of the foundation
 Contextual interviews, content analysis, surveys, focus
groups, ethnographic studies
 Category modeling – “Intertwingledness” -learning new
categories influenced by other, related categories
 Natural level categories mapped to communities, activities
• Novice prefer higher levels
• Balance of informative and distinctiveness
 Living, breathing, evolving foundation is the goal
10
Text Analytics Platform – Benefits
IDC White Paper
 Time Wasted
–
–
–
Reformat information - $5.7 million per 1,000 per year
Not finding information - $5.3 million per 1,000
Recreating content - $4.5 Million per 1,000
 Small Percent Gain = large savings
–
–
–
1% - $10 million
5% - $50 million
10% - $100 million
11
Text Analytics Platform – Benefits
 Findability within and outside the enterprise
–
Savings per year - $millions
 Rescue enterprise search and ECM projects
–
Add semantics to search
 Clean up enterprise content
–
Duplication and accurate categorization
 Improve the quality of information access
–
Finding the right information can save millions
 Build smarter applications
–
Social networking, locate expertise within the enterprise
12
Text Analytics Platform – Benefits
 Understand your customers
–
What they are talking about and how they feel about it
 Empower your employees
–
Not only more time, but they work smarter
 Understand your competitors
–
–
What they are working on, talking about
Combine unstructured content and rich data sources – more
intelligent analysis
13
Text Analytics Platform – Dangers
 Text Analytics as a software project
 Not enough resources – to develop, to maintain-refine
 Wrong resources – SME’s, IT, Library
–
Need all of the above and taxonomists+
 Bad Design:
–
–
Start with bad taxonomy
Wrong taxonomy – too big or two flat
 Bad Categorization / Entity Extraction
–
Right kind of experience
14
Resources
 Books
–
Women, Fire, and Dangerous Things
• George Lakoff
–
Knowledge, Concepts, and Categories
• Koen Lamberts and David Shanks
–
The Stuff of Thought – Steven Pinker
 Web Sites
–
Text Analytics News http://social.textanalyticsnews.com/index.php
–
Text Analytics Wiki - http://textanalytics.wikidot.com/
15
Resources
 Blogs
–
SAS- Manya Mayes – Chief Strategist http://blogs.sas.com/text-mining/
 Web Sites
–
Taxonomy Community of Practice:
http://finance.groups.yahoo.com/group/TaxoCoP/
–
Whitepaper – CM and Text Analytics http://www.textanalyticsnews.com/usa/contentmanagementm
eetstextanalytics.pdf
16
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Download