Taxonomy Development Workshop

advertisement
Text Analytics Workshop
Evaluation of Software
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Features, Varieties, Vendors
 Enterprise Context
–
–
Start with Self-Knowledge
Text Analytics Team
 Evaluation Process
–
–
Features and Capabilities – Filter
Proof of Concept / Pilot
2
Text Analytics Software – Features
 Entity Extraction
–
Multiple types, custom classes – entities, concepts, events
 Auto-categorization – Taxonomy Structure
–
–
–
–
–
Training sets – Bayesian, Vector space
Terms – literal strings, stemming, dictionary of related terms
Rules – simple – position in text (Title, body, url)
Boolean– Full search syntax – AND, OR, NOT
Advanced – NEAR (#), PARAGRAPH, SENTENCE
 Advanced Features
–
–
Facts / ontologies /Semantic Web – RDF +
Sentiment Analysis
3
Varieties of Taxonomy/ Text Analytics Software
 Taxonomy Management
–
Synaptica, SchemaLogic
 Full Platform
–
SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept
Searching, IBM
 Content Management
–
Nstein, Interwoven, Documentum, etc.
 Embedded – Search
–
FAST, Autonomy, Endeca, Exalead, etc.
 Specialty
–
Sentiment Analysis - Lexalytics
4
Vendors of Taxonomy/ Text Analytics Software





Attensity
Business Objects – Inxight
Clarabridge
ClearForest
Data Harmony / Access
Innovations
 GATE (Open Source)
 IBM Content Analyst










Lexalytics
Multi-Tes
Nstein
SAS - Teragram
SchemaLogic
Smart Logic
Synaptica
Wikionomy
Wordmap
Lots More
5
Evaluating Taxonomy/Text Analytics Software
Start with Self Knowledge
 Strategic and Business Context
 Info Problems – what, how severe
 Strategic Questions – why, what value from the taxonomy/text
analytics, how are you going to use it
 Formal Process - KA audit – content, users, technology, business
and information behaviors, applications - Or informal for smaller
organization,
 Text Analytics Strategy/Model – forms, technology, people
– Existing taxonomic resources, software
 Need this foundation to evaluate and to develop
6
Evaluating Taxonomy/Text Analytics Software
Start with Self Knowledge
 Do you need it – and what blend if so?
 Taxonomy Management Only
– Multiple taxonomies, languages, authors-editors
 Technology Environment – ECM, Enterprise Search – where is it
embedded
 Publishing Process – where and how is metadata being added –
now and projected future
–
Can it utilize auto-categorization, entity extraction, summarization
 Is the current search adequate – can it utilize text analytics?
 Applications – text mining, BI, CI, Alerts?
7
Design of the Text Analytics Selection Team
 Traditional Candidates - IT
 Experience with large software purchases
–
Search/Categorization is unlike other software
 Experience with needs assessments
–
Need more – know what questions to ask, knowledge audit
 Objective criteria
–
Looking where there is light?
– Asking IT to select taxonomy software is like asking a construction
company to select the design of your house.
 They have the budget
–
OK, they can play.
8
Design of the Text Analytics Selection Team
 Traditional Candidates - Business Owners
 Understand the business
–
But don’t understand information behavior
 Focus on business value, not technology
–
Focus on semantics is needed
 They can get executive sponsorship, support, and budget.
–
OK, they can play
9
Design of the Text Analytics Selection Team
 Traditional Candidates - Library
 Understand information structure
–
But not how it is used in the business
 Experts in search experience and categorization
–
Suitable for experts, not regular users
 Experience with variety of search engines, taxonomy
software, integration issues
–
OK, they can play
10
Design of the Text Analytics Selection Team
 Interdisciplinary Team, headed by Information
Professionals
 Relative Contributions
–
–
–
IT – Set necessary conditions, support tests
Business – provide input into requirements, support project
Library – provide input into requirements, add understanding
of search semantics and functionality
 Much more likely to make a good decision
 Create the foundation for implementation
11
Evaluating Text Analytics Software – Process
 Start with Self Knowledge
 Eliminate the unfit
–
Filter One- Ask Experts - reputation, research – Gartner, etc.
• Market strength of vendor, platforms, etc.
• Feature scorecard – minimum, must have, filter to top 3
–
–
Filter Two – Technology Filter – match to your overall scope
and capabilities – Filter not a focus
Filter Three – Focus Group one day visit – 3-4 vendors
 Deep pilot (2) / POC – advanced, integration, semantics
 Focus on working relationship with vendor.
12
Evaluating Text Analytics Software
Feature Checklist and Score: Basic Features, Admin
 New, copy, rename, delete, merge
–







Branches not just nodes
Scope Notes
Spell check
Search – all parts and selected (only taxonomy nodes)
Names and Identifiers for terms and nodes
Check for duplicates
Versioning, multiple authors
Analytical reports – structure, application to documents
13
Evaluating Text Analytics Software
Feature Checklist and Score: Usability
 Ease of use – copy, paste, rename, merge, etc.
 User Documentation, user manuals, on-line help, training and
tutorials
 Visualization
– file structure, tree, Hierarchy and alphabetical
 Automatic Taxonomy/Node & Rule Generation
–
Nonsense for Taxonomy
– Node – suggestions for sub-categories, rules
 Variety of node relationships – child-parent, related
14
Evaluating Text Analytics Software
Feature Checklist and Score: Additional Features
 Language support – international - If you have need for it
 Scalability – Size of taxonomy rarely important
–





More important for auto-categorization
Import-Export – XML and SKOS
Support standards – NISO, etc., Mapping between taxonomies
API / SDK
Security, Access Rights, Roles
Advanced Features – future growth
Facts / ontologies /Semantic Web – RDF +
– Sentiment Analysis
–
15
Evaluating Text Analytics Software
Advanced Features – Text Analytics as Platform
 Entity Extraction
–
Multiple types, custom classes
 Summarization
–
Customizable rules, map to different content
 Auto-categorization
–
–
–
–
–
–
Training sets
Terms – literal strings, stemming, dictionary of related terms
Rules – simple – position in text (Title, body, url)
Advanced – saved search queries (full search syntax)
NEAR, SENTENCE, PARAGRAPH
Boolean – X NEAR Y and Not-Z
16
Evaluating Taxonomy Software
POC




Quality of results is the essential factor
6 weeks POC – bake off / or short pilot
Real life scenarios, categorization with your content
Preparation:
–
Preliminary analysis of content and users information needs
– Set up software in lab – relatively easy
– Train taxonomist(s) on software(s)
– Develop taxonomy if none available
 Six week POC – 3 rounds of development, test, refine / Not OOB
 Need SME’s as test evaluators – also to do an initial categorization of
content
17
Evaluating Taxonomy Software
POC
 Majority of time is on auto-categorization
 Need to balance uniformity of results with vendor unique capabilities –
have to determine at POC time
 Risks – getting software installed and working, getting the right content,
initial categorization of content
 Elements:
–
Content
– Search terms / search scenarios
– Training sets
– Test sets of content
 Taxonomy Developers – expert consultants plus internal taxonomists
18
Evaluating Taxonomy Software
POC







Test Cases:
Auto-categorization to existing taxonomy – variety of content
Clustering – automatic node generation
Summarization
Entity extraction – build a number of catalogs – design which ones based
on projected needs – example privacy info (SS#, phone, etc.)
Entity example –people, organization, methods, etc.
Evaluate usability in action by taxonomists
19
Evaluating Taxonomy Software
POC - Issues





Quality of content
Quality of initial human categorization
Normalize among different test evaluators
Quality of taxonomists – experience with text analytics software and/or
experience with content and information needs and behaviors
Quality of taxonomy
General issues – structure (too flat or too deep)
– Overlapping categories
– Differences in use – browse, index, categorize
– IMPORTANT!!!
–
20
Conclusion
 Start with self-knowledge – what will you use it for?
–
Current Environment – technology, information
 Basic Features are only filters, not scores
 Integration – need an integrated team (IT, Business, KA)
–
For evaluation and development
 POC – your content, real world scenarios – not scores
 Foundation for development, experience with software
–
Development is better, faster, cheaper
 Categorization is essential, time consuming
 Categorization essential issue is complexity of language
 Entity Extraction essential issue is scale
21
Questions?
Tom Reamy
tomr@kapsgroup.com
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Download