Text Analytics Workshop Evaluation of Software Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com Agenda Features, Varieties, Vendors Enterprise Context – – Start with Self-Knowledge Text Analytics Team Evaluation Process – – Features and Capabilities – Filter Proof of Concept / Pilot 2 Text Analytics Software – Features Entity Extraction – Multiple types, custom classes – entities, concepts, events Auto-categorization – Taxonomy Structure – – – – – Training sets – Bayesian, Vector space Terms – literal strings, stemming, dictionary of related terms Rules – simple – position in text (Title, body, url) Boolean– Full search syntax – AND, OR, NOT Advanced – NEAR (#), PARAGRAPH, SENTENCE Advanced Features – – Facts / ontologies /Semantic Web – RDF + Sentiment Analysis 3 Varieties of Taxonomy/ Text Analytics Software Taxonomy Management – Synaptica, SchemaLogic Full Platform – SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM Content Management – Nstein, Interwoven, Documentum, etc. Embedded – Search – FAST, Autonomy, Endeca, Exalead, etc. Specialty – Sentiment Analysis - Lexalytics 4 Vendors of Taxonomy/ Text Analytics Software Attensity Business Objects – Inxight Clarabridge ClearForest Data Harmony / Access Innovations GATE (Open Source) IBM Content Analyst Lexalytics Multi-Tes Nstein SAS - Teragram SchemaLogic Smart Logic Synaptica Wikionomy Wordmap Lots More 5 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, Text Analytics Strategy/Model – forms, technology, people – Existing taxonomic resources, software Need this foundation to evaluate and to develop 6 Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge Do you need it – and what blend if so? Taxonomy Management Only – Multiple taxonomies, languages, authors-editors Technology Environment – ECM, Enterprise Search – where is it embedded Publishing Process – where and how is metadata being added – now and projected future – Can it utilize auto-categorization, entity extraction, summarization Is the current search adequate – can it utilize text analytics? Applications – text mining, BI, CI, Alerts? 7 Design of the Text Analytics Selection Team Traditional Candidates - IT Experience with large software purchases – Search/Categorization is unlike other software Experience with needs assessments – Need more – know what questions to ask, knowledge audit Objective criteria – Looking where there is light? – Asking IT to select taxonomy software is like asking a construction company to select the design of your house. They have the budget – OK, they can play. 8 Design of the Text Analytics Selection Team Traditional Candidates - Business Owners Understand the business – But don’t understand information behavior Focus on business value, not technology – Focus on semantics is needed They can get executive sponsorship, support, and budget. – OK, they can play 9 Design of the Text Analytics Selection Team Traditional Candidates - Library Understand information structure – But not how it is used in the business Experts in search experience and categorization – Suitable for experts, not regular users Experience with variety of search engines, taxonomy software, integration issues – OK, they can play 10 Design of the Text Analytics Selection Team Interdisciplinary Team, headed by Information Professionals Relative Contributions – – – IT – Set necessary conditions, support tests Business – provide input into requirements, support project Library – provide input into requirements, add understanding of search semantics and functionality Much more likely to make a good decision Create the foundation for implementation 11 Evaluating Text Analytics Software – Process Start with Self Knowledge Eliminate the unfit – Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 3 – – Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus Filter Three – Focus Group one day visit – 3-4 vendors Deep pilot (2) / POC – advanced, integration, semantics Focus on working relationship with vendor. 12 Evaluating Text Analytics Software Feature Checklist and Score: Basic Features, Admin New, copy, rename, delete, merge – Branches not just nodes Scope Notes Spell check Search – all parts and selected (only taxonomy nodes) Names and Identifiers for terms and nodes Check for duplicates Versioning, multiple authors Analytical reports – structure, application to documents 13 Evaluating Text Analytics Software Feature Checklist and Score: Usability Ease of use – copy, paste, rename, merge, etc. User Documentation, user manuals, on-line help, training and tutorials Visualization – file structure, tree, Hierarchy and alphabetical Automatic Taxonomy/Node & Rule Generation – Nonsense for Taxonomy – Node – suggestions for sub-categories, rules Variety of node relationships – child-parent, related 14 Evaluating Text Analytics Software Feature Checklist and Score: Additional Features Language support – international - If you have need for it Scalability – Size of taxonomy rarely important – More important for auto-categorization Import-Export – XML and SKOS Support standards – NISO, etc., Mapping between taxonomies API / SDK Security, Access Rights, Roles Advanced Features – future growth Facts / ontologies /Semantic Web – RDF + – Sentiment Analysis – 15 Evaluating Text Analytics Software Advanced Features – Text Analytics as Platform Entity Extraction – Multiple types, custom classes Summarization – Customizable rules, map to different content Auto-categorization – – – – – – Training sets Terms – literal strings, stemming, dictionary of related terms Rules – simple – position in text (Title, body, url) Advanced – saved search queries (full search syntax) NEAR, SENTENCE, PARAGRAPH Boolean – X NEAR Y and Not-Z 16 Evaluating Taxonomy Software POC Quality of results is the essential factor 6 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content Preparation: – Preliminary analysis of content and users information needs – Set up software in lab – relatively easy – Train taxonomist(s) on software(s) – Develop taxonomy if none available Six week POC – 3 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of content 17 Evaluating Taxonomy Software POC Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time Risks – getting software installed and working, getting the right content, initial categorization of content Elements: – Content – Search terms / search scenarios – Training sets – Test sets of content Taxonomy Developers – expert consultants plus internal taxonomists 18 Evaluating Taxonomy Software POC Test Cases: Auto-categorization to existing taxonomy – variety of content Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. Evaluate usability in action by taxonomists 19 Evaluating Taxonomy Software POC - Issues Quality of content Quality of initial human categorization Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors Quality of taxonomy General issues – structure (too flat or too deep) – Overlapping categories – Differences in use – browse, index, categorize – IMPORTANT!!! – 20 Conclusion Start with self-knowledge – what will you use it for? – Current Environment – technology, information Basic Features are only filters, not scores Integration – need an integrated team (IT, Business, KA) – For evaluation and development POC – your content, real world scenarios – not scores Foundation for development, experience with software – Development is better, faster, cheaper Categorization is essential, time consuming Categorization essential issue is complexity of language Entity Extraction essential issue is scale 21 Questions? Tom Reamy tomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com