Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting Abstract • Big data inferences are increasingly used to mine huge heaps of data. • The applications are endless. • However, those inferences do not work well when many lines go to a single bubble. The lines and relationships must be drawn between concepts, not simply between words. • Using the text analytics is a powerful tool, but it is a means to an end, not the end itself. • The important work is in the interpretation of the data. • This session outlines a highly accurate and efficient approach and provides a case study of the application. Outline of the talk • Using text analytics in term extraction – 3 examples – Pattern recognition – String tagging – Taxonomy control • Achieving Synonymy • Now what do I do with it? Term clouds • Good place to start • Show concept landscape • Basis = – Levenshtein distances – N-grams • Redundant concepts, separately shown • No disambiguation • Not direct XML tagging Sample article Normal text extraction Near conceptual synonyms Nonsensical suggestions Small Taxonomy Near synonym, conceptual duplicate Refined presentation Dependent concepts Ontological dependencies Achieving Synonymy • • • • Find like concepts Merge the terms Choose a preferred form Build term record – Hierarchy – Equivalence – Associative Overview, Upload 7K documents, search for text string, add a tag, “Columbia” “Colombian” – no stemming Same document – different terms Colombiana – record overlap “FARC” – No Synonymy “People’s Armed Forces of Colombia”, i.e., FARC, lacks synonymy, some doc overlap Tag suite, no hierarchy, no equivalence, no combining tags for synonymy Disambiguation Bridge Structure Bridge Dentistry Bridge Game Bridge Concept Now what do I do with it? • Tag documents – Consistently – Even depth of treatment – Full breadth of conceptual area • • • • • Insert concepts in full text or as linked data Implement in search Use for internal statistics and analysis Track industry trends Create semantic fingerprints The AIP Thesaurus Hierarchy Term Record The AIP Thesaurus: Rulebase This article is about (among other things) degenerate stars. The text string “degenerate stars” occurs zero times in the text of the article. But since the rulebase is tuned to understand that when certain other words appear near the text “star”or “stars” it was correctly indexed. The AIP Thesaurus: Rulebase If the word “star” or “stars” appears in the same sentence as “degenerate” or “compact” MAI applies the term “Degenerate stars” instead of just using “Stars” The AIP Thesaurus: Applications Listing of the AIP Thesaurus terms in JATS. Includes the term, keyword-ID, weight, code. Inline tagged terms (denoted by the highlighting). The keyword ID (kwd1.4) corresponds with the name in the previous screenshot. HTML Header Copyright © 2013 Access Innovations, Inc. 7. Content Recommender Selected Article Search “thin film sputtering” Grants available Upcoming conferences on this topic More Articles on the same topic Authors working in this space Taxonomy Driven Search Presentation Auto-completion using the taxonomy Guide the user Navigate the full taxonomy “tree” BROWSE Thesaurus Term Record view Taxonomy view Copyright © 2005 - Access Innovations, Inc. Suggested taxonomy descriptors Visualization Strategies Visualization Software Matrix 34 Pattern Analysis Domain Associations Pattern Analysis Gap Analyses Summary • • • • • • Taxonomy tool box Text extraction / mining for terms Gather synonyms Disambiguate terms Look for gaps and over coverage Map all conceptual groupings – Hierarchical, Associative, Equivalence • Apply to content • Leverage knowledge of the collection Thank you Marjorie M.K. Hlava, President Access Innovations 505-998-0800 mhlava@accessinn.com The Semantic Enrichment Company About Access Innovations Access Innovations are experts in content creation, enrichment, and conversion services. We provide services to semantically enrich and tag raw text into highly structured data. We deliver clean, well-formed, metadataenriched content so our clients can reuse, repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for your information. Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, and e-commerce. We change search to found! Quick Facts • Founded in 1978 • Headquartered in Albuquerque, NM • Privately held • Delivered more than 2000 engagements