Unstructured Data and Text Mining D. Silver Unstructured Data • Definition: Information that either does not have a predefined data model or is not organized in a predefined manner • Imprecise for several reasons: – Structure of data may be easily implied, but not explicit – Data may have explicitly structure but not for the task at hand – Data may have some underlying structure that is not understood 80% of Data is Unstructured • Much of it is text based: – Business data: • Call center transcripts • Other CRM – Email – Open-ended survey responses – Web pages – NewsGroups – Organizational documents – Regulatory information Copyright 2003-4, SPSS Inc. 3 Growth of Unstructured Data Unstructured Information Management Architecture (UIMA) • Architecture for the development, discovery, composition, and deployment of analytics on unstructured data • Provides a common framework for processing unstructured data to extract meaning and create structured data and information • IBM’s Watson uses UIMA for real-time content analytics References: • • • • • • • Text to attributes p.328-329 Text mining Section 9.5 Web mining and beyond , Section 9.6-9.8 String conversion p.439 http://en.wikipedia.org/wiki/Unstructured_data http://en.wikipedia.org/wiki/UIMA http://bigdataintegration.blogspot.ca/2012/02/u nstructured-data-is-myth.html Text Mining • Text is: – Unstructured, amorphous and challenging to parse – Most common form of information exchange – Motivation to extract information is compelling • Text Mining differs from Data Mining – Most authors strive to clearly inform the reader – But humans do not have time to read/interpret everything – TM focuses on extracting information ready for rapid machine or human consumption Text Mining • Two broad approaches: – Natural Language Processing (Comp. Linguistics) • Extracts concepts based on semantics • Relies heavily on language morphology, syntax, and semantics – Information Retrieval • Exploits bag of word approach • Term weighting and text similarity measures Text Mining is a Variant of DM Text Mining Copyright 2003-4, SPSS Inc. 9 NLP Approach Concept Maps Attitudes Attract Text Clustering Grow Categorization Surveys Trending Web Channel Attributes Retain Concepts Outcomes Information Extraction Operational Systems Prediction Customer Data Data Collection Expert UI Copyright 2003-4, SPSS Inc. Business UI Text Actions NLP Fraud Business User 10 NLP Relies on the Building Blocks of Language • • • • Morphology Syntax Semantics Objective is to go from syntactic phrase – Using a tool like Text Mining is a great idea for any organization that is interested in maintaining information on competitive intelligence. • To semantic concept: – Competitive Intelligence Copyright 2003-4, SPSS Inc. 11 Morphology • Understanding words Noun – Stems – Affixes • Prefix • Suffix – Inflectional elements Reduces complexity of analysis Reduces complexity of representation Supports text mining Copyright 2003-4, SPSS Inc. Prefix Noun Stem Suffix in - dispute - able 12 Syntax • The Bank of Canada will curb inflation with higher interest rates Sentence Noun phrase Adjective The Verb phrase Aux Verb Noun will curb inflation Prepositional phrase Noun Bank of Canada with Adjective Copyright 2003-4, SPSS Inc. higher Noun phrase Noun Interest rates 13 Semantics • The meaning of it all • Approaches to meaning – Semantic networks – Deductive logic – Rule-based systems • Useful for classification of documents Copyright 2003-4, SPSS Inc. 14 Problems with NLP • Limitations of Natural Language Processing – Correctly identifying the role of noun phrases – Representing abstract concepts – Classifying synonyms – Representing the number of concepts • Limitations of technology – Language specific designs are required – Classification speed – Classifying hybrid words and sentences Copyright 2003-4, SPSS Inc. 15 IR Approach • Statistics applied to syntax yields pretty good results for: – Information Filtering – Text Categorization – Document/Term Clustering – Text Summarization Generality of Basic Techniques t1 t2 … t n d1 w11 w12… w1n d2 w21 w22… w2n …… … dm wm1 wm2… wmn Term similarity CLUSTERING Doc similarity Stemming & Stop words Raw text tt t t tt d d dd d d dd d d d d dd Term Weighting Tokenized text tt t t tt Sentence selection SUMMARIZATION META-DATA/ ANNOTATION Vector centroid d CATEGORIZATION 17 Stemming • General: – http://en.wikipedia.org/wiki/Stemming – http://www.comp.lancs.ac.uk/computing/research/stemming/general/ – http://snowball.tartarus.org/texts/introduction.html *READ* • Julie B. Lovins (1968) – http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins. htm – http://snowball.tartarus.org/algorithms/lovins/stemmer.html • Martin Porter (1979) – http://www.comp.lancs.ac.uk/computing/research/stemming/general/porter. htm • Snowball (~2000) – Framework for writing stemming algorithms – Language and compiler for stemming algorithms – http://snowball.tartarus.org Information Filtering • Stable & long term interest, dynamic info source • System must make a delivery decision immediately as a document “arrives” • Two Methods: Content-based vs. Collaborative my interest: … Filtering System 26 Examples of Information Filtering • • • • • News filtering Email filtering Recommending Systems Literature alert And many others 27 Sample Applications • Information Filtering • Text Categorization Document/Term Clustering • Text Summarization 28 The Clustering Problem • Discover “natural structure” • Group similar objects together • Object can be document, term, passages 29 Similarity-induced Structure 30 Examples of Doc/Term Clustering • • • • • Clustering of retrieval results Clustering of documents in the whole collection Term clustering to define “concept” or “theme” Automatic construction of hyperlinks In general, very useful for text mining 31 Sample Applications • Information Filtering • Text Categorization • Document/Term Clustering Text Summarization 32 “Retrieval-based” Summarization • Observation: term vector summary? • Basic approach – Rank “sentences”, and select top N as a summary • Methods for ranking sentences – Based on term weights – Based on position of sentences – Based on the similarity of sentence and document vector – NOTE: Similarity can be measured by inner product of vectors of term frequencies 33 Examples of Summarization • News summary • Summarize retrieval results – Single doc summary – Multi-doc summary • Summarize a cluster of documents (automatic label creation for clusters) 34 Sample Applications • Information Filtering Text Categorization • Document/Term Clustering • Text Summarization 35 Text Categorization • Pre-given categories and labeled document examples (Categories may form hierarchy) • Classify new documents • A standard supervised learning problem Sports Categorization System Business Education … Sports Business … Science Education 36 Examples of Text Categorization • • • • News article classification Meta-data annotation Automatic Email sorting Web page classification 38 References • http://paginas.fe.up.pt/~ec/files_0405/slides/07 %20TextMining.pdf • http://disi.unitn.it/~bernardi/Courses/CL/Slides/i r.pdf • Multinomimal Distribution – http://onlinestatbook.com/2/probability/multinomial. html – http://onlinestatbook.com/2/probability/binomial.ht ml WEKA Tutorials • https://moodle.umons.ac.be/pluginfile.php/4 3703/mod_resource/content/2/WekaTutorial. pdf • http://www.unal.edu.co/diracad/einternacional/Wek a.pdf