An Introduction to Text Mining Tim Daciuk SPSS, Inc. Services Manager, Canada Copyright 2003-4, SPSS Inc. 1 Agenda Introductions An Overview of Document Warehousing Understanding Unstructured Text Concept Extraction Text Mining Data Mining Demonstration Copyright 2003-4, SPSS Inc. 2 Tim Daciuk Background SPSS Social research Survey research 25 years working with the product 12 years working with the company 5 years working with text analysis Prior history Consulting Education Copyright 2003-4, SPSS Inc. 3 Predictive Analytics: Defined Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events. — Gareth Herschel, Research Director, Gartner Group Copyright 2003-4, SPSS Inc. 4 SPSS At A Glance Leadership Stability Founded in 1968 30+ year heritage in analytic technologies Proven track record Market leader in Predictive Analytics Focus on online & offline customer data acquisition and analysis 250,000+ customers worldwide NASDAQ: SPSS Analytics standard 80% of Fortune 500 are SPSS customers 80% plus market share in Survey & Market Research sector Ranked #1 Data Mining solution by KD Nuggets Copyright 2003-4, SPSS Inc. 5 Some of Our Brands Unstructured Data Management Text Mining is a subset of Unstructured Data Management. UDM can be broken down into: Content and Document Management Search and Retrieval XML database and tools Categorization, Classification, and Visualization Copyright 2003-4, SPSS Inc. 7 80% of Data is Unstructured Database notes: Copyright 2003-4, SPSS Inc. Call center transcripts Other CRM Email Open-ended survey responses Web pages NewsGroups Documents themselves Competitive information 8 Applications for Text Analysis Surveys ‘Reading’ email Call centre data Comment data Abstracts Document management Corporate history Thematic understanding of website Copyright 2003-4, SPSS Inc. 9 Data Warehouse vs. Document Warehouse Data warehouse Who, what, when, where, how much Internally focused Operational information Rarely include external information Document warehouse Why May not be internally focused May contain a range of information Often integrate external information Copyright 2003-4, SPSS Inc. 10 Document Warehouse Features There is no single document structure or document type Documents are drawn from multiple sources Essential features of documents are automatically extracted and explicitly stored in the document warehouse Document warehouses are designed to integrate semantically related documents Copyright 2003-4, SPSS Inc. 11 Building the Document Warehouse Identify Retrieve Pre-process Text Sources Document Document Analysis Copyright 2003-4, SPSS Inc. Compile Metadata 12 Predict, Impact, Deploy Concept Maps Attract Text Attitudes Categorization Surveys Text Actions NLP Concepts Trending Web Channel Grow Retain Outcomes Information Extraction Operational Systems Attributes Business UI Clustering Fraud Prediction Customer Data Data Collection Copyright 2003-4, SPSS Inc. Expert UI Business User 13 The Building Blocks of Language Morphology Syntax Semantics Phonology Pragmatics Copyright 2003-4, SPSS Inc. 14 Morphology Understanding words Noun Stems Affixes Prefix Suffix Inflectional elements Reducing complexity of analysis Reduces complexity of representation Supports text mining Copyright 2003-4, SPSS Inc. Prefix Noun Stem Suffix in - dispute - able 15 Syntax The Bank of Canada will curb inflation with higher interest rates Sentence Noun phrase Adjective The Verb phrase Aux Verb Noun will curb inflation Noun Bank of Canada with Adjective higher Copyright 2003-4, SPSS Inc. Prepositional phrase Noun phrase Noun Interest rates 16 Semantics The meaning of it all Approaches to meaning Semantic networks Deductive logic Rule-based systems Useful for classification Copyright 2003-4, SPSS Inc. 17 Problems with NLP Limitations of Natural Language Processing Correctly identifying the role of noun phrases Representing abstract concepts Classifying synonyms Representing the number of concepts Copyright 2003-4, SPSS Inc. 18 Problems with NLP Limitations of technology Language specific designs are required Classification speed Classifying hybrid words and sentences Copyright 2003-4, SPSS Inc. 19 Underlying Technology is Based on Linguistics Text is unstructured, ambiguous, and language dependent. The Linguistic Approach: Does not treat a document as a bag of words Removes ambiguity by extracting structured concepts Concepts are the DNA of text. Copyright 2003-4, SPSS Inc. 20 From Text to Concepts Morphology Accurate Scalable ••Compound Inserm; merck & co… words •1GB/hour Speed ••Proper tnp-470; glut-4… nouns •PDF, Multiple MS formats Office, text… • factor receptor; •Figures •English, Multiple languages French, German Inhibitory effect; Spanish, Italian, Dutch, •• Named D. Johnentities Paganoni, .. Positive/Negative •Domain specifics opinion… Japanese • London, Paris… Linguistic Semantics Terminology Extractor Customizable Names, Orgs… •SPSS dictionaries MeSH, genes... •User dictionaries Predicates rules •Extraction Synonyms,patterns stop •Extraction words.. Statistics DiscoveryOriented •Trends Known terms •Unknown terms •New terms Syntax Copyright 2003-4, SPSS Inc. 21 From Concepts to Predictive Analytics Components LexiQuest Mine Discover concepts, relationships and trends LexiQuest Categorize Linguistic Terminology Extractor Understand documents and assign in pre-defined categories Text Mining for Clementine Add text fields to data mining for better prediction Copyright 2003-4, SPSS Inc. 22 Concept Extraction Engine The extractor turns unstructured text into concepts: Visualization LexiQuest Mine Probabilities Clementine LexiQuest Categorize LexiQuest Extractor Engine Linguistic Processor Copyright 2003-4, SPSS Inc. 23 Part-of-Speech Tagging Copyright 2003-4, SPSS Inc. a: adjective b: adverb c: preposition d: determiner n: noun v: verb o: coordination p: participle s: stop word 24 How is a Concept Extracted? Step 1: Part-of-Speech Tagging Using a tool like LexiQuest Mine is a great V P N A N N V P A idea for any organization that is interested in maintaining N P A N P V V P V information on competitive intelligence. N P N N Copyright 2003-4, SPSS Inc. 25 How is a Concept Extracted? Step 2: Matching to Known Patterns This: V P N A N N V P A N PA N P V V P V N PN N Looks Most Like: NCDNN (32 Known patterns for English) Copyright 2003-4, SPSS Inc. 26 How is the Concept Extracted? The extractor looks at this sentence: Using a tool like LexiQuest Mine is a great idea for any organization that is interested in maintaining information on competitive intelligence. And extracts the concept: Competitive Intelligence Concepts are: Noun based Can be longer than one word Copyright 2003-4, SPSS Inc. 27 Example: Categorization Copyright 2003-4, SPSS Inc. 28 The Issue of Language NLP requires separate language understanding Clementine text mining French English English/French German Spanish Dutch Japanese Italian Mesh (Medical subject headings) Copyright 2003-4, SPSS Inc. http://www.nlm.nih.gov/mesh/meshhome.html 29 Data Mining Defined “The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” - The Gartner group. Why data mining? Data Mining software generally employs modeling algorithms designed to handle non-linearities and unusual patterns in data As opposed to classical linear models (e.g., linear regression) that aren’t as capable A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output Copyright 2003-4, SPSS Inc. 31 A Data Mining Methodology Use the cross industry standard process for data mining (CRISPDM) Based on real-world lessons: Focus on business issues Copyright 2003-4, SPSS Inc. User-centric & interactive Full process Results are used 32 Data Mining is not… Keep in mind that data mining is not… “Blind” application of analysis/modeling algorithms Brute-force crunching of bulk data Black box technology Magic Copyright 2003-4, SPSS Inc. 33 Back to the Process Text Mining Copyright 2003-4, SPSS Inc. 34 Understanding Business Understanding Determine objective Assess situation Determine data mining goals Produce project plan Data Understanding Collect initial data Describe data Explore data Verify data quality Copyright 2003-4, SPSS Inc. 35 Data Preparation Data Data set Data set description Select data Clean data Construct data set / Integrate data Format data Text Concept extraction Concept combination Concept assessment Copyright 2003-4, SPSS Inc. 36 Modeling Select modeling technique Universe of techniques Appropriate techniques Data Text Requirements Constraints Selected tools Generate test design Run model(s) Assess model(s) Copyright 2003-4, SPSS Inc. 37 Evaluation Results = Models + Findings Evaluate results Review process Determine next steps Copyright 2003-4, SPSS Inc. 38 Deployment Plan deployment Plan monitoring and maintenance Final report Project review Copyright 2003-4, SPSS Inc. 39 Data Mining Approaches Unsupervised methods: Group patients by drugs and demographic information and try to find unusual patients Supervised methods: Attempt to predict amount due and find sets of cases where the amount due is very different from the predicted amount Copyright 2003-4, SPSS Inc. 40 What Does Data Mining Do? Data mining uses existing data to: Predict Category membership Numeric Value Ie. Credit risk Group Cluster (group) things together based on their characteristics Ie. Different types of TV viewers Associate Find events that occur together, or in a sequence Ie. Beer and diapers Find outliers Identify cases that don’t follow expected behavior Ie. Fraudulent behaviour Copyright 2003-4, SPSS Inc. 41 Benefits of Document Warehousing Richer operational business intelligence Knowing your customers Macroenvironmental monitoring Technology assessment Copyright 2003-4, SPSS Inc. 42 Conclusions Text mining is More than word counts Linguistically based Concept extraction Data mining is Advanced analytics applied to datasets A family of techniques Supervised or unsupervised Copyright 2003-4, SPSS Inc. 43 Conclusions Text and data mining Add dimensionality to the data Allow for automation of the text analysis event Create 360 degree view Applications Websites Surveys Email Call centre Documentation Copyright 2003-4, SPSS Inc. 44 ? Copyright 2003-4, SPSS Inc. 45 So How Do I Get Started? Document Warehousing and Text Mining Survey of Text Mining: Clustering, Classification and Retrieval Dan Sullivan, Wiley, 2001 Michael W. Berry (ed.), Springer, 2003 Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization P. Jackson and I. Moulinier, John Benjamins, 2002 Copyright 2003-4, SPSS Inc. 46 SPSS Canada Tim Daciuk Services Manager, Canada 416-410-7921 800-543-6607 ext. 5156 tdaciuk@spss.com Hugh Rooney SPSS Sales Canada 416-410-7921 905-886-4322 hrooney@spss.com www.spss.com Copyright 2003-4, SPSS Inc. 47