Data Mining Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA January 13, 2011 Important Note This presentation was obtained from Dr. Vijay Raghavan Obtained on January 12, 2011. Dr. Raghavan is a member of the Center for Advanced Computer Studies, University of Louisiana at Lafayette, Lafayette, La., USA They are being used with his permission. Some modifications have been made. 2 CONTENTS The Motivation Knowledge Discovery in Databases (KDD) Data Mining Related Fields Research Issues Tasks Association Mining Problem Classification Mining Problem Conclusions 3 THE MOTIVATION “We are drowning in information, but starving for knowledge.” John Naisbett 4 KNOWLEDGE DISCOVERY IN DATABASES- Definition A hot buzzword for a class of database applications that look for patterns or relationships in data that are: Hidden, Previously unknown and Potentially useful 5 KDD: Definition Extract (discover): interesting and previously unknown knowledge from very large real world databases. 6 KDD: Definition More formally: Valid, Novel, Potentially useful or Desired Ultimately understandable. 7 KDD- PROCESS Note, order of these steps And what is include in Each step may vary depending On who you talk to. And maybe even this 8 KDD vs. DATA MINING Synonyms (?) KDD More than just finding pattern Mining, dredging and fishing 9 KDD- Related Fields Data Warehousing On-Line Analytical Processing (OLAP) Database Marketing Exploratory Data Analysis (EDA) 10 Data Warehousing A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision making process. 11 OLAP and Data Warehousing Relational Data Marts WEB External Source Data Warehouse MDD Data Marts Query Reporting Tools OLAP Tools User GIS Tools GeoReference Data File Server Meta Data 12 Data Mining: Related Areas Database Management Systems Other Areas: 1. Neural Networks 2. Evolutionary Methods 3. Information Retrieval 4. Etc. The ‘Parent’ of Data Mining Visualization Sometimes considered an subarea of AI Machine Learning Data Mining Artificial Intelligence Statistics Sometimes considered Expert an subarea of AI Systems Pattern Recognition 13 Database versus Data Mining Query DB: Well Defined & SQL DM: Poorly Defined & Various Languages Data DB: Operational (and generally relational) DM: Not Operational. Output DB: Precise, subset of the database. DM: Varies. 14 Examples Database Find all people with last name Raghavan. Identify all customers who have bought more than 10,000 dollars Data Mining Find those who have poor credit Find all those who like the same cars Find all items that are often (frequently) purchased with milk. Predict the value of the housing market. 15 Statistics Simple descriptive models Traditionally: A model created from a sample of the data to the entire dataset. Exploratory Data Analysis: Data can actually drive the creation of the model Opposite of traditional statistical view. Presupposes a distribution 16 Machine Learning Machine Learning: area of AI that examines how to write programs that can learn. Types of models Classification Prediction (Regression) Clustering Types of Learning: Supervised Unsupervised Traditionally Small Datasets ‘Complete’ Data Changed since about mid-1990’s Examples on non-complete from earlier Field isn’t static (ideas flow between) 17 Data Mining: Research Issues Ultra large data Noisy data Null values Incomplete data Note, somewhat related to NULL values Redundant data Dynamic aspects of data 18 Data Mining: Tasks Association Classification Generally, implication that you are seeking Clustering Relationships and/or can manipulate the data interactively. Estimation Data Visualization Deviation Analysis etc 19 Data Mining Models and Tasks 20 ASSOCIATION MINING PROBLEM Deriving association rules from data: Given a set of items I = {i1,i2, . . . , in} and a set of transactions S = {s1, s2, . . ., sm}, each transaction si S, such that si I, an association rule is defined as X Y, where X I, Y I and X Y = , describes the existence of a relationship between the two itemsets X and Y. 21 Measurements Measures to define the strength of the relationship between two itemsets X and Y 22 Measure of Confidence P( X , Y ) Confidence(X Y ) P( X ) The percentage of transactions that contain Y among those transaction containing X. 23 Applications of Associations I = Products, S = Baskets I = Cited Articles, S = Technical Articles I = Incoming Links, S = Web pages I = Keywords, S = Documents I = Term papers, S = Sentences 24 Classification Mining Problem Pattern Recognition and Machine Learning communities Generally aimed at models of the data. Often includes both Categorization Prediction (Regression) Supervised. 25 Clustering Mining Problem Assumption: Data, naturally, falls into groups. Overlapping or Non-Overlapping What are the groups? And what data falls within each group. Unsupervised. 26 Measures Error Categorization Number Bad Assignments/Total Assignments Prediction Mean Squared Error In truth, a number of measures have been proposed. 27 Note about ‘Data’ Various types: Text Strings Numeric Sound Image Relations Etc. 28 CONCLUSIONS KDD has interesting problems It is an inter-disciplinary field No matter your expertise, you can find an interesting niche Many high-demand applications Customer Relationship Management Suggestive Sales Search Amazon Ebay Stock Prediction Well, would be nice. 29