IE 483/583 Knowledge Discovery and Data Mining Dr. Siggi Olafsson Fall 2003 Fall 2004 Data Mining 1 What is Data Mining? (… and should I be here?) Fall 2004 Data Mining 2 Dilbert Replies ... Fall 2004 Data Mining 3 Some Definitions “Data mining is the extraction of implicit, previously unknown, and potentially useful information from data.” “Data mining is the process of exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules.” Fall 2004 Data Mining 4 What can Data Mining Do? • Classification • Prediction Supervised • Association discovery • Clustering Fall 2004 Unsupervised Data Mining 5 Applications of Data Mining • • • • • • Manufacturing Process Improvement Sales and Marketing Mapping the Human Genome Diagnosing Breast Cancer Financial Crime Identification Portfolio Management Fall 2004 Data Mining 6 Technical Background • Machine Learning – Data mining: business-oriented use of AI • Statistics – Regression, sampling, DOE, etc • Decision Support – Data warehousing, data marts, OLAP, etc • Interdisciplinary tools put together to form the process of knowledge discovery in databases … Fall 2004 Data Mining 7 Historical Perspective < 40 40s 50s 60s 70s 80s 90s Fall 2004 Stat AI AI Stat Stat IR DB IR AI Stat AI DB Bayes theorem, regression, etc. Neural networks Nearest neighbor, single link, perceptron Resampling, bias reduction, jackknife Linear models for classification, exploratory data analysis (EDA) Similarity measures, clustering Relational data model Smart IR systems Genetic algorithms EM algorithm, k-means clustering Kohonen maps, decision trees Association rule algorithms, web & search engines, data warehousing, OLAP Data Mining 8 What Changed? • Very large databases • Increased computational power as enabler • Business perspective Fall 2004 Data Mining 9 Knowledge Discovery in Databases Data Warehouse Systems Engineering Databases Data warehouse Prepared Data Knowledge Model/Structures Knowledge Discovery and Data Mining Fall 2004 Data Mining 10 Course Information • We assume data is ready for mining • Thus, we focus on: – models and structures, and – algorithms • More information on course homepage http://www.public.iastate.edu/~olafsson/mining.html Fall 2004 Data Mining 11 Fall 2004 Data Mining 12 Course Outline • • • • • • Introduction Exploratory Data Mining Supervised Learning Unsupervised Learning Optimization Methods in Learning Selected Advanced Topics – Mining the Web – Customer Relationship Management (CRM) • Course Review Fall 2004 Data Mining 13 Questions? Fall 2004 Data Mining 14 Data Mining • Discover patterns in data – automatic or semi-automatic process – meaningful or useful pattern – large amounts of data • What does such a pattern look like? Black box Fall 2004 Transparent box Data Mining 15 Describing Structural Patterns • Some ways of representing knowledge: – – – – – – Fall 2004 Decision tables Decision trees Classification rules Association rules Regression trees Clusters Data Mining 16 The Weather Problem Fall 2004 Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Temp. Humidity Hot High Hot High Hot High Mild High Cool Normal Cool Normal Cool Normal Mild High Cool Normal Mild Normal Mild Normal Mild High Hot Normal MildData Mining High Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 17 A Decision List If outlook = sunny and humidity = high If outlook = rainy and windy = true If outlook = overcast If humidity = normal If none of the above then play = no then play = no then play = yes then play = yes then play = yes • These are classification rules Fall 2004 Data Mining 18 Association Rules • Many association rules can be inferred: if temperature = cool then humidity = normal if humidity = normal and windy = false then play = yes if outlook = sunny and play = no then humidity = high Fall 2004 Data Mining 19 Three Layers of the Process Inputs Algorithms Outputs Fall 2004 Data Mining 20 Inputs • Three forms – Concepts • concept description - what you want to learn – Instances • examples - what you learn from – Attributes • features of instances - variables you have values for Fall 2004 Data Mining 21 Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Fall 2004 Data Mining 22 Instances: Learn from Examples • Set of instances to be classified, or associated, or clustered • Example of concept to be learned • Data set: flat file (single relation) – denormalization • Family tree example – concept: sister – example: family tree Fall 2004 Data Mining 23 Family Tree Peter (M) = Peggy (F) Steven M Graham M Grace (F) = Ray (M) Pam F =Ian M Anna F Fall 2004 Data Mining Pippa F Brian M Nikki F 24 Denormalizing Relational Data Name Gender Parent1 Parent2 Name Gender Parent1 Parent2 Sister of? Steven Male Peter Peggy Pam Female Peter Peggy Yes Ian Male Grace Ray Pippa Female Grace Ray Yes Brian Male Grace Ray Pippa Female Grace Ray Yes Anna Female Pam Ian Nikki Female Pam Ian Yes Nikki Female Pam Ian Anna Female Pam Ian Yes All others Fall 2004 No Data Mining 25 Denormalization Problems • Computational and storage costs • Trivial regularities customers product supplier products supplier supplier address • Infinite relations Fall 2004 Data Mining 26 Content of Instances: Attributes • Instance characterized by values of its (predefined) set of attributes – – – – – Fall 2004 Numeric (“continuous”) Nominal (categorical) Ordinal (rank) Interval Ratio Data Mining Focus in this class 27 Data Preparation • Data … – assembly • set of instances/denormalizing relational data – integration • enterprise-wide database/data warehouse – cleaning • missing data – aggregation • good information Fall 2004 Data Mining 28 ARFF Format • Used by JAVA package (Weka) • Independent, unordered instances • No relationship between instances Fall 2004 Data Mining 29 Weather Data % ARFF file for the weather data with some numeric features % @relation weather @attribute @attribute @attribute @attribute @attribute outlook { sunny, overcast, rainy } temperature numeric humidity numeric windy { true, false } play? { yes, no } @data % % 14 instances % sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes rainy, 70, 96, false, yes rainy, 68, 80, false, yes rainy, 65, 70, true, no overcast, 64, 65, true, yes sunny, 72, 95, false, no sunny, 69, 70, false, yes rainy, 75, 80, false, yes sunny, 75, 70, true, yes overcast, 72, 90, true, yes overcast, 81, 75, false, yes rainy, 71, 91, true, no Fall 2004 Data Mining 30 Features • % = comments • @relation <name> • @attribute <name> <type> – Attribute types: Nominal and numeric • @data – List of instances – Missing values represented by ? Fall 2004 Data Mining 31 Other Issues • Missing data • Inaccurate values • Look at the data!!! Fall 2004 Data Mining 32 Recall the Three Layers of the Data Mining Process Done Inputs Algorithms Next Outputs (structural patterns) Fall 2004 Data Mining 33 Describing Structural Patterns • Ways of representing knowledge: – – – – – – Fall 2004 Decision tables Decision trees Classification rules Association rules Regression trees Clusters Data Mining 34 The Weather Problem Fall 2004 Outlook Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Temp. Humidity Hot High Hot High Hot High Mild High Cool Normal Cool Normal Cool Normal Mild High Cool Normal Mild Normal Mild Normal Mild High Hot Normal MildData Mining High Windy FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE Play No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 35 A Decision List If outlook = sunny and humidity = high If outlook = rainy and windy = true If outlook = overcast If humidity = normal If none of the above Fall 2004 Data Mining then play = no then play = no then play = yes then play = yes then play = yes 36 A Decision Tree Outlook Sunny Humidity High Play=No Fall 2004 Rainy Windy Overcast Play=Yes TRUE Play=No Data Mining 37 Concepts: Styles of Learning • Classification (supervised) learning • Association learning • Clustering • Numeric prediction Fall 2004 Data Mining 38 Classification Rules • Classification easily read off decision trees • How? • Other direction possible, but not as straightforward If a and b then x If c and d then x Fall 2004 Data Mining 39 Corresponding Decision Tree a y b y x n c n n y c d n y n y d y x n x Fall 2004 Data Mining 40 Replicated Subtree Problem X=1 n y Y=1 Y=1 n b a If If If If Fall 2004 y x=1 x=0 x=0 x=1 n a and and and and y=0 y=1 y=0 y=1 Data Mining then then then then b a a b b 41 Replicated Subtree Problem If x=1 and y=1 then a If z=1 and w=1 then a Otherwise b x,y,z,w take values 1,2,3 Fall 2004 Data Mining 42 Rules with exceptions If x and y then a EXCEPT if z then b • Account for new instances • Exceptions from exceptions, etc Fall 2004 Data Mining 43 Association Rules • Coverage (support): number of instances it predicts correctly • Accuracy (confidence): coverage divided by number of instances it applies to If temperature = cool then humidity = normal • Coverage = 4 • Accuracy = 100% Fall 2004 Data Mining 44 Interpretation If windy = false and play = no then outlook = sunny and humidity = high If windy = false and play = no then outlook = sunny If windy = false and play = no then humidity = high If humidity = high and windy = false and play = no then outlook = sunny Fall 2004 Data Mining 45 The Shapes Problem Fall 2004 Shaded=standing Unshaded=lying Data Mining 46 Instances Width 2 3 4 7 7 2 9 10 Fall 2004 Height 4 6 3 8 6 9 1 2 Sides 4 4 4 3 3 4 4 3 Data Mining Class standing standing lying standing lying standing lying lying 47 Classification Rules If width 3.5 and height < 7.0 If height 3.5 then standing then lying • Work well to classify these instances • Problems? Fall 2004 Data Mining 48 Relational Rules If width > height then lying If height > width then standing • Rules comparing attributes to constants are called propositional rules • Structural patterns? Fall 2004 Data Mining 49 CPU Performance Example Cycle time 1 2 3 4 5 … 207 208 209 Fall 2004 Main memory (KB) (ns) min max MYCT MMIN MMAX Cache CACH Channels CHMIN Performance CHMAX PRP 125 29 29 29 29 256 8000 8000 8000 8000 6000 32000 32000 32000 16000 256 32 32 32 32 16 8 8 8 8 128 32 32 32 16 198 269 220 172 132 125 480 480 2000 512 1000 8000 8000 4000 0 32 0 2 0 0 14 0 0 52 67 45 Data Mining 50 Numerical Prediction: regression equation PRP 56.1 0.049 MYCT 0.015MMIN 0.006 MMAX 0.630CACH 0.270CHMIN 1.46CHMAX Fall 2004 Data Mining 51 Regression Tree CHMIN > 7.5 7.5 CACH 8.5 MMAX MMAX >28 (8.5,28] MMAX 64.6 - Accuracy? - Large and possibly awkward Fall 2004 Data Mining 52 Model Trees CHMIN 7.5 > 7.5 CACH 8.5 MMAX >8.5 MMAX LM4 28000 LM5 > 28000 LM6 LM 1 PRP 8.29 0.004MMAX 2.77CHMIN LM 2 PRP Fall 2004 Data Mining 53 Instance-Base Representation • Store actual instances • New instance: algorithm finds “most similar” stored instance • Features – What is a similar instance? – Need store (all?) instances – Really a black box method Fall 2004 Data Mining 54 Clusters: d d e a e j k c h f a b k i c h f b i g Fall 2004 j g Data Mining 55 Next: Algorithms Fall 2004 Data Mining 56