CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses Information Age Produces Large Amounts of Data • Data collected on almost everything • WWW rich data resource • Data warehouses required to hold data 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 2 The problem: How do we turn information into useful knowledge? Solution: Data mining & knowledge discovery 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 3 Data Mining & Knowledge Discovery This class provides • Tools & techniques for producing useful knowledge from information • Experience in using these tools 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 4 Data Mining & Knowledge Discovery in CS 404 • We will study – – – – Data warehouses Classification & Association rule miners (C4.5) Neural networks (BP, SOM) Classical tools • Correlation • Regression • Clustering • We will do several projects requiring mining knowledge from “real” data 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 5 CS 404 Class Information Prerequisites: CS 347 (Artificial Intelligence) or CS 304 (Database Systems) and Stat 215 Texts: • Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. • Quinlan, J., C4.5 Programs for Machine Learning, Morgan Kaufmann, 1988. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 6 CS 404 Class Information Reference: (This or a similar Matlab reference is recommended.) Hanselman, D. and Littlefield, B., Mastering Matlab 6: A Comprehensive Tutorial and Reference, Prentice Hall, 2001. Software: • C4.5 – provided to class w/o charge • Matlab – Can purchase from Mathworks or can login to UMR. • Microsoft Excel (provided on UMR CLC computers) 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 7 CS 404 Class Information Instructor: D.C. St. Clair, Ph.D. 325 Computer Science Phone: (573) 341-6352 e-mail: stclair@umr.edu (Cont.d) Fax: (573) 341-4501 Class web page: www.umr.edu/~stclair or http://web.umr.edu/~stclair/class/classfiles/cs404_fs02/ Things you will find on the class web page: • • • • Syllabus Schedule Homework assignments Lecture notes 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 8 Who am I? • Professor and Chair UMR Computer Science Dept. • Research area -- Data mining, machine intelligence, neural networks diagnostics intelligent graphics data mining pattern recognition & analysis system monitoring & assessment • “Applied” experience – – – – – Union Pacific Technologies Intelligent Systems Advisor Visiting Principal Scientist McDonnell Douglas Research Laboratories NASA’s Johnson Space Center Defense: Navy, Army, and Air Force Co-founder & former Chief Scientist of intelligent software systems company 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 9 Even More CS 404 Class Information Han, one of the authors of the data mining text has a web page at: www.cs.sfu.ca/~han/DM_Book.html Which contains several interesting things including: 1. A list of errata for the data mining book 2. A set of slides he uses in the data mining course he teaches. [I will be using some of these slides in my lectures.] You may want to check these out. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 10 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair We just finished this. Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 11 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 12 Data -- Information -- Knowledge T h e set of valu es: 12345 67890 1 0 0 0 .0 0 2 8 4 6 .9 2 SA CK h as n o m ean in g. It is d a ta b u t it is N O T in fo rm a tio n . In fo rm a tio n : In form ation is th e resu lt of organ izin g d ata in to m ean in gfu l q u an tities. T h e follow in g relation al tab le h elp s tu rn s th e d ata in to in form ation sin ce it associates m ean in g w ith th e d ata: A ccou n t N u m b er 12345 67890 B alan ce 1 0 0 0 .0 0 2 8 4 6 .9 2 typ e SA CK A d a ta b a se is a “stru ctu red ” collection of d ata stored an d op erated on w ith in a m an agem en t en viron m en t k n ow n as a D a ta b a se M a n a g em en t S y stem s (D B M S ) or d a ta b a se sy stem . T h e D B M S h elp s to tran sform d ata in to in form ation . Knowledge can be created from information. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 13 What Is Data Mining? How Does It Differ From Existing Database Technologies? Data Sources: Databases, data warehouses, Internet Decision Support Systems Tools for asking questions & doing analyses when you know what you want to ask and where you are going. (Ex. OLAP tools) Data Mining Process of discovering knowledge (meaningful new correlations, patterns, and trends) in data by sifting through large amounts of data (100M-10G) using pattern recognition as well as statistical and mathematical techniques. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 14 Other Names Used in Conjunction With Data Mining • • • • • • • Knowledge discovery(mining) in databases (KDD) Knowledge extraction Data/pattern analysis Data archeology Data dredging Information harvesting What is not data mining – (Deductive) query processing – Expert systems or small ml/statistical programs Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 15 P ote n tia l-C us tom er* P erso n A ge A nn S m ith 32 Joan G ra y 53 M ary B lythe 27 Jane B row n 55 B ob S m ith 30 Jack B row n 50 Data Mining Example M arried -T o H usba nd B ob S m ith Jack B row n K n o w led g e W ith in A R ela tio n S ex F F F F M M Inco m e 10,000 1,000,000 20,000 20,000 100,000 200,000 C usto m e r yes yes no yes yes yes W ife A nn S m ith Jane B row n IF In co m e(P erso n ) 1 0 0 ,0 0 0 T H E N P o ten tia l-C u sto m er(P erso n ) IF S ex(P erso n ) = F A N D A g e(P erso n ) 3 2 T H E N P o ten tia l-C u sto m er(P erso n ) K n o w led g e F ro m M u ltip le R ela tio n s IF M a rried -T o (P erso n ,S p o u se) A N D In co m e(P erso n ) 1 0 0 0 0 0 T H E N P o ten tia l-C u sto m er(S p o u se) IF M a rried -T o (P erso n ,S p o u se) A N D P o ten tia l-C u sto m er(P erso n ) T H E N P o ten tia l-C u sto m er(S p o u se). * D zeroski, S aso, Inductive L ogic Program m ing and K nowledge D iscovery in D atabases , A dvances in K now ledg e D iscovery and D ata M ining , E d. U. F ayyad, G .P iatetsky-S hapiro, P . S myth, & R . Uthurusamy, A A A I P ress, 1996 , pp. 117-152. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 17 Simple Concept Learning -- Example “Routine”, “well-understood” chemistry experiment performed numerous times. • Expected result occurred about half the time • Unexpected result occurred remainder of the time Numerous repetitions of experiment produced similar results Careful analysis determined: • One result produced when setup was in sunlight • Second result produced when setup was in shade Careful investigation showed: Experiment sensitive to ultraviolet radiation Result: Patented method for determining presence of ultraviolet radiation 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 18 The Knowledge Discovery Process Interpretation/ Evaluation Data Mining Transformation Preprocessing Selection Data Sources Knowledge Patterns / Models Transformed Data Preprocessed Data Target Data 2002 by D. C. St. Clair 404 Data Mining & Knowledge Discovery 19 Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 20 Data Sources • • • • • • Relational Databases Data Warehouses WWW Audio Video Printed Materials : : 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 21 Relational Databases 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 22 Multidimensional Data Cube 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000 23 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 24 Data Mining Tasks • Predictive – Perform inference on current data • Descriptive (KDD) – Characterize general properties of data Notes: – A measure of certainty or “belief” must be associated with each pattern – “Interesting” patterns must be identified 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 25 Kinds of Data Patterns to Be “Mined” • Concept/class description • Association analyses • Classification & prediction • Cluster analysis • Outlier analysis 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 26 Concept/class Descriptions Example 1 Produce a description summarizing characteristics of customers who purchase diapers • Objective: produce a description of those in the target class • Characterizes class/concept Example 2 What properties identify diaper buyers from other store customers? • Discriminates class/concept • Leads to other questions – What else do they buy – When do they purchase these items? 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 27 Association Analysis Assoc. Anal. -- discovery of association relationships between attribute-value conditions. Such relationships may be expressed in many ways. On common way is through association rules. X => Y 2002 by D. C. St. Clair A 1^.....^ A m B 1^....^ B n CS 404 Data Mining & Knowledge Discovery 28 Association Rules Example age (X, “20 .. 29”) ^ income (X, “20K..29K”) => buys (X, “CD changer) [support = 2% confidence = 60% ] % of data instances satisfying all three components of rule 2002 by D. C. St. Clair % of data instances where hypothesis is satisfied and conclusion is predicted correctly CS 404 Data Mining & Knowledge Discovery 29 Classification & Prediction o Debt o x o o x x o x o o o x x x o x x o o x o o Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 30 Classification (nonlinear) o No Loan Debt o x o o x x o x o o o x x x o x o x x o o o Loan Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 31 Cluster Analysis + Debt + + + + + + + + + + + + + + + + + + + + + + Income 2002 by D. C. St. Clair Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P, From Data Mining To Knowledge CS 404 Data Mining & Knowledge Discovery Discovery In Databases, AI Magazine, Fall 1996. 32 Some Major Data Mining Issues • Mining methodologies • User interaction • Performance (accuracy, robustness) • Heterogeneous databases • Interestingness 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 33 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 34 The Knowledge Discovery Process Interpretation/ Evaluation Data Mining Transformation Preprocessing Selection Data Sources Knowledge Patterns / Models Transformed Data Preprocessed Data Target Data 2002 by D. C. St. Clair 404 Data Mining & Knowledge Discovery 35 Source: Fayyad, U., Piatetsky-Shapiro, G., Smyth, CS P, From Data Mining To Knowledge Discovery In Databases, AI Magazine, Fall 1996. Chapter 2: Data Warehousing and OLAP Technology for Data Mining • What is a data warehouse? • A multi-dimensional data model • Data warehouse architecture • Data warehouse implementation • From data warehousing to data mining 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 36 What Is a Data Warehouse? DWs provide architectures and tools to support the systematic –organization, –understanding, and –use of data. Note: DWs may consist of data from numerous sources including business, scientific, as well as engineering data. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 37 Features of a Data Warehouse • Subject-oriented -- organized around major subjects • Integrated -- integrates multiple heterogeneous data sources – Relational databases – Flat files – On-line transaction records • Consistency is enforced • Time-variant -- data stored to provide historical data • Nonvolatile – Physically separate from operational environment – Operations on data: initial loading & retrieval 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 38 OLTP vs. OLAP O LTP O LAP u sers clerk, IT professional know ledge w orker f u n ction day to day operations decision support D B d esign application-oriented subject-oriented d ata current, up -to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad -hoc lots of scans u n it of w ork read/w rite index/hash on prim. key short, simple transaction # record s accessed tens millions #u sers thousands hundreds D B size 100M B -G B 100G B -T B m etric transaction throughput query throughput, response u sage access complex query Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 39 Topics to Be Covered in Lecture 1 Intro. to Data Mining & Knowledge Discovery • • • • • • 2002 by D. C. St. Clair Intro. to CS 404 What is Data Mining & KD? Data sources Data mining tasks Data wareshousing (Ch. 2) Multidimensional data models & schema CS 404 Data Mining & Knowledge Discovery 40 Multidimensional Data Models Figure 2.1 3-D data cube AllElectronics sales data 2002 by D. C. St. Clair 404 Data Mining Knowledge Discovery Allfigure references in this lecture are to the text: Han, CS J. & Kamber, M., &Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 41 4-D Data Cube of AllElectronics Sales Data Figure 2.2 4-D data cube AllElectronics sales data 2002 by D. C. St. Clair 404 Data Mining Knowledge Discovery Allfigure references in this lecture are to the text: Han, CS J. & Kamber, M., &Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 42 Fig. 2.3 A Lattice of Cuboids all time 0-D(apex) cuboid item time,location location item,location time,supplier time,item supplier 1-D cuboids location,supplier 2-D cuboids item,supplier time,location,supplier 3-D cuboids time,item,supplier time,item,location item,location,supplier 4-D(base) cuboid time, item, location, supplier 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 43 Conceptual Modeling of Data Warehouses • Modeling data warehouses: dimensions & measures – Star schema: A fact table in the middle connected to a set of dimension tables – Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake – Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000. 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 44 Fig. 2.4 Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures 2002 by D. C. St. Clair Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. item_key item_name brand type supplier_type location location_key street city province_or_street country 45 Fig. 2.5 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key branch_key branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales Measures 2002 by D. C. St. Clair Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. item_key item_name brand type supplier_key supplier supplier_key supplier_type location location_key street city_key city city_key city province_or_street country 46 Fig 2.6 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales item_key shipper_key location to_location location_key street city province_or_street country dollars_cost Measures 2002 by D. C. St. Clair time_key from_location branch_key branch Shipping Fact Table Slide is modified from slides provided by Han, J. & Kamber, M., Data Mining: ConceptsDiscovery and CS 404 Data Mining & Knowledge Techniques, Morgan Kaufmann, 2000. units_shipped shipper shipper_key shipper_name location_key 47 shipper_type A Data Mining Query Language, DMQL: Language Primitives • Cube Definition (Fact Table) define cube <cube_name> [<dimension_list>]: <measure_list> • Dimension Definition ( Dimension Table ) define dimension <dimension_name> as (<attribute_or_subdimension_list>) • Special Case (Shared Dimension Tables) – First time as “cube definition” – define dimension <dimension_name> as <dimension_name_first_time> in cube <cube_name_first_time> 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 48 Defining a Star Schema in DMQL define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country) 2002 by D. C. St. Clair CS 404 Data Mining & Knowledge Discovery 49 CS/EngMt/CpEng 404 Data Mining & Knowledge Discovery Dan St. Clair Lect 1 – Intro. To Data Mining & Data Warehouses Program Completed University of Missouri-Rolla Copyright 2001 Curators of University of Missouri