ITCS 6162 KDD Class Fall 2007 Transparencies made by Ho Tu Bao [JAIST] 1 Outline of the presentation Objectives, Brief Discussion Prerequisite Introduction and and Content to Lectures Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining” 2 Objectives This course provides: • fundamental techniques of knowledge discovery and data mining (KDD) • issues in KDD practical use and tools • case-studies of KDD application 3 Prerequisite for the course Nothing special but the followings are expected: • experience of computer use • basis of databases and statistics • programming skill for advanced levels 4 Content of the course Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 5 Outline of the presentation Objectives, Brief Discussion Prerequisite Introduction and and Content to Lectures Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining” 6 Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 7 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 8 KDD: A Definition KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. 106-1012 bytes: never see the whole data set or put it in the memory of computers Data mining algorithms? What knowledge? How to represent and use it? 9 Data, Information, Knowledge We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Knowledge can be considered data at a high level of abstraction and generalization. 10 From Data to Knowledge Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes ... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, ,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ? ,negative, ?, n, n, ABSCESS, VIRUS ... Numerical attribute categorical attribute missing values class labels IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy] 11 Data Rich Knowledge Poor How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. People gathered and stored so much data because they think some valuable assets are implicitly coded within it. ? Raw data is rarely of direct benefit. knowledge base inference engine Its true value depends on the ability to extract information useful for decision support. Tradition: via knowledge engineers Impractical Manual Data Analysis New trend: via automatic programs 12 Benefits of Knowledge Discovery Value Disseminate DSS Generate Volume MIS Rapid Response EDP EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems 13 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 14 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) Multiple process non-trivial process valid novel useful understandable Justified patterns/models Previously unknown Can be used by human and machine 15 The Knowledge Discovery Process a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 5 4 3 Putting the results in practical use Interpret and Evaluate discovered knowledge Data Mining 2 1 Extract Patterns/Models Collect and Preprocess Data Understand the domain and Define problems KDD is inherently interactive and iterative 16 The KDD Process Data organized by function Create/select target database Data warehousing 1 Select sampling technique and sample data Supply missing values Eliminate noisy data Normalize values Transform values 2 Create derived attributes Find important attributes & value ranges 4 3 Select DM task (s) Transform to different representation Select DM method (s) Extract knowledge Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods 5 17 Main Contributing Areas of KDD [data warehouses: integrated data] Statistics [OLAP: On-Line Analytical Processing] Databases Store, access, search, update data (deduction) Infer info from data (deduction & induction, mainly numeric data) KDD Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) 18 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 19 Potential Applications Business information Manufacturing information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. Scientific information - - Controlling and scheduling Network management Experiment result analysis etc. Personal information Sky survey cataloging Biosequence Databases Geosciences: Quakefinder etc. 20 KDD: Opportunity and Challenges Competitive Pressure Data Rich Knowledge Poor (the resource) KDD Data Mining Technology Mature Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) 21 KDD: A New and Fast Growing Area KDD workshops: since 1989. Inter. Conferences: KDD (USA), first in 1995; PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997. ML’04/PKDD’04 (in Pisa, Italy) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning). “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ. 22 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 3. KDD Applications 4. Data Mining Methods 5. Challenges for KDD 23 Primary Tasks of Data Mining finding the description of several predefined classes and classify a data item into one of them. Classification ? maps a data item to a real-valued prediction variable. Regression discovering the most significant changes in the data Deviation and change detection identifying a finite set of categories or clusters to describe the data. Clustering finding a model which describes significant dependencies between variables. Dependency Modeling finding a compact description for a subset of data Summarization 24 Classification “What factors determine cancerous cells?” Examples Data Cancerous Cell Data Mining Algorithm Classification Algorithm General patterns - Rule Induction - Decision tree - Neural Network 25 Classification: Rule Induction “What factors determine a cell is cancerous?” If and and Then Color = light Tails = 1 Nuclei = 2 Healthy Cell If and and Then Color = dark Tails = 2 Nuclei = 2 Cancerous Cell (certainty = 92%) (certainty = 87%) 26 Classification: Decision Trees Color = dark #nuclei=1 #tails=1 healthy #tails=2 cancerous #nuclei=2 cancerous Color = light #nuclei=1 #nuclei=2 healthy #tails=1 #tails=2 healthy cancerous 27 Classification: Neural Networks “What factors determine a cell is cancerous?” Color = dark # nuclei = 1 … Healthy Cancerous # tails = 2 28