Knowledge discovery & data mining Tools, methods, and experiences Fosca Giannotti and Dino Pedreschi Pisa KDD Lab CNUCE-CNR & Univ. Pisa http://www-kdd.di.unipi.it/ A tutorial @ EDBT2000 Contributors and acknowledgements The people @ Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvatore RUGGIERI, Franco TURINI and many students The many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-) In particular: Jiawei HAN, Simon Fraser University, whose forthcoming book Data mining: concepts and techniques has influenced the whole tutorial Rajeev RASTOGI and Kyuseok SHIM, Lucent Bell Labs Daniel A. KEIM, University of Halle Daniel Silver, CogNova Technologies The EDBT2000 board who accepted our tutorial proposal EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 2 Tutorial goals Introduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Data Mining technology Provide a systematization to the many many concepts around this area, according the following lines the the the the process methods applied to paradigmatic cases support environment research challenges Important issues that will be not covered in this tutorial: methods: time series, exception detection, neural nets systems: parallel implementations EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 3 Tutorial Outline 1. Introduction and basic concepts 1. Motivations, applications, the KDD process, the techniques 2. Deeper into DM technology 1. Decision Trees and Fraud Detection 2. Association Rules and Market Basket Analysis 3. Clustering and Customer Segmentation 3. Trends in technology 1. Knowledge Discovery Support Environment 2. Tools, Languages and Systems 4. Research challenges EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 4 Introduction - module outline Motivations Application Areas KDD Decisional Context KDD Process Architecture of a KDD system The KDD steps in short EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 5 Evolution of Database Technology: from data management to data analysis 1960s: Data collection, database creation, IMS and network DBMS. 1970s: Relational data model, relational DBMS implementation. 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.). 1990s: Data mining and data warehousing, multimedia databases, and Web technology. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 6 Motivations “Necessity is the Mother of Invention” Data explosion problem: Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett) Data warehousing and data mining : On-line analytical processing Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 7 A rapidly field A rapidly emerging emerging field Also referred to as: Data dredging, Data harvesting, Data archeology A multidisciplinary field: Database Statistics Artificial intelligence Machine learning, Expert systems and Knowledge Acquisition Visualization methods EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 8 Motivations for DM Abundance of business and industry data Competitive focus - Knowledge Management Inexpensive, powerful computing engines Strong theoretical/mathematical foundations machine learning & logic statistics database management systems EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 9 What is DM useful for? Increase knowledge to base decision upon. Marketing E.g., impact on marketing Database Marketing Data Warehousing EDBT2000 tutorial - Intro KDD & Data Mining Konstanz, 27-28.3.2000 10 The Value Chain Decision • Promote product A in region Z. Knowledge • Mail ads to families of profile P • Cross-sell service B to clients C • A quantity Y of product A is used in region Z • Customers of class Y use x% of C during period D Information • X lives in Z Data • S is Y years old • X and S moved • W has money in Z • Customer data • Store data • Demographical Data • Geographical data EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 11 Application Areas and Opportunities Marketing: segmentation, customer targeting, ... Finance: investment support, portfolio management Banking & Insurance: credit and policy approval Security: fraud detection Science and medicine: hypothesis discovery, prediction, classification, diagnosis Manufacturing: process modeling, quality control, resource allocation Engineering: simulation and analysis, pattern recognition, signal processing Internet: smart search engines, web marketing EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 12 Classes of applications Market analysis target marketing, customer relation management, market basket analysis, cross selling, market segmentation. Risk analysis Forecasting, customer retention, improved underwriting, quality control, competitive analysis. Fraud detection Text (news group, email, documents) and Web analysis. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 13 Market Analysis Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 MarketMarket Analysis and Management Analysis (2) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification). Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports; statistical summary information (data central tendency and variation) EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Risk Analysis Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: monitor competitors and market directions (CI: competitive intelligence). group customers into classes and class-based pricing procedures set pricing strategy in a highly competitive market EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Fraud Detection Applications: widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach: use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. Examples: auto insurance: detect a group of people who stage accidents to collect on insurance money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Fraud Detection (2) More examples: Detecting inappropriate medical treatment: Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud: Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Other applications Sports IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat. Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. Watch for the PRIVACY pitfall! EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 What is KDD? A process! The selection and processing of data for: the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena. Data mining is a major component of the KDD process - automated discovery of patterns and the development of predictive and explanatory models. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 20 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation p(x)=0.02 Patterns & Models Warehouse Prepared Data Consolidated Data Data Sources EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 21 The KDD Process Core Problems & Approaches Problems: identification of relevant data representation of data search for valid pattern or model Approaches: top-down deduction by expert interactive visualization of data/models * bottom-up induction from data * EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 OLAP Data Mining 22 The steps of the KDD process Learning the application domain: relevant prior knowledge and goals of application Data consolidation: Creating a target data set Selection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection: find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Interpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, … Use of discovered knowledge EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 The virtuous cycle 9 The KDD Process Interpretation and Evaluation Data Mining Knowledge Problem Selection and Preprocessing Data Consolidation p(x)=0.02 Patterns & Models Warehouse Knowledge Prepared Data Consolidated Data Data Sources CogNova Technologies Identify Problem or Opportunity Strategy EDBT2000 tutorial - Intro Act on Knowledge Measure effect of Action Konstanz, 27-28.3.2000 Results 24 Applications, operations, techniques EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 25 Roles in the KDD process EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 26 Data mining and business intelligence Increasing potential to support business decisions Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 DBA 27 Architecture of a KDD system Graphical User Interface Data Consolidation Data Sources Selection and Preprocessing Warehouse EDBT2000 tutorial - Intro Data Mining Interpretation and Evaluation Knowledge Konstanz, 27-28.3.2000 28 A business intelligence environment EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 29 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing Data Consolidation Warehouse p(x)=0.02 Patterns & Models Prepared Data Consolidated Data Data Sources EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 30 Data consolidation and preparation Garbage in Garbage out The quality of results relates directly to quality of the data 50%-70% of KDD process effort is spent on data consolidation and preparation Major justification for a corporate data warehouse EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 31 Data consolidation From data sources to consolidated data repository RDBMS Legacy DBMS Flat Files External EDBT2000 tutorial - Intro Data Consolidation and Cleansing Warehouse Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files Konstanz, 27-28.3.2000 32 Data consolidation Determine preliminary list of attributes Consolidate data into working database Internal and External sources Eliminate or estimate missing values Remove outliers (obvious exceptions) Determine prior probabilities of categories and deal with volume bias EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 33 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 34 Data selection and preprocessing Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference) Reduce attribute value ranges group symbolic discrete values quantize continuous numeric values Transform data de-correlate and normalize values map time-series data to static representation OLAP and visualization tools play key role EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 35 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation Warehouse EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 36 Data mining tasks and methods Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis x2 x1 Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms, decision trees f(x) Explanation/Description x e.g.. characterizing customers by demographics and purchase history if age > 35 decision trees, association rules and income < $35k then ... EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 37 Automated exploration and discovery Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting common properties. Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used Bayesian clustering search for the number of classes which result in best fit of a probability distribution to the data AutoClass (NASA) one of best examples EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 38 Prediction and classification Learning a predictive model Classification of a new case/sample Many methods: Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-parametric) EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 39 Generalization and regression The objective of learning is to achieve good generalization to new unseen cases. Generalization can be defined as a mathematical interpolation or regression over a set of training points Models can be validated with a previously unseen test set or using cross-validation methods f(x) x EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 40 Classification and prediction Classify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage. Use obtained model to predict some unknown or missing attribute values based on other information. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 41 Summarizing: inductive modeling = learning Objective: Develop a general model or hypothesis from specific examples Function approximation (curve fitting) f(x) x Classification (concept learning, pattern recognition) A x2 B x1 EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 42 Explanation and description Learn a generalized hypothesis (model) from selected data Description/Interpretation of model provides new knowledge Methods: Inductive decision tree and rule systems Association rule systems Link Analysis … EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 43 Exception/deviation detection Generate a model of normal activity Deviation from model causes alert Methods: Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 44 Outlier and exception data analysis Time-series analysis (trend and deviation): Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis. Similarity-based pattern-directed analysis Full vs. partial periodicity analysis Other pattern-directed or statistical analysis EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 45 The KDD process Interpretation and Evaluation Data Mining Knowledge Selection and Preprocessing p(x)=0.02 Data Consolidation and Warehousing Warehouse EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 46 Are all the discovered pattern interesting? A data mining system/query may generate thousands of patterns, not all of them are interesting. Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g., unexpectedness, novelty, etc. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Completeness vs. optimization Find all the interesting patterns: Completeness. Can a data mining system find all the interesting patterns? Search for only interesting patterns: Optimization. Can a data mining system find only the interesting patterns? Approaches First generate all the patterns and then filter out the uninteresting ones. Generate only the interesting patterns - mining query optimization. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 Interpretation and evaluation Evaluation Statistical validation and significance testing Qualitative review by experts in the field Pilot surveys to evaluate model accuracy Interpretation Inductive tree and rule models can be read directly Clustering results can be graphed and tabled Code can be automatically generated by some systems (IDTs, Regression models) EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 49 Interpretation and evaluation Visualization tools can be very helpful sensitivity analysis (I/O relationship) histograms of value distribution time-series plots and animation requires training and practice Response Temp Velocity EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 50 Important dates of data mining 1989 IJCAI Workshop on KDD Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, eds., 1991) 1991-1994 Workshops on KDD Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996) 1995-1998 AAAI Int. Conf. on KDD and DM (KDD’95-98) Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD 1999 SIGKDD’99 Conf. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 References - general P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996. M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowledge and Data Engineering, 8:866-883, 1996. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. To appear. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991. Michael Berry & Gordon Linoff. Data Mining Techniques for Marketing, Sales and Customer Support. John Wiley & Sons, 1997. Sholom M. Weiss and Nitin Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1997. W.H. Inmon, J.D. Welch, Katherine L. Glassey. Managing the data warehouse. Wiley, 1997. T. Mitchell. Machine Learning. McGraw-Hill, 1997. EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 52 Main Web resources http://www.kdnuggets.com KDD Newsletter and comprehensive website http://www.acm.org/sigkdd ACM SIGKDD http://www.research.microsoft.com/datamine/ Journal of Data Mining and Knowledge Discovery EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 53 Tutorial Outline Introduction and basic concepts Motivations, applications, the KDD process, the techniques Deeper into DM technology Decision Trees and Fraud Detection Association Rules and Market Basket Analysis Clustering and Customer Segmentation Trends in technology Knowledge Discovery Support Environment Tools, Languages and Systems Research challenges EDBT2000 tutorial - Intro Konstanz, 27-28.3.2000 54