Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA www.ncsa.uiuc.edu/STI/ALG October 14, 1998 1 University of Illinois at Urbana-Champaign Why Data Mining? -- Potential Applications • Database analysis, decision support, and automation – Market and Sales Analysis – Fraud Detection – Manufacturing Process Analysis – Risk Analysis and Management – Experimental Results Analysis – Scientific Data Analysis – Text Document Analysis 2 University of Illinois at Urbana-Champaign Data Mining: Confluence of Multiple Disciplines • • • • • • Database Systems, Data Warehouses, and OLAP Machine Learning Statistics Mathematical Programming Visualization High Performance Computing 3 University of Illinois at Urbana-Champaign Data Mining: On What Kind of Data? • • • • Relational Databases Data Warehouses Transactional Databases Advanced Database Systems – Object-Relational – Spatial – Temporal – Text – Heterogeneous, Legacy, and Distributed – WWW 4 University of Illinois at Urbana-Champaign Why Do We Need Data Mining? • Leverage organization’s data assets – Only a small portion (typically - 5%-10%) of the collected data is ever analyzed – Data that may never be analyzed continues to be collected, at a great expense, out of fear that something which may prove important in the future is missed – Growth rates of data precludes traditional “manual intensive” approach 5 University of Illinois at Urbana-Champaign Why Do We Need Data Mining? • As databases grow, the ability to support the decision support process using traditional query languages become infeasible – Many queries of interest are difficult to state in a query language ( Query formulation problem) – “find all cases of fraud” – “find all individuals likely to buy a FORD Expedition” – “find all documents that are similar to this customers problem” 6 University of Illinois at Urbana-Champaign Knowledge Discovery Process • Data Mining: is a step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data. • Knowledge Discovery Process: is the process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations. 7 University of Illinois at Urbana-Champaign Data Mining: A KDD Process 8 University of Illinois at Urbana-Champaign Knowledge Discovery Process Application Domain First and foremost you must understand your data and your business. It may be that you wish to increase the response from a direct mail campaign. So do you want to build a model to: – increase the response rate – increase the value of the response Depending on your specific goal, the model you choose may be different. 9 University of Illinois at Urbana-Champaign Knowledge Discovery - Selecting Data The task of selecting data begins with deciding what data is needed to solve the problem. Issues: – Database incompatibility – Data may be in an obscure form – Data is incomplete 10 University of Illinois at Urbana-Champaign Knowledge Discovery - Preparing The Data Data may have to be loaded from legacy systems or external sources, stored, cleaned, and validated. Issues: – Data may be in a format incompatible for its end use – Data may have many missing, incomplete, or erroneous values – Field descriptions may be unclear, confusing, or have different meanings depending on the source – Data may be stale 11 University of Illinois at Urbana-Champaign Knowledge Discovery - Transforming Data Considerable planning and knowledge of your data should go into the transformation decision. Data transformation are at the heart of developing a sound model. 12 University of Illinois at Urbana-Champaign Knowledge Discovery Types of Transformations • Feature construction – applying a mathematical formula to existing data features • Feature subset selection – removing columns which are not pertinent or redundant, or contain uninteresting predictors • Aggregating data – grouping features together and finding sums, maximums, minimums, or averages • Bin the data – breaking up continuous ranges into discrete segments 13 University of Illinois at Urbana-Champaign Knowledge Discovery - Data Mining The process of building models differ among: – Supervised learning (classification, regression, time series problems) – Unsupervised learning (database segmentation) – Pattern identification and description (link analysis) Once you have decided on the model type, you must choose an method for building the model (decision tree, neural net, K-nearest neighbor ), then the algorithm (backpropagation) 14 University of Illinois at Urbana-Champaign Knowledge Discovery - Analyze and Deploy Once the model is built, its implications must be understood. Graphical representations of relationships between independent and dependent variables may be necessary. Also, attention should be focused on important aspects of the model such as outliers or value. Model deployment may mean writing a new application, embedding into an existing system, or applying it to an existing data set. Model monitoring should be established. 15 University of Illinois at Urbana-Champaign Required Effort for Each KDD Step 60 Effort (%) 50 40 30 20 10 0 Business Objectives Determination Data Preparation Data Mining Analysis & Assimilation 16 University of Illinois at Urbana-Champaign What Data Mining Will Not Do • Automatically find answers to questions you do not ask • Constantly monitor your database for new and interesting relationships • Eliminate the need to understand your business and your data • Remove the need for good data analysis skills 17 University of Illinois at Urbana-Champaign Data Mining Models and Methods Predictive Modeling Database Segmentation Classification Demographic clustering Value prediction Neural clustering Link Analysis Deviation Detection Associations discovery Visualization Sequential pattern discovery Statistics Similar time sequence discovery 18 University of Illinois at Urbana-Champaign Deviation Detection • identify outliers in a dataset • typical techniques - probability distribution contrasts, supervised/unsupervised learning • hypothetical example: Point-of-sale fraud detection 19 University of Illinois at Urbana-Champaign Fraud and Inappropriate Practice Prevention Background: Through regular review, HR has developed a collaborative relationship with its Sales Associates (SAs). Semi-annual meetings allow review of the SAs practices with similar SAs across the country. Goal: The approach is aimed at modifying SAs behavior to promote better service rather than at investigating and prosecuting SAs, although both strategies are employed. 20 University of Illinois at Urbana-Champaign Fraud and Inappropriate Practice Prevention Business Objective: The focus of this project was on the recent and steady 12% annual rise in overrides. The overall business objective of the project was to find a way to ensure that the overrides were appropriate with out negatively affecting service provided by the SAs. 21 University of Illinois at Urbana-Champaign Fraud and Inappropriate Practice Prevention Approach: • To identify potential fraudulent overrides or overrides arising from inappropriate practices. • To develop general profiles of the SAs practices in order to compare practice behavior of individual SAs. 22 University of Illinois at Urbana-Champaign Fraud and Inappropriate Practice Prevention 23 University of Illinois at Urbana-Champaign Database Segmentation • regroup datasets into clusters that share common characteristics • typical technique - unsupervised leaning (SOMs, K-Means) • hypothetical example: Cluster all similar regimes (financial, free form text) 24 University of Illinois at Urbana-Champaign Self Organizing Maps Example - Text Clustering This data is considered to be confidential and proprietary to Caterpillar and may only be used with prior written consent from Caterpillar. 25 University of Illinois at Urbana-Champaign Predictive Modeling • past data predicts future response • typical technique - supervised learning (Artificial Neural Networks, Decision Trees, Naïve Bayesian) • hypothetical example (classification): Who is most likely to respond to a direct mailing • hypothetical example (predication): How will the German Stock Price Index perform in the next 3, 5, 7, days 26 University of Illinois at Urbana-Champaign Predictive Modeling - Prior Probabilities 27 University of Illinois at Urbana-Champaign Predictive Modeling - Posterior Probabilities 28 University of Illinois at Urbana-Champaign Link Analysis • relationships between records/attributes in datasets • typical techniques - rule association, sequence discovery • hypothetical example (rule association): When people buy a hammer they also buy nails 50% of the time • hypothetical example ( sequence discovery): When people buy a hammer they also buy nails within the next 3 months 18% of the time, and within the subsequent 3 months 12% of the time 29 University of Illinois at Urbana-Champaign Link Analysis (Rule Association) • Given a database, find all associations of the form: IF < LHS > THEN <RHS > Prevalence = frequency of the LHS and RHS occurring together Predictability = fraction of the RHS out of all items with the LHS 30 University of Illinois at Urbana-Champaign Rule Association - Basket Analysis 31 University of Illinois at Urbana-Champaign Association Rules - Basket Analysis • <Dairy-Milk-Refrigerated> implies <Soft Drinks Carbonated> – prevalence = 4.99%, predictability = 22.89% • <Dry Dinners - Pasta> implies <Soup-Canned> – prevalence = 0.94%, predictability = 28.14% • <Paper Towels - Jumbo> implies <Toilet Tissue> – prevalence = 2.11%, predictability = 38.22% • <Dry Dinners - Pasta> implies <Cereal - Ready to Eat> – prevalence = 1.36%, predictability = 41.02% • <Cheese-Processed Slices - American> implies <Cereal - Ready to Eat> – prevalence = 1.16%, predictability = 38.01% 32 University of Illinois at Urbana-Champaign Requirements For Successful Data Mining • There is a sponsor for the application. • The business case for the application is clearly understood and measurable, and the objectives are likely to be achievable given the resources being applied. • The application has a high likelihood of having a significant impact on the business. • Business domain knowledge is available. • Good quality, relevant data in sufficient quantities is available. 33 University of Illinois at Urbana-Champaign Requirements For Successful Data Mining • The right people – business domain, data management, and data mining experts. People who have “been there and done that” For a first time project the following criteria could be added: • The scope of the application is limited. Try to show results within 3-6 months. • The data source should be limited to those that are well know, relatively clean and freely accessible. 34 University of Illinois at Urbana-Champaign Rapid KD Development Environment 35 University of Illinois at Urbana-Champaign Rapid KDD Development Environment 36 University of Illinois at Urbana-Champaign Why Information Visualization • Gain insight into the contents and complexity of the database being analyzed • Vast amounts of under utilized data • Time-critical decisions hampered • Key information difficult to find • Results presentation • Reduced perceptual, interpretative, cognitive burden 37 University of Illinois at Urbana-Champaign Typical Data • • • • • • Abstract corporate data Mostly discrete not continuous Often multi-dimensional Quantitative Text Historical or real-time 38 University of Illinois at Urbana-Champaign Typical Applications • Historical Data Analysis – Marketing Data Mining Analysis – Portfolio Performance Attribution – Fraud/Surveillance Analysis • Decision Support – Financial Risk Management – Operations Planning – Military Strategic Planning Typical Applications 39 University of Illinois at Urbana-Champaign Typical Applications (cont) • Monitoring Real-Time Status – Industrial Process Control – Capital Markets Trading Management – Network Monitoring • Management Reporting – Financial Reporting – Sales and Marketing Reporting 40 University of Illinois at Urbana-Champaign Marketing Data Mining Analysis Click on me.. I am an animation 41 University of Illinois at Urbana-Champaign Risk Management 42 University of Illinois at Urbana-Champaign Capital Markets Trading Management 43 University of Illinois at Urbana-Champaign Network Monitoring 44 University of Illinois at Urbana-Champaign Industrial Process Control 45 University of Illinois at Urbana-Champaign Crisis Monitoring Ground (Student) View Normal Ignited Aerial/Oracular (Instructor) View Color code for compartment status Engulfed Extinguished Destroyed Fire Alarm Flooding 46 University of Illinois at Urbana-Champaign 3D Financial Reporting 47 University of Illinois at Urbana-Champaign Statistics Visualizer 48 University of Illinois at Urbana-Champaign Scatter Visualizer 49 University of Illinois at Urbana-Champaign Splat Visualizer 50 University of Illinois at Urbana-Champaign Tree Visualizer 51 University of Illinois at Urbana-Champaign Map Visualizer 52 University of Illinois at Urbana-Champaign Decision Tree 53 University of Illinois at Urbana-Champaign