Data Mining Tools Overview & Tutorial Ahmed Sameh Prince Sultan University Department of Computer Science & Info Sys May 2010 (Some slides belong to IBM) 1 Introduction Outline Goal: Provide an overview of data mining. Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues 2 Introduction Data is growing at a phenomenal rate Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING 3 Data Mining Definition Finding hidden information in a database Fit data to a model Similar terms Exploratory data analysis Data driven discovery Deductive learning 4 Data Mining Algorithm Objective: Fit Data to a Model Descriptive Predictive Preference – Technique to choose the best model Search – Technique to search the data “Query” 5 Database Processing vs. Data Mining Processing Query Well defined SQL Data – Operational data Output – Precise – Subset of database Query Poorly defined No precise query language Data – Not operational data Output – Fuzzy – Not a subset of database 6 Query Examples Database – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all customers who have purchased milk Data Mining – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering) – Find all items which are frequently purchased with milk. (association rules) 7 Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases 8 Statistics, Machine Learning and Data Mining Statistics: more theory-based more focused on testing hypotheses Machine learning more heuristic focused on improving performance of a learning agent also looks at real-time learning and robotics – areas not part of data mining Data Mining and Knowledge Discovery integrates theory and heuristics focus on the entire process of knowledge discovery, including data cleaning, learning, and integration and visualization of results Distinctions are fuzzy 9 Definition A class of database application that analyze data in a database using tools which look for trends or anomalies. Data mining was invented by IBM. Purpose To look for hidden patterns or previously unknown relationships among the data in a group of data that can be used to predict future behavior. Ex: Data mining software can help retail companies find customers with common interests. Background Information Many of the techniques used by today's data mining tools have been around for many years, having originated in the artificial intelligence research of the 1980s and early 1990s. Data Mining tools are only now being applied to large-scale database systems. The Need for Data Mining The amount of raw data stored in corporate data warehouses is growing rapidly. There is too much data and complexity that might be relevant to a specific problem. Data mining promises to bridge the analytical gap by giving knowledgeworkers the tools to navigate this complex analytical space. The Need for Data Mining, cont’ The need for information has resulted in the proliferation of data warehouses that integrate information multiple sources to support decision making. Often include data from external sources, such as customer demographics and household information. Definition (Cont.) Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. Of “laws”, Monsters, and Giants… Moore’s law: processing “capacity” doubles every 18 months : CPU, cache, memory It’s more aggressive cousin: Disk storage “capacity” doubles every 9 months Disk TB Shipped per Year 1E+7 What do the two “laws” combined produce? A rapidly growing gap between our ability to generate data, and our ability 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. ExaByte 1E+6 1E+5 disk TB growth: 112%/y Moore's Law: 58.7%/y 1E+4 1E+3 1988 1991 1994 1997 2000 What is Data Mining? Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Examples of tasks addressed by Data Mining Predictive Modeling (classification, regression) Segmentation (Data Clustering ) Summarization Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web 19 Data Mining The non-trivial extraction of novel, implicit, and actionable knowledge from large datasets. Extremely large datasets Discovery of the non-obvious Useful knowledge that can improve processes Can not be done manually Technology to enable data exploration, data analysis, and data visualization of very large databases at a high level of abstraction, without a specific hypothesis in mind. Sophisticated data search capability that uses statistical algorithms to discover patterns and correlations in data. 20 Data Mining (cont.) 21 Data Mining (cont.) Data Mining is a step of Knowledge Discovery in Databases (KDD) Process Data Warehousing Data Selection Data Preprocessing Data Transformation Data Mining Interpretation/Evaluation Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms 22 Data Mining Evaluation 23 Data Mining is Not … Data warehousing SQL / Ad Hoc Queries / Reporting Software Agents Online Analytical Processing (OLAP) Data Visualization 24 Data Mining Motivation Changes in the Business Environment Customers becoming more demanding Markets are saturated Databases today are huge: More than 1,000,000 entities/records/rows From 10 to 10,000 fields/attributes/variables Gigabytes and terabytes Databases a growing at an unprecedented rate Decisions must be made rapidly Decisions must be made with maximum knowledge 25 Why Use Data Mining Today? Human analysis skills are inadequate: Volume and dimensionality of the data High data growth rate Availability of: Data Storage Computational power Off-the-shelf software Expertise An Abundance of Data Supermarket scanners, POS data Preferred customer cards Credit card transactions Direct mail response Call center records ATM machines Demographic data Sensor networks Cameras Web server logs Customer web site trails Evolution of Database Technology 1960s: IMS, network model 1970s: The relational data model, first relational DBMS implementations 1980s: Maturing RDBMS, application-specific DBMS, (spatial data, scientific data, image data, etc.), OODBMS 1990s: Mature, high-performance RDBMS technology, parallel DBMS, terabyte data warehouses, object-relational DBMS, middleware and web technology 2000s: High availability, zero-administration, seamless integration into business processes 2010: Sensor database systems, databases on embedded systems, P2P database systems, large-scale pub/sub systems, ??? Much Commercial Support Many data mining tools http://www.kdnuggets.com/software Database systems with data mining support Visualization tools Data mining process support Consultants Why Use Data Mining Today? Competitive pressure! “The secret of success is to know something that nobody else knows.” Aristotle Onassis Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) Personalization, CRM The real-time enterprise “Systemic listening” Security, homeland defense The Knowledge Discovery Process Steps: 1. Identify business problem 2. Data mining 3. Action 4. Evaluation and measurement 5. Deployment and integration into businesses processes Data Mining Step in Detail 2.1 Data preprocessing Data selection: Identify target datasets and relevant fields Data cleaning Remove noise and outliers Data transformation Create common units Generate new fields 2.2 Data mining model construction 2.3 Model evaluation Preprocessing and Mining Knowledge Patterns Preprocessed Data Target Data Interpretation Model Construction Original Data Preprocessing Data Integration and Selection Data Mining Techniques Data Mining Techniques Descriptive Predictive Clustering Classification Association Decision Tree Sequential Analysis Rule Induction Neural Networks Nearest Neighbor Classification Regression 34 Data Mining Models and Tasks 35 Basic Data Mining Tasks Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning 36 Basic Data Mining Tasks (cont’d) Summarization maps data into subsets with associated simple descriptions. Characterization Generalization Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns. 37 Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior 38 Data Mining vs. KDD Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process. 39 Data Mining Development •Similarity Measures •Relational Data Model •SQL •Association Rule Algorithms •Data Warehousing •Scalability Techniques •Hierarchical Clustering •IR Systems •Imprecise Queries •Textual Data •Web Search Engines •Bayes Theorem •Regression Analysis •EM Algorithm •K-Means Clustering •Time Series Analysis •Algorithm Design Techniques •Algorithm Analysis •Data Structures •Neural Networks •Decision Tree Algorithms 40 KDD Issues Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality 41 KDD Issues (cont’d) Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application 42 Visualization Techniques Graphical Geometric Icon-based Pixel-based Hierarchical Hybrid 43 Data Mining Applications 44 Data Mining Applications: Retail Performing basket analysis Which items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions. Sales forecasting Examining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item? Database marketing Retailers can develop profiles of customers with certain behaviors, for example, those who purchase designer labels clothing or those who attend sales. This information can be used to focus cost–effective promotions. Merchandise planning and allocation When retailers add new stores, they can improve merchandise planning and allocation by examining 45 patterns in stores with similar demographic Data Mining Applications: Banking Card marketing By identifying customer segments, card issuers and acquirers can improve profitability with more effective acquisition and retention programs, targeted product development, and customized pricing. Cardholder pricing and profitability Card issuers can take advantage of data mining technology to price their products so as to maximize profit and minimize loss of customers. Includes riskbased pricing. Fraud detection Fraud is enormously costly. By analyzing past transactions that were later determined to be fraudulent, banks can identify patterns. Predictive life-cycle management DM helps banks predict each customer’s lifetime value and to service each segment appropriately (for example, offering special deals and discounts). 46 Data Mining Applications: Telecommunication Call detail record analysis Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. Customer loyalty Some customers repeatedly switch providers, or “churn”, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit. 47 Data Mining Applications: Other Applications Customer segmentation All industries can take advantage of DM to discover discrete segments in their customer bases by considering additional variables beyond traditional analysis. Manufacturing Through choice boards, manufacturers are beginning to customize products for customers; therefore they must be able to predict which features should be bundled to meet customer demand. Warranties Manufacturers need to predict the number of customers who will submit warranty claims and the average cost of those claims. Frequent flier incentives Airlines can identify groups of customers that can be 48 given incentives to fly more. A producer wants to know…. Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? What is the most effective distribution channel? What product prom-otions have the biggest impact on revenue? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? 49 Data, Data everywhere yet ... I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other 50 What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] 51 What are the users saying... Data should be integrated across the enterprise Summary data has a real value to the organization Historical data holds the key to understanding data over time What-if capabilities are required 52 What is Data Warehousing? Information Data A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research, April 1996] 53 Very Large Data Bases Terabytes -- 10^12 bytes:Walmart -- 24 Terabytes Petabytes -- 10^15 bytes:Geographic Information Systems Exabytes -- 10^18 bytes: National Medical Records Zettabytes -- 10^21 bytes: Zottabytes -- 10^24 bytes: Weather images Intelligence Agency Videos 54 Data Warehousing -It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database 55 Data Warehouse A data warehouse is a subject-oriented integrated time-varying non-volatile collection of data that is used primarily in organizational decision making. -- Bill Inmon, Building the Data Warehouse 1996 56 Data Warehousing Concepts Decision support is key for companies wanting to turn their organizational data into an information asset Traditional database is transaction-oriented while data warehouse is data-retrieval optimized for decision-support Data Warehouse "A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process" OLAP (on-line analytical processing), Decision Support Systems (DSS), Executive Information Systems (EIS), and data mining applications 57 What does data warehouse do? integrate diverse information from various systems which enable users to quickly produce powerful ad-hoc queries and perform complex analysis create an infrastructure for reusing the data in numerous ways create an open systems environment to make useful information easily accessible to authorized users help managers make informed decisions 58 Benefits of Data Warehousing Potential high returns on investment Competitive advantage Increased productivity of corporate decision-makers 59 Comparison of OLTP and Data Warehousing OLTP systems systems Holds current data Stores detailed data Data is dynamic Repetitive processing heuristic High level of transaction throughput throughput Predictable pattern of usage Transaction driven Application oriented Supports day-to-day decisions Serves large number of clerical / operational users Data warehousing Holds historic data Stores detailed, lightly, and summarized data Data is largely static Ad hoc, unstructured, and processing Medium to low transaction Unpredictable pattern of usage Analysis driven Subject oriented Supports strategic decisions Serves relatively lower number of managerial users 60 Data Warehouse Architecture Operational Data Load Manager Warehouse Manager Query Manager Detailed Data Lightly and Highly Summarized Data Archive / Backup Data Meta-Data End-user Access Tools 61 End-user Access Tools Reporting and query tools Application development tools Executive Information System (EIS) tools Online Analytical Processing (OLAP) tools Data mining tools 62 Data Warehousing Tools and Technologies Extraction, Cleansing, and Transformation Tools Data Warehouse DBMS Load performance Load processing Data quality management Query performance Terabyte scalability Networked data warehouse Warehouse administration Integrated dimensional tools Advanced query functionality 63 Data Marts A subset of data warehouse that supports the requirements of a particular department or business function 64 Online Analytical Processing (OLAP) OLAP The dynamic synthesis, analysis, and consolidation of large volume of multity i dimensional data C Multi-dimensional OLAP Product type Cubes of data Time 65 Problems of Data Warehousing Underestimation of resources for data loading Hidden problem with source systems Required data not captured Increased end-user demands Data homogenization High demand for resources Data ownership High maintenance Long duration projects Complexity of integration 66 Codd's Rules for OLAP Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels 67 OLAP Tools Multi-dimensional OLAP (MOLAP) Multi-dimensional DBMS (MDDBMS) Relational OLAP (ROLAP) Creation of multiple multi-dimensional views of the two-dimensional relations Managed Query Environment (MQE) Deliver selected data directly from the DBMS to the desktop in the form of a data cube, where it is stored, analyzed, and manipulated locally 68 Data Mining Definition The process of extracting valid, previously unknown, comprehensible, and actionable information from large database and using it to make crucial business decisions Knowledge discovery Association rules Sequential patterns Classification trees Goals Prediction Identification Classification Optimization 69 Data Mining Techniques Predictive Modeling Supervised training with two phases Training phase : building a model using large sample of historical data called the training set Testing phase : trying the model on new data Database Segmentation Link Analysis Deviation Detection 70 What are Data Mining Tasks? Classification Regression Clustering Summarization Dependency modeling Change and Deviation Detection 71 What are Data Mining Discoveries? New Purchase Trends Plan Investment Strategies Detect Unauthorized Expenditure Fraudulent Activities Crime Trends Smugglers-border crossing 72 Data Warehouse Architecture Relational Databases Optimized Loader ERP Systems Extraction Cleansing Data Warehouse Engine Purchased Data Legacy Data Analyze Query Metadata Repository 73 Data Warehouse for Decision Support & OLAP Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to the competition? What product promotions have the biggest impact on revenue? How did the share price of software companies correlate with profits over last 10 years? 74 Decision Support Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements 75 Data Mining works with Warehouse Data Data Warehousing provides the Enterprise with a memory Data Mining provides the Enterprise with intelligence 76 We want to know ... Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer? If I raise the price of my product by Rs. 2, what is the effect on my ROI? If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result? If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues? Which of my customers are likely to be the most loyal? Data Mining helps extract such information 77 Application Areas Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis 78 Data Mining in Use The US Government uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Warranty Claims Routing Holding on to Good Customers Weeding out Bad Customers 79 What makes data mining possible? Advances in the following areas are making data mining deployable: data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques. • -- Gartner Group 80 Why Separate Data Warehouse? Performance Op dbs designed & tuned for known txs & workloads. Complex OLAP queries would degrade perf. for op txs. Special data organization, access & implementation methods needed for multidimensional views & queries. Function Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be 81 reconciled. What are Operational Systems? They are OLTP systems Run mission critical applications Need to work with stringent performance requirements for routine tasks Used to run a business! 82 RDBMS used for OLTP Database Systems have been used traditionally for OLTP clerical data processing tasks detailed, up to date data structured repetitive tasks read/update a few records isolation, recovery and integrity are critical 83 Operational Systems Run the business in real time Based on up-to-the-second data Optimized to handle large numbers of simple read/write transactions Optimized for fast response to predefined transactions Used by people who deal with customers, products -- clerks, salespeople etc. They are increasingly used by customers 84 Examples of Operational Data Data Industry Usage Technology Customer File All Legacy application, flat Small-medium files, main frames Account Balance Point-ofSale data Call Record Track Customer Details Finance Control account activities Retail Generate bills, manage stock Telecomm- Billing unications Production ManufactRecord uring Control Production Volumes Legacy applications, Large hierarchical databases, mainframe ERP, Client/Server, Very Large relational databases Legacy application, Very Large hierarchical database, mainframe ERP, Medium relational databases, AS/400 85 Application-Orientation vs. Subject-Orientation Application-Orientation Subject-Orientation Operational Database Loans Credit Card Data Warehouse Customer Vendor Trust Savings Product Activity 86 OLTP vs. Data Warehouse OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse Special data organization, access methods and implementation methods are needed to support data warehouse queries (typically multidimensional queries) e.g., average amount spent on phone calls between 9AM-5PM in Pune during the month of December 87 OLTP vs Data Warehouse OLTP Application Oriented Used to run business Detailed data Current up to date Isolated Data Repetitive access Clerical User Warehouse (DSS) Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager) 88 OLTP vs Data Warehouse OLTP Performance Sensitive Few Records accessed at a time (tens) Read/Update Access No data redundancy Database Size 100MB -100 GB Data Warehouse Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes 89 OLTP vs Data Warehouse OLTP Transaction throughput is the performance metric Thousands of users Managed in entirety Data Warehouse Query throughput is the performance metric Hundreds of users Managed by subsets 90 To summarize ... OLTP Systems are used to “run” a business The Data Warehouse helps to “optimize” the business 91 Why Now? Data is being produced ERP provides clean data The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available 92 Myths surrounding OLAP Servers and Data Marts Data marts and OLAP servers are departmental solutions supporting a handful of users Million dollar massively parallel hardware is needed to deliver fast time for complex queries OLAP servers require massive and unwieldy indices Complex OLAP queries clog the network with data Data warehouses must be at least 100 GB to be effective – Source -- Arbor Software Home Page 93 II. On-Line Analytical Processing (OLAP) Making Decision Support Possible Typical OLAP Queries Write a multi-table join to compare sales for each product line YTD this year vs. last year. Repeat the above process to find the top 5 product contributors to margin. Repeat the above process to find the sales of a product line to new vs. existing customers. Repeat the above process to find the customers that have had negative sales growth. 95 What Is OLAP? Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent) * Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html 96 The OLAP Market Rapid growth in the enterprise market 1995: $700 Million 1997: $2.1 Billion Significant consolidation activity among major DBMS vendors 10/94: Sybase acquires ExpressWay 7/95: Oracle acquires Express 11/95: Informix acquires Metacube 1/97: Arbor partners up with IBM 10/96: Microsoft acquires Panorama Result: OLAP shifted from small vertical niche to mainstream DBMS category 97 Strengths of OLAP It is a powerful visualization paradigm It provides fast, interactive response times It is good for analyzing time series It can be useful to find some clusters and outliers Many vendors offer OLAP tools 98 OLAP Is FASMI Fast Analysis Shared Multidimensional Information Nigel Pendse, Richard Creath - The OLAP Report 99 Multi-dimensional Data “Hey…I sold $100M worth of goods” Dimensions: Product, Region, Time Hierarchical summarization paths Product W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7 Product Industry Region Country Time Year Category Region Quarter Product City Month Week Month Office Day100 A Visual Operation: Pivot (Rotate) Juice 10 Cola 47 Milk 30 Cream 12 Product 3/1 3/2 3/3 3/4 Date 101 “Slicing and Dicing” The Telecomm Slice Product Household Telecomm Video Audio Europe Far East India Retail Direct Special Sales Channel 102 Roll-up and Drill Down Higher Level of Aggregation Sales Channel Region Country State Location Address Sales Representative Low-level Details 103 Results of Data Mining Include: Forecasting what may happen in the future Classifying people or things into groups by recognizing patterns Clustering people or things into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events Data mining is not Brute-force crunching of bulk data “Blind” application of algorithms Going to find relationships where none exist Presenting data in different ways A database intensive task A difficult to understand technology requiring an advanced degree in computer science Data Mining versus OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening Data Mining Versus Statistical Analysis Data Analysis Data Mining for statistical Originally developed to act Tests correctness of models as expert systems to solve Are statistical problems assumptions of models Less interested in the correct? mechanics of the Eg Is the R-Square technique good? If it makes sense then Hypothesis testing let’s use it Is the relationship Does not require significant? assumptions to be made about data Use a t-test to validate significance Can find patterns in very Tends to rely on sampling large amounts of data Techniques are not Requires understanding optimised for large of data and business amounts of data problem Requires strong statistical skills Examples of What People are Doing with Data Mining: Fraud/Non-Compliance Anomaly detection Recruiting/Attracting customers Isolate the factors that Maximizing lead to fraud, waste and profitability (cross selling, identifying abuse profitable customers) Target auditing and Service Delivery and investigative efforts Customer Retention more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Build profiles of customers likely to use which services Web Mining What data mining has done for... The US Internal Revenue Service needed to improve customer service and... Scheduled its workforce to provide faster, more accurate answers to questions. What data mining has done for... The US Drug Enforcement Agency needed to be more effective in their drug “busts” and analyzed suspects’ cell phone usage to focus investigations. What data mining has done for... HSBC need to cross-sell more effectively by identifying profiles that would be interested in higher yielding investments and... Reduced direct mail costs by 30% while garnering 95% of the campaign’s revenue. Suggestion:Predicting Washington C-Span has lunched a digital archieve of 500,000 hours of audio debates. Text Mining or Audio Mining of these talks to reveal cwetrain questions such as…. Example Application: Sports IBM Advanced Scout analyzes NBA game statistics Shots blocked Assists Fouls Google: “IBM Advanced Scout” Advanced Scout Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots." Pattern is interesting: The average shooting percentage for the Charlotte Hornets during that game was 54%. Data Mining: Types of Data Relational data and transactional data Spatial and temporal data, spatiotemporal observations Time-series data Text Images, video Mixtures of data Sequence data Features from processing other data sources Data Mining Techniques Supervised learning Classification and regression Unsupervised learning Clustering Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection Different Types of Classifiers Linear discriminant analysis (LDA) Quadratic discriminant analysis (QDA) Density estimation methods Nearest neighbor methods Logistic regression Neural networks Fuzzy set theory Decision Trees Test Sample Estimate Divide D into D1 and D2 Use D1 to construct the classifier d Then use resubstitution estimate R(d,D2) to calculate the estimated misclassification error of d Unbiased and efficient, but removes D2 from training dataset D V-fold Cross Validation Procedure: Construct classifier d from D Partition D into V datasets D1, …, DV Construct classifier di using D \ Di Calculate the estimated misclassification error R(di,Di) of di using test sample Di Final misclassification estimate: Weighted combination of individual misclassification errors: R(d,D) = 1/V Σ R(di,Di) Cross-Validation: Example d d1 d2 d3 Cross-Validation Misclassification estimate obtained through cross-validation is usually nearly unbiased Costly computation (we need to compute d, and d1, …, dV); computation of di is nearly as expensive as computation of d Preferred method to estimate quality of learning algorithms in the machine learning literature Decision Tree Construction Three Split algorithmic components: selection (CART, C4.5, QUEST, CHAID, CRUISE, …) Pruning (direct stopping rule, test dataset pruning, cost-complexity pruning, statistical tests, bootstrapping) Data access (CLOUDS, SLIQ, SPRINT, RainForest, BOAT, UnPivot operator) Goodness of a Split Consider node t with impurity phi(t) The reduction in impurity through splitting predicate s (t splits into children nodes tL with impurity phi(tL) and tR with impurity phi(tR)) is: Δphi(s,t) = phi(t) – pL phi(tL) – pR phi(tR) Pruning Methods Test dataset pruning Direct stopping rule Cost-complexity pruning MDL pruning Pruning by randomization testing Stopping Policies A stopping policy indicates when further growth of the tree at a node t is counterproductive. All records are of the same class The attribute values of all records are identical All records have missing values At most one class has a number of records larger than a user-specified number All records go to the same child node if t is split (only possible with some split Test Dataset Pruning Use an independent test sample D’ to estimate the misclassification cost using the resubstitution estimate R(T,D’) at each node Select the subtree T’ of T with the smallest expected cost Missing Values What is the problem? During computation of the splitting predicate, we can selectively ignore records with missing values (note that this has some problems) But if a record r misses the value of the variable in the splitting attribute, r can not participate further in tree construction Algorithms for missing values address this problem. Mean and Mode Imputation Assume record r has missing value r.X, and splitting variable is X. Simplest algorithm: If X is numerical (categorical), impute the overall mean (mode) Improved algorithm: If X is numerical (categorical), impute the mean(X|t.C) (the mode(X|t.C)) Decision Trees: Summary Many application of decision trees There are many algorithms available for: Split selection Pruning Handling Missing Values Data Access Decision tree construction still active research area (after 20+ years!) Challenges: Performance, scalability, evolving datasets, new applications Supervised vs. Unsupervised Learning Supervised y=F(x): true function D: labeled training set D: {xi,F(xi)} Learn: G(x): model trained to predict labels D Goal: E[(F(x)-G(x))2] ≈ 0 Well defined criteria: Accuracy, RMSE, ... Unsupervised Generator: true model D: unlabeled data sample D: {xi} Learn ?????????? Goal: ?????????? Well defined criteria: ?????????? Clustering: Unsupervised Learning Given: Data Set D (training set) Similarity/distance metric/information Find: Partitioning of data Groups of similar/close items Similarity? Groups of similar customers Similar demographics Similar buying behavior Similar health Similar products Similar cost Similar function Similar store … Similarity usually is domain/problem specific Clustering: Informal Problem Definition Input: A data set of N records each given as a ddimensional data feature vector. Output: Determine a natural, useful “partitioning” of the data set into a number of (k) clusters and noise such that we have: High similarity of records within each cluster (intra-cluster similarity) Low similarity of records between clusters (inter-cluster similarity) Types of Clustering Hard Clustering: Each object is in one and only one cluster Soft Clustering: Each object has a probability of being in each cluster Clustering Algorithms Partitioning-based clustering K-means clustering K-medoids clustering EM (expectation maximization) clustering Hierarchical clustering Divisive clustering (top down) Agglomerative clustering (bottom up) Density-Based Methods Regions of dense points separated by sparser regions of relatively low density K-Means Clustering Algorithm Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) Visualization at: http://www.delftcluster.nl/textminer/theory/kmeans/kmeans.html Issues Why is K-Means working: How does it find the cluster centers? Does it find an optimal clustering What are good starting points for the algorithm? What is the right number of cluster centers? How do we know it will terminate? Agglomerative Clustering Algorithm: Put each item in its own cluster (all singletons) Find all pairwise distances between clusters Merge the two closest clusters Repeat until everything is in one cluster Observations: Results in a hierarchical clustering Yields a clustering for each possible number of clusters Greedy clustering: Result is not “optimal” for any cluster size Density-Based Clustering A cluster is defined as a connected dense component. Density is defined in terms of number of neighbors of a point. We can find clusters of arbitrary shape Market Basket Analysis Consider shopping cart filled with several items Market basket analysis tries to answer the following questions: Who makes purchases? What do customers buy together? In what order do customers purchase items? Market Basket Analysis Given: A database of customer transactions Each transaction is a set of items Example: Transaction with TID 111 contains items {Pen, Ink, Milk, Juice} TID 111 111 111 111 112 112 112 113 113 114 114 114 CID 201 201 201 201 105 105 105 106 106 201 201 201 Date 5/1/99 5/1/99 5/1/99 5/1/99 6/3/99 6/3/99 6/3/99 6/5/99 6/5/99 7/1/99 7/1/99 7/1/99 Item Pen Ink Milk Juice Pen Ink Milk Pen Milk Pen Ink Juice Qty 2 1 3 6 1 1 1 1 1 2 2 4 Market Basket Analysis (Contd.) Coocurrences 80% of all customers purchase items X, Y and Z together. Association rules 60% of all customers who purchase X and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within three weeks. Confidence and Support We prune the set of all possible association rules using two interestingness measures: Confidence of a rule: X Y has confidence c if P(Y|X) = c Support of a rule: X Y has support s if P(XY) = s We can also define Support of an itemset (a coocurrence) XY: Market Basket Analysis: Applications Sample Applications Direct marketing Fraud detection for medical insurance Floor/shelf planning Web site layout Cross-selling Applications of Frequent Itemsets Market Basket Analysis Association Rules Classification (especially: text, rare classes) Seeds for construction of Bayesian Networks Web log analysis Collaborative filtering Association Rule Algorithms More abstract problem redux Breadth-first search Depth-first search Problem Redux Abstract: A set of items {1,2,…,k} A dabase of transactions (itemsets) D={T1, T2, …, Tn}, Tj subset {1,2,…,k} GOAL: Find all itemsets that appear in at least x transactions (“appear in” == “are subsets of”) I subset T: T supports I For an itemset I, the number of transactions it appears in is called the support of I. Concrete: I = {milk, bread, cheese, …} D={ {milk,bread,cheese}, {bread,cheese,juice}, …} GOAL: Find all itemsets that appear in at least 1000 transactions {milk,bread,cheese} supports {milk,bread} Problem Redux (Contd.) Definitions: An itemset is frequent if it is a subset of at least x transactions. (FI.) An itemset is maximally frequent if it is frequent and it does not have a frequent superset. (MFI.) GOAL: Given x, find all frequent (maximally frequent) itemsets (to be stored in the FI (MFI)). Obvious relationship: MFI subset FI Example: D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} } Minimum support x = 3 {1,2} is frequent {1,2,3} is maximal frequent Support({1,2}) = 4 All maximal frequent itemsets: {1,2,3} Applications Spatial association rules Web mining Market basket analysis User/customer profiling ExtenSuggestionssions: Sequential Patterns In the “Market Itemset Analysis” replace Milk, Pen, etc with names of medications and use the idea in Hospital Data mining new proposal The idea of swaem intelligence – add to it the extra analysis pf the inducyion rules in this set of slides. Kraft Foods: Direct Marketing Company maintains a large database of purchases by customers. Data mining 1. Analysts identified associations among groups of products bought by particular segments of customers. 2. Sent out 3 sets of coupons to various households. • Better response rates: 50 % increase in sales for one its products • Continue to use of this approach Health Insurance Commission of Australia: Insurance Fraud Commission maintains a database of insurance claims,including laboratory tests ordered during the diagnosis of patients. Data mining 1. Identified the practice of "up coding" to reflect more expensive tests than are necessary. 2. Now monitors orders for lab tests. • Commission expects to save US$1,000,000 / year by eliminating the practice of "up coding”. HNC Software: Credit Card Fraud Payment Fraud Large issuers of cards may lose $10 million / year due to fraud Difficult to identify the few transactions among thousands which reflect potential fraud Falcon software Mines data through neural networks Introduced in September 1992 Models each cardholder's requested transaction against the customer's past spending history. processes several hundred requests per second compares current transaction with customer's history identifies the transactions most likely to be frauds enables bank to stop high-risk transactions before they are authorized Used by many retail banks: currently monitors 160 million card accounts for fraud New Account Fraud New Account Fraud Fraudulent applications for credit cards are growing at 50 % per year Falcon Sentry software Mines data through neural networks and a rule base Introduced in September 1992 Checks information on applications against data from credit bureaus Allows card issuers to simultaneously: increase the proportion of applications received reduce the proportion of fraudulent applications authorized Quality Control IBM Microelectronics: Quality Control Analyzed manufacturing data on Dynamic Random Access Memory (DRAM) chips. Data mining 1. Built predictive models of manufacturing yield (% non-defective) effects of production parameters on chip performance. 2. Discovered critical factors behind production yield & product performance. 3. Created a new design for the chip increased yield saved millions of dollars in direct manufacturing costs enhanced product performance by substantially lowering the memory cycle time Retail Sales B & L Stores Belk and Leggett Stores = one of largest retail chains 280 stores in southeast U.S. data warehouse contains 100s of gigabytes (billion characters) of data data mining to: increase sales reduce costs Selected DSS Agent from MicroStrategy, Inc. analyize merchandizing (patterns of sales) manage inventory Market Basket Analysis DSS Agent uses intelligent agents data mining provides multiple functions recognizes sales patterns among stores discovers sales patterns by time of day day of year category of product etc. swiftly identifies trends & shifts in customer tastes performs Market Basket Analysis (MBA) analyzes Point-of-Sale or -Service (POS) data identifies relationships among products and/or services purchased E.g. A customer who buys Brand X slacks has a 35% chance of buying Brand Y shirts. Agent tool is also used by other Fortune 1000 firms average ROI > 300 % Case Based Reasoning (CBR) case A case B target General scheme for a case based reasoning (CBR) model. The target case matched against similar precedents in the historical database, such as case Case Based Reasoning (CBR) Learning through the accumulation of experience Key issues Indexing: storing cases for quick, effective access of precedents Retrieval: accessing the appropriate precedent cases Advantages Explicit knowledge form recognizable to humans No need to re-code knowledge for computer processing Limitations Retrieving precedents based on superficial features E.g. Matching Indonesia with U.S. because both have similar population size Traditional approach ignores the issue of generalizing knowledge Genetic Algorithm Generation of candidate solutions using the procedures of biological evolution. Procedure 0. Initialize. Create a population of potential solutions ("organisms"). 1. Evaluate. Determine the level of "fitness" for each solution. 2. Cull. Discard the poor solutions. 3. Breed. a. Select 2 "fit" solutions to serve as parents. b. From the 2 parents, generate offspring. * Crossover: Cut the parents at random and switch the 2 halves. * Mutation: Randomly change the value in a parent solution. 4. Repeat. Go back to Step 1 above. Genetic Algorithm (Cont.) Advantages Applicable to a wide range of problem domains. Robustness: can obtain solutions even when the performance function is highly irregular or input data are noisy. Implicit parallelism: can search in many directions concurrently. Limitations Slow, like neural networks. But: computation can be distributed over multiple processors (unlike neural networks) Source: www.pathology.washington.edu Multistrategy Learning Every technique has advantages & limitations Multistrategy approach Take advantage of the strengths of diverse techniques Circumvent the limitations of each methodology Types of Models Prediction Models for Descriptive Models for Predicting and Classifying Grouping and Finding Regression algorithms Associations (predict numeric outcome): neural Clustering/Grouping networks, rule induction, algorithms: K-means, CART (OLS regression, Kohonen GLM) Association algorithms: Classification algorithm predict symbolic apriori, GRI outcome): CHAID, C5.0 (discriminant analysis, logistic regression) Neural Networks Description Difficult interpretation Tends to ‘overfit’ the data Extensive amount of training time A lot of data preparation Works with all data types Rule Induction Description Intuitive output Handles all forms of numeric data, as well as non-numeric (symbolic) data C5 Algorithm a special case of rule induction Apriori Description Seeks association rules in dataset ‘Market basket’ analysis Sequence discovery Data Mining Is The automated process of finding relationships and patterns in stored data It is different from the use of SQL queries and other business intelligence tools Data Mining Is Motivated by business need, large amounts of available data, and humans’ limited cognitive processing abilities Enabled by data warehousing, parallel processing, and data mining algorithms Common Types of Information from Data Mining Associations -- identifies occurrences that are linked to a single event Sequences -- identifies events that are linked over time Classification -- recognizes patterns that describe the group to which an item belongs Common Types of Information from Data Mining Clustering -- discovers different groupings within the data Forecasting -- estimates future values Commonly Used Data Mining Techniques Artificial neural networks Decision trees Genetic algorithms Nearest neighbor method Rule induction The Current State of Data Mining Tools Many of the vendors are small companies IBM and SAS have been in the market for some time, and more “biggies” are moving into this market BI tools and RDMS products are increasingly including basic data mining capabilities Packaged data mining applications are becoming common The Data Mining Process Requires personnel with domain, data warehousing, and data mining expertise Requires data selection, data extraction, data cleansing, and data transformation Most data mining tools work with highly granular flat files Is an iterative and interactive process Why Data Mining Credit ratings/targeted marketing: Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Fraud detection Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Customer relationship management: Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? : Data Mining helps extract such information Applications Banking: loan/credit card approval predict good customers based on old customers Customer relationship management: identify those who are likely to leave for a competitor. Targeted marketing: identify likely responders to promotions Fraud detection: telecommunications, financial transactions from an online stream of event identify fraudulent events Manufacturing and production: automatically adjust knobs when process parameter changes Applications (continued) Medicine: disease outcome, effectiveness of treatments analyze patient disease history: find relationship between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis: identify new galaxies by searching for sub clusters Web site/store design and promotion: find affinity of visitor to pages and modify The KDD process Problem fomulation Data collection subset data: sampling might hurt if highly skewed data feature selection: principal component analysis, heuristic search Pre-processing: cleaning name/address cleaning, different meanings (annual, yearly), duplicate removal, supplying missing values Transformation: map complex objects e.g. time series data to features e.g. frequency Choosing mining task and mining method: Result evaluation and Visualization: Knowledge discovery is an iterative process Relationship with other fields Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress on scalability of number of features and instances stress on algorithms and architectures whereas foundations of methods and formulations provided by statistics and machine learning. automation for handling large, heterogeneous data Some basic operations Predictive: Regression Classification Collaborative Filtering Descriptive: Clustering / similarity matching Association rules and variants Deviation detection Classification Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision rules Salary > 5 L Age Salary Profession Location Customer type Prof. = Exec New applicant’s data Good/ bad Classification methods Goal: Predict class Ci = f(x1, x2, .. Xn) Regression: (linear or any other polynomial) a*x1 + b*x2 + c = Ci. Nearest neighour Decision tree classifier: divide decision space into piecewise constant regions. Probabilistic/generative models Neural networks: partition by non- Nearest neighbor Define proximity between instances, find neighbors of new instance and assign majority class Case based reasoning: when attributes are more complicated than • Cons • Pros real-valued. + Fast training – Slow during application. – No feature selection. – Notion of proximity vague Clustering Unsupervised learning when old data with class labels not available e.g. when introducing a new product. Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Key requirement: Need a good measure of similarity between instances. Identify micro-markets and develop policies for each Applications Customer segmentation e.g. for targeted marketing Group/cluster existing customers based on time series of payment history such that similar customers in same cluster. Identify micro-markets and develop policies for each Collaborative filtering: group based on common items purchased Text clustering Compression Distance functions Numeric data: euclidean, manhattan distances Categorical data: 0/1 to indicate presence/absence followed by Hamming distance (# dissimilarity) Jaccard coefficients: #similarity in 1s/(# of 1s) data dependent measures: similarity of A and B depends on co-occurance with C. Combined numeric and categorical data: weighted normalized distance: Clustering methods Hierarchical clustering agglomerative Vs divisive single link Vs complete link Partitional clustering distance-based: K-means model-based: EM density-based: Partitional methods: K-means Criteria: minimize sum of square of distance Between each point and centroid of the cluster. Between each pair of points in the cluster Algorithm: Select initial partition with K clusters: random, first K, K separated points Repeat until stabilization: Assign each point to closest cluster center Collaborative Filtering Given database of user preferences, predict preference of new user Example: predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies Example: predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer) Association rules Given set T of groups of items Example: set of item sets purchased Goal: find all rules on itemsets of the form a-->b such that T Milk, cereal Tea, milk Tea, rice, bread support of a and b > user threshold s conditional probability (confidence) of b given a > user threshold c cereal Example: Milk --> bread Purchase of product A --> Prevalent Interesting Analysts already know about prevalent rules Interesting rules are those that deviate from prior expectation Mining’s payoff is in finding surprising phenomena Zzzz... 1995 Milk and cereal sell together! 1998 Milk and cereal sell together! Applications of fast itemset counting Find correlated events: Applications in medicine: find redundant tests Cross selling in retail, banking Improve predictive capability of classifiers that assume attribute independence New similarity measures of categorical attributes [Mannila et al, Application Areas Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis Usage scenarios Data warehouse mining: assimilate data from operational sources mine static data Mining log data Continuous mining: example in process control Stages in mining: data selection pre-processing: cleaning transformation mining result evaluation visualization Mining market Around 20 to 30 mining tool vendors Major tool players: Clementine, IBM’s Intelligent Miner, SGI’s MineSet, SAS’s Enterprise Miner. All pretty much the same set of tools Many embedded products: fraud detection: electronic commerce applications, health care, customer relationship management: Epiphany Vertical integration: Mining on the web Web log analysis for site design: what are popular pages, what links are hard to find. Electronic stores sales enhancements: recommendations, advertisement: Collaborative filtering: Net perception, Wisewire Inventory control: what was a shopper looking for and could not find.. State of art in mining OLAP integration Decision trees [Information discovery, Cognos] find factors influencing high profits Clustering [Pilot software] segment customers to define hierarchy on that dimension Time series analysis: [Seagate’s Holos] Query for various shapes along time: eg. spikes, outliers Multi-level Associations [Han et al.] find association between members of dimensions Data Mining in Use The US Government uses Data Mining to track fraud A Supermarket becomes an information broker Basketball teams use it to track game strategy Cross Selling Target Marketing Holding on to Good Customers Weeding out Bad Customers Some success stories Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data Won over (manual) knowledge engineering approach http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process Major US bank: customer attrition prediction First segment customers based on financial behavior: found 3 segments Build attrition models for each of the 3 segments 40-50% of attritions were predicted == factor of 18 increase Targeted credit marketing: major US banks find customer segments based on 13 months credit balances What is KnowledgeSeeker? Produced by ANGOSS Software Corporation, who focus “solely” on data mining software. Offer training and consulting services Produce data mining add-ins which accepts data from all major databases Works with popular query and reporting, spreadsheet, statistical and OLAP & ROLAP tools. Data Mining 19 9 Major Competitors Company Software Clementine 6.0 Enterprise Miner 3.0 Intelligent Miner Data Mining 20 0 Major Competitors Company Software Mineset 3.1 Darwin Scenario Data Mining 20 1 Current Applications Manufacturing Used by the R.R. Donnelly & Sons commercial printing company to improve process control, cut costs and increase productivity. Used extensively by Hewlett Packard in their United States manufacturing plants as a process control tool both to analyze factors impacting product quality as well as to generate rules for production control systems. Data Mining 20 2 Current Applications Auditing Used by the IRS to combat fraud, reduce risk, and increase collection rates. Finance Used by the Canadian Imperial Bank of Commerce (CIBC) to create models for fraud detection and risk management. Data Mining 20 3 Current Applications CRM Telephony Used by US West to reduce churning and increase customer loyalty for a new voice messaging technology. Data Mining 20 4 Current Applications Marketing Used by the Washington Post to improve their direct mail targeting and to conduct survey analysis. Health Care Used by the Oxford Transplant Center to discover factors affecting transplant survival rates. Used by the University of Rochester Cancer Center to study the effect of anxiety on chemotherapy-related nausea. Data Mining 20 5 More Customers Data Mining 20 6 Questions 1. What percentage of people in the test group have high blood pressure with these characteristics: 66-year-old male regular smoker that has low to moderate salt consumption? 2. Do the risk levels change for a male with the same characteristics who quit smoking? What are the percentages? 3. If you are a 2% milk drinker, how many factors are still interesting? 4. Knowing that salt consumption and smoking habits are interesting factors, which one has a stronger correlation to blood pressure levels? 5. Grow an automatic tree. Look to see if gender is an interesting factor for 55-year-old regular smoker who does not each cheese? Data Mining 20 7 Association Classic market-basket analysis, which treats the purchase of a number of items (for example, the contents of a shopping basket) as a single transaction. This information can be used to adjust inventories, modify floor or shelf layouts, or introduce targeted promotional activities to increase overall sales or move specific products. Example : 80 percent of all transactions in which beer was purchased also included potato chips. Sequence-based analysis Traditional market-basket analysis deals with a collection of items as part of a point-in-time transaction. to identify a typical set of purchases that might predict the subsequent purchase of a specific item. Clustering Clustering approach address segmentation problems. These approaches assign records with a large number of attributes into a relatively small set of groups or "segments." Example : Buying habits of multiple population segments might be compared to determine which segments to target for a new sales campaign. Classification Most commonly applied data mining technique Algorithm uses preclassified examples to determine the set of parameters required for proper discrimination. Example : A classifier derived from the Classification approach is capable of identifying risky loans, could be used to aid in the decision of whether to grant a loan to an individual. Issues of Data Mining Present-day tools are strong but require significant expertise to implement effectively. Issues of Data Mining Susceptibility to "dirty" or irrelevant data. Inability to "explain" results in human terms. Issues susceptibility to "dirty" or irrelevant data Data mining tools of today simply take everything they are given as factual and draw the resulting conclusions. Users must take the necessary precautions to ensure that the data being analyzed is "clean." Issues, cont’ inability to "explain" results in human terms Many of the tools employed in data mining analysis use complex mathematical algorithms that are not easily mapped into human terms. what good does the information do if you don’t understand it? Comparison with reporting, BI and OLAP Data Mining Reporting Complex Simple relationships relationships Automatically find Choose the the relevant factors relevant factors Show only relevant Examine all details details Prediction… (Also applies to visualisation & simple statistics) Comparison with Statistics Statistical analysis Mainly about hypothesis testing Focussed on precision Data mining Mainly about hypothesis generation Focussed on deployment Example: data mining and customer processes Insight: Who are my customers and why do they behave the way they do? Prediction: Who is a good prospect, for what product, who is at risk, what is the next thing to offer? Uses: Targeted marketing, mailshots, call-centres, adaptive websites Example: data mining and fraud detection Insight: How can (specific method of) fraud be recognised? What constitute normal, abnormal and suspicious events? Prediction: Recognise similarity to previous frauds – how similar? Spot abnormal events – how suspicious? Example: data mining and diagnosing cancer Complex data from genetics Challenging data mining problem Find patterns of gene activation indicating different diseases / stages “Changed the way I think about cancer” Oncologist from Chicago Children’s Memorial Hospital Example: data mining and policing Knowing the patterns helps plan effective crime prevention Crime hot-spots understood better Sift through mountains of crime reports Identify crime series “Other people save money using data mining – we save lives.” Police force homicide specialist and data miner Data mining tools: Clementine and its philosophy How to do data mining Lots of data mining operations How do you glue them together to solve a problem? How do we actually do data mining? Methodology Not just the right way, but any way… Myths about Data Mining (1) Data, Process and Tech Data mining is all about massive data It can be, but some important datasets are very small, and sampling is often appropriate Data mining is a technical process Business analysts perform data mining every day It is a business process Data mining is all about algorithms Algorithms are a key tool But data mining is done by people, not by algorithms Data mining is all about predictive accuracy It's about usefulness Accuracy is only a small component Myths about Data Mining (2) Data Quality Data mining only works with clean data Cleaning the data is part of the data mining process Need not be clean initially Data mining only works with complete data Data mining works with whatever data you have. Complete is good, incomplete is also ok. Data mining only works with correct data Errors in data are inevitable. Data mining helps you deal with them. One last exploding myth Neural Networks are not useful when you need to understand the patterns that you find (which nearly always Relatedis to over-simplistic views of in datadata mining mining) Data mining techniques form a toolkit We often use techniques in surprising ways E.g. Neural nets for field selection Neural nets for pattern confirmation Neural nets combined with other techniques for cross-checking What use is a pair of pliers? Related Concepts Outline Goal: Examine some areas which are related to data mining. Database/OLTP Systems Fuzzy Sets and Logic Information Retrieval(Web Search Engines) Dimensional Modeling Data Warehousing OLAP/DSS Statistics Machine Learning Pattern Matching 226 Fuzzy Sets and Logic Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function DM: Prediction and classification are fuzzy. 227 Information Retrieval Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query: Find all documents about “data mining”. DM: Similarity measures; Mine text/Web data. 228 Dimensional Modeling View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view data as dimensinoal. © Prentice Hall 229 Dimensional Modeling Queries Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems. 230 Cube view of Data 231 Data Warehousing “Subject-oriented, integrated, time-variant, nonvolatile” William Inmon Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data. DM: May access data in warehouse. 232 OLAP Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations: Slice: examine sub-cube. Dice: rotate cube to look at another dimension. Roll Up/Drill Down DM: May use OLAP queries. 233 OLAP Operations Roll Up Drill Down Single Cell Multiple Cells Slice Dice 234 Statistics Simple descriptive models Statistical inference: generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis: Data can actually drive the creation of the model Opposite of traditional statistical view. Data mining targeted to business user DM: Many data mining methods come from statistical techniques. 235 Machine Learning Machine Learning: area of AI that examines how to write programs that can learn. Often used in classification and prediction Supervised Learning: learns by example. Unsupervised Learning: learns without knowledge of correct answers. Machine learning often deals with small static datasets. DM: Uses many machine learning techniques. 236 Pattern Matching (Recognition) Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis. DM: Type of classification. © Prentice Hall 237 DM vs. Related Topics Area Query Data DB/OLTP Precise Database IR OLAP DM Results Output Precise DB Objects or Aggregation Precise Documents Vague Documents Analysis Multidimensional Precise DB Objects or Aggregation Vague Preprocessed Vague KDD Objects 238 Data Mining Techniques Outline Goal: Provide an overview of basic data mining techniques Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation Similarity Measures Decision Trees Neural Networks Activation Functions Genetic Algorithms © Prentice Hall 239 Point Estimation Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex: R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee’s salary. Is this a good idea? 240 Estimation Error Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE) 241 Expectation-Maximization (EM) Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence. 242 Models Based on Summarization Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot: 243 Bayes Theorem Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem: Assign probabilities of hypotheses given a data value. 244 Hypothesis Testing Find model to explain behavior by creating and then testing a hypothesis about the data. Exact opposite of usual DM approach. H0 – Null hypothesis; Hypothesis to be tested. H1 – Alternative hypothesis 245 Regression Predict future values based on past values Linear Regression assumes linear relationship exists. y = c0 + c1 x1 + … + cn xn Find values to best fit the data 246 Correlation Examine the degree to which the values for two variables behave similarly. Correlation coefficient r: • 1 = perfect correlation • -1 = perfect but opposite correlation • 0 = no correlation 247 Similarity Measures Determine similarity between two objects. Similarity characteristics: Alternatively, distance measure measure how unlike or dissimilar objects are. © Prentice Hall 248 Distance Measures Measure dissimilarity between objects 249 Decision Trees Decision Tree (DT): Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem. Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs. 250 Decision Trees A Decision Tree Model is a computational model consisting of three parts: Decision Tree Algorithm to create the tree Algorithm that applies the tree to data Creation of the tree is the most difficult part. Processing is basically a search similar to that in a binary search tree (although DT may not be© binary). Prentice Hall 251 Neural Networks Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern recognition, speech recognition, computer vision, and classification.© Prentice Hall 252 Generating Rules Decision tree can be converted into a rule set Straightforward conversion: each path to the leaf becomes a rule – makes an overly complex rule set More effective conversions are not trivial (e.g. C4.8 tests each node in root-leaf path to see if it can be eliminated without loss in accuracy) 253 Covering algorithms Strategy for generating a rule set directly: for each class in turn find rule set that covers all instances in it (excluding instances not in the class) This approach is called a covering approach because at each stage a rule is identified that covers some of the instances 254 Rules vs. trees Corresponding decision tree: (produces exactly the same predictions) But: rule sets can be more clear when decision trees suffer from replicated subtrees Also: in multi-class situations, covering algorithm concentrates on one class at a time whereas decision 255 A simple covering algorithm Generates a rule by adding tests that maximize rule’s accuracy Similar to situation in decision trees: problem of selecting an attribute to split on space of But: decision tree inducer maximizes examples overall purity rule s o far Each new test reduces rule’s coverage: 256 witten&eibe rule after adding new term Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indexing, and retrieving data (critical when data too large to reside in memory) Models and Patterns Models Prediction •Linear regression •Piecewise linear Probability Distributions Structured Data Models Prediction •Linear regression •Piecewise linear •Nonparamatric regression Probability Distributions Structured Data Models Prediction •Linear regression Probability Distributions Structured Data logistic regression •Piecewise linear naïve bayes/TAN/bayesian networks •Nonparametric regression NN support vector machines •Classification Trees etc. Models Prediction •Linear regression •Piecewise linear •Nonparametric regression •Classification Probability Distributions •Parametric models •Mixtures of parametric models •Graphical Markov models (categorical, continuous, mixed) Structured Data Models Prediction •Linear regression •Piecewise linear •Nonparametric regression •Classification Probability Distributions •Parametric models •Mixtures of parametric models •Graphical Markov models (categorical, continuous, mixed) Structured Data •Time series •Markov models •Mixture Transition Distribution models •Hidden Markov models •Spatial models Bias-Variance Tradeoff High Bias - Low Variance Score function should embody the compromise Low Bias - High Variance “overfitting” - modeling the random component Patterns Local Global •Clustering via partitioning •Outlier detection •Hierarchical Clustering •Changepoint detection •Mixture Models •Bump hunting •Scan statistics •Association rules Scan Statistics via Permutation Tests xx x xx x x xx x x xx xx x x x x x xxxx x x x xxxx The curve represents a road Each “x” marks an accident Red “x” denotes an injury accident Black “x” means no injury Is there a stretch of road where there is an unually large fraction of injury accidents? xxx x x Scan with Fixed Window If we know the length of the “stretch of road” that we seek, e.g., we could slide this window long the road and find the most “unusual” window location xxx x x x xx x xx x xx x x xx xx x x x x x xxxx x x x xxxx Spatial-Temporal Scan Statistics Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Deviation Detection: finding changes Estimation: predicting a continuous value 270 Link Analysis: finding relationships Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ... 271 Clustering Find “natural” grouping of instances given un-labeled data 272 Association Rules & Frequent Itemsets Transactions Frequent Itemsets: TID Produce 1 MILK, BREAD, EGGS 2 BREAD, SUGAR 3 BREAD, CEREAL 4 MILK, BREAD, SUGAR 5 MILK, CEREAL 6 BREAD, CEREAL 7 MILK, CEREAL 8 MILK, BREAD, CEREAL, EGGS 9 MILK, BREAD, CEREAL Milk, Bread (4) Bread, Cereal (3) Milk, Bread, Cereal (2) … Rules: Milk => Bread (66%) 273 Visualization & Data Mining Visualizing the data to facilitate human discovery Presenting the discovered results in a visually "nice" way 274 Summarization Describe features of the selected group Use natural language and graphics Usually in Combination with Deviation detection or other methods Average length of stay in this study area rose 45.7 percent, from 4.3 days to 6.2 days, because ... 275 Data Mining Central Quest Find true patterns and avoid overfitting (finding seemingly signifcant but really random patterns due to searching too many possibilites) 276 Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Regression, Decision Trees, Bayesian, Neural Networks, ... Given a set of points from classes what is the class of new point ? 277 Classification: Linear Regression Linear Regression w0 + w1 x + w2 y >= 0 Regression computes wi from data to minimize squared error to ‘fit’ the data Not flexible enough 278 Classification: Decision Trees if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue Y 3 2 X 5 279 DECISION TREE An internal node is a test on an attribute. A branch represents an outcome of the test, e.g., Color=red. A leaf node represents a class label or class label distribution. At each node, one attribute is chosen to split training examples into distinct classes as much as possible A new instance is280classified by Classification: Neural Nets Can select more complex regions Can be more accurate Also can overfit the data – find patterns in random noise 281 Evaluating which method works the best for classification No model is uniformly the best Dimensions for Comparison speed of training speed of model application noise tolerance explanation ability Best Results: Hybrid, Integrated models 282 Comparison of Major Classification Approaches Train Run Noise Can Use time Time Toler Prior ance Knowledge Decision fast fast poor no Trees Rules med fast poor no Accuracy Underon Customer standable Modelling Neural slow Networks Bayesian slow medium medium medium good fast good no good poor fast good yes good good A hybrid method will have higher accuracy 283 Evaluation of Classification Models How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data The new data will probably not be exactly the same as the training data! Overfitting – fitting the training data too precisely - usually leads to poor 284 results on new data Classification: Train, Validation, Test split Results Known + + + Data Model Builder Training set Evaluate Model Builder Y N Validation set Final Test Set Final Model 285 Predictions + + + - Final Evaluation + - Cross-validation Cross-validation avoids overlapping test sets First step: data is split into k subsets of equal size Second step: each subset in turn is used for testing and the remainder for training This is called k-fold cross-validation Often the subsets are stratified before the cross-validation is 286 Cross-validation example: —Break up data into groups of the same size — — —Hold aside one group for testing and use the rest to build model Test — —Repeat 287 287 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation 288 E.g. ten-fold cross-validation is Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up 289 Clustering Evaluation Manual inspection Benchmarking on existing labels Cluster quality measures distance measures high similarity within a cluster, low across clusters 290 The distance function Simplest case: one numeric attribute A Distance(X,Y) = A(X) – A(Y) Several numeric attributes: Distance(X,Y) = Euclidean distance between X,Y Nominal attributes: distance is set to 1 if values are different, 0 if they are equal 291 Are all attributes equally important? Simple Clustering: K-means Works with numeric data only 1) Pick a number (K) of cluster centers (at random) 2) Assign every item to its nearest cluster center (e.g. using Euclidean distance) 3) Move each cluster center to the mean of its assigned items 4) Repeat steps 2,3 until convergence (change in cluster assignments less 292 than a threshold) Data Mining in CRM: Customer Life Cycle Customer Life Cycle The stages in the relationship between a customer and a business Key stages in the customer lifecycle Prospects: people who are not yet customers but are in the target market Responders: prospects who show an interest in a product or service Active Customers: people who are currently using the product or service Former Customers: may be “bad” customers who did not pay their bills or who incurred high costs It’s important to know life cycle events (e.g. retirement) 293 Data Mining in CRM: Customer Life Cycle What marketers want: Increasing customer revenue and customer profitability Up-sell Cross-sell Keeping the customers for a longer period of time Solution: Applying data mining 294 Data Mining in CRM DM helps to Determine the behavior surrounding a particular lifecycle event Find other people in similar life stages and determine which customers are following similar behavior patterns 295 Data Mining in CRM (cont.) Data Warehouse Customer Profile Data Mining Customer Life Cycle Info. Campaign Management 296 CRISP-DM: Benefits of a standard methodology Communication A common language Repeatability Rational structure Education How do I start? www.crisp-dm.org CRISP-DM Overview An industry-standard process model for data mining. Not sector-specific CRISP-DM Phases: Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment Not strictly ordered respects iterative aspect of data mining Non-proprietary www.crisp-dm.org Rules vs. decision lists PRISM with outer loop removed generates a decision list for one class Subsequent rules are designed for rules that are not covered by previous rules But: order doesn’t matter because all rules predict the same class Outer loop considers all classes separately No order dependence implied Problems: overlapping rules, default 299 Process Standardization CRISP-DM: CRoss Industry Standard Process for Data Mining Initiative launched Sept.1996 SPSS/ISL, NCR, Daimler-Benz, OHRA Funding from European commission Over 200 members of the CRISP-DM SIG worldwide DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries, Syllogic, Magnify, .. System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte & Touche, … End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ... CRISP-DM Non-proprietary Application/Industry neutral Tool neutral Focus on business issues As well as technical analysis Framework for guidance Experience base Templates for Analysis Why CRISP-DM? •The data mining process must be reliable and repeatable by people with little data mining skills •CRISP-DM provides a uniform framework for –guidelines –experience documentation •CRISP-DM is flexible to account for differences –Different business/agency problems –Different data