Chapter I Introduction MIS 463 Fall 2011 1 Chapter 1. Introduction Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business applications of data mining 2 Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories Need to convert such data into knowledge and information Applications Business management Production control Market analysis Engineering design Science exploration 3 Evolution of Database Technology (1) Data collection, database creation Data management data storage and retrieval database transaction processing Data analysis and understanding Data mining and data warehousing 4 Evolution of Database Technology (2) (See Fig. 1.1) Han 1960s: 1970s: Relational data model, relational DBMS implementation Query languages like SQL (structured query language) Online transaction processing 1980s: Data collection, database creation, primitive file processing, hierarchical and network DBMS Advanced DBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.) Data warehousing, data mining, OLAP, multimedia databases, and Web databases 1990s—2000s: Web based database systems: XML based database systems, web mining 5 Developments in computer hardware Powerful and affordable computers Data collection equipment Storage media Communication and networking 6 Data Warehouse Repository of multiple heterogeneous data sources, organized under a unified schema at a single site in order to facilitate management decision making. Data warehouse technology includes: Data cleaning Data integration On-Line Analytical Processing (OLAP): Techniques that support multidimensional analysis and decision making with the following functionalities summarization consolidation aggregation view information from different angles but additional data analysis tools are needed for classification clustering charecterization of data changing over time 7 Data-rich, information-poor state Abundance of data AND need for powerful data analysis tools “data tombs” - data archives Important decisions are made seldom visited not on the information rich data stored in databases but on a decision maker’s intuition No tool to extract knowledge embedded in vast amounts of data Current expert system technology Users or domain experts manually input knowledge which is time consuming, costly, prone to biases errors 8 What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”: Gold mining vs sand mining Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? query processing. Expert systems or small ML/statistical programs 9 Data Mining vs. Data Query Data Query:e.g. A list of all customers who use a credit card to buy a PC A list of all MIS students having a GPA of 3.5 or higher and has studied 4 or less semesters Data Mining problems:e.g. What is the likelihood of a customer purchasing PC with credit card Given the characteristics of MIS students predict her SPA in the comming term What are the characteristics of MIS undergrad students 10 Chapter 1. Introduction Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business applications of data mining 11 Why Data Mining? Four questions to be answered Can the problem clearly be defined? Does potentially meaningful data exists? Does the data contain hidden knowledge or useful only for reporting purposes? Will the cost of processing the data will be less then the likely increase in profit from the knowledge gained from applying any data mining project 12 Steps of a KDD Process (1) 1. Goal identification: Define problem relevant prior knowledge and goals of application 2. Creating a target data set: data selection 3. Data preprocessing: (may take 60%-80% of effort!) removal of noise or outliers strategies for handling missing data fields accounting for time sequence information 4. Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. 13 Steps of a KDD Process (2) 5. Data Mining: Choosing functions of data mining: summarization, classification, regression, association, clustering. Choosing the mining algorithm(s): which models or parameters Search for patterns of interest 6. Presentation and Evaluation: visualization, transformation, removing redundant patterns, etc. 7. Taking action: incorporating into the performance system documenting reporting to interested parties 14 An example: Customer Segmentation 1. Marketing department wants to perform a segmentation study on the customers of AE Company 2. Decide on relevant variables from a data warehouse on customers, sales, promotions Customers: name,ID,income,age,education,... Sales: history of sales Promotion: promotion types durations... 3. Handle missing income, addresses.. determine outliers if any 4. Generate new index variables representing wealth of customers Wealth = a*income+b*#houses+c*#cars... Make neccesary transformations z scores so that some data mining algorithms work more efficiently 15 Example: Customer Segmentation cont. 5.a: Choose clustering as the data mining functionality as it is the natural one for a segmentation study so as to find group of customers with similar characteristics 5.b: Choose a clustering algorithm 5.c: Apply the algorithm K-means or k-medoids or any suitable one for that problem Find clusters or segments 6. Make reverse transformations, visualize the customer segments 7. Present the results in the form of a report to the marketing department Implement the segmentation as part of a DSS so that it can be applied repeatedly at certain internvals as new customers arrive Develop marketing strategies for each segment 16 Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Data Selection Data transformation Data Cleaning Data Integration Databases 17 Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse 18 Architecture of a Typical Data Mining System Data base, data warehouse Data base or data warehouse server Knowledge base concept hierarchies user beliefs other thresholds Data mining engine functional modules asses pattern’s interestingness characterization, association, classification, cluster analysis, evolution and deviation analysis Pattern evaluation module Graphical user interface 19 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines 20 Efficient and Scalable Techniques For an algorithm to be efficient and scalable its running time should be predictable and acceptable How Parallel and distributed algorithms Sampling from databases 21 Chapter 1. Introduction Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business applications of data mining 22 Two Styles of Data Mining Descriptive data mining Predictive data mining perform inference on the current data to make predictions we know what to predict Not mutually exclusive characterize the general properties of the data in the database finds patterns in data and the user determines which ones are important used together Descriptive predictive Eg. Customer segmentation – descriptive by clustering Followed by a risk assignment model – predictive by ANN 23 Supervised vs. Unsupervised Learning Supervised learning (classification, prediction) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (summarization. association, clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data 24 Descriptive Data Mining (1) Discovering new patterns inside the data Used during the data exploration steps Typical questions answered by descriptive data mining what is in the data what does it look like are there any unusual patterns what dose the data suggest for customer segmentation users may have no idea which kind of patterns may be interesting 25 Descriptive Data Mining (2) patterns at verious granularities geograph student country - city - region - street university - faculty - department - minor Fuctionalities of descriptive data mining Clustering Ex: customer segmentation summarization visualization Association Ex: market basket analysis 26 A model is a black box X: vector of independent variables or inputs Y =f(X) : an unknown function Y: dependent variables or output a single variable or a vector inputs X1,X2 Model Y output The user does not care what the model is doing it is a black box interested in the accuracy of its predictions 27 Predictive Data Mining (1) Using known examples the model is trained the more data with known outcomes is available the unknown function is learned from data the better the predictive power of the model Used to predict outcomes whose inputs are known but the output values are not realized yet Never %100 accurate 28 Predictive Data Mining (2) The performance of a model on past data is not important to predict the known outcomes Its performance on unknown data is much more important 29 Typical questions answered by predictive models Who is likely to respond to our next offer Which customers are likely to leave in the next six months What transactions are likely to be fraudulent based on history of previous marketing campaigns based on known examples of fraud What is the total amount spending of a customer in the next month 30 Data Mining Functionalities (1) Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., big spenders vs. budget spenders Association (correlation and causality) Multi-dimensional vs. single-dimensional association age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) contains(x, “software”) [1%, 75%] 31 Data Mining Functionalities (2) Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify people as healty or sick, or classify transactions as fraudulent or not Methods: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster customers of a retail company to learn about characteristics of different segments Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity 32 Data Mining Functionalities (3) Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining: click stream analysis Similarity-based analysis Other pattern-directed or statistical analyses 33 Concept Description Characterization Discerimination Data classes of items for sale classes or concpets computers, printers concepts of customers: bigSpenders BudgetSpenders 34 Data Characterization Summarization the data of the class under study (target class) Methods SQL queries OLAP roll up -operation user-controlled data summarization along a specified dimension without step by step user interraction attribute oriented induction the output of characterization pie charts, bar chars, curves, multidimensional data cube, or cross tabs in rule form as characteristic rules 35 Characterization example Description summarizing the characteristics of customers who spend more than $1000 a year at AllElecronics age, employment, income drill down on any dimension on occupation view these according to their type of employment 36 Data Discrimination Comparing the target class with one or a set of comparative classes (contrasting classes) these classes can be specified by the use database queries methods and output similar to those used for characterization include comparative measures to distinguish between the target and contrasting classes 37 Discrimination examples Example 1:Compare the general features of software products whose sales increased by %10 in the last year (target class) whose sales decreased by at least %30 during the same period (contrasting class) Example 2: Compare two groups of AE customers I) who shop for computer products regularly (target class) II) who rarely shop for such products (contrasting class) less than three times a year The resulting description: %80 of I group customers more than two times a month university education ages 20-40 %60 of II group customers seniors or young no university degree 38 Multidimensional Data sales according to region month and product type Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year Product Category Country Quarter Product City Office Month Week Day Month 39 Association Analysis Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data widely used market basket transaction data analysis more formally X Y that is A1A2.. Ak B1B2.. Bl A1 , B1 are attribute value pairs or predicates 40 Example: association analysis From the AllEs database age(X,”20..29”)income(X,”1,000...2,000”)buy(X,”CD player”) (support = %2, confidence= %60) X is a variable representing a customer %2 of the AE customers are between 20 and 29 age incomes ranging from 1 to 2 billon TL buy CD player with %60 probability that customers in those age and income groups will buy CD player a multidimensional association rule contains more than one attribute or predicate 41 Market basket analysis customers buying behaviour is investigated Based on only the transactions data no information about customer properties: age income Managers are interested in which products or product groups are sold together 42 Transactional Database Transaction ID Item List 10001 Computer,CD,pritner 10002 Ploter,monitor,mouse 10003 Computer,DVD Player 10004 Printer 10005 Ploter,UPS,modem 43 Example: basket analysis rule buy(computer)buy(printer) (support= %1,confidence=%60) %1 of all transactions contains if a transaction contains computer contains a single predicate an association rule is interesting if there is a %60 chance that it contains printer as well a single dimensional association rule computer and printer its support exceeds a minimum threshold and its confidence exceeds a min threshold These min values are set by specialists 44 Classification Learning is supervised Dependent variable is categorical Build a model able to assign new instances to one of a set of well-defined classes 45 Typical Classification Problems Given characteristics of individuals differentiate them who have suffered a heart attack from those who have not Determine if a credit card purchase is fraudulent Classify a car loan applicant as a good or a poor credit risk 46 Methods of Classification Decision Trees Artificial Neural Networks Bayesian Classification Naïve Belief Networks k-nearest neighbor Regression Logistic (logit) probit Predicts probability of each class when the dependent variable is categorical good customer bed customer or employed unemployed 47 Steps of classification process (1) Train the model (2) Test the model using a training set data objects whose class labels are known on a test sample whose class labels are known but not used for training the model (3) Use the model for classification on new data whose class labels are unknown 48 An example - classification ID age income education Type Historical data Each customer type İs known Each customer has a Label 1 35 800 udergrad risky 2 26 600 HighSch risky 3 48 1200 grad normal 8 52 2500 udergrad good 44 29 1700 HighSch good CustID age income education Type 17 43 550 Ph.D. risky 27 68 1650 grad Normal CustID age income 11 36 850 27 28 1650 Educatin Type Udergrd ? grad ? Testing set whose labels are also Known but not used in model Training the model New customers Whose type hsa to be Estimated Each new customer hss to be classified as Risky normal or good 49 An example – classification cont. Based on historical data develop a classification model Decision tree, neural network, regression ... Test the performance of the model on a portion of the historical data İf accuricy of the model is satisfactory Use the model on the new customers 11 and 27 to assign a type the these new customers 50 Example AE customers age goodl risky Yearly income 51 Example AE customers age goodl risky ? Assign the new customer whose type in unknown to either * or + Yearly income 52 Solution x2 : age good risky 35 x1 : yearly income 1000 rule: IF yearly income> 1000 and age> 35 THEN good ELSE risky 53 Credit Card Promotion Policy Credit card companies Promotional offerings with their monthly credit card billing Offers provide the opportunity to purchase items such as magazines, … A data mining study Predict individual behaviour What is the likelihood of an individual towards taking the advantage of promotions based on individual characteristics, credit history.. Expected reduction in postage; paper and processing costs for the credit card company 54 Credit Card Promotion Database Magazıne Promotıon Income Range Watch Promotıon Lıfe Insurance Promotıon Gender Age Credıt Card Insurance 40-50 K Yes No No Male 45 No 30-40 K Yes Yes Yes Female 40 No 40-50 K No No No Male 42 No 30-40 K Yes Yes Yes Male 43 Yes 50-60 K Yes No Yes Female 38 No 20-30 K No No No Female 55 No 30-40 K Yes No Yes Male 35 Yes 20-30 K No Yes No Male 27 No 30-40 K Yes No No Male 43 No 30-40 K Yes Yes Yes Female 41 No 40-50 K No Yes Yes Female 43 No 20-30 K No Yes Yes Male 29 No 50-60 K Yes Yes Yes Female 39 No 40-50 K No Yes No Male 55 No 20-30 K No No Yes Female 19 Yes 55 Decision Trees for Credit Card Insurance Database age <=43 Dependent Variable Life Insurance Promotion >43 Gender Female N 0, Y 6 Decision: Yes N 3,Y 0 Decision:No Male A Production Rule from the Tree Cr Ins No N 4, Y 1 Decision: No critical value of 43 is deter by the algorithm Yes IF (age<=43)&(Sex=Male) &(Credit Card In = No) THEN Life Insurance Pr = No Yes 2, No 0 Decision? Yes 56 Artificial Neural Networks Set of interconnected nodes designed to imitate the functioning of the human brain Feed-forward network Supervised learner model 57 For the promotion example Encode all variables Assign a numerical value even for qualitative variables such as sex Say X1 represent gender When Male X1 =1 Female X1 =0 58 Input layer X1=+1 1 Hidden layer Output layer W1,5=0.014 5 W5,9=-0.17 X2=0 X3=0.5 X4=-1 (1-0.78)2 is error square 1 actual value of O9 for a particular Data object 0.78 is predicted value 59 Weights updating Weights between nodes are adjusted so as to reduce error Details of the training process for neural networks are not important for the time being 60 Estimation-Prediction Similar to classification Output is a continuous variable Estimation: current value Prediction: future outcome rather then current behavior 61 Typical Estimation-Prediction Problems Estimate the salary of an individual who owns a sports car Predict next week`s closing price for the IMKB100 index Forecast next days temperature 62 Prediction methods Artificial Neural networks linear regression non-linear regression Yi = a0+a1X1,i+a2X2,i+...+akXk,i+ui Yi =f(X1,i, X2,i,.., Xk,ia1,a2,..,ak,ui) generalized linear regression logistic poisson regression logit,probit for count variables Regression Trees 63 Example:Prediction and Classification Classification is used to classify customers applying for credit cards known class labels: risky,reliable when a new customer applies looking at her charecteristics income age education wealth region ... Customer class is predicted independent variables Prediction: The monthly expense of a new customer ( a real continuous variable ) is predicted based on personal information income education wealth profession ... Some are numeric some categorical 64 Cluster Analysis Class label is unknown: Group data to form new classes, assign class labels to each data object e.g., cluster customers to find customer segments Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Unknown generated by the clustering model Objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters there may be hierarchy of classes 65 Example: Clustering Can be performed on AE customer data to identify homogenous subpopulations of customers represent individual target groups for marketing 66 distance Type1 Type 2 type 3 income Clustering according to income and distance to store three cluster of data points are evident 67 Outlier Analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis DECTECED using statistical tests distance measures visually inspecting the data Examples: 68 Reasons for outliers Measurement errors coding errors age is entered as 999 nature of data salary of the general manager is much more higher than the other employees in crisis the interest rate was in the order of 1000s 69 Evolution Analysis Describes and models regularities or trends for objects whose behavior changes over time Distinct features include Trend and deviation: time-series data analysis Sequential pattern mining, periodicity analysis Similarity-based analysis Example Stock market predictions: future stock prices for overall stocks: indexes or individual company stocks 70 Sequential Pattern Analysis Determine sequential patterns in data Based on time sequence of actions Similar to associations Relationship is based on time Example 1: buy CD player today buy CD within one week Example 2: In what sequence web pages of an ebusiness company are accessed %70 percents of visitors follows A B C or A D B C or A E B C He then determines to add a link directly from page A to page C 71 Chapter 1. Introduction Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business applications of data mining 72 Are All the “Discovered” Patterns Interesting? A data mining system/query may generate thousands of patterns, not all of them are interesting. Are all patterns interesting? Typically not -only a small fraction of patterns are interesting to any given user Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm 73 Objective vs. subjective interestingness measures: Objective: Objective: based on statistics and structures of patterns, e.g., support, X Y P(X Y):probability of a transaction contains both X and Y confidence, degree of certainty of the detected association P(Y I X) the conditional probability : the probability that a transaction containing X also contains Y thresholds - controlled by the user ex: rules that do not satisfy a confidence threshold of %50 are uninteresting Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. 74 Chapter 1. Introduction Motivation: Why data mining? Methodology of Knowledge Discovery in Databases Data mining functionalities Are all the patterns interesting? Business Applications of data mining 75 Potential Business Applications Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Banks assume a financial risk when they grant loans risk models attempt to predict the probability of default or fail to pay back the borrowed amount Credit cards Insurance companies Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering 76 Market Analysis and Management (1) Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies,clickstreams Customer profiling-segmentation data mining can tell you what types of customers buy what products (clustering or classification) Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. 77 Market Analysis and Management (2) Effectiveness of sales campaigns Advertisements, coupons, discounts, bonuses promote products and attract customers can help improve profits Compare amount of sales and number of transactions during the sales period versus before or after the sales campaign Association analysis which items are likely to be purchased together with the items on sale 78 Market Analysis and Management (3) Customer retention Analysis of Customer loyalty sequences of purchases of particular customers goods purchased at different periods by the same customers can be grouped into sequences changes in customer consumption or loyalty suggests adjustments on the pricing and variety of goods to retain old customers and attract new customers Cross-selling and up-selling associations from sales records a customer who buy a PC is likely to buy a printer purchase recommendations 79 Fraud Detection and Management Applications Approach widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples Credit card transactions: The FALCON fraud assessment system by HNC Inc. to signal possibly fraudulent credit card transactions money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) Detecting telephone fraud:ASPECT European Research Gr. Unsupervised clustering to detect fraud in mobile phone networks Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. 80 Health Care Storing patients` records in electronic format, developments in medical information systems Regularities, trends and surprising events extracted by data mining methods Large amount of clinical data ANN, temporal reasoning assist clinicians to make informed decisions and improving health sevices MERCK-MEDCO Managed Care, Pharmaceutical Insurance … company Uncover less expensive but equally effective drug treatments 81 Financial Data Analysis Financial data complete, reliable, high quality Loan payment prediction and customer credit policy analysis 82 Loan payment prediction and customer credit policy analysis Factors influencing loan payment performance loan-to-value ratio term of the loan dept ratio (total monthly debt/total monthly income) payment-to-income ratio income level education level residence region credit history analysis may find that payment-income ratio is a dominant factor while education level and debt ratio are not 83 Risk Management and Insurance determine insurance rates manage investment portfolios differentiate between companies and/or individuals who are good and poor credit risks Farmer`s Group discover a scenario: Someone who owns a sports car is not a higher accident risk Conditions: the sport car to be a second car and the family car to be a station wagon or a sedan 84 Data Mining for the Telecommunication Industry Telecommunication data are multidimensional duration location of callee data traffic resource usage profit system workload user group behavior used to identify and compare calling-time location of caller type of call fraudulent pattern analysis and identification of unusual patterns to achieve customer loyalty characteristics of customers affecting line usage 85 Other Applications Sports and Gaming Text Mining Predicting outcome of football games Spam detection Internet Web Mining Web usage mining İmprove link structure Recommander Systmes Web structure mining: mining link structure of Web 86 Other Applications Educational Data Mining Clustering students Design enterece exams, selection policies Human Resources How to select applicants Online Dating Recommandataions to visitors 87 Summary Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, 88