1BK40 Business Analytics & Decision Support Lecture 1, 2017 – 2018 Introduction dr. M.Firat Pav.D06, m.firat@tue.nl Outline • Motivation • Course organization • Data-driven decision making Data mining and analytics (DMA) • Multi-attribute decisions 2 Why Business Analytics and Decision Support? 3 Business decisions • Almost all activities in running businesses involve decision making • Recognize state of market • Select the right course of action • Plan a strategy • Main task of managers • Decisions need actionable information • Decision analysis helps in dealing with the decision problems in a structured way • However, there is more to decision making: e.g. organizational support (getting people behind decisions) An example business problem • TelCo, a major telecommunications firm, wants to investigate its problem with customer attrition, or “churn” • Lets consider this for now as a marketing www.flickr.com/photos/yourdon problem only How would you go about targeting some customers with a special offer, prior to contract expiration? Think about what data should be available for your use. 5 Another example flickr.com/photos/alaig • Company, a major producer of semi-conductors, wants to hire a new sales manager • How would you select your new recruit? How does this decision differ from the previous one? Think about data available, but also decision goals 6 Differences between two examples • • • • • • • Amount of available data Type, source, quality of data Amount and type of uncertainty Number of stakeholders Number of goals Number of decision moments Etc… 7 Overall objective • How to support structured decision making in a business setting, given • A data-rich environment • A data-poor environment? Data Science Decision Science 8 Course Organization 9 Course goals • discuss the properties of modern analytics and decision support systems for businesses • list several analytics and decision making methods • distinguish between different analytics functions • analyze data by using data science methods • apply data science techniques for improved decision support • analyze & solve discrete choice business problems 10 Lecturers • Prof. dr. ir. Uzay Kaymak (responsible lecturer) Pav. D.02 u.kaymak@tue.nl • Dr. M.Firat (lecturer) Pav. D.06 m.firat@tue.nl • Information from secretariat IEIS, Information Systems is@tue.nl 11 Meetings • 16 sessions (2 x 2 hrs./week, 8 weeks long) • Wednesday 15:45-17:30, Auditorium 8 • Friday 10:45-12:30, LUNA 1.050 Lectures: introduce and explain main subjects Embedded practice sessions, instructions (Matlab) • Two guest lectures (from industry or academia) • Content of guest lectures is part of mandatory material • Q&A sessions at the end of the quartile • Further questions can be asked by e-mail or during separate meetings upon appointment 12 Planning – 1 lecture week date topic 1 1 06 Sep.’17 2 1 08 Sep. ’17 Introduction to course materials, Introduction: Data-Analytic Thinking Matlab Session 1 3 2 13 Sep. ’17 4 2 15 Sep. ’17 5 3 20 Sep. ’17 6 3 22 Sep.’17 Business Problems and Data Science Solutions; Data mining; Introduction to predictive modeling Introduction to Predictive Modeling; Visualizing Segmentations; Fitting a Model to Data; Classification via Mathematical Functions; Regression via Mathematical Functions; Overfitting and Its Avoidance; Evaluating Classifiers; Cross-Validation; Expected Value Analytical Framework Matlab Session 2 7 4 27 Sep.’17 Similarity, Neighbors; Clustering 8 4 29 Sep.’17 Visualizing Model Performance; Ranking, Profit Curves; ROC Graphs and AUC; Cumulative Response and Lift Curves; Evidence and Probabilities; Combining Evidence Probabilistically; Applying Bayes’ Rule to Data Science Lecture time: 15:45-17:30, Lecture time: 10:45-12:30 *Some sessions might shift if needed 13 Planning – 2 class week date & time 9 10 11 12 5 5 6 6 4 Oct.’17 6 Oct.’17 11 Oct.’17 13 Oct.’17 13 7 18 Oct.’17 14 7 20 Oct.’17 15 8 25 Oct.’17 16 8 27 Oct.’17 topic Guest Lecture Matlab Session 3 Introduction to fuzzy systems Introduction to Decision Support; Decision heuristics; SMART Fuzzy decision making; Multicriteria decisions Bayesian decision theory; Analytic hierarchy process; Matlab Session 4 Guest Lecture; Q&A session; Preparation for exam Lecture time: 15:45-17:30, Lecture time: 10:45-12:30 *Some sessions might shift if needed 14 Course Material (literature and tools) • One book (mandatory): • Data Science for Business (O'Reilly Media) • Slides and handouts (mandatory) Distributed through CANVAS • Scientific papers, books and web pages (mandatory) Announced through CANVAS • Exercises for the instructions • Software tools (mandatory) • Matlab 15 MATLAB 2016b • Download and install Matlab 2016b before next session (time consuming): • Check your Windows version (x86-32 Bits vs x64-64 Bits): http://windows.microsoft.com/en-us/windows/which-operating-system • Install Matlab version corresponding to your OS (x86-32 Bits vs x64-64 Bits): https://intranet.tue.nl/en/university/services/ict-services/help-and-support/software-tuedevice/matlab/ • Do not install Notebook installation or Notebook installation for the Electrical Engineering. • Recommended: Personal installation - Matlab & Fuzzy Logic, statistics, global optimization, optimization toolboxes (more can be added later). • not required: Simulink (& blocksets), builder, coder, compiler,... • Alternative: Full (more HD space, many functions not needed,...) 16 Book • Title: Data Science for Business: What you need to know about data mining and dataanalytic thinking • Authors: Foster Provost and Tom Fawcett • Publisher: O'Reilly Media • Edition: 1st edition (August 19, 2013) • ISBN-10: 1449361323 • ISBN-13: 978-1449361327 • Available as e-book, in print or electronic copy (pdf) 17 Assessment • Components: • Assignment 1 – 25% − Deadline Assignment 1a: − Deadline Assignment 1b: • Assignment 2 – 25% − Deadline Assignment 2a: − Deadline Assignment 2b: • Written exam – 50% 29 Sep.’17 TBA Oct.’17 TBA Oct.’17 TBA Oct.’17 • Assignments will be made in groups of 3 It is not possible to re-sit assignments Assignments are valid only in the current academic year 18 Exam • Type: written, open questions, closed book • Date: 9 Nov.’17. • Time: 09:00 – 12:00 • Re-sit: TBA 19 Relation to Information Systems Research Business Process Engineering Smart Mobility Health Care • IS group has four research clusters • Most of the topics covered by BPI cluster http://is.ieis.tue.nl/research/bpi/ Business Process Intelligence Business Process Management 20 TU/e Data Science Center Slide by DSC/e 21 Data-driven decision making 23 Business drivers Nowadays, many decisions must be automated due to • Large volumes of data • Availability of online data, which requires real-time processing and decision making • Developments in m- and e-business: decisions anywhere, anytime • Competitive advantage through fast processing • Optimization of business processes (B. Gates, Business@thespeedofthought) A historical note by Bill Gates Business @ the speed of thought: Year 1998 Year 2017 The production of data is staggering • • • • • • • • • People produce 400 million tweets daily … and send 3.2 billion likes daily They also upload 300 million pictures daily Google Voice processes 10 years of spoken text daily The UK has 2 million surveillance cameras Facebook has 1 billion users 800 million users watch 4 billion movies daily Medical data doubles every five years In 2020 there will be 24 billion internet connected devices Source: NRC 08.02.2013 Slide by DSC/e The Always-On Society Connected products Sensing and monitoring data Connected Apps Philips Value Platform Meaningful information Enhanced user experience Connected Solutions Slide by DSC/e 27 Everywhere Analytics From Deloitte 28 Real-world examples Smart business solutions Process improvement Monitors and analyses events in an organization and proposes business improvement actions. Smart power grids Measures, monitors, and manages energy production, transport, and consumption is heterogeneous distributed grids. Clinical decision support Provides instant clinical decision support by correlating information from different part of uncorrelated sources. Slide by DSC/e 29 Be part of customer experience Customer Analytics Data as an asset It is not about what you have, but about what you know about what you have flickr.com/photos/elizabeth_donoghue 31 Data analytic thinking • Data and the capability to extract useful knowledge from data is a (strategic) asset • Invest in data: quality, collection, storage Can be costly! • Invest in models, skills, methods to process data • This combination creates value • Google • Facebook • Amazon 32 Data science Data science seeks to use all relevant, often complex and hybrid data to effectively tell a story that can be easily understood by non-experts It does this by integrating techniques and theories from many fields, including statistics, computational intelligence, pattern recognition, machine learning, online algorithms, visualization, security, uncertainty modeling, and high performance computing with the goal of developing the fundamental principles that guide the extraction of knowledge from data Slide by DSC/e 33 Data-driven decisions • Data science involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data in order to improve decision making 34 In today news (6 Sep.’17) Where is the added value? HURRICANE FRANCES was on its way, barreling across the Caribbean, threatening a direct hit on Florida's Atlantic coast. Residents made for higher ground, but far away, in Bentonville, Ark., executives at Wal-Mart Stores decided that the situation offered a great opportunity for one of their newest data-driven weapons, something that the company calls predictive technology. A week ahead of the storm's landfall, Linda M. Dillman, Wal-Mart's chief information officer, pressed her staff to come up with forecasts based on what had happened when Hurricane Charley struck several weeks earlier. Backed by the trillions of bytes' worth of shopper history that is stored in Wal-Mart's data warehouse, she felt that the company could "start predicting what's going to happen, instead of waiting for it to happen," as she put it. From NY Times’04 37 Automated text analysis Key questions to answer “What happened?” “Where exactly is the problem?” “What if these trends continue?” “How many, how often, where?” “What’s the best that can happen?” “What will happen next?” “What actions are needed?” “Why is this happening?” Source: SAS Levels of analytics capability What’s the best that can happen? What will happen next? What if these trends continue? Why is this happening? What actions are needed? Where exactly is the problem? How many, how often, where? What happened? Source: SAS Tag Cloud 40 By J. Reed, http://diginomica.com/2013/12/06/data-science-business-book-pull-off/ Fundamental concepts of data science • Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages • From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest • If you look too hard at a set of data, you will find something—but it might not generalize beyond the data you’re looking at • Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used 41 Gartner Hype Cycle, 2012 42 Gartner Hype Cycle, 2016 43 Multiple facets of data science Data Mining Stochastic Networks Probability and Statistics Data-Driven Innovation and Business Data-Driven Operations Management Process Mining Visualization DATA SCIENCE Internet of Things Privacy, Security, Ethics, and Governance Human and Social Analytics Intelligent Algorithms Large-Scale Distributed Systems Slide by DSC/e 44 Multi-attribute decision making What if there is no information? 45 Causes for lack of information • Problem is very new, not much data available • Relevant data is not available • Uncertainty may be too large • Multiple objectives • Complex environment These are typical characteristics of many management decisions 46 Management decisions often complex because they involve: 1. Risk and uncertainty 2. Multiple objectives 3. A complex structure 4. Multiple stakeholders In this course, we will consider mainly discrete choice problems with a small number of alternatives Coping with complexity Difficult for humans, because • The human mind has limited information processing capacity and memory (Miller’s 7 ± 2) • To cope with complexity we tend to simplify problems • This can lead to inconsistency and biases The Role of Decision Analysis • Analysis: ‘divide and conquer’ • Defensible rationale: ‘audit trail’ • Raised consciousness about issues • Allows participation: commitment • Insights: creative thinking • Guidance on information needs Two approaches • Normative (prescriptive) decision making • decision as a rational act of choice • formal models for rational decision making • closely related to optimization theory • Descriptive decision making • decision as a specific information processing process • studies the cognitive processes that lead to decisions • focus on how information is processed Example: decision making in oil industry Company owns a concession Perhaps there is oil, perhaps there is not. Do you sell the land and the exploration rights or do you drill and develop the field? 51 Characterized by large uncertainty • How much oil? • What is recoverable? • Given current technology • Depending on technological developments • • • • Price of oil in the future Tax developments One-off decision Attitude towards risk 52 Example: site selection Where to open your next store in the chain? Small number of alternatives Possibly multiple criteria 53 flickr.com/photos/stadsarchiefbreda/ Characterized by multiple criteria • Many stakeholders with diverging goals • Multiple criteria (many attributes) • Different importance of attributes • Trade-off amongst attributes known partially • Large uncertainty from the environment 54 Decision making methods • Bayes decision making • Decision heuristics • Simple multi-attribute rating technique (SMART) • Fuzzy decision making 55 I x=3 %declaring variable - This is a comment! y=5; %Another variable, note the semicolon %h=x+y nothing happens z=x+y %Calculation I % I I _- i,j a_1=1 %1x1 Matrix b_12=[2 3] %1x2 Matrix c_21=[2; 3] %2x1 Matrix d=3i F=[1 2; 3 4] c_21' %transpose transpose(c_21) %same as above I (+,-,*,/,^,.*,,./.^ I a_1*b_12 %this is fine a_1*c_21 %also fine b_12*c_21 %works %different from above b_12.*c_21' %note the .* element by element multiplicatio c_21*b_12 %works b_12*F %works c_21*F %does not work F^2 b_12.^2 %element calculation v=0:0.5:10 I I max I A = [1 3 5]; max(A) I help <command> help max doc max doc <command> I help <command> help log doc log %or click the hyperlink at the bottom I I help format log(10) format long log(10) format %reverts back to default log(10) v = [16 5 9 4 2 11 7 14]; v(3) %Extract the 3rd element v([1 5 6]) % Extract the 1st, 5th, 6th element v(3:7) %Extract 3rd through 7th elements. v(5:end) %Extract from 5th till the last element. A = magic(4) %help magic A(2,4) %row 2, column 4 A(2:4,1:2) %row 2 to 4 and column 1 to 2 A(3,:) %Extract third row %Logic indexing v([1 0 0 0 1 1]) %same as v([1 5 6]) v<10 v(v<10) help plot %as usual p4plots=0:0.05:1 plot(entropy_p4p) %simple plot p4p=p4plots.^2; figure,plot(p4plots,p4p,'r') %x,y hold on plot(p4plots,p4plots.^3,'--kx','MarkerSize',10) hold off xlabel('p(+)') ylabel('function values') title('Example of ploting as a function of p(+)') I I I I http://www.cyclismo.org/tutorial/matlab/ I https://www.mccormick.northwestern.edu/documents/students/ undergraduate/introduction-to-matlab.pdf I I 1BK40 Business Analytics & Decision Support Session 3 Business Problems and Data Science Solutions. Introduction to Predictive Modeling. Dr. M.Firat Pav.D06, m.firat@tue.nl September 13, 2017 Where innovation starts Notifications I Problems, comments, suggestions: • • I I I Keep up to date with the lectures material on Canvas. Digital age: Lectures handouts may be changed (suggestions, corrections, additions, ...). Assignment 1a will be posted on Canvas after Session 3. • • • I Email m.firat@tue.nl with the subject “[1BK40] <subject>”; Correctly addressed emails will be replied ASAP. Simple problem to be solved by hand (practice for exam). Real world problem using Matlab. To be solved as a group. Slides serve as reference (hence a bit verbose). 2/91 Outline Today Introduction From Business Problems to Data Mining Tasks Common types of data mining tasks Supervised vs. unsupervised methods Data Mining Related Analytics Techniques and Technologies Introduction to Predictive Modeling Attribute selection Closing remarks Fundamental concepts: A set of canonical data mining (DM) tasks; DM process; supervised vs. unsupervised DM; identifying informative attributes. 3/91 Data Science and Data Mining Data science principle: Data mining is a process with well-understood stages and well-defined subtasks. I Data mining involves • • I Information technology: discovering and evaluating patterns in data. Data analyst: creativity, business knowledge, and common sense. Structured data mining projects are • • conducted by systematic analysis not driven by chance and individual good judgments. 4/91 Answering Business Questions with DM & Related Techniques In data analysis, common questions are I I I I “Who are the most profitable customers?” “Is there really a difference between the profitable customers and the average one?” “Can we characterize the profitable customers to have an idea who really are they?” “Will a given new customer be profitable? How much revenue should I expect this customer to generate?” 5/91 6/91 From Business Problems to Data Mining Tasks From Business Problems to Data Mining Tasks I Every (data-driven) business (decision-making) problem is unique: goals, desires, and constraints. • • There are specifics of the problems even if they belong to the same case. (Churn example: Lecture 1 of Company MegTelCo vs. other similar company); However, there are common subtasks underlie the business problems (example: estimate from historical data a given probability). 7/91 From Business Problems to Data Mining Tasks I Data science aims to decompose a data analytics problem into sub problems • • • I 8/91 every of them is a known task with available tools, which prevents wasting time and resources, i.e. reinventing the wheel, which allows people to focus on parts requiring human involvement. For every data mining task, there are usually a number of proposed algorithms so • we shall clearly define these tasks to state several fundamental concepts of data science, e.g. classification and regression. Data mining task: Classification I Classification (& class probability estimation) attempts to predict to which set of classes a given individual belongs • I I 9/91 “Among all the customers of a cellphone company, which are likely to correspond to a given offer?” (usually binary, mutually exclusive) The classification procedure develops a model that determines the class of a new individual. Related task is scoring or probability estimation. • A scoring model outputs for a new individual the probability, i.e. score, that s/he belongs to each class. Data mining task: Regression I Regression (value estimation) attempts to predict the numerical value of some variable for a given individual. • I I 10/91 “How much will a given customer use the service?” (Predicted variable: service usage). Regression model is generated by looking at other individuals in the population. (Informal) difference between regression and classification: • Classification (regression) predicts whether (how much) something will happen. Data mining task: Similarity matching I 11/91 Similarity matching attempts to identify similar individuals based on the data known about them. • • Finding similar entities - “What companies are similar to our best business companies?” Making product recommendations - “What persons are similar to you in terms of the products they have liked or purchased?” Data mining task: Clustering I Clustering attempts to group individuals in a population together by their similarity, but without regard to any specific purpose. • • I 12/91 Directly - “Do customers form natural groups or segments?” groupings of the individuals of a population. Input to decision-making - “What products should we offer or develop?”, “How should our customer care teams (or sales teams) be structured?” Useful in preliminary domain exploration. Data mining task: Co-occurrence grouping I Co-occurrence grouping attempts to find associations between entities based on transactions involving them. • I I I “What items are commonly purchased together?” Note: clustering looks at similarity between objects based on the objects’ attributes, co-occurrence grouping considers similarity of objects based on their appearing together in transactions. Included in recommendation systems (people who bought X also bought Y). A description of items that occur together, including statistics on the frequency of the co-occurrence and an estimate of how surprising it is. 13/91 Data mining task: Profiling I Profiling (or behavior description) attempts to characterize the typical behavior of a group or population • I “What is the typical cellphone usage of this customer segment?” Often used to establish behavioral norms for anomaly detection (fraud detection). 14/91 Other data mining tasks I Link prediction attempts to predict connections between data items • I Data reduction attempts to take a large data set of data and replace it with a smaller one containing much of the important information. • I Social network systems: “Since you and Karen share ten friends, maybe you’d like to be Karen’s friend?” Trade-off: Easier processing vs. loss of information. Causal modeling attempts to help us understand what events or actions actually influence others. • • Targeting advertisements to consumers: “Was the higher purchase rate of targeted consumers because the advertisements influenced them?” Sophisticated methods for drawing causal conclusions from observational data. 15/91 Data mining tasks vs. data analytics problems I Note that the data analytics problem “recommendation” is used as an example for: • • • I 16/91 Similarity matching; Co-occurrence grouping; Link prediction. Recognize the differences and match the correct data mining task to the data analytics problem under study. 17/91 Supervised vs. Unsupervised methods Supervised vs. Unsupervised methods I Consider the following questions? 1. Q1: “Do our customers naturally fall into different groups?” 2. Q2: “Can we find customer groups having particularly high likelihoods of canceling their service soon after their contracts expire?” I Difference between Q1 and Q2? 18/91 Supervised vs. Unsupervised methods I Consider the following questions? 1. Q1: “Do our customers naturally fall into different groups?” 2. Q2: “Can we find customer groups having particularly high likelihoods of canceling their service soon after their contracts expire?” I Difference between Q1 and Q2? • in Q1 there is no specific target, hence unsupervised data mining. 19/91 Supervised vs. Unsupervised methods I Consider the following questions? 1. Q1: “Do our customers naturally fall into different groups?” 2. Q2: “Can we find customer groups having particularly high likelihoods of canceling their service soon after their contracts expire?” I Difference between Q1 and Q2? • • in Q1 there is no specific target, hence unsupervised data mining. in Q2 there exists a specific target, hence supervised data mining. 20/91 Models, Induction, and Prediction 21/91 Supervised vs. Unsupervised methods I I Supervised and unsupervised tasks require different techniques. Supervised tasks: • require (actual) data on the target. 22/91 Supervised vs. Unsupervised methods I I Supervised and unsupervised tasks require different techniques. Supervised tasks: • • require (actual) data on the target. involve classification, regression, and causal modeling. 23/91 Supervised vs. Unsupervised methods I I Supervised and unsupervised tasks require different techniques. Supervised tasks: • • I require (actual) data on the target. involve classification, regression, and causal modeling. Unsupervised tasks: • cannot provide guarantee of meaningful or useful results for any particular purpose. 24/91 Supervised vs. Unsupervised methods I I Supervised and unsupervised tasks require different techniques. Supervised tasks: • • I require (actual) data on the target. involve classification, regression, and causal modeling. Unsupervised tasks: • • cannot provide guarantee of meaningful or useful results for any particular purpose. involve clustering, co-occurrence grouping, and profiling. 25/91 Supervised vs. Unsupervised methods I I Supervised and unsupervised tasks require different techniques. Supervised tasks: • • I Unsupervised tasks: • • • I require (actual) data on the target. involve classification, regression, and causal modeling. cannot provide guarantee of meaningful or useful results for any particular purpose. involve clustering, co-occurrence grouping, and profiling. Problem: It might be useful to know whether a given customer will stay for at least six months, but there is only data for two months. Both supervised and unsupervised tasks are • similarity matching, link prediction, and data reduction. 26/91 Note on classification and regression I Classification and regression are distinguished based on the type of target: • • Regression involves a numeric target; Classification involves a categorical (often binary) target. 27/91 Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • “Will this customer purchase service S1 if given incentive I?” 28/91 Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • “Will this customer purchase service S1 if given incentive I?” classification 29/91 Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 30/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 31/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” - classification Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 32/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” - classification “How much will this customer use the service?” Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 33/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” - classification “How much will this customer use the service?” - regression Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 34/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” - classification “How much will this customer use the service?” - regression “What is the probability that the customer will continue?” Note on classification and regression I Classification and regression are distinguished based on the type of target: • • I 35/91 Regression involves a numeric target; Classification involves a categorical (often binary) target. Consider the following questions: • • • • “Will this customer purchase service S1 if given incentive I?” classification “Which service package (S1, S2, or none) will a customer purchase if given incentive I?” - classification “How much will this customer use the service?” - regression “What is the probability that the customer will continue?” classification with categorical target 36/91 Data Mining Process Data mining and KDD I I I 37/91 Goal of “data mining”: Mining of patterns and knowledge from data. Data mining is often set in the broader context of Knowledge Discovery in Databases (KDD). The precise boundaries of the data mining part within the KDD process are not easy to state (fuzzy). https://nocodewebscraping.com/difference-data-mining-kdd/ CRISP-DM I Alternative, more industry-driven view of KDD: CRISP-DM (Cross Industry Standard Process for Data Mining) 38/91 CRISP: Business Understanding I Understand the problem to be solved: • I Going through the process once without having solved the problem is, generally speaking, not a failure. Analyst’s creativity plays an important role • I Business projects: not clear and unambiguous data mining problems. Designing the solution: an iterative process of discovery. • I 39/91 Design team task: thinking carefully about the problem and the use scenario (more on this in future lectures). Decompose the problem into sub-problems each involving building models for classification, regression, and so on. CRISP: Data Understanding I Data: available raw materials. 40/91 CRISP: Data Understanding I I 41/91 Data: available raw materials. Historical data: often collected for purposes unrelated to the current business problem. CRISP: Data Understanding I I 42/91 Data: available raw materials. Historical data: often collected for purposes unrelated to the current business problem. • Understand the strengths and limitations of the data. There is rarely an exact match with the problem. CRISP: Data Understanding I I Data: available raw materials. Historical data: often collected for purposes unrelated to the current business problem. • I 43/91 Understand the strengths and limitations of the data. There is rarely an exact match with the problem. Estimate costs and benefits of each data source. • Data may be available virtually for free or may require effort to obtain. Is further investment merited? CRISP: Data Understanding I I Data: available raw materials. Historical data: often collected for purposes unrelated to the current business problem. • I Understand the strengths and limitations of the data. There is rarely an exact match with the problem. Estimate costs and benefits of each data source. • I 44/91 Data may be available virtually for free or may require effort to obtain. Is further investment merited? Necessary to uncover the structure of the business problem and the data that are available: • • Credit card fraud: Nearly all fraud is identified and reliably labeled (by the bank or customer). Medicare fraud: Medical providers (legitimate service providers) use the billing system, so submit claims (false?). What exactly should the “correct”’ charges be? No answer, hence no labels. CRISP: Data Understanding I I Data: available raw materials. Historical data: often collected for purposes unrelated to the current business problem. • I Data may be available virtually for free or may require effort to obtain. Is further investment merited? Necessary to uncover the structure of the business problem and the data that are available: • • I Understand the strengths and limitations of the data. There is rarely an exact match with the problem. Estimate costs and benefits of each data source. • I 45/91 Credit card fraud: Nearly all fraud is identified and reliably labeled (by the bank or customer). Medicare fraud: Medical providers (legitimate service providers) use the billing system, so submit claims (false?). What exactly should the “correct”’ charges be? No answer, hence no labels. Match business problem to one or several DM tasks. CRISP: Data Preparation I 46/91 Data: (often) to be manipulated and converted into other forms for better results. (time consuming). • • • Convert data to tabular format. Remove or infer missing values. Convert data to different types. CRISP: Data Preparation I Data: (often) to be manipulated and converted into other forms for better results. (time consuming). • • • I 47/91 Convert data to tabular format. Remove or infer missing values. Convert data to different types. Match data and requirements of DM techniques. CRISP: Data Preparation I 48/91 Data: (often) to be manipulated and converted into other forms for better results. (time consuming). • • • Convert data to tabular format. Remove or infer missing values. Convert data to different types. I Match data and requirements of DM techniques. I Select the relevant variables. CRISP: Data Preparation I 49/91 Data: (often) to be manipulated and converted into other forms for better results. (time consuming). • • • Convert data to tabular format. Remove or infer missing values. Convert data to different types. I Match data and requirements of DM techniques. I Select the relevant variables. I Normalize or scale numerical variables. CRISP: Modeling & Evaluation I Modeling: This is the primary place where DM techniques are applied to the data (core part of this course!) 50/91 CRISP: Modeling & Evaluation I I Modeling: This is the primary place where DM techniques are applied to the data Evaluation: Assess the DM results rigorously. • Gain confidence that results are valid and reliable; 51/91 CRISP: Modeling & Evaluation I I Modeling: This is the primary place where DM techniques are applied to the data Evaluation: Assess the DM results rigorously. • • Gain confidence that results are valid and reliable; Ensure that the model satisfies the original business goals and support decision making. 52/91 CRISP: Modeling & Evaluation I I Modeling: This is the primary place where DM techniques are applied to the data Evaluation: Assess the DM results rigorously. • • • Gain confidence that results are valid and reliable; Ensure that the model satisfies the original business goals and support decision making. Includes both quantitative and qualitative assessments. 53/91 CRISP: Modeling & Evaluation I I Modeling: This is the primary place where DM techniques are applied to the data Evaluation: Assess the DM results rigorously. • • • Gain confidence that results are valid and reliable; Ensure that the model satisfies the original business goals and support decision making. Includes both quantitative and qualitative assessments. 54/91 CRISP: Deployment I Models are put into real use in order to realize some return on investment: • • I Implement a predictive model in some business process; Example: predict the likelihood of churn in order to send special offers to customers who are predicted to be particularly at risk. Trend: DM techniques themselves are deployed • Systems automatically build and test models in production. I Rule discovery: simply use discovered rules. I Involve data scientists into final deployment. I 55/91 The process of mining data produces a great deal of insight into the business problem and the difficulties of its solution. Note on data mining and its use I Mining the data to find patterns and build models is different than using the results of data mining. 56/91 Related Analytics Techniques and Technologies I Software development: • • • I Statistics: • • • I CRISP looks similar to a software development cycle. DM is closer to research (explorative analysis) than it is to engineering: DM requires skills that may not be common among programmers. Understand different data distributions. How to use data to test hypotheses. Many of the DM techniques have their roots in statistics. Database querying, Data Warehousing and OLAP • • • • No discovery of patterns or models. Extract the data you need for DM. May be seen as a facilitating technology of DM. No modeling or automatic pattern finding. 57/91 Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? 58/91 Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. 59/91 Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I 60/91 Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I 61/91 Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? • • Question about hypothesis; Statistical hypothesis testing required. Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? • • I 62/91 Question about hypothesis; Statistical hypothesis testing required. But who really are these customers? Can I characterize them? Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? • • I 63/91 Question about hypothesis; Statistical hypothesis testing required. But who really are these customers? Can I characterize them? • • • Individual customers: database query Summary statistics Deeper analysis: determine characteristics that differentiate profitable customers from the rest (DM). Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I • Question about hypothesis; Statistical hypothesis testing required. But who really are these customers? Can I characterize them? • • • I Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? • I 64/91 Individual customers: database query Summary statistics Deeper analysis: determine characteristics that differentiate profitable customers from the rest (DM). Will a given new customer be profitable? How much revenue should I expect this customer to generate? Answering Business Questions with DM & Related Techniques I Who are the most profitable customers? • • I • Question about hypothesis; Statistical hypothesis testing required. But who really are these customers? Can I characterize them? • • • I Straightforward database query if “profitable” can be defined clearly based on existing data. Standard query tool. Is there really a difference between the profitable customers and the average customer? • I 65/91 Individual customers: database query Summary statistics Deeper analysis: determine characteristics that differentiate profitable customers from the rest (DM). Will a given new customer be profitable? How much revenue should I expect this customer to generate? • DM techniques that examine historical customer records and produce predictive models of profitability 66/91 Introduction to Predictive Modeling Introduction to Predictive Modeling I Predictive modeling as supervised segmentation: • • • 67/91 How to segment the population w.r.t. something that we would like to predict? Which customers are likely to leave the company when their contracts expire? Which potential customers are likely not to pay off their account balances? Introduction to Predictive Modeling I Predictive modeling as supervised segmentation: • • • I How to segment the population w.r.t. something that we would like to predict? Which customers are likely to leave the company when their contracts expire? Which potential customers are likely not to pay off their account balances? Find important, informative variables(attributes) of the entities w.r.t. a target • 68/91 Do some variables reduce our uncertainty of the target value? Introduction to Predictive Modeling I Predictive modeling as supervised segmentation: • • • I I How to segment the population w.r.t. something that we would like to predict? Which customers are likely to leave the company when their contracts expire? Which potential customers are likely not to pay off their account balances? Find important, informative variables(attributes) of the entities w.r.t. a target • 69/91 Do some variables reduce our uncertainty of the target value? Select informative subsets in large databases (also for data-reduction). Introduction to Predictive Modeling I Predictive modeling as supervised segmentation: • • • I I I How to segment the population w.r.t. something that we would like to predict? Which customers are likely to leave the company when their contracts expire? Which potential customers are likely not to pay off their account balances? Find important, informative variables(attributes) of the entities w.r.t. a target • 70/91 Do some variables reduce our uncertainty of the target value? Select informative subsets in large databases (also for data-reduction). Tree induction: Based on finding informative attributes. Introduction to Predictive Modeling I Predictive modeling as supervised segmentation: • • • I I How to segment the population w.r.t. something that we would like to predict? Which customers are likely to leave the company when their contracts expire? Which potential customers are likely not to pay off their account balances? Find important, informative variables(attributes) of the entities w.r.t. a target • 71/91 Do some variables reduce our uncertainty of the target value? Select informative subsets in large databases (also for data-reduction). I Tree induction: Based on finding informative attributes. I Information: A quantity that reduces uncertainty. Models, Induction, and Prediction I Model: An abstraction of a real-life process/case. • I • I in the forms of mathematical models or logical rules. examples are classification and regression models. Prediction: estimate an unknown value. • I Preserves, and sometimes further simplifies, the relevant information. Predictive model: A formula for estimating the value of the target variable. • 72/91 Credit scoring, spam filtering, fraud detection. Descriptive modeling: presenting the main features of the data. Models, Induction, and Prediction I Supervised learning: • • 73/91 Create model describing a relationship between a set of selected variables and a target variable. The model estimates the value of the target variable as a function of the features. Models, Induction, and Prediction I Supervised learning: • • I 74/91 Create model describing a relationship between a set of selected variables and a target variable. The model estimates the value of the target variable as a function of the features. Induction: Generalizing from specific case to general rules. Models, Induction, and Prediction I Supervised learning: • • I I 75/91 Create model describing a relationship between a set of selected variables and a target variable. The model estimates the value of the target variable as a function of the features. Induction: Generalizing from specific case to general rules. Deduction: From general rules and specific facts to create other facts. Models, Induction, and Prediction I Supervised learning: • • I I I 76/91 Create model describing a relationship between a set of selected variables and a target variable. The model estimates the value of the target variable as a function of the features. Induction: Generalizing from specific case to general rules. Deduction: From general rules and specific facts to create other facts. An important question in data mining: • How to select some attributes that will best divide the sample w.r.t. our target variable? Supervised Segmentation 77/91 Supervised Segmentation I I Intuitive segmentation: Finding subgroups of the population with different values of the target variable. Segmentation • • • I is used to predict the target variable. also provides ‘human understandable’ patterns in the data. “Middle-aged professionals who reside in New York City on average have a churn rate of 5%”. Important: Identify which variables are useful in explaining the target variable; 78/91 A simple segmentation problem I I Target variable: Whether a person becomes a loan write-off. Several attributes in data: • • head-shape: square; circular, body-shape: rectangular, oval; body-color: black, white ∆ Which attributes are best to segment people into groups of ‘write-offs’ and ‘non-write-offs’? I Aim for the resulting segment to be as ‘pure’ as possible. I Purity: Homogeneity of segments w.r.t. the target variable. 79/91 Supervised Segmentation: Purity I 80/91 Body-color “black” would create a pure group, unless person 2 was there. I Trade-off: purity of subsets vs. equal-size subsets. I How to split the target variable into more groups? I How to create supervised segmentation using numerical attributes? I Purity: related to ‘entropy’ and ‘information gain’. Supervised Segmentation: A complete example 81/91 Supervised Segmentation: Entropy I Entropy: A measure of disorder, i.e. ‘how impure’ the segment is. I Let pi be the relative percentage of property i within the set. 82/91 entropy = ≠ p1 log (p1 ) ≠ p2 log (p2 ) ≠ . . . . where pi ranges from 0 (none) to 1 (all). I Logarithm in entropy calculation is generally taken as base 2 (always indicate it clearly in your calculations!). Supervised Segmentation: Entropy 83/91 Supervised Segmentation: Entropy I Example: Consider a set S of 10 people with seven of the non-write-off class and three of the write-off class. • I 84/91 we have pnon-write-off =0.7 and pwrite-off =0.3 Entropy for the whole set, entropy(S) • • • • = ≠ pnon-write-off log2 (pnon-write-off ) ≠ pwrite-off log2 (pwrite-off ) = ≠ 0.7 log2 (0.7) ≠ 0.3 log2 (0.3) ¥ ≠(0.7 ◊ ( ≠ 0.51)) ≠ (0.3 ◊ ( ≠ 1.74)) ¥ 0.88 Supervised Segmentation: Information Gain I Using entropy formula, we want to know • • I 85/91 how informative an attribute is w.r.t. our target; how much gain in information an attribute brings us. Information gain • • • • measures how much an attribute improves, i.e. decreases, the entropy. shows the change in entropy due to any amount of new information. is calculated by splitting the set on all values of a single attribute. compares purity of the children (C ={ci }) to their parent (P). IG(P,C)=entropy(P)≠ ! " ! " ! " $ # ! " p c1 entropy c1 +p c2 entropy c2 + . . . ! " where the the entropy for each child ci is weighted by the proportion of instances belonging to that child, p(ci ). Supervised Segmentation: Information Gain Example: Splitting the ‘write-off’ sample into two segments, based on splitting the Balance attribute (account balance) at 50K. Entropy(parent): ≠ p( • ) log2 ( • ) ≠ p(F) log2 (F) ¥ ≠0.53 ◊ ≠0.9 ≠ 0.47 ◊ ≠1.1 ¥ 0.99 Entropy(left child): ≠ p( • ) log2 ( • ) ≠ p(F) log2 (F) ¥ ≠0.92 ◊ ≠0.12 ≠ 0.08 ◊ ≠3.7 ¥ 0.39 Entropy(right child): ≠ p( • ) log2 ( • ) ≠ p(F) log2 (F) ¥ ≠0.24 ◊ ≠2.1 ≠ 0.76 ◊ ≠0.39 ¥ 0.79 86/91 Supervised Segmentation: Information Gain Example: Splitting the ‘write-off’ sample into two segments, based on splitting the Balance attribute (account balance) at 50K. ! " ! " IG parent,children = entropy parent ≠ ! ! " ! p leftchild entropy leftchild ! " ! " +p rightchild entropy rightchild ! "" " ¥ 0.99 ≠ 0.43 ◊ 0.39+0.57 ◊ 0.79 =0.37. 87/91 Supervised Segmentation: Information Gain I Same example, but different candidate split: residence • • 88/91 The residence variable does have a positive information gain, but it is lower than that of balance. Homework: Check/perform calculations in book. Information gain for numeric attributes I “Discretize” numeric attributes by split points • I Segmentation for regression problems: • • • I How to choose the split points that provide the highest information gain? Information gain is not the right measure. We need a measure of purity for numeric values. Look at reduction of VARIANCE (zero = ‘pure’). To create the best segmentation given a numeric target, (possibly) choose the one that produces the best weighted average variance reduction. 89/91 90/91 Questions? Next Session I Lecture on 15 Sep.’17, Chapter 3,4: • • • • • Introduction to Predictive Modeling. Visualizing Segmentation. Fitting a Model to Data. Classification via Mathematical Functions. Regression via Mathematical Functions. 91/91 1BK40 Business Analytics & Decision Support Session 4 Introduction to Predictive Modeling. Visualizing Segmentation. Dr. M.Firat Pav.D06, m.firat@tue.nl September 18, 2017 Where innovation starts Notifications � Problems, comments, suggestions: • • � Assignment 1a will be posted on Canvas after this Session (4). • • • � Email m.firat@tue.nl with the subject “[1BK40] <subject>”; Correctly addressed emails will be replied ASAP. Simple problem to be solved by hand (practice for exam). Real world problem using Matlab. To be solved as a group. Slides serve as reference (hence a bit verbose). 2/58 Outline Today Introduction to Predictive Modeling Example: Attribute Selection with Information Gain - Session 3 Supervised Segmentation with Tree-Structured Models - Session 4 Visualizing Segmentations Probability Estimation Classification via Mathematical Functions Regression via Mathematical Functions Class Probability Estimation and Logistic “Regression” Logistic Regression versus Tree Induction Closing remarks Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection. 3/58 4/58 Example: Entropy - Session 3 Supervised Segmentation: Information Gain Example: Splitting the ‘write-off’ sample into two segments, based on splitting the Balance attribute (account balance) at 50K. Entropy(parent): − p(●) log2 (●) − p(�) log2 (�) ≈ −0.53 × −0.9 − 0.47 × −1.1 ≈ 0.99 Entropy(left child): − p(●) log2 (●) − p(�) log2 (�) ≈ −0.92 × −0.12 − 0.08 × −3.7 ≈ 0.39 Entropy(right child): − p(●) log2 (●) − p(�) log2 (�) ≈ −0.24 × −2.1 − 0.76 × −0.39 ≈ 0.79 5/58 Supervised Segmentation: Information Gain Example: Splitting the ‘write-off’ sample into two segments, based on splitting the Balance attribute (account balance) at 50K. IG (parent,children) = entropy (parent) − (p (leftchild) entropy (leftchild) +p (rightchild) entropy (rightchild)) ≈ 0.99 − (0.43 × 0.39 + 0.57 × 0.79) ≈ 0.37. 6/58 7/58 Example: Attribute Selection with Information Gain - Session 3 Example: edible and poisonous mushrooms 8/58 Example data is taken from The Audubon Society Field Guide to North American Mushrooms 1 . � the data contains 5,644 edible and poisonous mushroom examples (instances). � every instance has 22 categorical (non-target) attributes. � � there are 2,156 poisonous (pps ≈ 0.38) and 3,488 edible (ped ≈ 0.62) mushrooms. entropy(parent) = −pps log (pps ) − ped log (ped ) ≈ 0.96. We would like to answer: “Which single attribute is the most useful for distinguishing edible mushrooms from poisonous ones?” Using the concept of information gain, rephrase the question as “Which attribute is the most informative? ” 1 https://archive.ics.uci.edu/ml/datasets/Mushroom Default entropy for mushroom 9/58 Entropy vs. values GILL-COLOR 10/58 Entropy vs. values SPORE-PRINT-COLOR 11/58 Entropy vs. values ODOR 12/58 13/58 Supervised Segmentation with Tree-Structured Models - Session 4 Supervised Segmentation with Tree-Structured Models � Select the single variable that gives the most information gain � Single attribute selection alone is not sufficient � In a decision tree • • • • • • • very simple segmentation: two segments. so we need multi-attribute selection: decision tree. the topmost node in the tree is the root node. a node of the tree denotes a test on an attribute. each branch denotes the outcome of a test. at the leaf node no attribute test is conducted. each leaf node holds a segment label. 14/58 Decision trees � Main purpose of creating homogeneous regions: � Predict: Claudio Balance=115K, Employed=No, and Age=40. • to predict the target variable of a new, unseen instance by determining which segment it falls into. 15/58 Decision trees 16/58 � Manually building based on expert knowledge • • � time consuming Hard to avoid redundancy, contradictions, inefficiency,... Automatically using induction: • • recursively partition the instances based on their attributes easy to understand & relatively efficient Building Decision trees � � 17/58 Recursively find the best attribute to partition the current data set. The goal is partitioning the current group into subgroups that are as pure as possible w.r.t. the target variable. Building Decision trees 18/58 Final decision tree 19/58 Visualizing Segmentations � � 20/58 Each internal (decision) node corresponds to a split of the instance space. Each leaf node corresponds to an unsplit region of the space (a segment of the population). Trees as Sets of Rules � � � 21/58 Interpretation of classification trees as logical statements. If we trace down a single path from the root node to a leaf, collecting the conditions as we go, we generate a rule. Each rule consists of the attribute tests along the path connected with AND. For the previous example the classification tree is equivalent to this rule set. : • • • • IF IF IF IF (Balance (Balance (Balance (Balance < < ≥ ≥ 50K) 50K) 50K) 50K) AND AND AND AND (Age (Age (Age (Age < ≥ < ≥ 50) 50) 45) 45) THEN THEN THEN THEN Class=Write-off Class=No Write-off Class=Write-off Class=No Write-off Visualizing Segmentations - Matlab %From I r i s T r e e L o g i s t i c . m y l i n e = 26 . 5 ; %p a r a l l e l t o y a x i s x l i n e 1 =13 . 5 ; %p a r a l l e l t o x a x i s x l i n e 2 =17 . 5 ; %p a r a l l e l t o x a x i s %F i g u r e + l i n e s figure , g s c a t t e r ( x ( : , 1 ) , x ( : , 2 ) , f_x ) h o l d on plot ([ yline yline ] ,[ xline1 xline2 ] , k - - ) %p a r a l l e l t o y - c h a n g e t h e x l i n e 1 / x l i n e 2 t o t h e d e s i r e d s t a r t / end o f t h e l i n e p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 1 x l i n e 1 ] , k - - ) % p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 2 x l i n e 2 ] , k - - ) % p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e hold o f f x l a b e l ( p e d a l w i d t h - x1 ) y l a b e l ( s e p a l w i d t h - x2 ) 22/58 Visualizing Segmentations - Matlab � � 23/58 Each internal (decision) node corresponds to a split of the instance space. Each leaf node corresponds to an unsplit region of the space (a segment of the population). Probability estimation tree 24/58 Probability estimation tree � In a decision tree, it is easy to produce probability estimation instead of simple classification • • � 25/58 frequency-based estimate of class membership is calculated. e.g. if a leaf node contains n ‘+’ and m ‘-’ instances, then n . p (+) = n+m Approach may be too optimistic for segments with a very small number of instances (overfitting - next lectures). • Laplace correction moderates the influence of leaves with only a few n+1 . instances: p (+) = n+m+2 � Two cases: A leaf with 2 ‘+’ instances and no ‘-’ instances, and another leaf node with 20 ‘+’ and no ‘-’ negatives. � The Laplace correction smooths the estimate of the former down to p (+) =0.75 to reflect this uncertainty, but it has much less effect on the leaf with 20 instances (p (+) ≈ 0.95). Example - The Churn Problem � Predicting which new customers are going to churn by tree induction: • • • Historical data of 20,000 customers, either stayed or left. Customers have 10 attributes. Calculate the information gain of each attribute. 26/58 Example - The Churn Problem � The highest information gain feature (HOUSE) is at the root of the tree. When to stop building the tree? (Next lectures) � How do we know that this is a good model? (Next lectures) � 27/58 28/58 Classification via Mathematical Functions Classification tree visualization � Segmentation by a classification tree is as follows 29/58 Plot of the raw data � Are there other ways to partition the space? 30/58 Straight line � Consider separating by a line: Age= (−1.5) × Balance + 60 31/58 Linear discriminant functions � 32/58 Goal is to find a linear model that will help our classification task. � General linear model: � To use this model as a linear discriminant, for a given instance represented by a feature vector x, we check whether f (x) is positive or negative. � f (x) =w0 + w1 x1 + w2 x2 + . . . Line in previous slide: Age= (−1.5) × Balance + 60 • • The line equation gives us the boundary of the segmentation. Classification function using a line + class (x)= � ● if − 1.0 × Age − 1.5 × Balance + 60≤0 if − 1.0 × Age − 1.5 × Balance + 60>0 Best linear model? � Which one to pick? 33/58 Best linear model? � Which one to pick? 34/58 Support vector machine classifier � Objective: Maximize the margin 35/58 Logistic regression classifier � � Objective: Maximize the “likelihood” that all labeled examples belong to the correct class. In iris example, the target variable has two categories of species as • filled dots: Iris Setosa ,the circles: Iris Versicolor 36/58 37/58 Regression via Mathematical Functions Regression via Mathematical Functions 38/58 � In regression, we fit linear function: � decision on the objective function to optimize the model’s fit to the data. the notion of the fit: � f (x) =w0 + w1 x1 + w2 x2 + . . . • • � how far away are the estimated values from the true values on the training data? rephrasing the question: how big is the error of the fitted model? Two ways to quantify the above objective • • sum of absolute errors. sum of squared errors. Regression via Mathematical Functions: Raw data 39/58 Regression via Mathematical Functions: Regression estimator 40/58 Regression via Mathematical Functions: Errors 41/58 42/58 Class Probability Estimation and Logistic “Regression” Logistic “Regression” - Main ideas 43/58 � For probability estimation, logistic regression uses a linear model as linear regression for estimating numerical target values. � The log-odds is defined as a function of the probability of class membership. � The output of the logistic regression model is merely the probability of the class membership. � So, logistic regression is often used as a predictive model for estimating the probability of class membership. Logistic “Regression”: Technical details 44/58 � For many applications: estimate the probability that a new instance belongs to the class of interest. � Logistic regression is a model giving accurate estimates of class probability p+ (x). We will find a linear model f (x) =wo + w1 x1 + w2 x2 + . . .. � • • • p+ (x) the log-odds for example x is defined as ln � 1−p � + (x) p+ (x) equate log-odds and linear function: ln � 1−p � =f (x) + (x) 1 solve for p+ (x) and obtain p+ (x) = 1+exp(−f (x)) Logistic “Regression”: Plot 45/58 Logistic “Regression”: Objective function � � � 46/58 Ideally example x+ (x● ) would have p+ (x+ ) =1 (p+ (x● ) =0). Compute the likelihood of a particular labeled example x given a set of parameters w that produces class probability estimates p+ (x) p (x) if x is a + g (x, w )= � + 1 − p+ (x) if x is a ● The g function gives the model’s estimated probability of seeing x ’s actual class given x ’s features � For a particular parameter set w ′ , the objective value is the sum of the g values across all examples in a labeled data set. � Maximum likelihood gives the highest probabilities to the positive examples and the lowest probabilities to negative. Logistic “Regression”: Notes � � Logistic regression is a class probability estimation model not a regression model. Distinguish between target variable and probability of class membership. • • • � One may be tempted that the target variable is a representation of the probability of class membership. This is not consistent with how logistic regression models are used Example: probability of responding p (c responds) =0.02. Customer c actually responded, but probability is not 1.0! The customer just happened to respond this time. Training data are statistical “draws” from the underlying probabilities rather than representing the underlying probabilities themselves • 47/58 Logistic regression tries to estimate the probabilities with a linear-log-odds model based on the observed data 48/58 Example: Logistic Regression versus Tree Induction Logistic Regression versus Tree Induction � Important differences between trees and linear classifiers: • • • • � � 49/58 A classification tree uses decision boundaries that are perpendicular to the instance-space axes. The linear classifier can use decision boundaries of any direction or orientation. A classification tree is a is a “piecewise” classifier that segments the instance space recursively, possibly in arbitrarily small regions possible. The linear classifier places a single decision surface through the entire space. Which of these characteristics are a better match to a given data set? Consider the background of the stakeholders: • • A decision tree may be considerably more understandable to someone without a strong background in statistics. Data Mining team does not have the ultimate say how models are used or implemented! A simple but realistic example: Wisconsin Breast Cancer Dataset 50/58 � Entity in the data set: cell nuclei image. � Target variable is diagnosis: two categories benign and malignant (cancerous). � From each of these basic variables, three values were computed: the mean, standard error, and mean of the three largest values. This resulted in 30 measured attributes in the dataset. � There are 357 benign images and 212 malignant images. � 10 Non-target variables of cell images are considered. A simple but realistic example: Wisconsin Breast Cancer Dataset � Linear equation learned by logistic regression: • • � � Non-zero weights for 30 measured attributes are found. Performance: six mistakes on the entire dataset, accuracy 98.9%. A classification tree was learned from the same dataset • • � 51/58 it has 25 nodes with 13 leaf nodes. Accuracy: 99.1%. Which one is a better model? Try changing the Matlab example from Iris to WBC. • • Dataset link in book Diagnose dataset has no missing values - http://archive.ics.uci.edu/ ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). • Difficulties? Next session. Iris Dataset - Matlab � 52/58 Prepare data load i r i s . d a t % PREPARE DATA % c h o o s e d a t a f o r v e r s i c o l o r and v i r g i n i c a data = [ i r i s ( 5 1 : 1 0 0 , : ) ; i r i s (101:150 ,:)]; % c h o o s e p e d a l w i d t h and s e p a l w i d t h t o e x p l a i n x = [ data ( : , 2 ) data ( : , 4 ) ] ; f_x = d a t a ( : , 5 ) ; class Iris Dataset - Matlab � Classification tree analysis % CLASSIFICATION TREE ANALYSIS % estimate regression tree t _ i r i s = f i t c t r e e ( x , f_x ) ; % Display tree graphically v i e w ( t _ i r i s , mode , g r a p h ) % Display t r e e as a s e t of r u l e s view ( t _ i r i s ) % obtain fitted or predicted v a l u e s o f f_x u s i n g r e g r e s s i o n t r e e pred_tree = predict ( t _ i r i s , x ) ; % c o u n t c o r r e c t number o f p r e d i c t e d o r f i t t e d v a l u e s p c _ c o r r e c t _ t r e e = mean ( p r e d _ t r e e == f_x ) ; 53/58 Iris Dataset - Matlab � 54/58 Logistic regression analysis % LOGISTIC REGRESSION % r e s p o n s e h a s t o be 0 o r 1 f o r t h i s r e g r e s s i o n : d e f i n e new r e s p o n s e % v a r i a b l e y taking value 0 or 1 y1=c a t e g o r i c a l ( f_x ) ; % estimate l o g i s t i c regression [ b1 , dev1 , s t a t s 1 ] = m n r f i t ( x , y1 ) ; % obtain f i t t e d values of l o g i s t i c r e g r e s s i o n pred_logit1 = predict ( t_iris , x ); % c o u n t c o r r e c t number o f p r e d i c t e d o r f i t t e d v a l u e s % Note we h a v e t o u s e f_x n o t y1 due t o t h e v a r i a b l e t y p e s ( d o u b l e v s c a t e g o r i c a l ) . p c _ c o r r e c t _ l o g i t = mean ( p r e d _ l o g i t 1 == f_x ) ; Iris Dataset - Matlab � Visualization of the decision tree boundaries % D e f i n e 3 p o i n t s t a k e n from t _ i r i s . C u t P o i n t , a l s o v i s i b l e i n t h e t r e e y l i n e = 26 . 5 ;%p a r a l l e l t o y a x i s x l i n e 1 =13 . 5 ;%p a r a l l e l t o x a x i s x l i n e 2 =17 . 5 ;%p a r a l l e l t o x a x i s %F i g u r e + l i n e s figure , g s c a t t e r ( x ( : , 1 ) , x ( : , 2 ) , f_x ) h o l d on % p a r a l l e l t o y - c h a n g e t h e x l i n e 1 / x l i n e 2 t o t h e d e s i r e d s t a r t / end o f t h e l i n e plot ([ yline yline ] ,[ xline1 xline2 ] , k - - ) % p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 1 x l i n e 1 ] , k - - ) % p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 2 x l i n e 2 ] , k - - ) hold o f f x l a b e l ( p e d a l w i d t h - x1 ) y l a b e l ( s e p a l w i d t h - x2 ) 55/58 56/58 Questions? Today 57/58 � Chapter 3 & 4. � Matlab Tutorial 2 (comprehensive) - � Matlab Tutorial 1 - http://www.cyclismo.org/tutorial/matlab/ https://www.mccormick.northwestern.edu/documents/students/undergraduate/ introduction-to-matlab.pdf Next � 58/58 Session 5: • • • • � Overfitting and its Avoidance. Evaluating Classifier Cross-Validation Expected Value Analytical Framework Session 6: • Matlab Session 2 1BK40 Business Analytics & Decision Support Session 5 Generalization. Overfitting Model Evaluation Uzay Kaymak Pav.D02, u.kaymak@tue.nl September 26, 2017 Where innovation starts Outline Today Generalization and Overfitting Problem in overfitting Overfitting Avoidance and Complexity Control Model Evaluation Expected Value Closing remarks Fundamental concepts: Generalization; Fitting and overfitting. 2/43 3/43 Generalization and Overfitting Introduction � � Interested in general patterns, not data-specific ones (by chance). Q: Why general patterns? • Answer: They predict well the instances not seen yet. � Overfitting issue: � Example: churn data set, consider an extremal prediction model • • • • Involving data-specific chance occurrences in prediction model. A (table) look-up type prediction model. Using historical data, it is 100% accurate (for seen instances!) No ability to predict any unseen instances, hence no generalization. 4/43 Generalization and Overfitting � � DM needs to create models that generalize beyond training data. Generalization is the property of a model or modeling process whereby the model applies to data that were not used to build the model. • � If models do not generalize at all, they fit perfectly to the training data → they overfit. Overfitting is the tendency of DM procedures to tailor models to the training data, at the expense of generalization to previously unseen data points. • � 5/43 “If you torture the data long enough, it will confess.” - (Ronald Coase) Note: All DM procedures tend to overfitting. • • Trade-off between model complexity and the possibility of overfitting. You should recognize overfitting and manage complexity in a principled way. Holdout data 6/43 � Evaluation on training data provides no assessment of how well the model generalizes to unseen cases � Idea: “Hold out” some data for which we know the value of the target variable, but which will not be used to build the model - “lab test”. Predict the values of the “holdout data” (aka “test set”) with the model and compare them with the hidden true target values generalization performance. � • There is likely to be a difference between the model’s accuracy (“in-sample”) and the model’s generalization accuracy. Fitting Graph � A fitting graph shows the accuracy of a model as a function of complexity. � Generally, there will be more overfitting as one allows the model to be more complex. 7/43 Fitting Graph - Churn example 8/43 Overfitting - Tree Induction � 9/43 Recall tree induction: find important, predictive individual attributes recursively to smaller and smaller data subsets. • • • • Eventually, the subsets will be pure - we have found the leaves of our decision tree; The accuracy of this tree will be perfect! This is the same as the table model, i.e. an extreme example of overfitting! This tree should be slightly better than the lookup table, because every previously unseen instance also will arrive at some classification rather than just failing to match. � Generally: A procedure that grows trees until the leaves are pure tends to overfit. � If allowed to grow without bound, decision trees can fit any data to arbitrary precision. Overfitting - Tree Induction � A fitting graph shows the accuracy of a model as a function of complexity. 10/43 Overfitting - Mathematical Functions � 11/43 There are different ways to allow more or less complexity in mathematical functions: • Add more variables (attributes, features) f (x) = w0 + w1 x1 + w2 x2 + w3 x3 • � f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5 Add non-linear variables f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x12 + w5 x2 �x3 As you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points: • • Modelers carefully prune the attributes in order to avoid overfitting manual selection; Automatic feature selection. Overfitting - Linear functions 12/43 Problem in overfitting � Why is overfitting causing a model to become worse? • • • As a model gets more complex, it is allowed to pick up harmful “spurious” correlations. These correlations do not represent characteristics of the population in general. They may become harmful when they produce incorrect generalizations in the model. 13/43 Problem in overfitting � A simple two-class problem: • • • • � � 14/43 Classes c1 and c2 , attributes x and y . An evenly balanced population of examples. x has two values, p and q, and y has two values, r and s. General population: x = p occurs 75% of the time in class c1 examples and in 25% of c2 examples x provides some prediction of that class. Both of y ’s values occur in both classes equally → y has no predictive value at all. The instances in the domain are difficult to separate, with only x providing some predictive leverage (75% accuracy). Problem in overfitting � 15/43 Let us examine a very small training set of examples from this domain: Instance 1 2 3 4 5 6 7 8 x p p p q p q q q y r r r s s r s r Class c1 c1 c1 c1 c2 c2 c2 c2 � Note that in this particular dataset y ’s values of r and s are not evenly split between the classes, so y does seem to provide some predictiveness. � What would a classification tree do? Problem in overfitting � Small training set of examples, assume: • • • � A tree learner would split on x and produce a tree (a) with error 25%. In this particular dataset, y ’s values of r and s are not evenly split between the classes, so y seems to provide some predictions. Tree induction would achieve information gain by splitting on y ’s values and create tree (b). Tree (b) performs better than (a): • • • 16/43 Because y = r purely by chance correlates with class c1 in this data sample. The extra branch in (b) is not extraneous, it is harmful! The spurious y = s branch predicts c2 , which is wrong (error rate: 30%). Problem in overfitting � This phenomenon is not particular to decision trees; � There is no general analytic way to avoid overfitting. � or because of atypical training data; 17/43 Holdout training and testing � Cross-validation is a more sophisticated training and testing procedure. • • � 18/43 Not only a simple estimate of the generalization performance, but also some statistics on the estimated performance (mean, variance, ...) How does the performance vary across data sets? assessing confidence in the performance estimate Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing: • • • • • Split a data set into k partitions called folds (k = 5 or 10) Iterate training and testing k times. In each iteration, a different fold is chosen as the test data. The other k − 1 folds are combined to form the training data. Every example will have been used only once for testing but k − 1 times for training Compute average and standard deviation from k folds. Holdout training and testing 19/43 Cross-Validation in the Churn Datase � Logistic regression vs classification tree 20/43 Learning Curves � � 21/43 A learning curve is a plot of the generalization performance against the amount of training data. Learning curves vs fitting graphs: • • • A learning curve shows the generalization performance - the performance only on testing data, plotted against the amount of training data used. A fitting graph shows the generalization performance as well as the performance on the training data, but plotted against model complexity. Fitting graphs generally are shown for a fixed amount of training data. Learning Curves - Churn Dataset � The flexibility of tree induction can be an advantage with larger training sets: • the tree can represent substantially nonlinear relationships between the features and the target. 22/43 Avoiding Overfitting with Tree Induction � � The main problem with tree induction is that it will keep growing the tree to fit the training data until it creates pure leaf nodes. Tree induction commonly uses two techniques to avoid overfitting: • • � � 23/43 Stop growing the tree before it gets too complex; grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity). There are various methods for accomplishing both. Simple idea for first: limit tree size is to specify a minimum number of instances that must be present in a leaf (or generally use the data at the leaf to make a statistical estimate of the value of the target variable for future cases that would fall to that leaf). • • • Key concern: What threshold should be used? Use “hypothesis test”! Recall: roughly, a hypothesis test tries to assess whether a difference in some statistic is not due simply to chance. Pay attention to multiple comparisons for the ‘best’ model - It is a trap! A General Method for Avoiding Overfitting � Use tools from your arsenal: cross-validation & fitting graph � Note: You can also use nested cross-validation to check different model complexities. 24/43 25/43 Model Evaluation What is a good model? � Case: Wisconsin Breast Cancer Dataset (Session 4). • • � Logistic regression: Accuracy 98.9%. Classification tree: Accuracy 99.1%. Q: Which one is a better model? 26/43 What is a good model? � Data scientists and stakeholders should ask what to achieve by mining data • • Connect the results of mining data back to the goal of the undertaking. Have a clear understanding of basic concepts. � The goal: Often impossible to measure perfectly (e.g. inadequate systems, too costly data...). � We consider binary classification problems Accuracy: a metric that is very easy to measure � Accuracy = Number of correct decisions TP + TN = 1 − error rate = Total number of decisions made T 27/43 Evaluate Classifiers 28/43 � Problems: Unbalanced classes � Confusion matrix: true classes p(ositive) and n(egative) and classes predicted Y(es) and N(o), respectively. � Do not confuse Bad Positives and Harmless Negatives. Problems with Unbalanced Classes � Consider prediction of churn • • • Training data including 1000 customers, with 100 churned. What is the base rate accuracy? Majority classifier: Always output ‘no churn’, accuracy 90% (!) 29/43 Problems with Unbalanced Classes � Models A and B generate accuracies of 80% and 64% • • • � Model A: evaluated on a balanced data set. Model B: evaluated on a representative data set (1:9 ratio) However, accuracy of B in a balanced data set: 80% Which one is better? Figure: Confusion matrices a) Model A, b) Model B • • Model A: correctly identifying 60% of negative examples. Model B: correctly identifying 60% of positive examples. 30/43 Performance visualization � � Accuracy can be misleading. Train on balance data but evaluate with representative data. 31/43 Unequal costs and benefits � Key question: how much do we care about the different errors and correct decisions? • • � Classification accuracy makes no distinction between false positive and false negative errors. In real-world applications, different kinds of errors lead to different consequences! Example: medical diagnosis • • � 32/43 Patient has cancer (although she does not) - false positive: expensive, but not life threatening. Patient has cancer, but she is told that she has not - false negative: more serious. Errors should be counted separately - Estimate cost or benefit of each decision Generalizing beyond classification � Measure quality for regression model. E= 1 N 2 �(yi − ŷi ) N i=1 � Important to consider whether this is a meaningful metric for the problem at hand (must match the problem). � Many other metrics can be thought of... 33/43 A key concept in statistics: Expected Value � � � Need to know: All possible outcomes, and their probabilities. Weighted average of outcome values w.r.t. their probabilities: Expected Value (EV) = p(o1 )v (o1 ) + p(o2 )v (o2 ) + p(o3 )v (o3 ) + ... where oi is i th outcome, with probability p(oi ) and value v (oi ). 34/43 Expected Value to Frame Classifier Use � � Targeted marketing: ‘likely responder’ and ‘not likely responder’. Define the ‘value of response’: • • � � � Product price $ 200 and cost $ 100, targeting cost $1, Values: vR =$ 99, and vNR = - $ 1 Given a feature vector x of a customer as input. Let pR (x) be the ‘estimated’ probability of response, then • Expected profit = pR (x)vR + (1 − pR (x))vNR Q: Shall we target the customer? Check • pR (x)$99 + (1 − pR (x)) (−$1) > 0 ⇒ pR (x) > 0.01 35/43 Expected Value to Frame Classifier Evaluation � Shift our focus: from ‘entities’ to ‘sets of entities’. � Q: What is the expected benefit per customer using a model? � To compare one model to another ‘in aggregate’. � Using the confusion matrix, � � � 36/43 exp. profit = p(Y , p)b(Y , p) + p(N, p)b(N, p) + p(N, n)b(N, n) + p(Y , n)b(Y , n) Costs and benefits from business understanding and knowledge. Possible to compare two models on different data sets. There is an alternative formulation by using conditional probabilities. Example - Matlab % Cost - benefit matrix CB = [99 , -1;0 0]; % Confusion matrix CM = [56 7;5 42]; % % Calculate priors priors = sum ( CM ) / sum ( CM (:) ) % % Calculate rates ( conditional probabilities ) helpvar = ones (2 ,1) * sum ( CM ,1) rates = CM ./ helpvar % % Calculate expected profit % EP = p ( p ) [ p ( Y | p ) . b (Y , p ) + p ( N | p ) . b (N , p ) ] + ... % ... + p ( n ) [ p ( N | n ) . b (N , n ) + p ( Y | n ) . b (Y , n ) ] EB = rates .* CB % multiply rates with benefits ( expected benefits ) EP = sum ( EB * priors ) % multiply with priors : notice the transpose 37/43 Other evaluation metrics 38/43 Other evaluation metrics Source Wikipedia. 39/43 Evaluation, Baseline Performance, � Think of a reasonable baseline to compare the performance of our model. � Convincing people by showing the added value of data mining. Q: How to find the appropriate baseline? � � • 40/43 Depends on the actual application. In weather forecasting • • Tomorrow (and the next day) will be the same as today. The long-term historical average. � In classification � Maximizing simple prediction accuracy is not the ultimate goal. • Majority classifier: output the majority class of the training data set. 41/43 Questions? Today � Chapter 5 & 7. 42/43 Next week � Session 7: • • • • Learning curves. Similarity, Neighbors, and Clusters. Clustering. Decision Analytic Thinking. 43/43 1BK40 Business Analytics & Decision Support Session 6 Matlab session 2 Uzay Kaymak Pav.D02, u.kaymak@tue.nl September 21, 2017 Where innovation starts Announcements � Assignment 1b - Release: 27 Sept. / Deadline: 6 Oct. � Assignment 2b - Release: 18 Oct. / Deadline: 27 Oct. (tentative) � Assignment 2a - Release: 6 Oct. / Deadline: 18 Oct. (tentative) 2/26 Introduction � Previous lecture: � Today: • • • � 3/26 Introduced basic issues of model evaluation and explored the question of what makes for a good model. Implementation in Matlab for the Iris data. No help/questions in solving Assignment 1b. This lecture contains slides from other lectures. 4/26 Cross Validation - Matlab Note on data mining and its use � Mining the data to find patterns and build models is different than using the results of data mining. 5/26 Holdout training and testing 6/26 Load data load iris.dat % column information for IRIS data %1). sepal length in cm - %2). sepal width in cm %3). petal length in cm - %4). petal width in cm %5). class: %- Iris Setosa, %- Iris Versicolour, %- Iris Virginica % PREPARE DATA % choose data for versicolor and virginica data=[iris(iris(:,5)==2,:); iris(iris(:,5)==3,:)] % choose pedal width and sepal width to explain 'class' % x = [data(:,2) data(:,4)]; x = data(:,1:4); f_x = data(:,5)-2; %to make it zeros and ones % Set the random seed for reproducibility % (always obtain the same values). rng('default') 7/26 HOLD-OUT Validation - TREE 8/26 cvoCV=cvpartition(f_x,'Holdout',0.2); % To make it clearer, lets indicate the train and validation set xTrain=x(cvoCV.training,:); yTrain=f_x(cvoCV.training); xVal=x(cvoCV.test,:); yVal=f_x(cvoCV.test); % estimate regression tree t_HO = fitctree(xTrain,yTrain); % Predict in-sample values - Training pred_treeTrain = predict(t_HO,xTrain); % Accuracy of in-sample - Training tree_HO_AccTrain = mean(pred_treeTrain == yTrain); % Predict out-of-sample values - Validation (or testing) pred_treeVal = predict(t_HO,xVal); % Accuracy of out-of-sample - Validation (or testing) tree_HO_AccVal = mean(pred_treeVal == yVal); Cross Validation - TREE 9/26 cvo=cvpartition(f_x,'Kfold',10); % We need to use a for loop to automatically repeat the % building/checking for each fold... % or we use the created function cvDecisionTree [trees_CV,tree_CV_AccTrain,tree_CV_AccVal] = cvDecisionTree(x,f_x Logistic “Regression” - Technical details � 10/26 How to translate log-odds into the probability of class membership? • • • • • p+ (x) represents the model’s estimate of the probability of class membership of a data item by feature vector x). + is the class for the binary even we are modeling. 1 − p+ (x) is the estimated probability of the event not occurring. ln � p+ (x) � = f (x) = wo + w1 x1 + w2 x2 + . . . 1 − p+ (x) Solving for p+ (x) we obtain p+ (x) = 1 1 + exp(−f (x )) Logistic “Regression” - Technical details 11/26 HOLD-OUT Validation - LOGISTIC REGRESSION 12/26 cvoCV=cvpartition(f_x,'Holdout',0.2); % To make it clearer, lets indicate the train and validation set xTrain=x(cvoCV.training,:); yTrain=f_x(cvoCV.training); xVal=x(cvoCV.test,:); yVal=f_x(cvoCV.test); [b1,dev1,stats1] = mnrfit(xTrain,categorical(yTrain)); % Predict in-sample values - Training pr_yA = mnrval(b1,xTrain); [C,It]=max(pr_yA,[],2); % I is still 1 or 2, for our example we have classes 0 and 1 and % want categorical variables to compare with yTrain. pred_lrTrain=categorical(It-1); % Accuracy of in-sample - Training lr_AccTrain = mean(pred_lrTrain == categorical(yTrain)); % Predict out-of-sample values - Validation (or testing) % TODO Problems with Unbalanced Classes � Consider prediction of churn (again): • • • Training population of 1000 customers. Baseline churn rate is 10% (100 customers in 1000 are expected to churn). What is the base rate accuracy? � Model A: generates an accuracy of 80% � Why is there a difference? � Model B: generates an accuracy of 64% � Which one is better? 13/26 Problems with Unbalanced Classes � Difference: • • � � Model A: evaluated on a balanced data set. Model B: evaluated on a representative data set (1:9 ratio) Accuracy of both in a balanced data set: 80%. ?? Confusion matrices: • • Model A: 40% of negative class wrong. Model B: 40% of positive class wrong. 14/26 Performance visualization � � Accuracy can be misleading. Train on balance data but evaluate with representative data. 15/26 16/26 Visualization - Matlab Learning Curves - Churn Dataset 17/26 Learning curve - TREE 18/26 % Note that you obtain the percentages not the number of data poi [valuesPerc,Acc]=learningCurve(x,f_x,'tree'); % Make the actual plot with x and y axis labels and also legend. % TODO - also no solution in Matlab tutorial 2. Fitting Graph 19/26 � A fitting graph shows the accuracy of a model as a function of complexity. � Generally, there will be more overfitting as one allows the model to be more complex. Avoiding Overfitting with Tree Induction � � The main problem with tree induction is that it will keep growing the tree to fit the training data until it creates pure leaf nodes. Tree induction commonly uses two techniques to avoid overfitting: • • � � 20/26 Stop growing the tree before it gets too complex; grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity). There are various methods for accomplishing both. Simple idea for first: limit tree size is to specify a minimum number of instances that must be present in a leaf (or generally use the data at the leaf to make a statistical estimate of the value of the target variable for future cases that would fall to that leaf). • • • Key concern: What threshold should be used? Use “hypothesis test”! Recall: roughly, a hypothesis test tries to assess whether a difference in some statistic is not due simply to chance. Pay attention to multiple comparisons for the ‘best’ model - It is a trap! Example - The Churn Problem 21/26 Fitting graph - TREE 22/26 [modelC,Acc]=fittingGraph(x,f_x,'tree'); % Make the actual plot with x and y axis labels and also labels. % TODO - also no solution in Matlab tutorial 2. Overfitting - Mathematical Functions � 23/26 There are different ways to allow more or less complexity in mathematical functions: • Add more variables (attributes, features) f (x) = w0 + w1 x1 + w2 x2 + w3 x3 • � f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5 Add non-linear variables f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x12 + w5 x2 �x3 As you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points: • • Modelers carefully prune the attributes in order to avoid overfitting manual selection; Automatic feature selection. Fitting graph - LOGISTIC REGRESSION 24/26 % A simple way to increase complexity is to check the correlation % between each variable and the output. Please note that p-values % correlation are related (see: % http://www.eecs.qmul.ac.uk/¬norman/blog_articles/p_values.pdf) % this is a simplification. [modelC,Acc]=fittingGraph(x,f_x,'logistic') 25/26 Questions? Today � � Session 6 - Matlab examples. Solve Matlab tutorial 2. 26/26 1BK40 Business Analytics & Decision Support Session 7 Similarity & Clustering Uzay Kaymak Pav.D02, u.kaymak@tue.nl September 29, 2017 Where innovation starts Announcements � Healthy way of reading the book: • • � � Not for searching the content of the assignment question. But for understanding the flow of explanations. Session will start with quick scan of previous concepts. Assignment 1b will be released on Friday, after Session 8. 2/52 Previous session � Generalization and overfitting • • • • � Holdout training and testing Fitting graph: Error vs. model complexity Cross validation: Split data into k partitions Learning curve: Performance vs. size of training data Model evaluation • • Confusion matrix Expected value 3/52 Outline Today Similarity importance in business tasks Similarity and distance Nearest neighbor reasoning Clustering Hierarchical clustering k-means clustering Closing remarks Fundamental concepts: Calculating similarity of objects described by data; Using similarity for prediction; Clustering as similarity-based segmentation. 4/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Similarity importance in business tasks � Two things being similar is some way, often share other characteristics as well. • � Data mining procedures often are based on grouping things by similarity or searching for the “right” sort of similarity. Different sorts of business tasks involve reasoning from similar examples: • • • • • Retrieve similar things directly. Find companies similar to best customers. Classification and regression. Clustering: group similar items together. Costumer segmentation. Similarity-based recommendations. (People who like X also like Y). Amazon & Netflix. Reasoning from similar cases: Case-based reasoning. Law, medicine and AI. 5/52 Supervised segmentation � Can be viewed as a grouping data into groups with a similar property. � What are the measures of similarity in the following visualization? 6/52 Similarity and distance � Mathematically, similarity is a relation that satisfies a number of conditions (reflexive, symmetric, transitive). � In a vector space, this is linked to the notion of distance. � 7/52 Two objects are more similar the smaller the distance between them is. 8/52 Similarity and distance Similarity and distance 9/52 � Consider two instances from our simplified credit application domain: � Using the general Euclidean distance: � d(A, B) = (d1,A − d1,A )2 + (d2,A − d2,A )2 + . . . + (dn,A − dn,A )2 � we obtain (is this correct?) � d(A, B) = (23 − 40)2 + (2 − 10)2 + (2 − 1)2 ≈ 18.8 Distance is just a number: • • It has no units, and no meaningful interpretation; It is only really useful for comparing the similarity of one pair of instances to that of another pair. Nearest neighbor reasoning � � Start with an example (understand the fundamental notion). Find a whiskey the most similar to my favorite according to attributes • Color; nose; body; palate; finish. 10/52 Nearest neighbors for predictive modeling � Similarity for predictive modeling: • • • Given a new example (whose target variable we want to predict), Scan all the training examples and choose several that are the most similar to the new example. Predict the new example’s target value, based on the nearest neighbors’ (known) target values. 11/52 Nearest neighbors for predictive modeling � Similarity for predictive modeling: • • • Given a new example (whose target variable we want to predict), Scan all the training examples and choose several that are the most similar to the new example. Predict the new example’s target value, based on the nearest neighbors’ (known) target values. 11/52 Nearest neighbors for predictive modeling � Similarity for predictive modeling: • • • Given a new example (whose target variable we want to predict), Scan all the training examples and choose several that are the most similar to the new example. Predict the new example’s target value, based on the nearest neighbors’ (known) target values. 11/52 Nearest neighbors for predictive modeling � Similarity for predictive modeling: • • • Given a new example (whose target variable we want to predict), Scan all the training examples and choose several that are the most similar to the new example. Predict the new example’s target value, based on the nearest neighbors’ (known) target values. 11/52 Nearest neighbors for predictive modeling � Defined by: • • • • Dataset; Distance function; Number of neighbors (size of neighborhood); Combining function (prediction). 12/52 Distance function � Euclidean distance: � Manhattan distance: � Jaccard distance: � Cosine distance: dEuclidean (X , Y ) = ��X − Y ��2 = 13/52 � (x1 − y1 )2 ) + (x2 − y2 )2 + . . . dManhattan (X , Y ) = ��X − Y ��1 = �x1 − y1 � + �x2 − y2 � + . . . dJaccard (X , Y ) = 1 − dcosine (X , Y ) = 1 − �X ∩ Y � �X ∪ Y � X .Y ��X ��2 .��Y ��2 Distance function � 14/52 Manhattan distance (L1-norm) and Euclidean distance (L2-norm) are special cases of Minkovski distance: dMinkovski (X , Y ) = ��X −Y ��q = (�x1 − y1 �q + �x2 − y2 �q + . . . + �xp − yp �q )1�q By User:Psychonaut - Created by User:Psychonaut with XFig, Public Domain, https://commons.wikimedia.org/w/index.php?curid=731390 Nearest neighbor classifier � � Example: credit card marketing problem How many neighbors? How much influence of each neighbor? 15/52 Nearest neighbor classifier � � Example: credit card marketing problem How many neighbors? How much influence of each neighbor? 15/52 Nearest neighbor classifier � � Example: credit card marketing problem How many neighbors? How much influence of each neighbor? 15/52 How Many Neighbors and How Much Influence? � k is a complexity parameter: • • • � The larger the smoother (less complex) the model. Influence of distant observations? 1-NN model? Classification - odd number (to break ties in voting). 16/52 Combining functions � 17/52 Majority scoring Score(c, N) = � [class(y) = c] y∈N � Similarity-moderated classification: Score(c, N) = � w (x, y) × [class(y) = c] w (x, y) = (dist2 (x, y))−1 y∈N � Similarity-moderated scoring: � Similarity moderated regression: � ... p(c�x) = ∑y∈N w (x, y) × [class(y) = c] ∑y∈N w (x, y) p(c�x) = ∑y∈N w (x, y) × t(y ) ∑y∈N w (x, y) Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Nearest neighbors for predictive modeling � Probability estimation: • • • • • � It is important not just to classify a new example but to estimate its probability (score), because a score gives more information than just a Yes/No decision. Consider again the classification task of deciding whether David will be a responder or not (previous slides). Nearest neighbors (Rachael, John, and Norah) have classes of (No, Yes, Yes). If we score for the Yes class, so that Yes=1 and No=0, we can average these into a score of 2/3 for David. Recall the discussion of estimating probabilities from small samples for decision trees. Regression: • • • 18/52 Retrieve information from nearest neighbors similar to majority vote of a target for classification. Nearest neighbors (Rachael, John, and Norah) have income of (50, 35, 40). Use these values as average (≈42) or median (40). Lazy learners vs. non-lazy learners � Nearest neighbor classifiers are also known as lazy learners. • • � No model is built before evaluation (learner is lazy). All computations are made at evaluation time (i.e. when classifying new instances). A decision tree is a non-lazy learner since a model is built before evaluation time. Logistic regression? 19/52 Issues with NN classification � Intelligibility: � Dimensionality: • • • � Decisions can be explained along the contributions of the neighbors: “The movie Avatar was recommended based on your interest in Spider Man and The Hobbit” Too many/irrelevant attributes may confuse distance calculations. Curse of dimensionality: need to use feature selection or domain knowledge. Computational Efficiency: • 20/52 Querying the database for prediction expensive. k-NN classifier - Matlab %same data as before, but loaded differently. % Once again use only 2 species to make it easier to plot. load fisheriris X = meas(:,3:4); Y = species; figure,gscatter(X(:,1),X(:,2),species); grid set(legend,'location','best'); xlabel('petal length'); ylabel('petal width') % Construct a kNN classifier for 5-nearest neighbors. mdl = fitcknn(X,Y,'NumNeighbors', 5) % Examine the resubstitution loss, which, by default, % is the fraction of misclassifications from the % predictions of mdl. rloss = resubLoss(mdl) %about 4% of incorrect classification %predict a new flower flwr=[3.8 1.2]; flwrClass=predict(mdl,flwr) 21/52 k-NN classifier - Matlab % Plot the new flower line(flwr(1),flwr(2),'marker','x',... 'color','k','markersize',10,'linewidth',2); % Search 5 nearest neighbors [n,d] = knnsearch(X,flwr,'k',5); line(X(n,1),X(n,2),'color',[.5 .5 .5],'marker','o',... 'linestyle','none','markersize',10); 22/52 23/52 Clustering Supervised vs. unsupervised learning � � Supervised modeling • predicts a target feature. Unsupervised modeling • • has no notion of target variable. searches for some regularity in a dataset. 24/52 Clustering � Clustering uses the similarity as the criterion. � Clustering finds groups of objects where the objects within groups are similar. � Clustering purposes: • • Discovery of overall distribution patterns and structure. Data categorization, data compression (reduction). 25/52 Clustering approaches � Hierarchical clustering is grouping of data based on similarity/distance. � Density-based clustering uses distribution density of data points � 26/52 Grid-based clustering divides the search space into a finite number of grid elements. Clustering algorithms � k-means, k-modes, k-medoids algorithm � DBSCAN algorithm � Fuzzy k-means algorithm � ... � Single, complete linkage algorithm � STING algorithm � Possibilistic clustering � What is the basic idea for all of these algorithms? 27/52 Hierarchical clustering � Grouping points by their similarity. � Clusters are merged iteratively until only a single cluster remains � Lowest level: All data points themselves � Hierarchical clustering uses a distance (linkage) function between clusters � Only overlap between clusters when one contains(contained by) others. � No pre-determined number of clusters. � 28/52 Dendrogram is used to show explicitly the hierarchy of the clusters. Dendrogram representation � � Convenient graphic to represent the hierarchical sequence of clustering. Basically, it is a tree where • • • • each node represents a cluster each leaf represents a data point root node is the cluster of the whole data set each internal node has two children/clusters that were merged to form it. 29/52 Dendrogram representation 30/52 Example: Tree of life 31/52 Example: Tree of life 32/52 Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage algorithms 33/52 Given the data xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N 1. Start with N clusters and compute the N × N matrix of similarities. S(xi , xj ) = (1 + d(xi , xj ))−1 2. Step check: Determine the most similar clusters i ∗ , j ∗ . 3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ . 4. Delete in the similarity matrix the row and column corresponding to i ∗ and j ∗ . 5. Determine the similarity between i ′ and all other remaining clusters. 6. If the number is clusters is greater than one, go to ‘Step check’. Otherwise STOP. Linkage variations 34/52 d(A, B) = min d(xi , xj ) xi ∈A,xj ∈B d(A, B) = max d(xi , xj ) xi ∈A,xj ∈B d(A, B) = 1 � � d(xi , xj ) �A��B� xi ∈A xj ∈B Linkage example 35/52 Example: Complete linkage S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 36/52 Example: Complete linkage S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 36/52 Example: Complete linkage S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {3, 5} 4 1 0.25 1 Example: Complete linkage S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {3, 5} 4 1 0.25 1 Example: Complete linkage S = {si,j } = S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 1 {2, 4} {3, 5} 1 1 {2, 4} 0.8 1 {3, 5} 0.83 0.3 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {3, 5} 4 1 0.25 1 Example: Complete linkage S = {si,j } = S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 1 {2, 4} {3, 5} 1 1 {2, 4} 0.8 1 {3, 5} 0.83 0.3 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {3, 5} 4 1 0.25 1 Example: Complete linkage S = {si,j } = S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 1 {2, 4} {3, 5} 1 1 S = {si,j } = {2, 4} 0.8 1 {3, 5} 0.83 0.3 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {2, 4} {1, 3, 5} {3, 5} 4 1 0.25 1 {2, 4} {1, 3, 5} 1 0.8 1 Example: Complete linkage S = {si,j } = S = {si,j } = 1 2 3 4 5 1 2 3 4 5 1 0.8 1 S = {si,j } = 0.4 0.3 1 0.7 0.85 0.2 1 0.83 0.1 0.9 0.25 1 1 {2, 4} {3, 5} 1 1 S = {si,j } = {2, 4} 0.8 1 {3, 5} 0.83 0.3 1 36/52 1 2 1 1 2 0.8 1 {3, 5} 0.83 0.3 4 0.7 0.85 {2, 4} {1, 3, 5} {3, 5} 4 1 0.25 1 {2, 4} {1, 3, 5} 1 0.8 1 Hierarchical clustering - Matlab 37/52 %same data as before, but loaded differently. % Once again use only 2 species to make it easier to plot. load fisheriris X = meas(:,3:4); Y = species; figure gscatter(X(: 1) X(: 2) species) figure,gscatter(X(:,1),X(:,2),species) set(legend,'location','best') xlabel('petal length') ylabel('petal width') %Hierarchical classification % The pdist function returns this distance information in a % vector, Y, where each element contains the distance between % a pair of objects. eucD = pdist(X,'euclidean'); % To see this distance squareform(eucD) Hierarchical clustering - Matlab 38/52 % Once the proximity between objects in the data set has been % computed, you can determine how objects in the data set should % be grouped into clusters , using the linkage function. clustTreeEuc = linkage(eucD,'average'); % To visualize the hierarchy of clusters, you can plot % a dendrogram. figure; [heucD,nodeseucD] = dendrogram(clustTreeEuc,0); %try the cosine distance cosD = pdist(X,'cosine'); clustTreeCos = linkage(cosD,'average'); figure; [hCos,nodesCos] = dendrogram(clustTreeCos,0); % Determine quality of hierarchical clustering using % the cophenetic correlation qEuc = cophenet(clustTreeEuc,cosD) qCos = cophenet(clustTreeCos,cosD) k-means clustering � � � The most popular centroid-based clustering algorithm. The centroids are the arithmetic means of the instances of clusters. Given an initial set of k means, the algorithm proceeds by alternating between two steps: • • � 39/52 Assignment step: Assign each instance to the cluster whose mean is "nearest". Update step: Calculate the new means to be the centroids of the observations in the new clusters. The algorithm converges when centroids no longer change. k-means steps � Assignment step: 40/52 � Update step: Cluster prototype (center) evolution � Distribution of objects � Evolution of cluster centers 41/52 Common distance metrics for k-mean 42/52 Cluster validity 43/52 � What is a good clustering result? � Understanding clusters depends on the sort of data and the domain of application, In whiskey case, Group A � � The whole point is to understand whether something was discovered. • • • Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa The best of its class: Laphroaig (Islay), 10 years, 86 points Average characteristics: full gold; fruity, salty; medium; oily, salty, sherry; dry Evaluating cluster validity � Expert-based: • • • � Domain knowledge; Data exploration; Semantics (meaning) of clusters. Cluster validity indices: • • • Consider cluster compactness; Consider cluster separation; Sometimes, also cluster homogeneity. 44/52 Compactness and separation 45/52 Example validity index � Cluster dispersion 46/52 � �1 Si = � Gi � Intercluster distance � Davies-Bouldin (DB) index dij = DB = � ��x k − v i ��2 k,xk ∈Gi � ��v i − v j ��2 1 C C Si + Sj � max C i=1 j=1,j≠i dij where C is the number of clusters and v i is the centroid of cluster Gi . Determination of number of clusters DB index 47/52 � Number of clusters (C) is a complexity parameter; � You can plot a “fitting graph” for cluster validity (e.g. DB score and number of clusters) � The elbow in the graph is an indication of the “natural” number of clusters. Issues in clustering � Clustering is more exploratory - expert knowledge needed for interpretation � Parameter settings (e.g. threshold of similarity or number of clusters) are application dependent a good understanding of clustering goals needed for setting the parameters correctly. � Dimensionality is a problem as distance loses its meaning for very large number of variables e.g. feature selection may be imperative 48/52 49/52 Questions? Session 8 � Chapter 8 and 9 • • • • • • • Visualizing Model Performance Ranking, Profit Curves ROC Graphs and AUC Cumulative Response and Lift Curves Evidence and Probabilities Combining Evidence Probabilistically Applying Bayes’ Rule to Data Science 50/52 1BK40 Business Analytics & Decision Support Session 8 Visualization Evidence and Probabilities Murat Firat Pav.D06, m.firat@tue.nl September 29, 2017 Where innovation starts Announcements � � Completion of last session: Clustering will be discussed Assignment 1b will be released today • • • Firstly, check your Matlab references when you get stuck. Read the related section of the book before asking anything. Do not try to use me as a Matlab debugger 2/47 Outline Today Visualizing Model Performance Profit curves ROC curves Gain charts Lift curve Evidence and Probabilities Combining evidence probabilistically Applying Bayes’ Rule to Data Science Bayes’ rule Example Fundamental concepts: Visualization of model performance; Explicit evidence combination with Bayes’ Rule. 3/47 4/47 Visualizing Model Performance Introduction 5/47 � Previous lecture: � Still, stakeholders and even data scientists often want a higher-level, more intuitive view of model performance. � It is useful to present visualizations rather than just calculations or single numbers. • Introduced basic issues of model evaluation and explored the question of what makes for a good model. Disadvantages of expected value � Good estimates of costs and benefits must be available, but • • • • May be difficult to estimate accurately; Numbers may not be available; Ignores preferences; Costs and benefits may be intangible. � Good estimates of probabilities needed � Valid for only a single operating condition (often with stringent assumptions) • • Estimated probabilities may be biased. Sensitivity analysis would be useful. Oftentimes, ranking objects may be sufficient! 6/47 Models that rank cases � Rankers produce continuous output (e.g., in [0, 1]) • • • Rather than + vs. − Recall from previous lectures: decision trees, logistic regression, etc., can rank cases. Don’t “brain-damage” your model! • by just using some threshold chosen you-don’t-know-how � Combine with threshold to form classifier � Two issues for evaluation: • • • One ranker can define many classifiers choose ranking model; Choose proper threshold (if necessary). 7/47 Thresholding a ranking classifier Each threshold → a new classifier 8/47 How to evaluate a ranking classifier? � Key questions: • • � If we have accurate probability estimates and a well-specified cost-benefit matrix (basis of expected value) • � • • � we determine the threshold where our expected profit is above a desired level. Evaluation methods: • � How to compare different rankings? How to choose a proper threshold? Profit curves Receiver Operating Characteristics (ROC) curves Area under the ROC curve (AUC) Cumulative response curves (gain chart) Lift curves 9/47 Numeric evaluation measures � Wilcoxon-Mann-Whitney statistic (WMW) • • � 10/47 Probability that model will rank a randomly chosen positive case higher than a randomly chosen negative case Over entire ranking, higher WMW score is better Lift = p(+�Y )�p(+) → How much better with model than without? • • • Measured for specific cutoff, e.g., percentile Key question: What threshold / cut-off(s) is appropriate for your problem? Business understanding tells what is a good threshold/cutoff It is also useful to visualize model performance as the threshold changes. Profit curves ‘Profit’ shows expected cumulative profit. 11/47 Profit Curves � 12/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 12/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 12/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 12/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 12/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. ROC space � Graph of true positive rate against false positive rate. � Depicts relative trade-offs that a classifier makes between benefits and costs. Yes No p True positives n False positives False negatives P True negatives N TPR = TP�P FPR = FP�N 14/47 ROC curve Each specific point in the ROC space corresponds to a specific confusion matrix. Not sensitive to different class distributions (% + and % −) Diagonal/dashed line is the “random classifier”. 17/47 Constructing a ROC curve Pass a positive instance, step upward. Pass a negative instance step rightward. 18/47 ROC curves 19/47 Similar to other model performance visualizations � � e.g., isomorphic to lift curves, but... Separates classifier performance from costs, benefits and target class distributions. Area under ROC curve (AUC) � 20/47 Good measure of overall model performance, if single-number metric is needed and target conditions are completely unknown. • • Measures the ranking quality of a model; A fair measure of the quality of probability estimates. � Gives probability that model will rank a positive case higher than a negative case. � AUC is equivalent to Wilcoxon (WMW) statistic (see earlier slide) and Gini coefficient Cumulative response curves (gain charts) Plots TP rate against percentage of population targeted. 21/47 Lift curve � � Plots lift values against percentage of population targeted . It is essentially the value of gain chart divided by the value of the random model in the gain chart. 22/47 Lift & cumulative response curves � � 23/47 More intuitive for certain business apps. (e.g., targeted marketing). Caveat: assume target class priors (relative %) same as in test set (assignment 1b). When to use which model? � The cut-off point (based on business understanding) determines which model to use. � Question: Could you use model A for the top 40% of your population, model B for next 20% and model C for the remainder? � Read Chapter 8 Example: Performance Analytics for Churn Modeling (useful for assignment 2a). 24/47 25/47 Evidence and Probabilities Introduction � Previously: � Now: • • Using data to draw conclusions about some unknown quantity of a data instance. Analyse data instances as evidence for or against different values of the target. 26/47 Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • � Run a targeted campaign for, e.g. a luxury hotel. not randomly - obtain more bookings. Define our ad targeting problem more precisely. • • • • What will be an instance? What will be the target variable? What will be the features? How will we get the training data? 27/47 Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Idea: use historical data to estimate both the direction and the strength of the evidence. � Combine the evidence to estimate the resulting likelihood of class membership. 29/47 Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � 36/47 Bayes’ rule for classification C = c: p(C = c�E ) = • • • • • p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. p(E �C = c) is the likelihood of seeing the evidence E (the percentage of examples of class c that have E ). p(E ) is the likelihood of the evidence (occurrence of E ). Estimating these values, we could use p(C = c�E ) as an estimate of class probability. Alternatively, we could use the values as a score to rank instances. Applying Bayes’ Rule to Data Science � Drawback: • • • E is a usual vector of attribute values, we would require the knowledge of the full joint probability of the example p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c)- difficult to measure. We may never see a specific example in the training data that matches a given E in our test data Make a particular assumption of independence (which may or not hold)! 37/47 Naive Bayes - Conditional Independence 38/47 � Recall the notion of independence: two events are independent if knowing one does not give you information on the probability of the other. � Conditional independence is the same notion - using conditional probabilities. � � p(AB�C ) = p(A�C )p(B�AC ) Since A and B are conditionally independent given C : p(AB�C ) = p(A�C )p(B�C ) Simplifies the computation of probabilities from data for p(E �c) if the attributes are conditionally independent given the class (simplification C = c). p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c) = p(e1 �c)p(e2 �c) . . . p(ek �c) Naive Bayes - Conditional Independence � Naive Bayes: � Classifies a new example by estimating the probability that the example belongs to each class and reports the class with highest probability. In practice you do not compute p(E ), since: � p(c�E ) = • • 39/47 p(E �c) p(e1 �c)p(e2 �c) . . . p(ek �c) = p(E ) p(E ) In classification, interested in: of the different possible classes c, for which one is p(C �E ) the greatest (E is the same for all) classes often are mutually exclusive and exhaustive, meaning that every instance will belong to one and only one class. p(c�E ) = p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) + p(e1 �c1 )p(e2 �c1 ) . . . p(ek �c1 ) Advantages and Disadvantages Naive Bayes � Naive Bayes: • • • • • � is a simple classifier, although it takes all the feature evidence into account. is very efficient in terms of storage space and computational time. performs surprisingly well for classification. is an “incremental learner”. is ‘naturally’ biased. Note that the independence assumption does not hurt classification performance very much • • � 40/47 To some extent, we double the evidence. Tends to make the probability estimates more extreme in the correct direction. Do not use the probability estimates themselves! Ranking is ok! Example: Naive Bayes classifier Source:J.F.Ehmke 41/47 Example: Naive Bayes classifier Source:J.F.Ehmke 42/47 Example: Naive Bayes classifier Source:J.F.Ehmke 43/47 Example: Naive Bayes classifier Source:J.F.Ehmke 44/47 Example: Naive Bayes classifier Source:J.F.Ehmke 45/47 Example: Bayes’s rule as a decision tree Bayes rule: p(A ∩ B) = p(A) ⋅ p(B�A) for two events A and B. � � Denote ‘setosa’ in Iris data with i = 1, and ‘versicolor’ in Iris data with i = 2. Denote ‘pedalwidth > 30’ with p = 1, and ‘pedalwidth ≤ 30’ with p = 0. p P( ) =0 0.4 p=0 5 P(p =1 ) 0.5 5 p = 1 0) 1�p = P(i = 82 1 6 . 0 P(i = 2�p 0.38 18 = 0) 1) 1�p = P(i = 6 5 0.35 P(i = 2�p 0.64 44 = 1) P(i = 1 ∩ p = 0) = 0.45 ⋅ 0.6182 P(i = 2 ∩ p = 0) = 0.45 ⋅ 0.3818 P(i = 1 ∩ p = 1) = 0.55 ⋅ 0.3556 P(i = 2 ∩ p = 0) = 0.55 ⋅ 0.6444 46/47 47/47 Questions? 1BK40 Business Analytics & Decision Support Session 9 Visualization Evidence and Probabilities Murat Firat Pav.D06, m.firat@tue.nl October 4, 2017 Where innovation starts Announcements � Completion of last session: Today (planned) Session 8 topics. � Questions about Assignment 1b Lecture of DCS/e by Prof. James M. Keller program: � � Some feedback from Assignment 1a grading • • • • Place: Filmzaal de Zwarte Doos 15:30-16:00 Welcome and coffee 16.00-17.00 Lecture by professor James M. Keller 17.00-18.00 Network drinks 2/47 Outline Today Visualizing Model Performance ROC curves Gain charts Lift curve Profit curves Evidence and Probabilities Combining evidence probabilistically Applying Bayes’ Rule to Data Science Bayes’ rule Example Fundamental concepts: Visualization of model performance; Explicit evidence combination with Bayes’ Rule. 3/47 4/47 Visualizing Model Performance Introduction � Previous session: • • • • • 5/47 Introduced distance metrics: Manhattan, Euclidean, Minkovski.. Similarity: nearest neighbors, majority voting Classification: majority voting (of neighbors) Regression: Averaging, median (of neighbors) Clustering: two basic clustering algorithms: Hierarchical and k-means. Disadvantages of expected value � Good estimates of costs and benefits must be available, but • • • May be difficult to estimate accurately; Sometimes not available; Costs and benefits may be intangible. � Good estimates of probabilities needed � Valid for only a single operating condition • • Estimated probabilities may be biased. raises up the need for sensitivity analysis. One solution: rank objects! 6/47 Models that rank cases � Rankers produce numerical output (e.g., in [0, 1]) • • � � Rather than classes like + vs. − Decision trees and logistic regression can rank cases. Using a threshold generate a number of classifiers Two issues for evaluation: • • choose ranking model choose proper threshold 7/47 Thresholding a ranking classifier Each threshold → a new classifier 8/47 How to evaluate a ranking classifier? � If we have accurate probability estimates and a well-specified cost-benefit matrix • � threshold value: expected profit above a desired level. Evaluation methods: • • • • • Receiver Operating Characteristics (ROC) curves Area under the ROC curve (AUC) Cumulative response curves (gain chart) Lift curves Profit curve 9/47 ROC space � Graph of tp rate vs.fp rate. � Depicts: relative trade-offs between benefits and costs. Yes No p TP n FP FN P TN N TPR = TP�P FPR = FP�N 10/47 ROC curve Each specific point in the ROC space corresponds to a specific confusion matrix. Not sensitive to different class distributions (% + and % −) Diagonal/dashed line is the “random classifier”. 13/47 Constructing a ROC curve Pass a positive instance, step upward. Pass a negative instance step rightward. 14/47 ROC curves 15/47 Similar to other model performance visualizations � � e.g., isomorphic to lift curves, but... Separates classifier performance from costs, benefits and target class distributions. Area under ROC curve (AUC) � 16/47 Good measure of overall model performance, if single-number metric is needed and target conditions are completely unknown. • • Measures the ranking quality of a model; A fair measure of the quality of probability estimates. � Gives probability that model will rank a positive case higher than a negative case. � AUC is equivalent to Wilcoxon (WMW) statistic (see earlier slide) and Gini coefficient Numeric evaluation measures � Wilcoxon-Mann-Whitney statistic (WMW) • • � 17/47 Probability that model will rank a randomly chosen positive case higher than a randomly chosen negative case Over entire ranking, higher WMW score is better Lift = p(+�Y )�p(+) → How much better with model than without? • • • Measured for specific cutoff, e.g., percentile Key question: What threshold / cut-off(s) is appropriate for your problem? Business understanding tells what is a good threshold/cutoff It is also useful to visualize model performance as the threshold changes. Cumulative response curves (gain charts) Plots TP rate against percentage of population targeted. 18/47 Lift curve � � Plots lift values against percentage of population targeted . It is essentially the value of gain chart divided by the value of the random model in the gain chart. 19/47 Lift & cumulative response curves � � 20/47 More intuitive for certain business apps. (e.g., targeted marketing). Caveat: assume target class priors (relative %) same as in test set (assignment 1b). Profit curves ‘Profit’ shows expected cumulative profit. 21/47 Profit Curves � 22/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 22/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 22/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 22/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. Profit Curves � 22/47 Critical conditions underlying the profit calculation: • • Class priors: The proportion of positive and negative instances in the target population, also known as the base rate. The costs and benefits: The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit matrix. � Profit curves are a good choice for visualizing model performance if both class priors and cost-benefit estimates are known and are expected to be stable. � Use a method that can accomodate uncertainty by showing the entire space of performance possibilities - Receiver Operating Characteristics (ROC) curves. When to use which model? � The cut-off point (based on business understanding) determines which model to use. � Question: Could you use model A for the top 40% of your population, model B for next 20% and model C for the remainder? � Read Chapter 8 Example: Performance Analytics for Churn Modeling (useful for assignment 2a). 24/47 25/47 Evidence and Probabilities Introduction � Main idea: • Analyze data instances as evidence for or against different values of the target. 26/47 Case: Targeting online consumers with ads � Target online displays to consumers based on webpages they have visited in the past: • • � Run a targeted campaign for, e.g. a luxury hotel. not randomly - obtain more bookings. Define our ad targeting problem more precisely. • • • • What will be an instance? What will be the target variable? What will be the features? How will we get the training data? 27/47 Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Target online displays to consumers based on webpages they have visited in the past: • • • • • • 28/47 Target variable: will the consumer book a hotel room within one week after having seen the advertisement? Cookies allow for observing which consumers book rooms. A consumer is characterized by the set of websites we have observed her to have visited previously (cookies!). We assume that some of these websites are more likely to be visited by good prospects for the luxury hotel. Problem: we do not have the resources to estimate the evidence potential for each site manually. Problem: humans are notoriously bad at estimating the precise strength of the evidence (but quite good at using our knowledge and common sense to recognize whether evidence is likely to be “for” or “against”. Targeting Online Consumers With Advertisements � Idea: use historical data to estimate both the direction and the strength of the evidence. � Combine the evidence to estimate the resulting likelihood of class membership. 29/47 Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically - Notation � Interest: quantities such as the probability of a consumer booking a room after being shown an ad. • • • � � � We actually need to be a little more specific some particular consumer? any consumer? Let’s call this quantity C . Represent the probability of an event C as p(C ). p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to consumers, we would expect about 1 in 10,000 to book rooms. • � 30/47 Recall: Expected value framework - Purchase rates attributable to online advertisements generally seem very small to those outside the industry cost of placing one ad often is quite small as well. p(C �E ) - the probability of C given some evidence E (such as the set of websites visited by a particular consumer). • • the probability of C given E ; the probability of C conditioned on E . Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Combining evidence probabilistically � � Want to use some labeled data to associate different collections of evidence E with different probabilities. Problem (not small): • • • • � 31/47 For any particular collection of evidence E , there are (probably) not enough cases with exactly that same collection of evidence. Usually you do not obtain a particular collection of evidence at all! What is the chance that in our training data we have seen a consumer with exactly the same visiting patterns as a consumer we will see in the future? Infinitesimal (maybe not for Google or Facebook?). Solution: • Consider the different pieces of evidence separately, and then combine evidence. Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Statistical independence � � If the events A and B are statistically independent, then we can compute the probability that both A and B occur as p(AB) = p(A)p(B). Example: Rolling a fair dice • • • � A is “roll #1 shows a six” and event B is “roll #2 shows a six”, p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six) The events are independent p(AB) = p(A)p(B) = 1�36. The general formula for combining probabilities that take care of dependencies between events is p(AB) = p(A)p(B�A) • Given that you know A, what is the probability of B. 32/47 Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule 33/47 � Note: � Dividing by p(A): � Consider B to be some hypothesis of interest (likelihood) and A some evidence observed. Renaming H for hypothesis and E for evidence we obtain Bayes’ rule: p(E �H)p(H) p(H�E ) = p(E ) � � p(AB) = p(A)p(B�A) = p(B)p(A�B) p(B�A) = p(A�B)p(B) p(A) compute the probability of our hypothesis H given some evidence E by instead looking at the probability of the evidence given the hypothesis, as well as the unconditional probabilities of the hypothesis and the evidence. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Bayes’ rule - Example � Medical diagnosis: Assume you are a doctor and a patient arrives with red spots. • � � 34/47 Hypothesized diagnosis (H = measles), evidence (E = red spots). In order to directly estimate p(measles�red spots), we would need to think through all the different reasons a person might exhibit red spots and what proportion of them would be measles. Solution (simpler): • • • p(E �H) is the probability that one has red spots given that one has measles. p(H) is simply the probability that someone has measles, without considering any evidence; that’s just the prevalence of measles in the population. p(E ) is the probability of the evidence: what’s the probability that someone has red spots-again, simply the prevalence of red spots in the population, which does not require complicated reasoning about the different underlying causes, just observation and counting. Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � � � 35/47 Bayes’ rule is the base of “Bayesian” methods. Bayes’ rule for classification C = c: p(C = c�E ) = p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. • p(C = c): • • • “subjective” prior (the belief of a particular decision maker based on all her knowledge, experience, and opinions); “prior” belief based on some previous application(s) of Bayes’ Rule with other evidence an unconditional probability inferred from data (e.g. the prevalence of c in the population ≈ percentage of all examples that are of class c). Applying Bayes’ Rule to Data Science � 36/47 Bayes’ rule for classification C = c: p(C = c�E ) = • • • • • p(E �C = c)p(C = c) p(E ) p(C = c�E ) is the probability that the target variable C takes on the class of interest c after taking the evidence E (the vector of feature values) into account - posterior probability. p(E �C = c) is the likelihood of seeing the evidence E (the percentage of examples of class c that have E ). p(E ) is the likelihood of the evidence (occurrence of E ). Estimating these values, we could use p(C = c�E ) as an estimate of class probability. Alternatively, we could use the values as a score to rank instances. Applying Bayes’ Rule to Data Science � Drawback: • • • E is a usual vector of attribute values, we would require the knowledge of the full joint probability of the example p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c)- difficult to measure. We may never see a specific example in the training data that matches a given E in our test data Make a particular assumption of independence (which may or not hold)! 37/47 Naive Bayes - Conditional Independence 38/47 � Recall the notion of independence: two events are independent if knowing one does not give you information on the probability of the other. � Conditional independence is the same notion - using conditional probabilities. � � p(AB�C ) = p(A�C )p(B�AC ) Since A and B are conditionally independent given C : p(AB�C ) = p(A�C )p(B�C ) Simplifies the computation of probabilities from data for p(E �c) if the attributes are conditionally independent given the class (simplification C = c). p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c) = p(e1 �c)p(e2 �c) . . . p(ek �c) Naive Bayes - Conditional Independence � Naive Bayes: � Classifies a new example by estimating the probability that the example belongs to each class and reports the class with highest probability. In practice you do not compute p(E ), since: � p(c�E ) = • • 39/47 p(E �c) p(e1 �c)p(e2 �c) . . . p(ek �c) = p(E ) p(E ) In classification, interested in: of the different possible classes c, for which one is p(C �E ) the greatest (E is the same for all) classes often are mutually exclusive and exhaustive, meaning that every instance will belong to one and only one class. p(c�E ) = p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) + p(e1 �c1 )p(e2 �c1 ) . . . p(ek �c1 ) Advantages and Disadvantages Naive Bayes � Naive Bayes: • • • • • � is a simple classifier, although it takes all the feature evidence into account. is very efficient in terms of storage space and computational time. performs surprisingly well for classification. is an “incremental learner”. is ‘naturally’ biased. Note that the independence assumption does not hurt classification performance very much • • � 40/47 To some extent, we double the evidence. Tends to make the probability estimates more extreme in the correct direction. Do not use the probability estimates themselves! Ranking is ok! Example: Naive Bayes classifier Source:J.F.Ehmke 41/47 Example: Naive Bayes classifier Source:J.F.Ehmke 42/47 Example: Naive Bayes classifier Source:J.F.Ehmke 43/47 Example: Naive Bayes classifier Source:J.F.Ehmke 44/47 Example: Naive Bayes classifier Source:J.F.Ehmke 45/47 Example: Bayes’s rule as a decision tree Bayes rule: p(A ∩ B) = p(A) ⋅ p(B�A) for two events A and B. � � Denote ‘setosa’ in Iris data with i = 1, and ‘versicolor’ in Iris data with i = 2. Denote ‘pedalwidth > 30’ with p = 1, and ‘pedalwidth ≤ 30’ with p = 0. p P( ) =0 0.4 p=0 5 P(p =1 ) 0.5 5 p = 1 0) 1�p = P(i = 82 1 6 . 0 P(i = 2�p 0.38 18 = 0) 1) 1�p = P(i = 6 5 0.35 P(i = 2�p 0.64 44 = 1) P(i = 1 ∩ p = 0) = 0.45 ⋅ 0.6182 P(i = 2 ∩ p = 0) = 0.45 ⋅ 0.3818 P(i = 1 ∩ p = 1) = 0.55 ⋅ 0.3556 P(i = 2 ∩ p = 0) = 0.55 ⋅ 0.6444 46/47 47/47 Questions? 1BK40 Business Analytics & Decision Support Session 11 Introduction to Fuzzy Sets Uzay Kaymak Pav.D02, u.kaymak@tue.nl October 11, 2017 Where innovation starts Choice of active holidays Big mountains, stunning views 2/46 Choice of active holidays Big mountains, stunning views Nice trails 2/46 Choice of active holidays Big mountains, stunning views Nice trails 2/46 Gnarly sections Choice of active holidays Big mountains, stunning views Nice trails Good chair lifts 2/46 Gnarly sections Choice of active holidays Big mountains, stunning views Nice trails Good chair lifts 2/46 Gnarly sections Good emergency service Choice of active holidays Big mountains, stunning views Nice trails Good chair lifts 2/46 Gnarly sections Good emergency service Announcements � New office-hour determined: Wednesdays 13:00-15:00 � Trial exam is available in Canvas. � Assignment 1b: Matlab functions fittingGraph and learningCurve are in Canvas. � Assignment 1b: Submissions should be in the form of m and pdf files. Extras should be put in a zip file. � Reader “Topics in Decision Analysis” is available in Canvas. � Assignment 2a: Release on 12 October at 09:00. 3/46 Choice of active holidays � Next series of lectures: more on how to choose a location - Decision making. • � 4/46 Data-poor environment. Today: How can we model linguistic terms (Big, Nice, Gnarly, Good,...) Outline Introduction Fuzzy sets Definition Interpretation Properties Operations Closing remarks Fundamental concepts: Fuzzy sets, properties of fuzzy sets, fuzzy Logic operations 5/46 Introduction � Consider the following questions: • • • • � 6/46 “Among all the customers of a cellphone company, which have a large income?” “Will this customer purchase service S1 if his plan includes large selection of I?” “How much will this customer use the service?” “What is the typical cellphone usage of this customer segment?” What kind of information would you like to have? Paradoxes A barber shaves any man who does not shave himself. Who shaves the barber? 7/46 A person is sentenced to death and allowed to make one last statement. If the statement is true, the person will be hung, if it is false, the person will be shot. The person says ‘I will be shot.’ What now? Sorites � If you remove sand grains from a sand dune one by one, when does the sand dune turn into sand hill, into a sand pile? � When is the sky cloudy? How many clouds does it take to make a clear sky not clear? � c. f. mathematical induction. Source: dune, clouds 1, clouds 2, clouds 3. 8/46 Graduality � Humans conceptualize the world based on concepts of similarity, gradualness, fuzziness However, it is safe to say that the rapid expansion of electronic transactions constitutes a major opportunity for trade and development: it can be the source of a significant number of success stories by which developing countries and their enterprises can reach new levels of international competitiveness and participate more actively in the emerging global information economy. (From United Nations Conference on Trade and Development). 9/46 Graduality 10/46 ... the high levels of taxation on petroleum products in most consuming countries are greatly amplifying the effects of rises in the price of crude, to the detriment of the consumer. OPEC expresses the hope, once again, that the governments of these countries will reduce their high taxes on a barrel of oil - which is much more than producers themselves receive - in the interests of market stability. In addition, speculation in the oil market has become a key factor that has distorted realities and has artificially influenced prices far beyond what the fundamentals indicate. (From Opening Address to the 111th Meeting of the OPEC Conference). Precision and information content 11/46 Incompatibility principle 12/46 “As the complexity of a system increases, our ability to make precise and yet relevant (significant) statements about the system diminishes, until a threshold is reached beyond which precision and relevance (significance) become mutually exclusive characteristics” (Zadeh 1973) � What is the perimeter of a table? � How long are the Dutch borders? � What is the creditworthiness of a company? � What is the circumference of a circle? � Can you describe the motion of a pendulum? Introduction - Revisited � Consider the following questions: • • • • � � 13/46 “Among all the customers of a cellphone company, which have a large income?” “Will this customer purchase service S1 if his plan includes large selection of I?” “How much will this customer use the service?” little,medium,a lot “What is the typical cellphone usage of this customer segment?” little,medium,a lot What kind of information would you like to have? In this lecture - Mathematical formulation for these type of imprecise or vague statements. 14/46 Fuzzy Sets Crisp sets 15/46 Collection of definite, well-definable objects (elements) to form a whole. Representation of sets: � characteristic function � list of all elements fA ∶ X → {0, 1}, A = {x1 , ..., xn }, xj ∈ X fA (x ) = 1, ⇔ x ∈ A � Elements with property P fA (x ) = 0, ⇔ x ∉ A A = {x �x satisfies P}, x ∈ X � Venn diagram Fuzzy sets � � � � 16/46 Sets with fuzzy, gradual boundaries (Zadeh 1965) A fuzzy set A in X is characterized by its membership function µA : X → [0, 1] A fuzzy set A is completely determined by the set of ordered pairs A = (x , µA (x ))�x ∈ X X is called the domain or universe of discourse Crisp vs. fuzzy sets � � � Integers larger than 3. Families without children. People with job description “manager”. 17/46 � Tall people. � Bold men. � Comfortable car. � Fast cars. � Tall and blond Dutch. Definition � � As a mathematical notion, a fuzzy set F on a finite universe U is unambiguously defined by a membership function uF ∶ U → [0, 1]. The mathematical object representing the fuzzy set is the membership function uF (x ) indicating the grade of membership of element of x ∈ U in F . 18/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Interpretation � Fuzzy sets are usually related to vagueness. � Fuzzy sets are used to represent three different concepts: • • • • � This vagueness is not defined as uncertainty of meaning but instead as the standard definition of vagueness with the possession of borderline cases. gradualness (original idea of Zadeh-1965) epistemic uncertainty (not discussed in detail) bipolarity (not discussed in detail) Gradualness refers to the idea that many categories in natural language are a matter of degree, including truth. • • The fuzzy set is used as representing some precise gradual entity consisting of a collection of items (sets). The gradualness is indicated through membership. The transition between membership and non-membership is “gradual rather than abrupt” 19/46 Gradualness � The gradualness can be linked to different situations: • 20/46 Example: forest zone in a grey level image. Inherently, the boundary of this zone is gradual (zoom in a picture). The boundaries of the set are precisely known, but it is not possible to measure it (or indicate it) precisely. Gradualness � The gradualness can be linked to different situations: • 21/46 Example: Define the boundaries of a forest when the density of trees is slowly decreasing in peripheral zones. It is possible to measure each element of the set precisely (e.g. position of the trees), the boundaries of the set are known, but a (crisp) definition of its boundaries is not precise. Gradualness � The gradualness can be linked to different situations: • Example: Define dense forest zone. The uncertainty is linked to a fuzzy predicate referring to a gradual concept (e.g. “dense” forest zone). In this case the boundaries are known, the measure of each element is precise, but the fuzzy predicate indicates gradualness. 22/46 Degree of membership � the degree of membership µF (x ) of an element x in a fuzzy set F can be used to express: • • • Degree of similarity - related to gradualness. Degree of preferences (in utility functions). Degree of uncertainty* (not discussed). 23/46 Degree of membership � � 24/46 Degree of similarity - The membership degree µF (x ) represents the degree of proximity of x to prototype elements of F . This view is used in clustering analysis and regression analysis, where the problem is representing a set of data by the proximity between pieces of information. • Example: classification of cars of known dimensions in categories of F = {big cars, regular cars, small cars}. If the prototype of the category big cars is a Mercedes Class S, then we can construct a measure of distance between any car to this prototype, where the distance is a measure of similarity. Note: Fuzzy sets vs probabilities � Probabilities are related to randomness- uncertainty described by tendency or frequency of a random variable to take on a value in a specific region. • � � Interpretations: Symmetry, frequency, subjective probability (exchangeable betting rates) - Bayes rule. Fuzzy sets are related to gradualness. An example: • • Predict next person to walk in to be tall (can be probability). Person is in front of you, how can you define it tall? 25/46 Note: Fuzzy sets vs probabilities - Illustration � Suppose I have 2 cartons of 10 bottles filled with water and/or poison. You are super thirsty, and need to pick up a bottle from one of the boxes. The trick is to ask which box to choose it from. • • � 26/46 The first box has bottles that have a fuzzy membership of .9 in the set that describes water. The second box has a 0.9 probability that you will pick a bottle of water if you choose one bottle. Which one to choose? • • Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste funky). Box 2: 90% of the bottles are 100% poison and the remaining 10% of the bottles are 100% water. Note: Fuzzy sets vs probabilities - Illustration � Suppose I have 2 cartons of 10 bottles filled with water and/or poison. You are super thirsty, and need to pick up a bottle from one of the boxes. The trick is to ask which box to choose it from. • • � 26/46 The first box has bottles that have a fuzzy membership of .9 in the set that describes water. The second box has a 0.9 probability that you will pick a bottle of water if you choose one bottle. Which one to choose? • • Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste funky). Box 2: 90% of the bottles are 100% poison and the remaining 10% of the bottles are 100% water. Note: Fuzzy sets vs probabilities - Illustration � Suppose I have 2 cartons of 10 bottles filled with water and/or poison. You are super thirsty, and need to pick up a bottle from one of the boxes. The trick is to ask which box to choose it from. • • � 26/46 The first box has bottles that have a fuzzy membership of .9 in the set that describes water. The second box has a 0.9 probability that you will pick a bottle of water if you choose one bottle. Which one to choose? • • Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste funky). Box 2: 90% of the bottles are 100% poison and the remaining 10% of the bottles are 100% water. Fuzzy sets on discrete universes � � Fuzzy set C =“desirable city to live in” X = SF, Boston, LA (discrete and non-ordered) C =(SF, 0.9), (Boston, 0.8), (LA, 0.6) Fuzzy set A = “sensible number of children” X = {0, 1, 2, 3, 4, 5, 6} (discrete universe) A = {(0, .1), (1, .3), (2, .7), (3, 1), (4, .6), (5, .2), (6, .1)} 27/46 Fuzzy sets on continuous universes � Fuzzy set B = “about 50 years old” X =Set of positive real numbers (continuous) B = {(x , µB (x ))�x ∈ X } µB (x ) = 1 2 � 1+� x −50 10 28/46 About membership functions � � � Subjective measures. Context dependent. Not probability functions. 29/46 About membership functions � � � Subjective measures. Context dependent. Not probability functions. 29/46 Fuzzy partition Fuzzy partition formed by the linguistic values “young”, “middle aged”, and “old”: 30/46 Support, core, singleton � � The support of a fuzzy set A in X is the crisp subset of X whose elements have non-zero membership in A: supp(A) = {x ∈ X �µA (x ) > 0}. The core of a fuzzy set A in X is the crisp subset of X whose elements have membership 1 in A: core(A) = {x ∈ X �µA (x ) = 1}. 31/46 –-cut of a fuzzy set (level set) � 32/46 An –-level set of a fuzzy set A of X is a crisp set denoted by A– and defined by A– = � x ∈ X �µA (x ) ≥ –}, – > 0 cl(supp(A)), –=0 Normal fuzzy sets � � The height of a fuzzy set A is the maximum value of µA (x ) A fuzzy set is called normal if its height is 1, otherwise it is called sub-normal 33/46 Convexity of fuzzy sets A fuzzy set A is convex if for any ⁄ ∈ [0, 1], µA (⁄x1 + (1 − ⁄)x2 ) ≥ min (µA (x1 ), µA (x2 )) Alternatively, A is convex if all its –-cuts are convex 34/46 Set theoretic operations � Subset � Complement � Union � Intersection A ⊆ B ⇔ µA ≤ µB Ā = X − A ⇔ µĀ (x ) = 1 − µA (x ) C = A ∪ B ⇔ µc (x ) = max (µA (x ), µB (x )) = µA (x ) ∨ µB (x ) C = A ∩ B ⇔ µc (x ) = min (µA (x ), µB (x )) = µA (x ) ∧ µB (x ) 35/46 Set theoretic operations A ⊆ B ⇔ µA (x ) ≤ µB (x ) A is contained in B: 36/46 Average 37/46 The average of fuzzy sets A and B in X is defined by µA (x ) + µB (x ) 2 Note that the classical set theory does not have averaging as a set operation. This is an extension provided by the fuzzy set approach. µ(A+B)�2 (x ) = Combinations with negation Note: De Morgan laws do hold in fuzzy set theory! 38/46 MF formulation 39/46 � Triangular MF: � Trapezoidal MF: � Gaussian MF: � Generalized bell MF: trimf(x ; a, b, c) = max �min � trapf(x ; a, b, c, d) = max �min � x −a c −x , � , 0� b−a c −b x −a d −x , 1, � , 0� b−a d −c gaussmf(x ; c, s) = e − 2 � gbellmf(x ; a, b, c) = 1 x −c 2 � s 1 2a 1 + � x b−c � MF formulation 40/46 Cartesian product � � 41/46 Cartesian product of fuzzy sets A and B is a fuzzy set in the product space X × Y with membership µA×B (x , y ) = min (µA (x ), µB (y )) . Cartesian co-product of fuzzy sets A and B is a fuzzy set in the product space X × Y with membership µA+B (x , y ) = max (µA (x ), µB (y )) . Linguistic variable 42/46 � A numerical variable takes numerical values: � A linguistic variables takes linguistic values: � A linguistic value is a fuzzy set. � Age = 65 Age is old All linguistic values form a term set: T(age) = {young, not young, very young, middle aged, not middle aged, old, not old, very old, more or less old, not very young and not very old, ...} Linguistic values (terms) 43/46 44/46 Questions? Today � Reader. 45/46 Now � Read material/slides. 46/46 1BK40 - Business Analytics & Decision Making Session 12 1BK40 Business Analytics & Decision Support Session 12, 2017 – 2018 Introduction to decision support Decision heuristics SMART Prof. dr. ir. Uzay Kaymak Pav.D02, u.kaymak@tue.nl U, Kaymak 1 1BK40 - Business Analytics & Decision Making Session 12 Announcements • Decision making literature is available on Canvas: • • • • Introduction to decision making methods (J. Fülöp) Introduction to decision analysis (D.F. Groebner et al.) Topics in decision analysis (U. Kaymak) Biases and Heuristics 2 U, Kaymak 2 1BK40 - Business Analytics & Decision Making Session 12 Today’s agenda Concepts: • Discrete choice problem • Alternatives and attributes • Decision heuristics • Simple Multi-Attribute Rating Technique (SMART) Techniques: • Lexicographic strategy • Recognition heuristic • Elimination by aspects • SMART 3 U, Kaymak 3 1BK40 - Business Analytics & Decision Making Session 12 Decisions with multiple attributes Examples • Choosing a holiday • • • • • liveliest nightlife least crowded beaches most sunshine most modern hotels lowest cost • Choosing a company to supply goods • • • • U, Kaymak best after-sales service fastest delivery time lowest prices best reputation for reliability 4 1BK40 - Business Analytics & Decision Making Session 12 https://www.linkedin.com/pulse/20140827164419-92141785-effective-decision-making 5 U, Kaymak 5 1BK40 - Business Analytics & Decision Making Session 12 Characteristics of the decision environment • Choice set for consideration is relatively small • Typically, a sub-set of all possibilities are considered • Information is available, but large quantities of data are not involved • There may be considerable uncertainty (also related to the lack of data) • Consequences of decisions are not known accurately • Ambiguity of goals and consequences • Preferences of a human decision maker are important • Emphasis on structured analysis (instead of the solution) 6 U, Kaymak 6 1BK40 - Business Analytics & Decision Making Session 12 Bounded rationality • The limitations of the human mind mean that people use ‘approximate methods’ to deal with most decision problems cf. Miller’s 7 ± 2 categories • As a result they seek to identify satisfactory, rather than optimal, courses of action • These approximate methods, or rules of thumb, are often referred to as ‘heuristics’ U, Kaymak 7 1BK40 - Business Analytics & Decision Making Session 12 Discrete choice problems • A finite number of alternatives are considered • How to determine eligible alternatives? • How to collect information about these alternatives? • Goal is to select one of the alternatives • Alternatives are compared based on a number of aspects (also known as attributes) 8 U, Kaymak 8 1BK40 - Business Analytics & Decision Making Session 12 Heuristics • These heuristics are often well adapted to the structure of people’s knowledge of the environment • Quick ways of making decisions, which people use, especially when time is limited, have been referred to as ‘fast and frugal heuristics’ U, Kaymak 9 1BK40 - Business Analytics & Decision Making Session 12 Compensation or not • Compensatory strategy - poor performance on some attributes is compensated by good performance on others • Not the case in a non-compensatory strategy • Compensatory strategies involve more cognitive effort U, Kaymak 10 1BK40 - Business Analytics & Decision Making Session 12 The recognition heuristic • Used where people have to choose between two options • If one is recognized and the other is not, the recognized option is chosen • Works well in environments where quality is associated with ease of recognition U, Kaymak 11 1BK40 - Business Analytics & Decision Making Session 12 The minimalist strategy • First apply recognition heuristic • If neither option is recognized, simply guess which is the best option • If both options are recognized, pick at random one of the attributes of the two options and choose best performer on this attribute • If both perform equally well on this attribute, pick a 2nd attribute at random, and so on U, Kaymak 12 1BK40 - Business Analytics & Decision Making Session 12 Take the last • Same as minimalist heuristic except that people use the attribute that enabled them to choose last time when they had a similar choice • If both options are equally good on this attribute, choose the attribute that worked the time before, and so on • If none of the previously used attributes works, a random attribute will be tried U, Kaymak 13 1BK40 - Business Analytics & Decision Making Session 12 The lexicographic strategy • Used where attributes can be ranked in order of importance • Involves identifying most important attribute and selecting the option which is best on that attribute (e.g. choose cheapest option) • In there’s a ‘tie’ on the most important attribute, choose the option which performs best on the 2nd most important attribute, and so on U, Kaymak 14 1BK40 - Business Analytics & Decision Making Session 12 Semi-lexicographic strategy • Like the lexicographic strategy - except if options have similar performance on an attribute they are considered to be tied • It can lead to violation of transitivity axiom… (if A is preferred to B and B to C, transitivity requires that A is also preferred to C) U, Kaymak 15 1BK40 - Business Analytics & Decision Making Session 12 Example… • ‘If the price difference between brands is less than 50 cents choose the higher quality product, otherwise choose the cheaper brand.’ Brand Price Quality A $3.00 Low B $3.60 High C $3.40 Medium U, Kaymak 16 1BK40 - Business Analytics & Decision Making Session 12 Elimination by aspects (EBA) • Most important attribute is identified and a performance cut-off point is established • Any alternative falling below this point is eliminated • The process continues with 2nd most important attribute, and so on U, Kaymak 17 1BK40 - Business Analytics & Decision Making Session 12 Strengths & limitations of EBA • Easy to apply • Involves no complicated computations • Easy to explain and justify to others • Fails to ensure that the alternatives retained are superior to those which are eliminated - this arises because the strategy is non-compensatory U, Kaymak 18 1BK40 - Business Analytics & Decision Making Sequential decision making Session 12 Satisficing • Used where alternatives become available sequentially • Search process stops when an alternative is found which is satisfactory in that its attributes’ performances all exceed aspiration levels • These aspiration levels themselves adjust gradually in the light of alternatives already examined U, Kaymak 19 1BK40 - Business Analytics & Decision Making Session 12 Reason-based choice • Shafir et al.: ‘when faced with the need to choose, decision makers often seek and construct reasons in order to resolve the conflict and justify their choice to themselves and to others’. U, Kaymak 20 1BK40 - Business Analytics & Decision Making Session 12 Some consequences • Decisions framed as ‘choose which to select…’ can lead to different choices to those framed as ‘choose which to reject’ • Irrelevant alternatives can influence choice • Alternatives can be rejected if they have weakly favourable or irrelevant attributes U, Kaymak 21 1BK40 - Business Analytics & Decision Making Session 12 Example of reason-based choice Candidate A • Average written communication skills • Satisfactory absenteeism record • Average computer skills • Reasonable interpersonal skills • Average level of numeracy • Average telephone skills U, Kaymak Candidate B • Excellent written communication skills • Very good absenteeism record • Excellent computer skills • Awkward when dealing with others • Poor level of numeracy • Poor telephone skills 22 1BK40 - Business Analytics & Decision Making Session 12 Factors affecting choices • Time available to make decision • Effort that a given strategy will involve • Decision maker’s knowledge about the environment • Importance of making an accurate decision • Whether or not the choice has to be justified to others • Desire to minimize conflict (e.g. conflicts between the pros and cons of the Many human aspects alternatives) U, Kaymak 23 1BK40 - Business Analytics & Decision Making Session 12 Decisions Involving Multiple Aspects: SMART Simple Multi-Attribute Rating Technique U, Kaymak 24 1BK40 - Business Analytics & Decision Making Session 12 Objectives and Attributes • An objective = an indication of preferred direction of movement, i.e. ‘minimize’ or ‘maximize’ • An attribute is used to measure performance in relation to an objective U, Kaymak 25 1BK40 - Business Analytics & Decision Making Session 12 An office location problem Location of office Annual rent ($) Addison Square 30 000 Bilton Village 15 000 Carlisle Walk 5 000 Denver Street 12 000 Elton Street 30 000 Filton Village 15 000 Gorton Square 10 000 U, Kaymak 26 1BK40 - Business Analytics & Decision Making Session 12 Main stages of SMART 1. 2. 3. 4. 5. 6. 7. 8. U, Kaymak Identify decision maker(s) Identify alternative courses of action Identify the relevant attributes Assess the performance of the alternatives on each attribute Determine a weight for each attribute For each alternative, take a weighted average of the values assigned to that alternative Make a provisional decision Perform sensitivity analysis 27 1BK40 - Business Analytics & Decision Making Session 12 Value tree Benefits Costs Turnover conditions Rent U, Kaymak Electricity Cleaning Closeness Visibility Image Size to customers Working Comfort Car parking 28 1BK40 - Business Analytics & Decision Making Session 12 Issues Is the value tree an accurate and useful representation of the decision maker’s concerns? 1. Completeness 2. Operationality 3. Decomposability 4. Absence of redundancy 5. Minimum size U, Kaymak 29 1BK40 - Business Analytics & Decision Making Session 12 Costs associated with the seven offices Annual Annual Annual cleaning electricity rent ($) costs ($) costs ($) Total costs ($) Addison Square 30 000 3000 2000 35 000 Bilton Village 15 000 2000 800 17 800 Carlisle Walk 5 000 1000 700 6 700 Denver Street 12 000 1000 1100 14 100 Elton Street 30 000 2500 2300 34 800 Filton Village 15 000 1000 2600 18 600 Gorton Square 10 000 1100 900 12 000 Office U, Kaymak 30 1BK40 - Business Analytics & Decision Making Session 12 Direct rating for ‘Office Image’ • 1. 2. 3. 4. 5. 6. 7. U, Kaymak Ranking from most preferred to least preferred Addison Square Elton Street Filton Village Denver Street Gorton Square Bilton Village Carlisle Walk 31 1BK40 - Business Analytics & Decision Making Session 12 Direct rating - Assigning values U, Kaymak 32 1BK40 - Business Analytics & Decision Making Session 12 Value function to assign values U, Kaymak 33 1BK40 - Business Analytics & Decision Making Session 12 Values for the office location problem Attribute Office A B C D E F G 100 20 80 70 40 0 60 60 80 70 50 60 0 100 100 10 0 30 90 70 20 75 30 0 55 100 0 50 Comfort 0 100 10 30 60 80 50 Car parking 90 30 100 90 70 0 80 Closeness Visibility Image Size One way of aggregating is summing them up, but often the criteria have different importance. U, Kaymak 34 1BK40 - Business Analytics & Decision Making Session 12 Determining swing weights Closeness to customers 100 Visibility Image Best 70 Comfort Car parking • Rank the criteria • Judge the importance of a swing from the worst to the best compared to a swing from the worst to the best on the most important attribute Best 80 Size Best Best Best Best 0 U, Kaymak Worst Worst Worst Worst Worst Worst 35 1BK40 - Business Analytics & Decision Making Session 12 For example... A swing from the worst ‘image’ to the best ‘image’ is considered to be 70% as important as a swing from the worst to the best location for ‘closeness to customers’ ...so ‘image’ is assigned a weight of 70. U, Kaymak 36 1BK40 - Business Analytics & Decision Making Session 12 Normalizing weights Attribute Closeness to customers U, Kaymak Original (swing) weights Normalized weights (rounded) 100 32 Visibility 80 26 Image 70 23 Size 30 10 Comfort 20 6 Car-parking facilities 10 3 310 100 37 1BK40 - Business Analytics & Decision Making Session 12 Calculating aggregate benefits Addison Square Attribute Closeness to cust. Visibility Image Size Comfort Car-parking facilities Values Weight Value weight 100 32 3200 60 26 1560 100 23 2300 75 10 750 0 6 0 90 3 270 8080 so aggregate benefits = 8080/100 = 80.8 Weighted average U, Kaymak 38 1BK40 - Business Analytics & Decision Making Session 12 Aggregate benefits computation Table 3.2 - Values and weights for the office location problem ____________________________________________________________________ Attribute Weight Office A B C D E F G _____________________________________________________________________________________ Closeness 32 100 20 80 70 40 0 60 Visibility 26 60 80 70 50 60 0 100 Image 23 100 10 0 30 90 70 20 Size 10 75 30 0 55 100 0 50 Comfort 6 0 100 10 30 60 80 50 Car parking 3 90 30 100 90 70 0 80 80.8 39.4 47.4 52.3 64.8 Aggregate benefits 20.9 60.2 _____________________________________________________________________________________ U, Kaymak 39 1BK40 - Business Analytics & Decision Making Session 12 Trading benefits against costs Solutions on the efficient frontier can improve a benefit only by increasing costs and vice-versa (trade-off) Dominated solutions have worse benefit and cost than some solution on the efficient frontier U, Kaymak 40 1BK40 - Business Analytics & Decision Making Session 12 Sensitivity analysis Turnover weights U, Kaymak • Force some weights to zero (e.g. turnover) and re-compute the normalized weights and perform the analysis 41 1BK40 - Business Analytics & Decision Making Session 12 Summary This lecture: Next topics: • Definition of discrete choice • Introduction to fuzzy sets problems (already covered) • Alternatives and attributes • Fuzzy decision making • Decision heuristics • • • • • Minimalist strategy Lexicographic strategies Elimination by aspect Sequential decision making Simple multi-attribute rating technique (SMART) 42 U, Kaymak 42 1BK40 Business Analytics & Decision Support Session 13, 2017 – 2018 Fuzzy decision making Multicriteria decisions Prof. dr. ir. Uzay Kaymak Pav.D02, u.kaymak@tue.nl Today’s agenda Concepts: • Multicriteria decisions • Fuzzy decision making • Fuzzy set aggregation functions Techniques: • Bellman and Zadeh’s model • Yager’s model • Weighted criteria 2 General Formulation of DM Decision is a quintuple (A, Q, X, k,D) • A is the set of decision alternatives • Q is the set of “states of environment” • X is the set of consequences • k is a mapping A x Q X which relates decision alternatives to consequences • D is the decision function D: X i a a i j j D( D( ) D( ) D( i i j j ) ) Modeling benefits of office location Benefits Costs Turnover conditions Rent Electricity Cleaning Closeness Visibility Image Size to customers Working Comfort Car parking 4 What are the elements of the quintuple? Alternatives: 1. Addison Square 2. Elton Street 3. Filton Village 4. Denver Street 5. Gorton Square 6. Bilton Village 7. Carlisle Walk • • • • Criteria Consequences Mapping k Decision function e.g. when considering SMART 5 Fuzzy Goals and Constraints • A fuzzy goal is a restriction on the set of alternatives A G :A [0,1] • Fuzzy goals are often specified indirectly, on the set of objective function values • A fuzzy constraint is also a restriction on the set of alternatives A : A [0,1] C • Fuzzy constraints are defined on the set of alternatives directly, or on the domain of various indicators, indirectly • Fuzzy goals and constraints are generalisations of crisp goals and constraints Bellman and Zadeh’s model • Fuzzy decision F is a confluence of (fuzzy) decision goals and (fuzzy) decision criteria • Both the decision goals and the decision constraints should be satisfied F G C F (a) G (a) C (a), a A • Maximising decision (optimal decision a*) Decision with the largest membership value a* arg max a A G (a) C (a) Alternative corresponding to the largest membership value is denoted as the best alternative (solution) BZ model : example Small dosage (fuzzy constraint) Large dosage (fuzzy goal) 1 fuzzy decision maximizing decision interferon dosage [mg] Yager’s model • • • • A special case of Bellman and Zadeh’s model Discrete set of alternatives Multiple decision criteria Evaluation of alternatives for each criterion by using a fuzzy set, leading to judgements (ratings, membership values) • Use of fuzzy aggregation operators for combining the judgements (decision function) • Decision criteria can be weighted • Alternatives ordered by the decision function Discrete Choice Problem • Set of alternatives A = {a1, …, an} • Set of criteria C = {c1, …, cm} Structure determined by the selection of criteria • Judgements ij from evaluation of each alternative for each criterion Evaluation matrix: a 1 a n c 1 11 1n c m m1 mn Evaluations are either made using membership functions that represent fuzzy criteria, or by direct evaluation of the alternatives (i.e. filled in by the decision maker) Discrete Choice Problem • Weight factors denote importance of criteria • An aggregation function (decision function)combines weight factors and judgements for the criteria D w ( 1j , , mj ), j {1, , n} • Decision function orders the alternatives according to preference • A higher aggregated value corresponds to a more preferred alternative D (a k ) D ( al ) a a k l Types of Fuzzy Aggregation • Conjunctive aggregation of criteria Models simultaneous satisfaction of criteria T-norms • Disjunctive aggregation of criteria Models full compensation amongst criteria T-conorms • Compensatory aggregation of criteria Models trade-off and exchange amongst the criteria Compensatory operators, averaging operators, fuzzy integrals • Aggregation with a mixed behaviour Models some types of complicated interactions amongst criteria Associative compensatory operators, rule-based mappings, hierarchies of operators T-(co)norms & Averaging Operators • T-norms: simultaneous satisfaction of criteria e.g. D(a, b) a b minimum D ( a, b) D ( a, b) a • b product max(0, a b 1) bounded difference • T-conorms: full compensation amongst criteria e.g. D(a, b) a b maximum D ( a, b) D ( a, b) a b ab algebraic sum min(a b,1) bounded sum • Averaging operators: trade-off amongst criteria e.g. generalised averaging operator D(b1 , , bm ) m 1 bs mi 1 i 1/ s , s Generalized intersection (t-norm) • Basic requirements: – Boundary: T(0, a) = 0, T(a, 1) = T(1, a) = a – Monotonicity: T(a, b) < T(c, d) if a < c and b < d – Commutativity: T(a, b) = T(b, a) – Associativity: T(a, T(b, c)) = T(T(a, b), c) • Examples: – Minimum: T ( a, b) a b – Algebraic product: T ( a, b) a b – Bounded difference: T (a, b) 0 (a b 1) T-norm operator Algebraic product: Ta(a, b) Minimum: Tm(a, b) (a) Min Bounded product: Tb(a, b) Drastic product: Td(a, b) (c) Bounded Product (b) Algebraic Product (d) Drastic Product 1 1 1 0.5 0.5 0.5 0.5 0 1 0 1 0 1 0 1 0.5 b 0.5 0.5 0 0 a 0.5 1 Y=b 0 0 0.5 X= a 1 Y=b 1 0.5 0 0 0.5 X= a 1 Y=b 1 1 1 1 0.5 0.5 0.5 0.5 0 20 0 20 0 20 0 20 10 Y=y 10 10 0 0 10 X= x 20 Y=y 0 0 10 X= x 20 Y=y 0 0 0.5 X= a 1 0 0 10 X= x 20 10 0 0 10 X= x 20 Y=y Generalized union (t-conorm) • Basic requirements: – Boundary: S(1, a) = 1, S(a, 0) = S(0, a) = a – Monotonicity: S(a, b) < S(c, d) if a < c and b < d – Commutativity: S(a, b) = S(b, a) – Associativity: S(a, S(b, c)) = S(S(a, b), c) • Examples: – Maximum: S ( a, b) a b – Algebraic sum: S ( a, b) a b a b – Bounded sum: S ( a, b) 1 ( a b) T-conorm operator Algebraic sum: Sa(a, b) Maximum: Sm(a, b) (a) Max Bounded sum: Sb(a, b) Drastic sum: Sd(a, b) (c) Bounded Sum (b) Algebraic Sum (d) Drastic Sum 1 1 1 0.5 0.5 0.5 0.5 0 1 0 1 0 1 0 1 0.5 Y=b 0.5 0.5 0 0 0.5 X= a 1 Y=b 0 0 0.5 X= a 1 Y=b 1 0.5 0 0 0.5 X= a 1 Y=b 1 1 1 1 0.5 0.5 0.5 0.5 0 20 0 20 0 20 0 20 10 Y=y 10 10 0 0 10 X= x 20 Y=y 0 0 10 X= x 20 Y=y 0 0 0.5 X= a 1 0 0 10 X= x 20 10 0 0 10 X= x 20 Y=y Generalised averaging operator Has, as special cases, well-known averaging operators D(b1 , m 1 bs mi 1 i , bm ) • Minimum operator (s->- ) D(b , , b ) 1 m m bi i 1 • Harmonic mean (s=-1) • Geometric mean (s=0) • Arithmetic mean (s=1) • Quadratic mean (s=2) m D(b1 , , bm ) D(b1 , , bm ) m , bm ) 1 m D(b1 , D(b1 , m 1 b i 1 i b i 1 i m i 1 1 m , bm ) • Maximum operator (s-> ) D(b1 , , bm ) m bi m i 1 m bi i 1 bi2 1/ s , s Index of optimism • Generalized averaging operator is monotonic in the parameter s • s can be interpreted as an index of optimism of the decision maker min HM GM AM QM max aggregate value 0.8 0.7 0.6 0.5 0.4 0.3 0.2 -20 -15 -10 -5 0 index of optimisim 5 10 15 20 Range of Operators Generalized negation • General requirements: – Boundary: N(0)=1 and N(1) = 0 – Monotonicity: N(a) > N(b) if a < b – Involution: N(N(a) = a • Two types of fuzzy complements: – Sugeno’s complement: 1 a N s (a) 1 sa – Yager’s complement: N w (a) (1 a w )1/ w Sugeno’s and Yager’s complements Sugeno’s complement: 1 a 1 sa N w (a ) (a) Sugeno's Complements 1 s = -0.95 0.8 s = -0.7 0.6 s=0 0.4 s=2 0.2 s = 20 0 0 0.5 1 a Zadeh’s complement N(a) N(a) N s (a ) Yager’s complement: (1 a w )1/ w (b) Yager's Complements 1 w=3 0.8 w = 1.5 0.6 w=1 0.4 w = 0.7 0.2 w = 0.4 0 0 0.5 a 1 Generalized De Morgan’s Law • T-norms and T-conorms are duals which support the generalization of DeMorgan’s law: • T(a, b) = N(S(N(a), N(b))) • S(a, b) = N(T(N(a), N(b))) Tm(a, b) Ta(a, b) Tb(a, b) Td(a, b) Sm(a, b) Sa(a, b) Sb(a, b) Sd(a, b) Compensatory Operators • Trade-off amongst criteria • Average of a t-norm and a t-conorm D(a, b) M (T (a, b), S (a, b)) • Zimmermann operator Weighted geometric mean of algebraic sum and product D ( a, b) ab 1 a b ab , [0,1] • Hurwicz operator Weighted arithmetic mean of minimum and maximum D(a, b) (1 )(a b) (a b), [0,1] Hierarchies of Operators • Model interaction amongst criteria • Organise a complex problem into sub-problems • Group logically related criteria Parameterized T-norm and conorm • Parameterized T-norms and dual T-conorms have been proposed by several researchers: • • • • • • • Yager Schweizer and Sklar Dubois and Prade Hamacher Frank Sugeno Dombi Parametric T-norms & conorms Parametric operators generalise a wide range of operators, often by using a single parameter T-norms • Yager t-norm D(a, b) max(0, 1 [(1 a ) p p 0 T-conorms • Yager t-conorm (1 b) p ]1/ p ) • Hamacher t-norm D ( a, b) (1 ab , )(a b ab) D ( a, b) min(1, (a p b p )1/ p ), p 0 • Hamacher t-conorm 0 D ( a, b) a b ( 2)ab , ( 1)ab 0 Schweizer and Sklar p fuzzy sets pA and B (a) Two S SS (a, b, p) 1 [max{0, ((1 a) (1 b) 1 1)}] 1 p A 0.5 0 (b) T-norm of A and B lim p lim p 0 T1SS (a, b, p) ab T (a, b, p) min( a, b) 0.5SS 0 (c) T-conorm (S-norm) of A and B 1 0.5 0 TSS (a, b, p) [max{0, (a p b p 1)}] 1 p B Weighted aggregation • Weights represent relative importance of objective functions and the constraints • The problem is described by: • D(x,w) = T (w, G0(a0Tx), G1(a1Tx), Gm(amTx)) • The solution is given by D(x*, w ) sup D(x, w ) x m Fuzzy sets that represent each criterion • For general fuzzy optimization with simultaneous satisfaction of constraints, t-norms must be extended to their weighted counterparts Weight factors • Weights represent the relative importance of various constraints and the goal within the preference structure of the decision maker • The higher the weight of a particular criteria, the larger its importance on the aggregation result • Importance of criteria can also be done directly in the membership functions. • Normalization of weights for t-norms, t-conorms m i=0 wi = 1 Two approaches to weighted aggregation • Transforming the decision function • Incorporate the weight factors into the decision function in a uniform way • Weights become parameters of the new aggregation funtion • Transforming the operands • Transform the membership values into new values by using the weight factors • Use the original non-weighted membership function on the transformed values 31 Transforming the decision function (1) Non-weighted Weighted Contour lines of the decision function change 32 Transforming the decision function (2) Non-weighted Hamacher Weighted Hamacher 33 Transforming the operands 1 • Product t-norm • a = 0.5, b = 0.4 • wa = 1, wb = 0.25 • Evaluation: 0.2 0.4 0.8 0.7 0.6 criterion 2 • With power raising: •a 0.5 •b 0.8 0.9 Weighted 0.5 0.4 0.3 0.2 0.1 0 Non-weighted 0 0.1 0.2 0.3 0.4 0.5 0.6 criterion 1 0.7 0.8 0.9 1 34 Weighted conjunction (examples) m • Minimum operator • Product operator D ( x, w ) [Gi (x)]wi i 0 m [Gi (x)]wi D ( x, w ) i 0 0 if i, Gi (x) 0 • Hamacher t-norm 1 D ( x, w ) 1 • Yager t-norm D(x, w ) m w 1 Gi ( x ) i 0 i Gi ( x ) max 0,1 , otherwise m w (1 i 0 i Gi (x)) 2 Weighted averaging • Generalized averaging operator D ( x, w ) m 1/ s m i 0 wi [Gi (x)]s , s D(x, w ) m wi Gi ' (x), i 0 indicates that these numbers (G) are ordered from smaller to larger wi i 0 • Ordered weighted averages (OWA) m , wi i 0 1 1 Weighted disjunction (examples) Most easily obtained from weighted t-norms by using De Morgan laws: S(x,w)=1-T(1-x,w) m • Maximum operator D(x, w ) 1 [1 Gi (x)]wi i 0 m • Algebraic sum [1 Gi (x)]wi D(x, w ) 1 i 0 • Yager s-norm D(x, w ) m i 2 w ( G ( x )) i 0 i • • • • • • Summary This lecture: General definition of decision making (discrete choice) Fuzzy decision making Fuzzy set aggregation Bellman and Zadeh’s model Yager’s model Weighted aggregation Next topics: • Bayesian decision model • Analytic Hierarchy Process (AHP) 38 1BK40 Business Analytics & Decision Support Session 14, 2017 – 2018 Bayes decision analysis Analytic Hierarchy Process Prof. dr. ir. Uzay Kaymak Pav.D02, u.kaymak@tue.nl Today’s agenda Concepts: • Bayes decision formulation • Payoff table • Perfect information and experimentation • Pairwise comparisons Techniques: • Payoff table • Expected value of perfect information • Expected value of experimentation • Analytic Hierarchy Process (AHP) 1 Scope • For situations where there is a significant degree of uncertainty • Uncertainty may be about: • key decision parameters • the actual outcome • problem definition … • Often considers a small and finite set of alternatives Example: oil exploration Perhaps there is oil, perhaps there is not. Do you sell the land and the exploration rights or do you drill and develop the field? Costly to obtain better / more information! 3 Key concepts • Feasible alternatives e.g. an action from a set of possible actions • State of the nature a possible situation that may be the reality • Payoff table Quantification of each combination of alternatives and the state of the nature Decision making • States of the nature have different prior probability • Optimal decision is found by aggregation in the payoff table • Different decision criteria for aggregation models different types of behavior • pessimistic viewpoint • optimistic viewpoint • probabilistic viewpoint, etc. Example problem • • • • Oil company owns a piece of land Can make money by selling it Can drill for oil (brings money if oil is found) Uncertain whether oil will be found Payoff table State of Nature Alternative Drill Sell Oil 700 90 Dry -100 90 Prior Probability 0.25 0.75 Maximin and maximax • Maximin payoff criterium For each possible action, find the minimum payoff over all possible states of nature. Next, find the maximum of these minimum payoffs. Choose the action whose minimum payoff gives this maximum. • Maximax payoff criterium For each possible action, find the maximum payoff over all possible states of nature. Next, find the maximum of these maximum payoffs. Choose the action whose maximum payoff gives this maximum. Maximin example Alternative Drill Sell Oil 700 90 State of Nature Dry -100 90 Only the worst state-of-nature is considered Minimum in Row -100 90 Maximin Maximax example Alternative Drill Sell Oil 700 90 State of Nature Dry -100 90 Only the best state-of-nature is considered Maximum in Row 700 Maximax 90 Maximum likelihood & Bayes • Maximum likelihood criterion Identify the most likely state of nature (the one with the largest prior probability). For this state of nature, find the action with the maximum payoff. Choose this action. • Bayes' decision rule Using the best available estimates of the probabilities of the respective states of nature (the prior probabilities), calculate the expected value of the payoff for each of the possible actions. Choose the action with the maximum expected payoff. Maximum likelihood example State of Nature Alternative Drill Sell Oil 700 90 Dry -100 90 Prior Probability 0.25 0.75 Maximum Only the most probable state-of-nature is considered Maximum Bayes example Expected payoff = 0.25 x oil + 0.75 x dry Alternative Drill Sell Oil 700 90 State of Nature Dry -100 90 Prior Probability 0.25 0.75 This criterion is also called Expected Monetary Value (EMV) criterion Expected Payoff 100 Maximum 90 Bayes sensitivity 700 Expected payoff Drill Expected payoff Sell 600 expected payoff 500 400 300 200 100 0 -100 0 0.1 0.2 0.3 0.4 0.5 0.6 prior probability of oil 0.7 0.8 0.9 1 The maximin criterion Another example A decision table for the food manufacturer (Daily profits) Course of action Produce 1 batch Produce 2 batches Demand (no. of batches) 1 2 $200 –$600 $200 $400 EMV criterion Another decision table for the food manufacturer (Daily profits) Course of action Produce 1 batch Produce 2 batches Demand (no. of batches) 1 2 Probability 0.3 0.7 $200 –$600 $200 $400 Calculating expected profits Produce one batch: Expected daily profit = (0.3 $200) + (0.7 $200) = $200 Produce two batches: Expected daily profit = (0.3 –$600) + (0.7 $400) = $100 Sensitivity analysis Limitations of the EMV criterion • It assumes that the decision maker is neutral to risk • It assumes a linear value function for money • It considers only one attribute – money Revising judgments • How to improve decisions when new information is available • How much value can we hope to obtain from new information? • How much is new information worth? Bayes’ Theorem Prior probability New information Posterior probability DM with experimentation • Obtain additional information e.g. seismic study for k$ 30 • Lets you estimate the probability of states in a better way (posterior probabilities) finding from experimentation (a random variable) Bayes' theorem P( i | f j ) posterior probability P( f j | i ) P( i ) n P( f j | k 1 number of possible states of nature k ) P( k ) Probability tree Prior Conditional probabilities probabilities 0.6 0.25 Oil 0.4 Joint probabilities FSS 0.15 Oil and FSS 0.5 Oil, given FSS USS 0.1 Oil and USS 0.14 Oil, given USS P(FSS)=0.3 Dry 0.75 0.2 0.8 Posterior probabilities FSS USS P(USS)=0.7 0.15 Dry and FSS 0.5 Dry, given FSS 0.6 Dry and USS 0.86 Dry, given USS EV of perfect information • Gives an upper bound on the expected value of experimentation • Expected payoff with perfect information: average payoff assuming you can take the best decision in any state of the nature • EVPI = EPPI – EPWE expected payoff without experimentation expected payoff with perfect information Value of experimentation • Identifies potential value of experimentation EPE P(f j ) E[payoff | f j ] j expected value of experimentation • EVE = EPE - EPWE expected payoff without experimentation expected payoff with experimentation The components problem Applying Bayes’ theorem Step for Bayes’ theorem 1. 2. 3. 4. 5. Construct tree with all possible events Extend tree by attaching new branches that represent new information (conditional probability) Obtain joint probabilities (multiply probabilities from root to leaves) Sum joint probabilities Obtain posterior probability by dividing the “appropriate” joint probability with the sum of joint probabilities DM with Bayes theorem Retailer’s decision problem New information Uncertainty about outcome is reduced (cf. entropy) Applying posterior probabilities Decision node Probability node Determining EVPI Example Calculating the EVPI Buy imperfect information? If test indicates virus is present If test indicates virus is absent Determining the EVE (or EVII) Expected profit with imperfect information = $62 155 Expected profit without the information = $57 000 Expected value of imperfect information (EVII) = $5 155 Oil exploration example Prior Conditional probabilities probabilities 0.6 0.25 Oil 0.4 Joint probabilities FSS 0.15 Oil and FSS 0.5 Oil, given FSS USS 0.1 Oil and USS 0.14 Oil, given USS P(FSS)=0.3 Dry 0.75 0.2 0.8 Posterior probabilities FSS USS P(USS)=0.7 0.15 Dry and FSS 0.5 Dry, given FSS 0.6 Dry and USS 0.86 Dry, given USS Constructing a decision tree Refined decision tree Rolling back the tree Timing of decisions is not considered Pearson-Tukey approximation • Used for approximating a continuous probability distribution with a set of three discrete probability events • Estimate value with 95% probability of being exceeded (set probability of this event to 0.185) • Estimate value with 50% probability of being exceeded (set probability of this event to 0.63) • Estimate value with 5% probability of being exceeded (set probability of this event to 0.185) • Dependent probabilities are estimated as conditional probability Example Eliciting decision structure Towards a better representation? Definitions in influence diagrams Definitions in influence diagrams Deriving the decision tree 1. 2. 3. 4. Identify a node with no arrows pointing to it If there is a choice in 1 between a decision node and an event node, choose the decision node Place the node at the beginning of the decision tree and remove from influence diagram Repeat from 1 with the new, reduced influence diagram Tree derived from influence diagram Analytic Hierarchy Process (AHP) 50 Pairwise comparisons • Based on psychological studies • Human judgment is error prone when measuring quantities on relative or absolute scales • Various biases: end-effect, avoiding extreme scores, inability to compare large number of objects simultaneously • Humans are good in relative comparisons • E.g. comparing two alternatives • One method is analytic hierarchy process (AHP) Overview of the AHP 1. 2. 3. 4. 5. Set up decision hierarchy Make pairwise comparisons of attributes and alternatives Transform comparisons into weights and check consistency Use weights to obtain scores for options Carry out sensitivity analysis Packaging machine problem Scale for pairwise comparisons • equally important (1) • Alternatives equally important • weakly more important (3) • Experience or judgment slightly favours one alternative over the other • strongly more important (5) • Experience or judgment strongly favours one alternative over the other • very strongly more important (7) • One alternative is strongly preferred and its dominance is demonstrated in practice • extremely more important (9) • Evidence definitely favours one alternative Reciprocal preference matrix • Diagonal elements consist of 1’s • Numbers above the main diagonal are reciprocals of numbers below the main diagonal • n x n matrix • Note that sum of eigenvalues is n, i.e. pij 1 p ji Example: 1 3 5 1/ 3 1 1/ 7 1/ 5 7 1 n j 1 j n Judgment scales • Saaty scale (9-point scale) • 1,3,5,7,9 to denote importance levels • 2,4,6,8 to denote intermediate (mixed) levels • i.e. 1/9,1/7,1/5,1/3,1,3,5,7,9 • Geometric scale q , , 1 , 0 , 1 , • With q = 4 and a = 2: 1/16,1/8,1/4,1/2,1,2,4,8,16 , q Comparing criteria Importance of quality attributes Comparing alternatives Solution methods • Eigenvector solution (most widely used) • Normalized row sums • Normalized column sums • Geometric mean method • Logarithmic regression • Computational intelligence approach Eigenvector solution (Saaty) • A reciprocal preference matrix has a unique positive maximal eigenvalue max (Perron-Frobenius theorem) • Find the eigenvector corresponding to this eigenvalue (principal eigenvector) • see function eig in Matlab • Normalize the vector elements to obtain weights (usually, sum of vector elements equals 1) Normalized row or column sums • Sum up each row • Normalize weights 1 1 / 3 2 10/3 3 1 6 10 1 / 2 1 / 6 1 5/3 • Sum up reciprocal of each column • Normalize weights 1 3 1/ 2 1/ 3 1 1/ 6 2 6 1 Sum=15 10/3 w = [2/9 2/3 1/9]T 10 w = [2/9 2/3 Reciprocals of columns 5/3 Sum = 15 1/9]T Geometric mean method • Calculate geometric mean of each row mi n n j p 1 ij • Normalize the judgment vector 1 1/ 3 2 3 1 6 1/ 2 1/ 6 1 mi 0.8736 2/9 2.6207 2/3 0.4368 1/9 Weights after normalization Logarithmic regression • Solve an optimization problem Above-diagonal elements (known) • Minimize: n n ln pij ln wi ln w j 2 i 1 j i 1 Weights to determine • Can also deal with missing evaluations • Can also be extended to the case with multiple decision makers Normalization and aggregation • Usually, sum of the judgments are normalized to 1 • However, other normalizations could also be used (leading to different decision behavior!) n wj 1 j 1 1/ p n w j 1 p j 1, p 1 Final weights Scores for the three machines Aztec Barton Congress 0.255 0.541 0.204 = 0.833 x 0.875 x 0.222 + 0.833 x 0.125 x 0.558 + 0.167 x 0.569 x 0.167 + 0.167 x 0.148 x 0.286 + 0.167 x 0.074 x 0.625 + 0.167 x 0.209 x 0.127 Consistency of comparisons • Reciprocal preference matrix is said to be cardinally consistent when all triads of elements satisfy pik pij p jk • The above condition is also called a transitivity relation • Since, comparisons are pairwise, consistency is usually not guaranteed Sources of inconsistency • Limited range of the judgment scale • Integer valued judgment scale • Judgment errors of the decision maker Inconsistency index • Saaty proposed the following inconsistency index n max CI n 1 • Inconsistency index is equal to zero when the reciprocal preference matrix is cardinally consistent • CI equals to the average value of eigenvalues smaller than max • Saaty advises the inconsistency index to be smaller than 0.1 Rank reversal • Addition of a new alternative can change the mutual ranking of other alternatives • E.g. 1 5 4 1/ 5 1 3 1/ 4 1/ 3 1 a1 1 5 4 1/ 3 1/ 5 1 3 1 1/ 4 1/ 3 1 5 3 1 1/ 5 1 a2 a3 a1 a3 a4 a2 Strengths of the AHP • Formal structuring of the decision problem • Simplicity of pairwise comparisons • Redundancy allows consistency to be checked • Versatility Criticisms of AHP • Conversion from verbal to numeric scale • Problems of 1 to 9 scale • Meaningfulness of responses to questions • New alternatives can reverse the rank of existing alternatives • No. of comparisons required may be large • Axioms of the method Summary This lecture: • Bayes decisions Next topics: • Preparation for exam • Payoff matrix • EVPI, EVE • Construction of decision trees • Analytic Hierarchy Process • Pairwise comparisons • Hierarchy construction • Solution Methods 74