Uploaded by NAA

All lectures 1BK40

advertisement
1BK40
Business Analytics &
Decision Support
Lecture 1, 2017 – 2018
Introduction
dr. M.Firat
Pav.D06, m.firat@tue.nl
Outline
• Motivation
• Course organization
• Data-driven decision making
Data mining and analytics (DMA)
• Multi-attribute decisions
2
Why Business Analytics and
Decision Support?
3
Business decisions
• Almost all activities in running businesses involve
decision making
• Recognize state of market
• Select the right course of action
• Plan a strategy
• Main task of managers
• Decisions need actionable information
• Decision analysis helps in dealing with the decision
problems in a structured way
• However, there is more to decision making: e.g.
organizational support (getting people behind
decisions)
An example business problem
• TelCo, a major
telecommunications
firm, wants to
investigate its problem
with customer attrition,
or “churn”
• Lets consider this for
now as a marketing
www.flickr.com/photos/yourdon
problem only
How would you go about targeting some customers with
a special offer, prior to contract expiration? Think about
what data should be available for your use.
5
Another example
flickr.com/photos/alaig
• Company, a major producer of semi-conductors,
wants to hire a new sales manager
• How would you select your new recruit?
How does this decision differ from the previous one?
Think about data available, but also decision goals
6
Differences between two examples
•
•
•
•
•
•
•
Amount of available data
Type, source, quality of data
Amount and type of uncertainty
Number of stakeholders
Number of goals
Number of decision moments
Etc…
7
Overall objective
• How to support structured decision
making in a business setting, given
• A data-rich environment
• A data-poor environment?
Data Science
Decision Science
8
Course Organization
9
Course goals
• discuss the properties of modern analytics and
decision support systems for businesses
• list several analytics and decision making methods
• distinguish between different analytics functions
• analyze data by using data science methods
• apply data science techniques for improved decision
support
• analyze & solve discrete choice business problems
10
Lecturers
• Prof. dr. ir. Uzay Kaymak (responsible
lecturer)
Pav. D.02
u.kaymak@tue.nl
• Dr. M.Firat (lecturer)
Pav. D.06
m.firat@tue.nl
• Information from secretariat
IEIS, Information Systems
is@tue.nl
11
Meetings
• 16 sessions (2 x 2 hrs./week, 8 weeks long)
• Wednesday 15:45-17:30, Auditorium 8
• Friday 10:45-12:30, LUNA 1.050
Lectures: introduce and explain main subjects
Embedded practice sessions, instructions (Matlab)
• Two guest lectures (from industry or academia)
• Content of guest lectures is part of mandatory material
• Q&A sessions at the end of the quartile
• Further questions can be asked by e-mail or during
separate meetings upon appointment
12
Planning – 1
lecture week
date
topic
1
1
06 Sep.’17
2
1
08 Sep. ’17
Introduction to course materials,
Introduction: Data-Analytic Thinking
Matlab Session 1
3
2
13 Sep. ’17
4
2
15 Sep. ’17
5
3
20 Sep. ’17
6
3
22 Sep.’17
Business Problems and Data Science Solutions;
Data mining; Introduction to predictive modeling
Introduction to Predictive Modeling; Visualizing Segmentations;
Fitting a Model to Data; Classification via Mathematical Functions;
Regression via Mathematical Functions;
Overfitting and Its Avoidance; Evaluating Classifiers;
Cross-Validation; Expected Value Analytical Framework
Matlab Session 2
7
4
27 Sep.’17
Similarity, Neighbors; Clustering
8
4
29 Sep.’17
Visualizing Model Performance; Ranking, Profit Curves;
ROC Graphs and AUC; Cumulative Response and Lift Curves;
Evidence and Probabilities; Combining Evidence Probabilistically;
Applying Bayes’ Rule to Data Science
Lecture time: 15:45-17:30, Lecture time: 10:45-12:30
*Some sessions might shift if needed
13
Planning – 2
class week date & time
9
10
11
12
5
5
6
6
4 Oct.’17
6 Oct.’17
11 Oct.’17
13 Oct.’17
13
7
18 Oct.’17
14
7
20 Oct.’17
15
8
25 Oct.’17
16
8
27 Oct.’17
topic
Guest Lecture
Matlab Session 3
Introduction to fuzzy systems
Introduction to Decision Support;
Decision heuristics;
SMART
Fuzzy decision making;
Multicriteria decisions
Bayesian decision theory;
Analytic hierarchy process;
Matlab Session 4
Guest Lecture;
Q&A session;
Preparation for exam
Lecture time: 15:45-17:30, Lecture time: 10:45-12:30
*Some sessions might shift if needed
14
Course Material (literature and tools)
• One book (mandatory):
• Data Science for Business (O'Reilly Media)
• Slides and handouts (mandatory)
Distributed through CANVAS
• Scientific papers, books and web pages (mandatory)
Announced through CANVAS
• Exercises for the instructions
• Software tools (mandatory)
• Matlab
15
MATLAB 2016b
• Download and install Matlab 2016b before next session (time consuming):
• Check your Windows version (x86-32 Bits vs x64-64 Bits):
http://windows.microsoft.com/en-us/windows/which-operating-system
• Install Matlab version corresponding to your OS
(x86-32 Bits vs x64-64 Bits):
https://intranet.tue.nl/en/university/services/ict-services/help-and-support/software-tuedevice/matlab/
• Do not install Notebook installation or Notebook installation for the
Electrical Engineering.
• Recommended: Personal installation - Matlab & Fuzzy Logic, statistics,
global optimization, optimization toolboxes (more can be added later).
• not required: Simulink (& blocksets), builder, coder, compiler,...
• Alternative: Full (more HD space, many functions not needed,...)
16
Book
• Title: Data Science for Business: What you
need to know about data mining and dataanalytic thinking
• Authors: Foster Provost and Tom Fawcett
• Publisher: O'Reilly Media
• Edition: 1st edition (August 19, 2013)
• ISBN-10: 1449361323
• ISBN-13: 978-1449361327
• Available as e-book, in print or electronic
copy (pdf)
17
Assessment
• Components:
• Assignment 1 – 25%
− Deadline Assignment 1a:
− Deadline Assignment 1b:
• Assignment 2 – 25%
− Deadline Assignment 2a:
− Deadline Assignment 2b:
• Written exam – 50%
29 Sep.’17
TBA Oct.’17
TBA Oct.’17
TBA Oct.’17
• Assignments will be made in groups of 3
It is not possible to re-sit assignments
Assignments are valid only in the current academic year
18
Exam
• Type:
written, open questions, closed book
• Date: 9 Nov.’17.
• Time: 09:00 – 12:00
• Re-sit: TBA
19
Relation to Information Systems
Research
Business
Process
Engineering
Smart
Mobility
Health
Care
• IS group has four
research clusters
• Most of the topics
covered by BPI
cluster
http://is.ieis.tue.nl/research/bpi/
Business
Process
Intelligence
Business
Process
Management
20
TU/e Data Science Center
Slide by DSC/e
21
Data-driven decision making
23
Business drivers
Nowadays, many decisions must be automated due to
• Large volumes of data
• Availability of online data, which requires real-time
processing and decision making
• Developments in m- and e-business: decisions
anywhere, anytime
• Competitive advantage through fast processing
• Optimization of business processes
(B. Gates, Business@thespeedofthought)
A historical note by Bill Gates
Business @ the speed of thought:
Year 1998
Year 2017
The production of data is staggering
•
•
•
•
•
•
•
•
•
People produce 400 million tweets daily
… and send 3.2 billion likes daily
They also upload 300 million pictures daily
Google Voice processes 10 years of spoken text daily
The UK has 2 million surveillance cameras
Facebook has 1 billion users
800 million users watch 4 billion movies daily
Medical data doubles every five years
In 2020 there will be 24 billion internet connected devices
Source: NRC 08.02.2013
Slide by DSC/e
The Always-On Society
Connected products
Sensing and monitoring data
Connected
Apps
Philips Value Platform
Meaningful
information
Enhanced user experience
Connected Solutions
Slide by DSC/e
27
Everywhere Analytics
From Deloitte
28
Real-world examples
Smart business solutions
Process improvement
Monitors and analyses events in an
organization and proposes
business improvement actions.
Smart power grids
Measures, monitors, and manages
energy production, transport, and
consumption is heterogeneous
distributed grids.
Clinical decision support
Provides instant clinical decision
support by correlating information
from different part of uncorrelated
sources.
Slide by DSC/e
29
Be part of customer experience
Customer Analytics
Data as an asset
It is not about what you have,
but about what you know
about what you have
flickr.com/photos/elizabeth_donoghue
31
Data analytic thinking
• Data and the capability to extract useful knowledge
from data is a (strategic) asset
• Invest in data: quality, collection, storage
Can be costly!
• Invest in models, skills, methods to process data
• This combination creates value
• Google
• Facebook
• Amazon
32
Data science
Data science seeks to use all relevant,
often complex and hybrid data to
effectively tell a story that can be easily
understood by non-experts
It does this by integrating techniques
and theories from many fields,
including statistics, computational
intelligence, pattern recognition,
machine learning, online algorithms,
visualization, security, uncertainty
modeling, and high performance
computing with the goal of developing
the fundamental principles that guide
the extraction of knowledge from data
Slide by DSC/e
33
Data-driven decisions
• Data science
involves principles,
processes and
techniques for
understanding
phenomena via the
(automated) analysis
of data in order to
improve decision
making
34
In today news (6 Sep.’17)
Where is the added value?
HURRICANE FRANCES was on its way, barreling across the Caribbean,
threatening a direct hit on Florida's Atlantic coast. Residents made for
higher ground, but far away, in Bentonville, Ark., executives at Wal-Mart
Stores decided that the situation offered a great opportunity for one of
their newest data-driven weapons, something that the company calls
predictive technology.
A week ahead of the storm's landfall, Linda M. Dillman, Wal-Mart's chief
information officer, pressed her staff to come up with forecasts based on
what had happened when Hurricane Charley struck several weeks
earlier. Backed by the trillions of bytes' worth of shopper history that is
stored in Wal-Mart's data warehouse, she felt that the company could
"start predicting what's going to happen, instead of waiting for it to
happen," as she put it.
From NY Times’04
37
Automated text analysis
Key questions to answer
“What happened?”
“Where exactly is the problem?”
“What if these trends continue?”
“How many, how often, where?”
“What’s the best that can happen?”
“What will happen next?”
“What actions are needed?”
“Why is this happening?”
Source: SAS
Levels of analytics capability
What’s the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?
Where exactly is the problem?
How many, how often, where?
What happened?
Source: SAS
Tag Cloud
40
By J. Reed, http://diginomica.com/2013/12/06/data-science-business-book-pull-off/
Fundamental concepts of data science
• Extracting useful knowledge from data to solve business
problems can be treated systematically by following a
process with reasonably well-defined stages
• From a large mass of data, information technology can
be used to find informative descriptive attributes of
entities of interest
• If you look too hard at a set of data, you will find
something—but it might not generalize beyond the data
you’re looking at
• Formulating data mining solutions and evaluating the
results involves thinking carefully about the context in
which they will be used
41
Gartner Hype Cycle, 2012
42
Gartner Hype Cycle, 2016
43
Multiple facets of data science
Data Mining
Stochastic
Networks
Probability and
Statistics
Data-Driven
Innovation and
Business
Data-Driven
Operations
Management
Process
Mining
Visualization
DATA
SCIENCE
Internet of
Things
Privacy, Security,
Ethics, and
Governance
Human and
Social Analytics
Intelligent
Algorithms
Large-Scale
Distributed
Systems
Slide by DSC/e
44
Multi-attribute
decision making
What if there is no information?
45
Causes for lack of information
• Problem is very new, not much data available
• Relevant data is not available
• Uncertainty may be too large
• Multiple objectives
• Complex environment
These are typical characteristics of many management decisions
46
Management decisions
often complex because they involve:
1. Risk and uncertainty
2. Multiple objectives
3. A complex structure
4. Multiple stakeholders
In this course, we will
consider mainly
discrete choice
problems with a small
number of alternatives
Coping with complexity
Difficult for humans, because
• The human mind has limited information processing
capacity and memory
(Miller’s 7 ± 2)
• To cope with complexity we tend to simplify
problems
• This can lead to inconsistency and biases
The Role of Decision Analysis
• Analysis: ‘divide and conquer’
• Defensible rationale: ‘audit trail’
• Raised consciousness about issues
• Allows participation: commitment
• Insights: creative thinking
• Guidance on information needs
Two approaches
• Normative (prescriptive) decision making
• decision as a rational act of choice
• formal models for rational decision making
• closely related to optimization theory
• Descriptive decision making
• decision as a specific information processing process
• studies the cognitive processes that lead to decisions
• focus on how information is processed
Example: decision making in oil industry
Company owns a
concession
Perhaps there is oil,
perhaps there is not.
Do you sell the land
and the exploration
rights or do you drill
and develop the field?
51
Characterized by large uncertainty
• How much oil?
• What is recoverable?
• Given current technology
• Depending on technological developments
•
•
•
•
Price of oil in the future
Tax developments
One-off decision
Attitude towards risk
52
Example: site selection
Where to open your
next store in the
chain?
Small number of
alternatives
Possibly multiple
criteria
53
flickr.com/photos/stadsarchiefbreda/
Characterized by multiple criteria
• Many stakeholders with diverging goals
• Multiple criteria (many attributes)
• Different importance of attributes
• Trade-off amongst attributes known partially
• Large uncertainty from the environment
54
Decision making methods
• Bayes decision making
• Decision heuristics
• Simple multi-attribute rating technique
(SMART)
• Fuzzy decision making
55
I
x=3 %declaring variable - This is a comment!
y=5; %Another variable, note the semicolon
%h=x+y nothing happens
z=x+y %Calculation
I
%
I
I
_-
i,j
a_1=1 %1x1 Matrix
b_12=[2 3] %1x2 Matrix
c_21=[2; 3] %2x1 Matrix
d=3i
F=[1 2; 3 4]
c_21' %transpose
transpose(c_21) %same as above
I
(+,-,*,/,^,.*,,./.^
I
a_1*b_12 %this is fine
a_1*c_21 %also fine
b_12*c_21 %works
%different from above
b_12.*c_21' %note the .* element by element multiplicatio
c_21*b_12 %works
b_12*F %works
c_21*F %does not work
F^2
b_12.^2 %element calculation
v=0:0.5:10
I
I
max
I
A = [1 3 5];
max(A)
I
help <command>
help max
doc max
doc <command>
I
help <command>
help log
doc log %or click the hyperlink at the bottom
I
I
help format
log(10)
format long
log(10)
format %reverts back to default
log(10)
v = [16 5 9 4 2 11 7 14];
v(3) %Extract the 3rd element
v([1 5 6]) % Extract the 1st, 5th, 6th element
v(3:7) %Extract 3rd through 7th elements.
v(5:end) %Extract from 5th till the last element.
A = magic(4) %help magic
A(2,4) %row 2, column 4
A(2:4,1:2) %row 2 to 4 and column 1 to 2
A(3,:) %Extract third row
%Logic indexing
v([1 0 0 0 1 1]) %same as v([1 5 6])
v<10
v(v<10)
help plot %as usual
p4plots=0:0.05:1
plot(entropy_p4p) %simple plot
p4p=p4plots.^2;
figure,plot(p4plots,p4p,'r') %x,y
hold on
plot(p4plots,p4plots.^3,'--kx','MarkerSize',10)
hold off
xlabel('p(+)')
ylabel('function values')
title('Example of ploting as a function of p(+)')
I
I
I
I
http://www.cyclismo.org/tutorial/matlab/
I
https://www.mccormick.northwestern.edu/documents/students/
undergraduate/introduction-to-matlab.pdf
I
I
1BK40 Business Analytics
& Decision Support
Session 3
Business Problems and Data Science Solutions.
Introduction to Predictive Modeling.
Dr. M.Firat
Pav.D06, m.firat@tue.nl
September 13, 2017
Where innovation starts
Notifications
I
Problems, comments, suggestions:
•
•
I
I
I
Keep up to date with the lectures material on Canvas.
Digital age: Lectures handouts may be changed (suggestions,
corrections, additions, ...).
Assignment 1a will be posted on Canvas after Session 3.
•
•
•
I
Email m.firat@tue.nl with the subject “[1BK40] <subject>”;
Correctly addressed emails will be replied ASAP.
Simple problem to be solved by hand (practice for exam).
Real world problem using Matlab.
To be solved as a group.
Slides serve as reference (hence a bit verbose).
2/91
Outline
Today
Introduction
From Business Problems to Data Mining Tasks
Common types of data mining tasks
Supervised vs. unsupervised methods
Data Mining
Related Analytics Techniques and Technologies
Introduction to Predictive Modeling
Attribute selection
Closing remarks
Fundamental concepts: A set of canonical data mining (DM) tasks;
DM process; supervised vs. unsupervised DM; identifying informative
attributes.
3/91
Data Science and Data Mining
Data science principle:
Data mining is a process with well-understood stages and well-defined
subtasks.
I
Data mining involves
•
•
I
Information technology: discovering and evaluating patterns in data.
Data analyst: creativity, business knowledge, and common sense.
Structured data mining projects are
•
•
conducted by systematic analysis
not driven by chance and individual good judgments.
4/91
Answering Business Questions with DM & Related Techniques
In data analysis, common questions are
I
I
I
I
“Who are the most profitable customers?”
“Is there really a difference between the profitable customers and
the average one?”
“Can we characterize the profitable customers to have an idea who
really are they?”
“Will a given new customer be profitable? How much revenue
should I expect this customer to generate?”
5/91
6/91
From Business Problems to Data Mining
Tasks
From Business Problems to Data Mining Tasks
I
Every (data-driven) business (decision-making) problem is unique:
goals, desires, and constraints.
•
•
There are specifics of the problems even if they belong to the same
case. (Churn example: Lecture 1 of Company MegTelCo vs. other
similar company);
However, there are common subtasks underlie the business problems
(example: estimate from historical data a given probability).
7/91
From Business Problems to Data Mining Tasks
I
Data science aims to decompose a data analytics problem into sub
problems
•
•
•
I
8/91
every of them is a known task with available tools,
which prevents wasting time and resources, i.e. reinventing the wheel,
which allows people to focus on parts requiring human involvement.
For every data mining task, there are usually a number of proposed
algorithms so
•
we shall clearly define these tasks to state several fundamental
concepts of data science, e.g. classification and regression.
Data mining task: Classification
I
Classification (& class probability estimation) attempts to predict to
which set of classes a given individual belongs
•
I
I
9/91
“Among all the customers of a cellphone company, which are likely to
correspond to a given offer?” (usually binary, mutually exclusive)
The classification procedure develops a model that determines the
class of a new individual.
Related task is scoring or probability estimation.
•
A scoring model outputs for a new individual the probability, i.e.
score, that s/he belongs to each class.
Data mining task: Regression
I
Regression (value estimation) attempts to predict the numerical
value of some variable for a given individual.
•
I
I
10/91
“How much will a given customer use the service?” (Predicted
variable: service usage).
Regression model is generated by looking at other individuals in the
population.
(Informal) difference between regression and classification:
•
Classification (regression) predicts whether (how much) something
will happen.
Data mining task: Similarity matching
I
11/91
Similarity matching attempts to identify similar individuals based on
the data known about them.
•
•
Finding similar entities - “What companies are similar to our best
business companies?”
Making product recommendations - “What persons are similar to you
in terms of the products they have liked or purchased?”
Data mining task: Clustering
I
Clustering attempts to group individuals in a population together by
their similarity, but without regard to any specific purpose.
•
•
I
12/91
Directly - “Do customers form natural groups or segments?” groupings of the individuals of a population.
Input to decision-making - “What products should we offer or
develop?”, “How should our customer care teams (or sales teams) be
structured?”
Useful in preliminary domain exploration.
Data mining task: Co-occurrence grouping
I
Co-occurrence grouping attempts to find associations between
entities based on transactions involving them.
•
I
I
I
“What items are commonly purchased together?”
Note: clustering looks at similarity between objects based on the
objects’ attributes, co-occurrence grouping considers similarity of
objects based on their appearing together in transactions.
Included in recommendation systems (people who bought X also
bought Y).
A description of items that occur together, including statistics on
the frequency of the co-occurrence and an estimate of how
surprising it is.
13/91
Data mining task: Profiling
I
Profiling (or behavior description) attempts to characterize the
typical behavior of a group or population
•
I
“What is the typical cellphone usage of this customer segment?”
Often used to establish behavioral norms for anomaly detection
(fraud detection).
14/91
Other data mining tasks
I
Link prediction attempts to predict connections between data
items
•
I
Data reduction attempts to take a large data set of data and
replace it with a smaller one containing much of the important
information.
•
I
Social network systems: “Since you and Karen share ten friends,
maybe you’d like to be Karen’s friend?”
Trade-off: Easier processing vs. loss of information.
Causal modeling attempts to help us understand what events or
actions actually influence others.
•
•
Targeting advertisements to consumers: “Was the higher purchase
rate of targeted consumers because the advertisements influenced
them?”
Sophisticated methods for drawing causal conclusions from
observational data.
15/91
Data mining tasks vs. data analytics problems
I
Note that the data analytics problem “recommendation” is used as
an example for:
•
•
•
I
16/91
Similarity matching;
Co-occurrence grouping;
Link prediction.
Recognize the differences and match the correct data mining task to
the data analytics problem under study.
17/91
Supervised vs. Unsupervised methods
Supervised vs. Unsupervised methods
I
Consider the following questions?
1. Q1: “Do our customers naturally fall into different groups?”
2. Q2: “Can we find customer groups having particularly high
likelihoods of canceling their service soon after their contracts
expire?”
I
Difference between Q1 and Q2?
18/91
Supervised vs. Unsupervised methods
I
Consider the following questions?
1. Q1: “Do our customers naturally fall into different groups?”
2. Q2: “Can we find customer groups having particularly high
likelihoods of canceling their service soon after their contracts
expire?”
I
Difference between Q1 and Q2?
•
in Q1 there is no specific target, hence unsupervised data mining.
19/91
Supervised vs. Unsupervised methods
I
Consider the following questions?
1. Q1: “Do our customers naturally fall into different groups?”
2. Q2: “Can we find customer groups having particularly high
likelihoods of canceling their service soon after their contracts
expire?”
I
Difference between Q1 and Q2?
•
•
in Q1 there is no specific target, hence unsupervised data mining.
in Q2 there exists a specific target, hence supervised data mining.
20/91
Models, Induction, and Prediction
21/91
Supervised vs. Unsupervised methods
I
I
Supervised and unsupervised tasks require different techniques.
Supervised tasks:
•
require (actual) data on the target.
22/91
Supervised vs. Unsupervised methods
I
I
Supervised and unsupervised tasks require different techniques.
Supervised tasks:
•
•
require (actual) data on the target.
involve classification, regression, and causal modeling.
23/91
Supervised vs. Unsupervised methods
I
I
Supervised and unsupervised tasks require different techniques.
Supervised tasks:
•
•
I
require (actual) data on the target.
involve classification, regression, and causal modeling.
Unsupervised tasks:
•
cannot provide guarantee of meaningful or useful results for any
particular purpose.
24/91
Supervised vs. Unsupervised methods
I
I
Supervised and unsupervised tasks require different techniques.
Supervised tasks:
•
•
I
require (actual) data on the target.
involve classification, regression, and causal modeling.
Unsupervised tasks:
•
•
cannot provide guarantee of meaningful or useful results for any
particular purpose.
involve clustering, co-occurrence grouping, and profiling.
25/91
Supervised vs. Unsupervised methods
I
I
Supervised and unsupervised tasks require different techniques.
Supervised tasks:
•
•
I
Unsupervised tasks:
•
•
•
I
require (actual) data on the target.
involve classification, regression, and causal modeling.
cannot provide guarantee of meaningful or useful results for any
particular purpose.
involve clustering, co-occurrence grouping, and profiling.
Problem: It might be useful to know whether a given customer will
stay for at least six months, but there is only data for two months.
Both supervised and unsupervised tasks are
•
similarity matching, link prediction, and data reduction.
26/91
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
27/91
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
“Will this customer purchase service S1 if given incentive I?”
28/91
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
“Will this customer purchase service S1 if given incentive I?” classification
29/91
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
30/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?”
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
31/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?” - classification
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
32/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?” - classification
“How much will this customer use the service?”
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
33/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?” - classification
“How much will this customer use the service?” - regression
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
34/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?” - classification
“How much will this customer use the service?” - regression
“What is the probability that the customer will continue?”
Note on classification and regression
I
Classification and regression are distinguished based on the type of
target:
•
•
I
35/91
Regression involves a numeric target;
Classification involves a categorical (often binary) target.
Consider the following questions:
•
•
•
•
“Will this customer purchase service S1 if given incentive I?” classification
“Which service package (S1, S2, or none) will a customer purchase if
given incentive I?” - classification
“How much will this customer use the service?” - regression
“What is the probability that the customer will continue?” classification with categorical target
36/91
Data Mining Process
Data mining and KDD
I
I
I
37/91
Goal of “data mining”: Mining of patterns and knowledge from data.
Data mining is often set in the broader context of Knowledge
Discovery in Databases (KDD).
The precise boundaries of the data mining part within the KDD
process are not easy to state (fuzzy).
https://nocodewebscraping.com/difference-data-mining-kdd/
CRISP-DM
I
Alternative, more industry-driven view of KDD: CRISP-DM (Cross
Industry Standard Process for Data Mining)
38/91
CRISP: Business Understanding
I
Understand the problem to be solved:
•
I
Going through the process once without having solved the problem is,
generally speaking, not a failure.
Analyst’s creativity plays an important role
•
I
Business projects: not clear and unambiguous data mining problems.
Designing the solution: an iterative process of discovery.
•
I
39/91
Design team task: thinking carefully about the problem and the use
scenario (more on this in future lectures).
Decompose the problem into sub-problems each involving building
models for classification, regression, and so on.
CRISP: Data Understanding
I
Data: available raw materials.
40/91
CRISP: Data Understanding
I
I
41/91
Data: available raw materials.
Historical data: often collected for purposes unrelated to the current
business problem.
CRISP: Data Understanding
I
I
42/91
Data: available raw materials.
Historical data: often collected for purposes unrelated to the current
business problem.
•
Understand the strengths and limitations of the data. There is rarely
an exact match with the problem.
CRISP: Data Understanding
I
I
Data: available raw materials.
Historical data: often collected for purposes unrelated to the current
business problem.
•
I
43/91
Understand the strengths and limitations of the data. There is rarely
an exact match with the problem.
Estimate costs and benefits of each data source.
•
Data may be available virtually for free or may require effort to
obtain. Is further investment merited?
CRISP: Data Understanding
I
I
Data: available raw materials.
Historical data: often collected for purposes unrelated to the current
business problem.
•
I
Understand the strengths and limitations of the data. There is rarely
an exact match with the problem.
Estimate costs and benefits of each data source.
•
I
44/91
Data may be available virtually for free or may require effort to
obtain. Is further investment merited?
Necessary to uncover the structure of the business problem and the
data that are available:
•
•
Credit card fraud: Nearly all fraud is identified and reliably labeled
(by the bank or customer).
Medicare fraud: Medical providers (legitimate service providers) use
the billing system, so submit claims (false?). What exactly should the
“correct”’ charges be? No answer, hence no labels.
CRISP: Data Understanding
I
I
Data: available raw materials.
Historical data: often collected for purposes unrelated to the current
business problem.
•
I
Data may be available virtually for free or may require effort to
obtain. Is further investment merited?
Necessary to uncover the structure of the business problem and the
data that are available:
•
•
I
Understand the strengths and limitations of the data. There is rarely
an exact match with the problem.
Estimate costs and benefits of each data source.
•
I
45/91
Credit card fraud: Nearly all fraud is identified and reliably labeled
(by the bank or customer).
Medicare fraud: Medical providers (legitimate service providers) use
the billing system, so submit claims (false?). What exactly should the
“correct”’ charges be? No answer, hence no labels.
Match business problem to one or several DM tasks.
CRISP: Data Preparation
I
46/91
Data: (often) to be manipulated and converted into other forms for
better results. (time consuming).
•
•
•
Convert data to tabular format.
Remove or infer missing values.
Convert data to different types.
CRISP: Data Preparation
I
Data: (often) to be manipulated and converted into other forms for
better results. (time consuming).
•
•
•
I
47/91
Convert data to tabular format.
Remove or infer missing values.
Convert data to different types.
Match data and requirements of DM techniques.
CRISP: Data Preparation
I
48/91
Data: (often) to be manipulated and converted into other forms for
better results. (time consuming).
•
•
•
Convert data to tabular format.
Remove or infer missing values.
Convert data to different types.
I
Match data and requirements of DM techniques.
I
Select the relevant variables.
CRISP: Data Preparation
I
49/91
Data: (often) to be manipulated and converted into other forms for
better results. (time consuming).
•
•
•
Convert data to tabular format.
Remove or infer missing values.
Convert data to different types.
I
Match data and requirements of DM techniques.
I
Select the relevant variables.
I
Normalize or scale numerical variables.
CRISP: Modeling & Evaluation
I
Modeling: This is the primary place where DM techniques are
applied to the data (core part of this course!)
50/91
CRISP: Modeling & Evaluation
I
I
Modeling: This is the primary place where DM techniques are
applied to the data
Evaluation: Assess the DM results rigorously.
•
Gain confidence that results are valid and reliable;
51/91
CRISP: Modeling & Evaluation
I
I
Modeling: This is the primary place where DM techniques are
applied to the data
Evaluation: Assess the DM results rigorously.
•
•
Gain confidence that results are valid and reliable;
Ensure that the model satisfies the original business goals and
support decision making.
52/91
CRISP: Modeling & Evaluation
I
I
Modeling: This is the primary place where DM techniques are
applied to the data
Evaluation: Assess the DM results rigorously.
•
•
•
Gain confidence that results are valid and reliable;
Ensure that the model satisfies the original business goals and
support decision making.
Includes both quantitative and qualitative assessments.
53/91
CRISP: Modeling & Evaluation
I
I
Modeling: This is the primary place where DM techniques are
applied to the data
Evaluation: Assess the DM results rigorously.
•
•
•
Gain confidence that results are valid and reliable;
Ensure that the model satisfies the original business goals and
support decision making.
Includes both quantitative and qualitative assessments.
54/91
CRISP: Deployment
I
Models are put into real use in order to realize some return on
investment:
•
•
I
Implement a predictive model in some business process;
Example: predict the likelihood of churn in order to send special
offers to customers who are predicted to be particularly at risk.
Trend: DM techniques themselves are deployed
•
Systems automatically build and test models in production.
I
Rule discovery: simply use discovered rules.
I
Involve data scientists into final deployment.
I
55/91
The process of mining data produces a great deal of insight into the
business problem and the difficulties of its solution.
Note on data mining and its use
I
Mining the data to find patterns and build models is different than
using the results of data mining.
56/91
Related Analytics Techniques and Technologies
I
Software development:
•
•
•
I
Statistics:
•
•
•
I
CRISP looks similar to a software development cycle.
DM is closer to research (explorative analysis) than it is to
engineering:
DM requires skills that may not be common among programmers.
Understand different data distributions.
How to use data to test hypotheses.
Many of the DM techniques have their roots in statistics.
Database querying, Data Warehousing and OLAP
•
•
•
•
No discovery of patterns or models.
Extract the data you need for DM.
May be seen as a facilitating technology of DM.
No modeling or automatic pattern finding.
57/91
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
58/91
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
59/91
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
60/91
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
61/91
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
•
•
Question about hypothesis;
Statistical hypothesis testing required.
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
•
•
I
62/91
Question about hypothesis;
Statistical hypothesis testing required.
But who really are these customers? Can I characterize them?
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
•
•
I
63/91
Question about hypothesis;
Statistical hypothesis testing required.
But who really are these customers? Can I characterize them?
•
•
•
Individual customers: database query
Summary statistics
Deeper analysis: determine characteristics that differentiate profitable
customers from the rest (DM).
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
•
Question about hypothesis;
Statistical hypothesis testing required.
But who really are these customers? Can I characterize them?
•
•
•
I
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
•
I
64/91
Individual customers: database query
Summary statistics
Deeper analysis: determine characteristics that differentiate profitable
customers from the rest (DM).
Will a given new customer be profitable? How much revenue should
I expect this customer to generate?
Answering Business Questions with DM & Related Techniques
I
Who are the most profitable customers?
•
•
I
•
Question about hypothesis;
Statistical hypothesis testing required.
But who really are these customers? Can I characterize them?
•
•
•
I
Straightforward database query if “profitable” can be defined clearly
based on existing data.
Standard query tool.
Is there really a difference between the profitable customers and the
average customer?
•
I
65/91
Individual customers: database query
Summary statistics
Deeper analysis: determine characteristics that differentiate profitable
customers from the rest (DM).
Will a given new customer be profitable? How much revenue should
I expect this customer to generate?
•
DM techniques that examine historical customer records and produce
predictive models of profitability
66/91
Introduction to Predictive Modeling
Introduction to Predictive Modeling
I
Predictive modeling as supervised segmentation:
•
•
•
67/91
How to segment the population w.r.t. something that we would like
to predict?
Which customers are likely to leave the company when their contracts
expire?
Which potential customers are likely not to pay off their account
balances?
Introduction to Predictive Modeling
I
Predictive modeling as supervised segmentation:
•
•
•
I
How to segment the population w.r.t. something that we would like
to predict?
Which customers are likely to leave the company when their contracts
expire?
Which potential customers are likely not to pay off their account
balances?
Find important, informative variables(attributes) of the entities
w.r.t. a target
•
68/91
Do some variables reduce our uncertainty of the target value?
Introduction to Predictive Modeling
I
Predictive modeling as supervised segmentation:
•
•
•
I
I
How to segment the population w.r.t. something that we would like
to predict?
Which customers are likely to leave the company when their contracts
expire?
Which potential customers are likely not to pay off their account
balances?
Find important, informative variables(attributes) of the entities
w.r.t. a target
•
69/91
Do some variables reduce our uncertainty of the target value?
Select informative subsets in large databases (also for
data-reduction).
Introduction to Predictive Modeling
I
Predictive modeling as supervised segmentation:
•
•
•
I
I
I
How to segment the population w.r.t. something that we would like
to predict?
Which customers are likely to leave the company when their contracts
expire?
Which potential customers are likely not to pay off their account
balances?
Find important, informative variables(attributes) of the entities
w.r.t. a target
•
70/91
Do some variables reduce our uncertainty of the target value?
Select informative subsets in large databases (also for
data-reduction).
Tree induction: Based on finding informative attributes.
Introduction to Predictive Modeling
I
Predictive modeling as supervised segmentation:
•
•
•
I
I
How to segment the population w.r.t. something that we would like
to predict?
Which customers are likely to leave the company when their contracts
expire?
Which potential customers are likely not to pay off their account
balances?
Find important, informative variables(attributes) of the entities
w.r.t. a target
•
71/91
Do some variables reduce our uncertainty of the target value?
Select informative subsets in large databases (also for
data-reduction).
I
Tree induction: Based on finding informative attributes.
I
Information: A quantity that reduces uncertainty.
Models, Induction, and Prediction
I
Model: An abstraction of a real-life process/case.
•
I
•
I
in the forms of mathematical models or logical rules.
examples are classification and regression models.
Prediction: estimate an unknown value.
•
I
Preserves, and sometimes further simplifies, the relevant information.
Predictive model: A formula for estimating the value of the target
variable.
•
72/91
Credit scoring, spam filtering, fraud detection.
Descriptive modeling: presenting the main features of the data.
Models, Induction, and Prediction
I
Supervised learning:
•
•
73/91
Create model describing a relationship between a set of selected
variables and a target variable.
The model estimates the value of the target variable as a function of
the features.
Models, Induction, and Prediction
I
Supervised learning:
•
•
I
74/91
Create model describing a relationship between a set of selected
variables and a target variable.
The model estimates the value of the target variable as a function of
the features.
Induction: Generalizing from specific case to general rules.
Models, Induction, and Prediction
I
Supervised learning:
•
•
I
I
75/91
Create model describing a relationship between a set of selected
variables and a target variable.
The model estimates the value of the target variable as a function of
the features.
Induction: Generalizing from specific case to general rules.
Deduction: From general rules and specific facts to create other
facts.
Models, Induction, and Prediction
I
Supervised learning:
•
•
I
I
I
76/91
Create model describing a relationship between a set of selected
variables and a target variable.
The model estimates the value of the target variable as a function of
the features.
Induction: Generalizing from specific case to general rules.
Deduction: From general rules and specific facts to create other
facts.
An important question in data mining:
•
How to select some attributes that will best divide the sample w.r.t.
our target variable?
Supervised Segmentation
77/91
Supervised Segmentation
I
I
Intuitive segmentation: Finding subgroups of the population with
different values of the target variable.
Segmentation
•
•
•
I
is used to predict the target variable.
also provides ‘human understandable’ patterns in the data.
“Middle-aged professionals who reside in New York City on average
have a churn rate of 5%”.
Important: Identify which variables are useful in explaining the
target variable;
78/91
A simple segmentation problem
I
I
Target variable: Whether a person becomes a loan write-off.
Several attributes in data:
•
•
head-shape: square; circular,
body-shape: rectangular, oval; body-color: black, white
∆ Which attributes are best to segment people into groups of
‘write-offs’ and ‘non-write-offs’?
I
Aim for the resulting segment to be as ‘pure’ as possible.
I
Purity: Homogeneity of segments w.r.t. the target variable.
79/91
Supervised Segmentation: Purity
I
80/91
Body-color “black” would create a pure group, unless person 2 was
there.
I
Trade-off: purity of subsets vs. equal-size subsets.
I
How to split the target variable into more groups?
I
How to create supervised segmentation using numerical attributes?
I
Purity: related to ‘entropy’ and ‘information gain’.
Supervised Segmentation: A complete example
81/91
Supervised Segmentation: Entropy
I
Entropy: A measure of disorder, i.e. ‘how impure’ the segment is.
I
Let pi be the relative percentage of property i within the set.
82/91
entropy = ≠ p1 log (p1 ) ≠ p2 log (p2 ) ≠ . . . .
where pi ranges from 0 (none) to 1 (all).
I
Logarithm in entropy calculation is generally taken as base 2 (always
indicate it clearly in your calculations!).
Supervised Segmentation: Entropy
83/91
Supervised Segmentation: Entropy
I
Example: Consider a set S of 10 people with seven of the
non-write-off class and three of the write-off class.
•
I
84/91
we have pnon-write-off =0.7 and pwrite-off =0.3
Entropy for the whole set, entropy(S)
•
•
•
•
= ≠ pnon-write-off log2 (pnon-write-off ) ≠ pwrite-off log2 (pwrite-off )
= ≠ 0.7 log2 (0.7) ≠ 0.3 log2 (0.3)
¥ ≠(0.7 ◊ ( ≠ 0.51)) ≠ (0.3 ◊ ( ≠ 1.74))
¥ 0.88
Supervised Segmentation: Information Gain
I
Using entropy formula, we want to know
•
•
I
85/91
how informative an attribute is w.r.t. our target;
how much gain in information an attribute brings us.
Information gain
•
•
•
•
measures how much an attribute improves, i.e. decreases, the entropy.
shows the change in entropy due to any amount of new information.
is calculated by splitting the set on all values of a single attribute.
compares purity of the children (C ={ci }) to their parent (P).
IG(P,C)=entropy(P)≠
! "
! "
! "
$
# ! "
p c1 entropy c1 +p c2 entropy c2 + . . .
! "
where the the entropy for each child ci is weighted by the proportion of
instances belonging to that child, p(ci ).
Supervised Segmentation: Information Gain
Example: Splitting the ‘write-off’ sample into two segments, based on
splitting the Balance attribute (account balance) at 50K.
Entropy(parent):
≠ p( • ) log2 ( • ) ≠ p(F) log2 (F)
¥ ≠0.53 ◊ ≠0.9 ≠ 0.47 ◊ ≠1.1 ¥ 0.99
Entropy(left child):
≠ p( • ) log2 ( • ) ≠ p(F) log2 (F)
¥ ≠0.92 ◊ ≠0.12 ≠ 0.08 ◊ ≠3.7 ¥ 0.39
Entropy(right child):
≠ p( • ) log2 ( • ) ≠ p(F) log2 (F)
¥ ≠0.24 ◊ ≠2.1 ≠ 0.76 ◊ ≠0.39 ¥ 0.79
86/91
Supervised Segmentation: Information Gain
Example: Splitting the ‘write-off’ sample into two segments, based on
splitting the Balance attribute (account balance) at 50K.
!
"
!
"
IG parent,children = entropy parent ≠
! !
"
!
p leftchild entropy leftchild
!
"
!
"
+p rightchild entropy rightchild
!
""
"
¥ 0.99 ≠ 0.43 ◊ 0.39+0.57 ◊ 0.79 =0.37.
87/91
Supervised Segmentation: Information Gain
I
Same example, but different candidate split: residence
•
•
88/91
The residence variable does have a positive information gain, but it is
lower than that of balance.
Homework: Check/perform calculations in book.
Information gain for numeric attributes
I
“Discretize” numeric attributes by split points
•
I
Segmentation for regression problems:
•
•
•
I
How to choose the split points that provide the highest information
gain?
Information gain is not the right measure.
We need a measure of purity for numeric values.
Look at reduction of VARIANCE (zero = ‘pure’).
To create the best segmentation given a numeric target, (possibly)
choose the one that produces the best weighted average variance
reduction.
89/91
90/91
Questions?
Next Session
I
Lecture on 15 Sep.’17, Chapter 3,4:
•
•
•
•
•
Introduction to Predictive Modeling.
Visualizing Segmentation.
Fitting a Model to Data.
Classification via Mathematical Functions.
Regression via Mathematical Functions.
91/91
1BK40 Business Analytics
& Decision Support
Session 4
Introduction to Predictive Modeling.
Visualizing Segmentation.
Dr. M.Firat
Pav.D06, m.firat@tue.nl
September 18, 2017
Where innovation starts
Notifications
�
Problems, comments, suggestions:
•
•
�
Assignment 1a will be posted on Canvas after this Session (4).
•
•
•
�
Email m.firat@tue.nl with the subject “[1BK40] <subject>”;
Correctly addressed emails will be replied ASAP.
Simple problem to be solved by hand (practice for exam).
Real world problem using Matlab.
To be solved as a group.
Slides serve as reference (hence a bit verbose).
2/58
Outline
Today
Introduction to Predictive Modeling
Example: Attribute Selection with Information Gain - Session 3
Supervised Segmentation with Tree-Structured Models - Session 4
Visualizing Segmentations
Probability Estimation
Classification via Mathematical Functions
Regression via Mathematical Functions
Class Probability Estimation and Logistic “Regression”
Logistic Regression versus Tree Induction
Closing remarks
Fundamental concepts: Identifying informative attributes; Segmenting
data by progressive attribute selection.
3/58
4/58
Example: Entropy - Session 3
Supervised Segmentation: Information Gain
Example: Splitting the ‘write-off’ sample into two segments, based on
splitting the Balance attribute (account balance) at 50K.
Entropy(parent):
− p(●) log2 (●) − p(�) log2 (�)
≈ −0.53 × −0.9 − 0.47 × −1.1 ≈ 0.99
Entropy(left child):
− p(●) log2 (●) − p(�) log2 (�)
≈ −0.92 × −0.12 − 0.08 × −3.7 ≈ 0.39
Entropy(right child):
− p(●) log2 (●) − p(�) log2 (�)
≈ −0.24 × −2.1 − 0.76 × −0.39 ≈ 0.79
5/58
Supervised Segmentation: Information Gain
Example: Splitting the ‘write-off’ sample into two segments, based on
splitting the Balance attribute (account balance) at 50K.
IG (parent,children) = entropy (parent) −
(p (leftchild) entropy (leftchild)
+p (rightchild) entropy (rightchild))
≈ 0.99 − (0.43 × 0.39 + 0.57 × 0.79) ≈ 0.37.
6/58
7/58
Example: Attribute Selection with
Information Gain - Session 3
Example: edible and poisonous mushrooms
8/58
Example data is taken from The Audubon Society Field Guide to North
American Mushrooms 1 .
�
the data contains 5,644 edible and poisonous mushroom examples
(instances).
�
every instance has 22 categorical (non-target) attributes.
�
�
there are 2,156 poisonous (pps ≈ 0.38) and 3,488 edible (ped ≈ 0.62)
mushrooms.
entropy(parent) = −pps log (pps ) − ped log (ped ) ≈ 0.96.
We would like to answer:
“Which single attribute is the most useful for distinguishing edible
mushrooms from poisonous ones?”
Using the concept of information gain, rephrase the question as
“Which attribute is the most informative? ”
1
https://archive.ics.uci.edu/ml/datasets/Mushroom
Default entropy for mushroom
9/58
Entropy vs. values GILL-COLOR
10/58
Entropy vs. values SPORE-PRINT-COLOR
11/58
Entropy vs. values ODOR
12/58
13/58
Supervised Segmentation with
Tree-Structured Models - Session 4
Supervised Segmentation with Tree-Structured Models
�
Select the single variable that gives the most information gain
�
Single attribute selection alone is not sufficient
�
In a decision tree
•
•
•
•
•
•
•
very simple segmentation: two segments.
so we need multi-attribute selection: decision tree.
the topmost node in the tree is the root node.
a node of the tree denotes a test on an attribute.
each branch denotes the outcome of a test.
at the leaf node no attribute test is conducted.
each leaf node holds a segment label.
14/58
Decision trees
�
Main purpose of creating homogeneous regions:
�
Predict: Claudio Balance=115K, Employed=No, and Age=40.
•
to predict the target variable of a new, unseen instance by
determining which segment it falls into.
15/58
Decision trees
16/58
�
Manually building based on
expert knowledge
•
•
�
time consuming
Hard to avoid redundancy,
contradictions,
inefficiency,...
Automatically using induction:
•
•
recursively partition the
instances based on their
attributes
easy to understand &
relatively efficient
Building Decision trees
�
�
17/58
Recursively find the best attribute to partition the current data set.
The goal is partitioning the current group into subgroups that are as
pure as possible w.r.t. the target variable.
Building Decision trees
18/58
Final decision tree
19/58
Visualizing Segmentations
�
�
20/58
Each internal (decision) node corresponds to a split of the instance space.
Each leaf node corresponds to an unsplit region of the space (a segment of
the population).
Trees as Sets of Rules
�
�
�
21/58
Interpretation of classification trees as logical statements.
If we trace down a single path from the root node to a leaf,
collecting the conditions as we go, we generate a rule. Each rule
consists of the attribute tests along the path connected with AND.
For the previous example the classification tree is equivalent to this
rule set. :
•
•
•
•
IF
IF
IF
IF
(Balance
(Balance
(Balance
(Balance
<
<
≥
≥
50K)
50K)
50K)
50K)
AND
AND
AND
AND
(Age
(Age
(Age
(Age
<
≥
<
≥
50)
50)
45)
45)
THEN
THEN
THEN
THEN
Class=Write-off
Class=No Write-off
Class=Write-off
Class=No Write-off
Visualizing Segmentations - Matlab
%From I r i s T r e e L o g i s t i c . m
y l i n e = 26 . 5 ; %p a r a l l e l t o y a x i s
x l i n e 1 =13 . 5 ; %p a r a l l e l t o x a x i s
x l i n e 2 =17 . 5 ; %p a r a l l e l t o x a x i s
%F i g u r e + l i n e s
figure ,
g s c a t t e r ( x ( : , 1 ) , x ( : , 2 ) , f_x )
h o l d on
plot ([ yline yline ] ,[ xline1 xline2 ] , k - - )
%p a r a l l e l t o y - c h a n g e t h e x l i n e 1 / x l i n e 2 t o t h e d e s i r e d s t a r t / end o f t h e l i n e
p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 1 x l i n e 1 ] , k - - )
% p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e
p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 2 x l i n e 2 ] , k - - )
% p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e
hold o f f
x l a b e l ( p e d a l w i d t h - x1 )
y l a b e l ( s e p a l w i d t h - x2 )
22/58
Visualizing Segmentations - Matlab
�
�
23/58
Each internal (decision) node corresponds to a split of the instance space.
Each leaf node corresponds to an unsplit region of the space (a segment of
the population).
Probability estimation tree
24/58
Probability estimation tree
�
In a decision tree, it is easy to produce probability estimation instead
of simple classification
•
•
�
25/58
frequency-based estimate of class membership is calculated.
e.g. if a leaf node contains n ‘+’ and m ‘-’ instances, then
n
.
p (+) = n+m
Approach may be too optimistic for segments with a very small
number of instances (overfitting - next lectures).
•
Laplace correction moderates the influence of leaves with only a few
n+1
.
instances: p (+) = n+m+2
�
Two cases: A leaf with 2 ‘+’ instances and no ‘-’ instances, and
another leaf node with 20 ‘+’ and no ‘-’ negatives.
�
The Laplace correction smooths the estimate of the former down to
p (+) =0.75 to reflect this uncertainty, but it has much less effect on
the leaf with 20 instances (p (+) ≈ 0.95).
Example - The Churn Problem
�
Predicting which new customers are going to churn by tree
induction:
•
•
•
Historical data of 20,000 customers, either stayed or left.
Customers have 10 attributes.
Calculate the information gain of each attribute.
26/58
Example - The Churn Problem
�
The highest information gain feature (HOUSE) is at the root of the tree.
When to stop building the tree? (Next lectures)
�
How do we know that this is a good model? (Next lectures)
�
27/58
28/58
Classification via Mathematical Functions
Classification tree visualization
�
Segmentation by a classification tree is as follows
29/58
Plot of the raw data
�
Are there other ways to partition the space?
30/58
Straight line
�
Consider separating by a line: Age= (−1.5) × Balance + 60
31/58
Linear discriminant functions
�
32/58
Goal is to find a linear model that will help our classification task.
�
General linear model:
�
To use this model as a linear discriminant, for a given instance
represented by a feature vector x, we check whether f (x) is positive
or negative.
�
f (x) =w0 + w1 x1 + w2 x2 + . . .
Line in previous slide: Age= (−1.5) × Balance + 60
•
•
The line equation gives us the boundary of the segmentation.
Classification function using a line
+
class (x)= �
●
if − 1.0 × Age − 1.5 × Balance + 60≤0
if − 1.0 × Age − 1.5 × Balance + 60>0
Best linear model?
�
Which one to pick?
33/58
Best linear model?
�
Which one to pick?
34/58
Support vector machine classifier
�
Objective: Maximize the margin
35/58
Logistic regression classifier
�
�
Objective: Maximize the “likelihood” that all labeled examples
belong to the correct class.
In iris example, the target variable has two categories of species as
•
filled dots: Iris Setosa ,the circles: Iris Versicolor
36/58
37/58
Regression via Mathematical Functions
Regression via Mathematical Functions
38/58
�
In regression, we fit linear function:
�
decision on the objective function to optimize the model’s fit to the
data.
the notion of the fit:
�
f (x) =w0 + w1 x1 + w2 x2 + . . .
•
•
�
how far away are the estimated values from the true values on the
training data?
rephrasing the question: how big is the error of the fitted model?
Two ways to quantify the above objective
•
•
sum of absolute errors.
sum of squared errors.
Regression via Mathematical Functions: Raw data
39/58
Regression via Mathematical Functions: Regression estimator
40/58
Regression via Mathematical Functions: Errors
41/58
42/58
Class Probability Estimation and Logistic
“Regression”
Logistic “Regression” - Main ideas
43/58
�
For probability estimation, logistic regression uses a linear model as
linear regression for estimating numerical target values.
�
The log-odds is defined as a function of the probability of class
membership.
�
The output of the logistic regression model is merely the probability
of the class membership.
�
So, logistic regression is often used as a predictive model for
estimating the probability of class membership.
Logistic “Regression”: Technical details
44/58
�
For many applications: estimate the probability that a new instance
belongs to the class of interest.
�
Logistic regression is a model giving accurate estimates of class
probability p+ (x).
We will find a linear model f (x) =wo + w1 x1 + w2 x2 + . . ..
�
•
•
•
p+ (x)
the log-odds for example x is defined as ln � 1−p
�
+ (x)
p+ (x)
equate log-odds and linear function: ln � 1−p
� =f (x)
+ (x)
1
solve for p+ (x) and obtain p+ (x) = 1+exp(−f
(x))
Logistic “Regression”: Plot
45/58
Logistic “Regression”: Objective function
�
�
�
46/58
Ideally example x+ (x● ) would have p+ (x+ ) =1 (p+ (x● ) =0).
Compute the likelihood of a particular labeled example x given a set
of parameters w that produces class probability estimates p+ (x)
p (x)
if x is a +
g (x, w )= � +
1 − p+ (x) if x is a ●
The g function gives the model’s estimated probability of seeing x ’s
actual class given x ’s features
�
For a particular parameter set w ′ , the objective value is the sum of
the g values across all examples in a labeled data set.
�
Maximum likelihood gives the highest probabilities to the positive
examples and the lowest probabilities to negative.
Logistic “Regression”: Notes
�
�
Logistic regression is a class probability estimation model not a
regression model.
Distinguish between target variable and probability of class
membership.
•
•
•
�
One may be tempted that the target variable is a representation of
the probability of class membership.
This is not consistent with how logistic regression models are used
Example: probability of responding p (c responds) =0.02. Customer c
actually responded, but probability is not 1.0! The customer just
happened to respond this time.
Training data are statistical “draws” from the underlying
probabilities rather than representing the underlying probabilities
themselves
•
47/58
Logistic regression tries to estimate the probabilities with a
linear-log-odds model based on the observed data
48/58
Example: Logistic Regression versus Tree
Induction
Logistic Regression versus Tree Induction
�
Important differences between trees and linear classifiers:
•
•
•
•
�
�
49/58
A classification tree uses decision boundaries that are perpendicular
to the instance-space axes.
The linear classifier can use decision boundaries of any direction or
orientation.
A classification tree is a is a “piecewise” classifier that segments the
instance space recursively, possibly in arbitrarily small regions possible.
The linear classifier places a single decision surface through the entire
space.
Which of these characteristics are a better match to a given data
set?
Consider the background of the stakeholders:
•
•
A decision tree may be considerably more understandable to someone
without a strong background in statistics.
Data Mining team does not have the ultimate say how models are
used or implemented!
A simple but realistic example: Wisconsin Breast Cancer Dataset
50/58
�
Entity in the data set: cell nuclei image.
�
Target variable is diagnosis: two categories benign and malignant
(cancerous).
�
From each of these basic variables, three values were computed: the
mean, standard error, and mean of the three largest values. This
resulted in 30 measured attributes in the dataset.
�
There are 357 benign images and 212 malignant images.
�
10 Non-target variables of cell images are considered.
A simple but realistic example: Wisconsin Breast Cancer Dataset
�
Linear equation learned by logistic regression:
•
•
�
�
Non-zero weights for 30 measured attributes are found.
Performance: six mistakes on the entire dataset, accuracy 98.9%.
A classification tree was learned from the same dataset
•
•
�
51/58
it has 25 nodes with 13 leaf nodes.
Accuracy: 99.1%.
Which one is a better model?
Try changing the Matlab example from Iris to WBC.
•
•
Dataset link in book
Diagnose dataset has no missing values - http://archive.ics.uci.edu/
ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
•
Difficulties? Next session.
Iris Dataset - Matlab
�
52/58
Prepare data
load i r i s . d a t
% PREPARE DATA
% c h o o s e d a t a f o r v e r s i c o l o r and v i r g i n i c a
data = [ i r i s ( 5 1 : 1 0 0 , : ) ;
i r i s (101:150 ,:)];
% c h o o s e p e d a l w i d t h and s e p a l w i d t h t o e x p l a i n
x = [ data ( : , 2 ) data ( : , 4 ) ] ;
f_x = d a t a ( : , 5 ) ;
class
Iris Dataset - Matlab
�
Classification tree analysis
% CLASSIFICATION TREE ANALYSIS
% estimate regression tree
t _ i r i s = f i t c t r e e ( x , f_x ) ;
% Display tree graphically
v i e w ( t _ i r i s , mode , g r a p h )
% Display t r e e as a s e t of r u l e s
view ( t _ i r i s )
% obtain
fitted
or
predicted
v a l u e s o f f_x u s i n g r e g r e s s i o n t r e e
pred_tree = predict ( t _ i r i s , x ) ;
% c o u n t c o r r e c t number o f p r e d i c t e d o r f i t t e d v a l u e s
p c _ c o r r e c t _ t r e e = mean ( p r e d _ t r e e == f_x ) ;
53/58
Iris Dataset - Matlab
�
54/58
Logistic regression analysis
% LOGISTIC REGRESSION
% r e s p o n s e h a s t o be 0 o r 1 f o r t h i s r e g r e s s i o n : d e f i n e new r e s p o n s e
% v a r i a b l e y taking value 0 or 1
y1=c a t e g o r i c a l ( f_x ) ;
% estimate l o g i s t i c regression
[ b1 , dev1 , s t a t s 1 ] = m n r f i t ( x , y1 ) ;
% obtain f i t t e d values of l o g i s t i c r e g r e s s i o n
pred_logit1 = predict ( t_iris , x );
% c o u n t c o r r e c t number o f p r e d i c t e d o r f i t t e d v a l u e s
% Note we h a v e t o u s e f_x n o t y1 due t o t h e v a r i a b l e t y p e s ( d o u b l e v s c a t e g o r i c a l ) .
p c _ c o r r e c t _ l o g i t = mean ( p r e d _ l o g i t 1 == f_x ) ;
Iris Dataset - Matlab
�
Visualization of the decision tree boundaries
% D e f i n e 3 p o i n t s t a k e n from t _ i r i s . C u t P o i n t , a l s o v i s i b l e i n t h e t r e e
y l i n e = 26 . 5 ;%p a r a l l e l t o y a x i s
x l i n e 1 =13 . 5 ;%p a r a l l e l t o x a x i s
x l i n e 2 =17 . 5 ;%p a r a l l e l t o x a x i s
%F i g u r e + l i n e s
figure ,
g s c a t t e r ( x ( : , 1 ) , x ( : , 2 ) , f_x )
h o l d on
% p a r a l l e l t o y - c h a n g e t h e x l i n e 1 / x l i n e 2 t o t h e d e s i r e d s t a r t / end o f t h e l i n e
plot ([ yline yline ] ,[ xline1 xline2 ] , k - - )
% p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e
p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 1 x l i n e 1 ] , k - - )
% p a r a l l e l t o x - c h a n g e t h e min /max t o t h e d e s i r e d s t a r t / end o f t h e l i n e
p l o t ( [ min ( x ( : , 1 ) ) max ( x ( : , 1 ) ) ] , [ x l i n e 2 x l i n e 2 ] , k - - )
hold o f f
x l a b e l ( p e d a l w i d t h - x1 )
y l a b e l ( s e p a l w i d t h - x2 )
55/58
56/58
Questions?
Today
57/58
�
Chapter 3 & 4.
�
Matlab Tutorial 2 (comprehensive) -
�
Matlab Tutorial 1 -
http://www.cyclismo.org/tutorial/matlab/
https://www.mccormick.northwestern.edu/documents/students/undergraduate/
introduction-to-matlab.pdf
Next
�
58/58
Session 5:
•
•
•
•
�
Overfitting and its Avoidance.
Evaluating Classifier
Cross-Validation
Expected Value Analytical Framework
Session 6:
•
Matlab Session 2
1BK40 Business Analytics
& Decision Support
Session 5
Generalization. Overfitting
Model Evaluation
Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
September 26, 2017
Where innovation starts
Outline
Today
Generalization and Overfitting
Problem in overfitting
Overfitting Avoidance and Complexity Control
Model Evaluation
Expected Value
Closing remarks
Fundamental concepts: Generalization; Fitting and overfitting.
2/43
3/43
Generalization and Overfitting
Introduction
�
�
Interested in general patterns, not data-specific ones (by chance).
Q: Why general patterns?
•
Answer: They predict well the instances not seen yet.
�
Overfitting issue:
�
Example: churn data set, consider an extremal prediction model
•
•
•
•
Involving data-specific chance occurrences in prediction model.
A (table) look-up type prediction model.
Using historical data, it is 100% accurate (for seen instances!)
No ability to predict any unseen instances, hence no generalization.
4/43
Generalization and Overfitting
�
�
DM needs to create models that generalize beyond training data.
Generalization is the property of a model or modeling process
whereby the model applies to data that were not used to build the
model.
•
�
If models do not generalize at all, they fit perfectly to the training
data → they overfit.
Overfitting is the tendency of DM procedures to tailor models to
the training data, at the expense of generalization to previously
unseen data points.
•
�
5/43
“If you torture the data long enough, it will confess.” - (Ronald
Coase)
Note: All DM procedures tend to overfitting.
•
•
Trade-off between model complexity and the possibility of overfitting.
You should recognize overfitting and manage complexity in a
principled way.
Holdout data
6/43
�
Evaluation on training data provides no assessment of how well the
model generalizes to unseen cases
�
Idea: “Hold out” some data for which we know the value of the
target variable, but which will not be used to build the model - “lab
test”.
Predict the values of the “holdout data” (aka “test set”) with the
model and compare them with the hidden true target values generalization performance.
�
•
There is likely to be a difference between the model’s accuracy
(“in-sample”) and the model’s generalization accuracy.
Fitting Graph
�
A fitting graph shows the accuracy of a model as a function of
complexity.
�
Generally, there will be more overfitting as one allows the model to
be more complex.
7/43
Fitting Graph - Churn example
8/43
Overfitting - Tree Induction
�
9/43
Recall tree induction: find important, predictive individual attributes
recursively to smaller and smaller data subsets.
•
•
•
•
Eventually, the subsets will be pure - we have found the leaves of our
decision tree;
The accuracy of this tree will be perfect!
This is the same as the table model, i.e. an extreme example of
overfitting!
This tree should be slightly better than the lookup table, because
every previously unseen instance also will arrive at some classification
rather than just failing to match.
�
Generally: A procedure that grows trees until the leaves are pure
tends to overfit.
�
If allowed to grow without bound, decision trees can fit any data to
arbitrary precision.
Overfitting - Tree Induction
�
A fitting graph shows the accuracy of a model as a function of
complexity.
10/43
Overfitting - Mathematical Functions
�
11/43
There are different ways to allow more or less complexity in
mathematical functions:
•
Add more variables (attributes, features)
f (x) = w0 + w1 x1 + w2 x2 + w3 x3
•
�
f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5
Add non-linear variables
f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x12 + w5 x2 �x3
As you increase the dimensionality, you can perfectly fit larger and
larger sets of arbitrary points:
•
•
Modelers carefully prune the attributes in order to avoid overfitting manual selection;
Automatic feature selection.
Overfitting - Linear functions
12/43
Problem in overfitting
�
Why is overfitting causing a model to become worse?
•
•
•
As a model gets more complex, it is allowed to pick up harmful
“spurious” correlations.
These correlations do not represent characteristics of the population
in general.
They may become harmful when they produce incorrect
generalizations in the model.
13/43
Problem in overfitting
�
A simple two-class problem:
•
•
•
•
�
�
14/43
Classes c1 and c2 , attributes x and y .
An evenly balanced population of examples.
x has two values, p and q, and y has two values, r and s.
General population: x = p occurs 75% of the time in class c1 examples
and in 25% of c2 examples x provides some prediction of that class.
Both of y ’s values occur in both classes equally → y has no
predictive value at all.
The instances in the domain are difficult to separate, with only x
providing some predictive leverage (75% accuracy).
Problem in overfitting
�
15/43
Let us examine a very small training set of examples from this
domain:
Instance
1
2
3
4
5
6
7
8
x
p
p
p
q
p
q
q
q
y
r
r
r
s
s
r
s
r
Class
c1
c1
c1
c1
c2
c2
c2
c2
�
Note that in this particular dataset y ’s values of r and s are not
evenly split between the classes, so y does seem to provide some
predictiveness.
�
What would a classification tree do?
Problem in overfitting
�
Small training set of examples, assume:
•
•
•
�
A tree learner would split on x and produce a tree (a) with error 25%.
In this particular dataset, y ’s values of r and s are not evenly split
between the classes, so y seems to provide some predictions.
Tree induction would achieve information gain by splitting on y ’s
values and create tree (b).
Tree (b) performs better than (a):
•
•
•
16/43
Because y = r purely by chance correlates with class c1 in this data
sample.
The extra branch in (b) is not extraneous, it is harmful!
The spurious y = s branch predicts c2 , which is wrong (error rate:
30%).
Problem in overfitting
�
This phenomenon is not particular to decision trees;
�
There is no general analytic way to avoid overfitting.
�
or because of atypical training data;
17/43
Holdout training and testing
�
Cross-validation is a more sophisticated training and testing
procedure.
•
•
�
18/43
Not only a simple estimate of the generalization performance, but also
some statistics on the estimated performance (mean, variance, ...)
How does the performance vary across data sets? assessing
confidence in the performance estimate
Cross-validation computes its estimates over all the data by
performing multiple splits and systematically swapping out samples
for testing:
•
•
•
•
•
Split a data set into k partitions called folds (k = 5 or 10)
Iterate training and testing k times.
In each iteration, a different fold is chosen as the test data. The
other k − 1 folds are combined to form the training data.
Every example will have been used only once for testing but k − 1
times for training
Compute average and standard deviation from k folds.
Holdout training and testing
19/43
Cross-Validation in the Churn Datase
�
Logistic regression vs classification tree
20/43
Learning Curves
�
�
21/43
A learning curve is a plot of the generalization performance against
the amount of training data.
Learning curves vs fitting graphs:
•
•
•
A learning curve shows the generalization performance - the
performance only on testing data, plotted against the amount of
training data used.
A fitting graph shows the generalization performance as well as the
performance on the training data, but plotted against model
complexity.
Fitting graphs generally are shown for a fixed amount of training data.
Learning Curves - Churn Dataset
�
The flexibility of tree induction can be an advantage with larger
training sets:
•
the tree can represent substantially nonlinear relationships between
the features and the target.
22/43
Avoiding Overfitting with Tree Induction
�
�
The main problem with tree induction is that it will keep growing
the tree to fit the training data until it creates pure leaf nodes.
Tree induction commonly uses two techniques to avoid overfitting:
•
•
�
�
23/43
Stop growing the tree before it gets too complex;
grow the tree until it is too large, then “prune” it back, reducing its
size (and thereby its complexity).
There are various methods for accomplishing both.
Simple idea for first: limit tree size is to specify a minimum number
of instances that must be present in a leaf (or generally use the data
at the leaf to make a statistical estimate of the value of the target
variable for future cases that would fall to that leaf).
•
•
•
Key concern: What threshold should be used?
Use “hypothesis test”! Recall: roughly, a hypothesis test tries to
assess whether a difference in some statistic is not due simply to
chance.
Pay attention to multiple comparisons for the ‘best’ model - It is a
trap!
A General Method for Avoiding Overfitting
�
Use tools from your arsenal: cross-validation & fitting graph
�
Note: You can also use nested cross-validation to check different
model complexities.
24/43
25/43
Model Evaluation
What is a good model?
�
Case: Wisconsin Breast Cancer Dataset (Session 4).
•
•
�
Logistic regression: Accuracy 98.9%.
Classification tree: Accuracy 99.1%.
Q: Which one is a better model?
26/43
What is a good model?
�
Data scientists and stakeholders should ask what to achieve by
mining data
•
•
Connect the results of mining data back to the goal of the
undertaking.
Have a clear understanding of basic concepts.
�
The goal: Often impossible to measure perfectly (e.g. inadequate
systems, too costly data...).
�
We consider binary classification problems
Accuracy: a metric that is very easy to measure
�
Accuracy =
Number of correct decisions
TP + TN
= 1 − error rate =
Total number of decisions made
T
27/43
Evaluate Classifiers
28/43
�
Problems: Unbalanced classes
�
Confusion matrix: true classes p(ositive) and n(egative) and classes
predicted Y(es) and N(o), respectively.
�
Do not confuse Bad Positives and Harmless Negatives.
Problems with Unbalanced Classes
�
Consider prediction of churn
•
•
•
Training data including 1000 customers, with 100 churned.
What is the base rate accuracy?
Majority classifier: Always output ‘no churn’, accuracy 90% (!)
29/43
Problems with Unbalanced Classes
�
Models A and B generate accuracies of 80% and 64%
•
•
•
�
Model A: evaluated on a balanced data set.
Model B: evaluated on a representative data set (1:9 ratio)
However, accuracy of B in a balanced data set: 80%
Which one is better?
Figure: Confusion matrices a) Model A, b) Model B
•
•
Model A: correctly identifying 60% of negative examples.
Model B: correctly identifying 60% of positive examples.
30/43
Performance visualization
�
�
Accuracy can be misleading.
Train on balance data but evaluate with representative data.
31/43
Unequal costs and benefits
�
Key question: how much do we care about the different errors and
correct decisions?
•
•
�
Classification accuracy makes no distinction between false positive
and false negative errors.
In real-world applications, different kinds of errors lead to different
consequences!
Example: medical diagnosis
•
•
�
32/43
Patient has cancer (although she does not) - false positive: expensive,
but not life threatening.
Patient has cancer, but she is told that she has not - false negative:
more serious.
Errors should be counted separately - Estimate cost or benefit of
each decision
Generalizing beyond classification
�
Measure quality for regression model.
E=
1 N
2
�(yi − ŷi )
N i=1
�
Important to consider whether this is a meaningful metric for the
problem at hand (must match the problem).
�
Many other metrics can be thought of...
33/43
A key concept in statistics: Expected Value
�
�
�
Need to know: All possible outcomes, and their probabilities.
Weighted average of outcome values w.r.t. their probabilities:
Expected Value (EV) = p(o1 )v (o1 ) + p(o2 )v (o2 ) + p(o3 )v (o3 ) + ...
where oi is i th outcome, with probability p(oi ) and value v (oi ).
34/43
Expected Value to Frame Classifier Use
�
�
Targeted marketing: ‘likely responder’ and ‘not likely responder’.
Define the ‘value of response’:
•
•
�
�
�
Product price $ 200 and cost $ 100, targeting cost $1,
Values: vR =$ 99, and vNR = - $ 1
Given a feature vector x of a customer as input.
Let pR (x) be the ‘estimated’ probability of response, then
•
Expected profit = pR (x)vR + (1 − pR (x))vNR
Q: Shall we target the customer? Check
•
pR (x)$99 + (1 − pR (x)) (−$1) > 0 ⇒ pR (x) > 0.01
35/43
Expected Value to Frame Classifier Evaluation
�
Shift our focus: from ‘entities’ to ‘sets of entities’.
�
Q: What is the expected benefit per customer using a model?
�
To compare one model to another ‘in aggregate’.
�
Using the confusion matrix,
�
�
�
36/43
exp. profit
= p(Y , p)b(Y , p) + p(N, p)b(N, p) + p(N, n)b(N, n) + p(Y , n)b(Y , n)
Costs and benefits from business understanding and knowledge.
Possible to compare two models on different data sets.
There is an alternative formulation by using conditional probabilities.
Example - Matlab
% Cost - benefit matrix
CB = [99 , -1;0 0];
% Confusion matrix
CM = [56 7;5 42];
% % Calculate priors
priors = sum ( CM ) / sum ( CM (:) )
% % Calculate rates ( conditional probabilities )
helpvar = ones (2 ,1) * sum ( CM ,1)
rates = CM ./ helpvar
% % Calculate expected profit
% EP = p ( p ) [ p ( Y | p ) . b (Y , p ) + p ( N | p ) . b (N , p ) ] + ...
%
... + p ( n ) [ p ( N | n ) . b (N , n ) + p ( Y | n ) . b (Y , n ) ]
EB = rates .* CB
% multiply rates with benefits ( expected benefits )
EP = sum ( EB * priors ) % multiply with priors : notice the transpose
37/43
Other evaluation metrics
38/43
Other evaluation metrics
Source Wikipedia.
39/43
Evaluation, Baseline Performance,
�
Think of a reasonable baseline to compare the performance of our
model.
�
Convincing people by showing the added value of data mining.
Q: How to find the appropriate baseline?
�
�
•
40/43
Depends on the actual application.
In weather forecasting
•
•
Tomorrow (and the next day) will be the same as today.
The long-term historical average.
�
In classification
�
Maximizing simple prediction accuracy is not the ultimate goal.
•
Majority classifier: output the majority class of the training data set.
41/43
Questions?
Today
�
Chapter 5 & 7.
42/43
Next week
�
Session 7:
•
•
•
•
Learning curves.
Similarity, Neighbors, and Clusters.
Clustering.
Decision Analytic Thinking.
43/43
1BK40 Business Analytics
& Decision Support
Session 6
Matlab session 2
Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
September 21, 2017
Where innovation starts
Announcements
�
Assignment 1b - Release: 27 Sept. / Deadline: 6 Oct.
�
Assignment 2b - Release: 18 Oct. / Deadline: 27 Oct. (tentative)
�
Assignment 2a - Release: 6 Oct. / Deadline: 18 Oct. (tentative)
2/26
Introduction
�
Previous lecture:
�
Today:
•
•
•
�
3/26
Introduced basic issues of model evaluation and explored the question
of what makes for a good model.
Implementation in Matlab for the Iris data.
No help/questions in solving Assignment 1b.
This lecture contains slides from other lectures.
4/26
Cross Validation - Matlab
Note on data mining and its use
�
Mining the data to find patterns and build models is different than
using the results of data mining.
5/26
Holdout training and testing
6/26
Load data
load iris.dat
% column information for IRIS data
%1). sepal length in cm - %2). sepal width in cm
%3). petal length in cm - %4). petal width in cm
%5). class:
%- Iris Setosa, %- Iris Versicolour, %- Iris Virginica
% PREPARE DATA
% choose data for versicolor and virginica
data=[iris(iris(:,5)==2,:); iris(iris(:,5)==3,:)]
% choose pedal width and sepal width to explain 'class'
% x = [data(:,2) data(:,4)];
x = data(:,1:4);
f_x = data(:,5)-2; %to make it zeros and ones
% Set the random seed for reproducibility
% (always obtain the same values).
rng('default')
7/26
HOLD-OUT Validation - TREE
8/26
cvoCV=cvpartition(f_x,'Holdout',0.2);
% To make it clearer, lets indicate the train and validation set
xTrain=x(cvoCV.training,:);
yTrain=f_x(cvoCV.training);
xVal=x(cvoCV.test,:);
yVal=f_x(cvoCV.test);
% estimate regression tree
t_HO = fitctree(xTrain,yTrain);
% Predict in-sample values - Training
pred_treeTrain = predict(t_HO,xTrain);
% Accuracy of in-sample - Training
tree_HO_AccTrain = mean(pred_treeTrain == yTrain);
% Predict out-of-sample values - Validation (or testing)
pred_treeVal = predict(t_HO,xVal);
% Accuracy of out-of-sample - Validation (or testing)
tree_HO_AccVal = mean(pred_treeVal == yVal);
Cross Validation - TREE
9/26
cvo=cvpartition(f_x,'Kfold',10);
% We need to use a for loop to automatically repeat the
% building/checking for each fold...
% or we use the created function cvDecisionTree
[trees_CV,tree_CV_AccTrain,tree_CV_AccVal] = cvDecisionTree(x,f_x
Logistic “Regression” - Technical details
�
10/26
How to translate log-odds into the probability of class membership?
•
•
•
•
•
p+ (x) represents the model’s estimate of the probability of class
membership of a data item by feature vector x).
+ is the class for the binary even we are modeling.
1 − p+ (x) is the estimated probability of the event not occurring.
ln �
p+ (x)
� = f (x) = wo + w1 x1 + w2 x2 + . . .
1 − p+ (x)
Solving for p+ (x) we obtain
p+ (x) =
1
1 + exp(−f (x ))
Logistic “Regression” - Technical details
11/26
HOLD-OUT Validation - LOGISTIC
REGRESSION
12/26
cvoCV=cvpartition(f_x,'Holdout',0.2);
% To make it clearer, lets indicate the train and validation set
xTrain=x(cvoCV.training,:);
yTrain=f_x(cvoCV.training);
xVal=x(cvoCV.test,:);
yVal=f_x(cvoCV.test);
[b1,dev1,stats1] = mnrfit(xTrain,categorical(yTrain));
% Predict in-sample values - Training
pr_yA = mnrval(b1,xTrain);
[C,It]=max(pr_yA,[],2);
% I is still 1 or 2, for our example we have classes 0 and 1 and
% want categorical variables to compare with yTrain.
pred_lrTrain=categorical(It-1);
% Accuracy of in-sample - Training
lr_AccTrain = mean(pred_lrTrain == categorical(yTrain));
% Predict out-of-sample values - Validation (or testing)
% TODO
Problems with Unbalanced Classes
�
Consider prediction of churn (again):
•
•
•
Training population of 1000 customers.
Baseline churn rate is 10% (100 customers in 1000 are expected to
churn).
What is the base rate accuracy?
�
Model A: generates an accuracy of 80%
�
Why is there a difference?
�
Model B: generates an accuracy of 64%
�
Which one is better?
13/26
Problems with Unbalanced Classes
�
Difference:
•
•
�
�
Model A: evaluated on a balanced data set.
Model B: evaluated on a representative data set (1:9 ratio)
Accuracy of both in a balanced data set: 80%. ??
Confusion matrices:
•
•
Model A: 40% of negative class wrong.
Model B: 40% of positive class wrong.
14/26
Performance visualization
�
�
Accuracy can be misleading.
Train on balance data but evaluate with representative data.
15/26
16/26
Visualization - Matlab
Learning Curves - Churn Dataset
17/26
Learning curve - TREE
18/26
% Note that you obtain the percentages not the number of data poi
[valuesPerc,Acc]=learningCurve(x,f_x,'tree');
% Make the actual plot with x and y axis labels and also legend.
% TODO - also no solution in Matlab tutorial 2.
Fitting Graph
19/26
�
A fitting graph shows the accuracy of a model as a function of
complexity.
�
Generally, there will be more overfitting as one allows the model to
be more complex.
Avoiding Overfitting with Tree Induction
�
�
The main problem with tree induction is that it will keep growing
the tree to fit the training data until it creates pure leaf nodes.
Tree induction commonly uses two techniques to avoid overfitting:
•
•
�
�
20/26
Stop growing the tree before it gets too complex;
grow the tree until it is too large, then “prune” it back, reducing its
size (and thereby its complexity).
There are various methods for accomplishing both.
Simple idea for first: limit tree size is to specify a minimum number
of instances that must be present in a leaf (or generally use the data
at the leaf to make a statistical estimate of the value of the target
variable for future cases that would fall to that leaf).
•
•
•
Key concern: What threshold should be used?
Use “hypothesis test”! Recall: roughly, a hypothesis test tries to
assess whether a difference in some statistic is not due simply to
chance.
Pay attention to multiple comparisons for the ‘best’ model - It is a
trap!
Example - The Churn Problem
21/26
Fitting graph - TREE
22/26
[modelC,Acc]=fittingGraph(x,f_x,'tree');
% Make the actual plot with x and y axis labels and also labels.
% TODO - also no solution in Matlab tutorial 2.
Overfitting - Mathematical Functions
�
23/26
There are different ways to allow more or less complexity in
mathematical functions:
•
Add more variables (attributes, features)
f (x) = w0 + w1 x1 + w2 x2 + w3 x3
•
�
f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x4 + w5 x5
Add non-linear variables
f (x) = w0 + w1 x1 + w2 x2 + w3 x3 + w4 x12 + w5 x2 �x3
As you increase the dimensionality, you can perfectly fit larger and
larger sets of arbitrary points:
•
•
Modelers carefully prune the attributes in order to avoid overfitting manual selection;
Automatic feature selection.
Fitting graph - LOGISTIC REGRESSION
24/26
% A simple way to increase complexity is to check the correlation
% between each variable and the output. Please note that p-values
% correlation are related (see:
% http://www.eecs.qmul.ac.uk/¬norman/blog_articles/p_values.pdf)
% this is a simplification.
[modelC,Acc]=fittingGraph(x,f_x,'logistic')
25/26
Questions?
Today
�
�
Session 6 - Matlab examples.
Solve Matlab tutorial 2.
26/26
1BK40 Business Analytics
& Decision Support
Session 7
Similarity & Clustering
Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
September 29, 2017
Where innovation starts
Announcements
�
Healthy way of reading the book:
•
•
�
�
Not for searching the content of the assignment question.
But for understanding the flow of explanations.
Session will start with quick scan of previous concepts.
Assignment 1b will be released on Friday, after Session 8.
2/52
Previous session
�
Generalization and overfitting
•
•
•
•
�
Holdout training and testing
Fitting graph: Error vs. model complexity
Cross validation: Split data into k partitions
Learning curve: Performance vs. size of training data
Model evaluation
•
•
Confusion matrix
Expected value
3/52
Outline
Today
Similarity importance in business tasks
Similarity and distance
Nearest neighbor reasoning
Clustering
Hierarchical clustering
k-means clustering
Closing remarks
Fundamental concepts: Calculating similarity of objects described by
data; Using similarity for prediction; Clustering as similarity-based
segmentation.
4/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Similarity importance in business tasks
�
Two things being similar is some way, often share other
characteristics as well.
•
�
Data mining procedures often are based on grouping things by
similarity or searching for the “right” sort of similarity.
Different sorts of business tasks involve reasoning from similar
examples:
•
•
•
•
•
Retrieve similar things directly. Find companies similar to best
customers.
Classification and regression.
Clustering: group similar items together. Costumer segmentation.
Similarity-based recommendations. (People who like X also like Y).
Amazon & Netflix.
Reasoning from similar cases: Case-based reasoning. Law, medicine
and AI.
5/52
Supervised segmentation
�
Can be viewed as a grouping data into groups with a similar
property.
�
What are the measures of similarity in the following visualization?
6/52
Similarity and distance
�
Mathematically, similarity is a relation that satisfies a number of
conditions (reflexive, symmetric, transitive).
�
In a vector space, this is linked to the notion of distance.
�
7/52
Two objects are more similar the smaller the distance between them
is.
8/52
Similarity and distance
Similarity and distance
9/52
�
Consider two instances from our simplified credit application domain:
�
Using the general Euclidean distance:
�
d(A, B) = (d1,A − d1,A )2 + (d2,A − d2,A )2 + . . . + (dn,A − dn,A )2
�
we obtain (is this correct?)
�
d(A, B) = (23 − 40)2 + (2 − 10)2 + (2 − 1)2 ≈ 18.8
Distance is just a number:
•
•
It has no units, and no meaningful interpretation;
It is only really useful for comparing the similarity of one pair of
instances to that of another pair.
Nearest neighbor reasoning
�
�
Start with an example (understand the fundamental notion).
Find a whiskey the most similar to my favorite according to
attributes
•
Color; nose; body; palate; finish.
10/52
Nearest neighbors for predictive modeling
�
Similarity for predictive modeling:
•
•
•
Given a new example (whose target variable we want to predict),
Scan all the training examples and choose several that are the most
similar to the new example.
Predict the new example’s target value, based on the nearest
neighbors’ (known) target values.
11/52
Nearest neighbors for predictive modeling
�
Similarity for predictive modeling:
•
•
•
Given a new example (whose target variable we want to predict),
Scan all the training examples and choose several that are the most
similar to the new example.
Predict the new example’s target value, based on the nearest
neighbors’ (known) target values.
11/52
Nearest neighbors for predictive modeling
�
Similarity for predictive modeling:
•
•
•
Given a new example (whose target variable we want to predict),
Scan all the training examples and choose several that are the most
similar to the new example.
Predict the new example’s target value, based on the nearest
neighbors’ (known) target values.
11/52
Nearest neighbors for predictive modeling
�
Similarity for predictive modeling:
•
•
•
Given a new example (whose target variable we want to predict),
Scan all the training examples and choose several that are the most
similar to the new example.
Predict the new example’s target value, based on the nearest
neighbors’ (known) target values.
11/52
Nearest neighbors for predictive modeling
�
Defined by:
•
•
•
•
Dataset;
Distance function;
Number of neighbors (size of neighborhood);
Combining function (prediction).
12/52
Distance function
�
Euclidean distance:
�
Manhattan distance:
�
Jaccard distance:
�
Cosine distance:
dEuclidean (X , Y ) = ��X − Y ��2 =
13/52
�
(x1 − y1 )2 ) + (x2 − y2 )2 + . . .
dManhattan (X , Y ) = ��X − Y ��1 = �x1 − y1 � + �x2 − y2 � + . . .
dJaccard (X , Y ) = 1 −
dcosine (X , Y ) = 1 −
�X ∩ Y �
�X ∪ Y �
X .Y
��X ��2 .��Y ��2
Distance function
�
14/52
Manhattan distance (L1-norm) and Euclidean distance (L2-norm)
are special cases of Minkovski distance:
dMinkovski (X , Y ) = ��X −Y ��q = (�x1 − y1 �q + �x2 − y2 �q + . . . + �xp − yp �q )1�q
By User:Psychonaut - Created by User:Psychonaut with XFig, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=731390
Nearest neighbor classifier
�
�
Example: credit card marketing problem
How many neighbors? How much influence of each neighbor?
15/52
Nearest neighbor classifier
�
�
Example: credit card marketing problem
How many neighbors? How much influence of each neighbor?
15/52
Nearest neighbor classifier
�
�
Example: credit card marketing problem
How many neighbors? How much influence of each neighbor?
15/52
How Many Neighbors and How Much Influence?
�
k is a complexity parameter:
•
•
•
�
The larger the smoother (less complex) the model.
Influence of distant observations?
1-NN model?
Classification - odd number (to break ties in voting).
16/52
Combining functions
�
17/52
Majority scoring
Score(c, N) = � [class(y) = c]
y∈N
�
Similarity-moderated classification:
Score(c, N) = � w (x, y) × [class(y) = c] w (x, y) = (dist2 (x, y))−1
y∈N
�
Similarity-moderated scoring:
�
Similarity moderated regression:
�
...
p(c�x) =
∑y∈N w (x, y) × [class(y) = c]
∑y∈N w (x, y)
p(c�x) =
∑y∈N w (x, y) × t(y )
∑y∈N w (x, y)
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Nearest neighbors for predictive modeling
�
Probability estimation:
•
•
•
•
•
�
It is important not just to classify a new example but to estimate its
probability (score), because a score gives more information than just
a Yes/No decision.
Consider again the classification task of deciding whether David will
be a responder or not (previous slides).
Nearest neighbors (Rachael, John, and Norah) have classes of (No,
Yes, Yes).
If we score for the Yes class, so that Yes=1 and No=0, we can
average these into a score of 2/3 for David.
Recall the discussion of estimating probabilities from small samples
for decision trees.
Regression:
•
•
•
18/52
Retrieve information from nearest neighbors similar to majority vote
of a target for classification.
Nearest neighbors (Rachael, John, and Norah) have income of (50,
35, 40).
Use these values as average (≈42) or median (40).
Lazy learners vs. non-lazy learners
�
Nearest neighbor classifiers are also known as lazy learners.
•
•
�
No model is built before evaluation (learner is lazy).
All computations are made at evaluation time (i.e. when classifying
new instances).
A decision tree is a non-lazy learner since a model is built before
evaluation time. Logistic regression?
19/52
Issues with NN classification
�
Intelligibility:
�
Dimensionality:
•
•
•
�
Decisions can be explained along the contributions of the neighbors:
“The movie Avatar was recommended based on your interest in
Spider Man and The Hobbit”
Too many/irrelevant attributes may confuse distance calculations.
Curse of dimensionality: need to use feature selection or domain
knowledge.
Computational Efficiency:
•
20/52
Querying the database for prediction expensive.
k-NN classifier - Matlab
%same data as before, but loaded differently.
% Once again use only 2 species to make it easier to plot.
load fisheriris
X = meas(:,3:4); Y = species;
figure,gscatter(X(:,1),X(:,2),species); grid
set(legend,'location','best');
xlabel('petal length'); ylabel('petal width')
% Construct a kNN classifier for 5-nearest neighbors.
mdl = fitcknn(X,Y,'NumNeighbors', 5)
% Examine the resubstitution loss, which, by default,
% is the fraction of misclassifications from the
% predictions of mdl.
rloss = resubLoss(mdl) %about 4% of incorrect classification
%predict a new flower
flwr=[3.8 1.2]; flwrClass=predict(mdl,flwr)
21/52
k-NN classifier - Matlab
% Plot the new flower
line(flwr(1),flwr(2),'marker','x',...
'color','k','markersize',10,'linewidth',2);
% Search 5 nearest neighbors
[n,d] = knnsearch(X,flwr,'k',5);
line(X(n,1),X(n,2),'color',[.5 .5 .5],'marker','o',...
'linestyle','none','markersize',10);
22/52
23/52
Clustering
Supervised vs. unsupervised learning
�
�
Supervised modeling
•
predicts a target feature.
Unsupervised modeling
•
•
has no notion of target
variable.
searches for some regularity
in a dataset.
24/52
Clustering
�
Clustering uses the similarity as the criterion.
�
Clustering finds groups of objects where the objects within groups
are similar.
�
Clustering purposes:
•
•
Discovery of overall distribution patterns and structure.
Data categorization, data compression (reduction).
25/52
Clustering approaches
�
Hierarchical clustering is grouping of data based on
similarity/distance.
�
Density-based clustering uses distribution density of data points
�
26/52
Grid-based clustering divides the search space into a finite number
of grid elements.
Clustering algorithms
�
k-means, k-modes, k-medoids algorithm
�
DBSCAN algorithm
�
Fuzzy k-means algorithm
�
...
�
Single, complete linkage algorithm
�
STING algorithm
�
Possibilistic clustering
�
What is the basic idea for all of these algorithms?
27/52
Hierarchical clustering
�
Grouping points by their similarity.
�
Clusters are merged iteratively until only a single cluster remains
�
Lowest level: All data points themselves
�
Hierarchical clustering uses a distance (linkage) function between
clusters
�
Only overlap between clusters when one contains(contained by)
others.
�
No pre-determined number of clusters.
�
28/52
Dendrogram is used to show explicitly the hierarchy of the clusters.
Dendrogram representation
�
�
Convenient graphic to represent the hierarchical sequence of
clustering.
Basically, it is a tree where
•
•
•
•
each node represents a cluster
each leaf represents a data point
root node is the cluster of the whole data set
each internal node has two children/clusters that were merged to
form it.
29/52
Dendrogram representation
30/52
Example: Tree of life
31/52
Example: Tree of life
32/52
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage algorithms
33/52
Given the data
xk = [x1k , x2k , . . . , xnk ]T , k = 1, . . . , N
1. Start with N clusters and compute the N × N matrix of similarities.
S(xi , xj ) = (1 + d(xi , xj ))−1
2. Step check: Determine the most similar clusters i ∗ , j ∗ .
3. Merge clusters to form a new cluster i ′ = i ∗ ∪ j ∗ .
4. Delete in the similarity matrix the row and column corresponding to
i ∗ and j ∗ .
5. Determine the similarity between i ′ and all other remaining clusters.
6. If the number is clusters is greater than one, go to ‘Step check’.
Otherwise STOP.
Linkage variations
34/52
d(A, B) =
min d(xi , xj )
xi ∈A,xj ∈B
d(A, B) = max d(xi , xj )
xi ∈A,xj ∈B
d(A, B) =
1
� � d(xi , xj )
�A��B� xi ∈A xj ∈B
Linkage example
35/52
Example: Complete linkage
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
36/52
Example: Complete linkage
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
36/52
Example: Complete linkage
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{3, 5} 4
1
0.25 1
Example: Complete linkage
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{3, 5} 4
1
0.25 1
Example: Complete linkage
S = {si,j } =
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
1 {2, 4} {3, 5}
1
1
{2, 4} 0.8
1
{3, 5} 0.83 0.3
1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{3, 5} 4
1
0.25 1
Example: Complete linkage
S = {si,j } =
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
1 {2, 4} {3, 5}
1
1
{2, 4} 0.8
1
{3, 5} 0.83 0.3
1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{3, 5} 4
1
0.25 1
Example: Complete linkage
S = {si,j } =
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
1 {2, 4} {3, 5}
1
1
S = {si,j } =
{2, 4} 0.8
1
{3, 5} 0.83 0.3
1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{2, 4}
{1, 3, 5}
{3, 5} 4
1
0.25 1
{2, 4} {1, 3, 5}
1
0.8
1
Example: Complete linkage
S = {si,j } =
S = {si,j } =
1
2
3
4
5
1
2
3
4 5
1
0.8
1
S = {si,j } =
0.4 0.3 1
0.7 0.85 0.2 1
0.83 0.1 0.9 0.25 1
1 {2, 4} {3, 5}
1
1
S = {si,j } =
{2, 4} 0.8
1
{3, 5} 0.83 0.3
1
36/52
1
2
1
1
2
0.8
1
{3, 5} 0.83 0.3
4
0.7 0.85
{2, 4}
{1, 3, 5}
{3, 5} 4
1
0.25 1
{2, 4} {1, 3, 5}
1
0.8
1
Hierarchical clustering - Matlab
37/52
%same data as before, but loaded differently.
% Once again use only 2 species to make it easier to plot.
load fisheriris
X = meas(:,3:4);
Y = species;
figure gscatter(X(: 1) X(: 2) species)
figure,gscatter(X(:,1),X(:,2),species)
set(legend,'location','best')
xlabel('petal length')
ylabel('petal width')
%Hierarchical classification
% The pdist function returns this distance information in a
% vector, Y, where each element contains the distance between
% a pair of objects.
eucD = pdist(X,'euclidean');
% To see this distance
squareform(eucD)
Hierarchical clustering - Matlab
38/52
% Once the proximity between objects in the data set has been
% computed, you can determine how objects in the data set should
% be grouped into clusters , using the linkage function.
clustTreeEuc = linkage(eucD,'average');
% To visualize the hierarchy of clusters, you can plot
% a dendrogram.
figure;
[heucD,nodeseucD] = dendrogram(clustTreeEuc,0);
%try the cosine distance
cosD = pdist(X,'cosine');
clustTreeCos = linkage(cosD,'average');
figure; [hCos,nodesCos] = dendrogram(clustTreeCos,0);
% Determine quality of hierarchical clustering using
% the cophenetic correlation
qEuc = cophenet(clustTreeEuc,cosD)
qCos = cophenet(clustTreeCos,cosD)
k-means clustering
�
�
�
The most popular centroid-based clustering algorithm.
The centroids are the arithmetic means of the instances of clusters.
Given an initial set of k means, the algorithm proceeds by
alternating between two steps:
•
•
�
39/52
Assignment step: Assign each instance to the cluster whose mean is
"nearest".
Update step: Calculate the new means to be the centroids of the
observations in the new clusters.
The algorithm converges when centroids no longer change.
k-means steps
�
Assignment step:
40/52
�
Update step:
Cluster prototype (center) evolution
�
Distribution of objects
�
Evolution of cluster centers
41/52
Common distance metrics for k-mean
42/52
Cluster validity
43/52
�
What is a good clustering result?
�
Understanding clusters depends on the sort of data and the domain
of application,
In whiskey case, Group A
�
�
The whole point is to understand whether something was discovered.
•
•
•
Scotches: Aberfeldy, Glenugie, Laphroaig, Scapa
The best of its class: Laphroaig (Islay), 10 years, 86 points
Average characteristics: full gold; fruity, salty; medium; oily, salty,
sherry; dry
Evaluating cluster validity
�
Expert-based:
•
•
•
�
Domain knowledge;
Data exploration;
Semantics (meaning) of clusters.
Cluster validity indices:
•
•
•
Consider cluster compactness;
Consider cluster separation;
Sometimes, also cluster homogeneity.
44/52
Compactness and separation
45/52
Example validity index
�
Cluster dispersion
46/52
�
�1
Si = �
Gi
�
Intercluster distance
�
Davies-Bouldin (DB) index
dij =
DB =
� ��x k − v i ��2
k,xk ∈Gi
�
��v i − v j ��2
1 C C Si + Sj
� max
C i=1 j=1,j≠i dij
where C is the number of clusters and v i is the centroid of cluster
Gi .
Determination of number of clusters DB index
47/52
�
Number of clusters (C) is a complexity parameter;
�
You can plot a “fitting graph” for cluster validity (e.g. DB score and
number of clusters)
�
The elbow in the graph is an indication of the “natural” number of
clusters.
Issues in clustering
�
Clustering is more exploratory - expert knowledge needed for
interpretation
�
Parameter settings (e.g. threshold of similarity or number of
clusters) are application dependent a good understanding of
clustering goals needed for setting the parameters correctly.
�
Dimensionality is a problem as distance loses its meaning for very
large number of variables e.g. feature selection may be imperative
48/52
49/52
Questions?
Session 8
�
Chapter 8 and 9
•
•
•
•
•
•
•
Visualizing Model Performance
Ranking, Profit Curves
ROC Graphs and AUC
Cumulative Response and Lift Curves
Evidence and Probabilities
Combining Evidence Probabilistically
Applying Bayes’ Rule to Data Science
50/52
1BK40 Business Analytics
& Decision Support
Session 8
Visualization
Evidence and Probabilities
Murat Firat
Pav.D06, m.firat@tue.nl
September 29, 2017
Where innovation starts
Announcements
�
�
Completion of last session: Clustering will be discussed
Assignment 1b will be released today
•
•
•
Firstly, check your Matlab references when you get stuck.
Read the related section of the book before asking anything.
Do not try to use me as a Matlab debugger
2/47
Outline
Today
Visualizing Model Performance
Profit curves
ROC curves
Gain charts
Lift curve
Evidence and Probabilities
Combining evidence probabilistically
Applying Bayes’ Rule to Data Science
Bayes’ rule
Example
Fundamental concepts: Visualization of model performance; Explicit
evidence combination with Bayes’ Rule.
3/47
4/47
Visualizing Model Performance
Introduction
5/47
�
Previous lecture:
�
Still, stakeholders and even data scientists often want a higher-level,
more intuitive view of model performance.
�
It is useful to present visualizations rather than just calculations or
single numbers.
•
Introduced basic issues of model evaluation and explored the question
of what makes for a good model.
Disadvantages of expected value
�
Good estimates of costs and benefits must be available, but
•
•
•
•
May be difficult to estimate accurately;
Numbers may not be available;
Ignores preferences;
Costs and benefits may be intangible.
�
Good estimates of probabilities needed
�
Valid for only a single operating condition
(often with stringent assumptions)
•
•
Estimated probabilities may be biased.
Sensitivity analysis would be useful.
Oftentimes, ranking objects may be sufficient!
6/47
Models that rank cases
�
Rankers produce continuous output (e.g., in [0, 1])
•
•
•
Rather than + vs. −
Recall from previous lectures: decision trees, logistic regression, etc.,
can rank cases.
Don’t “brain-damage” your model!
•
by just using some threshold chosen you-don’t-know-how
�
Combine with threshold to form classifier
�
Two issues for evaluation:
•
•
•
One ranker can define many classifiers
choose ranking model;
Choose proper threshold (if necessary).
7/47
Thresholding a ranking classifier
Each threshold → a new classifier
8/47
How to evaluate a ranking classifier?
�
Key questions:
•
•
�
If we have accurate probability estimates and a well-specified
cost-benefit matrix (basis of expected value)
•
�
•
•
�
we determine the threshold where our expected profit is above a
desired level.
Evaluation methods:
•
�
How to compare different rankings?
How to choose a proper threshold?
Profit curves
Receiver Operating Characteristics (ROC) curves
Area under the ROC curve (AUC)
Cumulative response curves (gain chart)
Lift curves
9/47
Numeric evaluation measures
�
Wilcoxon-Mann-Whitney statistic (WMW)
•
•
�
10/47
Probability that model will rank a randomly chosen positive case
higher than a randomly chosen negative case
Over entire ranking, higher WMW score is better
Lift = p(+�Y )�p(+)
→ How much better with model than without?
•
•
•
Measured for specific cutoff, e.g., percentile
Key question:
What threshold / cut-off(s) is appropriate for your problem?
Business understanding tells what is a good threshold/cutoff
It is also useful to visualize model performance as the threshold changes.
Profit curves
‘Profit’ shows expected cumulative profit.
11/47
Profit Curves
�
12/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
12/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
12/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
12/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
12/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
ROC space
�
Graph of true positive
rate against false positive
rate.
�
Depicts relative trade-offs
that a classifier makes
between benefits and
costs.
Yes
No
p
True
positives
n
False
positives
False
negatives
P
True
negatives
N
TPR = TP�P
FPR = FP�N
14/47
ROC curve
Each specific point in the ROC
space corresponds to a specific
confusion matrix.
Not sensitive to different class
distributions (% + and % −)
Diagonal/dashed line is the “random classifier”.
17/47
Constructing a ROC curve
Pass a positive instance, step upward.
Pass a negative instance step rightward.
18/47
ROC curves
19/47
Similar to other model performance visualizations
�
�
e.g., isomorphic to lift curves, but...
Separates classifier performance from costs, benefits and target class
distributions.
Area under ROC curve (AUC)
�
20/47
Good measure of overall model performance, if single-number metric
is needed and target conditions are completely unknown.
•
•
Measures the ranking quality of a model;
A fair measure of the quality of probability estimates.
�
Gives probability that model will rank a positive case higher than a
negative case.
�
AUC is equivalent to Wilcoxon (WMW) statistic (see earlier slide)
and Gini coefficient
Cumulative response curves (gain charts)
Plots TP rate against percentage of population targeted.
21/47
Lift curve
�
�
Plots lift values against percentage of population targeted .
It is essentially the value of gain chart divided by the value of the
random model in the gain chart.
22/47
Lift & cumulative response curves
�
�
23/47
More intuitive for certain business apps. (e.g., targeted marketing).
Caveat: assume target class priors (relative %) same as in test set
(assignment 1b).
When to use which model?
�
The cut-off point (based
on business
understanding)
determines which model
to use.
�
Question:
Could you use model A
for the top 40% of your
population, model B for
next 20% and model C
for the remainder?
�
Read Chapter 8 Example:
Performance Analytics for
Churn Modeling (useful
for assignment 2a).
24/47
25/47
Evidence and Probabilities
Introduction
�
Previously:
�
Now:
•
•
Using data to draw conclusions about some unknown quantity of a
data instance.
Analyse data instances as evidence for or against different values of
the target.
26/47
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
�
Run a targeted campaign for, e.g. a luxury hotel.
not randomly - obtain more bookings.
Define our ad targeting problem more precisely.
•
•
•
•
What will be an instance?
What will be the target variable?
What will be the features?
How will we get the training data?
27/47
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Idea: use historical data to estimate both the direction and the
strength of the evidence.
�
Combine the evidence to estimate the resulting likelihood of class
membership.
29/47
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
36/47
Bayes’ rule for classification C = c:
p(C = c�E ) =
•
•
•
•
•
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
p(E �C = c) is the likelihood of seeing the evidence E (the percentage
of examples of class c that have E ).
p(E ) is the likelihood of the evidence (occurrence of E ).
Estimating these values, we could use p(C = c�E ) as an estimate of
class probability.
Alternatively, we could use the values as a score to rank instances.
Applying Bayes’ Rule to Data Science
�
Drawback:
•
•
•
E is a usual vector of attribute values, we would require the
knowledge of the full joint probability of the example
p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c)- difficult to measure.
We may never see a specific example in the training data that
matches a given E in our test data
Make a particular assumption of independence (which may or not
hold)!
37/47
Naive Bayes - Conditional Independence
38/47
�
Recall the notion of independence: two events are independent if
knowing one does not give you information on the probability of the
other.
�
Conditional independence is the same notion - using conditional
probabilities.
�
�
p(AB�C ) = p(A�C )p(B�AC )
Since A and B are conditionally independent given C :
p(AB�C ) = p(A�C )p(B�C )
Simplifies the computation of probabilities from data for p(E �c) if
the attributes are conditionally independent given the class
(simplification C = c).
p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c) = p(e1 �c)p(e2 �c) . . . p(ek �c)
Naive Bayes - Conditional Independence
�
Naive Bayes:
�
Classifies a new example by estimating the probability that the
example belongs to each class and reports the class with highest
probability.
In practice you do not compute p(E ), since:
�
p(c�E ) =
•
•
39/47
p(E �c) p(e1 �c)p(e2 �c) . . . p(ek �c)
=
p(E )
p(E )
In classification, interested in: of the different possible classes c, for
which one is p(C �E ) the greatest (E is the same for all)
classes often are mutually exclusive and exhaustive, meaning that
every instance will belong to one and only one class.
p(c�E ) =
p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 )
p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) + p(e1 �c1 )p(e2 �c1 ) . . . p(ek �c1 )
Advantages and Disadvantages Naive Bayes
�
Naive Bayes:
•
•
•
•
•
�
is a simple classifier, although it takes all the feature evidence into
account.
is very efficient in terms of storage space and computational time.
performs surprisingly well for classification.
is an “incremental learner”.
is ‘naturally’ biased.
Note that the independence assumption does not hurt classification
performance very much
•
•
�
40/47
To some extent, we double the evidence.
Tends to make the probability estimates more extreme in the correct
direction.
Do not use the probability estimates themselves! Ranking is ok!
Example: Naive Bayes classifier
Source:J.F.Ehmke
41/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
42/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
43/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
44/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
45/47
Example: Bayes’s rule as a decision tree
Bayes rule: p(A ∩ B) = p(A) ⋅ p(B�A) for two events A and B.
�
�
Denote ‘setosa’ in Iris data with i = 1, and ‘versicolor’ in Iris data
with i = 2.
Denote ‘pedalwidth > 30’ with p = 1, and ‘pedalwidth ≤ 30’ with
p = 0.
p
P(
)
=0
0.4
p=0
5
P(p
=1
)
0.5
5
p = 1
0)
1�p =
P(i =
82
1
6
.
0
P(i =
2�p
0.38
18
= 0)
1)
1�p =
P(i =
6
5
0.35
P(i =
2�p
0.64
44
= 1)
P(i = 1 ∩ p = 0) = 0.45 ⋅ 0.6182
P(i = 2 ∩ p = 0) = 0.45 ⋅ 0.3818
P(i = 1 ∩ p = 1) = 0.55 ⋅ 0.3556
P(i = 2 ∩ p = 0) = 0.55 ⋅ 0.6444
46/47
47/47
Questions?
1BK40 Business Analytics
& Decision Support
Session 9
Visualization
Evidence and Probabilities
Murat Firat
Pav.D06, m.firat@tue.nl
October 4, 2017
Where innovation starts
Announcements
�
Completion of last session: Today (planned) Session 8 topics.
�
Questions about Assignment 1b
Lecture of DCS/e by Prof. James M. Keller program:
�
�
Some feedback from Assignment 1a grading
•
•
•
•
Place: Filmzaal de Zwarte Doos
15:30-16:00 Welcome and coffee
16.00-17.00 Lecture by professor James M. Keller
17.00-18.00 Network drinks
2/47
Outline
Today
Visualizing Model Performance
ROC curves
Gain charts
Lift curve
Profit curves
Evidence and Probabilities
Combining evidence probabilistically
Applying Bayes’ Rule to Data Science
Bayes’ rule
Example
Fundamental concepts: Visualization of model performance; Explicit
evidence combination with Bayes’ Rule.
3/47
4/47
Visualizing Model Performance
Introduction
�
Previous session:
•
•
•
•
•
5/47
Introduced distance metrics: Manhattan, Euclidean, Minkovski..
Similarity: nearest neighbors, majority voting
Classification: majority voting (of neighbors)
Regression: Averaging, median (of neighbors)
Clustering: two basic clustering algorithms: Hierarchical and k-means.
Disadvantages of expected value
�
Good estimates of costs and benefits must be available, but
•
•
•
May be difficult to estimate accurately;
Sometimes not available;
Costs and benefits may be intangible.
�
Good estimates of probabilities needed
�
Valid for only a single operating condition
•
•
Estimated probabilities may be biased.
raises up the need for sensitivity analysis.
One solution: rank objects!
6/47
Models that rank cases
�
Rankers produce numerical output (e.g., in [0, 1])
•
•
�
�
Rather than classes like + vs. −
Decision trees and logistic regression can rank cases.
Using a threshold generate a number of classifiers
Two issues for evaluation:
•
•
choose ranking model
choose proper threshold
7/47
Thresholding a ranking classifier
Each threshold → a new classifier
8/47
How to evaluate a ranking classifier?
�
If we have accurate probability estimates and a well-specified
cost-benefit matrix
•
�
threshold value: expected profit above a desired level.
Evaluation methods:
•
•
•
•
•
Receiver Operating Characteristics (ROC) curves
Area under the ROC curve (AUC)
Cumulative response curves (gain chart)
Lift curves
Profit curve
9/47
ROC space
�
Graph of tp rate vs.fp
rate.
�
Depicts: relative
trade-offs between
benefits and costs.
Yes
No
p
TP
n
FP
FN
P
TN
N
TPR = TP�P
FPR = FP�N
10/47
ROC curve
Each specific point in the ROC
space corresponds to a specific
confusion matrix.
Not sensitive to different class
distributions (% + and % −)
Diagonal/dashed line is the “random classifier”.
13/47
Constructing a ROC curve
Pass a positive instance, step upward.
Pass a negative instance step rightward.
14/47
ROC curves
15/47
Similar to other model performance visualizations
�
�
e.g., isomorphic to lift curves, but...
Separates classifier performance from costs, benefits and target class
distributions.
Area under ROC curve (AUC)
�
16/47
Good measure of overall model performance, if single-number metric
is needed and target conditions are completely unknown.
•
•
Measures the ranking quality of a model;
A fair measure of the quality of probability estimates.
�
Gives probability that model will rank a positive case higher than a
negative case.
�
AUC is equivalent to Wilcoxon (WMW) statistic (see earlier slide)
and Gini coefficient
Numeric evaluation measures
�
Wilcoxon-Mann-Whitney statistic (WMW)
•
•
�
17/47
Probability that model will rank a randomly chosen positive case
higher than a randomly chosen negative case
Over entire ranking, higher WMW score is better
Lift = p(+�Y )�p(+)
→ How much better with model than without?
•
•
•
Measured for specific cutoff, e.g., percentile
Key question:
What threshold / cut-off(s) is appropriate for your problem?
Business understanding tells what is a good threshold/cutoff
It is also useful to visualize model performance as the threshold changes.
Cumulative response curves (gain charts)
Plots TP rate against percentage of population targeted.
18/47
Lift curve
�
�
Plots lift values against percentage of population targeted .
It is essentially the value of gain chart divided by the value of the
random model in the gain chart.
19/47
Lift & cumulative response curves
�
�
20/47
More intuitive for certain business apps. (e.g., targeted marketing).
Caveat: assume target class priors (relative %) same as in test set
(assignment 1b).
Profit curves
‘Profit’ shows expected cumulative profit.
21/47
Profit Curves
�
22/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
22/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
22/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
22/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
Profit Curves
�
22/47
Critical conditions underlying the profit calculation:
•
•
Class priors: The proportion of positive and negative instances in the
target population, also known as the base rate.
The costs and benefits: The expected profit is specifically sensitive to
the relative levels of costs and benefits for the different cells of the
cost-benefit matrix.
�
Profit curves are a good choice for visualizing model performance if
both class priors and cost-benefit estimates are known and are
expected to be stable.
�
Use a method that can accomodate uncertainty by showing the
entire space of performance possibilities - Receiver Operating
Characteristics (ROC) curves.
When to use which model?
�
The cut-off point (based
on business
understanding)
determines which model
to use.
�
Question:
Could you use model A
for the top 40% of your
population, model B for
next 20% and model C
for the remainder?
�
Read Chapter 8 Example:
Performance Analytics for
Churn Modeling (useful
for assignment 2a).
24/47
25/47
Evidence and Probabilities
Introduction
�
Main idea:
•
Analyze data instances as evidence for or against different values of
the target.
26/47
Case: Targeting online consumers with ads
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
�
Run a targeted campaign for, e.g. a luxury hotel.
not randomly - obtain more bookings.
Define our ad targeting problem more precisely.
•
•
•
•
What will be an instance?
What will be the target variable?
What will be the features?
How will we get the training data?
27/47
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Target online displays to consumers based on webpages they have
visited in the past:
•
•
•
•
•
•
28/47
Target variable: will the consumer book a hotel room within one
week after having seen the advertisement?
Cookies allow for observing which consumers book rooms.
A consumer is characterized by the set of websites we have observed
her to have visited previously (cookies!).
We assume that some of these websites are more likely to be visited
by good prospects for the luxury hotel.
Problem: we do not have the resources to estimate the evidence
potential for each site manually.
Problem: humans are notoriously bad at estimating the precise
strength of the evidence (but quite good at using our knowledge and
common sense to recognize whether evidence is likely to be “for” or
“against”.
Targeting Online Consumers With Advertisements
�
Idea: use historical data to estimate both the direction and the
strength of the evidence.
�
Combine the evidence to estimate the resulting likelihood of class
membership.
29/47
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically - Notation
�
Interest: quantities such as the probability of a consumer booking a
room after being shown an ad.
•
•
•
�
�
�
We actually need to be a little more specific
some particular consumer?
any consumer?
Let’s call this quantity C .
Represent the probability of an event C as p(C ).
p(C ) = 0.0001 ‘means’ that if we were to show ads randomly to
consumers, we would expect about 1 in 10,000 to book rooms.
•
�
30/47
Recall: Expected value framework - Purchase rates attributable to
online advertisements generally seem very small to those outside the
industry cost of placing one ad often is quite small as well.
p(C �E ) - the probability of C given some evidence E (such as the
set of websites visited by a particular consumer).
•
•
the probability of C given E ;
the probability of C conditioned on E .
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Combining evidence probabilistically
�
�
Want to use some labeled data to associate different collections of
evidence E with different probabilities.
Problem (not small):
•
•
•
•
�
31/47
For any particular collection of evidence E , there are (probably) not
enough cases with exactly that same collection of evidence.
Usually you do not obtain a particular collection of evidence at all!
What is the chance that in our training data we have seen a
consumer with exactly the same visiting patterns as a consumer we
will see in the future?
Infinitesimal (maybe not for Google or Facebook?).
Solution:
•
Consider the different pieces of evidence separately, and then combine
evidence.
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Statistical independence
�
�
If the events A and B are statistically independent, then we can
compute the probability that both A and B occur as
p(AB) = p(A)p(B).
Example: Rolling a fair dice
•
•
•
�
A is “roll #1 shows a six” and event B is “roll #2 shows a six”,
p(A) = 1�6, p(B) = 1�6 (even if we know that roll #1 shows a six)
The events are independent p(AB) = p(A)p(B) = 1�36.
The general formula for combining probabilities that take care of
dependencies between events is
p(AB) = p(A)p(B�A)
•
Given that you know A, what is the probability of B.
32/47
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule
33/47
�
Note:
�
Dividing by p(A):
�
Consider B to be some hypothesis of interest (likelihood) and A
some evidence observed.
Renaming H for hypothesis and E for evidence we obtain Bayes’
rule:
p(E �H)p(H)
p(H�E ) =
p(E )
�
�
p(AB) = p(A)p(B�A) = p(B)p(A�B)
p(B�A) =
p(A�B)p(B)
p(A)
compute the probability of our hypothesis H given some evidence E
by instead looking at the probability of the evidence given the
hypothesis, as well as the unconditional probabilities of the
hypothesis and the evidence.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Bayes’ rule - Example
�
Medical diagnosis: Assume you are a doctor and a patient arrives
with red spots.
•
�
�
34/47
Hypothesized diagnosis (H = measles), evidence (E = red spots).
In order to directly estimate p(measles�red spots), we would need to
think through all the different reasons a person might exhibit red
spots and what proportion of them would be measles.
Solution (simpler):
•
•
•
p(E �H) is the probability that one has red spots given that one has
measles.
p(H) is simply the probability that someone has measles, without
considering any evidence; that’s just the prevalence of measles in the
population.
p(E ) is the probability of the evidence: what’s the probability that
someone has red spots-again, simply the prevalence of red spots in
the population, which does not require complicated reasoning about
the different underlying causes, just observation and counting.
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
�
�
35/47
Bayes’ rule is the base of “Bayesian” methods.
Bayes’ rule for classification C = c:
p(C = c�E ) =
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
•
p(C = c):
•
•
•
“subjective” prior (the belief of a particular decision maker based on
all her knowledge, experience, and opinions);
“prior” belief based on some previous application(s) of Bayes’ Rule
with other evidence
an unconditional probability inferred from data (e.g. the prevalence of
c in the population ≈ percentage of all examples that are of class c).
Applying Bayes’ Rule to Data Science
�
36/47
Bayes’ rule for classification C = c:
p(C = c�E ) =
•
•
•
•
•
p(E �C = c)p(C = c)
p(E )
p(C = c�E ) is the probability that the target variable C takes on the
class of interest c after taking the evidence E (the vector of feature
values) into account - posterior probability.
p(E �C = c) is the likelihood of seeing the evidence E (the percentage
of examples of class c that have E ).
p(E ) is the likelihood of the evidence (occurrence of E ).
Estimating these values, we could use p(C = c�E ) as an estimate of
class probability.
Alternatively, we could use the values as a score to rank instances.
Applying Bayes’ Rule to Data Science
�
Drawback:
•
•
•
E is a usual vector of attribute values, we would require the
knowledge of the full joint probability of the example
p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c)- difficult to measure.
We may never see a specific example in the training data that
matches a given E in our test data
Make a particular assumption of independence (which may or not
hold)!
37/47
Naive Bayes - Conditional Independence
38/47
�
Recall the notion of independence: two events are independent if
knowing one does not give you information on the probability of the
other.
�
Conditional independence is the same notion - using conditional
probabilities.
�
�
p(AB�C ) = p(A�C )p(B�AC )
Since A and B are conditionally independent given C :
p(AB�C ) = p(A�C )p(B�C )
Simplifies the computation of probabilities from data for p(E �c) if
the attributes are conditionally independent given the class
(simplification C = c).
p(E �c) = p(e1 ∧ e2 ∧ . . . ek �c) = p(e1 �c)p(e2 �c) . . . p(ek �c)
Naive Bayes - Conditional Independence
�
Naive Bayes:
�
Classifies a new example by estimating the probability that the
example belongs to each class and reports the class with highest
probability.
In practice you do not compute p(E ), since:
�
p(c�E ) =
•
•
39/47
p(E �c) p(e1 �c)p(e2 �c) . . . p(ek �c)
=
p(E )
p(E )
In classification, interested in: of the different possible classes c, for
which one is p(C �E ) the greatest (E is the same for all)
classes often are mutually exclusive and exhaustive, meaning that
every instance will belong to one and only one class.
p(c�E ) =
p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 )
p(e1 �c0 )p(e2 �c0 ) . . . p(ek �c0 ) + p(e1 �c1 )p(e2 �c1 ) . . . p(ek �c1 )
Advantages and Disadvantages Naive Bayes
�
Naive Bayes:
•
•
•
•
•
�
is a simple classifier, although it takes all the feature evidence into
account.
is very efficient in terms of storage space and computational time.
performs surprisingly well for classification.
is an “incremental learner”.
is ‘naturally’ biased.
Note that the independence assumption does not hurt classification
performance very much
•
•
�
40/47
To some extent, we double the evidence.
Tends to make the probability estimates more extreme in the correct
direction.
Do not use the probability estimates themselves! Ranking is ok!
Example: Naive Bayes classifier
Source:J.F.Ehmke
41/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
42/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
43/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
44/47
Example: Naive Bayes classifier
Source:J.F.Ehmke
45/47
Example: Bayes’s rule as a decision tree
Bayes rule: p(A ∩ B) = p(A) ⋅ p(B�A) for two events A and B.
�
�
Denote ‘setosa’ in Iris data with i = 1, and ‘versicolor’ in Iris data
with i = 2.
Denote ‘pedalwidth > 30’ with p = 1, and ‘pedalwidth ≤ 30’ with
p = 0.
p
P(
)
=0
0.4
p=0
5
P(p
=1
)
0.5
5
p = 1
0)
1�p =
P(i =
82
1
6
.
0
P(i =
2�p
0.38
18
= 0)
1)
1�p =
P(i =
6
5
0.35
P(i =
2�p
0.64
44
= 1)
P(i = 1 ∩ p = 0) = 0.45 ⋅ 0.6182
P(i = 2 ∩ p = 0) = 0.45 ⋅ 0.3818
P(i = 1 ∩ p = 1) = 0.55 ⋅ 0.3556
P(i = 2 ∩ p = 0) = 0.55 ⋅ 0.6444
46/47
47/47
Questions?
1BK40 Business Analytics
& Decision Support
Session 11
Introduction to Fuzzy Sets
Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
October 11, 2017
Where innovation starts
Choice of active holidays
Big mountains, stunning
views
2/46
Choice of active holidays
Big mountains, stunning
views
Nice trails
2/46
Choice of active holidays
Big mountains, stunning
views
Nice trails
2/46
Gnarly sections
Choice of active holidays
Big mountains, stunning
views
Nice trails
Good chair lifts
2/46
Gnarly sections
Choice of active holidays
Big mountains, stunning
views
Nice trails
Good chair lifts
2/46
Gnarly sections
Good emergency service
Choice of active holidays
Big mountains, stunning
views
Nice trails
Good chair lifts
2/46
Gnarly sections
Good emergency service
Announcements
�
New office-hour determined: Wednesdays 13:00-15:00
�
Trial exam is available in Canvas.
�
Assignment 1b: Matlab functions fittingGraph and learningCurve
are in Canvas.
�
Assignment 1b: Submissions should be in the form of m and pdf
files. Extras should be put in a zip file.
�
Reader “Topics in Decision Analysis” is available in Canvas.
�
Assignment 2a: Release on 12 October at 09:00.
3/46
Choice of active holidays
�
Next series of lectures: more on how to choose a location - Decision
making.
•
�
4/46
Data-poor environment.
Today: How can we model linguistic terms (Big, Nice, Gnarly,
Good,...)
Outline
Introduction
Fuzzy sets
Definition
Interpretation
Properties
Operations
Closing remarks
Fundamental concepts: Fuzzy sets, properties of fuzzy sets, fuzzy
Logic operations
5/46
Introduction
�
Consider the following questions:
•
•
•
•
�
6/46
“Among all the customers of a cellphone company, which have a large
income?”
“Will this customer purchase service S1 if his plan includes large
selection of I?”
“How much will this customer use the service?”
“What is the typical cellphone usage of this customer segment?”
What kind of information would you like to have?
Paradoxes
A barber shaves any man who does
not shave himself.
Who shaves the barber?
7/46
A person is sentenced to death and
allowed to make one last
statement. If the statement is
true, the person will be hung, if it
is false, the person will be shot.
The person says ‘I will be shot.’
What now?
Sorites
�
If you remove sand grains from a sand dune one by one, when does
the sand dune turn into sand hill, into a sand pile?
�
When is the sky cloudy? How many clouds does it take to make a
clear sky not clear?
�
c. f. mathematical induction.
Source: dune, clouds 1, clouds 2, clouds 3.
8/46
Graduality
�
Humans conceptualize the world based on concepts of similarity,
gradualness, fuzziness
However, it is safe to say that the rapid expansion of electronic
transactions constitutes a major opportunity for trade and
development: it can be the source of a significant number of
success stories by which developing countries and their enterprises
can reach new levels of international competitiveness and
participate more actively in the emerging global information
economy.
(From United Nations Conference on Trade and Development).
9/46
Graduality
10/46
... the high levels of taxation on petroleum products in most consuming
countries are greatly amplifying the effects of rises in the price of crude,
to the detriment of the consumer. OPEC expresses the hope, once again,
that the governments of these countries will reduce their high taxes on a
barrel of oil - which is much more than producers themselves receive - in
the interests of market stability. In addition, speculation in the oil market
has become a key factor that has distorted realities and has artificially
influenced prices far beyond what the fundamentals indicate.
(From Opening Address to the 111th Meeting of the OPEC Conference).
Precision and information content
11/46
Incompatibility principle
12/46
“As the complexity of a system increases, our ability to make precise and
yet relevant (significant) statements about the system diminishes, until a
threshold is reached beyond which precision and relevance (significance)
become mutually exclusive characteristics” (Zadeh 1973)
�
What is the perimeter of a table?
�
How long are the Dutch borders?
�
What is the creditworthiness of a company?
�
What is the circumference of a circle?
�
Can you describe the motion of a pendulum?
Introduction - Revisited
�
Consider the following questions:
•
•
•
•
�
�
13/46
“Among all the customers of a cellphone company, which have a
large income?”
“Will this customer purchase service S1 if his plan includes large
selection of I?”
“How much will this customer use the service?” little,medium,a lot
“What is the typical cellphone usage of this customer segment?”
little,medium,a lot
What kind of information would you like to have?
In this lecture - Mathematical formulation for these type of
imprecise or vague statements.
14/46
Fuzzy Sets
Crisp sets
15/46
Collection of definite, well-definable objects (elements) to form a whole.
Representation of sets:
� characteristic function
� list of all elements
fA ∶ X → {0, 1},
A = {x1 , ..., xn }, xj ∈ X
fA (x ) = 1, ⇔ x ∈ A
� Elements with property P
fA (x ) = 0, ⇔ x ∉ A
A = {x �x satisfies P}, x ∈ X
�
Venn diagram
Fuzzy sets
�
�
�
�
16/46
Sets with fuzzy, gradual boundaries (Zadeh 1965)
A fuzzy set A in X is characterized by its membership function µA :
X → [0, 1]
A fuzzy set A is completely
determined by the set of
ordered pairs
A = (x , µA (x ))�x ∈ X
X is called the domain or
universe of discourse
Crisp vs. fuzzy sets
�
�
�
Integers larger than 3.
Families without children.
People with job description
“manager”.
17/46
�
Tall people.
�
Bold men.
�
Comfortable car.
�
Fast cars.
�
Tall and blond Dutch.
Definition
�
�
As a mathematical notion, a fuzzy set F on a finite universe U is
unambiguously defined by a membership function uF ∶ U → [0, 1].
The mathematical object representing the fuzzy set is the
membership function uF (x ) indicating the grade of membership of
element of x ∈ U in F .
18/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Interpretation
�
Fuzzy sets are usually related to vagueness.
�
Fuzzy sets are used to represent three different concepts:
•
•
•
•
�
This vagueness is not defined as uncertainty of meaning but instead
as the standard definition of vagueness with the possession of
borderline cases.
gradualness (original idea of Zadeh-1965)
epistemic uncertainty (not discussed in detail)
bipolarity (not discussed in detail)
Gradualness refers to the idea that many categories in natural
language are a matter of degree, including truth.
•
•
The fuzzy set is used as representing some precise gradual entity
consisting of a collection of items (sets).
The gradualness is indicated through membership. The transition
between membership and non-membership is “gradual rather than
abrupt”
19/46
Gradualness
�
The gradualness can be linked to different situations:
•
20/46
Example: forest zone in a grey level image. Inherently, the boundary
of this zone is gradual (zoom in a picture). The boundaries of the set
are precisely known, but it is not possible to measure it (or indicate
it) precisely.
Gradualness
�
The gradualness can be linked to different situations:
•
21/46
Example: Define the boundaries of a forest when the density of trees
is slowly decreasing in peripheral zones. It is possible to measure each
element of the set precisely (e.g. position of the trees), the
boundaries of the set are known, but a (crisp) definition of its
boundaries is not precise.
Gradualness
�
The gradualness can be linked to different situations:
•
Example: Define dense forest zone. The uncertainty is linked to a
fuzzy predicate referring to a gradual concept (e.g. “dense” forest
zone). In this case the boundaries are known, the measure of each
element is precise, but the fuzzy predicate indicates gradualness.
22/46
Degree of membership
�
the degree of membership µF (x ) of an element x in a fuzzy set F
can be used to express:
•
•
•
Degree of similarity - related to gradualness.
Degree of preferences (in utility functions).
Degree of uncertainty* (not discussed).
23/46
Degree of membership
�
�
24/46
Degree of similarity - The membership degree µF (x ) represents the
degree of proximity of x to prototype elements of F .
This view is used in clustering analysis and regression analysis,
where the problem is representing a set of data by the proximity
between pieces of information.
•
Example: classification of cars of known dimensions in categories of
F = {big cars, regular cars, small cars}. If the prototype of the
category big cars is a Mercedes Class S, then we can construct a
measure of distance between any car to this prototype, where the
distance is a measure of similarity.
Note: Fuzzy sets vs probabilities
�
Probabilities are related to randomness- uncertainty described by
tendency or frequency of a random variable to take on a value in a
specific region.
•
�
�
Interpretations: Symmetry, frequency, subjective probability
(exchangeable betting rates) - Bayes rule.
Fuzzy sets are related to gradualness.
An example:
•
•
Predict next person to walk in to be tall (can be probability).
Person is in front of you, how can you define it tall?
25/46
Note: Fuzzy sets vs probabilities - Illustration
�
Suppose I have 2 cartons of 10 bottles filled with water and/or
poison. You are super thirsty, and need to pick up a bottle from one
of the boxes. The trick is to ask which box to choose it from.
•
•
�
26/46
The first box has bottles that have a fuzzy membership of .9 in the
set that describes water.
The second box has a 0.9 probability that you will pick a bottle of
water if you choose one bottle.
Which one to choose?
•
•
Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste
funky).
Box 2: 90% of the bottles are 100% poison and the remaining 10% of
the bottles are 100% water.
Note: Fuzzy sets vs probabilities - Illustration
�
Suppose I have 2 cartons of 10 bottles filled with water and/or
poison. You are super thirsty, and need to pick up a bottle from one
of the boxes. The trick is to ask which box to choose it from.
•
•
�
26/46
The first box has bottles that have a fuzzy membership of .9 in the
set that describes water.
The second box has a 0.9 probability that you will pick a bottle of
water if you choose one bottle.
Which one to choose?
•
•
Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste
funky).
Box 2: 90% of the bottles are 100% poison and the remaining 10% of
the bottles are 100% water.
Note: Fuzzy sets vs probabilities - Illustration
�
Suppose I have 2 cartons of 10 bottles filled with water and/or
poison. You are super thirsty, and need to pick up a bottle from one
of the boxes. The trick is to ask which box to choose it from.
•
•
�
26/46
The first box has bottles that have a fuzzy membership of .9 in the
set that describes water.
The second box has a 0.9 probability that you will pick a bottle of
water if you choose one bottle.
Which one to choose?
•
•
Box 1: Each of the bottles contains 0.1 poison and 0.9 water (taste
funky).
Box 2: 90% of the bottles are 100% poison and the remaining 10% of
the bottles are 100% water.
Fuzzy sets on discrete universes
�
�
Fuzzy set C =“desirable city to live in”
X = SF, Boston, LA (discrete and non-ordered)
C =(SF, 0.9), (Boston, 0.8), (LA, 0.6)
Fuzzy set A = “sensible number of children”
X = {0, 1, 2, 3, 4, 5, 6} (discrete universe)
A = {(0, .1), (1, .3), (2, .7), (3, 1), (4, .6), (5, .2), (6, .1)}
27/46
Fuzzy sets on continuous universes
�
Fuzzy set B = “about 50 years old”
X =Set of positive real numbers (continuous)
B = {(x , µB (x ))�x ∈ X }
µB (x ) =
1
2
�
1+� x −50
10
28/46
About membership functions
�
�
�
Subjective measures.
Context dependent.
Not probability functions.
29/46
About membership functions
�
�
�
Subjective measures.
Context dependent.
Not probability functions.
29/46
Fuzzy partition
Fuzzy partition formed by the linguistic values
“young”, “middle aged”, and “old”:
30/46
Support, core, singleton
�
�
The support of a fuzzy set A in X is the crisp subset of X whose
elements have non-zero membership in A:
supp(A) = {x ∈ X �µA (x ) > 0}.
The core of a fuzzy set A in X is the crisp subset of X whose
elements have membership 1 in A:
core(A) = {x ∈ X �µA (x ) = 1}.
31/46
–-cut of a fuzzy set (level set)
�
32/46
An –-level set of a fuzzy set A of X is a crisp set denoted by A– and
defined by
A– = �
x ∈ X �µA (x ) ≥ –}, – > 0
cl(supp(A)),
–=0
Normal fuzzy sets
�
�
The height of a fuzzy set A is the maximum value of µA (x )
A fuzzy set is called normal if its height is 1, otherwise it is called
sub-normal
33/46
Convexity of fuzzy sets
A fuzzy set A is convex if for any ⁄ ∈ [0, 1],
µA (⁄x1 + (1 − ⁄)x2 ) ≥ min (µA (x1 ), µA (x2 ))
Alternatively, A is convex if all its –-cuts are convex
34/46
Set theoretic operations
�
Subset
�
Complement
�
Union
�
Intersection
A ⊆ B ⇔ µA ≤ µB
Ā = X − A ⇔ µĀ (x ) = 1 − µA (x )
C = A ∪ B ⇔ µc (x ) = max (µA (x ), µB (x )) = µA (x ) ∨ µB (x )
C = A ∩ B ⇔ µc (x ) = min (µA (x ), µB (x )) = µA (x ) ∧ µB (x )
35/46
Set theoretic operations
A ⊆ B ⇔ µA (x ) ≤ µB (x )
A is contained in B:
36/46
Average
37/46
The average of fuzzy sets A and B in X is defined by
µA (x ) + µB (x )
2
Note that the classical set theory does not have averaging as a set
operation. This is an extension provided by the fuzzy set approach.
µ(A+B)�2 (x ) =
Combinations with negation
Note: De Morgan laws do hold in fuzzy set theory!
38/46
MF formulation
39/46
�
Triangular MF:
�
Trapezoidal MF:
�
Gaussian MF:
�
Generalized bell MF:
trimf(x ; a, b, c) = max �min �
trapf(x ; a, b, c, d) = max �min �
x −a c −x
,
� , 0�
b−a c −b
x −a
d −x
, 1,
� , 0�
b−a
d −c
gaussmf(x ; c, s) = e − 2 �
gbellmf(x ; a, b, c) =
1
x −c 2
�
s
1
2a
1 + � x b−c �
MF formulation
40/46
Cartesian product
�
�
41/46
Cartesian product of fuzzy sets A and B is a fuzzy set in the product
space X × Y with membership
µA×B (x , y ) = min (µA (x ), µB (y )) .
Cartesian co-product of fuzzy sets A and B is a fuzzy set in the
product space X × Y with membership
µA+B (x , y ) = max (µA (x ), µB (y )) .
Linguistic variable
42/46
�
A numerical variable takes numerical values:
�
A linguistic variables takes linguistic values:
�
A linguistic value is a fuzzy set.
�
Age = 65
Age is old
All linguistic values form a term set:
T(age) = {young, not young, very young, middle aged, not middle
aged, old, not old, very old, more or less old, not very young and not
very old, ...}
Linguistic values (terms)
43/46
44/46
Questions?
Today
�
Reader.
45/46
Now
�
Read material/slides.
46/46
1BK40 - Business Analytics & Decision Making
Session 12
1BK40
Business Analytics &
Decision Support
Session 12, 2017 – 2018
Introduction to decision support
Decision heuristics
SMART
Prof. dr. ir. Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
U, Kaymak
1
1BK40 - Business Analytics & Decision Making
Session 12
Announcements
• Decision making literature is available on Canvas:
•
•
•
•
Introduction to decision making methods (J. Fülöp)
Introduction to decision analysis (D.F. Groebner et al.)
Topics in decision analysis (U. Kaymak)
Biases and Heuristics
2
U, Kaymak
2
1BK40 - Business Analytics & Decision Making
Session 12
Today’s agenda
Concepts:
• Discrete choice problem
• Alternatives and attributes
• Decision heuristics
• Simple Multi-Attribute Rating Technique (SMART)
Techniques:
• Lexicographic strategy
• Recognition heuristic
• Elimination by aspects
• SMART
3
U, Kaymak
3
1BK40 - Business Analytics & Decision Making
Session 12
Decisions with multiple attributes
Examples
• Choosing a holiday
•
•
•
•
•
liveliest nightlife
least crowded beaches
most sunshine
most modern hotels
lowest cost
• Choosing a company to supply goods
•
•
•
•
U, Kaymak
best after-sales service
fastest delivery time
lowest prices
best reputation for reliability
4
1BK40 - Business Analytics & Decision Making
Session 12
https://www.linkedin.com/pulse/20140827164419-92141785-effective-decision-making
5
U, Kaymak
5
1BK40 - Business Analytics & Decision Making
Session 12
Characteristics of the decision
environment
• Choice set for consideration is relatively small
• Typically, a sub-set of all possibilities are considered
• Information is available, but large quantities of data
are not involved
• There may be considerable uncertainty (also related
to the lack of data)
• Consequences of decisions are not known accurately
• Ambiguity of goals and consequences
• Preferences of a human decision maker are
important
• Emphasis on structured analysis (instead of the
solution)
6
U, Kaymak
6
1BK40 - Business Analytics & Decision Making
Session 12
Bounded rationality
• The limitations of the human mind mean that people
use ‘approximate methods’ to deal with most
decision problems
cf. Miller’s 7 ± 2 categories
• As a result they seek to identify satisfactory, rather
than optimal, courses of action
• These approximate methods, or rules of thumb, are
often referred to as ‘heuristics’
U, Kaymak
7
1BK40 - Business Analytics & Decision Making
Session 12
Discrete choice problems
• A finite number of alternatives are considered
• How to determine eligible alternatives?
• How to collect information about these alternatives?
• Goal is to select one of the alternatives
• Alternatives are compared based on a number of
aspects (also known as attributes)
8
U, Kaymak
8
1BK40 - Business Analytics & Decision Making
Session 12
Heuristics
• These heuristics are often well adapted to the
structure of people’s knowledge of the environment
• Quick ways of making decisions, which people use,
especially when time is limited, have been referred to
as ‘fast and frugal heuristics’
U, Kaymak
9
1BK40 - Business Analytics & Decision Making
Session 12
Compensation or not
• Compensatory strategy - poor performance on some
attributes is compensated by good performance on
others
• Not the case in a non-compensatory strategy
• Compensatory strategies involve more cognitive
effort
U, Kaymak
10
1BK40 - Business Analytics & Decision Making
Session 12
The recognition heuristic
• Used where people have to choose between two
options
• If one is recognized and the other is not, the
recognized option is chosen
• Works well in environments where quality is
associated with ease of recognition
U, Kaymak
11
1BK40 - Business Analytics & Decision Making
Session 12
The minimalist strategy
• First apply recognition heuristic
• If neither option is recognized, simply guess which is
the best option
• If both options are recognized, pick at random one of
the attributes of the two options and choose best
performer on this attribute
• If both perform equally well on this attribute, pick a
2nd attribute at random, and so on
U, Kaymak
12
1BK40 - Business Analytics & Decision Making
Session 12
Take the last
• Same as minimalist heuristic except that people use
the attribute that enabled them to choose last time
when they had a similar choice
• If both options are equally good on this attribute,
choose the attribute that worked the time before, and
so on
• If none of the previously used attributes works, a
random attribute will be tried
U, Kaymak
13
1BK40 - Business Analytics & Decision Making
Session 12
The lexicographic strategy
• Used where attributes can be ranked in order of
importance
• Involves identifying most important attribute and
selecting the option which is best on that attribute
(e.g. choose cheapest option)
• In there’s a ‘tie’ on the most important attribute,
choose the option which performs best on the 2nd
most important attribute, and so on
U, Kaymak
14
1BK40 - Business Analytics & Decision Making
Session 12
Semi-lexicographic strategy
• Like the lexicographic strategy - except if options
have similar performance on an attribute they are
considered to be tied
• It can lead to violation of transitivity axiom…
(if A is preferred to B and B to C, transitivity requires
that A is also preferred to C)
U, Kaymak
15
1BK40 - Business Analytics & Decision Making
Session 12
Example…
• ‘If the price difference between brands is less than
50 cents choose the higher quality product,
otherwise choose the cheaper brand.’
Brand
Price Quality
A
$3.00 Low
B
$3.60 High
C
$3.40 Medium
U, Kaymak
16
1BK40 - Business Analytics & Decision Making
Session 12
Elimination by aspects (EBA)
• Most important attribute is identified and a
performance cut-off point is established
• Any alternative falling below this point is eliminated
• The process continues with 2nd most important
attribute, and so on
U, Kaymak
17
1BK40 - Business Analytics & Decision Making
Session 12
Strengths & limitations of EBA
• Easy to apply
• Involves no complicated computations
• Easy to explain and justify to others
• Fails to ensure that the alternatives retained are
superior to those which are eliminated - this arises
because the strategy is non-compensatory
U, Kaymak
18
1BK40 - Business Analytics & Decision Making
Sequential decision making
Session 12
Satisficing
• Used where alternatives become available
sequentially
• Search process stops when an alternative is found
which is satisfactory in that its attributes’
performances all exceed aspiration levels
• These aspiration levels themselves adjust gradually
in the light of alternatives already examined
U, Kaymak
19
1BK40 - Business Analytics & Decision Making
Session 12
Reason-based choice
• Shafir et al.:
‘when faced with the need to choose, decision
makers often seek and construct reasons in order to
resolve the conflict and justify their choice to
themselves and to others’.
U, Kaymak
20
1BK40 - Business Analytics & Decision Making
Session 12
Some consequences
• Decisions framed as ‘choose which to select…’ can
lead to different choices to those framed as ‘choose
which to reject’
• Irrelevant alternatives can influence choice
• Alternatives can be rejected if they have weakly
favourable or irrelevant attributes
U, Kaymak
21
1BK40 - Business Analytics & Decision Making
Session 12
Example of reason-based choice
Candidate A
• Average written
communication skills
• Satisfactory absenteeism
record
• Average computer skills
• Reasonable interpersonal
skills
• Average level of
numeracy
• Average telephone skills
U, Kaymak
Candidate B
• Excellent written
communication skills
• Very good absenteeism
record
• Excellent computer skills
• Awkward when dealing
with others
• Poor level of numeracy
• Poor telephone skills
22
1BK40 - Business Analytics & Decision Making
Session 12
Factors affecting choices
• Time available to make decision
• Effort that a given strategy will involve
• Decision maker’s knowledge about the
environment
• Importance of making an accurate decision
• Whether or not the choice has to be justified
to others
• Desire to minimize conflict (e.g. conflicts
between the pros and cons of the
Many human aspects
alternatives)
U, Kaymak
23
1BK40 - Business Analytics & Decision Making
Session 12
Decisions Involving
Multiple Aspects:
SMART
Simple Multi-Attribute Rating Technique
U, Kaymak
24
1BK40 - Business Analytics & Decision Making
Session 12
Objectives and Attributes
• An objective = an indication of preferred direction of
movement, i.e. ‘minimize’ or ‘maximize’
• An attribute is used to measure performance in
relation to an objective
U, Kaymak
25
1BK40 - Business Analytics & Decision Making
Session 12
An office location problem
Location of office Annual rent ($)
Addison Square
30 000
Bilton Village
15 000
Carlisle Walk
5 000
Denver Street
12 000
Elton Street
30 000
Filton Village
15 000
Gorton Square
10 000
U, Kaymak
26
1BK40 - Business Analytics & Decision Making
Session 12
Main stages of SMART
1.
2.
3.
4.
5.
6.
7.
8.
U, Kaymak
Identify decision maker(s)
Identify alternative courses of action
Identify the relevant attributes
Assess the performance of the alternatives
on each attribute
Determine a weight for each attribute
For each alternative, take a weighted
average of the values assigned to that
alternative
Make a provisional decision
Perform sensitivity analysis
27
1BK40 - Business Analytics & Decision Making
Session 12
Value tree
Benefits
Costs
Turnover
conditions
Rent
U, Kaymak
Electricity
Cleaning
Closeness Visibility Image Size
to customers
Working
Comfort
Car
parking
28
1BK40 - Business Analytics & Decision Making
Session 12
Issues
Is the value tree an accurate and useful representation
of the decision maker’s concerns?
1. Completeness
2. Operationality
3. Decomposability
4. Absence of redundancy
5. Minimum size
U, Kaymak
29
1BK40 - Business Analytics & Decision Making
Session 12
Costs associated with the seven offices
Annual Annual
Annual cleaning electricity
rent ($) costs ($) costs ($)
Total
costs ($)
Addison Square
30 000
3000
2000
35 000
Bilton Village
15 000
2000
800
17 800
Carlisle Walk
5 000
1000
700
6 700
Denver Street
12 000
1000
1100
14 100
Elton Street
30 000
2500
2300
34 800
Filton Village
15 000
1000
2600
18 600
Gorton Square
10 000
1100
900
12 000
Office
U, Kaymak
30
1BK40 - Business Analytics & Decision Making
Session 12
Direct rating for ‘Office Image’
•
1.
2.
3.
4.
5.
6.
7.
U, Kaymak
Ranking from most preferred to least preferred
Addison Square
Elton Street
Filton Village
Denver Street
Gorton Square
Bilton Village
Carlisle Walk
31
1BK40 - Business Analytics & Decision Making
Session 12
Direct rating - Assigning values
U, Kaymak
32
1BK40 - Business Analytics & Decision Making
Session 12
Value function to assign values
U, Kaymak
33
1BK40 - Business Analytics & Decision Making
Session 12
Values for the office location problem
Attribute
Office
A
B
C
D
E
F
G
100
20
80
70
40
0
60
60
80
70
50
60
0
100
100
10
0
30
90
70
20
75
30
0
55
100
0
50
Comfort
0
100
10
30
60
80
50
Car
parking
90
30
100
90
70
0
80
Closeness
Visibility
Image
Size
One way of aggregating is summing them up,
but often the criteria have different importance.
U, Kaymak
34
1BK40 - Business Analytics & Decision Making
Session 12
Determining swing weights
Closeness to
customers
100
Visibility
Image
Best
70
Comfort
Car parking
• Rank the criteria
• Judge the importance of a swing from the
worst to the best compared to a swing from
the worst to the best on the most important
attribute
Best
80
Size
Best
Best
Best
Best
0
U, Kaymak
Worst
Worst
Worst
Worst
Worst
Worst
35
1BK40 - Business Analytics & Decision Making
Session 12
For example...
A swing from the worst ‘image’ to the best ‘image’ is
considered to be 70% as important as a swing from
the worst to the best location for ‘closeness to
customers’
...so ‘image’ is assigned a weight of 70.
U, Kaymak
36
1BK40 - Business Analytics & Decision Making
Session 12
Normalizing weights
Attribute
Closeness to customers
U, Kaymak
Original
(swing) weights
Normalized
weights
(rounded)
100
32
Visibility
80
26
Image
70
23
Size
30
10
Comfort
20
6
Car-parking facilities
10
3
310
100
37
1BK40 - Business Analytics & Decision Making
Session 12
Calculating aggregate benefits
Addison Square
Attribute
Closeness to cust.
Visibility
Image
Size
Comfort
Car-parking facilities
Values
Weight
Value
weight
100
32
3200
60
26
1560
100
23
2300
75
10
750
0
6
0
90
3
270
8080
so aggregate benefits = 8080/100 = 80.8
Weighted average
U, Kaymak
38
1BK40 - Business Analytics & Decision Making
Session 12
Aggregate benefits computation
Table 3.2 - Values and weights for the office location problem
____________________________________________________________________
Attribute
Weight
Office
A
B
C
D
E
F
G
_____________________________________________________________________________________
Closeness
32
100
20
80
70
40
0
60
Visibility
26
60
80
70
50
60
0
100
Image
23
100
10
0
30
90
70
20
Size
10
75
30
0
55
100
0
50
Comfort
6
0
100
10
30
60
80
50
Car parking
3
90
30
100
90
70
0
80
80.8
39.4
47.4
52.3
64.8
Aggregate
benefits
20.9
60.2
_____________________________________________________________________________________
U, Kaymak
39
1BK40 - Business Analytics & Decision Making
Session 12
Trading benefits against costs
Solutions on the
efficient frontier can
improve a benefit
only by increasing
costs and vice-versa
(trade-off)
Dominated solutions
have worse benefit
and cost than some
solution on the
efficient frontier
U, Kaymak
40
1BK40 - Business Analytics & Decision Making
Session 12
Sensitivity analysis
Turnover weights
U, Kaymak
• Force some weights to
zero (e.g. turnover)
and re-compute the
normalized weights
and perform the
analysis
41
1BK40 - Business Analytics & Decision Making
Session 12
Summary
This lecture:
Next topics:
• Definition of discrete choice • Introduction to fuzzy sets
problems
(already covered)
• Alternatives and attributes
• Fuzzy decision making
• Decision heuristics
•
•
•
•
•
Minimalist strategy
Lexicographic strategies
Elimination by aspect
Sequential decision making
Simple multi-attribute rating
technique (SMART)
42
U, Kaymak
42
1BK40
Business Analytics &
Decision Support
Session 13, 2017 – 2018
Fuzzy decision making
Multicriteria decisions
Prof. dr. ir. Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
Today’s agenda
Concepts:
• Multicriteria decisions
• Fuzzy decision making
• Fuzzy set aggregation functions
Techniques:
• Bellman and Zadeh’s model
• Yager’s model
• Weighted criteria
2
General Formulation of DM
Decision is a quintuple (A, Q, X, k,D)
• A is the set of decision alternatives
• Q is the set of “states of environment”
• X is the set of consequences
• k is a mapping A x Q X which relates decision
alternatives to consequences
• D is the decision function D: X
i
a a
i
j
j
D(
D(
)
D(
)
D(
i
i
j
j
)
)
Modeling benefits of office location
Benefits
Costs
Turnover
conditions
Rent
Electricity
Cleaning
Closeness Visibility Image Size
to customers
Working
Comfort
Car
parking
4
What are the elements of the quintuple?
Alternatives:
1. Addison Square
2. Elton Street
3. Filton Village
4. Denver Street
5. Gorton Square
6. Bilton Village
7. Carlisle Walk
•
•
•
•
Criteria
Consequences
Mapping k
Decision function
e.g. when
considering SMART
5
Fuzzy Goals and Constraints
• A fuzzy goal is a restriction on the set of alternatives A
G
:A
[0,1]
• Fuzzy goals are often specified indirectly, on the set of
objective function values
• A fuzzy constraint is also a restriction on the set of
alternatives A
: A [0,1]
C
• Fuzzy constraints are defined on the set of
alternatives directly, or on the domain of various
indicators, indirectly
• Fuzzy goals and constraints are generalisations of
crisp goals and constraints
Bellman and Zadeh’s model
• Fuzzy decision F is a confluence of (fuzzy) decision
goals and (fuzzy) decision criteria
• Both the decision goals and the decision constraints
should be satisfied
F
G C
F
(a)
G
(a)
C
(a), a
A
• Maximising decision (optimal decision a*)
Decision with the largest membership value
a* arg max
a A
G
(a)
C
(a)
Alternative corresponding to the largest membership
value is denoted as the best alternative (solution)
BZ model : example
Small dosage (fuzzy constraint)
Large dosage (fuzzy goal)
1
fuzzy decision
maximizing decision
interferon dosage [mg]
Yager’s model
•
•
•
•
A special case of Bellman and Zadeh’s model
Discrete set of alternatives
Multiple decision criteria
Evaluation of alternatives for each criterion by using
a fuzzy set, leading to judgements (ratings,
membership values)
• Use of fuzzy aggregation operators for combining
the judgements (decision function)
• Decision criteria can be weighted
• Alternatives ordered by the decision function
Discrete Choice Problem
• Set of alternatives A = {a1, …, an}
• Set of criteria C = {c1, …, cm}
Structure determined by the selection of criteria
• Judgements ij from evaluation of each alternative for each
criterion
Evaluation matrix:
a
1
a
n
c
1
11
1n
c
m
m1
mn
Evaluations are either made using membership functions
that represent fuzzy criteria, or by direct evaluation of the
alternatives (i.e. filled in by the decision maker)
Discrete Choice Problem
• Weight factors denote importance of criteria
• An aggregation function (decision
function)combines weight factors and
judgements for the criteria
D
w
(
1j
,
,
mj
),
j {1,
, n}
• Decision function orders the alternatives
according to preference
• A higher aggregated value corresponds to a
more preferred alternative
D (a k )
D ( al )
a a
k
l
Types of Fuzzy Aggregation
• Conjunctive aggregation of criteria
Models simultaneous satisfaction of criteria
T-norms
• Disjunctive aggregation of criteria
Models full compensation amongst criteria
T-conorms
• Compensatory aggregation of criteria
Models trade-off and exchange amongst the criteria
Compensatory operators, averaging operators, fuzzy
integrals
• Aggregation with a mixed behaviour
Models some types of complicated interactions
amongst criteria
Associative compensatory operators, rule-based
mappings, hierarchies of operators
T-(co)norms & Averaging Operators
• T-norms: simultaneous satisfaction of criteria
e.g. D(a, b) a b minimum
D ( a, b)
D ( a, b)
a • b product
max(0, a b 1) bounded difference
• T-conorms: full compensation amongst criteria
e.g. D(a, b) a b maximum
D ( a, b)
D ( a, b)
a b ab algebraic sum
min(a b,1) bounded sum
• Averaging operators: trade-off amongst criteria
e.g. generalised averaging operator
D(b1 ,
, bm )
m
1 bs
mi 1 i
1/ s
, s
Generalized intersection (t-norm)
• Basic requirements:
– Boundary: T(0, a) = 0, T(a, 1) = T(1, a) = a
– Monotonicity: T(a, b) < T(c, d) if a < c and b < d
– Commutativity: T(a, b) = T(b, a)
– Associativity: T(a, T(b, c)) = T(T(a, b), c)
• Examples:
– Minimum:
T ( a, b)
a b
– Algebraic product:
T ( a, b)
a b
– Bounded difference:
T (a, b) 0 (a b 1)
T-norm operator
Algebraic
product:
Ta(a, b)
Minimum:
Tm(a, b)
(a) Min
Bounded
product:
Tb(a, b)
Drastic
product:
Td(a, b)
(c) Bounded Product
(b) Algebraic Product
(d) Drastic Product
1
1
1
0.5
0.5
0.5
0.5
0
1
0
1
0
1
0
1
0.5
b
0.5
0.5
0 0
a
0.5
1
Y=b
0 0
0.5
X= a
1
Y=b
1
0.5
0 0
0.5
X= a
1
Y=b
1
1
1
1
0.5
0.5
0.5
0.5
0
20
0
20
0
20
0
20
10
Y=y
10
10
0 0
10
X= x
20
Y=y
0 0
10
X= x
20
Y=y
0 0
0.5
X= a
1
0 0
10
X= x
20
10
0 0
10
X= x
20
Y=y
Generalized union (t-conorm)
• Basic requirements:
– Boundary: S(1, a) = 1, S(a, 0) = S(0, a) = a
– Monotonicity: S(a, b) < S(c, d) if a < c and b < d
– Commutativity: S(a, b) = S(b, a)
– Associativity: S(a, S(b, c)) = S(S(a, b), c)
• Examples:
– Maximum:
S ( a, b)
a b
– Algebraic sum:
S ( a, b)
a b a b
– Bounded sum:
S ( a, b) 1 ( a b)
T-conorm operator
Algebraic
sum:
Sa(a, b)
Maximum:
Sm(a, b)
(a) Max
Bounded
sum:
Sb(a, b)
Drastic
sum:
Sd(a, b)
(c) Bounded Sum
(b) Algebraic Sum
(d) Drastic Sum
1
1
1
0.5
0.5
0.5
0.5
0
1
0
1
0
1
0
1
0.5
Y=b
0.5
0.5
0 0
0.5
X= a
1
Y=b
0 0
0.5
X= a
1
Y=b
1
0.5
0 0
0.5
X= a
1
Y=b
1
1
1
1
0.5
0.5
0.5
0.5
0
20
0
20
0
20
0
20
10
Y=y
10
10
0 0
10
X= x
20
Y=y
0 0
10
X= x
20
Y=y
0 0
0.5
X= a
1
0 0
10
X= x
20
10
0 0
10
X= x
20
Y=y
Generalised averaging operator
Has, as special cases, well-known
averaging operators
D(b1 ,
m
1 bs
mi 1 i
, bm )
• Minimum operator (s->- ) D(b , , b )
1
m
m
bi
i 1
• Harmonic mean (s=-1)
• Geometric mean (s=0)
• Arithmetic mean (s=1)
• Quadratic mean (s=2)
m
D(b1 ,
, bm )
D(b1 ,
, bm )
m
, bm )
1
m
D(b1 ,
D(b1 ,
m
1
b
i 1 i
b
i 1 i
m
i 1
1
m
, bm )
• Maximum operator (s-> ) D(b1 , , bm )
m
bi
m
i 1
m
bi
i 1
bi2
1/ s
, s
Index of optimism
• Generalized averaging operator is monotonic
in the parameter s
• s can be interpreted as an index of optimism
of the decision maker
min HM GM AM QM max
aggregate value
0.8
0.7
0.6
0.5
0.4
0.3
0.2
-20
-15
-10
-5
0
index of optimisim
5
10
15
20
Range of Operators
Generalized negation
• General requirements:
– Boundary: N(0)=1 and N(1) = 0
– Monotonicity: N(a) > N(b) if a < b
– Involution: N(N(a) = a
• Two types of fuzzy complements:
– Sugeno’s complement:
1 a
N s (a)
1 sa
– Yager’s complement:
N w (a)
(1 a w )1/ w
Sugeno’s and Yager’s
complements
Sugeno’s complement:
1 a
1 sa
N w (a )
(a) Sugeno's Complements
1
s = -0.95
0.8
s = -0.7
0.6
s=0
0.4
s=2
0.2
s = 20
0
0
0.5
1
a
Zadeh’s
complement
N(a)
N(a)
N s (a )
Yager’s complement:
(1 a w )1/ w
(b) Yager's Complements
1
w=3
0.8
w = 1.5
0.6
w=1
0.4
w = 0.7
0.2
w = 0.4
0
0
0.5
a
1
Generalized De Morgan’s Law
• T-norms and T-conorms are duals which support
the generalization of DeMorgan’s law:
• T(a, b) = N(S(N(a), N(b)))
• S(a, b) = N(T(N(a), N(b)))
Tm(a, b)
Ta(a, b)
Tb(a, b)
Td(a, b)
Sm(a, b)
Sa(a, b)
Sb(a, b)
Sd(a, b)
Compensatory Operators
• Trade-off amongst criteria
• Average of a t-norm and a t-conorm
D(a, b) M (T (a, b), S (a, b))
• Zimmermann operator
Weighted geometric mean of algebraic sum and
product
D ( a, b)
ab
1
a b ab
,
[0,1]
• Hurwicz operator
Weighted arithmetic mean of minimum and
maximum
D(a, b) (1
)(a b)
(a b),
[0,1]
Hierarchies of Operators
• Model interaction amongst criteria
• Organise a complex problem into sub-problems
• Group logically related criteria
Parameterized T-norm and conorm
• Parameterized T-norms and dual T-conorms have
been proposed by several researchers:
•
•
•
•
•
•
•
Yager
Schweizer and Sklar
Dubois and Prade
Hamacher
Frank
Sugeno
Dombi
Parametric T-norms & conorms
Parametric operators generalise a wide range of operators,
often by using a single parameter
T-norms
• Yager t-norm
D(a, b) max(0, 1 [(1 a ) p
p 0
T-conorms
• Yager t-conorm
(1 b) p ]1/ p )
• Hamacher t-norm
D ( a, b)
(1
ab
,
)(a b ab)
D ( a, b)
min(1, (a p b p )1/ p ),
p
0
• Hamacher t-conorm
0
D ( a, b)
a b ( 2)ab
,
( 1)ab
0
Schweizer and Sklar
p fuzzy sets pA and B
(a) Two
S SS (a, b, p) 1 [max{0, ((1 a)
(1 b)
1
1)}]
1
p
A
0.5
0
(b) T-norm of A and B
lim p
lim p
0
T1SS (a, b, p)
ab
T (a, b, p)
min( a, b)
0.5SS
0
(c) T-conorm (S-norm) of A and B
1
0.5
0
TSS (a, b, p) [max{0, (a
p
b
p
1)}]
1
p
B
Weighted aggregation
• Weights represent relative importance of objective
functions and the constraints
• The problem is described by:
• D(x,w) = T (w, G0(a0Tx), G1(a1Tx), Gm(amTx))
• The solution is given by
D(x*, w )
sup D(x, w )
x
m
Fuzzy sets that represent each criterion
• For general fuzzy optimization with simultaneous
satisfaction of constraints, t-norms must be
extended to their weighted counterparts
Weight factors
• Weights represent the relative importance of
various constraints and the goal within the
preference structure of the decision maker
• The higher the weight of a particular criteria, the
larger its importance on the aggregation result
• Importance of criteria can also be done directly
in the membership functions.
• Normalization of weights for t-norms, t-conorms
m
i=0
wi = 1
Two approaches to weighted aggregation
• Transforming the decision function
• Incorporate the weight factors into the decision
function in a uniform way
• Weights become parameters of the new aggregation
funtion
• Transforming the operands
• Transform the membership values into new values by
using the weight factors
• Use the original non-weighted membership function on
the transformed values
31
Transforming the decision function (1)
Non-weighted
Weighted
Contour lines of the
decision function change
32
Transforming the decision function (2)
Non-weighted Hamacher
Weighted Hamacher
33
Transforming the operands
1
• Product t-norm
• a = 0.5, b = 0.4
• wa = 1, wb = 0.25
• Evaluation:
0.2
0.4
0.8
0.7
0.6
criterion 2
• With power raising:
•a
0.5
•b
0.8
0.9
Weighted
0.5
0.4
0.3
0.2
0.1
0
Non-weighted
0
0.1
0.2
0.3
0.4
0.5
0.6
criterion 1
0.7
0.8
0.9
1
34
Weighted conjunction (examples)
m
• Minimum operator
• Product operator
D ( x, w )
[Gi (x)]wi
i 0
m
[Gi (x)]wi
D ( x, w )
i 0
0 if i, Gi (x) 0
• Hamacher t-norm
1
D ( x, w )
1
• Yager t-norm
D(x, w )
m w 1 Gi ( x )
i 0 i Gi ( x )
max 0,1
, otherwise
m w (1
i 0 i
Gi (x)) 2
Weighted averaging
• Generalized averaging operator
D ( x, w )
m
1/ s
m
i 0
wi [Gi (x)]s
, s
D(x, w )
m
wi Gi ' (x),
i 0
indicates that these numbers (G) are
ordered from smaller to larger
wi
i 0
• Ordered weighted averages (OWA)
m
,
wi
i 0
1
1
Weighted disjunction (examples)
Most easily obtained from weighted t-norms by using
De Morgan laws: S(x,w)=1-T(1-x,w)
m
• Maximum operator
D(x, w ) 1
[1 Gi (x)]wi
i 0
m
• Algebraic sum
[1 Gi (x)]wi
D(x, w ) 1
i 0
• Yager s-norm
D(x, w )
m
i
2
w
(
G
(
x
))
i
0 i
•
•
•
•
•
•
Summary
This lecture:
General definition of
decision making (discrete
choice)
Fuzzy decision making
Fuzzy set aggregation
Bellman and Zadeh’s model
Yager’s model
Weighted aggregation
Next topics:
• Bayesian decision model
• Analytic Hierarchy Process
(AHP)
38
1BK40
Business Analytics &
Decision Support
Session 14, 2017 – 2018
Bayes decision analysis
Analytic Hierarchy Process
Prof. dr. ir. Uzay Kaymak
Pav.D02, u.kaymak@tue.nl
Today’s agenda
Concepts:
• Bayes decision formulation
• Payoff table
• Perfect information and experimentation
• Pairwise comparisons
Techniques:
• Payoff table
• Expected value of perfect information
• Expected value of experimentation
• Analytic Hierarchy Process (AHP)
1
Scope
• For situations where there is a significant degree of
uncertainty
• Uncertainty may be about:
• key decision parameters
• the actual outcome
• problem definition …
• Often considers a small and finite set of alternatives
Example: oil exploration
Perhaps there is oil,
perhaps there is not.
Do you sell the land
and the exploration
rights or do you drill
and develop the field?
Costly to obtain better
/ more information!
3
Key concepts
• Feasible alternatives
e.g. an action from a set of possible actions
• State of the nature
a possible situation that may be the reality
• Payoff table
Quantification of each combination of alternatives
and the state of the nature
Decision making
• States of the nature have different prior probability
• Optimal decision is found by aggregation in the
payoff table
• Different decision criteria for aggregation models
different types of behavior
• pessimistic viewpoint
• optimistic viewpoint
• probabilistic viewpoint, etc.
Example problem
•
•
•
•
Oil company owns a piece of land
Can make money by selling it
Can drill for oil (brings money if oil is found)
Uncertain whether oil will be found
Payoff table
State of Nature
Alternative
Drill
Sell
Oil
700
90
Dry
-100
90
Prior Probability
0.25
0.75
Maximin and maximax
• Maximin payoff criterium
For each possible action, find the minimum
payoff over all possible states of nature.
Next, find the maximum of these minimum
payoffs. Choose the action whose minimum
payoff gives this maximum.
• Maximax payoff criterium
For each possible action, find the maximum
payoff over all possible states of nature.
Next, find the maximum of these maximum
payoffs. Choose the action whose maximum
payoff gives this maximum.
Maximin example
Alternative
Drill
Sell
Oil
700
90
State of Nature
Dry
-100
90
Only the worst state-of-nature is considered
Minimum
in Row
-100
90
Maximin
Maximax example
Alternative
Drill
Sell
Oil
700
90
State of Nature
Dry
-100
90
Only the best state-of-nature is considered
Maximum
in Row
700
Maximax
90
Maximum likelihood & Bayes
• Maximum likelihood criterion
Identify the most likely state of nature (the
one with the largest prior probability). For
this state of nature, find the action with the
maximum payoff. Choose this action.
• Bayes' decision rule
Using the best available estimates of the
probabilities of the respective states of
nature (the prior probabilities), calculate the
expected value of the payoff for each of the
possible actions. Choose the action with the
maximum expected payoff.
Maximum likelihood example
State of Nature
Alternative
Drill
Sell
Oil
700
90
Dry
-100
90
Prior Probability
0.25
0.75
Maximum
Only the most probable state-of-nature is considered
Maximum
Bayes example
Expected payoff = 0.25 x oil + 0.75 x dry
Alternative
Drill
Sell
Oil
700
90
State of Nature
Dry
-100
90
Prior Probability
0.25
0.75
This criterion is also called
Expected Monetary Value (EMV) criterion
Expected
Payoff
100
Maximum
90
Bayes sensitivity
700
Expected payoff Drill
Expected payoff Sell
600
expected payoff
500
400
300
200
100
0
-100
0
0.1
0.2
0.3
0.4
0.5
0.6
prior probability of oil
0.7
0.8
0.9
1
The maximin criterion
Another example
A decision table for the food manufacturer
(Daily profits)
Course of action
Produce 1 batch
Produce 2 batches
Demand (no. of batches)
1
2
$200
–$600
$200
$400
EMV criterion
Another decision table for the food manufacturer
(Daily profits)
Course of action
Produce 1 batch
Produce 2 batches
Demand (no. of batches)
1
2
Probability
0.3
0.7
$200
–$600
$200
$400
Calculating expected profits
Produce one batch:
Expected daily profit
= (0.3 $200) + (0.7
$200) = $200
Produce two batches:
Expected daily profit
= (0.3 –$600) + (0.7
$400) = $100
Sensitivity analysis
Limitations of the EMV criterion
• It assumes that the decision maker is neutral to risk
• It assumes a linear value function for money
• It considers only one attribute – money
Revising judgments
• How to improve
decisions when new
information is
available
• How much value can
we hope to obtain
from new
information?
• How much is new
information worth?
Bayes’ Theorem
Prior probability
New information
Posterior probability
DM with experimentation
• Obtain additional information
e.g. seismic study for k$ 30
• Lets you estimate the probability of states in a better
way (posterior probabilities)
finding from experimentation
(a random variable)
Bayes' theorem
P( i | f j )
posterior
probability
P( f j | i ) P( i )
n
P( f j |
k 1
number of possible states of nature
k
) P( k )
Probability tree
Prior
Conditional
probabilities probabilities
0.6
0.25
Oil
0.4
Joint
probabilities
FSS
0.15
Oil and FSS
0.5
Oil, given FSS
USS
0.1
Oil and USS
0.14
Oil, given USS
P(FSS)=0.3
Dry
0.75
0.2
0.8
Posterior
probabilities
FSS
USS
P(USS)=0.7
0.15
Dry and FSS
0.5
Dry, given FSS
0.6
Dry and USS
0.86
Dry, given USS
EV of perfect information
• Gives an upper bound on the expected value of
experimentation
• Expected payoff with perfect information:
average payoff assuming you can take the best
decision in any state of the nature
• EVPI = EPPI – EPWE
expected payoff without experimentation
expected payoff with perfect information
Value of experimentation
• Identifies potential value of experimentation
EPE
P(f j ) E[payoff | f j ]
j
expected value of experimentation
• EVE = EPE - EPWE
expected payoff without experimentation
expected payoff with experimentation
The components problem
Applying Bayes’ theorem
Step for Bayes’ theorem
1.
2.
3.
4.
5.
Construct tree with all possible events
Extend tree by attaching new branches that
represent new information (conditional probability)
Obtain joint probabilities (multiply probabilities
from root to leaves)
Sum joint probabilities
Obtain posterior probability by dividing the
“appropriate” joint probability with the sum of
joint probabilities
DM with Bayes theorem
Retailer’s decision problem
New information
Uncertainty about outcome is
reduced (cf. entropy)
Applying posterior probabilities
Decision node
Probability node
Determining EVPI
Example
Calculating the EVPI
Buy imperfect information?
If test indicates virus is present
If test indicates virus is absent
Determining the EVE (or EVII)
Expected profit with imperfect
information = $62 155
Expected profit without the
information = $57 000
Expected value of imperfect
information (EVII) = $5 155
Oil exploration example
Prior
Conditional
probabilities probabilities
0.6
0.25
Oil
0.4
Joint
probabilities
FSS
0.15
Oil and FSS
0.5
Oil, given FSS
USS
0.1
Oil and USS
0.14
Oil, given USS
P(FSS)=0.3
Dry
0.75
0.2
0.8
Posterior
probabilities
FSS
USS
P(USS)=0.7
0.15
Dry and FSS
0.5
Dry, given FSS
0.6
Dry and USS
0.86
Dry, given USS
Constructing a decision tree
Refined decision tree
Rolling back the tree
Timing of decisions is not considered
Pearson-Tukey approximation
• Used for approximating a continuous
probability distribution with a set of three
discrete probability events
• Estimate value with 95% probability of being
exceeded (set probability of this event to 0.185)
• Estimate value with 50% probability of being
exceeded (set probability of this event to 0.63)
• Estimate value with 5% probability of being
exceeded (set probability of this event to 0.185)
• Dependent probabilities are estimated as
conditional probability
Example
Eliciting decision structure
Towards a better
representation?
Definitions in influence diagrams
Definitions in influence diagrams
Deriving the decision tree
1.
2.
3.
4.
Identify a node with no arrows pointing to it
If there is a choice in 1 between a decision node
and an event node, choose the decision node
Place the node at the beginning of the decision
tree and remove from influence diagram
Repeat from 1 with the new, reduced influence
diagram
Tree derived from influence diagram
Analytic Hierarchy Process (AHP)
50
Pairwise comparisons
• Based on psychological studies
• Human judgment is error prone when measuring
quantities on relative or absolute scales
• Various biases: end-effect, avoiding extreme scores,
inability to compare large number of objects
simultaneously
• Humans are good in relative comparisons
• E.g. comparing two alternatives
• One method is analytic hierarchy process (AHP)
Overview of the AHP
1.
2.
3.
4.
5.
Set up decision hierarchy
Make pairwise comparisons of attributes and
alternatives
Transform comparisons into weights and check
consistency
Use weights to obtain scores for options
Carry out sensitivity analysis
Packaging machine problem
Scale for pairwise comparisons
• equally important
(1)
• Alternatives equally important
• weakly more important
(3)
• Experience or judgment slightly favours one
alternative over the other
• strongly more important
(5)
• Experience or judgment strongly favours one
alternative over the other
• very strongly more important
(7)
• One alternative is strongly preferred and its
dominance is demonstrated in practice
• extremely more important
(9)
• Evidence definitely favours one alternative
Reciprocal preference matrix
• Diagonal elements
consist of 1’s
• Numbers above the
main diagonal are
reciprocals of
numbers below the
main diagonal
• n x n matrix
• Note that sum of
eigenvalues is n, i.e.
pij
1
p ji
Example:
1 3 5
1/ 3 1 1/ 7
1/ 5 7 1
n
j 1
j
n
Judgment scales
• Saaty scale (9-point scale)
• 1,3,5,7,9 to denote importance levels
• 2,4,6,8 to denote intermediate (mixed) levels
• i.e. 1/9,1/7,1/5,1/3,1,3,5,7,9
• Geometric scale
q
,
,
1
,
0
,
1
,
• With q = 4 and a = 2:
1/16,1/8,1/4,1/2,1,2,4,8,16
,
q
Comparing criteria
Importance of quality attributes
Comparing alternatives
Solution methods
• Eigenvector solution (most widely used)
• Normalized row sums
• Normalized column sums
• Geometric mean method
• Logarithmic regression
• Computational intelligence approach
Eigenvector solution (Saaty)
• A reciprocal preference matrix has a unique positive
maximal eigenvalue max
(Perron-Frobenius theorem)
• Find the eigenvector corresponding to this
eigenvalue (principal eigenvector)
• see function eig in Matlab
• Normalize the vector elements to obtain weights
(usually, sum of vector elements equals 1)
Normalized row or column sums
• Sum up each row
• Normalize weights
1 1 / 3 2 10/3
3
1 6 10
1 / 2 1 / 6 1 5/3
• Sum up reciprocal of
each column
• Normalize weights
1 3 1/ 2
1/ 3 1 1/ 6
2 6 1
Sum=15
10/3
w = [2/9 2/3
1/9]T
10
w = [2/9 2/3
Reciprocals of columns
5/3 Sum = 15
1/9]T
Geometric mean method
• Calculate geometric mean of each row
mi
n
n
j
p
1 ij
• Normalize the judgment vector
1 1/ 3 2
3
1 6
1/ 2 1/ 6 1
mi
0.8736
2/9
2.6207
2/3
0.4368
1/9
Weights after
normalization
Logarithmic regression
• Solve an optimization problem
Above-diagonal elements (known)
• Minimize:
n
n
ln pij
ln wi
ln w j
2
i 1 j i 1
Weights to determine
• Can also deal with missing evaluations
• Can also be extended to the case with multiple
decision makers
Normalization and aggregation
• Usually, sum of the
judgments are
normalized to 1
• However, other
normalizations could
also be used (leading
to different decision
behavior!)
n
wj
1
j 1
1/ p
n
w
j 1
p
j
1,
p 1
Final weights
Scores for the three machines
Aztec
Barton
Congress
0.255
0.541
0.204
= 0.833 x 0.875 x 0.222
+ 0.833 x 0.125 x 0.558
+ 0.167 x 0.569 x 0.167
+ 0.167 x 0.148 x 0.286
+ 0.167 x 0.074 x 0.625
+ 0.167 x 0.209 x 0.127
Consistency of comparisons
• Reciprocal preference matrix is said to be cardinally
consistent when all triads of elements satisfy
pik
pij p jk
• The above condition is also called a transitivity
relation
• Since, comparisons are pairwise, consistency is
usually not guaranteed
Sources of inconsistency
• Limited range of the judgment scale
• Integer valued judgment scale
• Judgment errors of the decision maker
Inconsistency index
• Saaty proposed the following inconsistency
index
n
max
CI
n 1
• Inconsistency index is equal to zero when
the reciprocal preference matrix is cardinally
consistent
• CI equals to the average value of eigenvalues
smaller than max
• Saaty advises the inconsistency index to be
smaller than 0.1
Rank reversal
• Addition of a new alternative can change the mutual
ranking of other alternatives
• E.g.
1
5 4
1/ 5 1 3
1/ 4 1/ 3 1
a1
1
5
4 1/ 3
1/ 5 1
3
1
1/ 4 1/ 3 1
5
3
1 1/ 5 1
a2
a3
a1
a3
a4
a2
Strengths of the AHP
• Formal structuring of the decision problem
• Simplicity of pairwise comparisons
• Redundancy allows consistency to be checked
• Versatility
Criticisms of AHP
• Conversion from verbal to numeric scale
• Problems of 1 to 9 scale
• Meaningfulness of responses to questions
• New alternatives can reverse the rank of existing
alternatives
• No. of comparisons required may be large
• Axioms of the method
Summary
This lecture:
• Bayes decisions
Next topics:
• Preparation for exam
• Payoff matrix
• EVPI, EVE
• Construction of decision trees
• Analytic Hierarchy Process
• Pairwise comparisons
• Hierarchy construction
• Solution Methods
74
Download