Business System Analysis & Decision Making

advertisement
Lecture Notes 16: Bayes’
Theorem and Data Mining
Zhangxi Lin
ISQS 6347
1
Modeling Uncertainty
 Probability Review
 Bayes Classifier
 Value of Information



Conditional Probability and Bayes’ Theorem
Expected Value of Perfect Information
Expected Value of Imperfect Information
2
Probability Review
 P(A|B) = P(A and B) / P(B)
 “Probability of A given B”
 Example, there are 40 female students in a class of
100. 10 of them are from some foreign countries. 20
male students are also foreign students.
 Even A: student from a foreign country
 Even B: a female student
 If randomly choosing a female student to present in
the class, the probability she is a foreign student:
P(A|B) = 10 / 40 = 0.25, or P(A|B) = P (A & B) / P (B)
= (10 /100) / (40 / 100) = 0.1 / 0.4 = 0.25
 That is, P(A|B) = # of A&B / # of B = (# of A&B / Total)
/ (# of B / Total) = P(A & B) / P(B)
3
Venn Diagrams
30+10 = 40
Female
(30)
Male non-foreign student
(40)
20+10 = 30
Foreign
(10) Student
(20)
Female foreign student (10)
4
Probability Review
 Complement
P( A )  1  P( A)
P( B )  1  P( B)
Non Female
Female
Non Foreign
Student
Foreign
student
5
Bayes Classifier
6
Bayes’ Theorem (From Wikipedia)
 In probability theory, Bayes' theorem (often called Bayes' Law) relates
the conditional and marginal probabilities of two random events. It is
often used to compute posterior probabilities given observations. For
example, a patient may be observed to have certain symptoms. Bayes'
theorem can be used to compute the probability that a proposed
diagnosis is correct, given that observation.
 As a formal theorem, Bayes' theorem is valid in all interpretations of
probability. However, it plays a central role in the debate around the
foundations of statistics: frequentist and Bayesian interpretations
disagree about the ways in which probabilities should be assigned in
applications. Frequentists assign probabilities to random events
according to their frequencies of occurrence or to subsets of
populations as proportions of the whole, while Bayesians describe
probabilities in terms of beliefs and degrees of uncertainty. The articles
on Bayesian probability and frequentist probability discuss these
debates at greater length.
7
Bayes’ Theorem
P( A & B)
 P( A | B) P( B)  P( A & B)
P( B)
P( A & B)
P( B | A) 
 P( B | A) P( A)  P( A & B)
P( A)
P( A | B) 
So:
P( A | B) P( B)  P( B | A) P( A)
P( B | A) 
P( A | B) P( B)
P( A | B) P( B)

P( A)
P( A | B) P( B)  P( A | B ) P( B )
The above formula is referred to as Bayes’ theorem. It is extremely
Useful in decision analysis when using information.
8
Example of Bayes Theorem
 Given:



A doctor knows that meningitis (M) causes stiff neck (S) 50% of
the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
 If a patient has stiff neck, what’s the probability
he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 

 0.0002
P( S )
1 / 20
9
Bayes Classifiers
 Consider each attribute and class label as random
variables
 Given a record with attributes (A1, A2,…,An)
 Goal is to predict class C (= (c1, c2, …, cm))
 Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
 Can we estimate P(C| A1, A2,…,An ) directly from data?
10
Bayes Classifiers
 Approach:
 compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P(C | A A  A ) 
1
2
n
P( A A  A | C ) P(C )
P( A A  A )
1
2
n
1
2
n

Choose value of C that maximizes
P(C | A1, A2, …, An)

Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
 How to estimate P(A1, A2, …, An | C )?
11
Example
ca
 C: Evade (Yes, No)
 A1: Refund (Yes, No)
 A2: Marital Status (Single,
t
o
g
e
l
a
ric
ca
t
o
g
e
l
a
ric
n
o
c
u
it n
s
u
o
s
s
a
cl
Tid Refund Marital
Status
Taxable
Income Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
 We can obtain P(A1, A2,
5
No
Divorced 95K
Yes
A3|C), P(A1, A2, A3), and
P(C) from the data set
 Then calculate P(C|A1, A2,
A3) for predictions given A1,
A2, and A3, while C is
unknown.
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Married, Divorced)
 A3: Taxable income (60k –
220k)
60K
10
12
Naïve Bayes Classifier
 Assume independence among attributes Ai when
class is given:

P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

Can estimate P(Ai| Cj) for all Ai and Cj.

New point is classified to Cj if P(Cj)  P(Ai| Cj) is
maximal.
Note: The above is equivalent to find i such that  P(Ai|
Cj) is maximal, since P(Cj) is identical.

13
How to Estimate
Probabilities
from
Data?
l
l
c
Tid
10
at
Refund
o
eg
a
c
i
r
c
at
o
eg
a
c
i
r
co
in
nt
u
s
u
o
s
s
a
cl
Marital
Status
Taxable
Income
Evade
 Class: P(C) = Nc/N

e.g., P(No) = 7/10,
P(Yes) = 3/10
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
 For discrete attributes:
4
Yes
Married
120K
No
P(Ai | Ck) = |Aik|/ Nc
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes


k
where |Aik| is number of
instances having
attribute Ai and belongs
to class Ck
Examples:
P(Status=Married|No) =
4/7
P(Refund=Yes|Yes)=0
14
*How to Estimate Probabilities from Data?
 For continuous attributes:
 Discretize the range into bins
 one ordinal attribute per bin
 violates independence assumption
 Two-way split: (A < v) or (A > v)
 choose only one of the two splits as new attribute
 Probability density estimation:
 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
 Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
15
l
l
s
a
u
*How toricaEstimate
Probabilities
from Data?
c
i
r
uo
c
Tid
at
Refund
o
g
e
c
at
o
g
e
co
in
t
n
as
l
c
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
s
 Normal distribution:
1
P( A | c ) 
e
2
i
j

( Ai   ij ) 2
2  ij2
2
ij

One for each (Ai,ci) pair
 For (Income, Class=No):

If Class=No
sample mean = 110
 sample variance = 2975

10
1
P( Income  120 | No) 
e
2 (54.54)

( 120110) 2
2 ( 2975)
 0.0072
16
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
17
Naïve Bayes Classifier
 If one of the conditional probability is zero,
then the entire expression becomes zero
 Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
c: number of classes
N ic  1
Laplace : P( Ai | C ) 
Nc  c
m - estimate : P( Ai | C ) 
N ic  mp
Nc  m
p: prior probability
m: parameter
18
*Example of Naïve Bayes Classifier
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
gila monster
platypus
owl
dolphin
eagle
Give Birth
yes
Give Birth
yes
no
no
yes
no
no
yes
no
yes
yes
no
no
yes
no
no
no
no
no
yes
no
Can Fly
no
no
no
no
no
no
yes
yes
no
no
no
no
no
no
no
no
no
yes
no
yes
Can Fly
no
Live in Water Have Legs
no
no
yes
yes
sometimes
no
no
no
no
yes
sometimes
sometimes
no
yes
sometimes
no
no
no
yes
no
Class
yes
no
no
no
yes
yes
yes
yes
yes
no
yes
yes
yes
no
yes
yes
yes
yes
no
yes
mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
non-mammals
non-mammals
mammals
non-mammals
mammals
non-mammals
Live in Water Have Legs
yes
no
Class
A: attributes
M: mammals
N: non-mammals
6 6 2 2
P( A | M )      0.06
7 7 7 7
1 10 3 4
P( A | N )      0.0042
13 13 13 13
7
P( A | M ) P( M )  0.06   0.021
20
13
P( A | N ) P( N )  0.004   0.0027
20
P(A|M)P(M) > P(A|N)P(N)
=> Mammals
?
19
Naïve Bayes (Summary)
 Robust to isolated noise points
 Handle missing values by ignoring the instance during
probability estimate calculations
 Robust to irrelevant attributes
 Independence assumption may not hold for some
attributes

Use other techniques such as Bayesian Belief
Networks (BBN)
20
Value of Information
When facing uncertain prospects we need information in order
to reduce uncertainty
Information gathering includes consulting experts, conducting
surveys, performing mathematical or statistical analyses, etc.
21
Expected Value of Perfect Information
(EVPI)
Problem: An buyer is to buy something online
Seller type Bad
Not use insurance
Pay $100
Net gain
- $100
0.01
EMV = $18.8
0.99
Buyer
Good
$20
Bad
- $2
0.01
EMV = $17.8
Use insurance
Pay $100+$2 = $102
Good
0.99
$18
22
Expected Value of Imperfect
Information (EVII)
 We rarely access to perfect information, which is common. Thus
we must extend our analysis to deal with imperfect information.
 Now suppose we can access the online reputation to estimate
the risk in trading with a seller.
 Someone provide their suggestions to you according to their
experience. Their predictions are not 100% correct:
 If the product is actually good, the person’s prediction is 90%
correct, whereas the remaining 10% is suggested bad.
 If the product is actually bad, the person’s prediction is 80%
correct, whereas the remaining 20% is suggested good.
 Although the estimate is not accurate enough, it can be used to
improve our decision making:
 If we predict the risk is high to buy the product online, we
purchase insurance
23
Decision Tree
Extended from the previous online trading question
Seller type
No Ins
Predicted Good
Insurance
Buyer
No Ins
- $100
Good (?)
$20
Bad (?)
- $2
Good (?)
$18
Bad (?)
- $100
Good (?)
$20
Bad (?)
Predicted Bad
Insurance
Questions:
Bad (?)
Good (?)
1. Given the
suggestion
What is your
decision?
2. What is the
probability
wrt the decision you
made?
3. How do you
estimate
The accuracy of a
prediction?
- $2
$18
24
Applying Bayes’ Theorem





Let “Good” be even A
Let “Bad” be even B
Let “Predicted Good” be event G
Let “Predicted Bad” be event W
According to the previous information, for example by data mining the
historical data, we know:



P(G|A) = 0.9, P(W|A) = 0.1
P(W|B) = 0.8, P(G|B) = 0.2
P(A) = 0.99, P(B) = 0.01
 We want to learn the probability the outcome is good providing the
prediction is “good”. i.e.

P(A|G) = ?
 We want to learn the probability the outcome is bad providing the
prediction is “bad”. i.e.

P(B|W) = ?
 We may apply Bayes theorem to solve this with imperfect information
25
Calculate P(G) and P(W)
 P(G) = P(G|A)P(A) + P(G|B)P(B)
= 0.9 * 0.99 + 0.2 * 0.01
= 0.893
 P(W) = P(W|B)P(B) + P(W|A)P(A)
= 0.8 * 0.01 + 0.1 * 0.99
= 0.107
= 1 - P(G)
26
Applying Bayes’ Theorem
 We have
P(A|G) = P(G|A)P(A) / P(G)
= P(G|A)P(A) / [P(G|A)P(A) + P(G|B)P(B)]
= P(G|A)P(A) / [P(G|A)P(A) + P(G|B)(1 - P(A))]
= 0.9 * 0.99 / [0.9 * 0.99 + 0.2 * 0.01]
= 0.9978 > 0.99
 P(B|W) = P(W|B)P(B) / P(W)
= P(W|B)P(B) / [P(W|B)P(B) + P(W|A)P(A)]
= P(W|B)P(B) / [P(W|B)P(B) + P(W|A)(1 - P(B))]
= 0.8 * 0.01 / [0.8 * 0.01 + 0.1 * 0.99]
= 0.0748 > 0.01
 Apparently, data mining provides good information and changes
the original probability

27
Decision Tree
P(A) = 0.99, P(B) = 0.01
Seller type
Predicted Good
P(G) = 0.893
No Ins
Bad (0.0022)
- $100
Good (0.9978)
$20
Bad (0.0022)
- $2
EMV = $19.87
Your choice
EMV = $17.78
Insurance
Buyer
No Ins
Good (0.9978)
$18
Bad (0.0748)
- $100
EMV = $11.03
Good (0.9252)
Bad (0.0748)
Predicted Bad
P(W) = 0.107
Insurance
Good (0.9252)
$20
- $2
$18
EMV = $16.50
Your choice
Data mining can significantly improve your decision making accuracy!
28
Consequences of a Decision
Actual Good
(A)
Actual Bad (B)
Predicted Good
(G) (not to buy
insurance)
Predicted Bad
(W) (need to buy
insurance $2)
a
Gain $20
b
Net Gain $18
P(A)
= (a + b) / (a + b + c + d)
=0.99
d
Cost $2
P(B)
= (c + d) / (a + b + c + d)
= 0.01
c
Lose $100
P(G)
= (a + c) / (a + b + c + d)
= 0.893
P(W)
= (b + d) / (a + b + c + d)
= 0.107
P(G|A) = a / (a + b) = 0.9, P(W|A) = b / (a + b) = 0.1
P(W|B) = c / (c + d) = 0.8, P(G|B) = d / (c + d) = 0.2
29
German Bank Credit Decision
Computed Good
(Action A, B)
Computed Bad
(Action A, B)
Actual
Good
True Positive
600 ($6, 0)
False Negative
100 (0, -$1)
700
Actual
Bad
False Positive
80 (-$2, -$1)
True Negative
220 (-$20, 0)
300
680
320
This is a modified version of the German Bank credit decision problem.
1. Assume because of the anti-discrimination regulation there could be a cost in FN
depending on the action taken.
2. The bank has two choices of actions: A & B. Each will have different results.
3.Question 1: When the classification model suggests that a specific loan applicant has
a probability 0.8 to be GOOD, which action should be taken?
4. Question 2: When the classification model suggests that a specific loan applicant has
a probability 0.6 to be GOOD, which action should be taken?
30
The Payoffs from Two Actions
Computed Good
(Action A)
Computed Bad
(Action A)
Actual
Good
True Positive
600 ($6)
False Negative
100 (0)
700
Actual
Bad
False Positive
100 (-$2)
True Negative
200 (-$20)
300
700
Computed Good
(Action B)
300
Computed Bad
(Action B)
Actual
Good
True Positive
600 (0)
False Negative
100 (-$1)
700
Actual
Bad
False Positive
100 (-$1)
True Negative
200 (0)
300
700
300
31
Summary
 There are two decision scenarios
 In previous classification problems, when predicted
target is 1 then take an action, otherwise do nothing.
Only the action will make something different.
 There is a cutoff value for this kind of decision. A riskaversion person may set a higher level of cutoff value,
when the utility function is not linear with regard to the
monetary result.
 The risk-aversion person may opt for earn less without
the emotional worry of the risk.
 In current Bayesian decision problem, when the
predicted target is 1 then take action A, otherwise take
Action B. Both actions will result in some outcomes.
32
Web Page Browsing
P0
Problem:
When a browsing user
Entered P5 from P2,
What is the probability
He will proceed to P3?
P1
P2
0.7
P5
P4
0.3
P3
How to solve the problem
in general?
1. Assume this is the first
Order Markovian chain.
2. Construct a transition
probability matrix
We notice that
1. P(P2|P4P0) may not equal to P(P2|P4P1)
2. There is only one entrance of the web site at P0
3. There is no link from P3 to other pages.
33
Transition Probabilities
P(K,L)=Probability of traveling FROM K TO L
P0/H
P1
P2
P3
P4
P5
Exit
P0/H
P(H,H) P(H,1) P(H,2) P(H,3) P(H,4) P(H,5) P(H,E)
P1
P(1,H) P(1,1) P(1,2) P(1,3) P(1,4) P(1,5) P(1,E)
P2
P(2,H) P(2,1) P(2,2) P(2,3) P(2,4) P(2,5) P(2,E)
P3
P(3,H) P(3,1) P(3,2) P(3,3) P(3,4) P(3,5) P(3,E)
P4
P(4,H) P(4,1) P(4,2) P(4,3) P(4,4) P(4,5) P(4,E)
P5
P(5,H) P(5,1) P(5,2) P(5,3) P(5,4) P(5,5) P(5,E)
Exit
0
0
0
0
0
0
0
34
Demonstration
 Dataset: Commrex web log data
 Data Exploration
 Link analysis
 The links among nodes
 Calculate the transition matrix
 The Bayesian network model for the web log data
 Reference:
 David Heckerman, “A Tutorial on Learning With
Bayesian Networks,” March 1995 (Revised November
1996), Technical Report, MSR-TR-9506\\BASRV1\ISQS6347\tr-95-06.pdf
35
Readings
 SPB, Chapter 3
 RG, Chapter 10
36
Download