Machine Learning and the Axioms of Probability Brigham S. Anderson

advertisement
Machine Learning
and the Axioms of
Probability
Brigham S. Anderson
www.cs.cmu.edu/~brigham
brigham@cmu.edu
School of Computer Science
Carnegie Mellon University
Copyright © 2006, Brigham S. Anderson
Probability
• The world is a very uncertain place
• 30 years of Artificial Intelligence and Database
research danced around this fact
• And then a few AI researchers decided to use
some ideas from the eighteenth century
Copyright © 2006, Brigham S. Anderson
2
What we’re going to do
• We will review the fundamentals of probability.
• It’s really going to be worth it
• You’ll see examples of probabilistic analytics in
action:
• Inference,
• Anomaly detection, and
• Bayes Classifiers
Copyright © 2006, Brigham S. Anderson
3
Discrete Random Variables
• A is a Boolean-valued random variable if A denotes
an event, and there is some degree of uncertainty
as to whether A occurs.
• Examples
• A = The US president in 2023 will be male
• A = You wake up tomorrow with a headache
• A = You have influenza
Copyright © 2006, Brigham S. Anderson
4
Probabilities
• We write P(A) as “the probability that A is true”
• We could at this point spend 2 hours on the
philosophy of this.
• We’ll spend slightly less...
Copyright © 2006, Brigham S. Anderson
5
Sample Space
Definition 1.
The set, S, of all possible outcomes of a particular
experiment is called the sample space for the experiment
The elements of the
sample space are called
outcomes.
Copyright © 2006, Brigham S. Anderson
6
Sample Spaces
Sample space of a coin flip:
S = {H, T}
T
H
Copyright © 2006, Brigham S. Anderson
7
Sample Spaces
Sample space of a die roll:
S = {1, 2, 3, 4, 5, 6}
Copyright © 2006, Brigham S. Anderson
8
Sample Spaces
Sample space of three die rolls?
S = {111,112,113,…,
…,664,665,666}
Copyright © 2006, Brigham S. Anderson
9
Sample Spaces
Sample space of a single draw from a
deck of cards:
S={As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
Copyright © 2006, Brigham S. Anderson
10
So Far…
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is an element of
the sample space.
2c
Copyright © 2006, Brigham S. Anderson
11
Events
Definition 2.
An event is any subset of S (including S itself)
Copyright © 2006, Brigham S. Anderson
12
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Jack”
Copyright © 2006, Brigham S. Anderson
13
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Hearts”
Copyright © 2006, Brigham S. Anderson
14
Events
Sample Space of card draw
• The Sample Space is the set of all
outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes
Event: “Red and Face”
Copyright © 2006, Brigham S. Anderson
15
Definitions
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is a single point in
the sample space.
2c
An event is a set of outcomes
from the sample space.
{2h,2c,2s,2d}
Copyright © 2006, Brigham S. Anderson
16
17
Events
Definition 3.
Two events A and B are mutually exclusive if A^B=Ø.
hearts
clubs
spades
diamonds
Definition 4.
If A1, A2, … are mutually exclusive and A1 A2 … = S,
then the collection A1, A2, … forms a partition of S.
Copyright © 2006, Brigham S. Anderson
Probability
Definition 5.
Given a sample space S, a probability function is a function


that maps each
event in S to a real number, and satisfies
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
A1, A2, A3 …, we have
P(A1  A2A3  …) = P(A1) + P(A2) + P(A3) +…
*
This definition of the domain of this function is
not 100% sufficient, but it’s close enough for
our purposes… (I’m sparing you Borel Fields)
Copyright © 2006, Brigham S. Anderson
18
Definitions
Definition
Example
The sample space is the set of
all possible worlds.
{As,Ac,Ah,Ad,2s,2c,2h,…
…,Ks,Kc,Kd,Kh}
An outcome is a single point in
the sample space.
4c
An event is a set of one or
more outcomes
Card is “Red”
P(E) maps event E to a real
number and satisfies the
axioms of probability
P(Red) = 0.50
P(Black) = 0.50
Copyright © 2006, Brigham S. Anderson
19
Misconception
•
The relative area of the events determines
their probability
•
•
…in a Venn diagram it does, but not in general.
However, the “area equals probability” rule is guaranteed
to result in axiom-satisfying probability functions.
We often~A
assume, for example, that the
probability of “heads” Ais equal to “tails” in
absence of other information…
But this is totally outside the axioms!
Copyright © 2006, Brigham S. Anderson
20
Creating a Valid P()
•
One convenient way to create an axiomsatisfying probability function:
1. Assign a probability to each outcome in S
2. Make sure they sum to one
3. Declare that P(A) equals the sum of outcomes in event A
Copyright © 2006, Brigham S. Anderson
21
Everyday Example
Assume you are a doctor.
This is the sample space of
“patients you might see on
any given day”.
Outcomes
Non-smoker, female, diabetic,
headache, good insurance, etc…
Smoker, male, herniated disk,
back pain, mildly schizophrenic,
delinquent medical bills, etc…
Copyright © 2006, Brigham S. Anderson
22
Everyday Example
Number of elements in the
“patient space”:
100 jillion
Are these patients equally likely to occur?
Again, generally not. Let’s assume for the
moment that they are, though.
…which roughly means “area equals probability”
Copyright © 2006, Brigham S. Anderson
23
Everyday Example
Event: Patient has Flu
F
Size of set “F”:
2 jillion
(Exactly 2 jillion of the points in
the sample space have flu.)
Size of “patient space”:
100 jillion
2 jillion
PpatientSpace(F) =
100 jillion
Copyright © 2006, Brigham S. Anderson
= 0.02
24
Everyday Example
F
2 jillion
PpatientSpace(F) =
100 jillion
= 0.02
From now on, the subscript on P() will
be omitted…
Copyright © 2006, Brigham S. Anderson
25
These Axioms are Not to be
Trifled With
• There have been attempts to do different
methodologies for uncertainty
•
•
•
•
Fuzzy Logic
Three-valued logic
Dempster-Shafer
Non-monotonic reasoning
• But the axioms of probability are the only system
with this property:
If you gamble using them you can’t be unfairly exploited by an opponent
using some other system [di Finetti 1931]
Copyright © 2006, Brigham S. Anderson
26
Theorems from the Axioms
Axioms
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
A1, A2, A3 …, we have
P(A1 A2 A3  …) = P(A1) + P(A2) + P(A3) +…
Theorem.
If P is a probability function and A is an event in S, then
P(~A) = 1 – P(A)
Proof:
(1) Since A and ~A partition S, P(A
 ~A)
(2) Since A and ~A are disjoint, P(A

= P(S) = 1
~A) = P(A) + P(~A)
Combining (1) and (2) gives the result
Copyright © 2006, Brigham S. Anderson
27
Multivalued Random
Variables
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on
exactly one value out of {A1,A2, ... Ak}, and
• The events {A1,A2,…,Ak} partition S, so
P( Ai , A j )  0 if i  j
P( A1  A2  ...  Ak )  1
Copyright © 2006, Brigham S. Anderson
28
Elementary Probability in
Pictures
P(~A) + P(A) = 1
~A
A
Copyright © 2006, Brigham S. Anderson
29
Elementary Probability in
Pictures
P(B) = P(B, A) + P(B, ~A)
~A
A
B
Copyright © 2006, Brigham S. Anderson
30
Elementary Probability in
Pictures
k
 P( A )  1
j 1
j
A2
A3
A1
Copyright © 2006, Brigham S. Anderson
31
Elementary Probability in
Pictures
k
P( B)   P( B, Aj )
j 1
A2
B
A3
A1
Useful!
Copyright © 2006, Brigham S. Anderson
32
Conditional Probability
Assume once more that you
are a doctor.
Again, this is the sample
space of “patients you might
see on any given day”.
Copyright © 2006, Brigham S. Anderson
33
Conditional Probability
Event: Flu
P(F) = 0.02
F
Copyright © 2006, Brigham S. Anderson
34
Conditional Probability
Event: Headache
H
P(H) = 0.10
F
Copyright © 2006, Brigham S. Anderson
35
Conditional Probability
P(F) = 0.02
P(H) = 0.10
H
F
…we still need to
specify the interaction
between flu and
headache…
Define
P(H|F) = Fraction of F’s outcomes which are also in H
Copyright © 2006, Brigham S. Anderson
36
Conditional Probability
0.89
0.09
P(F) = 0.02
P(H) = 0.10
P(H|F) = 0.50
H
F
0.01
0.01
H = “headache”
F = “flu”
Copyright © 2006, Brigham S. Anderson
37
Conditional Probability
P(H|F) = Fraction of flu worlds in
which patient has a headache
0.89
0.09
H
F
0.01
0.01
H = “headache”
F = “flu”
= #worlds with flu and headache
-----------------------------------#worlds with flu
= Size of “H and F” region
-----------------------------Size of “F” region
= P(H, F)
---------P(F)
Copyright © 2006, Brigham S. Anderson
38
Conditional Probability
39
Definition.
If A and B are events in S, and P(B) > 0, then the
conditional probability of A given B, written P(A|B), is
P( A, B)
P( A | B) 
P( B)
The Chain Rule
A simple rearrangement of the above equation yields
P( A, B)  P( A | B) P( B)
Copyright © 2006, Brigham S. Anderson
Main Bayes
Net concept!
Probabilistic Inference
H = “Have a headache”
F = “Coming down with
Flu”
H
P(H) = 0.10
P(F) = 0.02
P(H|F) = 0.50
F
One day you wake up with a headache. You think: “Drat!
50% of flus are associated with headaches so I must have a
50-50 chance of coming down with flu”
Is this reasoning good?
Copyright © 2006, Brigham S. Anderson
40
Probabilistic Inference
41
H = “Have a headache”
F = “Coming down with
Flu”
H
F
P(H) = 0.10
P(F) = 0.02
P(H|F) = 0.50
P( F , H ) P( H | F ) P( F ) (0.50)(0.02)
P( F | H ) 


 0.10
0.1
P( H )
P( H )
Copyright © 2006, Brigham S. Anderson
What we just did…
P(A,B)
P(A|B) P(B)
P(B|A) = ----------- = --------------P(A)
P(A)
This is Bayes Rule
Bayes, Thomas (1763) An essay
towards solving a problem in the
doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
Copyright © 2006, Brigham S. Anderson
42
More General Forms of Bayes
Rule
P( B | A) P( A)
P( A |B) 
P( B | A) P( A)  P( B |~ A) P(~ A)
P( B | A, C ) P( A, C )
P( A |B, C ) 
P ( B, C )
Copyright © 2006, Brigham S. Anderson
43
More General Forms of Bayes
Rule
P( Ai |B) 
P( B | Ai ) P( Ai )
nA
 P( B | A ) P( A )
k 1
k
Copyright © 2006, Brigham S. Anderson
k
44
45
Independence
Definition.
Two events, A and B, are statistically independent if
P( A, B)  P( A) P( B)
Which is equivalent to
P( A | B)  P( A)
Important for
Bayes Nets
Copyright © 2006, Brigham S. Anderson
Representing P(A,B,C)
• How can we represent the function P(A)?
• P(A,B)?
• P(A,B,C)?
Copyright © 2006, Brigham S. Anderson
46
The Joint
Probability Table
Recipe for making a joint distribution
of M variables:
1. Make a truth table listing all
combinations of values of your
variables (if there are M boolean
variables then the table will have
2M rows).
Example: P(A, B, C)
A
B
C
Prob
0
0
0
0.30
0
0
1
0.05
0
1
0
0.10
0
1
1
0.05
1
0
0
0.05
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
2. For each combination of values,
say how probable it is.
3. If you subscribe to the axioms of
probability, those numbers must
sum to 1.
47
A
0.05
0.25
0.30
Copyright © 2006, Brigham S. Anderson
B
0.10
0.10
0.10
0.05
0.05
C
48
Using the
Joint
One you have the JPT you
can ask for the probability of
any logical expression
P( E ) 
…what is P(Poor,Male)?
Copyright © 2006, Brigham S. Anderson
 P(row )
rows matching E
49
Using the
Joint
P(Poor, Male) = 0.4654
P( E ) 
 P(row )
rows matching E
…what is P(Poor)?
Copyright © 2006, Brigham S. Anderson
50
Using the
Joint
P(Poor) = 0.7604
P( E ) 
 P(row )
rows matching E
…what is P(Poor|Male)?
Copyright © 2006, Brigham S. Anderson
51
Inference
with the
Joint
P( E1 , E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
Copyright © 2006, Brigham S. Anderson
52
Inference
with the
Joint
P( E1 , E2 )
P( E1 | E2 ) 

P ( E2 )
 P(row )
rows matching E1 and E2
 P(row )
rows matching E2
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
Copyright © 2006, Brigham S. Anderson
Inference is a big deal
• I’ve got this evidence. What’s the chance that this
conclusion is true?
• I’ve got a sore neck: how likely am I to have meningitis?
• I see my lights are out and it’s 9pm. What’s the chance my spouse
is already asleep?
• There’s a thriving set of industries growing based around
Bayesian Inference. Highlights are: Medicine, Pharma,
Help Desk Support, Engine Fault Diagnosis
Copyright © 2006, Brigham S. Anderson
53
Where do Joint Distributions
come from?
54
• Idea One: Expert Humans
• Idea Two: Simpler probabilistic facts and some
algebra
Example: Suppose you knew
P(A) = 0.5
P(B|A) = 0.2
P(B|~A) = 0.1
P(C|A,B) = 0.1
P(C|A,~B) = 0.8
P(C|~A,B) = 0.3
P(C|~A,~B) = 0.1
Then you can automatically
compute the JPT using the
chain rule
P(A,B,C) = P(A) P(B|A) P(C|A,B)
Copyright © 2006, Brigham S. Anderson
Bayes Nets are a
systematic way to
do this.
Where do Joint Distributions
come from?
• Idea Three: Learn them from data!
Prepare to witness an impressive learning algorithm….
Copyright © 2006, Brigham S. Anderson
55
56
Learning a JPT
Build a Joint Probability table for
your attributes in which the
probabilities are unspecified
Then fill in each row with
records matching row
ˆ
P(row ) 
total number of records
A
B
C
Prob
0
0
0
?
0
0
1
?
A
B
C
Prob
0
1
0
?
0
0
0
0.30
0
1
1
?
0
0
1
0.05
1
0
0
?
0
1
0
0.10
1
0
1
?
0
1
1
0.05
1
1
0
?
1
0
0
0.05
1
1
1
?
1
0
1
0.10
1
1
0
0.25
1
1
1
0.10
Fraction of all records in which
A and B are True but C is False
Copyright © 2006, Brigham S. Anderson
Example of Learning a JPT
• This JPT was
obtained by
learning from
three attributes
in the UCI
“Adult” Census
Database
[Kohavi 1995]
Copyright © 2006, Brigham S. Anderson
57
Where are we?
• We have recalled the fundamentals of probability
• We have become content with what JPTs are and
how to use them
• And we even know how to learn JPTs from data.
Copyright © 2006, Brigham S. Anderson
58
Density Estimation
• Our Joint Probability Table (JPT) learner is our first
example of something called Density Estimation
• A Density Estimator learns a mapping from a set of
attributes to a Probability
Input
Attributes
Density
Estimator
Copyright © 2006, Brigham S. Anderson
Probability
59
Evaluating a density
estimator
• Given a record x, a density estimator M can tell
you how likely the record is:
Pˆ ( x|M )
• Given a dataset with R records, a density
estimator can tell you how likely the dataset is:
(Under the assumption that all records were independently generated
from the probability function)
R
Pˆ (dataset |M )  Pˆ (x1 , x 2 , x R|M )   Pˆ (x k|M )
k 1
Copyright © 2006, Brigham S. Anderson
60
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
From the UCI repository (thanks to Ross Quinlan)
Copyright © 2006, Brigham S. Anderson
61
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
Copyright © 2006, Brigham S. Anderson
62
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
1
america
america
america
america
america
europe
europe
R
Pˆ (dataset |M )  Pˆ (x , x 2 , x R|M )   Pˆ (x k|M )
Copyright © 2006, Brigham S. Anderson
k 1
 3.4  10-203
63
Log Probabilities
Since probabilities of datasets get so
small we usually use log probabilities
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
Copyright © 2006, Brigham S. Anderson
64
A small dataset: Miles Per
Gallon
192
Training
Set
Records
mpg
modelyear maker
good
bad
bad
bad
bad
bad
bad
bad
:
:
:
bad
good
bad
good
bad
good
good
bad
good
bad
75to78
70to74
75to78
70to74
70to74
70to74
70to74
75to78
:
:
:
70to74
79to83
75to78
79to83
75to78
79to83
79to83
70to74
75to78
75to78
asia
america
europe
america
america
asia
asia
america
:
:
:
america
america
america
america
america
america
america
america
europe
europe
R
R
k 1
k 1
log Pˆ (dataset |M )  log  Pˆ (x k|M )   log Pˆ (x k|M )
  466.19
Copyright © 2006, Brigham S. Anderson
65
Summary: The Good News
The JPT allows us to learn P(X) from data.
• Can do inference: P(E1|E2)
Automatic Doctor, Recommender, etc
• Can do anomaly detection
spot suspicious / incorrect records
(e.g., credit card fraud)
• Can do Bayesian classification
Predict the class of a record
(e.g, predict cancerous/not-cancerous)
Copyright © 2006, Brigham S. Anderson
66
Summary: The Bad News
• Density estimation with JPTs is trivial, mindless
and dangerous
Copyright © 2006, Brigham S. Anderson
67
Using a test set
An independent test set with 196 cars has a much worse log
likelihood than it had on the training set
(actually it’s a billion quintillion quintillion quintillion quintillion
times less likely)
….Density estimators can overfit. And the JPT estimator is the
overfittiest of them all!
Copyright © 2006, Brigham S. Anderson
68
Overfitting Density
Estimators
If this ever happens, it means
there are certain combinations
that we learn are “impossible”
Copyright © 2006, Brigham S. Anderson
69
Using a test set
The only reason that our test set didn’t score -infinity is that
Andrew’s code is hard-wired to always predict a probability of at
least one in 1020
We need Density Estimators that are less prone
to overfitting
Copyright © 2006, Brigham S. Anderson
70
Is there a better way?
The problem with the JPT is that it just mirrors
the training data.
In fact, it is just another way of storing the data: we could
reconstruct the original dataset perfectly from it!
We need to represent the probability function
with fewer parameters…
Copyright © 2006, Brigham S. Anderson
71
72
Aside:
Bayes Nets
Copyright © 2006, Brigham S. Anderson
Bayes Nets
• What are they?
• Bayesian nets are a framework for representing and analyzing
models involving uncertainty
• What are they used for?
• Intelligent decision aids, data fusion, 3-E feature recognition,
intelligent diagnostic aids, automated free text understanding,
data mining
• How are they different from other knowledge
representation and probabilistic analysis
tools?
• Uncertainty is handled in a mathematically rigorous yet efficient
and simple way
Copyright © 2006, Brigham S. Anderson
73
Bayes Net Concepts
1.Chain Rule
P(A,B) = P(A) P(B|A)
2.Conditional Independence
P(A|B,C) = P(A|B)
Copyright © 2006, Brigham S. Anderson
74
A Simple Bayes Net
• Let’s assume that we already have P(Mpg,Horse)
low
P(Mpg, Horse) =
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
0.36
0.04
0.12
0.48
How would you rewrite this using the Chain rule?
Copyright © 2006, Brigham S. Anderson
75
76
Review: Chain Rule
P(Mpg)
P(Mpg, Horse)
low
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
P(Mpg, Horse)
P(good) = 0.4
P( bad) = 0.6
0.36
0.04
0.12
0.48
*
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.89
0.21
0.11
0.79
77
Review: Chain Rule
= P(good) * P(low|good) =
0.4 * 0.89
= P(good) * P(high|good)
= 0.4 * 0.11
P(Mpg)
P(Mpg, Horse)
low
high
P(good, low) =
0.36 0.04
good
P(good,high)
=
P(
low)
=
0.12
0.48
badbad,
P( bad,high) =
P(good) = 0.4
P( bad) = 0.6
0.36
0.04
0.12
0.48
*
P(Horse|Mpg)
P(Mpg, Horse)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
= P(bad) * P(high|bad)
= 0.6 * 0.79
=
=
=
=
= P(bad) * P(low|bad)
= 0.6 * 0.21
Copyright © 2006, Brigham S. Anderson
0.89
0.21
0.11
0.79
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
Mpg
Horse
Copyright © 2006, Brigham S. Anderson
78
How to Make a Bayes Net
P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.90
0.21
0.10
0.79
79
How to Make a Bayes Net
• Each node is a probability function
• Each arc denotes conditional dependence
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.90
0.21
0.10
0.79
80
How to Make a Bayes Net
So, what have we
accomplished thus far?
Nothing;
we’ve just “Bayes Net-ified” the
P(Mpg, Horse) JPT using the
Chain rule.
…the real excitement starts
when we wield conditional
independence
Copyright © 2006, Brigham S. Anderson
Mpg
Horse
P(Mpg)
P(Horse|Mpg)
81
How to Make a Bayes Net
82
Before we continue, we need a worthier opponent
than puny P(Mpg, Horse)…
We’ll use P(Mpg, Horse, Accel):
* Note: I made these up…
P(Mpg,Horse,Accel)
P(good, low,slow)
P(good, low,fast)
P(good,high,slow)
P(good,high,fast)
P( bad, low,slow)
P( bad, low,fast)
P( bad,high,slow)
P( bad,high,fast)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
=
=
=
=
0.37
0.01
0.02
0.00
0.10
0.12
0.16
0.22
How to Make a Bayes Net
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Note:
Obviously, we could have written this 3!=6 different ways…
P(M, H, A) = P(M) * P(H|M) * P(A|M,H)
= P(M) * P(A|M) * P(H|M,A)
= P(H) * P(M|H) * P(A|H,M)
= P(H) * P(A|H) * P(M|H,A)
=…
=…
Copyright © 2006, Brigham S. Anderson
83
How to Make a Bayes Net
Step 1: Rewrite joint using the Chain rule.
P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
Mpg
Horse
Accel
Copyright © 2006, Brigham S. Anderson
84
How to Make a Bayes Net
Mpg
P(Mpg)
Horse
P(Horse|Mpg)
Accel
P(Accel|Mpg,Horse)
Copyright © 2006, Brigham S. Anderson
85
How to Make a Bayes Net
P(Mpg)
Mpg
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.90
0.21
0.10
0.79
P(good) = 0.4
P( bad) = 0.6
P(Accel|Mpg,Horse)
P(slow|good, low)
P(slow|good,high)
P(slow| bad, low)
P(slow| bad,high)
P(fast|good, low)
P(fast|good,high)
P(fast| bad, low)
P(fast| bad,high)
Horse
Accel
=
=
=
=
=
=
=
=
* Note: I made these up too…
Copyright © 2006, Brigham S. Anderson
0.97
0.15
0.90
0.05
0.03
0.85
0.10
0.95
86
How to Make a Bayes Net
P(Mpg)
Mpg
P(Horse|Mpg)
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(good) = 0.4
P( bad) = 0.6
P(Accel|Mpg,Horse)
P(slow|good, low)
P(slow|good,high)
P(slow| bad, low)
P(slow| bad,high)
Accel
P(fast|good, low)
P(fast|good,high)
A Miracle Occurs!
P(fast| bad, low)
P(fast|
You are told by God (or another domain
expert) bad,high)
Horse
that Accel is independent of Mpg given Horse!
i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
=
=
=
=
0.97
0.15
0.90
0.05
0.03
0.85
0.10
0.95
87
How to Make a Bayes Net
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.22
0.64
0.78
0.36
88
How to Make a Bayes Net
P(Mpg)
Mpg
Thank you, domain expert!
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Now I only need to
learn 5 parameters
instead of 7 from my data!
Horse
My parameter estimates
will be more accurate as
a result!
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.22
0.64
0.78
0.36
89
Independence
“The Acceleration does not depend on the Mpg once I know the
Horsepower.”
This can be specified very simply:
P(Accel Mpg, Horse) = P(Accel | Horse)
This is a powerful statement!
It required extra domain knowledge. A different kind of knowledge than
numerical probabilities. It needed an understanding of causation.
Copyright © 2006, Brigham S. Anderson
90
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented
directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.
• E is a set of directed edges joining vertices. No loops
of any length are allowed.
Each vertex in V contains the following information:
• A Conditional Probability Table (CPT) indicating how
this variable’s probabilities depend on all possible
combinations of parental values.
Copyright © 2006, Brigham S. Anderson
91
Bayes Nets Summary
• Bayes nets are a factorization of the full JPT which
uses the chain rule and conditional independence.
• They can do everything a JPT can do (like quick,
cheap lookups of probabilities)
Copyright © 2006, Brigham S. Anderson
92
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
Copyright © 2006, Brigham S. Anderson
93
The good news
We can do inference.
We can compute any conditional probability:
P( Some variable  Some other variable values )
P( E1  E2 )
P( E1 | E2 ) 

P ( E2 )
 P( joint entry )
joint entries matching E1 and E2
 P( joint entry )
joint entries matching E2
Suppose you have m binary-valued variables in your Bayes
Net and expression E2 mentions k variables.
How much work is the above computation?
Copyright © 2006, Brigham S. Anderson
94
The sad, bad news
Doing inference “JPT-style” by enumerating all matching entries in the joint
are expensive:
Exponential in the number of variables.
But perhaps there are faster ways of querying Bayes nets?
• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find
there are often many tricks to save you time.
• So we’ve just got to program our computer to do those tricks too, right?
Sadder and worse news:
General querying of Bayes nets is NP-complete.
Copyright © 2006, Brigham S. Anderson
95
Case Study I
Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT
Press, Cambridge MA).
•
Diagnostic system for lymph-node diseases.
•
60 diseases and 100 symptoms and test-results.
•
14,000 probabilities
•
Expert consulted to make net.
• 8 hours to determine variables.
• 35 hours for net topology.
• 40 hours for probability table values.
•
Apparently, the experts found it quite easy to invent the causal links and
probabilities.
•
Pathfinder is now outperforming the world experts in diagnosis. Being
extended to several dozen other medical domains.
Copyright © 2006, Brigham S. Anderson
96
Bayes Net Info
GUI Packages:
• Genie -- Free
• Netica -- $$
• Hugin -- $$
Non-GUI Packages:
• All of the above have APIs
• BNT for MATLAB
• AUTON code (learning extremely large networks of tens
of thousands of nodes)
Copyright © 2006, Brigham S. Anderson
97
98
Bayes Nets and
Machine Learning
Copyright © 2006, Brigham S. Anderson
Machine Learning Tasks
Evidence e1
Missing Variables e2
Inference
Engine
P(e2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
Copyright © 2006, Brigham S. Anderson
99
What is an Anomaly?
• An irregularity that cannot be explained by
simple domain models and knowledge
• Anomaly detection only needs to learn from
examples of “normal” system behavior.
• Classification, on the other hand, would need examples
labeled “normal” and “not-normal”
Copyright © 2006, Brigham S. Anderson
100
Anomaly Detection in
Practice
• Monitoring computer networks for attacks.
• Monitoring population-wide health data for outbreaks or
attacks.
• Looking for suspicious activity in bank transactions
• Detecting unusual eBay selling/buying behavior.
Copyright © 2006, Brigham S. Anderson
101
102
JPT
Anomaly Detector
• Suppose we have
the following model:
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
• We’re trying to detect anomalous cars.
• If the next example we see is <good,high>, how
anomalous is it?
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.36
0.04
0.12
0.48
103
JPT
Anomaly Detector
How likely is
<good,high>?
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
likelihood ( good , high)  P( good , high)
 0.04
Could not be easier! Just look up the entry in the JPT!
Smaller numbers are more anomalous in that the
model is more surprised to see them.
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.36
0.04
0.12
0.48
104
Bayes Net
Anomaly Detector
P(Mpg)
Mpg
How likely is
<good,high>?
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
likelihood ( good , high )  P( good , high )
 P( good ) P(high | good )
 0.04
Copyright © 2006, Brigham S. Anderson
=
=
=
=
0.90
0.21
0.10
0.79
105
Bayes Net
Anomaly Detector
P(Mpg)
Mpg
How likely is
<good,high>?
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
Horse
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
likelihood ( good , high )  P( good , high )
Again, trivial!
Pone
( good
P(high |
We need todo
tiny )lookup
for each variable in the network!
 0.04
Copyright © 2006, Brigham S. Anderson
good )
=
=
=
=
0.90
0.21
0.10
0.79
Machine Learning Tasks
Evidence e1
Inference
Engine
P(E2 | e1)
Data point x
Anomaly
Detector
P(x)
Data point x
Classifier
P(C | x)
Missing Variables E2
Copyright © 2006, Brigham S. Anderson
106
Bayes Classifiers
• A formidable and sworn enemy of decision trees
BC
DT
Copyright © 2006, Brigham S. Anderson
107
Bayes Classifiers in 1 Slide
Bayes classifiers just do inference.
That’s it.
The “algorithm”
1. Learn P(class,X)
2. For a given input x, infer P(class|x)
3. Choose the class with the highest probability
Copyright © 2006, Brigham S. Anderson
108
109
JPT
Bayes Classifier
• Suppose we have
the following model:
P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
=
=
=
=
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we
classify it?
Copyright © 2006, Brigham S. Anderson
0.36
0.04
0.12
0.48
110
JPT
Bayes Classifier
How do we classify
<Horse=low>?
P( good | low) 

P(Mpg, Horse) =
P(good, low)
P(good,high)
P( bad, low)
P( bad,high)
=
=
=
=
0.36
0.04
0.12
0.48
P( good , low)
P(low)
P( good , low)
P( good , low)P(bad , low)
0.36

 0.739
0.360.12
Copyright © 2006, Brigham S. Anderson
The P(good | low) = 0.75,
so we classify the example
as “good”
111
Bayes Net Classifier
P(Mpg)
Mpg
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
• We’re trying to classify cars
as Mpg = “good” or “bad”
Horse
• If the next example we see is
<Horse=low,Accel=fast>
how do we classify it?
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
Copyright © 2006, Brigham S. Anderson
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Net 112
Bayes Classifier
Suppose we get a
<Horse=low, Accel=fast>
example?
P( good , low, fast )
P( good | low, fast ) 
P(low, fast )
P(Mpg)
P(good) = 0.4
P( bad) = 0.6
Mpg
P(Horse|Mpg)


P( good ) P(low | good ) P( fast | low)
P(low, fast )
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
Horse
(0.4)(0.89)(0.05)
0.0178

P(low, fast )
P(low, fast )
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
0.0178

P( good , low, fast )  P(bad , low, fast )
 0.75
Accel
Note: this is not exactly 0.75 because I
rounded some
of the
CPT
numbers
Copyright
© 2006,
Brigham
S. Anderson earlier…
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Net 113
Bayes Classifier
P(Mpg)
Mpg
The P(good | low, fast) = 0.75,
so we classify the example
as “good”.
P(good) = 0.4
P( bad) = 0.6
P(Horse|Mpg)
…but that seems somehow familiar…
Horse
Wasn’t that the same answer as
P(Mpg=good | Horse=low)?
P( low|good)
P( low| bad)
P(high|good)
P(high| bad)
=
=
=
=
0.89
0.21
0.11
0.79
P(Accel|Horse)
Accel
Copyright © 2006, Brigham S. Anderson
P(slow| low)
P(slow|high)
P(fast| low)
P(fast|high)
=
=
=
=
0.95
0.11
0.05
0.89
Bayes Classifiers
• OK, so classification can be posed as inference
• In fact, virtually all machine learning tasks are a
form of inference
•
•
•
•
•
Anomaly detection:
Classification:
Regression:
Model Learning:
Feature Selection:
P(x)
P(Class | x)
P(Y | x)
P(Model | dataset)
P(Model | dataset)
Copyright © 2006, Brigham S. Anderson
114
The Naïve Bayes Classifier
ASSUMPTION:
all the attributes are conditionally independent
given the class variable
Copyright © 2006, Brigham S. Anderson
115
The Naïve Bayes Advantage
At least 256 parameters!
You better have the data
to support them…
A mere 25 parameters!
(the CPTs are tiny because the attribute
nodes only have one parent.)
Copyright © 2006, Brigham S. Anderson
116
What is the Probability Function
of the Naïve Bayes?
P(Mpg,Cylinders,Weight,Maker,…)
=
P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …
Copyright © 2006, Brigham S. Anderson
117
What is the Probability Function
of the Naïve Bayes?
P(class , x)  P(class ) P( xi | class )
i
This is another great feature of Bayes
Nets; you can graphically see your
model assumptions
Copyright © 2006, Brigham S. Anderson
118
Bayes
Classifier
Results:
“MPG”: 392
records
The Classifier
learned by
“Naive BC”
Copyright © 2006, Brigham S. Anderson
119
120
Bayes
Classifier
Results:
“MPG”: 40
records
Copyright © 2006, Brigham S. Anderson
More Facts About Bayes
Classifiers
• Many other density estimators can be slotted in
• Density estimation can be performed with real-valued inputs
• Bayes Classifiers can be built with real-valued inputs
• Rather Technical Complaint: Bayes Classifiers don’t try to be maximally
discriminative---they merely try to honestly model what’s going on
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with
the magic words “Dirichlet Prior”) can help.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes
cheerfully!
Copyright © 2006, Brigham S. Anderson
121
Summary
• Axioms of Probability
• Bayes nets are created by
• chain rule
• conditional independence
• Bayes Nets can do
• Inference
• Anomaly Detection
• Classification
Copyright © 2006, Brigham S. Anderson
122
123
Copyright © 2006, Brigham S. Anderson
124
Copyright © 2006, Brigham S. Anderson
Download