William W. Cohen
Machine Learning 10-601
Jan 2008
Slide 1
[Some material pilfered from http://www.cs.cmu.edu/~awm/tutorials ]
Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials .
Comments and corrections gratefully received.
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu
412-268-7599
Copyright © Andrew W. Moore Slide 2
• Recitation this week:
• Matlab tutorial
• Probability q/a and tutorial
• HW1 is due on Monday
Slide 3
1+0.1+0.01+0.001+0.0001+… = ?
• Lance Armstrong and the tortoise have a race
• Lance is 10x faster
0 1
• Tortoise has a 1m head start at time 0
• So, when Lance gets to 1m the tortoise is at 1.1m
• So, when Lance gets to 1.1m the tortoise is at 1.11m …
• So, when Lance gets to 1.11m the tortoise is at 1.111m … and
Lance will never catch up -?
unresolved until calculus was invented
Slide 4
• David Hume (1711-
1776): pointed out
1. Empirically, induction seems to work
2. Statement (1) is an application of induction.
• This stumped people for about 200 years
1.
Of the Different Species of Philosophy.
2.
Of the Origin of Ideas
3.
Of the Association of Ideas
4.
Sceptical Doubts Concerning the Operations of the
Understanding
5.
Sceptical Solution of These Doubts
6.
Of Probability 9
7.
Of the Idea of Necessary Connexion
8.
Of Liberty and Necessity
9.
Of the Reason of Animals
10.
Of Miracles
11.
Of A Particular Providence and of A Future State
12.
Of the Academical Or Sceptical Philosophy
Slide 5
• A black crow seems to support the hypothesis
“all crows are black”.
• A pink highlighter supports the hypothesis
“all non-black things are non-crows”
• Thus, a pink highlighter supports the hypothesis
“all crows are black”.
x CROW or
( x )
BLACK equivalent ly
( x )
x
BLACK ( x )
CROW ( x )
Slide 6
• You have much less than 200 years to figure it all out.
Slide 7
• Events
• discrete random variables, boolean random variables, compound events
• Axioms of probability
• What defines a reasonable theory of uncertainty
• Compound events
• Independent events
• Conditional probabilities
• Bayes rule and beliefs
• Joint probability distribution
Slide 8
• A is a Boolean-valued random variable if
• A denotes an event,
• there is uncertainty as to whether A occurs.
• Examples
• A = The US president in 2023 will be male
• A = You wake up tomorrow with a headache
• A = You have Ebola
• A = the 1,000,000,000,000 th digit of π is 7
• Define P(A) as “the fraction of possible worlds in which A is true”
• We’re assuming all possible worlds are equally probable
Slide 9
• A is a Boolean-valued random variable if
• A denotes an event, a possible outcome of an “experiment ”
• Define P(A) as “the fraction of experiments in which A is true”
• We’re assuming all possible outcomes are equiprobable
• Examples
• You roll two 6-sided die (the experiment) and get doubles (A=doubles, the outcome)
• I pick two students in the class (the experiment) and they have the same birthday (A=same birthday, the outcome)
Slide 10
Event space of all possible worlds
Its area is 1
Worlds in which
A is true
P(A) = Area of reddish oval
Worlds in which A is False
Slide 11
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
Slide 12
(This is Andrew’s joke)
Slide 13
• There have been many many other approaches to understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, nonmonotonic reasoning, …
• 25 years ago people in AI argued about these; now they mostly don’t
• Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms
• If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)
Slide 14
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true
Slide 15
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have
A true
Slide 16
A
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P( A or B ) = P( A ) + P( B ) - P( A and B )
B
Slide 17
• 0 <= P(A) <= 1
• P(True) = 1
• P(False) = 0
• P( A or B ) = P( A ) + P( B ) - P( A and B )
A
B
P(A or B)
P(A and B)
B
Simple addition and subtraction
Slide 18
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P( A or B ) = P( A ) + P( B ) - P( A and B )
P(not A) = P(~A) = 1-P(A)
P(A or ~A) = 1 P(A and ~A) = 0
P(A or ~A) = P(A) + P(~A) - P(A and ~A)
1 = P(A) + P(~A) 0
Slide 19
• P(~A) + P(A) = 1
A
~A
Slide 20
• I am inflicting these proofs on you for two reasons:
1. These kind of manipulations will need to be second nature to you if you use probabilistic analytics in depth
2. Suffering is good for you
(This is also Andrew’s joke)
Slide 21
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0
• P( A or B ) = P( A ) + P( B ) - P( A and B )
P(A) = P(A ^ B) + P(A ^ ~B)
A = A and (B or ~B) = (A and B) or (A and ~B)
P(A) = P(A and B) + P(A and ~B) – P((A and B) and (A and ~B))
P(A) = P(A and B) + P(A and ~B) – P(A and A and B and ~B)
Slide 22
• P(A) = P(A ^ B) + P(A ^ ~B)
A ^ B
B
A ^ ~B
~B
Slide 23
• You’re in a game show.
Behind one door is a prize.
Behind the others, goats.
• You pick one of three doors, say #1
• The host, Monty Hall, opens one door, revealing…a goat!
You now can either
• stick with your guess
3
• always change doors
• flip a coin and pick a new door randomly according to the coin
Slide 24
• Case 1: you don’t swap.
• W = you win.
• Pre-goat: P(W)=1/3
• Post-goat: P(W)=1/3
• Case 2: you swap
• W1=you picked the cash initially.
• W2=you win.
• Pre-goat: P(W1)=1/3.
• Post-goat:
• W2 = ~W1
• Pr(W2) = 1-P(W1)=2/3.
Moral: ?
Slide 25
The Extreme Monty Hall/Survivor Problem
• You’re in a game show. There are 10,000 doors. Only one of them has a prize.
• You pick a door.
• Over the remaining 13 weeks, the host eliminates 9,998 of the remaining doors.
• For the season finale:
• Do you switch, or not?
…
Slide 26
Some practical problems
• You’re the DM in a D&D game.
• Joe brings his own d20 and throws 4 critical hits in a row to start off
• DM=dungeon master
• D20 = 20-sided die
• “Critical hit” = 19 or 20
• Is Joe cheating?
• What is P(A), A=four critical hits?
• A is a compound event: A = C1 and C2 and C3 and C4
Slide 27
• Definition: two events A and B are independent if Pr(A and B)=Pr(A)*Pr(B).
• Intuition: outcome of A has no effect on the outcome of B (and vice versa).
• We need to assume the different rolls are independent to solve the problem.
• You almost always need to assume independence of learning problem.
something to solve any
Slide 28
Some practical problems
• You’re the DM in a D&D game.
• Joe brings his own d20 and throws 4 critical hits in a row to start off
• DM=dungeon master
• D20 = 20-sided die
• “Critical hit” = 19 or 20
• What are the odds of that happening with a fair die?
• Ci=critical hit on trial i, i=1,2,3,4
• P(C1 and C2 … and C4) = P(C1)*…*P(C4) = (1/10)^4
Followup: D=pick an ace or king out of deck three times in a row: D=D1 ^ D2 ^ D3
Slide 29
Some practical problems
The specs for the loaded d20 say that it has 20 outcomes, X where
• P(X=20) = 0.25
• P(X=19) = 0.25
• for i=1,…,18, P(X=i)= Z * 1/18
• What is Z?
Slide 30
• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on exactly one value out of {v
1
,v
2
, .. v k
}
• Thus…
P ( A
v i
A
v j
)
0 if i
j
P ( A
v
1
A
v
2
A
v k
)
1
Slide 31
More about Multivalued Random Variables
• Using the axioms of probability…
0 <= P(A) <= 1, P(True) = 1, P(False) = 0
P(A or B) = P(A) + P(B) - P(A and B)
• And assuming that A obeys…
P
P (
( A
A
v v i
1
A
A
v v
2 j
)
j
1
• It’s easy to prove that
P ( A
v
1
A
v
2
A
v i
)
j i
1
P ( A
v j
)
Slide 32
More about Multivalued Random Variables
• Using the axioms of probabilityand assuming that A obeys…
P
P
(
(
A
A
v v i
1
A
A
v v
2 j
)
j
1
• It’s easy to prove that
P ( A
v
1
A
v
2
A
v i
)
j i
1
P ( A
v j
)
• And thus we can prove j k
1
P ( A
v j
)
1
Slide 33
j k
1
P ( A
v j
)
1
A=2
A=3
A=5
A=4
A=1
Slide 34
Some practical problems
The specs for the loaded d20 say that it has 20 outcomes, X
• P(X=20) = P(X=19) = 0.25
• for i=1,…,18, P(X=i)= Z * 1/18 and what is Z?
j k
1
P ( A
v j
)
1
1
j
18
1
P ( A
v j
)
0 .
25
0 .
25
18
Z
1
18
0 .
5
Slide 35
• You (probably) have 8 neighbors and 5 close neighbors.
• What is Pr(A), A=one or more of your neighbors has the same sign as you?
• What’s the experiment?
• What is Pr(B), B=you and your close neighbors all have different signs?
• What about neighbors?
n c n c * c n c n
Moral: ?
Slide 36
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
6
5
4
3
P(X=20) = P(X=19) = 0.25
for i=1,…,18, P(X=i)= 0.5 * 1/18
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
Slide 37
Some practical problems
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A =d20 picked is fair and B =roll 19 or 20 with that die. What is P( B )?
P( B ) = P( B and A ) + P( B and ~A ) = 0.1*0.75 + 0.5*0.25 = 0.2
using Andrew’s “important theorem” P(A) = P(A ^ B) + P(A ^ ~B)
Slide 38
• P(A) = P(A ^ B) + P(A ^ ~B)
A ^ B
B
A ^ ~B
Followup:
What if I change the ratio of fair to loaded die in the experiment?
~B
Slide 39
Some practical problems
• I have lots of standard d20 die, lots of loaded die, all identical.
• Experiment is the same: (1) pick a d20 uniformly at random then (2) roll it. Can I mix the dice together so that P(B)=0.137 ?
P(B) = P(B and A ) + P(B and ~A ) = 0.1* λ + 0.5* (1- λ) = 0.137
“mixture model” λ = (0.5 - 0.137)/0.4 = 0.9075
Slide 40
It’s more convenient to say
• “if you’ve picked a fair die then …” i.e. Pr(critical hit|fair die)=0.1
• “if you’ve picked the loaded die then….” Pr(critical hit|loaded die)=0.5
A (fair die) ~A (loaded)
Conditional probability:
Pr(B|A) = P(B^A)/P(A)
A and B
~A and B
P(B|A) P(B|~A)
Slide 41
Definition of Conditional Probability
P(A ^ B)
P(A|B) = -----------
P(B)
Corollary: The Chain Rule
P(A ^ B) = P(A|B) P(B)
Slide 42
Some practical problems
“marginalizing out” A
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A =d20 picked is fair and B =roll 19 or 20 with that die. What is P( B )?
P( B ) = P( B | A ) P(A) + P( B | ~A ) P(~A) = 0.1
* 0.75
+ 0.5
* 0.25
= 0.2
Slide 43
P(B) = P(B|A)P(A) + P(B|~A)P(~A)
A (fair die) ~A (loaded)
P(A) P(~A)
A and B
~A and B
P(B|A) P(B|~A)
Slide 44
Some practical problems
• I have 3 standard d20 dice, 1 loaded die.
• Experiment: (1) pick a d20 uniformly at random then (2) roll it. Let A=d20 picked is fair and B=roll 19 or 20 with that die.
• Suppose B happens (e.g., I roll a 20). What is the chance the die I rolled is fair? i.e. what is P(A|B) ?
Slide 45
P(A|B) = ?
P(A and B) = P(A|B) * P(B)
P(A and B) = P(B|A) * P(A)
P(A|B) * P(B) = P(B|A) * P(A)
A (fair die) ~A (loaded)
P(A|B) =
P(B|A) * P(A)
P(B)
P(A)
A (fair die)
A and B
~A (loaded)
P(B)
~A and B
P(~A)
A and B
~A and B
P(B|A) P(B|~A)
Slide 46
posterior
P(A|B) =
P(B|A) * P(A) prior
Bayes’ rule
P(B)
P(B|A) =
P(A|B) * P(B)
P(A) Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of
London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Slide 48
More General Forms of Bayes Rule
P ( A | B )
P ( B |
P ( B |
A ) P ( A )
A ) P ( A )
P ( B |~ A ) P (~ A )
P ( A | B
X )
P ( B | A
X ) P ( A
P ( B
X )
X )
Slide 49
More General Forms of Bayes Rule
P ( A
v i
| B )
P ( B | k n A
1
P ( B |
A
v i
) P ( A
v i
)
A
v k
) P ( A
v k
)
Slide 50
P ( A | B )
P (
A | B )
1 k n A
1
P ( A
v k
| B )
1
Slide 51
• An Intuitive Explanation of Bayesian Reasoning : Bayes'
Theorem for the curious and bewildered; an excruciatingly gentle introduction Eliezer Yudkowsky
• Problem: Suppose that a barrel contains many small plastic eggs. Some eggs are painted red and some are painted blue. 40% of the eggs in the bin contain pearls, and 60% contain nothing. 30% of eggs containing pearls are painted blue, and 10% of eggs containing nothing are painted blue. What is the probability that a blue egg contains a pearl?
Slide 52
Some practical problems
• Joe throws 4 critical hits in a row, is Joe cheating?
• A = Joe using cheater’s die
• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
• B = C1 and C2 and C3 and C4
• Pr(B|A) = 0.0625 P(B|~A)=0.0001
P ( A | B )
P ( B |
P ( B |
A ) P ( A )
A ) P ( A )
P ( B |~ A ) P (~ A )
P ( A | B )
0 .
0625 * P ( A )
0 .
0625 * P ( A )
0 .
0001 * ( 1
P ( A ))
Slide 53
What’s the experiment and outcome here?
• Outcome A: Joe is cheating
• Experiment:
• Joe picked a die uniformly at random from a bag containing 10,000 fair die and one bad one.
• Joe is a D&D player picked uniformly at random from set of 1,000,000 people and cheat with probability p>0.
n of them
• I have no idea, but I don’t like his looks. Call it
P(A)=0.1
Slide 54
• A subjective belief can be treated, mathematically, like a probability
• Use those axioms!
• There have been many many other approaches to understanding “uncertainty”:
• Fuzzy Logic, three-valued logic, Dempster-Shafer, nonmonotonic reasoning, …
• 25 years ago people in AI argued about these; now they mostly don’t
• Any scheme for combining uncertain information, uncertain
“beliefs”, etc,… really should obey these axioms
• If you gamble based on “uncertain beliefs”, then [you can be exploited by an opponent] [your uncertainty formalism violates the axioms] - di Finetti 1931 (the “Dutch book argument”)
Slide 55
Some practical problems
P ( A | B )
P ( B | A ) P ( A )
P ( B )
P ( A | B )
P (
A | B )
P ( B | A ) P ( A ) / P ( B )
P ( B |
A ) P (
A ) / P ( B )
0 .
0625
0 .
0001
P ( A )
P (
A )
6 , 250
P
P ( B
( B |
| A )
A )
P ( A )
P (
A )
P ( A )
P (
A )
• Joe throws 4 critical hits in a row, is Joe cheating?
• A = Joe using cheater’s die
• C = roll 19 or 20; P(C|A)=0.5, P(C|~A)=0.1
• B = C1 and C2 and C3 and C4
• Pr(B|A) = 0.0625 P(B|~A)=0.0001
Moral: with enough evidence the prior P(A) doesn’t really matter.
Slide 56
Some practical problems
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
6
5
4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
Slide 57
MLE = maximum likelihood estimate
One solution
0
1
2
3
4
5
6
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency P(1)=0
P(2)=0
P(3)=0
P(4)=0.1
…
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Face Shown
P(19)=0.25
P(20)=0.2
Slide 58
MAP = maximum a posteriori estimate
A better solution
I bought a loaded d20 on EBay…but it didn’t come with any specs. How can I find out how it behaves?
Frequency
6
5 P(A=i|B) = P(B|A=i)*P(A=i) / P(B)
4
B = observed counts
3
2
P(A=i) = prior probability (subjective)
1
0
… and we’ll need some tricks to make
1 2 3 4 5 6 7 8 this simple to compute
Face Shown
Slide 59
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
P(D) = P(D ^ A=HL) + P(D ^ A=HF) + P(D ^ A=FL)
= P(D | A=HL)*P(A=HL) + P(D|A=HF)*P(A=HF) + P(A|A=FL)*P(A=FL)
Slide 65
Some practical problems
• I have 1 standard d6 die, 2 loaded d6 die.
• Loaded high: P(X=6)=0.50 Loaded low: P(X=1)=0.50
• Experiment: pick one d6 uniformly at random (A) and roll it. What is more likely – rolling a seven or rolling doubles?
Three combinations: HL, HF, FL
Roll 1
1 2 3 4 5 6
7
3
4
1 D
2
5
6 7
D
7
D 7
7 D
7
D
D
Slide 66
A
FL
…
HF
…
FL
HL
HL
FL
FL
…
FL
FL
…
A brute-force solution
Roll 1
1
Roll 2
1
P
1/3 * 1/6 * ½
Comment doubles joint probability table shows P(X1=x1 and … and
… …
6 seven
2 1
…
…
6
1
2
… …
• P(total is higher than 5)
1 1
• ….
doubles doubles
Slide 67
Example: Boolean variables A, B, C
Recipe for making a joint distribution of M variables:
Slide 68
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have
2 M rows).
1
1
1
0
1
0
0
A
0
0
1
1
1
0
0
1
0
B
Example: Boolean variables A, B, C
C
0
1
0
1
1
0
1
0
Slide 69
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have
2 M rows).
2. For each combination of values, say how probable it is.
1
1
1
0
1
0
0
A
0
0
1
1
1
0
0
1
0
B
Example: Boolean variables A, B, C
C Prob
0 0.30
1
0
1
1
0
1
0
0.05
0.10
0.05
0.05
0.10
0.25
0.10
Slide 70
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have
2 M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those numbers must sum to 1.
1
1
1
0
1
0
0
A
0
A
0
1
1
1
0
0
1
0
B
Example: Boolean variables A, B, C
C Prob
0 0.30
1
0
1
1
0
1
0
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05
0.25
0.10
0.10
0.05
0.05
C
0.10
0.30
B
Slide 71
One you have the JD you can ask for the probability of any logical expression involving your attribute
P ( E )
P ( row rows matching E
)
Slide 72
P(Poor Male) = 0.4654
P ( E )
P ( row rows matching E
)
Slide 73
P(Poor) = 0.7604
P ( E )
P ( row rows matching E
)
Slide 74
P ( E
1
| E
2
)
P ( E
1
E
2
)
P ( E
2
)
P ( row rows matching
E
1 and E
2
P ( row ) rows matching E
2
)
Slide 75
P ( E
1
| E
2
)
P ( E
1
E
2
)
P ( E
2
)
P ( row rows matching
E
1 and E
2
P ( row ) rows matching E
2
)
P( Male | Poor ) = 0.4654 / 0.7604 = 0.612
Slide 76
• I’ve got this evidence. What’s the chance that this conclusion is true?
• I’ve got a sore neck: how likely am I to have meningitis?
• I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep?
Slide 77