2014_0010_basic Probability and Bayes

advertisement
Review of
Probability
Axioms of Probability Theory
Pr(A) denotes probability that proposition A is true. (A is also called
event, or random variable).
1.
0  Pr( A)  1
2.
Pr(True)  1
3.
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
Pr( False )  0
A Closer Look at Axiom 3
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
True
A
A B
B
B
Using the Axioms to prove new properties
Pr( A  A)
Pr(True)
1
Pr(A)
 Pr( A)  Pr(A)  Pr( A  A)
 Pr( A)  Pr(A)  Pr( False )

Pr( A)  Pr(A)  0

1  Pr( A)
We proved this
Probability of Events
• Sample space and events
– Sample space S: (e.g., all people in an area)
– Events E1  S: (e.g., all people having cough)
E2  S:
(e.g., all people having cold)
• Prior (marginal) probabilities of events
–
–
–
–
P(E) = |E| / |S| (frequency interpretation)
P(E) = 0.1
(subjective probability)
0 <= P(E) <= 1 for all events
Two special events:  and S: P() = 0 and P(S) = 1.0
• Boolean operators between events (to form
compound events)
– Conjunctive (intersection):
E1 ^ E2
( E1  E2)
– Disjunctive (union):
E1 v E2
( E1  E2)
– Negation (complement): ~E
(EC = S – E)
5
• Probabilities of compound events
– P(~E) = 1 – P(E) because P(~E) + P(E) =1
– P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2)
– But how to compute the joint probability P(E1 ^ E2)?
~E
E
E1
E2
E1 ^ E2
• Conditional probability (of E1, given E2)
– How likely E1 occurs in the subspace of E2
| E1  E 2 | | E1  E 2 | / | S | P ( E1  E 2)
P ( E1 | E 2) 


| E2 |
| E2 | / | S |
P ( E 2)
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)
6
Using Venn diagrams and
decision trees is very useful in
proofs and reasonings
Independence, Mutual Exclusion and Exhaustive sets
of events
• Independence assumption
– Two events E1 and E2 are said to be independent of each other if
P ( E1 | E 2)  P ( E1) (given E2 does not change the likelihood of E1)
– It can simplify the computation
P ( E1  E 2)  P ( E1 | E 2) P ( E 2)  P ( E1) P ( E 2)
P ( E1  E 2)  P ( E1)  P ( E 2)  P ( E1  E 2)
 P ( E1)  P ( E 2)  P ( E1) P ( E 2)
 1  (1  P ( E1)(1  P ( E 2))
• Mutually exclusive (ME) and exhaustive (EXH) set of
events
– ME:
– EXH:
E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
E1  ...  En  S ( P ( E1  ...  En )  1)
Random
Variables
8
Discrete Random Variables
• X denotes a random variable.
• X can take on a finite number of values in
set {x1, x2, …, xn}.
• P(X=xi), or P(xi), is the probability that the random
variable X takes on value xi.
• P( ). is called probability mass function.
• E.g.
P( Room)  0.7,0.2,0.08,0.02
These are four possibilities of value of X. Sum of these values must be 1.0
Discrete Random Variables
• Finite set of possible outcomes
X   x 1 , x 2 , x 3 ,..., x n 
P ( xi )  0
0.4
0.35
0.3
n
 P( x )  1
i 1
X binary:
i
P ( x )  P ( x ) 1
0.25
0.2
0.15
0.1
0.05
0
X1
X2
X3
X4
10
Continuous Random Variable
• Probability distribution (density function)
over continuous values
X  0 ,10 
P (x)  0
10
 P ( x ) dx  1
P (x)
0
7
P (5  x  7 ) 
 P ( x ) dx
5
5
7
x
11
Continuous Random Variables
• X takes on values in the continuum.
• p(X=x), or p(x), is a probability density function (PDF).
b
Pr( x  [a, b])   p( x)dx
• E.g.
a
p(x)
x
Probability Distribution
• Probability distribution P(X|x)
– X is a random variable
• Discrete
• Continuous
– x is background state of information
Joint and Conditional
Probabilities
• Joint
P (x, y)  P ( X  x  Y  y)
– Probability that both X=x and Y=y
• Conditional
P(x | y)  P(X  x |Y  y)
– Probability that X=x given we know that Y=y
Joint and Conditional
Probabilities
• Joint
P (x, y)  P ( X  x  Y  y)
– Probability that both X=x and Y=y
• Conditional
P(x | y)  P(X  x |Y  y)
– Probability that X=x given we know that Y=y
Joint and Conditional Probability
• P(X=x and Y=y) = P(x,y)
• If X and Y are independent then
P(x,y) = P(x) P(y)
• P(x | y) is the probability of x given y
P(x | y) = P(x,y) / P(y)
P(x,y) = P(x | y) P(y)
• If X and Y are independent then
P(x | y) = P(x)
divided
Law of Total Probability
Discrete case
Continuous case
 P( x)  1
 p( x) dx  1
x
P ( x )   P ( x, y )
y
P( x)   P( x | y ) P( y )
y
p ( x)   p( x, y ) dy
p ( x)   p ( x | y ) p ( y ) dy
Rules of Probability: Marginalization
• Product Rule
P ( X , Y )  P ( X | Y ) P (Y )  P (Y | X ) P ( X )
• Marginalization
n
P (Y )   P (Y , x i )
i 1
X binary:
P (Y )  P (Y , x )  P (Y , x )
18
Gaussian,
Mean and
Variance
N(m, s)
Gaussian (normal) distributions
N(m, s)
P( x) 
  ( x  m )2 
1

exp 
2s
2 s


N(m, s)
different mean
different variance
20
Gaussian networks
Each variable is a linear
function of its parents,
with Gaussian noise
Joint probability density functions:
X
Y
X
Y
21
Reverend Thomas Bayes
(1702-1761)
Clergyman and
mathematician who first
used probability
inductively.
These researches
established a
mathematical basis for
probability inference
Bayes Rule
P (H , E )  P (H | E )P (E )  P (E | H )P (H )
P(E | H )P(H )
P(H | E ) 
P(E )
Pr( A  B)  Pr( A)  Pr( B)  Pr( A  B)
P (H , E )  P (H | E )P (E )  P (E | H )P (H )
10/40 = probability that you smoke if you
have cancer = P(smoke/cancer)
P(E | H )P(H )
P(H | E ) 
P(E )
10/100 = probability that you have
cancer if you smoke
40 People who
have cancer
100 People who
smoke
1000-100 = 900 people who do not smoke
1000-40 = 960 people who do not have cancer
10 People who smoke
and have cancer
True
All
people =
1000
E = smoke, H = cancer
A
A B
B
B
Prob(Cancer/Smoke) =
P (smoke/Cancer) * P (Cancer) / P(smoke)
P(smoke) = 100/1000
P(cancer) = 40/1000
P(smoke/Cancer) = 10/40 = 25%
Prob(Cancer/Smoke) = 10/40 * 40/1000/ 100 = 10/1000 / 100 = 10/10,000 =/1000 = 0.1%
10/40 = probability that you smoke if you
have cancer = P(smoke/cancer)
40 People who
have cancer
100 People who
smoke
10 People who smoke
and have cancer
10/100 = probability that you have
cancer if you smoke
1000-100 = 900 people who do not smoke
1000-40 = 960 people who do not have cancer
True
E = smoke, H = cancer
A
All
people =
1000
A B
B
Prob(Cancer/Smoke) =
P (smoke/Cancer) * P (Cancer) / P(smoke)
P(smoke) = 100/1000
P(cancer) = 40/1000
P(smoke/Cancer) = 10/40 = 25%
B
Prob(Cancer/Smoke) = 10/40 * 40/1000/ 100 = 10/1000 / 100 = 10/10,000 = 1/1000 = 0.1%
E = smoke, H = cancer
Prob(Cancer/Not smoke) = 30/40 * 40/100 / 900 =
30/100*900 = 30 / 90,000 = 1/3,000 = 0.03 %
Prob(Cancer/Not Smoke) =
P (Not smoke/Cancer) * P (Cancer) / P(Not smoke)
Bayes’ Theorem with relative likelihood
• In the setting of diagnostic/evidential reasoning
H i P(Hi )
hypotheses
P(E j | Hi )
E1
Ej
Em
– Know prior probability of hypothesis
conditional probability
– Want to compute the posterior probability
evidence/m anifestati ons
P(Hi )
P(E j | Hi )
P(Hi | E j )
• Bayes’ theorem (formula 1): P ( H i | E j )  P ( H i ) P ( E j | H i ) / P ( E j )
• If the purpose is to find which of the n hypotheses H1 ,..., H n
is more plausible given E j, then we can ignore the
denominator and rank them, use relative likelihood
rel ( H i | E j )  P ( E j | H i ) P ( H i )
26
Relative likelihood
• P ( E j ) can be computed fromP ( E j | H i ) and P ( H i ,) if we assume
all hypotheses H1 ,..., H n are ME and EXH
P ( E j )  P ( E j  ( H1  ...  H n )
(by EXH)
n
  P(E j  Hi )
(by ME)
i 1
n
  P(E j | Hi )P(Hi )
i 1
• Then we have another version of Bayes’ theorem:
P(Hi | E j ) 
P(E j | Hi )P(Hi )
n
 P(E
k 1
j
| Hk )P(Hk )

rel ( H i | E j )
n
 rel ( H
k 1
k
| Ej)
n
where
 P(E
k 1
j
| H k ) P ( H k ), the sum of relative likelihood of all
n hypotheses, is a normalization factor
Naïve Bayesian Approach
• Knowledge base:
E1 ,..., Em :
evidence/m anifestati on
H1 ,..., H n :
hypotheses /disorders
E j and H i are binary and hypotheses form a ME & EXH set
P ( E j | H i ), i  1,...n, j  1,...m
conditiona l probabilit ies
• Case input: E1 ,..., El
• Find the hypothesis H iwith the highest posterior
probability P ( H i | E1 ,..., El )
P ( E1 ,...E l | H i ) P ( H i )
• By Bayes’ theorem P ( H i | E1 ,..., E l ) 
P ( E1 ,...E l )
• Assume all pieces of evidence are conditionally
independent, given any hypothesis
P ( E1 ,...El | H i )   lj 1 P ( E j | H i )
28
absolute posterior probability
• The relative likelihood
rel ( H i | E1 ,..., El )  P ( E1 ,..., El | H i ) P ( H i )  P ( H i ) lj 1 P ( E j | H i )
• The absolute posterior probability
P ( H i | E1 ,..., E l ) 
rel ( H i | E1 ,..., El )
n
 rel ( H k | E1 ,..., El )
k 1

P ( H i ) lj 1 P ( E j | H i )
l
P
(
H
)

 k j 1 P ( E j | H k )
n
k 1
• Evidence accumulation (when new evidence is
discovered)
rel ( H i | E1 ,..., El , El 1 )  P ( El 1 | H i )rel ( H i | E1 ,..., El )
rel ( H i | E1 ,..., El , ~ El 1 )  (1  P ( El 1 | H i ))rel ( H i | E1 ,..., El )
Bayesian Networks and
Markov Models – applications in robotics
•
•
•
•
•
•
•
Bayesian AI
Bayesian Filters
Kalman Filters
Particle Filters
Bayesian networks
Decision networks
Reasoning about changes over time
• Dynamic Bayesian Networks
• Markov models
Download