Lecture 2: Probabilistic reasoning with hypotheses 8/31/2015

advertisement
8/31/2015
Lecture 2:
Probabilistic reasoning with
hypotheses
In an inferential setup we may work with propositions or
hypotheses.
A hypothesis is a central component in the building of science and
forensic science is no exception. Successive falsification of
hypotheses (cf. Popper) is an important component of crime
investigation.
The “standard situation” would be that we have two hypotheses:
H0 The forwarded hypothesis
H1 The alternative hypothesis
These must be mutually exclusive.
Classical statistical hypothesis testing
(Neyman J. and Pearson E.S. , 1933)
The two hypotheses are different explanations to the Data.
Each hypothesis provides model(s) for Data
The purpose is to use Data to try to falsify H0.
Type-I-error:
Type-II-error:
Falsifying a true H0
Not falsifying a false H0
Size or Significance level:
Reject (falsify) H 0 when
L(H1 Data )
L(H 0 Data )
≥A
where A > 0 is chosen so that
 L(H 1 Data )

P
≥ A H0  = α
 L(H 0 Data )



Minimises β for fixed α.
α = Pr(Type-I-error)
If each hypothesis provides one and only one model for Data:
Power:
Most powerful test for simple hypotheses (Neyman-Pearson lemma):
Note that the probability is taken with respect to Data , i.e. with respect to
the probability model each hypothesis provides for Data.
1 – Pr(Type-II-error) = 1 – β
Extension to composite hypotheses: Uniformly most powerful test (UMP)
The hypothesis are then referred to as simple
1
8/31/2015
Example: A seizure of pills, suspected to be Ecstasy, is sampled for
the purpose of investigating whether the proportion of Ecstasy pills
is “around” 80% or “around” 50%.
The likelihood of the two hypotheses are
In a sample of 50 pills, 39 proved to be Ecstasy pills.
L (H1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when
the seizure proportion of Ecstasy pills is 50%.
As the forwarded hypothesis we can formulate
H0: Around 80% of the pills in the seizure are Ecstasy
L (H0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when
the seizure proportion of Ecstasy pills is 80%.
Assuming a large seizure these probabilities can be calculated using a binomial
model Bin(50, p ), where H0 states that p = p0 = 0.8 and H1 states that p = p1 =
0.5.
In generic form, if we have obtained x Ecstasy pills out of n sampled:
and as the alternative hypothesis
 n
n− x
L(H 0 Data ) = L(H 0 x, (n) ) =   ⋅ p0x ⋅ (1 − p0 )
x
 
 n x
n− x
L(H1 Data ) = L(H1 x, (n) ) =   ⋅ p1 ⋅ (1 − p1 )
 x
H1: Around 50% of the pills in the seizure are Ecstasy
The Neyman-Pearson lemma now states that the most powerful test is of the form
n− x
L (H 1 Data )
 p
p x ⋅ (1 − p1 )
≥ A ⇒ 1x
=  1
n− x
L (H 0 Data )
p 0 ⋅ (1 − p 0 )
 p0
⇔
x
  1 − p1 
 ⋅ 

  1 − p0 
Normally, we would set the significance level α and the find C so that
P(X ≤ C H 0 ) = α
n− x
≥A
If α is chosen to 0.05 we can search the binomial distribution valid under H0 for a
value C such that
 p 
 1 − p1 
 ≥ ln A
x ⋅ ln  1  + (n − x ) ⋅ ln 
 p0 
 1 − p0 
⇔
 1 − p1 

ln A − n ⋅ ln 
 1 − p0  = C ( n )
x≤
 p 
 1 − p1 
ln  1  − ln 

p
 0
 1 − p0 
 50 
∑ P(X = k H ) ≤ 0.05 ⇒ ∑  k  ⋅ 0.8
C
k =0

k
⋅ 0.250− k ≤ 0.05

MSExcel:
( ) ( )
p
1− p1
since p1 < p0 ⇒ ln 1 − ln
<0
p0
1− p 0
Hence, H0 should be rejected in favour of H1 as soon as x ≤ C
How to choose C?
C
0
k =0
BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the
sum is at least 0.05 35
BINOM.DIST(35;50;0.8;TRUE) 0.06072208
BINOM.DIST(34;50;0.8;TRUE) 0.030803423
Choose C = 34. Since x = 39 we cannot reject H0
2
8/31/2015
Drawbacks with the classical approach
•
•
The Bayesian Approach
Data alone “decides”. Small amounts of data Low power
Difficulties in interpretation:
There is always a process that leads to the formulation of the hypotheses.
There exist a prior probability for each of them:
When H0 is rejected, it means
p0 = P (H 0 I ) = P (H 0 )
p1 = P (H 1 I ) = P (H 1 )
“If we repeat the collection of data under (in principal) identical
circumstances then in (at most) 100α % of all cases when H0
is true”
L(H 1 Data)
L(H 0 Data )
p0 + p1 = 1
≥A
Simpler expressed as prior odds for the hypothesis H0:
Can we (always) repeat the collection of data?
•
•
“Falling off the cliff” – What is the difference between “just rejecting” and
“almost rejecting” ?
“Isolated” falsification (or no falsification) – Tests using other data but with
the same hypotheses cannot be easily combined
P(H 0 Data, I )
P(H1 Data, I )
Non-informative priors: p0 = p1 = 0.5 gives prior odds = 1
=
q0
q1
The odds ratio (posterior odds/prior odds) is know as the Bayes factor:
⇒
q0 = P (H 0 Data, I ) =
p0 P (H 0 I )
=
p1 P(H1 I )
How can be obtain the posterior odds?
Data should help us calculate posterior odds
Odds(H 0 Data, I ) =
Odds(H 0 I ) =
B=
Odds (H 0 Data, I )
Odds (H 0 Data, I ) + 1
Odds(H 0 Data, I )
Odds(H 0 I )
=
P(H 0 Data, I ) P (H1 Data, I )
P (H 0 I ) P (H1 I )
Odds(H 0 Data, I ) = B ⋅ Odds(H 0 I )
The “hypothesis testing” is then a judgement upon whether q0 is
•
•
small enough to make us believe in H1
large enough to make us believe in H0
Hence, if we know the Bayes factor, we can calculate the posterior odds (since we
can always set the prior odds).
i.e. no pre-setting of the decision direction is made.
3
8/31/2015
1.
Both hypotheses are simple, i.e. give one and only one model each for Data
a)
b) Data is the observed value x of a continuous (possibly
multidimensional) random variable
Distinct probabilities can be assigned to Data
It can be shown that
Bayes’ theorem on odds-form then gives
P (H 0 Data, I )
P(Data H 0 , I ) P (H 0 I )
=
⋅
P(H1 Data, I ) P (Data H1 , I ) P(H1 I )
Hence, the Bayes factor is
B=
P (Data H 0 , I )
P(Data H1 , I )
The probabilities of the numerator and denominator respectively can be
calculated (estimated) using the model provided by respective
hypothesis.
In both cases we can see that the Bayes factor is a likelihood ratio
since the numerator and denominator are likelihoods for respective
hypothesis.
B=
L (H 0 Data, I )
L (H1 Data, I )
Example Ecstasy pills revisited
The likelihoods for the hypotheses are
P (H 0 Data, I )
P(H1 Data, I )
=
f (x H 0 , I ) P(H 0 I )
⋅
f (x H1 , I ) P (H1 I )
where f (x | H0, I ) and f (x | H1, I ) are the probability density functions
given by the models specified by H0 and H1 .
Hence, the Bayes factor is
B=
f (x H 0 , I )
f (x H1, I )
Known (or estimated) density functions under each model can then be
used to calculate the Bayes factor.
Assume we have no particular belief in any of the two hypothesis prior
to obtaining the data.
⇒ Odds(H 0 ) = 1
⇒ Odds(H 0 | Data ) ≈ 3831⋅1
H0: Around 80% of the pills
in the seizure are Ecstasy
H1: Around 50% of the pills
in the seizure are Ecstasy
 50 
L(H 0 Data ) =   ⋅ 0.839 ⋅ 0.211 ≈ 0.1271082
 39 
 50 
L(H 1 Data ) =   ⋅ 0.539 ⋅ 0.511 ≈ 3.317678e-05
 39 
0.1271082
⇒B≈
≈ 3831
3.317678e-05
⇒ P(H 0 Data ) =
3831
≈ 0.9997
3831 + 1
Hence, upon the analysis of data we can be 99.97% certain that H0 is
true.
Note however that it may be unrealistic to assume only two possible
proportions of Ecstasy pills in the seizure!
Hence, Data are 3831 times more probable if H0 is true compared
to if H1 is true.
4
8/31/2015
2.
The hypothesis H0 is simple but the hypothesis H1 is composite, i.e. it
provides several models for Data (several explanations)
If the relative priors are (fairly) equal the denominator reduces to
average likelihood of the alternatives.
The various models of H1 would (in general) provide different
likelihoods for the different explanations We cannot come up
with one unique likelihood for H1.
If the likelihoods of the alternatives are equal the denominator reduces to that
likelihood since the relative priors sum to one.
If in addition, the different explanations have different prior
probabilities we have to weigh the different likelihoods with these.
If the composition is defined by a continuously valued parameter, θ we
must use the conditional prior density of θ given that H1 is true: p(θ |H1)
and integrate the likelihood with respect to that density.
If the composition in H1 is in form of a set of discrete alternatives,
the Bayes factor can be written
B=
the
The Bayes factor can be written
L (H 0 Data )
∑ L(H1i Data )⋅ P(H1i H1 )
B=
i
L(H 0 Data )
∫
"θ∈H1 "
L(θ Data )⋅ p(θ H1 ) dθ
where P(H1i | H1) is the conditional prior probability that H1i is true
given that H1 is true (relative prior) , and the sum is over all
alternatives H11 , H12 , …
3.
Both hypothesis are composite, i.e. each provides several models
for Data (several explanations)
Example Ecstasy pills revisited again
This gives different sub-cases, depending on whether the
compositions in the hypotheses are discrete or according to a
continuously valued parameter.
Assume a more realistic case where we from a sample of the seizure shall
investigate whether the proportion of Ecstasy pills is higher than 80%.
H0: Proportion θ > 0.8
i.e. both are composite
H : Proportion θ ≤ 0.8
The “discrete-discrete” case gives the Bayes factor
∑ L(H
∑ L(H
)
0j
Data ⋅ P (H 0 j H 0 )
1i
Data )⋅ P (H 1i H 1 )
j
B=
1
We further assume like before that we have no particular belief in any of the two
hypotheses. The prior density for θ can thus be defined as
i
and the “continuous-continuous” case gives the Bayes factor
B=
∫
∫
"θ ∈H 0 "
"θ ∈H1 "
L(θ Data )⋅ p (θ H 0 )dθ
0.5 0.8 = 0.625 0 ≤ θ ≤ 0.8
p (θ ) = 
 0.5 / 0.2 = 2.5 0.8 < θ ≤ 1
L(θ Data )⋅ p (θ H 1 ) dθ
where p(θ | H0 ) is the conditional prior density of θ given that H0 is
true.
5
8/31/2015
The likelihood function is (irrespective of the hypotheses)
 50 
11
L(θ | Data ) =   ⋅ θ 39 ⋅ (1 − θ )
 39 
The Beta distribution:
A random variable is said to have a Beta distribution with parameters a and b if its
probability density function is
b −1
f (x ) = C ⋅ x a −1 ⋅ (1 − x )
The conditional prior densities under each hypothesis become uniform over each
interval of potential values of θ ( (0.8, 1] and [0,0.8] ).
; 0 ≤ x ≤1
1
with C = ∫ x a −1 ⋅ (1 − x ) dx = B( a, b)
b−1
0
The Bayes factor is
 50  39
11
∫  39  ⋅θ ⋅ (1 − θ ) ⋅1dθ
θ
0.8
B=
= 0.8
=
∫ L(θ Data )⋅ p(θ H1 )dθ ∫  50  ⋅θ 39 ⋅ (1 − θ )11 ⋅1dθ
θ
39 
0
∫ L(θ Data )⋅ p(θ H 0 )dθ
1
1
∫ θ ⋅ (1 − θ )
39
11
⋅1 dθ
0 .8
0 .8
1
0
∫ C ⋅ θ ⋅ (1 − θ )
39
⋅ 1 dθ
0.8
0.8
=
39
39
11
∫ θ ⋅ (1 − θ )
1
∫ θ ⋅ (1 − θ )
=
Hence, we can identify the integrals of the Bayes factor as proportional to different
probabilities of the same beta distribution
11
⋅ 1 dθ
11
1
0.8
0.8
=
∫ C ⋅ θ ⋅ (1 − θ )
39
0
∫ C ⋅θ
⋅ 1 dθ
11
⋅ 1 dθ
40 −1
⋅ (1 − θ )
⋅ 1 dθ
40 −1
⋅ (1 − θ )
⋅ 1 dθ
0.8
0.8
∫ C ⋅θ
12 −1
12 −1
0
How do we solve these integrals?
∫ θ ⋅ (1 − θ )
39
11
⋅1 dθ
namely a beta distribution with parameters a = 40 and b =12.
0
> num<-1-pbeta(0.8,40,12)
> den<-pbeta(0.8,40,12)
> num
[1] 0.314754
> den
[1] 0.685246
> B<-num/den
> B
[1] 0.45933
Hence, the Bayes factor is 0.45933.
With even prior odds (Odds(H0) = 1) we get the posterior odds equal to the Bayes
factor and the posterior probability of H0 is
P (H 0 Data ) =
0.45933
≈ 0.31
0.45933 + 1
Data does not provide us with evidence clearly against any of the hypotheses.
6
Download