8/31/2015 Lecture 2: Probabilistic reasoning with hypotheses In an inferential setup we may work with propositions or hypotheses. A hypothesis is a central component in the building of science and forensic science is no exception. Successive falsification of hypotheses (cf. Popper) is an important component of crime investigation. The “standard situation” would be that we have two hypotheses: H0 The forwarded hypothesis H1 The alternative hypothesis These must be mutually exclusive. Classical statistical hypothesis testing (Neyman J. and Pearson E.S. , 1933) The two hypotheses are different explanations to the Data. Each hypothesis provides model(s) for Data The purpose is to use Data to try to falsify H0. Type-I-error: Type-II-error: Falsifying a true H0 Not falsifying a false H0 Size or Significance level: Reject (falsify) H 0 when L(H1 Data ) L(H 0 Data ) ≥A where A > 0 is chosen so that L(H 1 Data ) P ≥ A H0 = α L(H 0 Data ) Minimises β for fixed α. α = Pr(Type-I-error) If each hypothesis provides one and only one model for Data: Power: Most powerful test for simple hypotheses (Neyman-Pearson lemma): Note that the probability is taken with respect to Data , i.e. with respect to the probability model each hypothesis provides for Data. 1 – Pr(Type-II-error) = 1 – β Extension to composite hypotheses: Uniformly most powerful test (UMP) The hypothesis are then referred to as simple 1 8/31/2015 Example: A seizure of pills, suspected to be Ecstasy, is sampled for the purpose of investigating whether the proportion of Ecstasy pills is “around” 80% or “around” 50%. The likelihood of the two hypotheses are In a sample of 50 pills, 39 proved to be Ecstasy pills. L (H1 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 50%. As the forwarded hypothesis we can formulate H0: Around 80% of the pills in the seizure are Ecstasy L (H0 | Data) = Probability of obtaining 39 Ecstasy pills out of 50 sampled when the seizure proportion of Ecstasy pills is 80%. Assuming a large seizure these probabilities can be calculated using a binomial model Bin(50, p ), where H0 states that p = p0 = 0.8 and H1 states that p = p1 = 0.5. In generic form, if we have obtained x Ecstasy pills out of n sampled: and as the alternative hypothesis n n− x L(H 0 Data ) = L(H 0 x, (n) ) = ⋅ p0x ⋅ (1 − p0 ) x n x n− x L(H1 Data ) = L(H1 x, (n) ) = ⋅ p1 ⋅ (1 − p1 ) x H1: Around 50% of the pills in the seizure are Ecstasy The Neyman-Pearson lemma now states that the most powerful test is of the form n− x L (H 1 Data ) p p x ⋅ (1 − p1 ) ≥ A ⇒ 1x = 1 n− x L (H 0 Data ) p 0 ⋅ (1 − p 0 ) p0 ⇔ x 1 − p1 ⋅ 1 − p0 Normally, we would set the significance level α and the find C so that P(X ≤ C H 0 ) = α n− x ≥A If α is chosen to 0.05 we can search the binomial distribution valid under H0 for a value C such that p 1 − p1 ≥ ln A x ⋅ ln 1 + (n − x ) ⋅ ln p0 1 − p0 ⇔ 1 − p1 ln A − n ⋅ ln 1 − p0 = C ( n ) x≤ p 1 − p1 ln 1 − ln p 0 1 − p0 50 ∑ P(X = k H ) ≤ 0.05 ⇒ ∑ k ⋅ 0.8 C k =0 k ⋅ 0.250− k ≤ 0.05 MSExcel: ( ) ( ) p 1− p1 since p1 < p0 ⇒ ln 1 − ln <0 p0 1− p 0 Hence, H0 should be rejected in favour of H1 as soon as x ≤ C How to choose C? C 0 k =0 BINOM.INV(50;0.8;0.05) returns the lowest value of B for which the sum is at least 0.05 35 BINOM.DIST(35;50;0.8;TRUE) 0.06072208 BINOM.DIST(34;50;0.8;TRUE) 0.030803423 Choose C = 34. Since x = 39 we cannot reject H0 2 8/31/2015 Drawbacks with the classical approach • • The Bayesian Approach Data alone “decides”. Small amounts of data Low power Difficulties in interpretation: There is always a process that leads to the formulation of the hypotheses. There exist a prior probability for each of them: When H0 is rejected, it means p0 = P (H 0 I ) = P (H 0 ) p1 = P (H 1 I ) = P (H 1 ) “If we repeat the collection of data under (in principal) identical circumstances then in (at most) 100α % of all cases when H0 is true” L(H 1 Data) L(H 0 Data ) p0 + p1 = 1 ≥A Simpler expressed as prior odds for the hypothesis H0: Can we (always) repeat the collection of data? • • “Falling off the cliff” – What is the difference between “just rejecting” and “almost rejecting” ? “Isolated” falsification (or no falsification) – Tests using other data but with the same hypotheses cannot be easily combined P(H 0 Data, I ) P(H1 Data, I ) Non-informative priors: p0 = p1 = 0.5 gives prior odds = 1 = q0 q1 The odds ratio (posterior odds/prior odds) is know as the Bayes factor: ⇒ q0 = P (H 0 Data, I ) = p0 P (H 0 I ) = p1 P(H1 I ) How can be obtain the posterior odds? Data should help us calculate posterior odds Odds(H 0 Data, I ) = Odds(H 0 I ) = B= Odds (H 0 Data, I ) Odds (H 0 Data, I ) + 1 Odds(H 0 Data, I ) Odds(H 0 I ) = P(H 0 Data, I ) P (H1 Data, I ) P (H 0 I ) P (H1 I ) Odds(H 0 Data, I ) = B ⋅ Odds(H 0 I ) The “hypothesis testing” is then a judgement upon whether q0 is • • small enough to make us believe in H1 large enough to make us believe in H0 Hence, if we know the Bayes factor, we can calculate the posterior odds (since we can always set the prior odds). i.e. no pre-setting of the decision direction is made. 3 8/31/2015 1. Both hypotheses are simple, i.e. give one and only one model each for Data a) b) Data is the observed value x of a continuous (possibly multidimensional) random variable Distinct probabilities can be assigned to Data It can be shown that Bayes’ theorem on odds-form then gives P (H 0 Data, I ) P(Data H 0 , I ) P (H 0 I ) = ⋅ P(H1 Data, I ) P (Data H1 , I ) P(H1 I ) Hence, the Bayes factor is B= P (Data H 0 , I ) P(Data H1 , I ) The probabilities of the numerator and denominator respectively can be calculated (estimated) using the model provided by respective hypothesis. In both cases we can see that the Bayes factor is a likelihood ratio since the numerator and denominator are likelihoods for respective hypothesis. B= L (H 0 Data, I ) L (H1 Data, I ) Example Ecstasy pills revisited The likelihoods for the hypotheses are P (H 0 Data, I ) P(H1 Data, I ) = f (x H 0 , I ) P(H 0 I ) ⋅ f (x H1 , I ) P (H1 I ) where f (x | H0, I ) and f (x | H1, I ) are the probability density functions given by the models specified by H0 and H1 . Hence, the Bayes factor is B= f (x H 0 , I ) f (x H1, I ) Known (or estimated) density functions under each model can then be used to calculate the Bayes factor. Assume we have no particular belief in any of the two hypothesis prior to obtaining the data. ⇒ Odds(H 0 ) = 1 ⇒ Odds(H 0 | Data ) ≈ 3831⋅1 H0: Around 80% of the pills in the seizure are Ecstasy H1: Around 50% of the pills in the seizure are Ecstasy 50 L(H 0 Data ) = ⋅ 0.839 ⋅ 0.211 ≈ 0.1271082 39 50 L(H 1 Data ) = ⋅ 0.539 ⋅ 0.511 ≈ 3.317678e-05 39 0.1271082 ⇒B≈ ≈ 3831 3.317678e-05 ⇒ P(H 0 Data ) = 3831 ≈ 0.9997 3831 + 1 Hence, upon the analysis of data we can be 99.97% certain that H0 is true. Note however that it may be unrealistic to assume only two possible proportions of Ecstasy pills in the seizure! Hence, Data are 3831 times more probable if H0 is true compared to if H1 is true. 4 8/31/2015 2. The hypothesis H0 is simple but the hypothesis H1 is composite, i.e. it provides several models for Data (several explanations) If the relative priors are (fairly) equal the denominator reduces to average likelihood of the alternatives. The various models of H1 would (in general) provide different likelihoods for the different explanations We cannot come up with one unique likelihood for H1. If the likelihoods of the alternatives are equal the denominator reduces to that likelihood since the relative priors sum to one. If in addition, the different explanations have different prior probabilities we have to weigh the different likelihoods with these. If the composition is defined by a continuously valued parameter, θ we must use the conditional prior density of θ given that H1 is true: p(θ |H1) and integrate the likelihood with respect to that density. If the composition in H1 is in form of a set of discrete alternatives, the Bayes factor can be written B= the The Bayes factor can be written L (H 0 Data ) ∑ L(H1i Data )⋅ P(H1i H1 ) B= i L(H 0 Data ) ∫ "θ∈H1 " L(θ Data )⋅ p(θ H1 ) dθ where P(H1i | H1) is the conditional prior probability that H1i is true given that H1 is true (relative prior) , and the sum is over all alternatives H11 , H12 , … 3. Both hypothesis are composite, i.e. each provides several models for Data (several explanations) Example Ecstasy pills revisited again This gives different sub-cases, depending on whether the compositions in the hypotheses are discrete or according to a continuously valued parameter. Assume a more realistic case where we from a sample of the seizure shall investigate whether the proportion of Ecstasy pills is higher than 80%. H0: Proportion θ > 0.8 i.e. both are composite H : Proportion θ ≤ 0.8 The “discrete-discrete” case gives the Bayes factor ∑ L(H ∑ L(H ) 0j Data ⋅ P (H 0 j H 0 ) 1i Data )⋅ P (H 1i H 1 ) j B= 1 We further assume like before that we have no particular belief in any of the two hypotheses. The prior density for θ can thus be defined as i and the “continuous-continuous” case gives the Bayes factor B= ∫ ∫ "θ ∈H 0 " "θ ∈H1 " L(θ Data )⋅ p (θ H 0 )dθ 0.5 0.8 = 0.625 0 ≤ θ ≤ 0.8 p (θ ) = 0.5 / 0.2 = 2.5 0.8 < θ ≤ 1 L(θ Data )⋅ p (θ H 1 ) dθ where p(θ | H0 ) is the conditional prior density of θ given that H0 is true. 5 8/31/2015 The likelihood function is (irrespective of the hypotheses) 50 11 L(θ | Data ) = ⋅ θ 39 ⋅ (1 − θ ) 39 The Beta distribution: A random variable is said to have a Beta distribution with parameters a and b if its probability density function is b −1 f (x ) = C ⋅ x a −1 ⋅ (1 − x ) The conditional prior densities under each hypothesis become uniform over each interval of potential values of θ ( (0.8, 1] and [0,0.8] ). ; 0 ≤ x ≤1 1 with C = ∫ x a −1 ⋅ (1 − x ) dx = B( a, b) b−1 0 The Bayes factor is 50 39 11 ∫ 39 ⋅θ ⋅ (1 − θ ) ⋅1dθ θ 0.8 B= = 0.8 = ∫ L(θ Data )⋅ p(θ H1 )dθ ∫ 50 ⋅θ 39 ⋅ (1 − θ )11 ⋅1dθ θ 39 0 ∫ L(θ Data )⋅ p(θ H 0 )dθ 1 1 ∫ θ ⋅ (1 − θ ) 39 11 ⋅1 dθ 0 .8 0 .8 1 0 ∫ C ⋅ θ ⋅ (1 − θ ) 39 ⋅ 1 dθ 0.8 0.8 = 39 39 11 ∫ θ ⋅ (1 − θ ) 1 ∫ θ ⋅ (1 − θ ) = Hence, we can identify the integrals of the Bayes factor as proportional to different probabilities of the same beta distribution 11 ⋅ 1 dθ 11 1 0.8 0.8 = ∫ C ⋅ θ ⋅ (1 − θ ) 39 0 ∫ C ⋅θ ⋅ 1 dθ 11 ⋅ 1 dθ 40 −1 ⋅ (1 − θ ) ⋅ 1 dθ 40 −1 ⋅ (1 − θ ) ⋅ 1 dθ 0.8 0.8 ∫ C ⋅θ 12 −1 12 −1 0 How do we solve these integrals? ∫ θ ⋅ (1 − θ ) 39 11 ⋅1 dθ namely a beta distribution with parameters a = 40 and b =12. 0 > num<-1-pbeta(0.8,40,12) > den<-pbeta(0.8,40,12) > num [1] 0.314754 > den [1] 0.685246 > B<-num/den > B [1] 0.45933 Hence, the Bayes factor is 0.45933. With even prior odds (Odds(H0) = 1) we get the posterior odds equal to the Bayes factor and the posterior probability of H0 is P (H 0 Data ) = 0.45933 ≈ 0.31 0.45933 + 1 Data does not provide us with evidence clearly against any of the hypotheses. 6