Introduction to Likelihood Meaningful Modeling of Epidemiologic Data, 2012 AIMS, Muizenberg, South Africa Steve Bellan, MPH, PhD Department of Environmental Science, Policy & Management University of California at Berkeley In a population of 1,000,000 people with a true prevalence of 30%, the probability distribution of number of positive individuals if 100 are sampled: æ100ö f (x) = ç ÷(0.3) x (0.7)100-x è x ø barplot(dbinom(x = 0:100, size = 100, prob = .3), names.arg = 0:size) In a population of 1,000,000 people with a true prevalence of 30%, the probability distribution of number of positive individuals if 100 are sampled: æ100ö f (x) = ç ÷(0.3) x (0.7)100-x è x ø We sample 100 people once and 28 are positive: > rbinom(n = 1, size = 100, prob = .3) [1] 28 We don’t know the true prevalence! But we can calculate the probability of 28 or a more extreme value occurring for a given prevalence. We sample 100 people once and 28 are positive: > rbinom(n = 1, size = 100, prob = .3) [1] 28 0.08 Cumulative Probability & P Values for 30% prevalence: 0.04 > 2*pbinom(28,100,.3)) 0.02 [1] 0.7535564 p-value = 0.74 0.00 probability 0.06 p(28 or a more extreme value occurring) = 0 2 4 6 8 11 14 17 20 23 26 29 32 35 38 41 44 47 50 53 56 59 number HIV+ We sample 100 people once and 28 are positive. 62 65 68 71 74 77 80 83 86 89 92 95 98 0.14 If true prevalence were 15%, then p(28 or more extreme) is 2*pbinom(28−1, 100, 0.15, lower.tail = FALSE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.00123 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ 0.14 If true prevalence were 20%, then p(28 or more extreme) is 2*pbinom(28−1, 100, 0.2, lower.tail = FALSE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.0683 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ 0.14 If true prevalence were 25%, then p(28 or more extreme) is 2*pbinom(28−1, 100, 0.25, lower.tail = FALSE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.555 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ 0.14 If true prevalence were 30%, then p(28 or more extreme) is 2*pbinom(28, 100, 0.3, lower.tail = TRUE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.754 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ 0.14 If true prevalence were 35%, then p(28 or more extreme) is 2*pbinom(28, 100, 0.35, lower.tail = TRUE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.17 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ 0.14 If true prevalence were 40%, then p(28 or more extreme) is 2*pbinom(28, 100, 0.4, lower.tail = TRUE) 0.08 0.06 0.04 0.02 x2 0.00 probability 0.10 0.12 p = 0.0169 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 number HIV+ p = 0.00123 p = 0.0683 p = 0.555 11 18 25 32 39 53 60 67 74 81 88 95 0.00 0.00 46 0 5 11 18 25 32 39 46 53 60 67 74 81 88 95 0 5 11 18 25 32 39 46 53 60 67 74 81 88 95 hypothetical prevalence: 30 % hypothetical prevalence: 35 % hypothetical prevalence: 40 % p = 0.754 p = 0.17 p = 0.0169 x2 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 0.00 x2 0.00 5 0.08 probability 0.04 0.08 probability 0.04 0.08 0.04 0.00 0.12 number HIV+ 0.12 number HIV+ 0.12 number HIV+ x2 0 0.08 probability 0.04 0.08 probability 0.04 0.08 probability 0.04 0.00 5 x2 x2 x2 0 probability 0.12 hypothetical prevalence: 25 % 0.12 hypothetical prevalence: 20 % 0.12 hypothetical prevalence: 15 % 0 5 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 0 5 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 Which hypotheses do we reject? IF GIVEN THE HYPOTHESIS p value < cutoff THEN REJECT HYPOTHESIS Cutoff usually chosen as α = 0.05 Which hypotheses do we reject? p = 0.00123 p = 0.0683 p = 0.555 11 18 25 32 39 46 53 60 67 74 81 88 95 0 5 11 18 25 32 39 46 53 60 67 74 81 88 95 0 5 11 18 25 32 39 46 53 60 67 74 81 88 95 number HIV+ hypothetical prevalence: 30 % hypothetical prevalence: 35 % hypothetical prevalence: 40 % p = 0.754 p = 0.17 p = 0.0169 5 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 0.08 probability 0.00 0.04 0.08 probability 0.00 0.04 0.08 0.04 0.12 number HIV+ 0.12 number HIV+ 0.00 0 0.08 probability 0.00 0.04 0.08 probability 0.00 0.04 0.08 probability 0.04 0.00 5 0.12 0 probability 0.12 hypothetical prevalence: 25 % 0.12 hypothetical prevalence: 20 % 0.12 hypothetical prevalence: 15 % 0 5 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 0 5 11 18 25 32 39 46 53 60 number HIV+ 67 74 81 88 95 0.6 95% CI includes HIV prevalences of 19.5% to 37.8% 0.378 0.2 0.4 0.195 0.0 p−value 0.8 1.0 Which hypotheses do we NOT reject: CONFIDENCE INTERVAL 0.0 0.2 0.4 0.6 hypothetical prevalence 0.8 1.0 Let’s take another approach We don’t know the true prevalence, but the probability that we had exactly 28/100 with 30% prevalence is: > dbinom(x = 28, size = 100, prob = .3) [1] 0.08041202 We sample 100 people once and 28 are positive: > rbinom(n = 1, size = 100, prob = .3) [1] 28 Which prevalence gives the greatest probability of observing exactly 28/100? Which of these prevalence values is most likely given our data? Maximum Likelihood Estimate parameter value giving greatest probability of the data having occurred. What do you think is the MLE here? MLE = 28/100 = 0.28 true unknown value = 0.30 different null hypotheses Defining Likelihood • L(parameter | data) = p(data | parameter) • Not a probability distribution. • Probabilities taken from many different distributions. function of x PDF: æ nö x f (x | p) = ç ÷ p (1- p) n-x è xø LIKELIHOOD: æ nö x L( p | x) = ç ÷ p (1- p) n-x è xø function of p Deriving the Maximum Likelihood Estimate maximize æ nö x L( p) = ç ÷ p (1- p) n-x è xø maximize éæ n ö x ù n-x log(L( p) = logêç ÷ p (1- p) ú ëè x ø û minimize éæ n ö x ù n-x l( p) = -logêç ÷ p (1- p) ú ëè x ø û Deriving the Maximum Likelihood Estimate éæ nö x ù n-x l( p) = -log(L( p)) = -logêç ÷ p (1- p) ú ëè x ø û æ nö l( p) = -logç ÷ - log( px ) - log((1- p) n-x ) è xø æ nö l( p) = -logç ÷ - x log( p) - (n - x)log(1- p) è xø Deriving the Maximum Likelihood Estimate æ nö l( p) = -logç ÷ - x log( p) - (n - x)log(1- p) è xø dl( p) x -(n - x) =0- dp p 1- p 0 = -x + pˆ n x n-x 0=- + pˆ 1- pˆ -x(1- pˆ ) + pˆ (n - x) 0= pˆ (1- pˆ ) ˆx 0 = -x + pˆXx + pˆ n - pX x pˆ = n The proportion of positives! Maximum Likelihood Estimate x 28 pˆ = = = 0.28 n 100 Maximum Likelihood Estimate x 28 pˆ = = = 0.28 n 100 Building Confidence Intervals Likelihood Ratio Test If the null hypothesis were true then L(null hypothesis) -2log( ~ c df2 =1 L(alternative hypothesis) 2lalternative - 2lnull ~ c df2 =1 So if our α = .05, then we reject any null hypothesis for which 2lMLE - 2lnull > c df2 =1, a = 0.05 = 3.84 > qchisq(p = .95, df = 1) [1] 3.841459 2lMLE - 2lnull > 3.84 lMLE - lnull >1.92 When lMLE - lnull > 1.92, we reject that null hypothesis. Building Confidence Intervals Likelihood Ratio Test Maximum Likelihood Estimate x 28 pˆ = = = 0.28 n 100 Let’s zoom in… Building Confidence Intervals Likelihood Ratio Test Maximum Likelihood Estimate x 28 pˆ = = = 0.28 n 100 Building Confidence Intervals Likelihood Ratio Test 2lMLE - 2lnull > c df2 =1, a = 0.05 = 3.84 lMLE - lnull >1.92 Building Confidence Intervals Likelihood Ratio Test 0.8 0.4 0.195 0.378 95% CI includes HIV prevalences of 19.5% to 37.8% 0.0 p−value Comparing Confidence Intervals 0.0 0.2 0.4 0.6 0.8 1.0 7 5 6 0.195 0.378 4 95% CI includes HIV prevalences of 19.5% to 37.8% 2 3 −log(likelihood) hypothetical prevalence (null hypothesis) 0.0 0.2 0.4 0.6 potential prevalences (our models) 0.8 1.0 Advantages of Likelihood • Practical method for estimating parameters estimating variance of our estimates • Easily adaptable to different probability distributions & dynamic models This presentation is made available through a Creative Commons Attribution-Noncommercial license. Details of the license and permitted uses are available at http://creativecommons.org/licenses/by-nc/3.0/ © 2010 Steve Bellan and the Meaningful Modeling of Epidemiological Data Clinic Title: Introduction to Likelihood Attribution: Steve Bellan, Clinic on the Meaningful Modeling of Epidemiological Data Source URL: http://lalashan.mcmaster.ca/theobio/mmed/index.php/ For further information please contact Steve Bellan (sbellan@berkeley.edu).