Evaluation of Probability Forecasts Shulamith T Gross City University of New York Baruch College Joint work with Tze Leung Lai, David Bo Shen: Stanford University Catherine Huber Universitee Rene Descartes Rutgers 4.30.2012 6/30/2016 1 Quality of Prediction Repeated Probability Predictions are commonly made in Meteorology, Banking and Finance, Medicine & Epidemiology Define: pi = true event probability p’i = (model) predicted future event prob. Yi = Observed event indicator Prediction quality: Calibration (Accuracy) Discrimination: (Resolution, Precision) 6/30/2016 2 OUTLINE Evaluation of probability forecasts in meteorology & Banking (Accuracy) Scores (Brier, Good), proper scores, Winkler’s skill scores (w loss or utility) L’n = n-1 Σ 1≤i≤n (Yi – p’i)2 Reliability diagrams: Bin data according to predicted risk p’i and plot the observed relative frequency of events in each bin center. 6/30/2016 3 OUTLINE 2: Measures of Predictiveness of models Medicine: Scores are increasingly used. Epidemiology: Curves: ROC, Predictiveness Plethora of indices of reliability/discrimination: AUC = P[p’1 > p’2| Y1=1, Y2=0] (Concordance. One predictor) Net Reclassification Index. NRI = P[p’’>p’|Y=1]-P[p’’>p’|Y=0]{P[p’’<p’|Y=1]-P[p’’<p’|Y=0]} Improved Discrimination Index (“sign”“-”) IDI= E[p’’-p’|Y=1] - E[p’’-p’|Y=0] (two 6/30/2016 predictors) 4 Evaluation of Probability Forecasts Using Scoring Rules Reliability, or Precision is measured using “scoring rules“ (loss or utility) L’n = n-1 Σ 1≤i≤n L(Yi ,p’i) First score BRIER(1950) used square error loss. Most commonly used. Review: Geneiting &Raftery (2007) Problem: very few inference tools are available 6/30/2016 5 Earlier Contributions Proper scores (using loss functions) Ep[L(Y,p)] ≤ Ep[L(Y,p’)] Inference: Tests of H0: pi = p’i all i=1,…,n Cox(1958) in model logit(pi)= b1+b2logit(p’i) Too restrictive. Need to estimate the true score-not test for perfect prediction. Related work: Econometrics. Evaluation of linear models. Emphasis on predicting Y. Giacomini and White (Econometrica 2006) 6/30/2016 6 Statistical Considerations L’n = n-1 Σ 1≤i≤n L(Yi ,p’i) attempts to estimate Ln = n-1 Σ 1≤i≤n L(pi ,p’i) ‘population parameter’ Squared error loss L(p,p’) = (p – p’)2 Kullback-Leibler divergence (Good, 1952): L(p,p’) = p log(p/p’) + (1-p) log( (1-p)/(1-p’)) How good is L’ in estimating L? 6/30/2016 7 Linear Equivalent Loss I. II. 6/30/2016 A loss function L~(p,p’) is a linear equivalent of the loss function L(p,p’) if It is a linear function of p L(p,p’) - L~(p,p’) does not depend on p’ E.g. L~(p,p’) = - 2p p’ + p’2 is a linear equivalent of the squared error loss L(p,p’)=(p – p’)2. A linear equivalent L˜ of the Kullback-Leibler divergence is given by the Good (1952) score L˜(p,p’) = - {p log(p’) + (1-p) log(1- p’)} L(p,p’) = |p-p’| has no linear equivalent 8 LINEAR EQUIVALENTS-2 We allow all forecast p’k to depend on an information set Fk-1 consisting of all event, forecast histories, and other covariates before Yk is observed, as well as the true p1…pk. The conditional distribution of Yi given Fi-1 is Bernoulli(pi), with P(Yi = 1|Fi-1) = pi Dawid (1982, 1993) Martingale use in testing hyp context Suppose L(p; p’) is linear in p, as in the case of linear equivalents of general loss functions. Then E[ L(Yi, p’i )| Fi-1] = L(pi,p’i) so L(Yi,p’i) - L(pi,p’i) is a martingale difference sequence with respect to {Fi-1}. 6/30/2016 9 Theorem 1. Suppose L(p; p’) is linear in p. Let σ2n = Σ1≤ i ≤n {L(1,p’i) - L(0,p’i)}2 pi(1- pi). If σ2n converges in probability to some non-random positive constant, then √n (L’n - Ln)/σn has a limiting standard normal distribution. ________________________________________________ We can set confidence intervals for the ‘true’ score if we can estimate pi(1- pi). Upper bound: ¼, leads to conservative CI Alternatively: Create buckets (bins) of cases with close event probability values (Banking) or use categorical covariates (Epidemiology). 6/30/2016 10 Difference of Scores to replace Skill Scores When the score has a linear equivalent, the score difference D’n = n-1Σ1≤ i ≤ n {L(Yi,p’’i) - L(Yi,p’i)} is Linear in pi and a martingale wrt Fn Does not depend on p’i or p’’i Theorem 2. If the score L has a linear equivalent, and Dn = n-1Σ1 ≤ i ≤ n {L(pi,p’’i) - L(pi,p’i)]} di = L(1,p’’i) - L(0,p’’i) – [L(1, p’i) - L(0, p’i)] s2n = Σ1≤ i ≤n di2pi(1- pi) Then if s2n converges in probability to a positive constant, then √n(D’n- Dn)/sn converges in law to N(0,1). If the score has no linear equivalent, the theorem holds with Dn = n-1Σ1≤ i ≤n {di pi + [L(0,p’’i) - L(0,p’i)]} 6/30/2016 11 A typical application in climatology is the following: in which pt is the predicted probability of precipitation t days ahead. These confidence intervals, which are centered at BRk – BRk-1, are given in the table below. BR1 Δ(2) Δ(3) Δ(4) Δ(5) Δ(6) Δ(7) ________________________________________ Queens, .125 .021 .012 .020 .010 .015 .007 NY ±.010 ±.011 ±.012 ±.011 ±.011 ±.010 Jefferson .159 .005 City, MO ±.010 .005 ± .011 .007 .024 .000 .008 ±.011 ±.010 ±.010 ±.008 Brier scores B1 and Conservative 95% confidence intervals for Δ(k). 6/30/2016 12 Additional Results Useful for Variance Estimation and Binning If a better estimate of pi(1-pi) is needed: use binning or bucketing of the data. Notation Predictions are made at time t in 1:T. At t, there are Jt buckets of predictions Bucket size njt p(j) = true probability common to all i in bucket j, for j in 1:Jt. In statistical applications buckets would be formed by combinations of categorical variables. 6/30/2016 13 Buckets for One or Two Scores Theorem 3. (for constant probability buckets) If njt≥2 for all j in 1:Jt and pi=pjt for all i in 1:Ijt then Under conditions of Theorem 1: σ2’n - σ2n = op(1), where s2’n = Σ I[t є1:T,jє1:Jt,iєIjt] {L(1,p’i) - L(0,p’i)}2 v’2t(j) and v’2t(j) is the sample variance of the Y’s in bucket Ij,t Under conditions of Theorem 2 s’2n - s2n = op(1) where s’2n = Σtє1:T,jє1:Jt,iєIjt di2 v2t(j) 6/30/2016 14 Notations for Quasi-Buckets (bins) Bins B1, B2, … , BJt Usually Jt=J Quasi buckets: A bin defined by grouping cases on their predicted probability p’ (not on true p): Ijt = {k, p’k,t є Bj} i=(k,t); nj= Σt njt Recall, Simple buckets: pi for all iє Ijt are equal Ybart(j) = Σi єIjtYi/njt Ybar(j) =Σtє1:T Σi єIjtYi /nj quasi-bucket j pbar(j) =Σtє1:T Σi єIjtpi /nj v’(j) = Σt є 1:T njt v’t(j) /nj estimated variance in bin j where v’t(j) = Ybart(j) (1- Ybart(j) ) (njt / (1- njt )) 6/30/2016 15 Quasi Buckets and the Reliability Diagram Application of quasi-buckets: The Reliability Diagram = graphical display of calibration : % cases in a bin vs bin-center BIN= all subjects within same decile (Bj) of predicted risk Note: Bin Ijt is measurable w.r.t. Ft-1. Define, for one and two predictors: s’2n= n-1Σt є1:T ΣjєJt Σ iєIjt {L(1,p’i) - L(0,p’i)}2 * (Yi-Ybar,t(j))2 njt/(njt-1) s’2n = n-1Σt є1:T ΣjєJt Σ iєIjt di2 *(Yi-Ybar,t(j))2 njt/(njt-1) Note that the difference between s’2n and the original s’2n in theorem 1. pi(1-pi) for all i in the same bucket is estimated by a square distance from a single bucket mean. Since the pi’s vary within the bucket, a wrong centering is used, leading to theorem 5. 6/30/2016 16 The simulation t=1,2 Jt=5 Bin(j) = ((j-1)/5, j/5) j=1,…,5 True probabilities: pjt ~ Unif(Bin(j)) nj=30 for j=1,…,5 Simulate: 1. Table of true and estimated bucket variances 2. Reliability diagram using confidence intervals from Theorem 5 6/30/2016 17 Simulated Reliability Diagram 6/30/2016 18 Theorem 5 for quasi-buckets Assume that bucket j at time t, Ijt is Ft-1 measurable for buckets at time t (j in 1:Jt). Then under the assumptions of Theorem 1, s’2 with equality if all p’s in a bucket are equal. Under the assumptions of Theorem 2, ≥ s2 + o(1) a.s. s’2 ≥ s2 + o(1) a.s. with equality if all p’s in a bucket are equal. 6/30/2016 19 Quasi-buckets Theorem 5 continued If nj/n converges in probability to a positive constant, and v(j) = Σt є 1:T Σi є Ijt pi(1-pi) /nj Then (nj/v(j))1/2 (Ybar(j) – pbar(j)) converges in distribution to N(0,1) and v’(j) ≥ v(j) +op(1) 6/30/2016 20 Simulation results for a 5 bucket example Min pbar(2) 0.101 Ybar(2) 0.050 v(2) 0.087 v’(2) 0.048 Pbar(5) 0.769 Ybar(5) 0.733 v(5) 0.082 v’(5) 0.077 6/30/2016 Q1 0.300 0.267 0.207 0.208 0.906 0.867 0.082 0.084 Q2 0.300 0.317 0.207 0.221 0.906 0.900 0.082 0.093 Q3 0.355 0.378 0.207 0.239 0.906 0.933 0.082 0.120 Max Mean SD 0.515 0.320 0.049 0.633 0.319 0.089 0.247 0.209 0.015 0.259 0.213 0.034 0.906 0.895 0.026 1.000 0.892 0.049 0.164 0.088 0.016 0.202 0.096 0.037 21 Buckets help resolve bias problem BIAS for Brier’s score w L(p, p’)=(p-p’)2 E[L’n – Ln|Fn-1 ]= n-1Σ1≤ i ≤ npi(1-pi) Using the within-bucket sample variances we estimate the bias fix the variance of the resulting estimate obtain asyptotic normality & estimates of the new asymptotic variance. 6/30/2016 22 Indices of reliability and discrimination in epidemiology Our work applies to ‘in the sample prediction’ e.g. ‘Best’ model based on standard covariates for predicting disease development Genetic or other marker(s) become available. Tested on ‘training’ sample. Estimated models used to predict disease development in test sample. Marker improves model fit significantly. Does it sufficiently improve prediction of disease to warrant the expense? 6/30/2016 23 Goodness of prediction in Epidemiology Discrimination: Compare the predictive prowess of two models by how well model 2 assigns higher risk to cases and lower risk to controls, compared to model 1. Pencina et al (2008, Stat in Med) call these indices IDI = Improved Discrimination Index NRI = Net reclassification Index Note: in/out of sample prediction here Much work: Gu, Pepe (2009); Uno et al Biometrics(2011) 6/30/2016 24 Current work and simulations Martingale approach applies to “out of the sample” evaluation of indices like NRI and IDI, AUC, and R2 differences. For Epidemiological applications, in the sample model comparison for prediction, assuming: (Y, Z) iid (response indicator, covariate vector) neither model is necessarily the true model models differ by one or more covariates. larger model significantly better than smaller model logistic or similar models, parameters of both models estimated by MLE we proved asymptotic normality, and provided variance estimates, for the IDI and Brier score difference. IDI2/1= E[p’’-p’|Y=1] - E[p’’-p’|Y=0] 6/30/2016 BRI=BR(1)-BR(2) 25 On the Three City Study Cohort study. Purpose: Identify variables that help predict dementia in people over 65. Our sample: n = 4214 individuals. 162 developed Dementia within 4 years. http://www.three-city-study.com/baseline-characteristicsof-the-3c-cohort.php Ages: 65-74 55% 75-79 27% 80+ 17% Educ: Primary School 33% High School 43% Beyond HS 24% Original data: Gender 6/30/2016 Men 3650 Women 5644 at start 26 Three City Occupation & Income Occupation Men (%) SENIOR EXEC 33 MID EXEC 24 OFFICE WORKER 14 SKILLED WORKER 29 HOUSEWIFE Annual Income >2300 Euro 45 1000-2300 29 750-1000 19 <750 2 6/30/2016 Women (%) 11 18 38 17 16 24 25 36 8 27 The predictive model w Marker TABLE V LOGISTIC MODEL 1 INCLUDING THE GENETIC MARKER APOE4. Estimate Std. Error (Intercept) -2.944 0.176 age.fac.31 -2.089 0.330 age.fac.32 -0.984 0.191 Education -0.430 0.180 Cardio 0.616 0.233 Depress 0.786 0.201 incap 1.180 0.206 APOE4 0.634 0.195 AIC = 1002 Age1: 65<71. Age2 71 <78 6/30/2016 Pr(>|z|) < 2e-16 2.3e-10 2.5e-07 0.0167 0.0081 9.5e-05 1.1e-08 0.0012 28 The predictive model w/o Marker TABLE VI MODEL 2 : LOGISTIC MODEL WITHOUT THE GENETIC MARKER APOE4. Estimate Standard Error p-value (Intercept) -2.797 0.168 < 2e-16 age.fac.31 -2.060 0.330 4.0e-10 age.fac.32 -0.963 0.190 4.2e-07 Education -0.434 0.179 0.0155 card 0.677 0.231 0.0034 depress 0.805 0.201 6.2e-05 incap 1.124 0.206 4.6e-08 AIC = 1194 Age1: 65<71. Age2 71 <78 6/30/2016 29 The Bootstrap and Our Estimates SAMPLE AND BOOTSTRAP ESTIMATES FOR IDI. Sample Bootstrap Sample Bootstrap Mean Mean Std Err Std Err -.00298 -.00374 0.00303 0.00328 SAMPLE AND BOOTSTRAP ESTIMATES FOR BRI. Sample Bootstrap Sample Bootstrap Mean Mean Std Err Std Err -6.09e-05 -9.02e-05 0.000115 0.000129 6/30/2016 30 Asymptotic and Bootstrap Confidence Intervals for IDI and BRI – 3 CITY STUDY 95% Confidence Interval for IDI Asymptotic -0.008911 +0.002958 Bootstrap -0.012012 +0.000389 95% Confidence Interval for BRI Asymptotic -2.87e-04 +1.65e-04 Bootstrap -4.08e-04 +9.03e-05 Additional QQ normal plots for Bootstrap distribution 6/30/2016 31 Discussion and Summary We have provided a probabilistic setting for inference on prediction scores and provided asymptotic results for discrimination and accuracy measures prevalent in life sciences in within the sample prediction . In-sample prediction: Assumed chosen models need not coincide with true model. Obtain Asymptotics for IDI and BRI. Convergence is slow for models for which true coefficients are small. We considered predicting P[event] only. Different problems pop up for prediction of a discrete probability distribution. Certainly our setting can be used but scores have to be carefully chosen. What about predicting ranks? (Sports. Internet search engines) 6/30/2016 32