Evaluation of Probability Forecasts

advertisement
Evaluation of Probability Forecasts
Shulamith T Gross
City University of New York
Baruch College
Joint work with
Tze Leung Lai, David Bo Shen: Stanford University
Catherine Huber Universitee Rene Descartes
Rutgers 4.30.2012
6/30/2016
1
Quality of Prediction
Repeated Probability Predictions are commonly
made in Meteorology, Banking and Finance,
Medicine & Epidemiology
 Define: pi = true event probability
p’i = (model) predicted future event prob.
Yi = Observed event indicator
Prediction quality: Calibration (Accuracy)
Discrimination: (Resolution, Precision)

6/30/2016
2
OUTLINE


Evaluation of probability forecasts in
meteorology & Banking (Accuracy)
Scores (Brier, Good), proper scores,
Winkler’s skill scores (w loss or utility)
L’n = n-1 Σ 1≤i≤n (Yi – p’i)2
Reliability diagrams: Bin data according to
predicted risk p’i and plot the observed
relative frequency of events in each bin
center.
6/30/2016
3
OUTLINE 2: Measures of Predictiveness of models
Medicine: Scores are increasingly used.
Epidemiology: Curves: ROC, Predictiveness
 Plethora of indices of reliability/discrimination:
 AUC = P[p’1 > p’2| Y1=1, Y2=0] (Concordance.
One predictor)
 Net Reclassification Index.
NRI = P[p’’>p’|Y=1]-P[p’’>p’|Y=0]{P[p’’<p’|Y=1]-P[p’’<p’|Y=0]}
 Improved Discrimination Index (“sign”“-”)
IDI= E[p’’-p’|Y=1] - E[p’’-p’|Y=0] (two
6/30/2016
predictors)

4
Evaluation of Probability Forecasts
Using Scoring Rules
Reliability, or Precision is measured
using “scoring rules“ (loss or utility)
L’n = n-1 Σ 1≤i≤n L(Yi ,p’i)
 First score BRIER(1950) used square
error loss. Most commonly used.
 Review: Geneiting &Raftery (2007)
 Problem: very few inference tools are
available

6/30/2016
5
Earlier Contributions






Proper scores (using loss functions)
Ep[L(Y,p)] ≤ Ep[L(Y,p’)]
Inference: Tests of H0: pi = p’i all i=1,…,n
Cox(1958) in model logit(pi)= b1+b2logit(p’i)
Too restrictive. Need to estimate the true
score-not test for perfect prediction.
Related work: Econometrics. Evaluation of
linear models. Emphasis on predicting Y.
Giacomini and White (Econometrica 2006)
6/30/2016
6
Statistical Considerations






L’n = n-1 Σ 1≤i≤n L(Yi ,p’i) attempts to estimate
Ln = n-1 Σ 1≤i≤n L(pi ,p’i) ‘population parameter’
Squared error loss
L(p,p’) = (p – p’)2
Kullback-Leibler divergence (Good, 1952):
L(p,p’) = p log(p/p’) + (1-p) log( (1-p)/(1-p’))
How good is L’ in estimating L?
6/30/2016
7
Linear Equivalent Loss

I.
II.





6/30/2016
A loss function L~(p,p’) is a linear equivalent of
the loss function L(p,p’) if
It is a linear function of p
L(p,p’) - L~(p,p’) does not depend on p’
E.g.
L~(p,p’) = - 2p p’ + p’2
is a linear equivalent of the squared error loss
L(p,p’)=(p – p’)2.
A linear equivalent L˜ of the Kullback-Leibler
divergence is given by the Good (1952) score
L˜(p,p’) = - {p log(p’) + (1-p) log(1- p’)}
L(p,p’) = |p-p’| has no linear equivalent
8
LINEAR EQUIVALENTS-2






We allow all forecast p’k to depend on an
information set Fk-1 consisting of all event, forecast
histories, and other covariates before Yk is observed, as
well as the true p1…pk. The conditional distribution of
Yi given Fi-1 is Bernoulli(pi), with
P(Yi = 1|Fi-1) = pi
Dawid (1982, 1993) Martingale use in testing hyp
context
Suppose L(p; p’) is linear in p, as in the case of linear
equivalents of general loss functions. Then
E[ L(Yi, p’i )| Fi-1] = L(pi,p’i)
so L(Yi,p’i) - L(pi,p’i) is a martingale difference
sequence with respect to {Fi-1}.
6/30/2016
9

Theorem 1.
Suppose L(p; p’) is linear in p. Let

σ2n = Σ1≤ i ≤n {L(1,p’i) - L(0,p’i)}2 pi(1- pi).
 If σ2n converges in probability to some non-random positive
constant, then √n (L’n - Ln)/σn has a limiting standard normal
distribution.
________________________________________________
 We can set confidence intervals for the ‘true’ score if we can
estimate pi(1- pi). Upper bound: ¼, leads to conservative CI


Alternatively: Create buckets (bins) of cases with close
event probability values (Banking) or use categorical
covariates (Epidemiology).
6/30/2016
10
Difference of Scores to replace Skill Scores









When the score has a linear equivalent, the score
difference
D’n = n-1Σ1≤ i ≤ n {L(Yi,p’’i) - L(Yi,p’i)}
is
Linear in pi and a martingale wrt Fn
Does not depend on p’i or p’’i
Theorem 2. If the score L has a linear equivalent,
and Dn = n-1Σ1 ≤ i ≤ n {L(pi,p’’i) - L(pi,p’i)]}
di = L(1,p’’i) - L(0,p’’i) – [L(1, p’i) - L(0, p’i)]
s2n = Σ1≤ i ≤n di2pi(1- pi)
Then if s2n converges in probability to a positive
constant, then √n(D’n- Dn)/sn converges in law to
N(0,1). If the score has no linear equivalent, the
theorem holds with
Dn = n-1Σ1≤ i ≤n {di pi + [L(0,p’’i) - L(0,p’i)]}
6/30/2016
11
A typical application in climatology is the following: in which pt
is the predicted probability of precipitation t days ahead. These
confidence intervals, which are centered at BRk – BRk-1, are
given in the table below.
BR1 Δ(2)
Δ(3) Δ(4) Δ(5) Δ(6) Δ(7)
________________________________________
Queens, .125 .021
.012
.020 .010 .015
.007
NY
±.010 ±.011 ±.012 ±.011 ±.011 ±.010
Jefferson .159 .005
City, MO
±.010
.005
± .011
.007 .024 .000 .008
±.011 ±.010 ±.010 ±.008
Brier scores B1 and Conservative 95% confidence intervals
for Δ(k).
6/30/2016
12
Additional Results Useful for Variance Estimation
and Binning






If a better estimate of pi(1-pi) is needed: use binning or
bucketing of the data.
Notation
Predictions are made at time t in 1:T.
At t, there are Jt buckets of predictions
Bucket size njt
p(j) = true probability common to all i in bucket j,
for j in 1:Jt.
In statistical applications buckets would be formed
by combinations of categorical variables.
6/30/2016
13
Buckets for One or Two Scores
Theorem 3. (for constant probability buckets)
If njt≥2 for all j in 1:Jt and pi=pjt for all i in 1:Ijt
then
 Under conditions of Theorem 1:
 σ2’n - σ2n = op(1), where
 s2’n = Σ I[t є1:T,jє1:Jt,iєIjt]

{L(1,p’i) - L(0,p’i)}2 v’2t(j)
 and v’2t(j) is the sample variance of the Y’s in
bucket Ij,t
 Under conditions of Theorem 2
 s’2n - s2n = op(1) where
s’2n = Σtє1:T,jє1:Jt,iєIjt di2 v2t(j)

6/30/2016
14
Notations for Quasi-Buckets (bins)
Bins B1, B2, … , BJt Usually Jt=J
 Quasi buckets: A bin defined by grouping cases
on their predicted probability p’ (not on true p):
Ijt = {k, p’k,t є Bj}
i=(k,t);
nj= Σt njt
 Recall, Simple buckets: pi for all iє Ijt are equal
 Ybart(j) = Σi єIjtYi/njt
 Ybar(j) =Σtє1:T Σi єIjtYi /nj quasi-bucket j
 pbar(j) =Σtє1:T Σi єIjtpi /nj
 v’(j) = Σt є 1:T njt v’t(j) /nj estimated variance in bin
j where
 v’t(j) = Ybart(j) (1- Ybart(j) ) (njt / (1- njt ))

6/30/2016
15
Quasi Buckets and the Reliability Diagram
Application of quasi-buckets:
 The Reliability Diagram = graphical display of
calibration : % cases in a bin vs bin-center
 BIN= all subjects within same decile (Bj) of
predicted risk
 Note: Bin Ijt is measurable w.r.t. Ft-1.
 Define, for one and two predictors:
 s’2n= n-1Σt є1:T ΣjєJt Σ iєIjt {L(1,p’i) - L(0,p’i)}2 *

(Yi-Ybar,t(j))2 njt/(njt-1)
 s’2n = n-1Σt є1:T ΣjєJt Σ iєIjt di2 *(Yi-Ybar,t(j))2 njt/(njt-1)

Note that the difference between s’2n and the original s’2n in theorem 1.
pi(1-pi) for all i in the same bucket is estimated by a square distance
from a single bucket mean. Since the pi’s vary within the bucket, a
wrong centering is used, leading to theorem 5.
6/30/2016
16
The simulation







t=1,2 Jt=5
Bin(j) = ((j-1)/5, j/5) j=1,…,5
True probabilities: pjt ~ Unif(Bin(j))
nj=30 for j=1,…,5
Simulate:
1. Table of true and estimated bucket
variances
2. Reliability diagram using confidence
intervals from Theorem 5
6/30/2016
17
Simulated Reliability Diagram
6/30/2016
18
Theorem 5 for quasi-buckets


Assume that bucket j at time t, Ijt is Ft-1
measurable for buckets at time t (j in 1:Jt).
Then under the assumptions of Theorem 1,
 s’2


with equality if all p’s in a bucket are equal.
Under the assumptions of Theorem 2,


≥ s2 + o(1) a.s.
s’2 ≥ s2 + o(1) a.s.
with equality if all p’s in a bucket are equal.
6/30/2016
19
Quasi-buckets Theorem 5 continued

If nj/n converges in probability to a positive
constant, and

v(j) = Σt є 1:T Σi є Ijt pi(1-pi) /nj

Then (nj/v(j))1/2 (Ybar(j) – pbar(j)) converges
in distribution to N(0,1) and
v’(j) ≥ v(j) +op(1)

6/30/2016
20
Simulation results for a 5 bucket example
Min
pbar(2) 0.101
Ybar(2) 0.050
 v(2) 0.087
 v’(2) 0.048
Pbar(5) 0.769
Ybar(5) 0.733
 v(5) 0.082
 v’(5) 0.077

6/30/2016
Q1
0.300
0.267
0.207
0.208
0.906
0.867
0.082
0.084
Q2
0.300
0.317
0.207
0.221
0.906
0.900
0.082
0.093
Q3
0.355
0.378
0.207
0.239
0.906
0.933
0.082
0.120
Max Mean SD
0.515 0.320 0.049
0.633 0.319 0.089
0.247 0.209 0.015
0.259 0.213 0.034
0.906 0.895 0.026
1.000 0.892 0.049
0.164 0.088 0.016
0.202 0.096 0.037
21
Buckets help resolve bias problem





BIAS for Brier’s score w L(p, p’)=(p-p’)2
E[L’n – Ln|Fn-1 ]= n-1Σ1≤ i ≤ npi(1-pi)
Using the within-bucket sample variances
we
estimate the bias
fix the variance of the resulting estimate
obtain asyptotic normality &
estimates of the new asymptotic variance.
6/30/2016
22
Indices of reliability and discrimination in
epidemiology
Our work applies to ‘in the sample prediction’ e.g.
 ‘Best’ model based on standard covariates for
predicting disease development
 Genetic or other marker(s) become available.
Tested on ‘training’ sample. Estimated models
used to predict disease development in test sample.
 Marker improves model fit significantly.
 Does it sufficiently improve prediction of disease to
warrant the expense?
6/30/2016
23
Goodness of prediction in Epidemiology

Discrimination: Compare the predictive

prowess of two models by how well model 2 assigns
higher risk to cases and lower risk to controls, compared
to model 1.
Pencina et al (2008, Stat in
Med) call these indices
IDI = Improved Discrimination Index
NRI = Net reclassification Index
 Note: in/out of sample prediction here
 Much work: Gu, Pepe (2009); Uno et al
Biometrics(2011)
6/30/2016

24
Current work and simulations
Martingale approach applies to “out of the sample”
evaluation of indices like NRI and IDI, AUC, and R2
differences.
 For Epidemiological applications, in the sample model
comparison for prediction, assuming:
 (Y, Z) iid (response indicator, covariate vector)

neither model is necessarily the true model

models differ by one or more covariates.

larger model significantly better than smaller model

logistic or similar models,
parameters of both models estimated by MLE
we proved asymptotic normality, and provided variance
estimates, for the IDI and Brier score difference.

IDI2/1= E[p’’-p’|Y=1] - E[p’’-p’|Y=0]
6/30/2016
BRI=BR(1)-BR(2) 25
On the Three City Study







Cohort study.
Purpose: Identify variables that help predict dementia in
people over 65.
Our sample: n = 4214 individuals. 162 developed Dementia
within 4 years.
http://www.three-city-study.com/baseline-characteristicsof-the-3c-cohort.php
Ages: 65-74 55%
75-79 27%
80+ 17%
Educ: Primary School 33%
High School 43%
Beyond HS
24%


Original data: Gender
6/30/2016
Men 3650 Women 5644 at start
26
Three City Occupation & Income
Occupation
Men (%)
SENIOR EXEC
33
MID EXEC
24
OFFICE WORKER 14
SKILLED WORKER 29
HOUSEWIFE
Annual Income
>2300 Euro
45
1000-2300
29
750-1000
19
<750
2
6/30/2016
Women (%)
11
18
38
17
16
24
25
36
8
27
The predictive model w Marker

TABLE V

LOGISTIC MODEL 1 INCLUDING THE GENETIC MARKER

APOE4.

Estimate
Std. Error
(Intercept) -2.944
0.176
age.fac.31 -2.089
0.330
age.fac.32 -0.984
0.191
Education -0.430
0.180
Cardio
0.616
0.233
Depress
0.786
0.201
incap
1.180
0.206
APOE4
0.634
0.195
AIC = 1002
Age1: 65<71. Age2 71 <78










6/30/2016
Pr(>|z|)
< 2e-16
2.3e-10
2.5e-07
0.0167
0.0081
9.5e-05
1.1e-08
0.0012
28
The predictive model w/o Marker













TABLE VI
MODEL 2 : LOGISTIC MODEL WITHOUT THE GENETIC
MARKER
APOE4.
Estimate
Standard Error
p-value
(Intercept) -2.797
0.168
< 2e-16
age.fac.31
-2.060
0.330
4.0e-10
age.fac.32
-0.963
0.190
4.2e-07
Education -0.434
0.179
0.0155
card
0.677
0.231
0.0034
depress
0.805
0.201
6.2e-05
incap
1.124
0.206
4.6e-08
AIC = 1194
Age1: 65<71. Age2 71 <78
6/30/2016
29
The Bootstrap and Our Estimates









SAMPLE AND BOOTSTRAP ESTIMATES FOR IDI.
Sample Bootstrap
Sample Bootstrap
Mean
Mean
Std Err Std Err
-.00298 -.00374
0.00303 0.00328
SAMPLE AND BOOTSTRAP ESTIMATES FOR BRI.
Sample
Bootstrap
Sample Bootstrap
Mean
Mean
Std Err Std Err
-6.09e-05 -9.02e-05
0.000115 0.000129
6/30/2016
30
Asymptotic and Bootstrap
Confidence Intervals for IDI and
BRI – 3 CITY STUDY




95% Confidence Interval for IDI
Asymptotic
-0.008911
+0.002958
Bootstrap
-0.012012
+0.000389






95% Confidence Interval for BRI
Asymptotic
-2.87e-04
+1.65e-04
Bootstrap
-4.08e-04
+9.03e-05
Additional QQ normal plots for Bootstrap distribution
6/30/2016
31
Discussion and Summary




We have provided a probabilistic setting for inference on
prediction scores and provided asymptotic results for
discrimination and accuracy measures prevalent in life sciences in
within the sample prediction .
In-sample prediction: Assumed chosen models need not
coincide with true model. Obtain Asymptotics for IDI and BRI.
Convergence is slow for models for which true coefficients are
small.
We considered predicting P[event] only. Different problems pop
up for prediction of a discrete probability distribution. Certainly
our setting can be used but scores have to be carefully chosen.
What about predicting ranks? (Sports. Internet search engines)
6/30/2016
32
Download