 

advertisement
The "Area Under the Curve" (AUC) Criterion in Classification Contests
Consider a two-class classification model, where y  0,1 and conditional distributions for x
given y are F0 and F1 . One evaluation criterion popular in analytics contests is this.
Contestants are required to order test cases from least to most likely to have come from to have
come from distribution F1 . Then conceptually, an empirical "ROC" curve is made, where for
M test cases with M 0 actual " y  0 " cases and M1  M  M 0 actual " y  1 " cases, one plots
M 0  1 points
 M0  j

, pˆ1 j  for j  M 0 , M 0  1, ,1, 0

 M0

where
pˆ1 j  the fraction of the actual y  1 cases judged "more
likely '1'" than the "jth least likely" y  0 case
(If the test cases are arranged left to right as judged least to most likely to be "1", pˆ1 j is the
fraction of actual "1" cases to the right of the jth rightmost "0" case.) One makes a step
function from the plotted points and then computes the area under the associated curve. This is
used as a figure of merit to judge classification efficacy.
Another way to represent this AUC figure of merit that helps make obvious what ordering of
cases is theoretically optimal is
AUC 
1
M0
M0
 pˆ
j 1
(*)
1j
Natural questions are then 1) to what theoretical figure of merit does the empirical AUC
correspond and 2) what is a theoretically optimal ordering for this theoretical figure of merit. To
investigate these questions, consider an ordering built on some statistic S  x  . (That is the
ordering applied to the test cases with inputs xl will be the numerical ordering of the numbers
S  xl  .) Let G0 and G1 be respectively the F0 and F1 distributions of S  x  . Then,
corresponding to the empirical AUC (*) is the (functional of S )
IP  S    1  G1  t   dG0  t 
(**)
1 And (for what it is worth) in the event that G0  t  is continuous and increasing (and thus has an
inverse) this is


IP  S    1  G1  G01  u   du
1
0
Notice that if for each t one builds from S a classifier of the form
at  x   I  S  x   t 
the integrand in (**) is the power of the test/classifier as a function of t and IP  S  is an average
(according to the F0 distribution of S  x  on t ) power (is an "integrated power").
Let   t  be the Type I error rate of at  x  , i.e.
  t   E 0 at  x   P0  S  x   t 
and   t  be the Type II error rate of at  x  ,
  t   1  E1at  x   P1  S  x   t 
Another representation of IP  S  (the theoretical figure of merit for S  x  ) is this. As t runs
from  to  the points
  t  ,1    t  
(***)
trace out a (theoretical Receiver Operating Characteristic) curve in  0,1 (the theoretical version
2
of the hypothetical step function made from ordered cases in order to compute the empirical
AUC). The ordinary integral over  0,1 of the function defined by that parametric curve is
IP  S  , and therefore the "higher" that parametric curve, the larger is IP  S  .
But consider the convex body in  0,1 defined by all pairs  ,1    corresponding to possible
2
classifiers/tests (we may need to allow randomization here). (This is a reflection of the set of all
points  ,   comprising the 0-1 loss risk set of all possible classifiers/tests.) The upper
boundary of that convex body (that corresponds to the lower boundary of the risk set) comes
2 from Bayes classifiers/tests. It is guaranteed to lie "above" (at least as high) as the parametric
curve defined in display (***). But the form of Bayes (and Neyman-Pearson) tests/classifiers is
well-known. For p   0,1 and f 0 and f1 densities for respective distributions F0 and F1 a
Bayes optimal classifier under 0-1 loss for prior probability p that y  1 is
 f  x 1 p 

I  pf1  x   1  p  f 0  x    I  1

p 
 f0  x 
and this brings us to what is in retrospect a perfectly obvious conclusion.
Any S that is a monotone increasing transformation of the likelihood ratio
l x 
f1  x 
f0  x 
(is equivalent to the likelihood ratio) will optimize IP  S  . This is all perfectly sensible. After
all, it is well-known that l  x  is minimal sufficient in the two-distribution statistical model
F0 , F1 .
And Neyman-Pearson theory/Bayes theory says that all good tests/classifiers are
defined in terms of the likelihood ratio. No sensible figure of merit would make any other choice
of S  x  look better than l  x  .
As one other small point, note that for any p   0,1 , in the Bayes model the conditional
probability that y  1 is
pf1  x 
pl  x 

1  p  f0  x   pf1  x  1  p   pl  x 
and the conditional probability that y  1 given x produces the same theoretical integrated
power and the same empirical ordering of test cases and corresponding empirical AUC as the
likelihood ratio statistic. 3 
Download