The "Area Under the Curve" (AUC) Criterion in Classification Contests Consider a two-class classification model, where y 0,1 and conditional distributions for x given y are F0 and F1 . One evaluation criterion popular in analytics contests is this. Contestants are required to order test cases from least to most likely to have come from to have come from distribution F1 . Then conceptually, an empirical "ROC" curve is made, where for M test cases with M 0 actual " y 0 " cases and M1 M M 0 actual " y 1 " cases, one plots M 0 1 points M0 j , pˆ1 j for j M 0 , M 0 1, ,1, 0 M0 where pˆ1 j the fraction of the actual y 1 cases judged "more likely '1'" than the "jth least likely" y 0 case (If the test cases are arranged left to right as judged least to most likely to be "1", pˆ1 j is the fraction of actual "1" cases to the right of the jth rightmost "0" case.) One makes a step function from the plotted points and then computes the area under the associated curve. This is used as a figure of merit to judge classification efficacy. Another way to represent this AUC figure of merit that helps make obvious what ordering of cases is theoretically optimal is AUC 1 M0 M0 pˆ j 1 (*) 1j Natural questions are then 1) to what theoretical figure of merit does the empirical AUC correspond and 2) what is a theoretically optimal ordering for this theoretical figure of merit. To investigate these questions, consider an ordering built on some statistic S x . (That is the ordering applied to the test cases with inputs xl will be the numerical ordering of the numbers S xl .) Let G0 and G1 be respectively the F0 and F1 distributions of S x . Then, corresponding to the empirical AUC (*) is the (functional of S ) IP S 1 G1 t dG0 t (**) 1 And (for what it is worth) in the event that G0 t is continuous and increasing (and thus has an inverse) this is IP S 1 G1 G01 u du 1 0 Notice that if for each t one builds from S a classifier of the form at x I S x t the integrand in (**) is the power of the test/classifier as a function of t and IP S is an average (according to the F0 distribution of S x on t ) power (is an "integrated power"). Let t be the Type I error rate of at x , i.e. t E 0 at x P0 S x t and t be the Type II error rate of at x , t 1 E1at x P1 S x t Another representation of IP S (the theoretical figure of merit for S x ) is this. As t runs from to the points t ,1 t (***) trace out a (theoretical Receiver Operating Characteristic) curve in 0,1 (the theoretical version 2 of the hypothetical step function made from ordered cases in order to compute the empirical AUC). The ordinary integral over 0,1 of the function defined by that parametric curve is IP S , and therefore the "higher" that parametric curve, the larger is IP S . But consider the convex body in 0,1 defined by all pairs ,1 corresponding to possible 2 classifiers/tests (we may need to allow randomization here). (This is a reflection of the set of all points , comprising the 0-1 loss risk set of all possible classifiers/tests.) The upper boundary of that convex body (that corresponds to the lower boundary of the risk set) comes 2 from Bayes classifiers/tests. It is guaranteed to lie "above" (at least as high) as the parametric curve defined in display (***). But the form of Bayes (and Neyman-Pearson) tests/classifiers is well-known. For p 0,1 and f 0 and f1 densities for respective distributions F0 and F1 a Bayes optimal classifier under 0-1 loss for prior probability p that y 1 is f x 1 p I pf1 x 1 p f 0 x I 1 p f0 x and this brings us to what is in retrospect a perfectly obvious conclusion. Any S that is a monotone increasing transformation of the likelihood ratio l x f1 x f0 x (is equivalent to the likelihood ratio) will optimize IP S . This is all perfectly sensible. After all, it is well-known that l x is minimal sufficient in the two-distribution statistical model F0 , F1 . And Neyman-Pearson theory/Bayes theory says that all good tests/classifiers are defined in terms of the likelihood ratio. No sensible figure of merit would make any other choice of S x look better than l x . As one other small point, note that for any p 0,1 , in the Bayes model the conditional probability that y 1 is pf1 x pl x 1 p f0 x pf1 x 1 p pl x and the conditional probability that y 1 given x produces the same theoretical integrated power and the same empirical ordering of test cases and corresponding empirical AUC as the likelihood ratio statistic. 3