online learning with structured experts–a biased survey Gábor Lugosi ICREA and Pompeu Fabra University, Barcelona 1 on-line prediction A repeated game between forecaster and environment. At each round t, the forecaster chooses an action It 2 {1, . . . , N }; (actions are often called experts) the environment chooses losses `t (1), . . . , `t (N) 2 [0, 1]; the forecaster su↵ers loss `t (It ). The goal is to minimize the regret Rn = n X `t (It ) min i N t =1 7 n X t=1 `t (i ) ! . simplest example Is it possible to make (1/n)Rn ! 0 for all loss assignments? Let N = 2 and define, for all t = 1, . . . , n, ⇢ 0 if It = 2 `t (1) = 1 if It = 1 and `t (2) = 1 Then n X `t (1). `t (It ) = n and t=1 so 1 n min i =1,2 1 Rn 2 11 . n X t=1 `t (i ) n 2 randomized prediction Key to solution: randomization. At time t, the forecaster chooses a probability distribution p t 1 = (p1,t 1 , . . . , pN,t 1 ) and chooses action i with probability pi ,t 1. Simplest model: all losses `s (i ), i = 1, . . . , N, s < t, are observed: full information. 12 Hannan and Blackwell Hannan (1957) and Blackwell (1956) showed that the forecaster has a strategy such that ! n n X 1 X `t (i ) ! 0 `t (It ) min i N n t =1 t=1 almost surely for all strategies of the environment. 13 basic ideas expected loss of the forecaster: `t (p t 1) = N X pi ,t 1 `t ( i ) = Et `t (It ) i =1 By martingale convergence, 1 n n X `t (It ) t=1 n X `t (p t 1) t=1 ! = OP (n so it suffices to study 1 n n X `t (p t 1) t =1 16 min i N n X t=1 `t (i ) ! 1/2 ) weighted average prediction Idea: assign a higher probability to better-performing actions. Vovk (1990), Littlestone and Warmuth (1989). A popular choice is pi ,t 1 ⇣ Pt 1 s =1 `s (i ) ⌘ exp ⌘ ⇣ ⌘ = P Pt 1 N exp ⌘ ` ( k ) k =1 s =1 s i = 1, . . . , N . where ⌘ > 0. Then 1 n with ⌘ = p n X `t (p t t=1 1) min i N 8 ln N/n. 19 n X t=1 `t (i ) ! = r ln N 2n proof Let Li ,t = Pt s=1 `s (i ) and Wt = N X wi ,t = i =1 for t N X e ⌘Li ,t i =1 1, and W0 = N. First observe that ! N X Wn ln N = ln e ⌘Li ,n ln W0 i =1 ✓ ◆ ⌘Li ,n ln max e ln N i =1,...,N = ⌘ min Li ,n i =1,...,N 21 ln N . proof On the other hand, for each t = 1, . . . , n ln Wt Wt PN wi ,t PN i =1 = ln 1 j =1 PN ⌘`t (i ) 1e wj ,t 1 ⌘2 i =1 wi ,t 1 `t (i ) + PN 8 j =1 wj ,t 1 ⌘ = ⌘`t (p t 1) + ⌘2 8 by Hoe↵ding’s inequality. Hoe↵ding (1963): if X 2 [0, 1], ln Ee ⌘X 23 ⌘EX + ⌘2 8 proof for each t = 1, . . . , n ln Wt Wt 1 ⌘`t (p t 1) + ⌘2 8 Summing over t = 1, . . . , n, ln Wn W0 ⌘ n X `t (p t 1) + t=1 ⌘2 8 n. Combining these, we get n X t=1 `t (p t 1) min Li ,n + i =1,...,N 24 ln N ⌘ + ⌘ 8 n lower bound The upper bound is optimal: for all predictors, Pn P mini N nt =1 `t (i ) t =1 `t (It ) p sup (n/2) ln N n,N,`t (i ) 1. Idea: choose `t (i ) to be i.i.d. symmetric Bernoulli coin flips. ! n n X X `t (i ) `t (It ) min sup `t (i ) E = i N t=1 n 2 " n X `t (It ) t=1 min Bi t=1 min i N n X t=1 `t (i ) # i N Where B1 , . . . , BN are independent Binomial (n, 1/2). Use the central limit theorem. 27 follow the perturbed leader It = arg min tX 1 i =1,...,N s=1 `s (i ) + Zi ,t where the Zi ,t are random noise variables. The original forecaster of Hannan (1957) is based on this idea. 28 follow the perturbed leader p If the Zi ,t are i.i.d. uniform [0, nN], then r 1 N + Op (n 1/2 ) . Rn 2 n n If the Zi ,t are i.i.d. with density (⌘/2)e ⌘|z | , then for p ⌘ ⇡ log N/n, r 1 log N Rn c + Op (n 1/2 ) . n n Kalai and Vempala (2003). 30 combinatorial experts Often the class of experts is very large but has some combinatorial structure. Can the structure be exploited? path planning. At each time instance, the forecaster chooses a path in a graph between two fixed nodes. Each edge has an associated loss. Loss of a path is the sum of the losses over the edges in the path. © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. N is huge!!! 32 assignments: learning permutations Given a complete bipartite graph Km,m , the forecaster chooses a perfect matching. The loss is the sum of the losses over the edges. This image has been removed due to copyright restrictions. Please see the image at http://38.media.tumblr.com/tumblr_m0ol5tggjZ1qir7tc.gif Helmbold and Warmuth (2007): full information case. 33 spanning trees The forecaster chooses a spanning tree in the complete graph Km . The cost is the sum of the losses over the edges. 34 combinatorial experts Formally, the class of experts is a set S ⇢ {0, 1}d of cardinality |S| = N. At each time t , a loss is assigned to each component: `t 2 Rd . Loss of expert v 2 S is `t (v ) = `> t v. Forecaster chooses It 2 S. The goal is to control the regret n X t=1 `t (It ) min k=1,...,N 35 n X t=1 `t (k) . computing the exponentially weighted average forecaster One needs to draw a random element of S with distribution proportional to ! tX 1 > ⌘ `t v . wt (v ) = exp ⌘ Lt (v ) = exp = d Y exp ⌘ s=1 tX 1 `t,j vj s=1 j =1 36 ! . computing the exponentially weighted average forecaster path planning: Sampling may be done by dynamic programming. assignments: Sum of weights (partition function) is the permanent of a non-negative matrix. Sampling may be done by a FPAS of Jerrum, Sinclair, and Vigoda (2004). spanning trees: Propp and Wilson (1998) define an exact sampling algorithm. Expected running time is the average hitting time of the Markov chain defined by the edge weights wt (v ). 39 computing the follow-the-perturbed leader forecaster In general, much easier. One only needs to solve a linear optimization problem over S. This may be hard but it is well understood. In our examples it becomes either a shortest path problem, or an assignment problem, or a minimum spanning tree problem. 40 follow the leader: random walk perturbation Suppose N experts, no structure. Define It = arg min t X i =1,...,N s=1 (`i ,s 1 + Xs ) where the Xs are either i.i.d. normal or ±1 coinflips. This is like follow-the-perturbed-leader but with random walk Pt perturbation: s=1 Xt . Advantage: forecaster rarely changes actions! 42 follow the leader: random walk perturbation If Rn is the regret and Cn is the number of times It 6= It 1 , then p ERn 2ECn 8 2n log N + 16 log n + 16 . Devroye, Lugosi, and Neu (2015). Key tool: number of leader changes in N independent random walks with drift. 43 follow the leader: random walk perturbation This also works in the “combinatorial” setting: just add an independent N(0, d ) at each time to every component. and e (B 3/2 ERn = O ECn = O(B p n log d ) p where B = maxv 2S kv k1 . 44 n log d ) , why exponentially weighted averages? May be adapted to many di↵erent variants of the problem, including bandits, tracking, etc. 45 multi-armed bandits The forecaster only observes `t (It ) but not `t (i ) for i 6= It . Herbert Robbins (1952). This image has been removed due to copyright restrictions. Please see the image at https://en.wikipedia.org/wiki/Herbert_Robbins#/media/File:1966-HerbertRobbins.jpg 46 multi-armed bandits Trick: estimate `t (i ) by èt (i ) = `t (It ) pIt ,t {It =i } 1 This is an unbiased estimate: Et `et (i ) = N X pj ,t 1 `t (j ) j =1 pj ,t {j =i } 1 = `t (i ) Use the estimated losses to define exponential weights and mix with uniform (Auer, Cesa-Bianchi, Freund, and Schapire, 2002): ⌘ ⇣ P exp ⌘ ts=11 `es (i ) ⌘+ ⇣ )P pi ,t 1 = (1 Pt 1 e N N exp ⌘ (k) ` |{z} k=1 s=1 s {z } exploration | exploitation 49 multi-armed bandits E 1 n n X t=1 `t (p t 1) min i N n X `t (i ) t=1 50 ! =O r N ln N n ! , multi-armed bandits Lower bound: sup E `t (i ) 1 n n X t =1 `t (p t 1) min i N n X `t (i ) t =1 ! C r N n , Dependence on N is not logarithmic anymore! Audibert and Bubeck (2009) constructed a forecaster with ! r ! n n X 1 X N max E `t (p t 1 ) `t (i ) = O , i N n n t=1 t=1 52 calibration Sequential probability assignment. A binary sequence x1 , x2 , . . . is revealed one by one. After observing x1 , . . . , xt It 2 {0, 1, . . . , N}. 1, the forecaster issues prediction Meaning: “chance of rain is It /N”. Forecast is calibrated if Pn xt Pt=1 n t=1 i {It =i } N {It =i } whenever lim supn (1/n) Pn t=1 {It =i } 1 2N + o(1) > 0. Is there a forecaster that is calibrated for all possible sequences? NO. (Dawid, 1985). 55 randomized calibration However, if the forecaster is allowed to randomize then it is possible! (Foster and Vohra, 1997). This can be achieved by a simple modification of any regret minimization procedure. Set of actions (experts): {0, 1, . . . , N}. At time t, assign loss `t (i ) = (xt i /N)2 to action i . One can now define a forecaster. Minimizing regret is not sufficient. 56 internal regret Recall that the (expected) regret is n X t=1 `t (p t 1) min i n X `t (i ) = max i t=1 n X X t=1 pj ,t (`t (j ) `t (i )) j The internal regret is defined by max i ,j pj ,t (`t (j ) n X pj ,t (`t (j ) `t (i )) t=1 `t (i )) = Et {It =j } (`t (j ) `t (i )) is the expected regret of having taken action j instead of action i . 59 internal regret and calibration By guaranteeing small internal regret, one obtains a calibrated forecaster. This can be achieved by an exponentially weighted average forecaster defined over N 2 actions. Can be extended even for calibration with checking rules. 60 prediction with partial monitoring For each round t = 1, . . . , n, the environment chooses the next outcome Jt 2 {1, . . . , M} without revealing it; the forecaster chooses a probability distribution p t draws an action It 2 {1, . . . , N} according to p t 1 and 1; the forecaster incurs loss `(It , Jt ) and each action i incurs loss `(i , Jt ). None of these values is revealed to the forecaster; the feedback h(It , Jt ) is revealed to the forecaster. H = [h(i , j )]N⇥M is the feedback matrix. L = [`(i , j )]N⇥M is the loss matrix. 61 examples Dynamic pricing. Here M = N, and L = [`(i , j )]N⇥N where `( i , j ) = and h(i , j ) = {i >j } h(i , j ) = a (j i) {i j } N +c {i >j } . or {i j } +b {i >j } , i , j = 1, . . . , N . Multi-armed bandit problem. The only information the forecaster receives is his own loss: H = L. 63 examples Apple tasting. N = M = 2. L= H = 0 1 1 0 a a b c . The predictor only receives feedback when he chooses the second action. Label efficient prediction. N = 3, M = 2. 3 2 1 1 L=4 0 1 5 1 0 2 3 a b H =4 c c 5 . c c 65 a general predictor A forecaster first proposed by Piccolboni and Schindelhauer (2001). Crucial assumption: H can be encoded such that there exists an N ⇥ N matrix K = [k(i , j )]N⇥N such that L=K ·H . Thus, `(i , j ) = N X k(i , l )h(l , j ) . l =1 Then we may estimate the losses by è(i , Jt ) = k(i , It )h(It , Jt ) . pIt ,t 66 a general predictor Observe e , Jt ) = Et `(i = N X k=1 N X pk,t 1 k(i , k)h(k, Jt ) pk,t 1 k (i , k)h(k, Jt ) = `(i , Jt ) , k =1 è(i , Jt ) is an unbiased estimate of `(i , Jt ). Let e e ⌘Li ,t 1 pi ,t 1 = (1 )P ek,t N ⌘L k =1 e e , Jt ). ei ,t = Pt `(i where L s =1 67 + 1 N performance bound With probability at least 1 n 1X n , `(It , Jt ) min i =1,...,N t=1 Cn 1/3 N 2/3 n 1X n `(i , Jt ) t=1 p ln(N/ ) . where C depends on K . (Cesa-Bianchi, Lugosi, Stoltz (2006)) Hannan consistency is achieved with rate O(n L = K · H. 1/3 ) whenever This solves the dynamic pricing problem. Bartók, Pál, and Szepesvári (2010): if M = 2, only possible rates are n 1/2 , n 1/3 , 1 70 imperfect monitoring: a general framework S is a finite set of signals. Feedback matrix: H : {1, . . . , N} ⇥ {1, . . . , M} ! P(S). For each round t = 1, 2 . . . , n, the environment chooses the next outcome Jt 2 {1, . . . , M} without revealing it; the forecaster chooses p t 1 and draws an action It 2 {1, . . . , N} according to it; the forecaster receives loss `(It , Jt ) and each action i su↵ers loss `(i , Jt ), none of these values is revealed to the forecaster; a feedback st drawn at random according to H(It , Jt ) is revealed to the forecaster. 71 target Define `(p, q) = X pi qj `(i , j ) i ,j H(·, q) = (H(1, q), . . . , H(N, q)) P where H(i , q) = j qj H(i , j ) . that can be written as H(·, q) for Denote by F the set of those some q. F is the set of “observable” vectors of signal distributions The key quantity is ⇢(p, )= max q : H(·,q)= ⇢ is convex in p and concave in . 72 `(p, q) . rustichini’s theorem The value of the base one-shot game is min max `(p, q) = min max ⇢(p, p q p 2F ) If q n is the empirical distribution of J1 , . . . , Jn , even with the knowledge of H(·, q n ) we cannot hope to do better than minp ⇢(p, H(·, q n )). Rustichini (1999) proved that there exists a strategy such that for all strategies of the opponent, almost surely, 0 1 X 1 lim sup @ `(It , Jt ) min ⇢ (p, H(·, q n ))A 0 p n t=1,...,n n !1 73 rustichini’s theorem Rustichini’s proof relies on an approachability theorem for a continuum of types (Mertens, Sorin, and Zamir, 1994). It is non-constructive. It does not imply any convergence rate. Lugosi, Mannor, and Stoltz (2008) construct efficiently computable strategies that guarantee fast rates of convergence. 74 combinatorial bandits The class of actions is a set S ⇢ {0, 1}d of cardinality |S| = N. At each time t , a loss is assigned to each component: `t 2 Rd . Loss of expert v 2 S is `t (v ) = `> t v. Forecaster chooses It 2 S. The goal is to control the regret n X t=1 `t (It ) min k=1,...,N 75 n X t=1 `t (k) . combinatorial bandits Three models. (Full information.) All d components of the loss vector are observed. (Semi-bandit.) Only the components corresponding to the chosen object are observed. (Bandit.) Only the total loss of the chosen object is observed. Challenge: Is O(n 1/2 poly(d )) regret achievable for the semi-bandit and bandit problems? 77 combinatorial prediction game Adversary Player 78 combinatorial prediction game Adversary Player © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. 79 combinatorial prediction game Adversary Player © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. 80 combinatorial prediction game Adversary `4 `1 `2 `6 `3 `7 `8 Player `d `5 `9 © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. 81 `d `d 1 2 combinatorial prediction game Adversary `4 `1 `2 `6 `3 `7 `8 Player `d `5 `d `d `9 © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. loss su↵ered: `2 + `7 + . . . + `d 82 1 2 combinatorial prediction game Adversary Feedback: `4 `1 `5 `2 : `6 `3 `7 `8 Player 8 < Full Info: `1 , `2 , . . . , `d `d `d `d `9 © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. loss su↵ered: `2 + `7 + . . . + `d 83 1 2 combinatorial prediction game 8 ` 1 , ` 2 , . . . , `d < Full Info: Semi-Bandit: `2 , `7 , . . . , `d Feedback: : Bandit: `2 + `7 + . . . + ` d `4 Adversary `1 `2 `6 `3 `7 `8 Player `d `5 `d `d `9 © Google. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/fairuse. loss su↵ered: `2 + `7 + . . . + `d 85 1 2 notation S ⇢ {0, 1}d `4 `1 `d `5 `2 `6 `3 `7 `8 `d 2 `t 2 Rd+ 1 `d `9 `4 `1 `d `5 `2 `6 `3 `7 `8 `d 2 Vt 2 S , loss su↵ered: `T t Vt 1 `d `9 regret: Rn = E n X t=1 `tT Vt min E u 2S n X t=1 `T t u loss assumption: |`T t v | 1 for all v 2 S and t = 1, . . . , n. 86 weighted average forecaster At time t assign a weight wt,i to each i = 1, . . . , d . The weight of each v k 2 S is Y w t (k) = wt,i . i :vk (i )=1 Let qt 1 (k) = wt 1 (k )/ PN k=1 wt 1 (k). At each time t, draw Kt from the distribution pt 1 (k) = (1 )qt 1 (k ) where µ is a fixed distribution on S and wt,i = exp et,i ⌘L + µ(k) > 0. Here et,i = `e1,i + · · · + `et,i and `et,i is an estimated loss. where L 89 loss estimates Dani, Hayes, and Kakade (2008). Define the scaled incidence vector X t = `t (Kt )V Kt where Kt is distributed according to pt 1 . ⇥ ⇤ Let Pt 1 = E V Kt V > Kt be the d ⇥ d correlation matrix. Hence X pt 1 (k) . Pt 1 (i , j ) = k : vk (i )=vk (j )=1 ⇥ ⇤ Similarly, let Qt 1 and M be the correlation matrices of E V V > when V has law, qt 1 and µ. Then Pt 1 (i , j ) = (1 )Qt 1 (i , j ) The vector of loss estimates is defined by èt = P + X t t where Pt+ 1 1 is the pseudo-inverse of Pt 92 1. + M(i , j ) . key properties M M + v = v for all v 2 S. Qt 1 Pt 1 is positive semidefinite for every t. Pt+ 1 v = v for all t and v 2 S. By definition, Et X t = Pt and therefore 1 `t Et èt = Pt+ 1 Et X t = `t An unbiased estimate! 93 performance bound The regret of the forecaster satisfies s✓ ✓ ◆ ◆ 2B 2 d ln N 1 b ELn min Ln (k) 2 +1 . k=1,...,N n d min (M) n where min (M) = min x2span(S):kxk=1 x T Mx > 0 is the smallest “relevant” eigenvalue of M. (Cesa-Bianchi and Lugosi, 2009.) Large min (M) is needed to make sure no |`et,i | is too large. 94 performance bound Other bounds: p B d ln N/n (Dani, Hayes, and Kakade). No condition on S. Sampling is over a barycentric spanner. p d (✓ ln n)/n (Abernethy, Hazan, and Rakhlin). Computationally efficient. 95 eigenvalue bounds min (M) = min x2span(S):kxk=1 E(V , x)2 . where V has distribution µ over S. In many cases it suffices to take µ uniform. 96 multitask bandit problem The decision maker acts in m games in parallel. In each game, the decision maker selects one of R possible actions. After selecting the m actions, the sum of the losses is observed. min h bn max E L k = i 1 R Ln (k) 2m p 3nR ln R . The price of only observing the sum of losses is a factor of m. Generating a random joint action can be done in polynomial time. 97 assignments Perfect matchings of Km,m . At each time one of the N = m! perfect matchings of Km,m is selected. min (M) h bn max E L k = 1 m 1 i p Ln (k) 2m 3n ln(m!) . Only a factor of m worse than naive full-information bound. 98 spanning trees In a network of m nodes, the cost of communication between two nodes joined by edge e is `t (e) at time t. At each time a minimal connected subnetwork (a spanning tree) is selected. The goal is to minimize the total cost. N = m m 2 . ◆ ✓ 1 1 . O min (M) = m m2 The entries of M are P{Vi = 1}= P Vi = 1, Vj = 1 = P Vi = 1, Vj = 1 = 99 2 m 3 m2 4 m2 if i ⇠ j if i 6⇠ j . stars At each time a central node of a network of m nodes is selected. Cost is the total cost of the edges adjacent to the node. min 1 O 100 ✓ 1 m ◆ . cut sets A balanced cut in K2m is the collection of all edges between a set of m vertices and its complement. Each balanced cut has m 2 edges and there are N = 2m balanced cuts. m min (M) = 1 4 O ✓ 1 m2 ◆ . Choosing from the exponentially weighted average distribution is equivalent to sampling from ferromagnetic Ising model. FPAS by Randall and Wilson (1999). 101 hamiltonian cycles A Hamiltonian cycle in Km is a cycle that visits each vertex exactly once and returns to the starting vertex. N = (m 1)! 2 min Efficient computation is hopeless. 102 m sampling paths In all these examples µ is uniform over S. For path planning it does not always work. What is the optimal choice of µ? What is the optimal way of exploration? 104 minimax regret Rn = inf max sup strategy S⇢{0,1}d adversary Rn Theorem Let n have d 2 . In the full information and semi-bandit games, we 0.008 d p p n R n d 2n, and in the bandit game, 0.01 d 3/2 p n R n 2 d 5/2 106 p 2n. proof upper bounds: P D = [0, +1)d , F (x) = ⌘1 di =1 xi log xi works for full information but it is only optimal up to a logarithmic factor in the semi-bandit case. in the bandit case it does not work at all! Exponentially weighted average forecaster is used. lower bounds: careful construction of randomly chosen set S in each case. 108 a general strategy Let D be a convex subset of Rd with nonempty interior int(D). A function F : D ! R is Legendre if • F is strictly convex and admits continuous first partial derivatives on int(D), • For u 2 @D, and v 2 int(D), we have lim (u s!0,s>0 v )T rF (1 s)u + sv = +1. The Bregman divergence DF : D ⇥ int(D) associated to a Legendre function F is DF (u , v ) = F (u) F (v ) 111 (u v )T rF (v ). CLEB (Combinatorial LEarning with Bregman divergences) Parameter: F Legendre on D (1) wt0+1 2D: rF (wt0+1 ) = rF (wt ) Conv (S) `˜t (2) wt+1 2 arg min DF (w , wt0+1 ) w 2Conv (S) (3) pt+1 2 D 0 wt+1 (S) : wt+1 = EV ⇠pt+1 V wt wt+1 Conv (S) pt+1 (S) pt 117 General regret bound for CLEB Theorem If F admits a Hessian r2 F always invertible then, Rn / diamDF (S) + E n X t =1 118 ⌘ ⇣ 2 `˜T ) r F (w t t 1 `˜t . Di↵erent instances of CLEB: LinExp (Entropy Function) D = [0, +1)d , F (x) = 1 ⌘ Pd i =1 xi log xi 8 Full Info: Exponentially weighted average > > < Semi-Bandit=Bandit: Exp3 > > : Auer et al. [2002] 8 Full Info: Component Hedge > > > > Koolen, Warmuth and Kivinen [2010] > > > > < Semi-Bandit: MW > > > Kale, Reyzin and Schapire [2010] > > > > > : Bandit: new algorithm 124 Di↵erent instances of CLEB: LinINF (Exchangeable Hessian) D = [0, +1)d , F (x) = R xi i =1 0 Pd 1 (s )ds INF, Audibert and Bubeck [2009] ⇢ (x) = exp(⌘x) : LinExp (x) = ( ⌘x) q , q > 1 : LinPoly 128 Di↵erent instances of CLEB: Follow the regularized leader D = Conv (S), then wt+1 2 arg min w 2D t X s=1 `˜T s w + F (w ) ! Particularly interesting choice: F self-concordant barrier function, Abernethy, Hazan and Rakhlin [2008] 130 MIT OpenCourseWare http://ocw.mit.edu 18.657 Mathematics of Machine Learning Fall 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.