Learning Theory and Algorithms for Auctioning and Adaptation Problems by Andrés Muñoz Medina A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics New York University September 2015 Mehryar Mohri c Andrés Muñoz Medina All Rights Reserved, 2015 Dedication To William Shaw, the limit does exist. iii Acknowledgements First and foremost, I want to thank my advisor Mehryar Mohri who I recall once said to me he considers each student an investment. To him I want to say “thank you for taking a risk with me”. I want to thank him for always being honest with me, for respecting my opinion as that of a colleague and most importantly for teaching me that research is much more than just proving theorems. Being his student was challenging, yet it is thanks to him that today I can call myself a researcher. I want to thank my thesis committee, not only for their comments and suggestions on my thesis but for their role in my professional development: Corinna Cortes, a reader and collaborator on the subject of domain adaptation, who taught me a wide variety of tricks of the trade in experimental evaluation of algorithms. In fact, several results on this dissertation would not have been possible without her detailed suggestions and explanations. Yishay Mansour and Claudio Gentile, who indirectly inspired my research in auctioning through their extensive results in this area and who, in spite of the distance, were willing to be part of this committee and provided me with insightful comments for my dissertation. Finally, I want to thank Esteban Tabak for being there for me and advising me at difficult times during my Ph.D. and Robert Kohn for preparing me on my qualifying exams. Making a transition from theoretical mathematics to computer science was a major challenge for me: abstract topology does not prepare you for doing large-scale machine learning experiments. It is only due to the patience and teachings of my hosts and friends at Google, Afshin Rostamizadeh, Umar Syed, Keith Hall, Cyril Allauzen, Shankar Kumar, Kishore Papineni and Richard Zens that I have developed the skills needed to become a research scientist. I look forward to many future collaborations with them. I am also indebted to my lab colleagues and nevertheless friends: Giulia DeSalvo, Scott Yang, Marius Kloft and Vitaly Kuznetsov for countless hours of machine learning discussions, research suggestions and talk preparations. I want to thank my friends at NYU: Juan Calvo, Edgar Costa, Henrique Moyses and of course our volleyball team. These 5 years would not have been the same without our study groups and academic exchanges. But most of all, I am grateful for knowing that I was never alone in this program and that all of us had to deal with the same issues. I also want to thank my friends outside of the program for keeping me sane throughout these years. In particular Miguel Angel, Marinie and Armando, some of my oldest friends, who in spite of the distance, always offered me support and advice when I most needed it. Quiero también agradecer a mis padres y a mi hermano quienes me han apoyado desde la infancia y han creı́do siempre en mı́. Todos mis logros se los dedico y se los seguiré dedicando a ustedes. Finally, I want to thank Will Shaw, the love of my life. Over the past four years iv you have stood by my side in the good times and the bad times. You have shared my successes and my failures. You have put up with my insane schedule and have always been there for me when I needed you the most; and for that, I cannot thank you enough. v Abstract A common assumption in machine learning is that training and testing data are i.i.d. realizations from the same distribution. However, this assumption is often violated in practice: for instance test and training distributions can be related but different such as when training a face recognition system on a carefully curated data set which will be deployed in the real world. As a different example, consider training a spam classifier when the type of emails considered spam changes as a function of time. The first problem described above is known as domain adaptation and the second one is called learning under drifting distributions. The first part of this thesis presents theory and algorithms for these problems. For domain adaptation, we provide tight learning bounds based on the novel concept of generalized discrepancy. These bounds strongly motivate our learning algorithm and it is shown, both theoretically and empirically, that this algorithm can significally improve upon the current state of the art. We extend the theoretical results of domain adaptation to the more challenging scenario of learning under drifting distributions. Moreover, we establish a deep connection between on-line learning and this problem. In particular, we provide a novel on-line to batch conversion that motivates a learning algorithm with excellent empirical performance. The second part of this thesis studies a crucial problem in the intersection of learning and game theory: revenue optimization in repeated auctions. More precisely, we study second-price and generalized second-price auctions with reserve. These auction mechanisms have become extremely popular in recent years due to the advent of online advertisement. Both type of auctions are characterized by a reserve price representing the minimum value at which the seller is willing to forego of the object in question. Therefore, selecting an optimal reserve price is crucial in achieving the largest possible revenue. We cast this problem as a learning problem and provide the first theoretical analysis for learning optimal reserve prices from samples for both second-price and generalized second-price auctions. These results however assume buyers do not react strategically to changes in reserve prices. Therefore, in the last chapter of this thesis we analyze the possible strategies for buyers and show that if the seller is more patient than the buyer, it is not in the best interest of the buyer to behave strategically. vi Contents Dedication . . . . . Acknowledgements Abstract . . . . . . List of Figures . . . List of Tables . . . List of Appendices Introduction . . . . 0.1 Notation . . . I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . iv . vi . ix . xii . xiii . 1 . 2 Domain Adaptation 1 Domain Adaptation 1.1 Introduction . . . . . . . 1.2 Learning Scenario . . . . 1.3 Discrepancy . . . . . . . 1.4 Learning Guarantees . . 1.5 Algorithm . . . . . . . . 1.6 Generalized Discrepancy 1.7 Optimization Solution . . 1.8 Experiments . . . . . . . 1.9 Conclusion . . . . . . . 5 . . . . . . . . . . . . . . . . . . 2 Drifting 2.1 Introduction . . . . . . . . . 2.2 Preliminaries . . . . . . . . 2.3 Drifting PAC Scenario . . . 2.4 Drifting Tracking Scenario . 2.5 On-line to Batch Conversion 2.6 Algorithm . . . . . . . . . . 2.7 Experiments . . . . . . . . . 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 9 10 15 19 22 29 33 38 . . . . . . . . 39 39 41 42 46 47 52 53 56 II Auctions 3 Learning in Auctions 3.1 Introduction . . . . . 3.2 Setup . . . . . . . . . 3.3 Previous work . . . . 3.4 Learning Guarantees 3.5 Algorithms . . . . . 3.6 Experiments . . . . . 3.7 Conclusion . . . . . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 61 62 64 73 80 84 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Generalized Second-Price Auctions 4.1 Introduction . . . . . . . . . . . . . 4.2 Model . . . . . . . . . . . . . . . . 4.3 Previous Work . . . . . . . . . . . . 4.4 Learning Algorithms . . . . . . . . 4.5 Convergence of Empirical Equilibria 4.6 Experiments . . . . . . . . . . . . . 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 . 86 . 88 . 89 . 90 . 98 . 103 . 105 . . . . . . . 107 107 109 110 112 117 117 118 5 Learning Against Strategic Adversaries 5.1 Introduction . . . . . . . . . . . . . 5.2 Setup . . . . . . . . . . . . . . . . . 5.3 Monotone Algorithms . . . . . . . . 5.4 A Nearly Optimal Algorithm . . . . 5.5 Lower Bound . . . . . . . . . . . . 5.6 Empirical Results . . . . . . . . . . 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion 119 Appendices 121 Bibliography 167 viii List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 2.1 2.2 2.3 Example of reweighting to make source and target distributions more similar. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples of favorable (c) and unfavorable (a and b) scenarios for adaptation. (a) Instance distributions are disjoint. (b) Instance distributions match but labeling functions are different. Therefore, joint distributions are not similar. (c) Instance distributions can be made similar through reweighting and labeling function is the same. . . . . . . . . . . . . . . Figure depicting the difference between the L1 distance and the discrepancy. In the left figure, the L1 distance is given by twice the measure of the green rectangle. In the right figure, P (h(x) 6= h0 (x)) is equal to the measure of the blue rectangle and Q(h(x) 6= h0 (x)) is the measure of the purple rectangle. The two measures are equal, thus disc(P, Q) = 0. . b Depiction of the distances ηH (fP , fQ ) and δ1P (fP , H 00 ). . . . . . . . . . Illustration of the sampling process on the set H 00 . . . . . . . . . . . . . (a) Hypotheses obtained by training on source (green circles), target (red triangles) and using DM (dashed blue) and GDM algorithms (solid blue). (b) Objective functions for source and target distribution as well as GDM and DM algorithms. The vertical lines show the minimizer for each algorithm. Set H and surrogate hypothesis set H 00 ⊆ H are shown at the bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a)MSE performance for different adaptation algorithms when adapting from kin-8fh to the three other kin-8xy domains. (b) Relative error of DM over GDM as a function of the ratio Λr . . . . . . . . . . . . . . . Barplot of estimated discrepancies for the continuous drifting and alternating drifting scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . MSE of different algorithms for the continuous drifting scenario. Different plots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400 (top-right) and k = 600 (bottom-right). . . . . . . MSE of different algorithms for the alternating drifting scenario. Different plots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400 (top-right) and k = 600 (bottom-right). . . . . . . ix 7 11 12 23 30 33 35 53 55 55 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 (a) Plot of the loss function r 7→ L(r, b) for fixed values of b(1) and b(2) ; (b) Functions l1 on the left and l2 on the right. . . . . . . . . . . . . . . (a) Piecewise linearPconvex surrogate loss Lp . (b) Comparison of the sum of real losses m i=1 L(·, bi ) for m = 500 with the sum of convex surrogate losses. Note that the minimizers are significantly different. . . (a) Comparison of the true loss L with surrogate loss Lγ on the left and Lγ on the right, for γ = 0.1. (b) Comparison of P500surrogate loss P 500 i=1 L(r, bi ) and i=1 Lγ (r, bi ) . . . . . . . . . . . . . . . . . . . . (a) Prototypical v-function. (b) Illustration of the fact that the definition of Vi (r, bi ) does not change on an interval [nk , nk+1 ]. . . . . . . . . . . Pseudocode of our DC-programming algorithm. . . . . . . . . . . . . . Plots of expected revenue against sample size for different algorithms: DC algorithm (DC), convex surrogate (CVX), ridge regression (Reg) and the algorithm that uses no feature to set reserve prices (NF). For (a)(c) bids are generated with different noise standard deviation (a) 0, (b) 0.25, (c) 0.5. The bids in (d) were generated using a generative model. . Distribution of reserve prices for each algorithm. The algorithms were trained on 800 samples using noisy bids with standard deviation 0.5. . . Results of the eBay data set. Comparison of our algorithm (DC) against a convex surrogate (CVX), using no features (NF), setting no reserve (NR) and setting reserve price to highest bid (HB). . . . . . . . . . . . Depiction of the loss Li,s . Notice that the loss in fact resembles a broken “V” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Empirical verification of Assumption 2. Values were generated using a uniform distribution over [0, 1] and the parameters of the auction were N = 3, s = 2. The blue line corresponds to the quantity maxi ∆βi for different values of n. In red we plot the desired upper bound for C = 1/2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Approximation of the empirical bidding function βb to the true solution β. The true solution is shown in red and the shaded region represents the confidence interval of βb when simulating the discrete GSP 10 times with a sample of size 200. Here N = 3, S = 2, c1 = 1, c2 = 0.5 and values were sampled uniformly from [0, 1] . . . . . . . . . . . . . . . . Bidding function for our experiments in blue and identity function in red. Since β is a strictly increasing function it follows from (Gomes and Sweeney, 2014) that this GSP admits an equilibrium. . . . . . . . . . . Comparison of methods for estimating valuations from bids. (a) Histogram of true valuations. (b) Valuations estimated under the SNE assumption. (c) Density estimation algorithm. . . . . . . . . . . . . . . . x 65 67 71 74 80 83 84 84 96 101 103 104 105 5.1 5.2 (a) Tree T (3) associated to the algorithm proposed in (Kleinberg and Leighton, 2003a). (b) Modified tree T 0 (3) with r = 2. . . . . . . . . . . . . . . . . . 113 Comparison of the monotone algorithm and PFSr for different choices of γ and v. The regret of each algorithm is plotted as a function of the number rounds when γ is not known to the algorithms (first two figures) and when its value is made accessible to the algorithms (last two figures). . . . . . . . . . 118 C.1 Depiction of the envelope function . . . . . . . . . . . . . . . . . . . . 137 D.1 Regret curves for PFSr and monotone for different values of v and γ. The value of γ is not known to the algorithms. . . . . . . . . . . . . . . 165 D.2 Regret curves for PFSr and monotone for different values of v and γ. The value of γ is known to both algorithms. . . . . . . . . . . . . . . . 166 xi List of Tables 1.1 1.2 1.3 2.1 4.1 Notation table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adaptation from books (B), kitchen (K), electronics (E) and dvd (D) to all other domains. Normalized results: MSE of training on the unweighted source data is equal to 1. Results in bold represent the algorithm with the lowest MSE. . . . . . . . . . . . . . . . . . . . . . Adaptation from caltech256 (C), imagenet (I), sun (S) and bing (B). Normalized results: MSE of training on the unweighted source data is equal to 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of different algorithms in the financial data set. Bold results are significant at the 0.05 level. . . . . . . . . . . . . . . . . . . . . . . . . 9 36 37 56 Mean revenue of both our algorithms. . . . . . . . . . . . . . . . . . . 104 xii List of Appendices Appendix to Chapter 1 Appendix to Chapter 3 Appendix to Chapter 4 Appendix to Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 129 136 160 Introduction This thesis presents a complete theoretical and algorithmic analysis of two important problems in machine learning. The first part studies a problem rooted at the very foundation of this field, that of learning under non standard distribution assumptions. In an ideal learning scenario, a learner uses an i.i.d. training sample from a distribution D to select a hypothesis h which is to be tested on the same distribution D. In practice, however this assumption is not always satisfied. Consider, for instance, the following examples: • A hospital in New York city collects patient information in order to predict the probability of a patient developing a particular disease. After training a learning algorithm on this dataset, the resulting system is deployed in all hospitals in the United States. • A political analyst trying to predict the winner of an election collects social media information over the span of a month. This data is used to train a sentiment classification algorithm that will help analyze people’s preferences over candidates. In the first example the training sample used by the learning algorithm represents only a biased fraction of the target population. Therefore, we cannot guarantee that the predictions made by the learner will be accurate in the whole U.S. population. This is a prototypical instance of the problem of domain adaptation where the learner must train on a source distribution different from the intended target. The second example suffers from a similar issue: the population’s sentiment towards a candidate can change on a daily basis. Thus, the information collected at the beginning of the month might mislead the learner, making its predictions at the end of the month inaccurate. This is an example of how the distribution of the data drifts over time. For this reason, this learning scenario is commonly known as learning with drifting distributions. These examples motivate the following question: can we still learn without the standard i.i.d. assumptions of statistical learning theory? It is clear that if the training and target distributions are vastly different, for instance if their supports are disjoint, domain adaptation cannot succeed. Similarly, if the distributions drastically drift over time there is no hope for learning. At the core of these problems is therefore a notion of distance between distributions. Several measures of divergence between distributions exist, such as the L1 distance and the KL-divergence. In Chapter 1 we discuss the disadvantages of these divergence measures and introduce a novel distance tailored for the problems of domain adaptation and drifting. In Chapters 1 and 2 we derive learning bounds for these problems which help us design algorithms with excellent theoretical properties as well as state-of-the-art results. Part II of this thesis presents the first learning-based analysis of revenue optimization in auctions. Traditionally studied in economics, auctions have recently become an important object of analysis in computer science, mainly, due to the important role auctions play nowadays in electronic markets. Indeed, online advertisement is today one 1 of the fastest growing markets in the world consisting of billions of dollars worth of transactions. The mechanisms by which most of the space for these advertisements is sold are the second-price auction with reserve and the generalized second-price auction with reserve. In a second-price auction buyers bid for an object and the highest bidder wins. However, the winner is not obligated to pay his bid but the bid of the second highest bidder. In order to avoid selling its inventory at a low price when the second bid is too small, the seller sets a reserve price under which the object will not be sold. The winner of the auction therefore must pay the maximum between the second highest bid and the reserve price. An appropriate selection of the reserve price is therefore crucial. If set too low the seller might not extract all possible revenue, on the other hand if the reserve price is too high buyers might not bid above it and the seller would procure no revenue. In Chapter 3 we propose a data-driven algorithm for selecting the optimal reserve price. This approach poses several theoretical challenges since the loss function associated with this problem does not posses the properties of traditional loss functions used in machine learning such as convexity or continuity. In spite of this, we provide learning guarantees for this problem and give an efficient algorithm for setting optimal reserve prices as a function of the features of the auction. We further extend these results to the more complicated generalized second-price auction. Trying to learn optimal reserve prices has a secondary effect inherent to auctions and it is the fact that buyers have the possibility to react to our actions. In particular if buyers realize that their bids are used to train a learning algorithm, they could modify their bidding behavior in an attempt to misguide the learner and obtain a more beneficial reserve price in the future. The interactions between a seller trying to optimize his revenue and a strategic buyer are analyzed in Chapter 5. We show that under some natural assumptions a seller can in fact find the optimal reserve price even in the challenging scenario where the buyer has access to the learning algorithm used by the seller. This is a remarkable result that justifies the use of historical data as means to optimize revenue. Let us emphazise that, with exception of the work of (Cesa-Bianchi et al., 2013), the use of machine learning techniques in the study of auctions presented here is completely nove. Indeed, auction mechanisms have been traditionally studied only from a game theoretic perspective. Given the large amount of historical data collected by major online advertising companies, we believe that the use of machine learning techniques is crucial for better understanding and optimizing these mechanisms. In fact, our work has inspired novel revenue optimization algorithms in more general settings (Morgenstern and Roughgarden, 2015). 0.1 Notation We present the notation and basic concepts that will be used throughout this thesis. We will consider a an input space X and an output space Y. Unless otherwise stated we 2 will assume that Y ⊂ R. Let H denote a hypothesis set, that is a collection of functions h : X → Y 0 for Y 0 ⊂ R. For the most part we will consider the case Y 0 = Y. A loss function L : Y 0 × Y → R will be used to measure the quality of an algorithm. For a distribution D over X × Y we denote the expected loss of a hypothesis by LD (h) = E (x,y)∼D [L(h(x), y)]. When the distribution D is understood from context it will be omitted from the notation. Similarly, for a sample ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m we denote the empirical loss of a hypothesis by m 1 X L(h(xi ), yi ). LDb (h) = m i=1 If there exist a labeling function f : X → Y we define the expected loss of a hypothesis h with respect to f as LDx (h, f ) = E [L h(x), f (x) ], x∼Dx where Dx denotes the marginal distribution of D over the input space X . We use an analogous notation for the empirical loss of a hypothesis h with respect to f . Matrices will be denoted by upper case bold letters (A, B, . . .) and vectors will be shown as lower case bold letters (u, v, . . .). The following capacity concepts will be used repeatedly in this thesis. Definition 1. Let Z be any set and let G be a family of functions mapping Z to R. Given an i.i.d. sample S = (z1 , . . . , zm ) ∈ Z m from a distribution D, we define the empirical Rademacher complexity of G as m h i X b S (G) = 1 E sup σi g(xi ) , R m σ g∈G i=1 where σ = (σ1 , . . . , σm ) and the σi s are independent uniform random variables over the set {−1, 1}. The random variables σi are called Rademacher variables. The Rademacher complexity of the class G is the expectation of the empirical Rademacher complexity : b S (G)]. Rm (G) = E[R S The Rademacher complexity measures the correlation of the function class G with noise and is therefore a way of quantifying the size of the function class G. When the functions in G take values on the set {−1, 1}, the Rademacher complexity can be related to the growth function. Definition 2. Let G be a family of functions mapping Z to {−1, 1}. The growth function 3 ΠG : N → N for the family G is defined by: ΠG (m) = max {z1 ,...,zm }⊂Z { g(z1 ), . . . , g(zm ) |g ∈ G} . The growth function evaluates to the number of ways a sample of size m can be classified using the function class G. Similar to the Rademacher complexity, the growth function is a measure of the size of a function class. However, it is a purely combinatorial measure unlike the Rademacher complexity which takes into account the distribution of the data. The following lemma due to Massart relates both quantities. Lemma 1 (Massart’s Lemma). Let G be a family of functions taking values in {−1, 1}. Then the following holds for any sample S: r b S (G) ≤ 2 log ΠG (m) . R m We now introduce the combinatorial concept of VC-dimension. Definition 3. Let G be a family of functions taking values in {−1, 1}, the VC-dimension of G is the largest value d such that ΠG (d) = 2d . We denote this value as VCdim(G). The VC-dimension is thus the size of the largest sample that can be classified in all possible ways by G. The last lemma of this section provides a bound on the growth function in terms of the VC dimension. Lemma 2 (Sauer’s Lemma). Let G be a family of functions with VCdim(G) = d. Then for all m ≥ d, d em d X m ΠG (m) ≤ ≤ . i d i=0 q d d log m . In particular, if the class G has VCdim(G) = d then Rm (G) is in O m For an extensive treatment of these complexity measures we refer the reader to Mohri et al. (2012). We conclude this section by presenting an analogous concept to that of VC-dimension for a class of functions G taking values in R. Definition 4. Let G be a family of functions takin values in R, we define the pseudo dimension of G, Pdim(G) as: Pdim(G) = VCdim(G0 ), where G0 = {z 7→ 1g(z)−t>0 |g ∈ G, t ∈ R}. 4 Part I Domain Adaptation 5 Chapter 1 Domain Adaptation In this chapter we study the problem of domain adaptation. In machine learning, domain adaptation refers to the problem of obtaining the most accurate hypothesis on a test distribution different from the training distribution. This challenging problem has received a great deal of attention by the learning community over the past decade. Here, we present an overview of the theory on domain adaptation as well as a description of the current state of the art adaptation algorithm: discrepancy minimization (DM). We discuss the main drawbacks of DM and introduce a new algorithm based on the novel concept of generalized discrepancy. We provide learning guarantees for our proposed algorithm and show empirically, that our algorithm consistently outperforms DM and several other commonly used algorithms. 1.1 Introduction A standard assumption in much of learning theory and algorithms is that training and test data are sampled from the same distribution. In practice, however, this assumption often does not hold. The learner then faces the more challenging problem of domain adaptation where the source and target distributions are distinct. This problem arises in a variety of applications such as natural language processing and computer vision (Dredze et al., 2007; Blitzer et al., 2007; Jiang and Zhai, 2007; Leggetter and Woodland, 1995; Martı́nez, 2002; Hoffman et al., 2014) and many others. A common trait among the aforementioned examples is that, albeit different, the source and target distributions are somehow related as otherwise learning would be impossible. Indeed, as shown by Ben-David et al. (2010), if the supports of the source and target distributions differ too much adaptation cannot succeed. More surprisingly, Ben-David and Urner (2012) showed that even in the favorable scenario where the source and target distribution admit the same support, a sample of size in the order of the cardinality of the target support is needed in order to solve the domain adaptation problem. As pointed out by the authors, the problem becomes triv6 Source Target Figure 1.1: Example of reweighting to make source and target distributions more similar. ially intractable when the hypothesis set contains no candidate with good performance on the training set. However, the adaptation tasks found in applications seem to be often more favorable than such worst cases and in fact several algorithms over the past decade have empirically demonstrated that adaptation can indeed succeed. We can distinguish two broad families of adaptation algorithms. Some consist of finding a new feature representation. The core idea behind these algorithms is to map the source and target data into a new feature space where the difference between source and target distributions is reduced. Transfer Component Analysis (TCA) (Pan et al., 2011) and the work on frustratingly easy domain adaptation (FE) (Daumé III, 2007) belong both to this family of algorithms. While some empirical evidence has been reported in the literature for the effectiveness of these algorithms, we are not aware of any theoretical guarantees in support of these techniques. Many other adaptation algorithms can be viewed as reweighting techniques. Originated in the statistics literature on sample bias correction, these techniques attempt to correct the difference between distributions by multiplying every training example by a positive weight. Most of the classical algorithms such as KMM (Huang et al., 2006), KLIEP (Sugiyama et al., 2007) and uLSIF (Kanamori et al., 2009) fall in this category. The weights for the KMM algorithm are selected in order to minimize the difference bewteen the mean of the reweighted source and target feature vectors under an appropriate feature map. A different approach is given by the KLIEP algorithm where weights are selected to minimize the KL- divergence between the source and target distribution. Finally, the uLSIF algorithm chooses these positives weights in order to minimize the square distance between these distributions. As the previous description shows, the underlying idea behind common reweighting techniques is that of minimizing the distance between the reweighted empirical source and target distributions. A crucial component of these learning algorithms is thus the choice of a divergence between probability measures. The KLIEP algorithm is based on the minimization of the KL-divergence , while algorithms such as KMM or the algorithm of Zhang et al. (2013) use the maximum mean discrepancy distance (MMD) (Gretton et al., 2007) as the divergence to be minimized. While these are natural di- 7 vergence measures commonly used in statistics, the aforementioned algorithms do not provide any learning guarantees. Instead, when the source and target distributions admit densities q(x) and p(x) respectively, it can be shown that the weight on the sample point xi will converge to the importance ratio p(xi )/q(xi ). The use of this ratio is commonly known as importance weighting and reweighting instances by this ratio provides an unbiased estimate for the expected loss on the target distribution. While this unbiasedness gives motivation for this kind of reweighting algorithms, it has been shown both empirically and theoretically that importance weighting algorithms can fail for the common case where the importance ratio becomes unbounded unless the second-moment is bounded, an assumption that cannot be tested in general (Cortes et al., 2010). The discussion above shows some of the problems ineherent to general purpose divergence measures not tailored for domain adaptation. In this chapter, therefore we present the Y-discrepancy as an alternative divergence to address this issues. The discrepancy is a generalization of the dA -distance, a crucial notion in the development of doman adaptation theory introduced by Ben-David et al. (2006). We provide a survey of discrepancy-based learning guarantees later on this chapter, as well as a discrepancy minimization (DM) algorithm proposed by Mansour et al. (2009) and later enhanced by Cortes and Mohri (2011) and Cortes and Mohri (2013). In spite of the theoretical guarantees of DM, we will discuss some of the drawbacks of the DM algorithm and propose the novel notion of generalized discrepancy to address these issues. We provide learning guarantees based on this new distance which motivate the design of a generalized discrepancy minimization (GDM) algorithm. Unlike previous algorithms, we do not consider a fixed reweighting of the losses over the training sample. Instead, the weights assigned to training sample losses vary as a function of the hypothesis h. This helps us ensure that, for every hypothesis h, the empirical loss on the source distribution is as close as possible to the empirical loss on the target distribution for that particular h. The chapter is organized as follows: we describe the learning scenario considered (Section 1.2), we then define the notion of discrepancy distance and compare the discrepancy against other common divergences used in practice (Section 1.3). We also provide learning guarantees for the domain adaptation problem based on the discrepancy and describe in detail the DM algorithm. Having established the basic results for domain adaptation we present a description of our GDM algorithm and show that it can be formulated as a convex optimization problem (Section 1.5). Next, we analyze the theoretical properties of our algorithm, which will guide the choice of parameters defining our algorithm (Section 1.6). In Section 1.7, we further analyze our optimization problem and derive an equivalent form that can be handled by a standard convex optimization solver. Finally, in Section 1.8, we report the results of experiments demonstrating that our algorithm improves upon the DM algorithm in several tasks. 8 Table 1.1: Notation table. X P Pb T T0 fP LP (h,fP ) LPb (h,fP ) disc(P,Q) discH 00 (P,Q) qmin 1.2 Input space Target distribution Empirical target distribution Target unlabeled sample Small target labeled sample Target labeling function Expected target loss Empirical target loss Discrepancy Local Discrepancy DM solution Y Q b Q S SX fQ LQ (h,fQ ) LQ b (h,fQ ) DISC(Pb,U) discY (P,Q) Qh Output space Source distribution Empirical source distribution Labeled source sample Unlabeled source sample Source labeling function Expected source loss Empirical source loss Generalized discrepancy Y -discrepancy GDM solution Learning Scenario This section defines the learning scenario of domain adaptation we consider, which coincides with that of Blitzer et al. (2007); Mansour et al. (2009), or Cortes and Mohri (2013) and introduces the definitions and concepts needed for the following sections. For the most part, we follow the definitions and notation of Cortes and Mohri (2013). Let X denote the input space and Y ⊆ R the output space. We define a domain as a distribution over the set X and a labeling function mapping X to Y. Throughout this chapter we consider a source domain (Q, fQ ) and a target domain (P, fP ). We will abuse notation and denote by P and Q also the joint distributions induced by the labeling functions on the space X × Y. In the scenario of domain adaptation we consider, the learner receives two samples: a labeled sample of m points from the source domain S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m with SX = (x1 , . . . , xm ) drawn i.i.d. according to Q and yi = fQ (xi ) for i ∈ [1, m]; and an unlabeled sample T = (x01 , . . . , x0n ) ∈ X n of size n drawn i.i.d. according b the empirical distribution corresponding to the target distribution P . We denote by Q to SX and by Pb the empirical distribution corresponding to T . We will be in fact more interested in the scenario commonly encountered in practice where, in addition to these two samples, a small amount of labeled data from the target domain T 0 = ((x001 , y100 ), . . . , (x00s , ys00 )) ∈ (X × Y)s is received by the learner. We consider a loss function L : Y × Y → R+ which unless otherwise stated, we assume to be jointly convex in its two arguments. The Lp losses commonly used in regression and defined by Lp (y, y 0 ) = |y 0 − y|p for p ≥ 1 are special instances of this definition. For any distribution D over X × Y we denote by LD (h) the expected loss of h: LD (h) = E L(h(x), y) . (x,y)∼D 9 Similarly, for any two functions h, h0 and a distribution D over X we let LD (h, h0 ) = E L(h(x), h0 (x)) . x∼D The learning problem consists of selecting a hypothesis h out of a hypothesis set H with a small expected loss LP (h, fP ) with respect to the target domain. We further extend this notation to arbitrary functions q : X → R with a finite support as follows: Lq (h, h0 ) = P 0 x∈X q(x)L(h(x), h (x)). The function q is known as a reweighting function. Given a reweighting function q : SX → R we will be interested in studying regularized weighted risk minimization. That is, given a positive semi-definite (PSD) kernel K, we analyze algorithms that return a hypothesis that solves the following optimization problem: min λkhk2K + Lq (h, fQ ), (1.1) h∈H where k · kK is the norm on the reproducing Hilbert space H induced by the kernel K. This family of algorithms is commonly used in practice and includes algorithms such as support vector machines (SVM), kernel ridge regression (KRR) and support vector regression (SVR) just to name a few. 1.3 Discrepancy A crucial component of domain adaptation is, of course, a notion of divergence between distributions. As already mentioned, if the divergence between the source and target distributions is too large, then adaptation is not possible. In Figure 1.2(a) we show a source and a target distribution over X with disjoint supports. If only data from the source distribution Q is used for training, one should not expect adaptation to succeed. Similarly, Figure 1.2(b) depicts the source and target distributions as well as the labeling functions fP and fQ . Even though the distributions over the instance space are close, the labeling functions differ, making the joint distributions P and Q far from each other. On the other hand, Figure 1.2(c) presents a scenario where reweighting the source sample would likely be useful. Indeed, by assigning more weight on the examples close to 0 we can make the source and target distributions become similar, thereby helping adaptation. The source and target distributions for the first two examples are far whereas on the third example a reweighting technique could make the source and target distributions close. In order to provide a formal analysis of domain adaptation an appropriate notion of divergence between distributions is therefore necessary. A natural choice for this divergence is the total variation or L1 distance between two 10 4 ● P 2 3 ● ● ● 1 ● ● −1 0 ●● ● ●● ●● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● Source Target −2 Q ● ●● ● −2 −1 (a) 0 1 2 (b) ● ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ●●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ●● ●● ● ● ● ● −0.8 −0.4 0.0 0.2 ● Source Target −1.2 ● −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (c) Figure 1.2: Examples of favorable (c) and unfavorable (a and b) scenarios for adaptation. (a) Instance distributions are disjoint. (b) Instance distributions match but labeling functions are different. Therefore, joint distributions are not similar. (c) Instance distributions can be made similar through reweighting and labeling function is the same. distributions : kP − Qk1 := 2 sup |P (A) − Q(A)| = A X (x,y)∈X ×Y |P (x, y) − Q(x, y)|, where the supremum is taken over all measurable sets A ⊂ X × Y. Another common measure of divergence between distributions is the KL-divergence or relative entropy: KL(P ||Q) = X P (x, y) log x∈X ×Y P (x, y) . Q(x, y) Notice however that these divergence measures do not take into consideration the loss function or the hypothesis set; both crucial for learning. Therefore, favorable distributions for learning can exist that nevertheless are assigned a large L1 distance. Consider, 11 h h' Figure 1.3: Figure depicting the difference between the L1 distance and the discrepancy. In the left figure, the L1 distance is given by twice the measure of the green rectangle. In the right figure, P (h(x) 6= h0 (x)) is equal to the measure of the blue rectangle and Q(h(x) 6= h0 (x)) is the measure of the purple rectangle. The two measures are equal, thus disc(P, Q) = 0. for instance, the following toy example: Let P and Q be distributions over R2 . Where P is uniform in the rectangle R1 defined by the vertices (−1, R), (1, R), (1, −1), (−1, −1) and Q is uniform in the rectangle R2 spanned by (−1, −R), (1, −R), (−1, 1), (1, 1). These distributions are depicted in Figure 1.3. Let H be a set of threshold functions on x1 . That is, ht (x1 , x2 ) = 1 if x1 ≤ t and 0 otherwise. Notice that the value of ht only changes as a function of the first coordinate and that the marginals of P and Q along the first coordinate agree. In view of this, an algorithm trained on Q should have the same performance as an algorithm trained on P . We therefore expect the divergence between these two distributions to be small. However, the L1 distance of these probability measures is given by twice . This distance goes to 2 as the measure of the green rectangle, i.e, kP − Qk1 = 2(R−1) R+1 R → ∞. Moreover, the KL-divergence KL(P ||Q) has a major challenge in this case: if the support of Q is not included in that of P, this divergence becomes uninterestingly infinite. One could argue that the L1 distance of the marginals over the first coordinate is in fact 0. However, in practice finding appropriate subsets of the support of P and Q to use the L1 distance might not be a trivial task as in this example. This simple example shows that the problem of domain adaptation requires a finer measure of divergence between distributions. Notice that for a hypothesis h we are ultimately interested in the difference |LP (h) − LQ (h)|. It is therefore natural to define the divergence as the supremum over h ∈ H of this quantity. Definition 5. Given distributions P and Q over X × Y. We define the Y-discrepancy 12 between P and Q as discY (P, Q) = sup |LP (h) − LQ (h)|. h∈H Equivalently, if the labeling functions fP and fQ are deterministic we have: discY (P, Q) = sup |LP (h, fP ) − LQ (h, fQ )|. h∈H It is clear from its definition that the Y-discrepancy takes into account the hypothesis set and the loss function. It is also symmetric and satisfies the triangle inequality. However, in general, the Y-discrepancy is not a metric as it may be 0 even when P 6= Q. It is easy to see that the Y-discrepancy is a finer measure than the L1 distance. Indeed, if the loss function L is bounded by M the following holds discY (P, Q) = sup |LP (h) − LQ (h)| h∈H X |P (x, y) − Q(x, y)||L(h(x), y)| ≤ M kP − Qk1 . ≤ sup h ∈H x∈X ,y∈Y By Pinsker’s inequality, we also have kP − Qk1 ≤ p 2KL(P ||Q). Therefore, the Y-discrepancy is also finer than the KL-divergence. Furthermore, as the following theorem shows, the Y-divergence can be estimated from a finite set of samples. This is in stark contrast with the L1 distance which in order to be estimated requires a sample of size in the order of the support of the distribution. In particular, for distributions with an infinite support, the L1 distance cannot be estimated (Valiant, 2011). Theorem 1. Let HL = {(x, y) 7→ L(h(x), y)|h ∈ H} and let P, Q be distributions over X ×Y. Then, for any δ > 0, with probability at least 1−δ over the choice of samples S = ((x1 , y1 ), . . . , (xm , ym )) ∈ (X × Y)m drawn from Q and S 0 = ((x01 , y10 ), . . . , (x0n , yn0 )) ∈ (X × Y)n drawn according to P the following holds: r r log(2/δ) log(2/δ) b b discY (P, Q) ≤ discY (P , Q)+2Rn (HL )+2Rm (HL )+M + r 2m r 2n log(2/δ) log(2/δ) b ≤ discY (P, Q)+2Rn (HL )+2Rm (HL )+M discY (Pb, Q) + . 2n 2m Proof. Notice that the Y-discrepancy between the distribution Q and its empirical counb is simply the maximum generalization error over all hypothesis h ∈ H. Thereterpart Q 13 fore, by a standard learning bound, with probability at least 1 − δ over the choice of S: r b ≤ 2Rm (HL ) + M log(1/δ) , discY (Q, Q) 2m and a similar bound can be given for P . Furthermore, by the triangle inequality we have: b + discY (Q, b Pb) + discY (Pb, P ). discY (Q, P ) ≤ discY (Q, Q) The result now follows immediately from these inequalities as well as the union bound. For µ-Lipschitz continuous functions the Rademacher complexity Rn (HL ) can be bounded by µRn (H), and for the 0-1 loss we have Rn (HL ) ≤ 21 Rn (H). Furthermore, if H has finite VC-dimension the Rademacher complexity Rn (H) can be shown to be in O( √1n ). Therefore, estimation of the discrepancy between distributions has the same statistical complexity as PAC-learning. Notice, however, that the accuracy of these estimates depend on both the size of the source and target samples. But in practice, only a small labeled sample from the target data might be available and therefore the Ydiscrepancy cannot be accurately estimated. Nevertheless, under certain conditions, we can estimate an upper bound on the Y-discrepancy using only unlabeled samples. In a deterministic scenario and when fP = fQ = f , the Y-discrepancy reduces to sup |LP (h, f ) − LQ (h, f )|. h∈H Furthermore, when the labeling function f belongs to the hypothesis set H. We may upper bound the above quantity by taking a supremum over h0 ∈ H. This gives the following definition of discrepancy. Definition 6 (Mansour et al. (2009)). The discrepancy disc between two distributions P and Q over X , is defined as disc(P, Q) = sup |LP (h, h0 ) − LQ (h, h0 )|. h,h0 ∈H The discrepancy admits properties similar to those of the Y-discrepancy: it is a lower bound on the L1 distance and can be accurately estimated from finite samples. Moreover, since the labeling function is not required for its definition, these samples can be unlabeled. In general, the discrepancy and the Y-discrepancy are not directly comparable. However, as pointed out before, when fP = fQ = f ∈ H the discrepancy is an upper bound on the Y-discrepancy. A detailed analysis of the relationship between these two measures is given in Section 1.4. 14 Let us now evaluate the discrepancy between the two distributions P and Q previously introduced in our toy example. Let L be the 0-1 loss, using the fact that hypotheses ht ∈ H are threshold functions we have disc(P, Q) = sup ht ,hs ∈H P (ht (x1 , x2 ) 6= hs (x1 , x2 )) − Q(ht (x1 , x2 ) 6= hs (x1 , x2 )) = sup P ([s, t] × R) − Q([s, t] × R) s,t∈R = sup t,s∈R |t − s| |t − s| − = 0, 2 2 which is what we expected for this scenario. In the next section we show that learning guarantees can be given where the discrepancy only appears as an additive term. Therefore, for this particular example we can guarantee that training on the source distribution is equivalent to training on the target. We conclude this chapter with some historical remarks on the notion of discrepancy. The first theoretical analysis of domain adaptation for classification was done by BenDavid et al. (2006) where the so called dA -distance was introduced. The dA - distance is in fact a special case of the discrepancy when the loss function is the 0-1 loss. Later on, the discrepancy disc was introduced as a generalization of the dA -distance to arbitrary loss functions by Mansour et al. (2009). As we will later see, the notion of discrepancy was pivotal in the design of a theoretically founded algorithm for domain adaptation (Cortes and Mohri, 2011, 2013). Finally, the Y-discrepancy was used by Mohri and Muñoz (2012) to analyze the related problem of learning with drifting distributions (See Chapter 2). 1.4 Learning Guarantees Here, we present two discrepancy-based guarantees: a tight learning bound based on the Rademacher complexity and a pointwise bound derived from a stability analysis. We will assume that the loss function L is µ-admissible. Definition 7. A loss function is said to be µ-admissible if there exists µ > 0 such that |L(h(x), y) − L(h0 (x), y)| ≤ µ|h(x) − h0 (x)| (1.2) holds for all (x, y) ∈ X × Y and h0 , h ∈ H. Notice that the notion of µ-admissibility is somewhat weaker than requiring µLipschitzness with respect to the first argument. The Lp losses commonly used in regression, p ≥ 1, verify this condition (see Appendix A.3). 15 Our first generalization bound is given in terms of the Y-discrepancy. Its derivation follows a similar analysis to the one given by Mohri and Muñoz (2012) to derive generalization bounds for learning with drifting distributions. Proposition 1. Let HQ and HP be the families of functions defined as follows: HQ := {x 7→ L(h(x), fQ (x)) : h ∈ H} and HP := {x 7→ L(h(x), fP (x)) : h ∈ H}. Define MQ and MP as MQ=supx∈X ,h∈H L(h(x), fQ (x)) and MP =supx∈X ,h∈H L(h(x), fP (x)). Then, for any δ > 0, 1. with probability at least 1 − δ over the choice of a labeled sample S of size m, the following inequality holds for all h ∈ H: s log( 1δ ) ; (1.3) LP (h, fP ) ≤ LQb (h, fQ ) + discY (P, Q) + 2Rm (HQ ) + MQ 2m 2. with probability at least 1 − δ over the choice of a sample T of size n, the following inequality holds for all h ∈ H and any distribution q over a sample s SX : LP (h, fP ) ≤ Lq (h, fQ ) + discY (Pb, q) + 2Rn (HP ) + MP log( 1δ ) . 2n (1.4) Proof. Let Φ(S) denote suph∈H LQb (h, fQ ) − LP (h, fP ). Changing one point in S M changes Φ(S) by at most mQ . Thus, by McDiarmid’s inequality, the following inequality holds: 2 − 2m2 M Q . P Φ(S) − E[Φ(S)] > ≤ e Therefore, for any δ > 0, with probability at least 1 − δ, the following holds for all h ∈ H: s log( 1δ ) LP (h, fP ) ≤ LQb (h, fQ ) + E[Φ(S)] + MQ . 2m Next, we can bound E[Φ(S)] as follows: E[Φ(S)] = E sup LQb (h, fQ ) − LP (h, fP ) h∈H ≤ E sup LQb (h, fQ ) − LQ (h, fQ ) + sup LQ (h, fQ ) − LP (h, fP ) h∈H h∈H ≤ 2Rm (HQ ) + discY (P, Q), where the last inequality follows from a standard symmetrization inequality in terms of the Rademacher complexity and the definition of the discY (P, Q). The second learning bound can be shown as follows. Starting with a standard Rademacher complexity bound for HP , for any δ > 0, with probability at least 1 − δ, 16 the following holds for all h ∈ H: LP (h, fP ) ≤ LPb (h, fP ) + 2Rn (HP ) + MP s log( 1δ ) 2n ≤ Lq (h, fQ ) + LPb (h, fP ) − Lq (h, fQ ) + 2Rn (HP ) + MP s log( 1δ ) b , ≤ Lq (h, fQ ) + discY (P , q) + 2Rn (HP ) + MP 2n (1.5) s log( 1δ ) 2n where the last two inequalities hold for any distribution q. This completes the proof. Observe that these bounds are tight as a function of the divergence measure (discrepancy) we use: in the absence of adaptation, the following tight Rademacher complexity learning bound holds: s log( 1δ ) . LP (h, fP ) ≤ LPb (h, fP ) + 2Rn (HP ) + MP 2n Our second adaptation bound differs from this inequality only by the fact that LPb (h, fP ) is replaced with Lq (h, fQ ) + discY (Pb, q). But, by definition of Y-discrepancy, there exists an h ∈ H such that |LPb (h, fP ) − Lq (h, fQ )| = discY (Pb, q). Therefore, our bound cannot be improved in the worst case. A similar analysis shows that our first bound is also tight. Given a labeled sample S from the source domain, Proposition 1 suggests choosing a distribution q with support SX that minimizes the right-hand side of (1.4). However, the quantity discY (Pb, q) depends, by definition, on the unknown labels from the target domain and therefore cannot be minimized. Thus, we will instead upper bound the Y-discrepancy in terms of quantities that can be estimated. The bound based on the discrepancy disc given on the previous section required the labeling functions to be equal and to belong to the hypothesis set H. In order to relax this condition we introduce the following term quantifying the difference of the source and target labeling functions: ηH (fP , fQ ) = min max |fP (x) − h0 (x)| + max |fQ (x) − h0 (x)| , h0 ∈H x∈supp(Pb) b x∈supp(Q) Proposition 2. The following inequality holds for all distributions q over SX : discY (Pb, q) ≤ disc(Pb, q) + µ ηH (fP , fQ ). Proof. By the triangle inequality and the µ-admissibility of the loss, the following in17 equality holds for all h0 ∈ H: discY (Pb, q) = sup |Lq (h, fQ ) − LPb (h, fP )| h∈H ≤ sup |LPb (h, h0 ) − LPb (h, fP )| + |Lq (h, fQ ) − Lq (h, h0 )| h∈H ≤µ sup x∈supp(Pb) |h0 (x) − fP (x)| + + sup |Lq (h, h0 ) − LPb (h, h0 )| h∈H sup |fQ (x) − h0 (x)| + disc(Pb, q). b x∈supp(Q) Minimizing over all h0 ∈ H gives discY (Pb, q) ≤ µ ηH (fP , fQ ) + disc(Pb, q) and completes the proof. The following corollary is an immediate consequence of propositions 1 and 2. Corollary 1. Under the notation of Proposition 1, for any distribution q over SX and for any δ > 0, with probability at least 1 − δ: r log(1/δ) . (1.6) LP (h, fP ) ≤ Lq (h, fQ ) + disc(Pb, q) + ηH (fP , fQ ) + 2Rn (HP ) + M 2n Corollary 1 motivates the following algorithm: select a distribution q and a hypothesis h minimizing the right hand side of (1.6). This optimization problem is however not jointly convex on q and h. Instead, Cortes and Mohri (2013) propose a two step discrepancy minimization algorithm. That is, first a distribution qmin is found satisfying qmin = argminq∈∆(SX ) disc(Pb, q), where ∆(SX ) ⊂ [0, 1]SX denotes the set of all probability distributions over SX . Given qmin , the final hypothesis is chosen to minimize the following objective function: λkhk2K + Lqmin (h, fQ ). (1.7) This is the discrepancy minimization algorithm of Mansour et al. (2009). This algorithm is further motivated by the following theorem which provides a bound on the distance between the ideal solution h∗ obtained by training on the target distribution and the hypothesis obtained by training on a reweighted source sample. Theorem 2 (Cortes and Mohri (2013)). Let q be an arbitrary distribution over SX and let h∗ and hq be the hypotheses minimizing λkhk2K + LPb (h, fP ) and λkhk2K + Lq (h, fQ ) respectively. Then, the following inequality holds: λkh∗ − hq k2K ≤ µ ηH (fP , fQ ) + disc(Pb, q). 18 (1.8) It is immediate that the choice of qmin precisely minimizes the right hand side of the previous bound. This is the discrepancy minimization (DM) algorithm of Cortes and Mohri (2013). To our knowledge, discrepancy minimization is the only adaptation algorithm that has been derived from learning guarantees. Besides its theoretical motivation, the DM algorithm has been shown to outperform several adaptation algorithms in different tasks (Cortes and Mohri, 2013). In the next section, however, we discuss an important drawback of the DM algorithm. Moreover, we propose a new, robust algorithm to address this limitation while still having sound theoretical guarantees. 1.5 Algorithm In the previous section we showed that the discrepancy was the correct measure of divergence between distributions in domain adaptation. Moreover, by selecting qmin as a reweighting function we could ensure the solution of the discrepancy minimization algorithm and the ideal solution to be close. However, by using a fixed reweighting, we implicitly assumed a worst case scenario. That is, we defined the discrepancy as a maximum over all pairs of hypotheses. But, the maximizing pair of hypotheses may not even be among the candidates ever considered by the learning algorithm. Thus, a learning algorithm based on discrepancy minimization tends to be too conservative. More precisely, the bound given by Theorem 2 can, in some favorable cases, become loose. We attempt to address this issue by using a hypothesis-dependent reweighting. 1.5.1 Main idea Our algorithm is motivated by the following observation: in the absence of a domain adaptation problem, the learner would have access to the labels of the points in T . Therefore, he would return the hypothesis h∗ solution of the optimization problem minh∈H F (h), where F is the convex function defined for all h ∈ H by F (h) = λkhk2K + LPb (h, fP ), (1.9) In view of that, we can formulate our objective, in the presence of a domain adaptation problem, as that of finding a hypothesis h whose loss LP (h, fP ) with respect to the target domain is as close as possible to LP (h∗ , fP ). To do so, we will seek in fact a hypothesis h that is as close as possible to h∗ , which would imply the closeness of the losses with respect to the target domains. We do not have access to fP and can only access the labels of the training sample S. Thus, we must resort to using in our objective function, instead of LPb (h, fP ), a reweighted empirical loss over the training sample S. Thus far, the motivation behind our algorithm matches the ideas behind the DM algorithm. However, instead of using a fixed set of weights, the main idea behind our algorithm is to define, for any h ∈ H, a reweighting function Qh : SX = {x1 , . . . , xm } → R such 19 that the objective function G defined for all h ∈ H by G(h) = λkhk2K + LQh (h, fQ ), (1.10) is uniformly close to F , thereby resulting in close minimizers. Since the first term of (1.9) and (1.10) coincide, the idea consists equivalently of seeking Qh such that LQh (h, fQ ) and LPb (h, fP ) are as close as possible. Observe that this departs from the standard reweighting methods: instead of reweighting the training sample with some fixed set of weights, we allow the weights to vary as a function of the hypothesis h. Note that we have further relaxed the condition commonly adopted by reweighting techniques that the weights must be non-negative and sum to one. Allowing the weights to be in a richer space than the space of probabilities over SX could raise over-fitting concerns but, we will later see that this in fact does not affect our learning guarantees and leads to good empirical results. Of course, searching for Qh to directly minimize |LQh (h, fQ ) − LPb (h, fP )| is, in general, not possible since we do not have access to fP . But, it is instructive to consider the imaginary case where the average loss LPb (h, fP ) is known to us for any h ∈ H. Qh could then be determined via Qh = argmin |Lq (h, fQ ) − LPb (h, fP )|, (1.11) q∈F (SX ,R) where F(SX , R) is the set of real-valued functions defined over SX . For any h, we can in fact select Qh such that LQh (h, fQ ) = LPb (h, fP ) since Lq (h, fQ ) is a linear function of q and thus the optimization problem (1.11) reduces to solving a simple linear equation. With this choice of Qh , the objective functions F and G coincide and by minimizing G we can recover the ideal solution h∗ . Note that, in general, the DM algorithm could not recover that ideal solution. Even a finer discrepancy minimization algorithm exploiting the knowledge of LPb (h, fP ) for all h and seeking a distribution q0min minimizing maxh∈H |Lq (h, fQ ) − LPb (h, fP )| could not, in general, recover the ideal solution since we could not have Lq0min (h, fQ ) = LPb (h, fP ) for all h ∈ H. Of course, in practice access to LPb (h, fP ) is unfeasible since the sample T is unlabeled. Instead, we will consider a non-empty convex set of candidate hypotheses H 00 ⊆ H that could contain a good approximation of fP . Using H 00 as a set of surrogate labeling functions leads to the following definition of Qh instead of (1.11): Qh = argmin sup |Lq (h, fQ ) − LPb (h, h00 )|. (1.12) q∈F (SX ,R) h00 ∈H 00 The choice of the subset H 00 is of course key. Our choice will be based on the theoretical analysis of Section 1.6. Nevertheless, in the following section, we present the formulation of the optimization problem for an arbitrary choice of the convex subset H 00 . 20 1.5.2 Formulation of optimization problem The following result provides a more explicit expression for LQh (h, fQ ) leading to a simpler formulation of the optimization problem defining our algorithm. Proposition 3. For any h ∈ H, let Qh be defined by (1.12). Then, the following identity holds for any h ∈ H: LQh (h, fQ ) = 1 00 00 max L (h, h ) + min L (h, h ) . b b h00 ∈H 00 P 2 h00 ∈H 00 P Proof. For any h ∈ H, the equation Lq (h, fQ ) = l with l ∈ R admits a solution q ∈ F(SX , R). Thus, we have {Lq (h, fQ ) | q ∈ F(SX , R)} = R and for any h ∈ H, we can write LQh (h, fQ ) = max |l − LPb (h, h00 )| 00 00 argmin l∈{Lq (h,fQ ):q∈F (SX ,R)} h ∈H = argmin max |l − LPb (h, h00 )| 00 00 h ∈H l∈R o n 00 00 (h, h ) (h, h ) − l, l − L = argmin max max L b b P P h00 ∈H 00 l∈R o n 00 00 (h, h ) (h, h ) − l, l − min L = argmin max max L Pb Pb 00 00 00 00 l∈R = 1 2 h ∈H h ∈H 00 00 max L (h, h ) + min L (h, h ) , Pb Pb 00 00 00 00 h ∈H h ∈H LPb (h, h00 ). since the minimizing l is obtained for max LPb (h, h00 ) − l = l − min 00 00 00 00 h ∈H h ∈H In view of this proposition, with our choice of Qh based on (1.12), the objective function G of our algorithm (1.10) can be equivalently written for all h ∈ H as follows: 1 00 00 (h, h ) . (1.13) (h, h ) + min L max L b b h00 ∈H 00 P 2 h00 ∈H 00 P The function h 7→ maxh00 ∈H 00 LPb (h, h00 ) is convex as a pointwise maximum of the convex functions h 7→ LPb (h, h00 ). Since the loss function L is jointly convex, so is LPb , therefore, the function derived by partial minimization over a non-empty convex set H 00 for one of the arguments, h 7→ minh00 ∈H 00 LPb (h, h00 ), also defines a convex function (Boyd and Vandenberghe, 2004). Thus, G is a convex function as a sum of convex functions. G(h) = λkhk2K + 21 1.6 Generalized Discrepancy Here we introduce the notion of generalized discrepancy, which will be used to derive learning guarantees for our algorithm. Let A(H) be the set of all functions U : h 7→ Uh mapping H to F(SX , R) such that for all h ∈ H, h 7→ LUh (h, fQ ) is a convex function. A(H) contains all constant functions U such that Uh = q for all h ∈ H, where q is a distribution over SX . We will abuse the notation and denote these functions also by q. By Proposition 3, A(H) also includes the function Q : h → Qh used by our algorithm. Definition 8 (Generalized discrepancy). For any U ∈ A(H), we define the notion of generalized discrepancy between Pb and U as the quantity DISC(Pb, U) given by DISC(Pb, U) = max h∈H,h00 ∈H 00 |LPb (h, h00 ) − LUh (h, fQ )|. (1.14) We also denote by dP1 (fP , H 00 ) the following distance of fP to H 00 : b dP1 (fP , H 00 ) = min00 E |h0 (x) − fP (x)|. b h0 ∈H (Pb (1.15) We now provide an upper bound on the Y-discrepancy in terms of the generalized disb crepancy as well as the term dP1 (fP , H 00 ). This bound can be used with Proposition 1 to provide a generalization bound based on the generalized discrepancy. Proposition 4. For any distribution q over SX and any set H 00 , the following inequality holds: b discY (Pb, q) ≤ DISC(Pb, q) + µ dP1 (fP , H 00 ). Proof. For any h0 ∈ H 00 , by the triangle inequality, we can write discY (Pb, q) = sup |Lq (h, fQ ) − LPb (h, fP )| h∈H ≤ sup |Lq (h, fQ ) − LPb (h, h0 )| + sup |LPb (h, h0 ) − LPb (h, fP )| h∈H h∈H ≤ sup sup |Lq (h, fQ ) − LPb (h, h00 )| + sup |LPb (h, h0 ) − LPb (h, fP )|. h∈H h00 ∈H 00 h∈H By the µ-admissibility of the loss, the last term can be bounded as follows: sup |LPb (h, h0 ) − LPb (h, fP )| ≤ µ E |fP (x) − h0 (x)|. Pb h∈H 22 fQ ⌘H (fP , fQ ) fP b P 00 1 (fP , H ) H 00 H Figure 1.4: Depiction of the distances ηH (fP , fQ ) and δ1P (fP , H 00 ). b Using this inequality and minimizing over h0 ∈ H 00 yields: b discY (Pb, q) ≤ sup sup |Lq (h, fQ ) − LPb (h, h00 )| + µ dP1 (fP , H 00 ) h∈H h00 ∈H 00 b = DISC(Pb, q) + µ dP1 (fP , H 00 ), which completes the proof. Corollary 2. Let H 00 ⊂ H be a convex set and q a distribution over SX . Then, for any δ > 0, with probability at least 1 − δ: r log(1/δ) b . LP (h, fP ) ≤ Lq (h, fQ ) + DISC(Pb, q) + µdP1 (fP , H 00 ) + 2Rn (HLP ) + M 2n (1.16) In general, the generalized discrepancy bound given by (1.16) and the discrepancy bound derived in (1.6) are not comparable. Indeed, as we shall see, the generalized discrepancy is always tighter than the discrepancy. However, as depicted in Figure 1.4, b δ1P (fP , H 00 ) can be larger than ηH (fP , fQ ). Nonetheless, when L is an LP loss for some p ≥ 1 we can show the existence of a set H 00 for which (1.16) is a tighter bound than (1.6). The result is expressed in terms of the local discrepancy defined by: discH 00 (Pb, q) = sup h∈H,h00 ∈H 00 |LPb (h, h00 ) − Lq (h, h00 )|, which is a finer measure than the standard discrepancy for which the supremum is defined over a pair of hypothesis both in H ⊇ H 00 . Theorem 3. Let q be an arbitrary distribution over SX and let L be the Lp loss for some p ≥ 1. If H := {B(r) : r ≥ 0} denotes the set of balls defined by B(r) = {h00 ∈ 23 H|Lq (h00 , fQ ) ≤ rp }, then there exists H 00 ∈ H such that the following holds: b DISC(Pb, q) + µ dP1 (fP , H 00 ) ≤ discH 00 (Pb, q) + µ ηH (fP , fQ ). 1 Proof. Fix a distribution q over SX . Let h∗0 be an element of argminh0 ∈H LPb (h0 , fP ) p + 1 Lq (h0 , fQ ) p . Choose H 00 ∈ H as H 00 = {h00 ∈ H|Lq (h00 , fQ ) ≤ rp } with r = 1 Lq (h∗0 , fQ ) p . Then, by definition, h∗0 is in H 00 . Furthermore, for the Lp loss, it is not 1 hard to show that for all h, h00 ∈ H, |Lq (h, h00 ) − Lq (h, fQ )| ≤ µ[Lq (h00 , fQ )] p (see Appendix A.3). In view of this inequality, we can write: DISC(Pb, q) = ≤ sup h∈H,h00 ∈H 00 sup h∈H,h00 ∈H 00 |LPb (h, h00 ) − Lq (h, fQ )| |LPb (h, h00 ) − Lq (h, h00 )| + sup h∈H,h00 ∈H 00 1 ≤ discH 00 (Pb, q) + sup µLq (h00 , fQ ) p |Lq (h, h00 ) − Lq (h, fQ )| h00 ∈H 00 1 ≤ discH 00 (Pb, q) + µr = discH 00 (Pb, q) + µLq (h∗0 , fQ ) p . Using this inequality, Jensen’s inequality, and the fact that h∗0 is in H 00 , we can write b µ dP1 (fP , H 00 ) + DISC(Pb, q) 1 ≤ µ min00 E |fP (x) − h0 (x)| + µLq (h∗0 , fQ ) p + discH 00 (Pb, q) h0 ∈H x∈Pb 1 1 ≤ µ min00 E |fP (x) − h0 (x)|p p + µLq (h∗0 , fQ ) p + discH 00 (Pb, q) h0 ∈H x∈Pb 1 1 ≤ µ LPb (h∗0 , fP ) p + µLq (h∗0 , fQ ) p + discH 00 (Pb, q) 1 1 = µ min LPb (h0 , fP ) p + Lq (h0 , fQ ) p + discH 00 (Pb, q) h0 ∈H ≤ µ min h0 ∈H max |fP (x) − h0 (x)| + x∈supp(Pb) = µ ηH (fP , fQ ) + discH 00 (Pb, q). max |fQ (x) − h0 (x)| + discH 00 (Pb, q) b x∈supp(Q) which concludes the proof. The previous theorem shows that the generalized discrepancy is in fact a finer measure than the discrepancy. Therefore, an algorithm minimizing (1.16) would benefit from a superior learning guarantee than the DM algorithm. However the problem defined by (1.16) is again not jointly convex on q and h. Therefore, we proceed as in the case of discrepancy minimization and show that our proposed algorithm minimizes a bound on the distance of the solution of our algorithm to the ideal hypothesis h∗ . 24 Theorem 4. Let U be an arbitrary element of A(H) and let h∗ and hU be the hypotheses minimizing λkhk2K + LPb (h, fP ) and λkhk2K + LUh (h, fQ ) respectively. Then, the following inequality holds for any convex set H 00 ⊆ H: b λkh∗ − hU k2K ≤ µ dP1 (fP , H 00 ) + DISC(Pb, U). (1.17) Proof. Fix U ∈ A(H) and let GPb denote h 7→ LPb (h, fP ) and GU the function h 7→ LUh (h, fQ ). Since h 7→ λkhk2K + GPb (h) is convex and differentiable and since h∗ is its minimizer, the gradient is zero at h∗ , that is 2λh∗ = −∇GPb (h∗ ). Similarly, since h 7→ λkhk2K + GU (h) is convex, it admits a sub-differential at any h ∈ H. Since hU is a minimizer, its sub-differential at hU must contain 0. Thus, there exists a sub-gradient g0 ∈ ∂GU (hU ) such that 2λhU = −g0 , where ∂GU (hU ) denotes the sub-differential of GU at hU . Using these two equalities we can write 2λkh∗ − hU k2K = hh∗ − hU , g0 − ∇GPb (h∗ )i = hg0 , h∗ − hU i − h∇GPb (h∗ ), h∗ − hU i ≤ GU (h∗ ) − GU (hU ) + GPb (hU ) − GPb (h∗ ) = LPb (hU , fP ) − LUh (hU , fQ ) + LUh (h∗ , fQ ) − LPb (h∗ , fP ) ≤ 2 sup |LPb (h, fP ) − LUh (h, fQ )|, h∈H where, for the first inequality, we used the convexity of GU combined with the subgradient property of g0 ∈ ∂GU (hU ), and the convexity of GPb . For any h ∈ H, using the µ-admissibility of the loss, we can upper bound the operand of the max operator as follows: |LPb (h, fP ) − LUh (h, fQ )| ≤ |LPb (h, fP ) − LPb (h, h0 )| + |LPb (h, h0 ) − LUh (h, fQ )| ≤ µ E |fP (x) − h0 (x)| + sup |LPb (h, h00 ) − LUh (h, fQ )|, h00 ∈H 00 x∼Pb where h0 is an arbitrary element of H 00 . Since this bound holds for all h0 ∈ H 00 , it follows immediately that λkh∗ − hU k2K ≤ µ min00 E |fP (x) − h0 (x)| + sup sup |LPb (h, h00 ) − LUh (h, fQ )|, h0 ∈H h∈H h00 ∈H 00 Pb which concludes the proof. It is now clear that our choice of Q : h 7→ Qh minimizes the right hand side of (1.17) 25 among all functions U ∈ A(H). Indeed, for any U we have DISC(Pb, U) = sup sup |LPb (h, h00 ) − LUh (h, fQ )| h∈H h00 ∈H 00 ≥ sup min sup |LPb (h, h00 ) − Lq (h, fQ )| h∈H q∈F (SX ) h00 ∈H 00 = sup sup |LPb (h, h00 ) − LQh (h, fQ )| = DISC(Pb, Q). h∈H h00 ∈H 00 In view of Theorem 3 we have that for any constant function U ∈ A(H) with Uh = q for some fixed distribution q over SX , the right-hand side of the bound of Theorem 2 is lower bounded by the right-hand side of the bound of Theorem 4, since the local discrepancy is a finer quantity than the discrepancy: discH 00 (Pb, q) ≤ disc(Pb, q). Thus, our algorithm benefits from a more favorable guarantee than the DM algorithm for that particular choice of H 00 , especially since our choice of Q is based on the minimization over all elements in A(H) and not just the subset of constant functions mapping to a distribution. The following pointwise guarantee follows directly from Theorem 4. Corollary 3. Let h∗ be a minimizer of λkhk2K + LPb (h, fP ) and hQ a minimizer of λkhk2K + LQh (h, fQ ). Then, the following holds for any convex set H 00 ⊆ H and for all (x, y) ∈ X × Y: s b µ dP1 (fP , H 00 ) + DISC(Pb, Q) , (1.18) |L(hQ (x), y) − L(h∗ (x), y)| ≤ µR λ where R2 = supx∈X K(x, x). Proof. By the µ-admissibility of the loss, the reproducing property of H, and the Cauchy-Schwarz inequality, the following holds for all x ∈ X and y ∈ Y: |L(hQ (x), y) − L(h∗ (x), y)| ≤ µ|hQ (x) − h∗ (x)| = µ|hhQ − h∗ , K(x, ·)iK | p ≤ µkhQ − h∗ kK K(x, x) ≤ µRkhQ − h∗ kK . Upper bounding khQ − h∗ kK using Theorem 4 and using the fact that Q : h → Qh is a minimizer of the bound over all choices of U ∈ A(H) yields the desired result. The pointwise loss guarantee just presented can be directly used to bound the difference of the expected loss of h∗ and hQ in terms of the same upper bounds, e.g., s b µ dP1 (fP , H 00 ) + DISC(Pb, Q) ∗ LP (hQ , fP ) ≤ LP (h , fP ) + µR . (1.19) λ Similarly, Theorem 3 directly implies the following Corollary. 26 Algorithm 1 Generalized Discrepancy Minimization (GDM) for Lp losses Require: Source sample S , target sample T , radius r Set qmin = DM(S, T ); Set H 00 = {h00 ∈ H | Lqmin (h00 , fQ ) ≤ rp }; Let Qh = argminq∈Rm suph00 ∈H 00 |Lq (h, fQ ) − LPb (h, h00 )|; Let hQ = argminh∈H λkhk2 + LQh (h, fQ ); return hQ Corollary 4. Let h∗ be a minimizer of λkhk2K + LPb (h, fP ) and hQ a minimizer of λkhk2K +LQh (h, fQ ). Let supx∈X K(x, x) = R2 . Then, there exists a choice of H 00 ∈ H for which the following inequality holds uniformly over (x, y) ∈ X × Y: s µηH (fP , fQ )+discH 00 (Pb, qmin ) , |L(hQ (x), y) − L(h∗ (x), y)| ≤ µR λ where qmin is the solution of the DM algorithm. The choice of the set H 00 defining our algorithm is strongly motivated by the theoretical results of this section. Indeed, in view of Theorem 3 we restrict our choice of H 00 to the family H , parametrized only by the radius r. Notice that DISC(Pb, Q) and b δ1P (fP , H 00 ) then become functions of r. Therefore, we may select this parameter as a minimizer of (1.19). This can be done by using as a validation set a small amount of labeled data from the target domain which is typically available in practice. 1.6.1 Comparison with other learning bounds We now compare the learning bounds just derived for our algorithm with those of some common reweighting techniques. In particular, we compare our bounds with those of Cortes et al. (2008) for the KMM algorithm. A similar comparison however can be derived for other algorithms based on importance weighting such as KLIEP or uLSIF. Assume P and Q admit densities p and q respectively. For every x ∈ X we denote the importance ratio and by β = β S its restriction to SX . We also let by β(x) = p(x) q(x) X βb be the solution to the optimization problem solved by the KMM algorithm. Let hβ denote the solution to min λkhk2 + Lβ (h, fQ ), (1.20) h∈H and hβb be the solution to min λkhk2 + Lβb(h, fQ ). h∈H (1.21) The following proposition due to Cortes et al. (2008) relates the error of these hypotheses. The proposition requires the kernel K to be a strictly positive definite universal 27 kernel, with Gram matrix K given by Kij = K(xi , xj ). Proposition 5. Assume L(h(x), y) ≤ 1 for all (x, y) ∈ X × Y, h ∈ H. For any δ > 0, with probability at least 1 − δ we have: 1 2 µ2 R2 λmax (K) B 0 √ |LP (hβ , fP ) − LP (hβb, fP )| ≤ λ m r r B 02 1 κ1/2 2 + 1 + 2 log , (1.22) + 1/2 n δ λmin (K) m where and B 0 are the hyperparameters defining the KMM algorithm. This bound and the one obtained in (1.19) are of course not comparable since the dependence on µ, R and λ is different. And in some cases this dependency is more favorable in (1.22) whereas for other values of these parameters (1.19) provides a better bound. Moreover, (1.22) depends on the condition number of K which can be really large in practice. However, a crucial difference between these bounds is that (1.19) is given in terms of the ideal hypothesis h∗ and (1.22) is given in terms of hβ , which, in view of the results of (Cortes et al., 2010) is not guaranteed to have a good performance on the target distribution. Therefore (1.22) does not provide an informative bound in general. 1.6.2 Scenario of additional labeled data Here, we consider a rather common scenario in practice where, in addition to the labeled sample S drawn from the source domain and the unlabeled sample T from the target domain, the learner receives a small amount of labeled data from the target domain T 0 = ((x001 , y100 ), . . . , (x00s , ys00 )) ∈ (X × Y)s . This sample is typically too small to be used solely to train an algorithm and achieve a good performance. However, it can be useful in at least two ways that we discuss here. One important benefit of T 0 is to serve as a validation set to determine the parameter r that defines the convex set H 00 used by our algorithm. Indeed, as the size of T 0 increases, we our confidence on the choice of the best value of r increases, thereby reducing our test error.xThe sample T 0 can also be used to enhance the discrepancy minimization algorithm as we now show. Let Pb0 denote the empirical distribution associated to T 0 . To take advantage of T 0 , the DM algorithm can be trained on the sample of size (m + s) obtained by combining S and T 0 , which corresponds to the new empirb0 = m Q b + s Pb0 . Note that for a fixed m and large values of s, ical distribution Q m+s m+s b0 essentially ignores the points from the source distribution Q, which corresponds to Q the standard supervised learning scenario in the absence of adaptation. Let q0min denote b0 . Since supp(Q b0 ) ⊇ supp(Q), b the the discrepancy minimization solution when using Q 28 discrepancy using q0min is a lower bound on the discrepancy using qmin : disc(q0min , Pb) = ≤ 1.7 min b0 ) supp(q)⊆supp(Q min b supp(q)⊆supp(Q) disc(Pb, q) disc(Pb, q) = disc(qmin , Pb). Optimization Solution As shown in Section 1.5.2, the function G defining our algorithm is convex and therefore expression (1.13) is a convex optimization problem. Nevertheless, its formulation does not admit a simple algorithmic solution. Indeed, evaluating the term maxh00 ∈H 00LPb (h,h00 ) defining our objective requires solving a non-convex optimization problem, which can be hard. Here, we exploit the structure of this problem to cast it as a semi-definite program (SDP) for the case of the L2 loss. 1.7.1 SDP formulation As discussed in Section 1.6, the choice of H 00 is a key component of our algorithm. In view of Corollary 4, we will consider the set H 00 = {h00 | Lqmin (h00 , fQ ) ≤ r2 }. Equivalently, as a result of the reproducing property representer theorem, H 00 may Pm of H and the Pm 1/2 m be defined as {a ∈ R | j=1 qmin (xj )( i=1 ai qmin (xi ) K(xi , xj ) − yj )2 ≤ r2 }. By the representer theorem, again, we know the solution to (1.13) will be of the form h = Pn 0 −1/2 n i=1 bi K(xi , ·), for bi ∈ R. Therefore, given normalized kernel matrices Kt , Ks , −1 0 0 ij 1/2 Kst defined respectively as Kij qmin (xj )1/2 K(xi , xj ) t = n K(xi , xj ), Ks = qmin (xi ) −1/2 and Kij qmin (xj )1/2 K(x0i , xj ), problem (1.13) is equivalent to st = n 1 2 2 > kKst a−Kt bk , (1.23) kKst a−Kt bk + min min λb Kt b+ max a∈Rm a∈Rm b∈Rn 2 2 2 2 2 kKs a−yk ≤r kKs a−yk ≤r where y = (qmin (x1 )1/2 y1 , . . . , qmin (xm )1/2 ym ) is the vector of normalized labels. Lemma 3. The Lagrangian dual of the problem max m a∈R kKs a−yk2 ≤r2 1 kKst ak2 − b> Kt Kst a, 2 29 g1 (h) = 0 h0 + ⇤b h h0 g2 (h) = 0 Figure 1.5: Illustration of the sampling process on the set H 00 . is given by min γ η≥0,γ Friday, February 7, 2014 s. t. 2 − 21 K> st Kst + ηKs 1 > b Kt Kst 2 − ηy> Ks 1 > K Kb 2 st t 2 − ηKs y η(kyk − r2 ) + γ ! 0. Furthermore, the duality gap for these problems is zero. The proof of the lemma is given in Appendix A.1. The lemma helps us derive the following equivalent SDP formulation for our original optimization problem. The proof the following proposition is given in Appendix A.1. Proposition 6. Optimization problem (1.23) is equivalent to the following SDP: 1 Tr(K> st Kst Z) − β − α α,β,ν,Z,z 2 ! 1e 1e K − K νK y + Kz νK2s + 12 K> st s st 4 4 0 s. t. e νy> Ks + 14 z> K α + ν(kyk2 − r2 ) ! λKt + K2t 21 Kt Kst z 0 1 > > z K K β t st 2 Z z 0 ∧ ν ≥ 0 ∧ Tr(K2s Z) − 2y> Ks z + kyk2 ≤ r2 , z> 1 max e = K> Kt (λKt + K2 )† Kt Kst and A† denotes the pseudo-inverse of the matrix where K st t A. Albeit this problem can be solved in polynomial time with a standard convex optimization solver, in practice solving a moderately sized SDP can be really slow. Therefore, in the next section we propose a more efficient solution to the optimization problem using sampling, which helps reducing the problem to a simple QP. 30 1.7.2 QP formulation The SDP formulation described in the previous section is applicable for a specific choice of H 00 . In this section, we present an analysis that holds for an arbitrary convex set H 00 . First, notice that the problem of minimizing G (expression (1.13)) is related to the minimum enclosing ball (MEB) problem. For a set D ⊆ Rd , the MEB problem is defined as follows: min max ku − vk2 . u∈Rd v∈D Omitting the regularization and the min term from (1.13) leads to a problem similar to the MEB. Thus, we could benefit from the extensive literature and algorithmic study available for this problem (Welzl, 1991; Kumar et al., 2003; Schőnherr, 2002; Fischer et al., 2003; Yildirim, 2008). However, to the best of our knowledge, there is currently no solution available to this problem in the case of an infinite set D, as in the case of our problem. Instead, we present a solution for solving an approximation of (1.13) based on sampling. Let {h1 , . . . , hk } be a set of hypotheses on the boundary of H 00 , ∂H 00 , and let C = C(h1 , . . . , hk ) denote their convex hull. The following is the sampling-based approximation of (1.13) that we consider: min λkhk2K + h∈H 1 1 max LPb (h, hi ) + min L b (h, h0 ). 2 i=1,...,k 2 h0 ∈C P (1.24) Proposition 7. Let Y = (Yij ) ∈ Rk×n be the matrix defined by Yij = n−1/2 hi (x0j ) and P n y0 = (y10 , . . . , yk0 )> ∈ Rk the vector defined by yi0 = n−1 j=1 hi (x0j )2 . Then, the dual problem of (1.24) is given by γ > 1 −1 > γ max − Y> α + Kt λI + Kt Y α+ α,γ,β 2 2 2 1 − γ > Kt K†t γ + α> y0 − β 2 1 s.t. 1> α = , 1β ≥ −Yγ, α ≥ 0, 2 (1.25) where 1 is the vector in Rk with all components equal to 1. Furthermore, the solution a solution (α, γ, β) of (1.25) by ∀x, h(x) = Pn h of (1.24) can be recovered from 1 1 −1 > i=1 ai K(xi , x), where a = λI + 2 Kt ) (Y α + 2 γ). The proof of the proposition is given in Appendix A.2. The result shows that, given a finite sample h1 , . . . , hk on the boundary of H 00 , (1.24) is in fact equivalent to a standard QP. Hence, a solution can be found efficiently with one of the many off-the-shelf algorithms for quadratic programming. We now describe the process of sampling from the boundary of the set H 00 , which is a necessary step for defining problem (1.24). We consider compact sets of the form 31 H 00 := {h00 ∈ H | gi (h00 ) ≤ 0}, where the functions gi are continuous and convex. For instance, we could consider the set H 00 defined in the section. More generally, Pprevious m 00 00 we can consider a family of sets Hp = {h ∈ H| | i=1 qmin (xi )|h(xi ) − yi |p ≤ rp }. Assume that there exists h0 satisfying gi (h0 ) < 0. Our sampling process is illustrated by Figure 1.5 and works as follows: pick a random direction b h and define λi to be the minimal solution to the system (λ ≥ 0) ∧ (gi (h0 + λb h) = 0). Set λi = ∞ if no solution is found and define λ∗ = mini λi . By the convexity and compactness of H 00 we can guarantee that λ∗ < ∞. Furthermore, he hypothesis h = h satisfies h ∈ H 00 and gj (h) = 0 for j such that λj = λ∗ . The latter is h0 + λ∗b straightforward, to verify the former, assume that gi (h0 + λ∗b h) > 0 for some i. The 0 continuity of gi would imply the existence of λi with 0 < λ0i < λ∗ ≤ λi such that h) = 0. This would contradict the choice of λi , thus, the inequality gi (h0 + gi (h0 + λ0ib ∗b λ h) ≤ 0 must hold for all i. Since a point h0 with gi (h0 ) < 0 can be obtained by solving a convex program and solving the equations defining λi is, in general, simple, the process described provides an efficient way of sampling points from the convex set H 00 . 1.7.3 Implementation for the L2 Loss We now describe how to fully implement our sampling-based algorithm for the case where L is equal to the L2 loss. In view of the results of Section 1.4, we let H 00 = {h00 |kh00 kK ≤ Λ ∧ Lqmin (h00 , fQ ) ≤ r2 }. We first describe the steps needed to find a point h0 ∈ H 00 . Let hΛ be such that khΛ kK = Λ and λr ∈ R+ be such that the solution hr to the optimization problem min λr khk2K + Lqmin (h, fQ ), h∈H satisfies Lqmin (hr , fQ ) = r2 . ItP is easy to verify that the existence of λr is guaranteed for 2 2 minh∈H Lqmin (h, fQ ) ≤ r ≤ m i=1 qmin (xi )yi . Furthermore, the convexity of the norm, as well as the loss, imply that the point h0 = 21 (hr + hΛ ) is in the interior of H 00 . Of course, finding λr with the desired properties is not possible. However, since r is chosen via validation, we do not need to find λr as a function of r. Instead, we can simply select λr , and not r, through cross-validation. In order to complete the sampling process, we must have an efficient way of selecting a random direction b h. If H ⊂ Rd is a set of linear hypotheses, a direction b h can be ξ b sampled uniformly by letting h = kξk , where ξ is a standard Gaussian random variable in Rd . If H is a subset of a RKHS, Pm by the representer theorem, we only need to consider hypotheses of the form h = i=1 αi K(xi , ·). Therefore, we can sample a direction 32 0.2 0.0 ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ●●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ● −0.8 MSE −0.4 ● −1.2 Source Target DM GDM −0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 w (a) (b) Figure 1.6: (a) Hypotheses obtained by training on source (green circles), target (red triangles) and using DM (dashed blue) and GDM algorithms (solid blue). (b) Objective functions for source and target distribution as well as GDM and DM algorithms. The vertical lines show the minimizer for each algorithm. Set H and surrogate hypothesis set H 00 ⊆ H are shown at the bottom. P 0 0 0 0 b h= m i=1 αi K(xi , ·), where the vector α = (α1 , . . . , αm ) is drawn uniformly from the m unit sphere in R . A full implementation of our algorithm thus consists of the following steps: • find the distribution qmin = argminq∈Q disc(q, Pb). This can be done by using the smooth approximation algorithm of Cortes and Mohri (2013); • sample points from the set H 00 using the sampling process described above; • solve the QP introduced in Section 1.7.2. Given that our algorithm only requires solving a simple QP, its complexity is similar to algorithms such as KMM and DM which also require solving a QP. 1.8 Experiments Here, we report the results of extensive comparisons between GDM and several other adaptation algorithms which demonstrate the benefits of our algorithm. We use the implementation described in the previous section. The source code for our algorithm as well as all other baselines described in this section can be found at http://cims. nyu.edu/˜munoz. 1.8.1 Synthetic Data Set To compare the performances of the GDM and DM algorithms, we considered the following synthetic one-dimensional task, similar to the one considered by Huang et al. (2006): the source domain examples were sampled from the uniform distribution over 33 the interval [.2, 1] and target instances were sampled uniformly over [0, .25]. The labels were given by the map x 7→ −x + x3 + ξ, where ξ is a Gaussian random variable with mean 0 and standard deviation 0.1. Our hypothesis set was defined by the family of linear functions without an offset. Figure 1.6(a) shows the regression hypotheses obtained by training the DM and GDM algorithm as well as those obtained by training on the source and target distributions. The ideal hypothesis is shown in red. Notice how the GDM solution gives a closer approximation than DM to the ideal solution. In order to better understand the difference between the solutions of these algorithms, Figure 1.6(b) depicts the objective function minimized by each algorithm as a function of the slope w of the linear function, the only variable of the hypothesis. The vertical linesshow the value of the minimizing hypothesis for each loss. Keeping in mind that the regularization parameter λ used in ridge regression corresponds to a Lagrange multiplier for the constraint w2 ≤ Λ2 for some Λ (Cortes and Mohri, 2013) [Lemma 1], the hypothesis set H = {w|w2 ≤ Λ2 } is depicted at the bottom of this plot. The shaded region represents the set H 00 = H ∩ {h00 |Lqmin (h00 ) ≤ r}. It is clear from this plot that reweighting the sample using qmin helps approximate the target loss function. Nevertheless, it is our hypothesis-dependent reweighting that allows both objective functions to be uniformly close. This should come as no surprise since our algorithm was precisely designed to achieve that. 1.8.2 Adaptation Data Sets We now present the results of evaluating our algorithm against several other adaptation algorithms. GDM is compared against DM and training on the uniform distribution. The following baselines were also used: 1. The KMM algorithm (Huang et al., 2006), which reweights examples from the source distribution in order to match the mean of the source and target data in a feature space induced by a universal kernel. The hyper-parameters of this algorithm √ m √ were set to the recommended values of B = 1000 and = m−1 . 2. KLIEP (Sugiyama et al., 2007). This algorithm estimates the importance ratio of the source and target distribution by modeling this ratio as a mixture of basis functions and learning the mixture coefficients from the data. Gaussian kernels were used as basis functions forthis algorithm and KMM. The bandwidth for the kernel was selected from the set σd : σ = 2−5 , . . . , 25 via validation on the test set, where d is the mean distance between points sampled from the source domain. 3. FE (Daumé III, 2007). This algorithm maps source and target data into a common high-dimensional feature space where the difference of the distributions is expected to reduce. In addition to these algorithms we compare GDM to the ideal hypothesis obtained by training on T which we denote by Tar. 34 1.0 ● ● ● ● 1.5 ● ● ● 0.4 MSE ● 0.2 1.0 ● 0.5 0.0 fh to nm fh to fm fh to nh 0.6 0.8 ● 2.0 2.5 3.0 GDM DM Unif Target KMM KLIEP FE DM/GDM 3.5 Source distribution: kin−8fh 0.0 kin−8nm kin−8fm 0.5 1.0 1.5 2.0 r Λ kin−8nh (a) (b) Figure 1.7: (a)MSE performance for different adaptation algorithms when adapting from kin-8fh to the three other kin-8xy domains. (b) Relative error of DM over GDM as a function of the ratio Λr . We selected the set of linear functions as our hypothesis set. The learning algorithm used for all tasks was ridge regression and the performance was evaluated by the mean squared error. We follow the setup of Cortes and Mohri (2011) and for all adaptation algorithms we selected the parameter λ via 10-fold cross validation over the training data by using a grid search over the set of values λ ∈ {2−10 , . . . , 210 }. The results of training on the target distribution are presented for a parameter λ tuned via 10-fold cross validation over the target data. We used the QP implementation of our algorithm with the sampling set H 00 and the sampling mechanism defined at the end of Section 1.7.2, where the parameter λr was chosen from the same set as λ via cross-validation on a small amount of data from the target distribution. Whereas there exist other validation techniques such as transfer cross validation (Zhong et al., 2010), these techniques rely on importance weighting and as such suffer from the issues previously mentioned. In order to achieve a fair comparison, all other algorithms were allowed to use the small amount of labeled data too. Since, with the exception of FE, all other baselines do not propose a way of dealing with labeled data from the target distribution, we simply added this data to the training set and ran the algorithms on the extended source data as discussed in Section 1.6.2. The first task we considered is given by the 4 kin-8xy Delve data sets (Rasmussen et al., 1996). These data sets are variations of the same model: a realistic simulation of the forward dynamics of an 8 link all-revolute robot arm. The task in all data sets consists of predicting the distance of the end-effector from a target. The data sets differ by the degree of non-linearity (fairly linear, x=f, or non-linear, x=n) and the amount of noise in the output (moderate, y=m, or high, y=h). The data set defines 4 different domains, that is 12 pairs of different distributions and labeling functions. A sample 35 Table 1.2: Adaptation from books (B), kitchen (K), electronics (E) and dvd (D) to all other domains. Normalized results: MSE of training on the unweighted source data is equal to 1. Results in bold represent the algorithm with the lowest MSE. Task: Sentiment S B K E D T K E D B E D B K D B K E GDM DM Unif Tar KMM KLIEP F 0.763±(0.222) 1.056±(0.289) 1.00 0.517±(0.152) 3.328±(0.845) 3.494±(1.144) 0.942±(0.093) 0.574±(0.211) 1.018±(0.206) 1.00 0.367±(0.124) 3.018±(0.319) 3.022±(0.318) 0.857±(0.135) 0.936±(0.256) 1.215±(0.255) 1.00 0.623±(0.152) 2.842±(0.492) 2.764±(0.446) 0.936±(0.110) 0.854±(0.119) 1.258±(0.117) 1.00 0.665±(0.085) 2.784±(0.244) 2.642±(0.218) 1.047±(0.047) 0.975±(0.131) 1.460±(0.633) 1.00 0.653±(0.201) 2.408±(0.582) 2.157±(0.255) 0.969±(0.131) 0.884±(0.101) 1.174±(0.140) 1.00 0.665±(0.071) 2.771±(0.157) 2.620±(0.210) 1.111±(0.059) 0.723±(0.138) 1.016±(0.187) 1.00 0.551±(0.109) 3.433±(0.694) 3.290±(0.583) 1.035±(0.059) 1.030±(0.312) 1.277±(0.283) 1.00 0.636±(0.176) 2.173±(0.249) 2.223±(0.293) 0.955±(0.199) 0.731±(0.171) 1.005±(0.166) 1.00 0.518±(0.117) 3.363±(0.402) 3.231±(0.483) 0.974±(0.102) 0.992±(0.191) 1.026±(0.090) 1.00 0.740±(0.138) 2.571±(0.616) 2.475±(0.400) 0.986±(0.041) 0.870±(0.212) 1.062±(0.318) 1.00 0.557±(0.137) 2.755±(0.375) 2.741±(0.347) 0.940±(0.087) 0.674±(0.135) 0.994±(0.171) 1.00 0.478±(0.098) 2.939±(0.501) 2.878±(0.418) 0.907±(0.081) of 200 points from each domain was used for training and 10 labeled points from the target distribution were used to select H 00 . The experiment was carried out 10 times and the results of testing on a sample of 400 points from the target domain are reported in Figure 1.7(a). The bars represent the median performance of each algorithm. The error bars are the .25 and .75 quantiles respectively. All results were normalized in such a way that the median performance of training on the source is equal to 1. Notice that the performance of all algorithms is comparable when adapting to kin8-fm since both labeling functions are fairly linear, yet only GDM is able to reasonably adapt to the two data sets with different labeling functions. In order to better understand the advantages of GDM over DM we plot the relative error of DM against GDM as a function of the ratio r/Λ in Figure 1.7(b), where r is the radius defining H 00 . Notice that when the ratio r/Λ is small then both algorithms behave similarly which is most of the times for the adaptation task fh to fm. On the other hand, a better performance of GDM can be obtained when the ratio is larger. This is due to the fact that r/Λ measures the effective size of the set H 00 . A small ratio means that the size of H 00 is small and therefore the hypothesis returned by GDM will be close to that of DM where as if H 00 is large then GDM has the possibility of finding a better hypothesis. For our next experiment we considered the cross-domain sentiment analysis data set of Blitzer et al. (2007). This data set consists of consumer reviews from 4 different domains: books, kitchen, electronics and dvds. We used the top 1000 uni-grams and bi-grams as the features for this task. For each pair of adaptation tasks we sampled 700 points from the source distribution and 700 unlabeled points from the 36 Table 1.3: Adaptation from caltech256 (C), imagenet (I), sun (S) and bing (B). Normalized results: MSE of training on the unweighted source data is equal to 1. Task: Images S C I S B T I S B C S B C I B C I S GDM DM Unif Tar KMM KLIEP F 0.927±(0.051) 1.005±(0.010) 1.00 0.879±(0.048) 2.752±(3.820) 0.936±(0.016) 0.959±(0.035) 0.938±(0.064) 0.993±(0.018) 1.00 0.840±(0.057) 0.827±(0.017) 0.835±(0.020) 0.947±(0.025) 0.909±(0.040) 1.003±(0.013) 1.00 0.886±(0.052) 0.945±(0.022) 0.942±(0.017) 0.947±(0.019) 1.011±(0.015) 0.951±(0.011) 1.00 0.802±(0.040) 0.989±(0.036) 1.009±(0.042) 0.971±(0.024) 1.006±(0.030) 0.992±(0.016) 1.00 0.871±(0.030) 0.930±(0.018) 0.936±(0.016) 0.973±(0.017) 0.987±(0.022) 1.009±(0.010) 1.00 0.986±(0.028) 1.011±(0.028) 1.011±(0.028) 0.994±(0.018) 1.022±(0.037) 0.982±(0.035) 1.00 0.759±(0.033) 1.172±(0.043) 1.201±(0.038) 0.938±(0.036) 0.924±(0.049) 0.998±(0.030) 1.00 0.831±(0.047) 3.868±(4.231) 1.227±(0.039) 0.947±(0.028) 0.898±(0.072) 1.003±(0.044) 1.00 0.821±(0.053) 1.240±(0.039) 1.248±(0.041) 0.945±(0.021) 1.010±(0.014) 0.956±(0.017) 1.00 0.777±(0.031) 1.028±(0.033) 1.032±(0.031) 0.980±(0.019) 1.012±(0.010) 1.004±(0.007) 1.00 0.966±(0.009) 2.785±(3.803) 0.981±(0.018) 1.000±(0.004) 1.009±(0.018) 0.988±(0.010) 1.00 0.850±(0.035) 0.930±(0.022) 0.934±(0.024) 0.983±(0.013) target. Only 50 labeled points from the target distribution were used to tune the parameter r of our algorithm. The final evaluation is done on a test set of 1000 points. The mean results and standard deviations of this task are shown in Table 1.2 where the MSE values have been normalized in such a way that the performance of training on the source without reweighting is always 1. Finally, we considered a novel domain adaptation task (Tommasi et al., 2014) of paramount importance in the computer vision community. The domains correspond to 4 well known collections of images: bing, caltech256, sun and imagenet. These data sets have been standardized so that they all share the same feature representation and labeling function (Tommasi et al., 2014). We sampled 800 labeled points from the source distribution and 800 unlabeled points from the target distribution as well as 50 labeled target points to be used for validation of r. The results of testing on 1000 points from the target domain are presented in Table 1.3 where, again, the results were normalized in such a way that the performance of training on the source data is always 1. After analyzing the results of this section we can conclude that the GDM algorithm consistently outperforms DM and achieves similar or better performance than all other common adaptation algorithms. It is worth noticing that in some cases, other algorithms perform even worse than training on the unweighted sample. This deficiency of the KLIEP algorithm had already been pointed out by Sugiyama et al. (2007) but here we observe that this problem can also affect the KMM algorithm. Finally, let us point out that even though the FE algorithm also achieved performances similar to GDM on the sentiment and image adaptation, its performance was far from optimal adapting on the 37 kin-8xy task. Since there is a lack of theoretical understanding for this algorithm, it is hard to characterize the scenarios where FE would perform better than GDM. 1.9 Conclusion We presented a new theoretically well-founded domain adaptation algorithm seeking to minimize a less conservative quantity than the DM algorithm. We presented an SDP solution for the particular case of the L2 loss which can be solved in polynomial time. Our empirical results show that our new algorithm always performs better than or on par with the otherwise state-of-the-art DM algorithm. We also provided tight generalization bounds for the domain adaptation problem based on the Y-discrepancy. As pointed out in Section 1.4, an algorithm that minimizes the Y-discrepancy would benefit from the best possible guarantees. However, the lack of labeled data from the target distribution makes this algorithm not viable. In future research we would like to analyze a richer scenario where the learner is allowed to ask for a limited number of labels from the target distribution. This setup, which is related to active learning, seems to be in fact the closest one to real-life applications and has started to receive attention from the research community (Berlind and Urner, 2015). We believe that the discrepancy measure will again play a central role in the analysis of this scenario. 38 Chapter 2 Drifting We extend the results of the previous chapter to the more challenging task of learning with drifting distributions. We prove learning bounds based on the Rademacher complexity of the hypothesis set and the Y-discrepancy of distributions both for a natural extension of the standard PAC learning scenario and a tracking scenario that will be later described. Our bounds are always tighter and in some cases substantially improve upon previous ones based on the L1 distance. We also present a generalization of the standard on-line to batch conversion to the drifting scenario in terms of the discrepancy and arbitrary convex combinations of hypotheses. We introduce a new algorithm exploiting these learning guarantees, which we show can be formulated as a simple QP. The chapter concludes with an extensive empirical evaluation of our proposed algorithm. 2.1 Introduction In Chapter 1 we presented theory and algorithms for solving the problem of domain adaptation. These results are crucial since adaptation is constantly encountered in fields such as natural language processing and computer vision. Domain adaptation, however, deals only with the problem of training on a fixed source distribution and testing on an also fixed target distribution. Therefore, there are common learning tasks that cannot fit into this framework. For instance in spam detection, political sentiment analysis, financial market prediction under mildly fluctuating economic conditions, or news stories, the learning environment is not stationary and there is a continuous drift of its parameters over time. This means in particular that the training distribution is in fact not fixed. There is a large body of literature devoted to the study of related problems both in the on-line and the batch learning scenarios. In the on-line scenario, the target function is typically assumed to be fixed but no distributional assumption is made, thus input points may be chosen adversarially (Cesa-Bianchi and Lugosi, 2006). Variants of this model where the target is allowed to change a fixed number of times have also been studied (Cesa-Bianchi and Lugosi, 2006; Herbster and Warmuth, 1998, 2001; Cavallanti et al., 39 2007). In the batch scenario, the case of a fixed input distribution with a drifting target was originally studied by Helmbold and Long (1994). A more general scenario was introduced by Bartlett (1992) where the joint distribution over the input and labels could drift over time under the assumption that the L1 distance between the distributions in two consecutive time steps was bounded by ∆. Both generalization bounds and lower bounds have been given for this scenario (Long, 1999; Barve and Long, 1997). In particular, Long (1999) showed that if the L1 distance between two consecutive distributions is at most ∆, then a generalization error of O((d∆)1/3 ) is achievable and Barve and Long (1997) proved this bound to be tight. Further improvements were presented by Freund and Mansour (1997) under the assumption of a constant rate of change for drifting. Other settings allowing arbitrary but infrequent changes of the target have also been studied (Bartlett et al., 2000). An intermediate model of drift based on a near relationship was also recently introduced and analyzed by Crammer et al. (2010) where consecutive distributions may change arbitrarily, modulo the restriction that the region of disagreement between nearby functions would only be assigned limited distribution mass at any time. This chapter deals with the analysis of learning in the presence of drifting distributions in the batch setting. We consider both the general drift model introduced by Bartlett (1992) and a related drifting PAC model that we will later describe. We present new generalization bounds for both models (Sections 2.3 and 2.4). Unlike the L1 distance used by previous authors to measure the distance between distributions in the drifting scenario , our bounds are based on the Y-discrepancy between distributions. As shown in Chapter 1, the Y-discrepancy is a finer measure than the L1 distance. Furthermore it can be estimated from finite samples unlike the L1 distance (see for example lower bounds on the sample complexity of testing closeness by Valiant (2011)). The learning bounds we present in Sections 2.3 and 2.4 are tighter than previous bounds both because they are given in terms of the discrepancy distance, and because they are given in terms of the Rademacher complexity instead of the VC-dimension. Additionally, our proofs are often simpler and more concise. We also present a generalization of the standard on-line to batch conversion to the scenario of drifting distributions in terms of the discrepancy measure (Section 2.5). Our guarantees hold for convex combinations of the hypotheses generated by an on-line learning algorithm. These bounds lead to the definition of a natural meta-algorithm which consists of selecting the convex combination of weights in order to minimize the discrepancy-based learning bound (Section 2.6). We show that this optimization problem can be formulated as a simple QP and report the results of several experiments demonstrating its benefits. Finally we discuss the practicality of our algorithm in some natural scenarios. 40 2.2 Preliminaries In this section, we introduce some preliminary notation and key definitions that will be used throughout the Chapter. In addition, we describe the learning scenarios that we will consider. Let X denote the input space, Y the output space and H a hypothesis set. We consider a loss function L : Y × Y → R+ bounded by some constant M > 0. For any two functions h, h0 : X → Y and any distribution D over X × Y, we keep the notation of the previous chapter and denote by LD (h) the expected loss of h and by LD (h, h0 ) the expected loss of h with respect to h0 : LD (h) = E (x,y)∼D [L(h(x), y)] LD (h, h0 ) = E [L(h(x), h0 (x))], and x∼DX (2.1) where DX is the marginal distribution over X derived from D. We adopt the standard definition of the empirical Rademacher complexity, yet we adapt the definition of Rademacher complexity to our drifting scenario. This definition is related to the notion of sequential Rademacher complexity of Rakhlin et al. (2010). Definition 9. Let G be a family of functions mapping from a set Z to R and S = (z1 , . . . , zT ) a fixed sample of size T with elements in Z. The empirical Rademacher complexity of G for the sample S is defined by: # " T X 1 b S (G) = E sup σt g(zt ) , (2.2) R σ g∈G T t=1 where σ = (σ1 , . . . , σT )> , with σt s independent uniform random variables taking values b S (G) over all in {−1, +1}. The Rademacher complexity of G is the expectation of R samples S = (z1 , . . . , zT ) of size T drawn according to the product distribution D = NT t=1 Dt : b S (G)]. RT (G) = E [R (2.3) S∼D Note that this coincides with the standard Rademacher complexity when the distributions Dt , t ∈ [1, T ], all coincide. Similar to domain adaptation, a key question for the analysis of learning with a drifting scenario is a measure of the difference between two distributions D and D0 . The distance used by previous authors is the L1 distance. However, as previously discussed, the L1 distance is not helpful in this context since it can be large even in some rather favorable situations. In view of this, we instead use the Y-discrepancy (Definition 5 in the previous chapter) to measure the distance between two consecutive distributions. We will present our learning guarantees in terms of the Y-discrepancy discY . That is, the most general definition since guarantees in terms of the discrepancy disc can be 41 straightforwardly derived from them. The advantage of the latter bounds is the fact that the discrepancy can be estimated in that case from unlabeled finite samples. We will consider two different scenarios for the analysis of learning with drifting distributions: the drifting PAC scenario and the drifting tracking scenario. The drifting PAC scenario is a natural extension of the PAC scenario, where the objective is to select a hypothesis h out of a hypothesis set H with a small expected loss according to the distribution DN T +1 after receiving a sample of T ≥ 1 instances drawn from the product distribution Tt=1 Dt . Thus, the focus in this scenario is the performance of the hypothesis h with respect to the environment distribution after receiving the training sample. The drifting tracking scenario we consider is based on the scenario originally introduced by Bartlett (1992) for the zero-one loss and is used to measure the performance of an algorithm A (as opposed to any hypothesis h). In that learning model, the performance of an algorithm is determined based on its average predictions at each time for a sequence of distributions. We will generalize its definition by using the notion of discrepancy and extending it to other loss functions. The following definitions are the key concepts defining this model. Definition 10. For any sample S = (xt , yt )Tt=1 of size T , we denote by hT −1 ∈ H the hypothesis returned by an algorithm A after receiving the first T − 1 examples and cT its loss or mistake on xT : M cT = L(hT −1 (xT ), yT ). For a product distribution by M NT T D = t=1 Dt on (X × Y) we denote by MT (D) the expected mistake of A: cT ] = E [L(hT −1 (xT ), yT )]. MT (D) = E [M S∼D S∼D fT be the supremum of MT (D) over all distribution Definition 11. Let ∆ > 0 and let M sequences D = (Dt ), with discY (Dt , Dt+1 ) < ∆. Algorithm A is said to (∆, )-track fT < inf h∈H LD (h) + . H if there exists t0 such that for T > t0 we have M T As suggested by its name, the focus in this scenario is tracking the best hypothesis at every time. An analysis of the tracking scenario with the L1 distance used to measure the divergence of distributions instead of the discrepancy was carried out by Long (1999) fT in terms of and Barve and Long (1997), including both upper and lower bounds for M ∆. Their analysis makes use of an algorithm very similar to empirical risk minimization, which we will also use in our theoretical analysis of both scenarios. 2.3 Drifting PAC Scenario In this section, we present guarantees for the drifting PAC scenario in terms of the discrepancies of Dt and DT +1 , t ∈ [1, T ], and the Rademacher complexity of the hypothesis set. 42 Let us emphasize that learning bounds in the drifting scenario should of course not be expected to converge to zero as a function of the sample size but depend instead on the divergence between distributions. Theorem 5. Assume that the loss function L is bounded by M . Let D1 , . . . , DT +1 be a sequence of distributions and let HL = {(x, y) 7→ L(h(x), y) : h ∈ H}. Then, for any δ > 0, with probability at least 1 − δ, the following holds for all h ∈ H: s T T X X log 1δ 1 1 L(h(xt ), yt ) + 2RT (HL ) + discY (Dt , DT +1 ) + M . LDT +1 (h) ≤ T t=1 T t=1 2T N Proof. We denote by D the product distribution Tt=1 Dt . Let Φ be the function defined over any sample S = ((x1 , y1 ), . . . , (xT , yT )) ∈ (X × Y)T by T 1X Φ(S) = sup LDT +1 (h) − L(h(xt ), yt ). T t=1 h∈H Let S and S 0 be two samples differing by one labeled point, say (xt , yt ) in S and (x0t , yt0 ) in S 0 , then: i M 1h Φ(S 0 ) − Φ(S) ≤ sup L(h(x0t ), yt0 ) − L(h(xt ), yt ) ≤ . T h∈H T Thus, by McDiarmid’s inequality, the following holds:1 h i P Φ(S) − E [Φ(S)] > ≤ exp(−2T 2 /M 2 ). S∼D S∼D 1 Note that McDiarmid’s inequality does not require points to be drawn according to the same distribution but only that they would be drawn independently. 43 We now bound ES∼D [Φ(S)] by first rewriting it, as follows: T T T i h 1X 1X 1X LDt (h) + LDt (h) − L(h(xt ), yt ) E sup LDT +1 (h) − T t=1 T t=1 T t=1 h∈H T T T h i h i 1X 1X 1X ≤ E sup LDT +1 (h)− LDt (h) +E sup LDt (h)− L(h(xt ), yt ) T t=1 T t=1 h∈H h∈H T t=1 T T h1 X i 1X ≤E sup LDT +1 (h) − LDt (h) +sup LDt (h) − L(h(xt ), yt ) T t=1 h∈H h∈H T t=1 T T h i 1X 1X ≤ discY (Dt , DT +1 ) + E sup LDt (h) − L(h(xt ), yt ) . T t=1 h∈H T t=1 It is not hard to see, using a symmetrization argument as in the non-sequential case, that the second term can be bounded by 2RT (HL ). Observe that the bound of Theorem 5 is tight as a function of the divergence measure (discrepancy) we are using. Consider for example the case where D1 = . . . = DT +1 , then a standard Rademacher complexity generalization bound holds for all h ∈ H: T √ 1X L(h(xt ), yt ) + 2RT (HL ) + O(1/ T ). LDT +1 (h) ≤ T t=1 Now, our generalization bound for LDT +1 (h) includes only the additive term discY (Dt , DT +1 ), but by definition of the discrepancy, for any > 0, there exists h ∈ H such that the inequality |LDT +1 (h) − LDT (h)| < discY (Dt , DT +1 ) + holds. Next, we present PAC learning bounds for empirical risk minimization. Let h∗T be a best-in class hypothesis in H, that is one with the best expected loss. By a similar reasoning as in Theorem 5, we can show that with probability 1 − 2δ we have T 1X T 1X ∗ ∗ L(hT (xt ), yt ) ≤ LDT +1 (hT ) + 2RT (HL ) + discY (Dt , DT +1 ) + 2M T t=1 T t=1 s log 2δ . 2T Let hT be a hypothesis returned by empirical risk minimization (ERM). The last inequality, along with the bound of Theorem 5 and the union bound imply that with probability 1 − δ the following holds: s T X log 2δ 2 discY (Dt , DT +1 ) + 2M LDT +1 (hT ) − LDT +1 (h∗T ) ≤ 4RT (HL ) + , (2.4) T t=1 2T 44 P P where we used the fact that Tt=1 L(hT (xt ), yt ) − Tt=1 L(h∗ (xt ), yt ) < 0 since hT is an empirical minimizer. This learning bound indicates a trade-off: larger values of the sample size T guarantee smaller first and third terms; however, as T increases, the average discrepancy term is likely to grow as well, thereby making learning increasingly challenging. This suggests an algorithm similar to empirical risk minimization but limited to the last m examples instead of the whole sample with m < T . This algorithm was previously used in Barve and Long (1997) for the study of the tracking scenario. We will use it here to prove several theoretical guarantees in the PAC learning model. Proposition 8. Let ∆ ≥ 0. Assume that (Dt )t≥0 is a sequence of distributions such that discY (Dt , Dt+1 ) ≤ ∆ for all t ≥ 0. Fix P m ≥ 1 and let hT denote the hypothesis returned by the algorithm A that minimizes Tt=T −m L(h(xt ), yt ) after receiving T > m examples. Then, for any δ > 0, with probability at least 1 − δ, the following learning bound holds: s log 2δ . (2.5) LDT +1 (hT ) − inf LDT +1 (h) ≤ 4Rm (HL ) + (m + 1)∆ + 2M h∈H 2m Proof. The proof is straightforward. Notice that the algorithm discards the first T − m examples and considers exactly m instances. Thus, as in inequality (2.4), we have: s T X log 2δ 2 disc(Dt , DT +1 ) + 2M . LDT +1 (hT ) − LDT +1 (h∗T ) ≤ 4Rm (HL ) + m t=T −m 2m Finally, we can use the triangle inequality to bound disc(Dt , DT +1 ) by (T + 1 − m)∆. Thus, the sum of the discrepancy terms can be bounded by (m + 1)∆. To obtain the best learning guarantee, we can select m to minimize the bound just presented. This requires expressing the Rademacher complexity in terms of m. The following is the result obtained when using a VC-dimension upper bound for the Rademacher complexity. Corollary 5. Fix ∆ > 0. Let q H be a hypothesis set with VC-dimension d such that for all m ≥ 1, Rm (HL ) ≤ C4 md for some constant C > 0. Assume that (Dt )t>0 is a sequence of distributions such that discY (Dt , Dt+1 ) ≤ ∆ for all t ≥ 0. Then, there exists an algorithm A such that for any δ > 0, theq hypothesis hT it returns after h i 23 1 0 log( 2δ ) receiving T > C+C ( ∆d2 ) 3 instances, where C 0 = 2M , satisfies the following 2 2d with probability at least 1 − δ: C + C0 LDT +1 (hT ) − inf LDT +1 (h) ≤ 3 h∈H 2 45 2/3 (d∆)1/3 + ∆. (2.6) Proof. Fix δ > 0. Replacing Rm (HL ) by the upper bound r LDT +1 (hT ) − inf LDT +1 (h) ≤ (C + C 0 ) h∈H 0 2 C 4 q d m in (2.5) yields d + (m + 1)∆. m (2.7) 1 Choosing m = ( C+C ) 3 ( ∆d2 ) 3 to minimize the right-hand side gives exactly (2.6). 2 pWhen H has finite VC-dimension d, it is known that Rm (HL ) can be bounded by C d/m for some constant C > 0, by using a chaining argument (Dudley, 1984; Pollard, 1984; Talagrand, 2005). Thus, the assumption of the previous corollary holds for many loss functions L, when H has finite VC-dimension. 2.4 Drifting Tracking Scenario In this section, we present a simpler proof of the bounds given by Long (1999) for the agnostic case demonstrating that using the discrepancy as a measure of the divergence between distributions leads to tighter and more informative bounds than using the L1 distance. Proposition 9. Let ∆ > 0 and let (Dt )t≥0 be a sequence of distributions such that discY (Dt , Dt+1 ) ≤ ∆ for all t ≥ 0. Let m > 1 and let hT be as in Proposition 8. Then, r cT +1 ] − inf LD (h) ≤ 4Rm (HL ) + 2M π + (m + 1)∆. (2.8) E[M T +1 D h m N N +1 Dt and D0 = Tt=1 Dt . By Fubini’s Theorem we can write: Proof. Let D = Tt=1 h i cT +1 ] − inf LD (h) = E LD (hT ) − inf LD (h) . E[M (2.9) T +1 T +1 T +1 0 D h h D −1 Let φ (δ) = 4Rm (HL ) + (m + 1)∆ + 2M 4Rm (HL ) + (m + 1)∆, the following holds: q log 2δ 2m . By inequality (2.5), for β > P [LDT +1 (hT ) − inf LDT +1 (h) > β] < φ(β). D0 h Thus, the expectation on the right-hand side of (2.9) can be bounded as follows: Z i E0 LDT +1 (hT ) − inf LDT +1 (h) ≤ 4Rm (HL ) + (m + 1)∆ + D h h ∞ φ(β)dβ. 4Rm (HL )+(m+1)∆ Using variable δ = φ(β), we see the last integral is equivalent to R 2 the dδchange of p π √ 2M 0 = 2M m , which concludes the proof. 2 m log δ 46 The following corollary can be shown using the same proof as that of Corollary 5. Corollary 6. Fix ∆ > 0. Let q H be a hypothesis set with VC-dimension d such that for all m > 1, 4Rm (HL ) ≤ C md . Let (Dt )t>0 be a sequence of distributions over X × Y p 0 2/3 such that discY (Dt , Dt+1 ) ≤ ∆. Let C 0 = 2M πd and K = 3 C+C . Then, for 2 C+C 0 23 d 1 T > ( ∆2 ) 3 , the following inequality holds: 2 cT +1 ] − inf LD (h) < K(d∆)1/3 + ∆. E[M T +1 D h In terms of Definition 11, this corollary shows that algorithm A (∆, K(d∆)1/3 +∆)tracks H. This result is similar to a result of Long (1999) which states that given > 0 if ∆ = O(d3 ) then A (∆, )-tracks H. However, in (Long, 1999), ∆ is an upper bound on the L1 distance and not the discrepancy, making our bound tighter. It is worth noting that the results of Long (1999) guarantee that the dependency on ∆ cannot be improved as on the example considered in their lower bound, the L1 distance and the discrepancy agree. 2.5 On-line to Batch Conversion In this section, we present learning guarantees for drifting distributions in terms of the regret of an on-line learning algorithm A. The algorithm processes a sample (xt )t≥1 sequentially by receiving a sample point xt ∈ X , generating a hypothesis ht , and incurring a loss L(ht (xt ), yt ), with yt ∈ Y. We denote by RT the regret of algorithm A after processing T ≥ 1 sample points: RT = T X t=1 L(ht (xt ), yt ) − inf h∈H T X L(h(xt ), yt ). t=1 The standard setting of on-line learning assumes an adversarial scenario with no distributional assumption. Nevertheless, when the data is generated according to some distribution, the hypotheses returned by an on-line algorithm A can be combined to define a hypothesis with √ strong learning guarantees in the distributional setting when the regret RT is in O( T ) (which is attainable by several regret minimization algorithms) (Littlestone, 1989; Cesa-Bianchi et al., 2001). Here, we extend these results to the drifting scenario and the case of a convex combination of the hypotheses generated by the algorithm. The following lemma will be needed for the proof of our main result. N Lemma 4. Let S = (xt , yt )Tt=1 be a sample drawn from the distribution D = Tt=1 Dt and let (ht )Tt=1 be the sequence of hypotheses returned by an on-line algorithm sequentially processing S. Let w = (w1 , . . . , wt )> be a vector of non-negative weights veri47 P fying Tt=1 wt = 1. If the loss function L is bounded by M then, for any δ > 0, with probability at least 1 − δ, each of the following inequalities hold: T X t=1 T X t=1 wt LDT +1 (ht ) ≤ T X t=1 wt L(ht (xt ), yt ) ≤ ¯ wt L(ht (xt ), yt ) + ∆(w, T ) + M kwk2 T X t=1 ¯ wt LDT +1 (ht ) + ∆(w, T ) + M kwk2 r r 2 log 1 δ 1 2 log , δ P ¯ where ∆(w, T ) denotes the average discrepancy Tt=1 wt discY (Dt , DT +1 ). Proof. Consider the random process: Zt = wt L(ht (xt ), yt ) − wt L(ht ) and let Ft denote the filtration associated to the sample process. We have: |Zt | ≤ M wt and E[Zt |Ft−1 ] = E[wt L(ht (xt ), yt )|Ft−1 ] − E [wt L(ht (xt ), yt )] = 0. D D Dt The second equality holds because ht is determined at time t − 1 and xt , yt are independent of Ft−1 . Thus, by Azuma-Hoeffding’s inequality, for any δ > 0, with probability at least 1 − δ the following holds: T X t=1 wt LDt (ht ) ≤ T X t=1 wt L(ht (xt ), yt ) + M kwk2 r 1 2 log . δ (2.10) By definition of the discrepancy, the following inequality holds for any t ∈ [1, T ]: LDT +1 (ht ) ≤ LDt (ht ) + discY (Dt , DT +1 ). P Summing up these inequalities and using (2.10) to bound Tt=1 wt LDt (ht ) proves the first statement. The second statement can be proven in a similar way. The following theorem is the main result of this section. Theorem 6. Assume that L is bounded by M and convex with respect to its first argument. Let h1 , . . . , hT be the hypotheses returned by AP when sequentially processing T (xt , yt )Tt=1 and let h be the hypothesis defined by h = t=1 wt ht , where w1 , . . . , wT PT are arbitrary non-negative weights verifying t=1 wt = 1. Then, for any δ > 0, with 48 probability at least 1 − δ, h satisfies each of the following learning guarantees: LDT +1 (h) ≤ T X t=1 ¯ wt L(ht (xt ), yt ) + ∆(w, T ) + M kwk2 r 2 log 1 δ RT ¯ LDT +1 (h) ≤ inf L(h) + + ∆(w, T ) + M kw − u0 k1 + 2M kwk2 h∈H T r 2 2 log , δ (2.11) P ¯ where w = (w1 , . . . , wT )> , ∆(w, T ) = Tt=1 wt discY (Dt , DT +1 ), and u0 ∈ RT is the vector with all its components equal to 1/T . Observe that when all weights are all equal to T1 , the result we obtain is similar to the learning guarantee obtained in Theorem 5 when the Rademacher complexity of HL is O( √1T ). Also, if the learning scenario is i.i.d., then the term involving the discrepancy ¯ vanishes and it can be seen straightforwardly that to minimize the RHS of 2.11 we ∆ need to set wt = T1 , which results in the known i.i.d. guarantees for on-line to batch conversion (Littlestone, 1989; Cesa-Bianchi et al., 2001). Proof. SinceP L is convex with PTrespect to its first argument, by Jensen’s inequality, we T have LDT +1 ( t=1 wt ht ) ≤ t=1 wt LDT +1 (ht ). Thus, by Lemma 4, for any δ > 0, the following holds with probability at least 1 − δ: ! r T T X X 1 ¯ wt L(ht (xt ), yt ) + ∆(w, T ) + M kwk2 2 log . (2.12) wt ht ≤ LDT +1 δ t=1 t=1 This proves the first statement of the theorem. To prove the second claim, we will bound the empirical error in terms ofPthe regret. For any h∗ ∈ H, we can write P using inf h∈H T1 Tt=1 L(h(xt ), yt ) ≤ T1 Tt=1 L(h∗ (xt ), yt ): T X t=1 = wt L(ht (xt ), yt ) − T X t=1 T X wt L(h∗ (xt ), yt ) t=1 T 1 1X ∗ [L(ht (xt ), yt )−L(h∗ (xt ), yt )] [L(ht (xt ), yt )−L(h (xt ), yt )]+ wt − T T t=1 T T 1X 1X ≤ M kw − u0 k1 + L(ht (xt ), yt ) − inf L(h(xt ), yt ) h T T t=1 t=1 ≤ M kw − u0 k1 + RT . T Now, by definition of the infimum, for any > 0, there exists h∗ ∈ H such that LDT +1 (h∗ ) ≤ inf h∈H LDT +1 (h) + . For that choice of h∗ , in view of (2.12), with 49 probability at least 1 − δ/2, the following holds: LDT +1 (h) ≤ T X t=1 RT ¯ + ∆(w, T ) + M kwk2 wt L(h (xt ), yt ) + M kw− u0 k1 + T ∗ r 2 2 log . δ By the second statement of Lemma 4, for any δ > 0, with probability at least 1 − δ/2, T X t=1 ¯ wt L(h (xt ), yt ) ≤ LDT +1 (h ) + ∆(w, T ) + M kwk2 ∗ ∗ r 2 2 log . δ Combining these last two inequalities, by the union bound, with probability at least 1−δ, q the following holds with B(w, δ) = M kw− u0 k1 + RT T + 2M kwk2 2 log 2δ : ¯ LDT +1 (h) ≤ LDT +1 (h∗ ) + 2∆(w, T ) + B(w, δ) ¯ ≤ inf LDT +1 (h) + + 2∆(w, T ) + B(w, δ). h∈H The last inequality holds for all > 0, therefore also for = 0 by taking the limit. The above inequality can be made to hold uniformly over w at the additional cost of a log log2 kw − u0 k−1 1 term. Corollary 7. Under the conditions fo the previous theorem the following inequality is satisfied uniformly for all w such that kw − u0 k1 ≤ 1: T X t=1 wt LDT +1 (ht ) ≤ T X t=1 ¯ wt L(ht (xt ), yt ) + ∆(w, T ) + 4M (kwk2 + kw − u0 k1 ) r 3 p 2 log + log log2 2kw − u0 k−1 . + M (kwk2 + kw − u0 k1 ) δ Proof. Let (w(k))∞ any sequence of positive weights and define k=0√ be PT ¯ k = M kw(k)k2 ( + 2 log k) + ∆(w(k), T ). If h̄k = t=1 w(k)t L(ht (xt ), yt ), then by Lemma 4 and the union bound we have: T T X X w(k)t LDT +1 (ht ) − w(k)t L(ht (xt ), yt ) ≥ k P ∃k s.t t=1 ≤ ∞ X t=1 e ¯ ( −∆(w(k),T ))2 − 21 k 2 M kw(k)k2 2 = k=0 2 − 2 ≤e ∞ X k=0 ∞ X 2 1 − 2 ≤ 3e . k2 k=0 50 1 √ e− 2 (+2 log k)2 Let us select w(k) in such a way that kw(k) − u0 k1 ≤ 1. There exists k such that 1 2k and let w satisfy kw − u0 k1 ≤ kw(k) − u0 k1 ≤ kw − u0 k1 ≤ kw(k − 1) − u0 k1 = 2kw(k) − u0 k1 . (2.13) Therefore, we have T X t=1 ≤ ≤ wt (LDt+1 (h) − L(ht (xt ), yt )) T X t=1 T X t=1 (2.14) w(k)t LDt+1 (ht ) − L(ht (xt ), yt ) + M kw − w(k)k1 w(k)t LDt+1 (ht ) − L(ht (xt ), yt ) + 2M kw − u0 k1 . Where the last inequality follows from the triangle inequality and the fact that kw(k) − ¯ ¯ u0 k1 ≤ kw − u0 k1 . Similarly, ∆(w, T ) ≤ ∆(w(k), T ) + 2M kw − u0 k1 and kwk2 ≤ kw(k)k2 + kw(k) − u0 k2 ≤ kw(k)k2 + kw(k) − u0 k1 . (2.15) Finally, by definition of w(k) and inequality (2.13) we have k = log2 kw(k) − u0 k−1 1 ≤ log2 2 . kw − u0 k1 (2.16) p Let w = M (kwk2 + kw − u0 k1 )( + 2 log log2 2kw − u0 k−1 1 ) + 4M (kwk2 + kw − u0 k1 ) + ∆(w, T ). In view of (2.16) and (2.14) we see that: P sup T X w:kw−u0 k1 ≤1 t=1 T X ≤ P ∃k s.t t=1 wt LDT +1 (ht ) − T X t=1 w(k)t LDT +1 (ht ) − wt L(ht (xt ), yt ) − w ≥ 0 T X t=1 2 w(k)t L(ht (xt ), yt ) − k ≥ 0 ≤ 3e− 2 . Setting the right hand side equal to δ and solving for δ we see that with probability at 51 least 1 − δ T X t=1 2.6 wt LDT +1 (ht ) ≤ T X t=1 ¯ wt L(ht (xt ), yt ) + ∆(w, T ) + 4M (kwk2 + kw − u0 k1 ) q r 3 −1 + M (kwk2 + kw − u0 k1 ) 2 log + log log2 2kw − u0 k1 . δ Algorithm The results of the previous section suggest a natural algorithm based on the values of the discrepancy between distributions. Let (ht )Tt=1 be the sequence of hypotheses generated by an on-line algorithm. Theorem 6 provides a learning guarantee for any convex combination of these hypotheses. The convex combination based on the weight vector w minimizing the bound of Theorem 6 benefits from the most favorable guarantee. This leads to an algorithm for determining w based on the following convex optimization problem: min w subject to: λkwk22 T X t=1 + T X wt (discY (Dt , DT +1 ) + L(ht (xt ), yt )) (2.17) t=1 wt = 1 ∧ (∀t ∈ [1, T ], wt ≥ 0), where λ ≥ 0 is a regularization parameter. This is a standard QP problem that can be efficiently solved using a variety of techniques and available software. In practice, the discrepancy values discY (Dt , DT +1 ) are not available since they require labeled samples. But, in the deterministic scenario where the labeling function f is in H, we have discY (Dt , DT +1 ) ≤ disc(Dt , DT +1 ). Thus, the discrepancy values disc(Dt , DT +1 ) can be used instead in our learning bounds and in the optimization (2.17). This also holds approximately when f is not in H but is close to some h ∈ H. As shown in the previous chapter, given two (unlabeled) samples of size√n from Dt and DT +1 , the discrepancy disc(Dt , DT +1 ) can be estimated within O(1/ n), when √ Rn (HL ) = O(1/ n). In many realistic settings, for tasks such as spam filtering, the distribution Dt does not change within a day. This gives us the opportunity to collect an independent unlabeled sample of size n from each distribution Dt . If we choose n T , by the union √ bound, with high probability, all of our estimated discrepancies will be within O(1/ T ) of their exact counterparts disc(Dt , DT +1 ). Additionally, in many cases, the distributions Dt remains unchanged over longer periods (cycles) which may be known to us. This in fact typically holds for some tasks 52 0.2 0.4 Discrepancy 0.6 0.8 1.0 0.8 0.6 Discrepancy 0.4 0.0 0.2 0.0 Cycle Cycle Figure 2.1: Barplot of estimated discrepancies for the continuous drifting and alternating drifting scenarios. such as spam filtering, political sentiment analysis, some financial market prediction problems, and other problems. For example, in the absence of any major political event such as a debate, speech, or a prominent measure, we can expect the political sentiment to remain stable. In such scenarios, it should be even easier to collect an unlabeled sample from each distribution. More crucially, we do not need then to estimate the discrepancy for all t ∈ [1, T ] but only once for each cycle. 2.7 Experiments Here, we report the performance of our algorithm in both synthetic and real-world datasets for the task of regression. Throughout this section, we let H denote the set of linear hypotheses x 7→ v> x. As our base on-line algorithm A we use the Widrow-Hoff algorithm Widrow and Hoff (1988) which minimizes the square loss using stochastic √ and the paramgradient descent. We set the learning rate of this algorithm to ηt = 0.001 t eter λ of our algorithm is chosen through validation over the set {2k |k = −20, . . . 20}. As discussed in the previous section, we consider scenarios where drifting occurs on cycles. In view of this, we compare our algorithm against the following natural baselines: the algorithm that averages all hypotheses returned by A, which we denote by avg and the algorithm that only averages over the hypotheses obtained over the last cycle (last). 2.7.1 Synthetic data sets We create 8 different data sets in the following way: we generate T = 9600 examples in R50 . These examples are divided into different cycles, each cycle having the same distribution. For each experiment, we select the cycle size to be k ∈ {100, 200, 400, 600} 53 and instances in cycle i have a Gaussian distribution with mean µi and covariance matrix 0.5I, where I is the identity matrix. The labeling function for all examples was P50 given by x 7→ (0.2 ∗ j=1 xj + 1)2 − 1. To introduce drifting on the distributions, we let µ1 = (0, . . . , 0)> and µi = µi−1 + i , where the drift i is given by one of the following: 1. Continuous drifting. We let i be a uniform random variable in [−.1, .1]50 . 2. Alternating drifting. We select 0 = (.1, . . . , .1)> and define i = (−1)i 0 . As suggested by their names, the continuous drifting scenario keeps changing the distribution at every cycle whereas the alternating scenario maintains the same distribution for all even cycles and all odd cycles respectively. The discrepancies needed for our algorithm were estimated from 6000 unlabeled samples of each distribution pair. We replicated each experiment 50 times and the results are presented in Figure 2.2 and Figure 2.3 where we report the MSE of the final hypotheses output by each algorithm. The first thing to notice is that our proposed algorithm is never significantly worse than the two other algorithms. For the continuous drifting scenario, there is in fact no statistically significant difference between the performance of our algorithm and that of last. This can be explained by the fact that in the continuous drifting scenario the last cycle had the smallest discrepancy with respect to the testing distribution and in most cases, our algorithm therefore considered only hypotheses on this cycle. In the alternating scenario however, we can see in Figure 2.1 that the discrepancy between the distribution of the last cycle and the testing distribution is large. Therefore using only hypotheses from the last round should be detrimental to learning. This effect can in fact be seen in Figure 2.3 where our algorithm instead learns to use hypotheses from alternating cycles. In fact, as the size of the cycle increases the gap in the performance of the algorithms increases too. This can be explained by the fact that a larger per-cycle sample allows the last algorithm to learn a better hypothesis for the wrong distribution. Thus, making its performance suffer. 2.7.2 Real-world data sets We test our algorithm on a financial data set consisting of the following stock tickers: IBM, GOOG, MSFT, INTC and AAPL. Our goal is to predict the price of GOOG as a function of the other four prices. We consider one day of trading as a cycle and for each day we sample the price of these stocks every 2 minutes. This yields 195 points per cycle. These points were used to estimate the discrepancy disc between the cycles, which is required by our algorithm. The data consists of 20 days of trading beginning on February 9th 2015 and is available at http://cims.nyu.edu/˜munoz. We report the median performance of each algorithm on days 5, 10, 15 and 20. The results are shown in Table 2.1. Notice that, albeit having similar performance to last and avg on the initial rounds, our algorithm seems to have a lot less variability in its results and to, in general, 54 ●●● 2.0 1.00 ● ● ● MSE ● ● ● ● ● Weighted ● Last ● ● 1.0 ● ● ● ● ● ● ● MSE 1.5 ● ● Algorithm ● ● ● ● ● ● Algorithm Weighted 0.75 Last Avg Avg 0.50 0.5 2400 4800 6400 9600 2400 4800 6400 9600 Training Rounds Training Rounds 0.8 2.0 ● ● ● ● ● ● 0.7 1.5 Algorithm Weighted MSE MSE Algorithm ● ● Last 1.0 ● ● ● ● ● ● ● Avg ● 0.6 Weighted Last Avg 0.5 0.5 0.4 2400 4800 6400 9600 2400 4800 6400 9600 Training Rounds Training Rounds Figure 2.2: MSE of different algorithms for the continuous drifting scenario. Different plots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400 (top-right) and k = 600 (bottom-right). 3 ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● Algorithm Weighted ● ● Algorithm 2.0 MSE ● ● MSE 2 Last ● 1.5 Avg ● ● Weighted ● ● ● ● ● Last Avg 1 1.0 0.5 2400 4800 6400 9600 2400 4800 6400 9600 Training Rounds Training Rounds 3.0 ● 3 ● ● 2.5 ● ● 2.0 Algorithm ● ● ● Last 1.5 Algorithm ● Weighted MSE MSE ● ● ● 2 Weighted Last Avg Avg ● ● 1 1.0 ● ● 0.5 2400 4800 6400 9600 2400 4800 6400 9600 Training Rounds Training Rounds Figure 2.3: MSE of different algorithms for the alternating drifting scenario. Different plots represent different cycle sizes: k = 100 (top-left), k = 200 (bottom-left), k = 400 (top-right) and k = 600 (bottom-right). adapt much better to this difficult task. 55 Day 5 10 15 20 Weighted Avg Last 0.558 0.552 0.663 38.328 82.001 37.939 4.149 60.203 6.981 10.345 13.365 251.941 Table 2.1: Results of different algorithms in the financial data set. Bold results are significant at the 0.05 level. 2.8 Conclusion We presented a theoretical analysis of the problem of learning with drifting distributions in the batch setting. Our learning guarantees improve upon previous ones based on the L1 distance, in some cases substantially, and our proofs are simpler and concise. These bounds benefit from the notion of discrepancy which seems to be the natural measure of the divergence between distributions in a drifting scenario. This work motivates a number of related studies, in particular a discrepancy-based analysis of the scenario introduced by Crammer et al. (2010) and further improvements of the algorithm we presented, in particular by exploiting the specific on-line learning algorithm used. 56 Part II Auctions 57 Chapter 3 Learning in Auctions Second-price auctions with reserve play a critical role in the revenue of modern search engine and popular online sites since the revenue of these companies often directly depends on the outcome of such auctions. The choice of the reserve price is the main mechanism through which the auction revenue can be influenced in these electronic markets. We cast the problem of selecting the reserve price to optimize revenue as a learning problem and present a full theoretical analysis dealing with the complex properties of the corresponding loss function. We further give novel algorithms for solving this problem and report the results of several experiments in both synthetic and realworld data demonstrating their effectiveness. 3.1 Introduction Over the past few years, advertisement has gradually moved away from the traditional printed promotion to the more tailored and directed online publicity. The advantages of online advertisement are clear: since most modern search engine and popular online site companies such as Microsoft, Facebook, Google, eBay, or Amazon, may collect information about the users’ behavior, advertisers can better target the population sector their brand is intended for. More recently, a new method for selling advertisements has gained momentum. Unlike the standard contracts between publishers and advertisers where some amount of impressions is required to be fulfilled by the publisher, an Ad Exchange works in a way similar to a financial exchange where advertisers bid and compete between each other for an ad slot. The winner then pays the publisher and his ad is displayed. The design of such auctions and their properties are crucial since they generate a large fraction of the revenue of popular online sites. These questions have motivated extensive research on the topic of auctioning in the last decade or so, particularly in the theoretical computer science and economic theory communities. Much of this work has focused on the analysis of mechanism design, either to prove some useful property of an 58 existing auctioning mechanism, to analyze its computational efficiency, or to search for an optimal revenue maximization truthful mechanism (see Muthukrishnan (2009) for a good discussion of key research problems related to Ad Exchange and references to a fast growing literature therein). One particularly important problem is that of determining an auction mechanism that achieves optimal revenue (Muthukrishnan, 2009). In the ideal scenario where the valuation of the bidders is drawn i.i.d. from a given continuous distribution, this is known to be achievable for a single item, one-shot auction (see for example (Myerson, 1981)). Extensions to different scenarios have been done and have produced a series of interesting results including (Riley and Samuelson, 1981; Milgrom and Weber, 1982; Myerson, 1981; Nisan et al., 2007), all of them based on some assumptions such as buyers having i.i.d. valuations. The results of these publications have set the basis for most Ad Exchanges in practice: the mechanism widely adopted for selling ad slots is that of a Vickrey auction (Vickrey, 1961) or second-price auction with reserve price (Easley and Kleinberg, 2010). In such auctions, the winning bidder (if any) pays the maximum of the second-place bid and the reserve price. The reserve price can be set by the publisher or automatically by the exchange. The popularity of these auctions relies on the fact that they are incentivecompatible, i.e., bidders bid exactly what they are willing to pay. It is clear that the revenue of the publisher depends greatly on how the reserve price is set: if set too low, the winner of the auction might end up paying only a small amount, even if his bid was high; on the other hand, if it is set too high, then bidders may not bid higher than the reserve price and the ad slot will not be sold. We propose a learning approach to the problem of determining the reserve price to optimize revenue in such auctions. The general idea is to leverage the information gained from past auctions to predict a beneficial reserve price. Since every transaction on an Exchange is logged, it is natural to seek to exploit that data. This could be used to estimate the probability distribution of the bidders, which can then be used indirectly to come up with the optimal reserve price (Myerson, 1981; Ostrovsky and Schwarz, 2011). Instead, we will seek a discriminative method making use of the loss function related to the problem and taking advantage of existing user features. Learning methods have been used in the past for the related problems of designing incentive-compatible auction mechanisms (Balcan et al., 2008; Blum et al., 2004), for algorithmic bidding (Langford et al., 2010; Amin et al., 2012), and even for predicting bid landscapes (Cui et al., 2011). Another closely related problem for which machine learning solutions have been proposed is that of revenue optimization for sponsored search ads and click-through rate predictions (Zhu et al., 2009; He et al., 2014; Devanur and Kakade, 2009). But, to our knowledge, no prior work has used historical data in combination with user features for the sole purpose of revenue optimization in this context. In fact, the only publications we are aware of that are directly related to our objective are (Ostrovsky and Schwarz, 2011) and (Cesa-Bianchi et al., 2013), which 59 considers a more general case than (Ostrovsky and Schwarz, 2011). The scenario studied by Cesa-Bianchi et al. (2013) is that of censored information, which motivates their use of a regret minimization algorithm to optimize the revenue of the seller. Our analysis assumes instead access to full information. We argue that this is a more realistic scenario since most companies do in fact have access to the full historical data. The learning scenario we consider is also more general since it includes the use of features, as is standard in supervised learning. Since user information is communicated to advertisers and bids are made based on that information, it is only natural to include user features in the formulation of the learning problem. A special case of our analysis coincides with the no-feature scenario considered by Cesa-Bianchi et al. (2013), assuming full information. But, our results further extend those of this paper even in that scenario. In particular, we present an O(m log m) algorithm for solving a key optimization problem used as a subroutine by these authors, for which they do not seem to give an algorithm. We also do not assume that buyers’ bids are sampled i.i.d. from a common distribution. Instead, we only assume that the full outcome of each auction is independent and identically distributed. This subtle distinction makes our scenario closer to reality as it is unlikely for all bidders to follow the same underlying value distribution. Moreover, even though our scenario does not take into account a possible strategic behavior of bidders between rounds, it allows for bidders to be correlated, which is common in practice. This chapter is organized as follows: in Section 3.2, we describe the setup and give a formal description of the learning problem. We discuss the relations between the scenario we consider and previous work on learning in auctions in Section 3.3. In particular, we show that, unlike previous work, our problem can be cast as that of minimizing the expected value of a loss function, which is a standard learning problem. Unlike most work in this field, however, the loss function naturally associated to this problem does not admit favorable properties such as convexity or Lipschitz continuity. In fact the loss function is discontinuous. Therefore, the theoretical and algorithmic analysis of this problem raises several non-trivial technical issues. Nevertheless, we use a decomposition of the loss to derive generalization bounds for this problem (see Section 3.4). These bounds suggest the use of structural risk minimization to determine a learning solution benefiting from strong guarantees. This, however, poses a new challenge: solving a highly non-convex optimization problem. Similar algorithmic problems have been of course previously encountered in the learning literature, most notably when seeking to minimize a regularized empirical 0-1 loss in binary classification. A standard method in machine learning for dealing with such issues consists of resorting to a convex surrogate loss (such as the hinge loss commonly used in linear classification). However, we show in Section 3.4.2 that no convex loss function is calibrated for the natural loss function for this problem. That is, minimizing a convex surrogate could in fact be detrimental to learning. This fact is further empirically verified in Section 4.6. The impossibility results of Section 3.4.2 prompt us to search for surrogate loss 60 functions with weaker regularity properties such as Lipschitz continuity. We describe a loss function with precisely that property which we further show to be consistent with the original loss. We also provide finite sample learning guarantees for that loss function, which suggest minimizing its empirical value while controlling the complexity of the hypothesis set. This leads to an optimization problem which, albeit non-convex, admits a favorable decomposition as a difference of two convex functions (DC-programming) . Thus, we suggest using the DC-programming algorithm (DCA) introduced by Tao and An (1998) to solve our optimization problem. This algorithm admits favorable convergence guarantees to a local minimum. To further improve upon DCA, we propose a combinatorial algorithm to cycle through different local minima with the guarantee of reducing the objective function at every iteration. Finally, in Section 4.6, we show that our algorithm outperforms several different baselines in various synthetic and real-world revenue optimization tasks. 3.2 Setup We start with the description of the problem and our formulation of the learning setup. We study second-price auctions with reserve, the type of auctions adopted in many Ad Exchanges. In such auctions, the bidders submit their bids simultaneously and the winner, if any, pays the maximum of the value of the second-place bid and a reserve price r set by the seller. This type of auctions benefits from the same truthfulness property as second-price auctions (or Vickrey auctions) Vickrey (1961): truthful bidding can be shown to be a dominant strategy in such auctions. The choice of the reserve price r is the only mechanism through which the seller can influence the auction revenue. Its choice is thus critical: if set too low, the amount paid by the winner could be too small; if set too high, the ad slot could be lost. How can we select the reserve price to optimize revenue? We consider the problem of learning to set the reserve prices to optimize revenue in second-price auctions with reserve. The outcome of an auction can be encoded by the highest and second-highest bids which we denote by a vector b = (b(1) , b(2) ) ∈ B ⊂ R2+ . We will assume that there exists an upper bound M ∈ (0, +∞) for the bids: supb∈B b(1) = M . For a given reserve price r and bid pair b, by definition, the revenue of an auction is given by Revenue(r, b) = b(2) 1r<b(2) + r1b(2) ≤r≤b(1) . We consider the general scenario where a feature vector x ∈ X ⊂ RN is associated with each auction. In the auction theory literature, this feature vector is commonly referred to as public information. In the context of online advertisement, this could be for example information about the user’s location, gender or age. The learning problem can thus be formulated as that of selecting out of a hypothesis set H of functions mapping X to R a 61 hypothesis h with high expected revenue E (x,b)∼D [Revenue(h(x), b)], (3.1) where D is an unknown distribution according to which pairs (x, b) are drawn, where we have made the implicit assumption that bids are independent of the reserve price. Instead of the revenue, we will consider a loss function L defined for all (r, b) by L(r, b) = −Revenue(r, b), and will equivalently seek a hypothesis h with small expected loss L(h) := E(x,b)∼D [L(h(x), b)]. As in standard supervised learning scenarios, we assume access to a training sample S = ((x1 , b1 ), . . . , (xm , bm )) of size m ≥ 1 drawn i.i.d. according to D. We will denote by LbS (h) the empirical loss P LbS (h) = m1 m i=1 L(h(xi , bi ). Notice that we only assume that the auction outcomes are i.i.d. and not that bidders are independent of each other with the same underlying bid distribution, as in some previous work (Cesa-Bianchi et al., 2013; Ostrovsky and Schwarz, 2011). In the next sections, we will present a detailed study of this learning problem, starting with a review of the related literature. 3.3 Previous work Here, we briefly discuss some previous work related to the study of auctions from a learning standpoint. One of the earliest contributions in this literature is that of Blum et al. (2004) where the authors studied a posted-price auction mechanism where a seller offers some good at a certain price and where the buyer decides to either accept that price or reject it. It is not hard to see that this type of auctions is equivalent to second-price auctions with reserve with a single buyer. The authors consider a scenario of repeated interactions with different buyers where the goal is to design an incentive-compatible method of setting prices that is competitive with the best fixed-priced strategy in hindsight. A fixed-price strategy is one that simply offers the same price to all buyers. Using a variant of the EXP3 algorithm of Auer et al. (2002), the authors designed a pricing algorithm achieving a (1 + )-approximation to the best fixed-price strategy. This same scenario was also studied by Kleinberg and Leighton (2003b) who gave an online algorithm whose regret after T rounds is in O(T 2/3 ). A step further in the design of optimal pricing strategies was proposed by Balcan et al. (2008). One of the problems considered by the authors was that of setting prices for n buyers in a posted-price auction as a function of their public information. Unlike the on-line scenario of Blum et al. (2004), Balcan et al. (2008) considered a batch scenario where all buyers are known in advance. However, the comparison class considered was no longer that of simple fixed-price strategies but functions mapping public information to prices. This makes the problem more challenging and in fact closer to the scenario we consider. The authors showed that finding a (1 + )-optimal truthful mechanism is 62 equivalent to finding an algorithm to optimize the empirical risk associated to the loss function we consider (in the case b(2) ≡ 0). There are multiple connections between this work and our results. In particular, the authors pointed out that the discontinuity and asymmetry of the loss function presented several challenges to their analysis. We will see that, in fact, the same problems appear in the derivation of our learning guarantees. But, we will present an algorithm for minimizing the empirical risk which was a crucial element missing in their results. A different line of work by Cui et al. (2011) focused on predicting the highest bid of a second-price auction. To estimate the distribution of the highest bid, the authors partitioned the space of advertisers based on their campaign objectives and estimated the distribution for each partition. Within each partition, the distribution of the highest bid was modeled as a mixture of log-normal distributions where the means and standard deviations of the mixtures were estimated as a function of the data features. While it may seem natural to seek to predict the highest bid, we show that this is not necessary and that in fact accurate predictions of the highest bid do not necessarily translate into algorithms achieving large revenue (see Section 4.6). As already mentioned, the closest previous work to ours is that of Cesa-Bianchi et al. (2013), who studied the problem of directly optimizing the revenue under a partial information setting where the learner can only observe the value of the second-highest bid, if it is higher than the reserve price. In particular, the highest bid remains unknown to the learner. This is a natural scenario for auctions such as those of eBay where only the price at which an object is sold is reported. To do so, the authors expressed the expected revenue in terms of the quantity q(t) = P[b(2) > t]. This can be done as follows: (3.2) E[Revenue(r, b)] = E [b(2) 1r<b(2) ] + r P[b(2) ≤ r ≤ b(1) ] b b(2) Z +∞ = P[b(2) 1r<b(2) > t] dt + r P[b(2) ≤ r ≤ b(1) ] Z0 r Z ∞ (2) = P[r < b ] dt + P[b(2) > t]dt + r P[b(2) ≤ r ≤ b(1) ] 0 r Z ∞ = P[b(2) > t] dt + r(P[b(2) > r] + 1−P[b(2) > r]−P[b(1) < r]) Zr ∞ = P[b(2) > t] dt + r P[b(1) ≥ r]. r The main observation of Cesa-Bianchi et al. (2013) was that the quantity q(t) can be estimated from the observed outcomes of previous auctions. Furthermore, if the buyers’ bids are i.i.d., then, one can express P[b(1) ≥ r] as a function of the estimated value of q(r). This implies that the right-hand side of (3.2) can be accurately estimated and therefore an optimal reserve price can be selected. Their algorithm makes calls to a 63 procedure that maximizes the empirical revenue. The authors, however, did not describe an algorithm for that maximization. A by-product of our work is an efficient algorithm for that procedure. The guarantees of Cesa-Bianchi et al. (2013) are similar to those presented in the next section in the special case of learning without features. However, our derivation is different since we consider a batch scenario while Cesa-Bianchi et al. (2013) treated an online setup for which they presented regret guarantees. 3.4 Learning Guarantees The problem we consider is an instance of the well known family of supervised learning problems. However, the loss function L does not admit any of the properties such as convexity or Lipschitz continuity often assumed in the analysis of the generalization error, as shown by Figure 3.1(a). Furthermore, L is discontinuous and, unlike the 01 loss function whose discontinuity point is independent of the label, its discontinuity depends on the outcome b of the auction. Thus, the problem of learning with the loss function L requires a new analysis. 3.4.1 Generalization bound To analyze the complexity of the family of functions LH mapping X × B to R defined by LH = {(x, b) 7→ L(h(x), b) : h ∈ H}, we decompose L as a sum of two loss functions l1 and l2 with more favorable properties than L. We have L = l1 + l2 with l1 and l2 defined for all (r, b) ∈ R × B by l1 (r, b) = −b(2) 1r<b(2) − r1b(2) ≤r≤b(1) − b(1) 1r>b(1) l2 (r, b) = b(1) 1r>b(1) . These functions are shown in Figure 3.1(b). Note that, for a fixed b, the function r 7→ l1 (r, b) is 1-Lipschitz since the slope of the lines defining the function is at most 1. We will consider the corresponding families of loss functions: l1H = {(x, b) 7→ l1 (h(x), b) : h ∈ H} and l2H = {(x, b) 7→ l2 (h(x), b) : h ∈ H} and use the notion of pseudo-dimension as well as those of empirical and average Rademacher complexity to measure their complexities. The pseudo-dimension is a standard complexity measure (Pollard, 1984) extending the notion of VC-dimension to real-valued functions (see also Mohri et al. (2012)). For a family of functions G and finite sample S = b S (G) = (z1 ,. . . , zm ) ofP size m, the empirical Rademacher complexity is defined by R m 1 > Eσ supg∈G m i=1 σi g(zi ) , where σ = (σ1 , . . . , σm ) , with σi s independent uniform random variables taking values in {−1, +1}. The Rademacher complexity of G is deb S (G)]. fined as Rm (G) = ES∼Dm [R 64 b2 1 b(2) 0 0 0 1 2 3 4 5 6 7 -1 − -2 -3 -4 4 -1 3 b(2) -2 2 -3 1 -4 0 -5 -5 -1 0 b 1 2 1 3 4 b 5 6 7 0 (1) (a) 1 2 3 4 5 6 7 b(1) (b) Figure 3.1: (a) Plot of the loss function r 7→ L(r, b) for fixed values of b(1) and b(2) ; (b) Functions l1 on the left and l2 on the right. To bound the complexity of LH , we will first bound the complexity of the family of loss functions l1H and l2H . Since l1 is 1-Lipschitz, the complexity of the class l1H can be readily bounded by that of H, as shown by the following proposition. Proposition 10. For any sample S=((x1 ,b1 ), . . . , (xm ,bm )), the empirical Rademacher complexity of l1H can be bounded as follows: b S (l1H ) ≤ R b S (H). R Proof. By definition of the empirical Rademacher complexity, we can write " # " # m m X X 1 1 b S (l1H ) = E sup σi l1 (h(xi ), bi ) = E sup σi (ψi ◦ h)(xi ) , R m σ h∈H i=1 m σ h∈H i=1 where, for all i ∈ [1, m], ψi is the function defined by ψi : r 7→ l1 (r, bi ). For any i ∈ [1, m], ψi is 1-Lipschitz, thus, by the contraction lemma of Appendix B.1 (Lemma 13), the following inequality holds: " # m X 1 b S (l1H ) ≤ b S (H), E sup σi h(xi ) = R R m σ h∈H i=1 which completes the proof. As shown by the following proposition, the complexity of l2H can be bounded in terms of the pseudo-dimension of H. Proposition 11. Let d = Pdim(H) denote the pseudo-dimension of H, then, for any sample S = ((x1 , b1 ), . . . , (xm , bm )), the empirical Rademacher complexity of l2H can be bounded as follows: r em b S (l2H ) ≤ M 2d log d . R m 65 Proof. By definition of the empirical Rademacher complexity, we can write " # " # m m X X 1 1 (1) b S (l2H ) = E sup σi bi 1h(xi )>b(1) = E sup σi Ψi (1h(xi )>b(1) ) , R i i m σ h∈H i=1 m σ h∈H i=1 (1) where, for all i ∈ [1, m], Ψi is the M -Lipschitz function x 7→ bi x. Thus, by Lemma 13 combined with Massart’s lemma (see for example Mohri et al. (2012)), we can write " # r m X 2d0 log em M d0 b S (l2H ) ≤ E sup σi 1h(xi )>b(1) ≤ M , R i m σ h∈H i=1 m where d0 = VCdim({(x, b) 7→ 1h(x)−b(1) >0 : (x, b) ∈ X × B}). Since the second bid component b(2) plays no role in this definition, d0 coincides with VCdim({(x, b(1) ) 7→ 1h(x)−b(1) >0 : (x, b(1) ) ∈ X × B1 }), where B1 is the projection of B ⊆ R2 onto its first component, and is upper-bounded by VCdim({(x, t) 7→ 1h(x)−t>0 : (x, t) ∈ X × R}), that is, the pseudo-dimension of H. Propositions 10 and 11 can be used to derive the following generalization bound for the learning problem we consider. Theorem 7. For any δ > 0, with probability at least 1 − δ over the draw of an i.i.d. sample S of size m, the following inequality holds for all h ∈ H: s r em 2d log d log 1δ L(h) ≤ LbS (h) + 2Rm (H) + 2M +M . m 2m Proof. By a standard property of the Rademacher complexity, since L = l1 + l2 , the following inequality holds: Rm (LH ) ≤ Rm (l1H ) + Rm (l2H ). Thus, in view of Propositions 10 and 11, the Rademacher complexity of LH can be bounded via r 2d log em d Rm (LH ) ≤ Rm (H) + M . m The result then follows by the application of a standard Rademacher complexity bound (Koltchinskii and Panchenko, 2002). This learning bound invites us to consider an algorithm seeking h ∈ H to minimize the empirical loss LbS (h), while controlling the complexity (Rademacher complexity and pseudo-dimension) of the hypothesis set H. In the following section, we discuss the computational problem of minimizing the empirical loss and suggest the use of a surrogate loss leading to a more tractable problem. 66 b2 1 0 0 1 2 3 4 5 6 7 -1 -2 -3 -4 -5 b1 (a) (b) Figure 3.2: (a) Piecewise linear convex surrogate loss L . (b) Comparison of the sum p Pm of real losses i=1 L(·, bi ) for m = 500 with the sum of convex surrogate losses. Note that the minimizers are significantly different. 3.4.2 Surrogate Loss As pointed out in Section 3.4, the loss function L does not admit most properties of traditional loss functions used in machine learning: for any fixed b, L(·, b) is not differentiable (at two points), it is not convex nor Lipschitz, and in fact it is discontinuous. For any fixed b, L(·, b) is quasi-convex,1 a property that is often desirable since there exist several solutions for quasi-convex optimization Pmproblems. However, in general, a sum of quasi-convex functions, such as the sum i=1 L(·, bi ) appearing in the definition of the empirical loss, is not quasi-convex and a fortiori not convex.2 In fact, in general, such a sum may admit exponentially many local minima. This leads us to seek a surrogate loss function with more favorable optimization properties. A standard method in machine learning consists of replacing the loss function L with a convex upper bound (Bartlett et al., 2006). A natural candidate in our case is the piecewise linear function Lp shown in Figure 3.2(a). While this is a convex loss function, and thus convenient for optimization, it is not calibrated. That is, it is possible for rp ∈ argmin Eb [Lp (r, b)] to have a large expected true loss. Therefore, it does not provide us with a useful surrogate. The calibration problem is illustrated by PmFigure 3.2(b) in dimension one, where the true objective function to be minimized i=1 L(r, bi ) is compared with the sum of the surrogate losses. The next theorem shows that this problem in fact affects any non-constant convex surrogate. It is expressed in terms of the e : R × R+ → R defined by L(r, e b) = −r1r≤b , which coincides with L when the loss L second bid is 0. 1 A function f : R → R is said to be quasi-convex if for any α ∈ R the sub-level set {x : f (x) ≤ α} is convex. 2 It is known that, under some separability condition, if a finite sum of quasi-convex functions on an open convex set is quasi-convex, then all but perhaps one of them is convex (Debreu and Koopmans, 1982). 67 e if, Definition 12. We say that a function Lc : [0, M ] × [0, M ] → R is consistent with L ∗ for any distribution D, there exists a minimizer r ∈ argminr Eb∼D [Lc (r, b)] such that e b)]. r∗ ∈ argminr Eb∼D [L(r, Definition 13. We say that a sequence of convex functions (Ln )n∈N mapping [0, M ] × e if there exists a sequence (rn )n∈N in R with [0, M ] to R is weakly consistent with L rn ∈ argminr Eb∼D [Ln (r, b)] for all n ∈ N such that limn→+∞ rn = r∗ with r∗ ∈ e b)]. argmin Eb∼D [L(r, Proposition 12 (Convex surrogates). Let Lc : [0, M ] × [0, M ] → R be a bounded funce then Lc (·, b) is tion, convex with respect to its first argument. If Lc is consistent with L, constant for any b ∈ [0, M ]. Proof. The idea behind the proof is the following: for any two bids b1 < b2 , there e b)] is minimized at both exists a distribution D with support {b1 , b2 } such that Eb∼D [L(r, r = b1 and r = b2 . We show this implies that Eb∼D [Lc (r, b)] must attain a minimum at both points too. By convexity of Lc it follows that Eb∼D [Lc (r, b)] must be constant on the interval [b1 , b2 ]. The main part of the proof will be showing that this implies that the function Lc (·, b1 ) must also be constant on the interval [b1 , b2 ]. Finally, since the value of b2 was chosen arbitrarily, it will follow that Lc (·, b1 ) is constant. Let 0 < b1 < b2 < M and, for any µ ∈ [0, 1], let Dµ denote the probability distribution with support included in {b1 , b2 } defined by Dµ (b1 ) = µ and let Eµ denote the expectation with respect to this distribution. A straightforward calculation shows that e b)] is given by b2 if µ > b2 −b1 and by b1 if µ < b2 −b1 . the unique minimizer of Eµ [L(r, b2 b2 Therefore, if Fµ (r) = Eµ [Lc (r, b)], it must be the case that b2 is a minimizer of Fµ for 1 1 and b1 is a minimizer of Fµ for µ < b2b−b . µ > b2b−b 2 2 For a convex function f : R → R, we denote by f − its left-derivative and by f + its right-derivative, which are guaranteed to exist. We will also denote here, for any b ∈ R, by g − (·, b) and g + (·, b) the left- and right-derivatives of the function g(·, b) and by g 0 (·, b) its derivative, when it exists. Recall that for a convex function f , if x0 is a minimizer, then f − (x0 ) ≤ 0 ≤ f + (x0 ). In view of that and the minimizing properties of b1 and b2 , the following inequalities hold: − 0 ≥ Fµ− (b2 ) = µL− c (b2 , b1 ) + (1 − µ)Lc (b2 , b2 ) 0 ≤ Fµ+ (b1 ) ≤ Fµ− (b2 ) b2 − b1 , b2 b2 − b1 for µ < , b2 for µ > (3.3) (3.4) where the second inequality in (3.4) holds by convexity of Fµ and the fact that b1 < b2 . 1 By setting µ = b2b−b , it follows from inequalities (3.3) and (3.4) that Fµ− (b2 ) = 0 and 2 Fµ+ (b1 ) = 0. By convexity of Fµ , it follows that Fµ is constant on the interval (b1 , b2 ). We now show this may only happen if Lc (·, b1 ) is also constant. By rearranging terms 68 in (3.3) and plugging in the expression of µ, we obtain the equivalent condition − (b2 − b1 )L− c (b2 , b1 ) = −b1 Lc (b2 , b2 ). Since Lc is a bounded function, it follows that L− c (b2 , b1 ) is bounded for any b1 , b2 ∈ − (0, M ), therefore as b1 → b2 we must have b2 Lc (b2 , b2 ) = 0, which implies L− c (b2 , b2 )= − 0 for all b2 > 0. In view of this, inequality (3.3) may only be satisfied if Lc (b2 , b1 ) ≤ − 0. However, the convexity of Lc implies L− c (b2 , b1 ) ≥ Lc (b1 , b1 ) = 0. Therefore, L− c (b2 , b1 ) = 0 must hold for all b2 > b1 > 0. Similarly, by definition of Fµ , the first inequality in (3.4) implies + µL+ c (b1 , b1 ) + (1 − µ)Lc (b1 , b2 ) ≥ 0. (3.5) + − Nevertheless, for any b2 > b1 we have 0 = L− c (b1 , b1 ) ≤ Lc (b1 , b1 ) ≤ Lc (b2 , b1 ) = 0. + + Consequently, Lc (b1 , b1 ) = 0 for all b1 > 0. Furthermore, Lc (b1 , b2 ) ≤ L+ c (b2 , b2 ) = 0. + Therefore, for inequality (3.5) to be satisfied, we must have Lc (b1 , b2 ) = 0 for all b1 < b2 . Thus far, we have shown that for any b > 0, if r ≥ b, then L− c (r, b) = 0, while + Lc (r, b) = 0 for r ≤ b. A simple convexity argument shows that Lc (·, b) is then differentiable and L0c (r, b) = 0 for all r ∈ (0, M ), which in turn implies that Lc (·, b) is a constant function. The result of the previous proposition can be considerably strengthened, as shown by the following theorem. As in the proof of the previous proposition, to simplify the notation, for any b ∈ R, we will denote by g 0 (·, b) the derivative of a differentiable function g(·, b). Theorem 8. Let (Ln )n∈N denote a sequence of functions mapping [0, M ] × [0, M ] to R that are convex and differentiable with respect to their first argument and satisfy the following conditions: • supb∈[0,M ],n∈N max(|L0n (0, b)|, |L0n (M, b)| = K < ∞; e • (Ln )n is weakly consistent with L; • Ln (0, b) = 0 for all n ∈ N and for all b. If the sequence (Ln )n converges pointwise to a function Lc , then Ln (·, b) converges uniformly to Lc (·, b) ≡ 0. We defer the proof of this theorem to Appendix B.2 and present here only a sketch of the proof. We first show that the convexity of the functions Ln implies that the convergence to Lc must be uniform and that Lc is convex with respect to its first argument. This fact and the weak consistency of the sequence Ln will then imply that Lc is consistent e and therefore must be constant by Proposition 12. with L The theorem just presented shows that even a weakly consistent sequence of convex losses is uniformly close to a constant function and therefore not helpful to tackle 69 the learning task we consider. This suggests searching for surrogate losses that admit weaker regularity assumptions such as Lipschitz continuity. Perhaps, the most natural surrogate loss function is then Lγ , an upper bound on L defined for all γ > 0 by: Lγ (r, b) = −b(2) 1r≤b(2) − r1 + b(2) <r≤ (1−γ)b(1) ∨b(2) 1 − γ γ b(2) ∨ (1) (r − b(1) )1 (2) b −b (1−γ)b(1) ∨b(2) <r≤b(1) , where c ∨ d = max(c, d). The plot of this function is shown in Figure 3.3(a). The max terms ensure that the function is well defined if (1 − γ)b(1) < b(2) . However, this turns out to be also a poor choice as Lγ is a loose upper bound on L in the most critical region, that is around the minimum of the loss L. Thus, instead, we will consider, for any γ > 0, the loss function Lγ defined as follows: 1 Lγ (r, b) = −b(2) 1r≤b(2) − r1b(2) <r≤b(1) + (r − (1 + γ)b(1) )1b(1) <r≤(1+γ)b(1) , γ (3.6) whose plot is shown in Figure 3.3(a).3 A comparison between the sum of L-losses and the sum of Lγ -losses is shown in Figure 3.3(b). Observe that the fit is considerably better than that of the piecewise linear convex surrogate loss shown in Figure 3.2(b). A possible concern associated with the loss function Lγ is that it is a lower bound for L. One might think then that minimizing it would not lead to an informative solution. However, we argue that this problem arises significantly with upper bounding losses such as the convex surrogate, which we showed not to lead to a useful minimizer, or Lγ , which is a poor approximation of L near its minimum. By matching the original loss L in the region of interest, around the minimal value, the loss function Lγ leads to more informative solutions for this problem. In fact, we show that that the expected loss Lγ (h) : = Ex,b [Lγ (h)] admits a minimizer close to the minimizer of L(h). Since Lγ → L as γ → 0, this result may seem trivial. However, this convergence is not uniform and therefore calibration is not guaranteed. Theorem 9. Let H be a closed, convex subset of a linear space of functions containing 0. Then, the following inequality holds for all γ ≥ 0: L(h∗γ ) − Lγ (h∗γ ) ≤ γM. Notice that, since L ≥ Lγ for all γ ≥ 0, the theorem implies that limγ→0 L(h∗γ ) = L(h∗ ). Indeed, let h∗ denote the best-in-class hypothesis for the loss function L. Then, 3 Technically, the theoretical and algorithmic results we present for Lγ could be developed in a somewhat similar way for Lγ . 70 b2 (1 − γ)b1 1 (1 + γ)b1 b2 1 0 0 0 1 2 3 4 5 0 -1 -1 -2 -2 -3 -3 -4 -4 1 2 3 b1 4 5 b1 (a) (b) Figure 3.3: (a) Comparison of the true loss L with surrogate loss L on the left and Pγ 500 surrogate loss L on the right, for γ = 0.1. (b) Comparison of γ i=1 L(r, bi ) and P500 i=1 Lγ (r, bi ) the following straightforward inequalities hold: L(h∗ ) ≤ L(h∗γ ) ≤ Lγ (h∗γ ) + γM ≤ Lγ (h∗ ) + γM ≤ L(h∗ ) + γM. By letting γ → 0 we see that L(h∗γ ) → L(h∗ ). This is a remarkable result as it not only provides a convergence guarantee but it also gives us an explicit rate of convergence. We will later exploit this fact to come up with an optimal choice for γ. The proof of Theorem 9 is based on the following partitioning of X × B in four regions where Lγ is defined as an affine function: I1 = {(x, b)|h∗γ (x) ≤ b(2) } I3 = {(x, b)|h∗γ (x) ∈ (b(1) , (1 + γ)b(1) ]} I2 = {(x, b)|h∗γ (x) ∈ (b(2) , b(1) ]} I4 = {(x, b)|h∗γ (x) > (1 + γ)b(1) }, Notice that Lγ and L differ only on I3 . Therefore, we only need to bound the measure of this set which can be done as in Lemma 14 (see Appendix B.3). Theorem 9. . We can express the difference as E x,b h L(h∗γ (x), b) − i Lγ (h∗γ (x), b) = 4 X k=1 i h E (L(h∗γ (x), b) − Lγ (h∗γ (x), b))1Ik (x, b) x,b i = E (L(h∗γ (x), b) − Lγ (h∗γ (x), b))1I3 (x, b) x,b h1 i (1) ∗ = E (3.7) ((1 + γ)b − hγ (x))1I3 (x, b)) . x,b γ h Furthermore, for (x, b) ∈ I3 , we know that b(1) < h∗γ (x). Thus, we can bound (3.7) 71 by Ex,b [h∗γ (x)1I3 (x, b)], which, by Lemma 14 in Appendix B.3, is upper bounded by h γ Ex,b h∗γ (x)1I2 (x, b) . Thus, the following inequalities hold: h i h i E L(h∗γ (x), b) − E Lγ (h∗γ (x), b) x,b x,b i h i h ≤ γ E h∗γ (x)1I2 (x, b) ≤ γ E b(1) 1I2 (x, b) ≤ γM, x,b x,b using the fact that h∗γ (x) ≤ b(1) for (x, b) ∈ I2 . The 1/γ-Lipschitzness of Lγ can be used to prove the following generalization bound. Theorem 10. Fix γ ∈ (0, 1] and let S denote a sample of size m. Then, for any δ > 0, with probability at least 1 − δ over the choice of the sample S, for all h ∈ H, the following holds: s 2 Lγ (h) ≤ Lbγ (h) + Rm (H) + M γ log 1δ . 2m (3.8) Proof. Let Lγ,H denote the family of functions {(x, b) → Lγ (h(x), b) : h ∈ H}. The loss function Lγ is γ1 -Lipschitz since the slope of the lines defining it is at most γ1 . Thus, using the contraction lemma (Lemma 13) as in the proof of Proposition 10, gives Rm (Lγ,H ) ≤ γ1 Rm (H). The application of a standard Rademacher complexity bound to the family of functions Lγ,H then shows that for any δ > 0, with probability at least 1 − δ, for any h ∈ H, the following holds: s log 1δ 2 Lγ (h) ≤ Lbγ (h) + Rm (H) + M . γ 2m We conclude this section by showing that Lγ admits a stronger form of consistency. More precisely, we prove that the generalization error of the best-in-class hypothesis L∗ := L(h∗ ) can be lower bounded in terms of that of the empirical minimizer of Lγ , b hγ : = argminh∈H Lbγ (h). Theorem 11. Let M = supb∈B b(1) and let H be a hypothesis set with pseudo-dimension d = Pdim(H). Then, for any δ > 0 and a fixed value of γ > 0, with probability at least 1 − δ over the choice of a sample S of size m, the following inequality holds: s r m 2d log d log 2δ 2γ + 2 ∗ ∗ b L ≤ L(hγ ) ≤ L + Rm (H) + γM + 2M + 2M . γ m 2m 72 Proof. By Theorem 7, with probability at least 1 − δ/2, the following holds: s r m 2d log log 2δ d L(b hγ ) ≤ LbS (b hγ ) + 2Rm (H) + 2M +M . m 2m (3.9) Applying Lemma 14 with the empirical distribution induced by the sample, we can bound LbS (b hγ ) by Lbγ (b hγ ) + γM . The first term of the previous expression is less than ∗ hγ . Moreover, the same analysis used in the proof of Theorem 10 Lbγ (hγ ) by definition of b shows that with probability 1 − δ/2, s log 2δ 2 . Lbγ (h∗γ ) ≤ Lγ (h∗γ ) + Rm (H) + M γ 2m Finally, by definition of h∗γ and using the fact that L is an upper bound on Lγ , we can write Lγ (h∗γ ) ≤ Lγ (h∗ ) ≤ L(h∗ ). Thus, 2 LbS (b hγ ) ≤ L(h∗ ) + Rm (H) + M γ s log 2δ + γM. 2m Replacing this inequality in (3.9) and applying the union bound yields the result. This bound can be extended to hold uniformly over all γ at the price of a term in qlog log 1 2 γ √ O . Thus, for appropriate choices of γ as a function of m (for instance m γ = 1/m1/4 ) we can guarantee the convergence of L(b hγ ) to L∗ , a stronger form of consistency (See Appendix B.3). These results are reminiscent of the standard margin bounds with γ playing the role of a margin. The situation here is however somewhat different. Our learning bounds suggest, for a fixed γ ∈ (0, 1], to seek a hypothesis h minimizing the empirical loss Lbγ (h) while controlling a complexity term upper bounding Rm (H), which in the case of a family of linear hypotheses could be khk2K for some PSD kernel K. Since the bound can hold uniformly for all γ, we can use it to select γ out of a finite set of possible grid search values. Alternatively, γ can be set via cross-validation. In the next section, we present algorithms for solving this regularized empirical risk minimization problem. 3.5 Algorithms In this section, we show how to minimize the empirical risk under two regimes: first we analyze the no-feature scenario considered in Cesa-Bianchi et al. (2013) and then we present an algorithm to solve the more general feature-based revenue optimization 73 (2) Vi (nk+1 , bi ) = −ai nk+1 b(2) (2) bi nk (1 + η)b(1) (1) bi (1) (1 + η)bi nk+1 −a(1) a(3) r − a(4) −a(2) r (2) Vi (nk , bi ) = −ai nk b(1) (a) (b) Figure 3.4: (a) Prototypical v-function. (b) Illustration of the fact that the definition of Vi (r, bi ) does not change on an interval [nk , nk+1 ]. problem. 3.5.1 No-Feature Case We now present a general algorithm to optimize sums of functions similar to Lγ or L in the one-dimensional case. Definition 14. We will say that function V : R × B → R is a v-function if it admits the following form: V (r, b) = −a(1) 1r≤b(2) − a(2) r1b(2) <r≤b(1) + (a(3) r − a(4) )1b(1) <r<(1+η)b(1) , with a(1) > 0 and η > 0 constants and a(2) , a(3) , a(4) defined by a(1) = ηa(3) b(2) , a(2) = ηa(3) , and a(4) = a(3) (1 + η)b(1) . Figure 3.4(a) illustrates this family of loss functions. A v-function is a generalization of Lγ and L. Indeed, any v-function V satisfies V (r, b) ≤ 0 and attains its minimum at b(1) . Finally, as can be seen straightforwardly from Figure 3.3, Lγ is a v-function for any γ > 0. We consider the following general problem of minimizing a sum of v-functions: min F (r) := r≥0 m X Vi (r, bi ). (3.10) i=1 Observe that this is not a trivial problem since, for any fixed bi , Vi (·, bi ) is non-convex and that, in general, a sum of m such functions may admit many local minima. Of course, we can seek a solution that is -close to the optimal reserve via a grid search over points ri = i. However, the guarantees for that algorithm would depend on the continuity of the function. In particular, this algorithm might fail for the loss L. Instead, we exploit the particular structure of a v-function to exactly minimize F . The following proposition, which is proven in Appendix B.4, shows that the minimum is attained at one of the highest bids, which matches the intuition. Notice that for the loss function L 74 this is immediate since if r is not a highest bid, one can raise the reserve price without increasing any of the component losses. (1) Proposition 13. Problem (3.10) admits a solution r∗ that satisfies r∗ = bi i ∈ [1, m]. for some Problem (3.10) can thus be reduced to examining the value of the function for the (1) m arguments bi , i ∈ [1, m]. This yields a straightforward method for solving the (1) optimization which consists of computing F (bi ) for all i and taking the minimum. (1) But, since the computation of each F (bi ) takes O(m), the overall computational cost is in O(m2 ), which can be prohibitive for even moderately large values of m. Instead, we present a combinatorial algorithm to solve the optimization problem S (1) (2) (1) (3.10) in O(m log m). Let N = i {bi , bi , (1 + η)bi } denote the set of all boundary points associated with the functions V (·, bi ). The algorithm proceeds as follows: first, sort the set N to obtain the ordered sequence (n1 , . . . , n3m ), which can be achieved in O(m log m) using a comparison-based sorting algorithm. Next, evaluate F (n1 ) and compute F (nk+1 ) from F (nk ) for all k. The main idea of the algorithm is the following: since the definition of Vi (·, bi ) can only change at boundary points (see also Figure 3.4(b)), computing F (nk+1 ) from F (nk ) can be achieved in constant time. Indeed, since between nk and nk+1 there are only two boundary points, we can compute V (nk+1 , bi ) from V (nk , bi ) by calculating V for only two values of bi , which can be done in constant time. We now give a more detailed description and proof of correctness of our algorithm. Proposition 14. There exists an algorithm to solve the optimization problem (3.10) in O(m log m). (1) (4) Proof. The pseudocode of the algorithm is given in Algorithm 2, where ai , ..., ai denote the parameters defining the functions Vi (r, bi ). We will prove that, after running Algorithm 2, we can compute F (nj ) in constant time using: (1) (2) (3) (4) F (nj ) = cj + cj nj + cj nj + cj . (2) (3.11) This holds trivially for n1 since by definition n1 ≤ bi for all i and therefore F (n1 ) = P (1) − m i=1 ai . Now, assume that (3.11) holds for j, we prove that it must also hold for (2) (1) (1) j + 1. Suppose nj = bi for some i (the cases nj = bi and nj = (1 + η)bi can be (1) handled in the same way). Then Vi (nj , bi ) = −ai and we can write X (1) (2) (3) (4) (1) Vk (nj , bk ) = F (nj ) − V (nj , bi ) = (cj + cj nj + cj nj + cj ) + ai . k6=i 75 Algorithm 2 Sorting S (1) (2) (1) N := m i=1 {bi , bi , (1 + η)bi }; n1 , ..., n3m ) = Sort(N ); (1) (2) (3) (4) Set ci := (ci , ci , ci , ci ) = 0 for i = 1, ..., 3m; P (1) (1) Set c1 = − m i=1 ai ; for j = 2, ..., 3m do Set cj = cj−1 ; (2) if nj−1 = bi for some i then (1) (1) (1) c j = c j + ai ; (2) (2) (2) cj = cj − ai ; (1) else if nj−1 = bi for some i then (2) (2) (2) c j = c j + ai ; (3) (3) (3) c j = c j + ai ; (4) (4) (4) cj = cj − ai ; else (3) (3) (3) cj = cj − ai ; (4) (4) (4) c j = c j + ai ; end if end for 76 Thus, by construction we would have: (1) (2) (3) (4) cj+1 + cj+1 nj+1 + cj+1 nj+1 + cj+1 (1) (1) (2) (2) (3) (4) = cj + ai + (cj − ai )nj+1 + cj nj+1 + cj (1) (2) (3) (4) (1) (2) = (cj + cj nj+1 + cj nj+1 + cj ) + ai − ai nj+1 X (2) = Vk (nj+1 , bk ) − ai nj+1 , k6=i where the last equality holds since the definition of Vk (r, bk ) does not change for r ∈ [nj , nj+1 ] and k 6= i. Finally, since nj was a boundary point, the definition of Vi (r, bi ) (1) (2) must change from −ai to −ai r, thus the last equation is indeed equal to F (nj+1 ). A (1) (1) similar argument can be given if nj = bi or nj = (1 + η)bi . We proceed to analyze the complexity of the algorithm: sorting the set N can be performed in O(m log m) and each iteration takes only constant time. Thus, the evaluation of all points can be achieved in linear time and, clearly, the minimum can then also be obtained in linear time. Therefore, the overall time complexity of the algorithm is in O(m log m). The algorithm just proposed can be straightforwardly extended to solve the minimization of F over a set of r-values bounded by Λ, that is {r : 0 ≤ r ≤ Λ}. Indeed, we (1) (1) need only compute F (bi ) for i ∈ [1, m] such that bi < Λ and of course also F (Λ), thus the computational complexity in this regularized case remains in O(m log m). 3.5.2 General Case We now present our main algorithm for revenue optimization in the presence of features. This problem presents new challenges characteristic of non-convex optimization problems in higher dimensions. Therefore, our proposed algorithm can only guarantee convergence to a local minimum. Nevertheless, we provide a simple method for cycling through these local minima with the guarantee of reducing the objective function at each time. We consider the case of a hypothesis set H ⊂ RN of linear functions x 7→ w · x with bounded norm, kwk ≤ Λ, for some Λ ≥ 0. This can be immediately generalized to non-linear hypotheses by using a positive definite kernel. The results of Theorem 10 suggest seeking, for P a fixed γ ≥ 0, the vector w solution to the following optimization problem: minkwk≤Λ m i=1 Lγ (w · xi , bi ). Replacing the original loss L with Lγ helped us remove the discontinuity of the loss. But, we still face an optimization problem based on a sum of non-convex functions. This problem can be formulated as a DC-programming (difference of convex functions programming) problem which is a well studied problem in non-convex optimization. Indeed, Lγ can 77 be decomposed as follows for all (r, b) ∈ R × B: Lγ (r, b) = u(r, b) − v(r, b), with the convex functions u and v defined by u(r, b) = −r1r<b(1) + r−(1+γ)b(1) 1r≥b(1) γ v(r, b) = (−r + b(2) )1r<b(2) + r−(1+γ)b(1) 1r>(1+γ)b(1) . γ Using the decomposition Lγ = u − v, our optimization problem can be formulated as follows: min U (w) − V (w) subject to kwk ≤ Λ, (3.12) w∈RN P Pm where U (w) = m i=1 u(w · xi , bi ) and V (w) = i=1 v(w · xi , bi ), which shows that it can be formulated as a DC-programming problem. The global minimum of the optimization problem (3.12) can be found using a cutting plane method (Horst and Thoai, 1999), but that method only converges in the limit and does not admit known algorithmic convergence guarantees.4 There exists also a branch-and-bound algorithm with exponential convergence for DC-programming (Horst and Thoai, 1999) for finding the global minimum. Nevertheless, in (Tao and An, 1997), it is pointed out that such combinatorial algorithms fail to solve real-world DC-programs in high dimensions. In fact, our implementation of this algorithm shows that the convergence of the algorithm in practice is extremely slow for even moderately high-dimensional problems. Another attractive solution for finding the global solution of a DC-programming problem over a polyhedral convex set is the combinatorial solution of Tuy (1964). However, this method requires explicitly specifying the slope and offsets for the piecewise linear function corresponding to a sum of Lγ losses and incurs an exponential cost in time and space. An alternative consists of using the DC algorithm (DCA) , a primal-dual subdifferential method of Dinh Tao and Hoai An Tao and An (1998), (see also Tao and An (1997) for a good survey). This algorithm is applicable when u and v are proper lower semi-continuous convex functions as in our case. When v is differentiable, the DC algorithm coincides with the CCCP algorithm of Yuille and Rangarajan (2003), which has been used in several contexts in machine learning and analyzed by Sriperumbudur and Lanckriet (2012). The general proof of convergence of the DC algorithm was given by Tao and An (1998). In some special cases, the DC algorithm can be used to find the global minimum of the problem as in the trust region problem (Tao and An, 1998), but, in general, the DC algorithm or its special case CCCP are only guaranteed to converge to a critical point (Tao and An, 1998; Sriperumbudur and Lanckriet, 2012). Nevertheless, the number of iterations of the DC algorithm is relatively small. Its convergence has been shown to be in fact linear for DC-programming problems such as ours (Yen et al., 2012). The algorithm we are proposing goes one step further than that of Tao and An (1998): we 4 Some claims of Horst and Thoai (1999), e.g., Proposition 4.4 used in support of the cutting plane algorithm, are incorrect (Tuy, 2002). 78 use DCA to find a local minimum but then restart our algorithm with a new seed that is guaranteed to reduce the objective function. Unfortunately, we are not in the same regime as in the trust region problem of Tao and An (1998) where the number of local minima is linear in the size of the input. Indeed, here, the number of local minima can be exponential in the number of dimensions of the feature space and it is not clear to us how the combinatorial structure of the problem could help us rule out some local minima faster and make the optimization more tractable. In the following, we describe more in detail the solution we propose for solving the DC-programming problem (3.12). The functions v and V are not differentiable in our context but they admit a sub-gradient at all points. We will denote by δV (w) an arbitrary element of the sub-gradient ∂V (w), which coincides with ∇V (w) at points w where V is differentiable. The DC algorithm then coincides with CCCP, modulo the replacement of the gradient of V by δV (w). It consists of starting with a weight vector w0 ≤ Λ and of iteratively solving a sequence of convex optimization problems obtained by replacing V with its linear approximation giving wt as a function of wt−1 , for t = 1, . . . , T : wt ∈ argminkwk≤Λ U (w) − δV (wt−1 ) · w. This problem can be rewritten in our context as the following: min kwk2 ≤Λ2 ,s m X i=1 si − δV (wt−1 ) · w (3.13) i h 1 (1) subject to (si ≥ −w · xi )∧ si ≥ w · xi −(1 + γ)bi . γ The problem is equivalent to a QP (quadratic-programming). Indeed, by convex duality, there exists a λ > 0 such that the above problem is equivalent to 2 min λkwk + w∈RN m X i=1 si − δV (wt−1 ) · w i h 1 (1) subject to (si ≥ −w · xi )∧ si ≥ w · xi −(1 + γ)bi γ which is a simple QP that can be tackled by one of many off-the-shelf QP solvers. Of course, the value of λ as a function of Λ does not admit a simple expression. Instead, we select λ through validation which is then equivalent to choosing the optimal value of Λ through validation. We now address the problem of the DC algorithm converging to a local minimum. A common practice is to restart the DC algorithm at a new random point. Instead, we propose an algorithm that iterates along different local minima, with the guarantee of reducing the function at every change of local minimum. The algorithm is simple and is based on the observation that the function Lγ is positive homogeneous. Indeed, for any 79 DC Algorithm w ← w0 . initialization while v 6= w do v ← DCA(w) . DC algorithm v u ← kvk P η ∗ ← min0≤η≤Λ u·xi >0 Lγ (ηu · xi , bi ) w ← η∗v end while Figure 3.5: Pseudocode of our DC-programming algorithm. η > 0 and (r, b), (2) Lγ (ηr, ηb) = −ηb 1ηr<ηb(2) − ηr1ηb(2) ≤ηr≤ηb(1) ηr−(1+γ)ηb(1) + 1ηb(1) <ηr<η(1+γ)b(1) γ = ηLγ (r, b). Minimizing the objective function ofP (3.12) in a fixed direction u, kuk = 1, can be m reformulated as follows: min0≤η≤Λ i=1 Lγ (ηu · xi , bi ). Since for u · xi ≤ 0 the (2) function η 7→ Lγ (ηu · xi , bi ) is constant and equal to −bi , the problem is equivalent to solving X min Lγ (ηu · xi , bi ). 0≤η≤Λ u·xi >0 Furthermore, since Lγ is positive homogeneous, for all i ∈ [1, m] with u·xi > 0, Lγ (ηu· xi , bi ) = (u · xi )Lγ (η, bi /(u · xi )). But η 7→ (u · xi )Lγ (η, bi /(u · xi )) is a v-function and thus the problem can efficiently be optimized using the combinatorial algorithm for the no-feature case (Section 3.5.1). This leads to the optimization algorithm described in Figure 3.5. The last step of each iteration of our algorithm can be viewed as a line search and this is in fact the step that reduces the objective function the most in practice. This is because we are then precisely minimizing the objective function even though this is for some fixed direction. Since in general this line search does not find a local minimum (we are likely to decrease the objective value in other directions that are not the one in which the line search was performed) running DCA helps us find a better direction for the next iteration of the line search. 3.6 Experiments In this section, we report the results of several experiments with synthetic and real-world data demonstrating the benefits of our algorithm. Since the use of features for reserve price optimization has not been previously studied in the literature, we are not aware of any baseline for comparison with our algorithm. Therefore, its performance is measured 80 against three natural strategies that we now describe. As mentioned before, a standard solution for solving this problem would be the use of a convex surrogate loss. In view of that, we compare against the solution of the regularized empirical risk minimization of the convex surrogate loss Lα shown in Figure 3.2(a) parametrized by α ∈ [0, 1] and defined by ( −r if r < b(1) + α(b(2) − b(1) ) Lα (r, b) = (1−α)b(1) +αb(2) (r − b(1) ) otherwise. α(b(1) −b(2) ) A second alternative consists of using ridge regression to estimate the first bid and of using its prediction as the reserve price. A third algorithm consists of minimizing Pn the loss while ignoring the feature vectors xi , i.e., solving the problem minr≤Λ i=1 L(r, bi ). It is worth mentioning that this third approach is very similar to what advertisement exchanges currently use to suggest reserve prices to publishers. By using the empirical version of equation (3.2), we see that this algorithm is equivalent to finding the empirical distribution of bids and optimizing the expected revenue with respect to this empirical distribution as in (Ostrovsky and Schwarz, 2011) and (Cesa-Bianchi et al., 2013). 3.6.1 Artificial Data Sets We generated 4 different synthetic data sets with different correlation levels between features and bids. For all our experiments, the feature vectors x ∈ R21 were generated in as follows: x̃ ∈ R20 was sampled from a standard Gaussian distribution and x = (x̃, 1) was created by adding an offset feature. We now describe the bid generating process for each of the experiments as a function of the feature vector x. For our first three experiments, in Figure 3.6(a)-(c), the highest bidand second highest bid were shown P21 xi P21 P21 xi P21 and min set to max i=1 2 + i=1 2 + 2 i=1 xi + 1 , i=1 xi + 1 , + 2 respectively, where i is a Gaussian random variable with mean 0. The standard + deviation of the Gaussian noise was varied over the set {0, 0.25, 0.5}. For our last artificial experiment, we used a generative model motivated by previous empirical observations (Ostrovsky and Schwarz, 2011; Lahaie and Pennock, 2007): bids were generated by sampling two values from a log-normal distribution with means x · w and x·w and standard deviation 0.5, with w a random vector sampled from a standard 2 Gaussian distribution. For all our experiments, the parameters λ, γ and α were selected respectively from the sets {2i |i ∈ [−5, 5]}, {0.1, 0.01, 0.001}, and {0.1, 0.2, . . . , 0.9} via validation over a set consisting of the same number of examples as the training set. Our algorithm was initialized using the best solution of the convex surrogate optimization problem. The test set consisted of 5,000 examples drawn from the same distribution as the training set. Each experiment was repeated 10 times and the mean revenue of each algorithm is 81 shown in Figure 3.6. The plots are normalized in such a way that the revenue obtained by setting no reserve price is equal to 0 and the maximum possible revenue (which can be obtained by setting the reserve price equal to the highest bid) is equal to 1. The performance of the ridge regression algorithm is not included in Figure 3.6(d) as it was too inferior to be comparable with the performance of the other algorithms. By inspecting the results in Figure 3.6(a), we see that, even in the simplest noiseless scenario, our algorithm outperforms all other techniques. The reader could argue that these results are, in fact, not surprising since the bids were generated by a locally linear function of the feature vectors, thereby ensuring the success of our algorithm. Nevertheless, one would expect this to be the case too for algorithms that leverage the use of features such as the convex surrogate and ridge regression. But one can see that this is in fact not true even for low levels of noise. It is also worth noticing that the use of ridge regression is actually worse than setting the reserve price to 0. This fact can be easily understood by noticing that the square loss used in regression is symmetric. Therefore, we can expect several reserve prices to be above the highest bid, making the revenue of these auctions equal to zero. Another notable feature is that as the noise level increases, the performance of feature-based algorithms decreases. This is true for any learning algorithm: if the features are not relevant to the prediction task, the performance of the algorithm will suffer. However, for the convex surrogate algorithm, a more critical issue occurs: the performance of this algorithm actually decreases as the sample size increases, which shows that in general learning with a convex surrogate is not possible. This is an empirical verification of the inconsistency result provided in Section 3.4.2. This lack of calibration can also be seen in Figure 3.6(d), where in fact the performance of this algorithm approaches the use of no reserve price. In order to better understand the reason behind the performance discrepancy between feature-based algorithms, we analyze the reserve prices offered by each algorithm. In Figure 3.7 we see that the convex surrogate algorithm tends to offer lower reserve prices. This should be intuitively clear as high reserve prices are over-penalized by the chosen convex surrogate as shown in Figure 3.2(b). On the other hand, reserve prices suggested by the regression algorithm seem to be concentrated and symmetric around their mean. Therefore we can infer that about 50% of the reserve prices offered will be higher than the highest bid thereby yielding zero revenue. Finally, our algorithm seems to generally offer higher prices. This suggests that the increase in revenue comes from auctions where the highest bid is large but the second bid is small. This bidding phenomenon is in fact commonly observed in practice (Amin et al., 2013). 3.6.2 Real-world Data Sets Due to proprietary data and confidentiality reasons, we cannot present empirical results for AdExchange data. However, we were able to procure an eBay data set consisting of approximately 70,000 second-price auctions of collector sport cards. The 82 NF 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 (a) CVX (c) Reg NF CVX DC Reg 0.5 0.4 0.3 0.2 0.1 0 -0.1 100 NF 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 DC 200 CVX 100 200 400 DC 400 800 -0.2 1600 (b) Reg 800 100 200 DC 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 -0.2 1600 (d) 400 CVX 800 1600 NF 200 300 400 600 800 1000 1200 1600 2400 Figure 3.6: Plots of expected revenue against sample size for different algorithms: DC algorithm (DC), convex surrogate (CVX), ridge regression (Reg) and the algorithm that uses no feature to set reserve prices (NF). For (a)-(c) bids are generated with different noise standard deviation (a) 0, (b) 0.25, (c) 0.5. The bids in (d) were generated using a generative model. full data set can be accessed using the following URL: http://cims.nyu.edu/ ˜munoz/data. Some other sources of auction data are accessible (e.g., http:// modelingonlineauctions.com/datasets), but features are not available for those datasets. To the best of our knowledge, with the exception of the one used here, there is no publicly available data set for online auctions including features that could be readily used with our algorithm. The features used here include information about the seller such as positive feedback percent, seller rating and seller country; as well as information about the card such as whether the player is in the sport’s Hall of Fame. The final dimension of the feature vectors is 78. The values of these features are both continuous and categorical. For our experiments we also included an extra offset feature. Since the highest bid is not reported by eBay, our algorithm cannot be straightforwardly used on this data set. In order to generate highest bids, we calculated the mean price of each object (each card was generally sold more than once) and set the highest bid to be the maximum between this average and the second highest bid. Figure 3.8 shows the revenue gained using different algorithms including our DC algorithm, using a convex surrogate, or the algorithm that ignores features. It also shows the results obtained by using no reserve price (NR) and the highest possible revenue obtained by setting the reserve price to the highest bid (HB). We randomly sampled 2,000 examples for training, 2,000 examples for validation and 2,000 examples for testing. This experiment was repeated 10 times. Figure 3.8(b) shows the mean revenue 83 CVX bid distribution 2 4 6 8 10 600 200 50 0 0 0 0 400 Frequency 150 800 200 Regression bid distribution 100 Frequency 100 50 Frequency 150 DC bid distribution 0 2 Reserve price 4 6 8 10 0 Reserve price 2 4 6 8 10 Reserve price Figure 3.7: Distribution of reserve prices for each algorithm. The algorithms were trained on 800 samples using noisy bids with standard deviation 0.5. 55 50 45 40 35 30 25 CVX NF DC HB NR Figure 3.8: Results of the eBay data set. Comparison of our algorithm (DC) against a convex surrogate (CVX), using no features (NF), setting no reserve (NR) and setting reserve price to highest bid (HB). for each algorithm and their standard deviations. The results of this experiment show that the use of features is crucial for revenue optimization. Indeed, setting an optimal reserve price for all objects seems to achieve the same revenue as no reserve price. Instead, our algorithm achieves a 22% increase on the revenue obtained by not setting a reserve price whereas the non-calibrated convex surrogate algorithm only obtains a 3% revenue improvement. Furthermore, our algorithm is able to obtain as much as 70% of the achievable revenue with knowledge of the highest bid. 3.7 Conclusion We presented a comprehensive theoretical and algorithmic analysis of the learning problem of revenue optimization in second-price auctions with reserve. The specific properties of the loss function for this problem required a new analysis and new learning guarantees. The algorithmic solutions we presented are practically applicable to revenue optimization problems for this type of auctions in most realistic settings. Our 84 experimental results further demonstrate their effectiveness. Much of the analysis and algorithms presented, in particular our study of calibration questions, can also be of interest in other learning problems. In particular, as we will see on Chapter 4 they are relevant to the study of learning problems arising in the study of generalized secondprice auctions. 85 Chapter 4 Generalized Second-Price Auctions We present an extensive analysis of the key problem of learning optimal reserve prices for generalized second-price auctions. We describe two algorithms for this task: one based on density estimation, and a novel algorithm benefiting from solid theoretical guarantees and with a very favorable running-time complexity of O(nS log(nS)), where n is the sample size and S the number of slots. Our theoretical guarantees are more favorable than those previously presented in the literature. Additionally, we show that even if bidders do not play at an equilibrium, our second algorithm is still well defined and minimizes a quantity of interest. To our knowledge, this is the first attempt to apply learning algorithms to the problem of reserve price optimization in GSP auctions. Finally, we present the first convergence analysis of empirical equilibrium bidding functions to the unique Bayesian-Nash equilibrium of a GSP. 4.1 Introduction The Generalized Second-Price (GSP) auction is currently the standard mechanism used for selling sponsored search advertisement. As suggested by the name, this mechanism generalizes the standard second-price auction of Vickrey (1961) to multiple items. In the case of sponsored search advertisement, these items correspond to ad slots which have been ranked by their position. Given this ranking, the GSP auction works as follows: first, each advertiser places a bid; next, the seller, based on the bids placed, assigns a score to each bidder. The highest scored advertiser is assigned to the slot in the best position, that is, the one with the highest likelihood of being clicked on. The secondhighest score obtains the second best item and so on, until all slots have been allocated or all advertisers have been assigned to a slot. As with second-price auctions, the bidder’s payment is independent of his bid. Instead, it depends solely on the bid of the advertiser assigned to the position below. In spite of its similarity with second-price auctions, the GSP auction is not an incentive-compatible mechanism, that is, bidders have an incentive to lie about their 86 valuations. This is in stark contrast with second-price auctions where truth revealing is in fact a dominant strategy. It is for this reason that predicting the behavior of bidders in a GSP auction is challenging. This is further worsened by the fact that these auctions are repeated multiple times a day. The study of all possible equilibria of this repeated game is at the very least difficult. While incentive compatible generalizations of the secondprice auction exist, namely the Vickrey-Clark-Gloves (VCG) mechanism, the simplicity of the payment rule for GSP auctions as well as the large revenue generated by them has made the adoption of VCG mechanisms unlikely. Since its introduction by Google, GSP auctions have generated billions of dollars across different online advertisement companies. It is therefore not surprising that it has become a topic of great interest for diverse fields such as Economics, Algorithmic Game Theory and more recently Machine Learning. The first analysis of GSP auctions was carried out independently by Edelman et al. (2007) and Varian (2007). Both publications considered a full information scenario, that is one where the advertisers’ valuations are publicly known. This assumption is weakly supported by the fact that repeated interactions allow advertisers to infer their adversaries’ valuations. (Varian, 2007) studied the so-called Symmetric Nash Equilibria (SNE) which is a subset of the Nash equilibria with several favorable properties. In particular, Varian showed that any SNE induces an efficient allocation, that is an allocation where the highest positions are assigned to advertisers with high values. Furthermore, the revenue achieved by the seller when advertisers play an SNE is always at least that of the one obtained by VCG. The authors also presented some empirical results showing that some bidders indeed play by using an SNE. However, no theoretical justification can be given for the choice of this subset of equilibria (Börgers et al., 2013; Edelman and Schwarz, 2010). A finer analysis of the full information scenario was given by Lucier et al. (2012). The authors proved that, excluding the payment of the highest bidder, the revenue achieved at any Nash equilibrium is at least one half that of the VCG auction. Since the assumption of full information can be unrealistic, a more modern line of research has instead considered a Bayesian scenario for this auction. In a Bayesian setting, it is assumed that advertisers’ valuations are i.i.d. samples drawn from a common distribution. Gomes and Sweeney (2014) characterized all symmetric Bayes-Nash equilibria and showed that any symmetric equilibrium must be efficient. This work was later extended by Sun et al. (2014) to account for the quality score of each advertiser. The main contribution of this work was the design of an algorithm for the crucial problem of revenue optimization for the GSP auction. Lahaie and Pennock (2007) studied different squashing ranking rules for advertisers commonly used in practice and showed that none of these rules are necessarily optimal in equilibrium. Lucier et al. (2012) showed that the GSP auction with an optimal reserve price achieves at least 1/6 of the optimal revenue (of any auction) in a Bayesian equilibrium. Most recently, Thompson and Leyton-Brown (2013) compared different allocation rules and showed that an anchoring allocation rule is optimal when valuations are sampled i.i.d. from a uniform distribution. 87 With the exception of (Sun et al., 2014), none of these authors has proposed an algorithm for revenue optimization using historical data. Zhu et al. (2009) introduced a ranking algorithm to learn an optimal allocation rule. The proposed ranking is a convex combination of a quality score based on the features of the advertisement as well as a revenue score which depends on the value of the bids. This work was later extended in (He et al., 2014) where, in addition to the ranking function, a behavioral model of the advertisers is learned by the authors. The rest of this chapter is organized as follows. In Section 4.2, we give a learning formulation of the problem of selecting reserve prices in a GSP auction. In Section 4.3, we discuss previous work related to this problem. Next, we present and analyze two learning algorithms for this problem in Section 4.4, one based on density estimation extending to this setting an algorithm of Guerre et al. (2000), and a novel discriminative algorithm taking into account the loss function and benefiting from favorable learning guarantees. Section 4.5 provides a convergence analysis of the empirical equilibrium bidding function to the true equilibrium bidding function in a GSP. On its own, this result is of great interest as it justifies the common assumption of buyers playing a symmetric Bayes-Nash equilibrium. Finally, in Section 4.6, we report the results of experiments comparing our algorithms and demonstrating in particular the benefits of the second algorithm. 4.2 Model For the most part, we will use the model defined by Sun et al. (2014) for GSP auctions with incomplete information. We consider N bidders competing for S slots with N ≥ S. Let vi ∈ [0, 1] and bi ∈ [0, 1] denote the per-click valuation of bidder i and his bid respectively. Let the position factor cs ∈ [0, 1] represent the probability of a user noticing an ad in position s and let ei ∈ [0, 1] denote the expected click-through rate of advertiser i. That is ei is the probability of ad i being clicked on given that it was noticed by the user. We will adopt the common assumption that cs > cs+1 (Gomes and Sweeney, 2014; Lahaie and Pennock, 2007; Sun et al., 2014; Thompson and LeytonBrown, 2013). Define the score of bidder i to be si = ei vi . Following (Sun et al., 2014), we assume that si is an i.i.d. realization of a random variable with distribution F and density function f . Finally, we assume that advertisers bid in an efficient symmetric Bayes-Nash equilibrium. This is motivated by the fact that even though advertisers may not infer what the valuation of their adversaries is from repeated interactions, they can certainly estimate the distribution F . Define π : s 7→ π(s) as the function mapping slots to advertisers, i.e. π(s) = i if advertiser i is allocated to position s. For a vector x = (x1 , . . . , xN ) ∈ RN , we use the notation x(s) := xπ(s) . Finally, denote by ri the reserve price for advertiser i. An advertiser may participate in the auction only if bi ≥ ri . In this chapter we present an 88 analysis of the two most common ranking rules (Qin et al., 2014): 1. Rank-by-bid. Advertisers who bid above their reserve price are ranked in descending order of their bids and the payment of advertiser π(s) is equal to max(r(s), b(s+1) ). 2. Rank-by-revenue. Each advertiser is assigned a quality score qi := qi (bi ) = ei bi 1bi ≥ri and the ranking is done by sorting these scores in descending order. The payment (s+1) of advertiser π(s) is given by max r(s) , q e(s) . In both setups, only advertisers bidding above their reserve price are considered. Notice that rank-by-bid is a particular case of rank-by-revenue where all click-through rates ei are equal to 1. Given a vector of reserve prices r and a bid vector b, we define the revenue function to be S q (s+1) X (s) 1 + r 1 Rev(r, b) = cs (s+1) (s) (s) (s+1) (s) (s) (s) ≥e r q <e r ≤q e(s) q s=1 Using the notation of (Mohri and Medina, 2014), we define the loss function L(r, b) = −Rev(r, b). Given an i.i.d. sample S = (b1 , . . . , bn ) of realizations of an auction, our objective will be to find a reserve price vector r∗ that maximizes the expected revenue. Equivalently, r∗ should be a solution to the following optimization problem: min E[L(r, b)]. r∈[0,1]N b 4.3 (4.1) Previous Work It has been shown, both theoretically and empirically, that reserve prices can increase the revenue of an auction (Myerson, 1981; Ostrovsky and Schwarz, 2011). The choice of an appropriate reserve price therefore becomes crucial. If it is chosen too low, the seller might lose some revenue. On the other hand, if it set too high, then the advertisers may not wish to bid above that value and the seller will not obtain any revenue from the auction. Mohri and Medina (2014) gave a learning algorithm using historical data to estimate the optimal reserve price for a second-price auction in a very general setting. An extension of this work to the GSP auction is not straightforward. Indeed, as we will show later, the optimal reserve price vector depends on the distribution of the advertisers’ valuation. In a second-price auction, these valuations are observed since it is an incentive-compatible mechanism. This does not hold for GSP auctions. Moreover, in (Mohri and Medina, 2014), only one reserve price had to be estimated. In contrast, our model requires the estimation of up to N parameters with intricate dependencies between them. 89 The problem of estimating valuations from observed bids in a non-incentive compatible mechanism has been previously analyzed. Guerre et al. (2000) described a way of estimating valuations from observed bids in a first-price auction. We will show that this method can be extended to the GSP auction. The rate of convergence of this algorithm, 1 √ however, in general will be worse than the standard learning rate of O n . Sun et al. (2014) showed that, for advertisers playing an efficient equilibrium, the optimal reserve price is given by ri = eri where r satisfies r= 1 − F (r) . f (r) The authors suggest learning r via a maximum likelihood technique over some parametric family to estimate f and F , and use these estimates in the above expression. There are two main drawbacks for this algorithm. The first is a standard problem of parametric statistics: there are no guarantees on the convergence of their estimation procedure when the density function f is not part of the parametric family considered. While this problem can be addressed by the use of a non-parametric estimation algorithm such as kernel density estimation, the fact remains that the function f is the density for the unobservable scores si and therefore cannot be properly estimated. The solution proposed by the authors assumes that the bids in fact form a perfect SNE and so advertisers valuations may be recovered using the process described in (Varian, 2007). There is however no justification for this assumption and, in fact, we show in Section 4.6 that bids played in a Bayes-Nash equilibrium do not in general form a SNE. 4.4 Learning Algorithms Here, we present and analyze two algorithms for learning the optimal reserve price for a GSP auction when advertisers play a symmetric equilibrium. 4.4.1 Density estimation algorithm First, we derive an extension of the algorithm of (Guerre et al., 2000) to GSP auctions. To do so, we first derive a formula for the bidding strategy at equilibrium. Let zs (v) denote the probability of winning position s given that the advertiser’s valuation is v. It is not hard to verify that N −1 zs (v) = (1 − F (v))s−1 F p (v), s−1 where p = N − s. Indeed, in an efficient equilibrium, the bidder with the s-th highest valuation must be assigned to the s-th highest position. Therefore an advertiser with 90 valuation v is assigned to position s if and only if s − 1 bidders have a higher valuation and p have a lower valuation. For a rank-by-bid auction, Gomes and Sweeney (2014) showed the following results. Theorem 12 (Gomes and Sweeney (2014)). A GSP auction has a unique efficient symmetric Bayes-Nash equilibrium with bidding strategy β(v) if and only if β(v) is strictly increasing and satisfies the following integral equation S X s=1 cs Z 0 v Z v S X N −1 dzs (t) s−1 β(t)pF p−1 (t)f (t)dt. (4.2) tdt = cs (1 − F (v)) dt s − 1 0 s=1 Furthermore, the optimal reserve price r∗ satisfies r∗ = 1 − F (r∗ ) . f (r∗ ) (4.3) The authors show that, if the click probabilities cs are sufficiently diverse, then, β is guaranteed to be strictly increasing. When ranking is done by revenue, Sun et al. (2014) gave the following theorem. Theorem 13 (Sun et al. (2014)). Let β be defined by the previous theorem. If advertisers i) . Moreover, the optimal reserve price bid in a Bayes-Nash equilibrium then bi = β(v ei r ∗ ∗ vector r is given by ri = ei where r satisfies equation (4.3). We are now able to present the foundation of our first algorithm. Instead of assuming that the bids constitute an SNE we follow the ideas of (Guerre et al., 2000) and infer the scores si only from observables bi . Our result is presented for the rank-by-bid GSP auction but an extension to the rank-by-revenue mechanism is trivial. Lemma 5. Let v1 , . . . , vn be an i.i.d. sample of valuations with distribution F and let bi = β(vi ) be the bid played at equilibrium. Then the random variables bi are i.i.d. with −1 (b)) . Furthermore, distribution G(b) = F (β −1 (b)) and density g(b) = βf0(β (β −1 (b)) (1 − G(bi ))s−1 bi pG(bi )p−1 g(bi ) vi = β −1 (bi ) = (4.4) PS N−1 dz s=1 cs s−1 db (bi ) R bi PS s−2 c (s − 1)(1 − G(b )) g(b ) pG(u)p−1 ug(u)du s i i 0 , − s=1 PS N−1 dz s=1 cs s−1 db (bi ) PS s=1 cs N−1 s−1 where z s (b) := zs (β −1 (b)) and is given by N−1 s−1 (1 − G(b))s−1 G(b)p−1 . Proof. By definition bi = β(vi ), is a function of only vi . Since β does not depend on the other samples either, it follows that (bi )N i=1 must be an i.i.d. sample. Using the fact that 91 β is a strictly increasing function we also have G(b) = P (bi ≤ b) = P (vi ≤ β −1 (b)) = F (β −1 (b)) and a simple application of the chain rule gives us the expression for the density g(b). To prove the second statement observe that by the change of variable v = β −1 (b), the right-hand side of (4.2) is equal to Z β −1 (b) S X N −1 s−1 (1 − G(b)) pβ(t)F p−1 (t)f (t)dt s−1 0 s=1 Z b S X N −1 s−1 puG(u)p−1 (u)g(u)du. = (1 − G(b)) s−1 0 s=1 The last equality follows by the change of variable t = β(u) and from the fact that −1 (b)) g(b) = βf0(β . The same change of variables applied to the left-hand side of (4.2) (β −1 (b)) yields the following integral equation: Z S X N −1 b dz (u)du s−1 du 0 s=1 Z b S X N −1 s−1 upG(u)p−1 (u)g(u)du. = (1 − G(b)) s−1 0 s=1 β −1 (u) Taking the derivative with respect to b of both sides of this equation and rearranging terms lead to the desired expression. The previous Lemma shows that we may recover the valuation of an advertiser from its bid. We therefore propose the following algorithm for estimating the value of r. 1. Assumin a symmetric Bayes-Nash equilibrium, use the sample S to estimate G and g. 2. Plug this estimates in (4.4) to obtain approximate samples from the distribution F . 3. Use the approximate samples to find estimates fb and Fb of the valuations density and cumulative distribution functions respectively. 4. Use Fb and fb to estimate r. In order to avoid the use of parametric methods, a kernel density estimation algorithm can be used to estimate g and f . While this algorithm addresses both drawbacks of the algorithm proposed by Sun et al. (2014), it can be shown (Guerre et al., 2000)[Theorem 2] that if f is R times continuously differentiable, then, after seeing n samples, kf − fbk∞ 1 independently of the algorithm used to estimate f . In particular, note is in Ω nR/(2R+3) 1 . This unfavorable rate of convergence can be that for R = 1 the rate is in Ω n1/4 attributed to the fact that a two-step estimation algorithm is being used (estimation of g and f ). But, even with access to bidder valuations, the rate can only be improved to 1 Ω nR/(2R+1) (Guerre et al., 2000). Furthermore, a small error in the estimation of f 92 affects the denominator of the equation defining r and can result in a large error on the estimate of r. 4.4.2 Discriminative algorithm In view of the problems associated with density estimation, we propose to use empirical risk minimization to find an approximation to the optimal reserve price. In particular, we are interested in solving the following optimization problem: min r∈[0,1]N n X L(r, bi ). (4.5) i=1 We first show that, when bidders play in equilibrium, the optimization problem (4.1) can be considerably simplified. Proposition 15. If advertisers play a symmetric Bayes-Nash equilibrium then e b)], min E[L(r, b)] = min E[L(r, r∈[0,1]N b r∈[0,1] b where qi := qi (bi ) = ei bi and e b) = − L(r, S X cs (s+1) q 1 + r1 q(s+1) ≥r q(s+1) <r≤q(s) . (s) e s=1 Proof. Since advertisers play a symmetric Bayes-Nash equilibrium, the optimal reserve price vector r∗ is of the form ri∗ = eri . Therefore, letting D = {r|ri = eri , r ∈ [0, 1]} we have minr∈[0,1]N Eb [L(r, b)] = minr∈D Eb [L(r, b)]. Furthermore, when restricted to D, the objective function L is given by S X cs (s+1) − q 1q(s+1) ≥r + r1q(s+1) <r≤q(s) . e(s) s=1 Thus, we are left with showing that replacing q (s) with q(s) in this expression does not affect its value. Let r ≥ 0, since qi = qi 1qi ≥r , in general the equality q (s) = q(s) does not hold. Nevertheless, if s0 denotes the largest index less than or equal to S satisfying q (s0 ) > 0, then q(s) ≥ r for all s ≤ s0 and q (s) = q(s) . On the other hand, for S ≥ s > s0 , 93 1q(s) ≥r = 1q(s) ≥r = 0. Thus, S X cs (s+1) q 1q(s+1) ≥r + r1q(s+1) <r≤q(s) e(s) s=1 s0 X cs (s+1) q 1 + r1 (s+1) (s+1) (s) q ≥r q <r≤q e(s) s=1 s0 X cs (s+1) = q 1 + r1 (s+1) (s+1) (s) q ≥r q <r≤q e(s) s=1 = e b), = −L(r, which completes the proof. In view of this proposition, we can replace the challenging problem of solving an optimization problem in RN with solving the following simpler empirical risk minimization problem min r∈[0,1] n X i=1 e bi ) = min L(r, r∈[0,1] n X S X Ls,i (r, q(s) , q(s+1) ), (4.6) i=1 s=1 (s+1) s 1q(s+1) ≥r − r1q(s+1) <r≤q(s) ). In order to effi(e qi where Ls,i (r, q(s) ), q(s+1) ) := − ec(s) i i i ciently minimize this highly non-convex function, we draw upon the work of the previous chapter on minimization of sums of v-functions. Definition 15. A function V : R3 → R is a v-function if it admits the following form: (1) (2) V (r, q1 , q2 ) = −a 1r≤q2 − a r1q2 <r≤q1 + 1 η r−a (3) 1q1 <r<(1+η)q1 , with 0 ≤ a(1) , a(2) , a(3) , η ≤ ∞ constants satisfying a(1) = a(2) q2 , −a(2) q1 1η>0 = 1 q − a(3) 1η>0 . Under the convention that 0 · ∞ = 0. η 1 As suggested by their name, these functions admit a characteristic “V shape”. It is (s+1) s s clear from Figure 4.1 that Ls,i is a v-function with a(1) = ec(s) qei , a(2) = ec(s) and η = 0. Thus, we can apply the optimization algorithm given on the previous chapter to minimize (4.6) in O(nS log nS) time. The adaptation of this general algorithm to our problem is presented in Algorithm 3. We conclude this section by presenting learning guarantees for our algorithm. Our bounds are given in terms of the Rademacher complexity and the VC-dimension. Definition 16. Let X be a set and let G := {g : X → R} be a family of functions. Given a sample S = (x1 , . . . , xn ) ∈ X , the empirical Rademacher complexity of G is 94 Algorithm 3 Minimization algorithm (s) Require: Scores (e qi ), i ∈ {1, . . . , n} and s ∈ {1, . . . , S}. (1) (2) (s) (s+1) Define (pis , pis ) = (e qi , qei ); Set m S = nS;S (1) (2) N := ni=1 Ss=1 {pis , pis }; (n1 , ..., n2m ) = Sort(N ); Set di := (d1 , d2 ) = 0 P P (2) Set d1 = − ni=1 Ss=1 cesi pis ; Set r∗ = −1 and L∗ = inf for j = 2, . . . , 2m do (2) if nj−1 = pis then (2) d1 = d1 + cesi pis d2 = d2 − cesi (1) else if nj−1 = pis then d2 = d2 + ecss end if L = d1 − nj ∗ d2 ; if L < L∗ then L∗ = L; r∗ = nj ; end if end for return r∗ ; defined by i 1 h 1X RS (G) = E sup σi g(xi ) , n σ g∈G n i=1 n where σi is a random variable distributed uniformly over the set {−1, 1}. P Proposition 16. Let m = mini ei > 0 and M = Ss=1 cs . Then, for any δ > 0, with probability at least 1 − δ over the draw of a sample S of size n, each of the following inequalities holds for all r ∈ [0, 1]: r r n 1 X M 1 log(en) M log(1/δ) e b)] ≤ e bi ) + √ + L(r, E[L(r, + (4.7) n i=1 m n mn n 95 - cs bi(s+1) bi(s+1) bi(s) Figure 4.1: Depiction of the loss Li,s . Notice that the loss in fact resembles a broken “V” and n 1Xe M 1 e √ + L(r, bi ) ≤ E[L(r, b)] + n i=1 m n r log(en) + n r M log(1/δ) . 2mn (4.8) P e e b)]. Let S i be a sample obtained Proof. Let Ψ : S 7→ supr∈[0,1] n1 ni=1 L(r, bi )−E[L(r, M from S by replacing bi with b0i . It is not hard to verify that |Ψ(S) − Ψ(S i )| ≤ nm . Thus, it follows from a standard learning bound that, with probability at least 1 − δ, r n X 1 e b)] ≤ e bi ) + RS (R) + M log(1/δ) . E[L(r, L(r, n i=1 2mn e b)|r ∈ [0, 1]}. We proceed to bound the empirical where R = {Lr : b 7→ L(r, Rademacher complexity of the class R. For q1 > q2 ≥ 0 let L(r, q1 , q2 ) = q2 1q2 >r + r1q1 ≥r≥q2 . By definition of Rademacher complexity we have RS (R) = n i X 1 h E sup σi Lr (bi ) n σ r∈[0,1] i=1 n S i X X cs 1 h (s) (s+1) L(r, qi , qi = E sup σi ) n σ r∈[0,1] i=1 s=1 es S n i X 1 hX (s) (s+1) ≤ E sup σi ψs (L(r, qi , qi )) , n σ s=1 r∈[0,1] i=1 where ψs is the cs -Lipschitz m function mapping x 7→ 96 cs x. e(s) Therefore, by Talagrand’s contraction lemma (Ledoux and Talagrand, 2011), the last term is bounded by n S S i X X X cs cs h (s) (s+1) e σi L(r, qi , qi E sup ) = RSs (R), σ nm m r∈[0,1] s=1 s=1 i=1 (s) (s+1) (s) (s+1) e := {L(r, ·, ·)|r ∈ [0, 1]}. The loss where Ss = (q1 , q1 ), . . . , (qn , qn ) and R (s) (s+1) L(r, q , q ) in fact evaluates to the negative revenue of a second-price auction with highest bid q(s) and second highest bid q(s+1) (Mohri and Medina, 2014). Therefore, by Propositions 9 and 10 of (Mohri and Medina, 2014) we can write n h i r 2 log en X 1 e ≤ E sup rσi + RSs (R) n σ r∈[0,1] i=1 n r 1 2 log en ≤ √ + . n n Corollary 8. Under the hypothesis of Proposition 16, let rb denote the empirical minimizer and r∗ the minimizer of the expected loss. With probability at least 1 − δ, we have r r M log(2/δ) M 1 log(en) e r, b)] − E[L(r e ∗ , b)] ≤ 2 √ + + . E[L(b 2mn m n n Proof. By the union bound, we (4.7) and (4.8) hold simultaneously with probability at least 1 − δ if δ is replaced by δ/2 in these equations. Adding both inequalities we obtain r n M log(2/δ) X 1 e r, b)] − E[L(r e ∗ , b)] ≤ e r, bi ) − L(r e ∗ , bi ) + 2 E[L(b L(b n i=1 2mn r M 1 log(en) √ + + m n n The result now follows by using the fact that rb is an empirical minimizer and therefore that the difference appearing on the right-hand side of this inequality is less than or equal to 0. It is worth noting that our algorithm is well defined whether or not the buyers bid in equilibrium. Indeed, the algorithm consists of the minimization over r of an observable quantity. While we can guarantee convergence to a solution of 4.1 only when buyers 97 play a symmetric BNE, our algorithm will still find an approximate solution to min Eb [L(r, b)], r∈[0,1] which remains a quantity of interest that can be close to 4.1 if buyers are close to the equilibrium. 4.5 Convergence of Empirical Equilibria A crucial assumption in the study of GSP auctions, including this work, is that advertisers bid in a Bayes-Nash equilibrium (Lucier et al., 2012; Sun et al., 2014). This assumption is partially justified by the fact that advertisers can infer the underlying distribution F using as observations the outcomes of the past repeated auctions and can thereby implement an efficient equilibrium. In this section, we provide a stronger theoretical justification in support of this assumption: we quantify the difference between the bidding function calculated using observed empirical distributions and the true symmetric bidding function in equilibria. For the sake of notation simplicity, we will consider only the rank-by-bid GSP auction. Let Sv = (v1 , . . . , vn ) be an i.i.d. sample of values drawn from a continuous distribution F with density function f . Assume without loss of generality that v1 ≤ . . . ≤ vn and let v denote the vector defined by vi = vi . Let Fb denote the empirical distribution function induced by Sv and let F ∈ Rn and G ∈ Rn be defined by Fi = Fb(vi ) = i/n and Gi = 1 − Fi . We consider a discrete GSP auction where the advertiser’s valuations are i.i.d. samples drawn from a distribution Fb. In the event where two or more advertisers admit the same valuation, ties are broken randomly. Denote by βb the bidding function for this auction in equilibrium (when it exists). We are interested in characterizing βb and in providing guarantees on the convergence of βb to β as the sample size increases. We first introduce the notation used throughout this section. Definition 17. Given a vector F ∈ Rn , the backwards difference operator ∆ : Rn → Rn is defined as: ∆Fi = Fi − Fi−1 , for i > 1 and ∆F1 = F1 . We will denote ∆∆Fi by ∆2 Fi . Given any k ∈ N and a vector F, the vector Fk is defined as Fki = (Fi )k . Finally, we will use the following notation for multinomial coefficients as follows: a a! , = b 1 ! . . . bn ! b1 , . . . bn P where a, b1 , . . . , bn ∈ N and ni=1 bi = a. 98 Let us now define the discrete analog of the function zs that quantifies the probability of winning slot s. Proposition 17. In a symmetric efficient equilibrium of the discrete GSP the probability zbs (v) that an advertiser with valuation v is assigned to slot s is given by zbs (v) = N −s X s−1 X j=0 k=0 Fji−1 Gki N −1 . j, k, N −1−j −k (N − j − k)nN −1−j−k if v = vi and by zbs (v) = where p = N − s. N −1 lim− Fb(v 0 )p (1 − Fb(v))s−1 =: zbs− (v), 0 s − 1 v →v In particular, notice that zbs− (vi ) admits the simple expression N − 1 bp − zbs (vi ) = Fi−1 Gs−1 i−1 , s−1 which is the discrete version of the function zs . On the other hand, even though function zbs (vi ) does not admit a closed-form, it is not hard to show that 1 N −1 p . (4.9) zbs (vi ) = Fi−1 Gs−1 + O i n s−1 This can again can be thought of as a discrete version of zs . The proof of this and all other propositions in this section are deferred to Appendix C. Let us now define the lower triangular matrix M(s) by: p N − 1 n∆Fj ∆Gsi , Mij (s) = − s−1 s for i > j and Mii (s) = NX −s−1 X s−1 j=0 k=0 Fji−1 Gki N −1 . j, k, N −1−j −k (N − j − k)nN −1−j−k Proposition 18. If the discrete GSP auction admits a symmetric efficient equilibrium, b i ) = βi , where β is the solution of the following then its bidding function βb satisfies β(v linear equation. Mβ = u, (4.10) 99 with M = PS s=1 cs M(s) and ui = S X s=1 cs zs (vi )vi − i X j=1 zbs− (vj )∆vj . (4.11) b defined by Proposition 18 and To gain some insight on the relationship between β, β, defined in Theorem 12; we compare equations (4.10) and (4.2). An integration by parts of the right-hand side of (4.2) and the change of variable G(v) = 1 − F (v) show that β satisfies S X s=1 cs vzs (v) − Z 0 v Z v S X N −1 dzs (t) s−1 tdt = cs G(v) β(t)dF p . dt s − 1 0 s=1 (4.12) On the other hand, equation (4.10) implies that for all i S i−1 X N − 1 n∆Gsi X p ui = cs Mii (s)βi − ∆Fj βj s − 1 s s=1 j=1 Moreover, by Lemma 15 and Proposition 31 in Appendix C, the equalities − Gs−1 + O n1 and i 1 1 N −1 s−1 Mii (s) = G + O , pFp−1 i−1 i 2n s − 1 n2 (4.13) n∆Gsi s = hold. Thus, equation (4.13) resembles a numerical scheme for solving (4.12) where the integral on the right-hand side is approximated by the trapezoidal rule. Equation (4.12) is in fact a Volterra equation of the first kind with kernel S X N −1 K(t, v) = G(v)s−1 pF p−1 (t). s−1 s=1 Therefore, we could benefit from the extensive literature on the convergence analysis of numerical schemes for this type of equations (Baker, 1977; Kress et al., 1989; Linz, 1985). However, equations of the first kind are in general ill-posed problems (Kress et al., 1989), that is small perturbations on the equation can produce large errors on the solution. When the kernel K satisfies mint∈[0,1] K(t, t) > 0, there exists a standard technique to transform an equation of the first kind to an equation of the second kind, which is a well posed problem. This makes the convergence analysis for this type of problems much simpler. The kernel function appearing in (4.12) does not satisfy this property and therefore we cannot use these results for our problem. To the best of 100 0.1 max(βi - βi-1) C n-1/2 0.08 0.06 0.04 0.02 0 0 100 200 300 400 500 Sample size Figure 4.2: (a) Empirical verification of Assumption 2. Values were generated using a uniform distribution over [0, 1] and the parameters of the auction were N = 3, s = 2. The blue line corresponds to the quantity maxi ∆βi for different values of n. In red we plot the desired upper bound for C = 1/2. our knowledge, the problem of using a simple quadrature method to solve a Volterra equation of the first kind with vanishing kernel has not been addressed previously. In addition to dealing with an uncommon integral equation, we need to address the problem that the elements of (4.10) are not exact evaluations of the functions defining (4.12) but rather stochastic approximations of these functions. Finally, the grid points used for the numerical approximation are also random. In order to prove convergence of the function βb to β we will make the following assumptions Assumption 1. There exists a constant c > 0 such that f (x) > c for all x ∈ [0, 1]. This assumption is needed to ensure that the difference between consecutive samples vi − vi−1 goes to 0 as n → ∞, which is a necessary condition for the convergence of any numerical scheme. Assumption 2. The solution β of (4.10) satisfies vi , βi ≥ 0 for all i and maxi ∆βi ≤ C √ , for some universal constant C. n Since βi is a bidding strategy in equilibrium, it is reasonable to expect that vi ≥ βi ≥ 0. On the other hand, the assumption on ∆βi is related to the smoothness of the solution. If the function β is smooth, we should expect the approximation βb to be smooth too. Both assumptions can in practice be verified empirically; for example, by using a density estimation algorithm for Assumption 1. On the other hand, Assumption 2 can be verified by calculating the desired statistics as in Figure 4.2, which depicts the quantity maxi∈1,...,n ∆βi as a function of the sample size n. Assumption 3. The solution β to (4.2) is twice continuously differentiable. This is satisfied if for instance the distribution function F is twice continuously differentiable. We can now present our main result. 101 Theorem 14. If Assumptions 1, 2 and 3 are satisfied, then, with probability at least 1−δ over the draw of a sample of size n, the following bound holds for all i ∈ [1, n]: b i ) − β(vi )| ≤ eC |β(v log(2/δ)N/2 Cq(n, δ/2) √ q(n, δ/2)3 + . n3/2 n where q(n, δ) = 2c log(nc/2δ) with c defined in Assumption 1, and where C is a universal constant. The proof of this theorem is highly technical and we defer it to Appendix C.6 and present here only a sketch of the proof. 1. We take the the discrete derivative of (4.10) to obtain the new system of equations dMβ = dui (4.14) where dMij = Mij − Mi,j−1 and dui = ui − ui−1 . This step is standard in the analysis of numerical methods for Volterra equations of the first kind. N −S 2. Since dMii = Mii and these values go to 0 as ni it follows that (4.14) is illconditioned and therefore a straight forward comparison of the solutions β and β will not work. Instead, we analyze the vector ψ = v − β and show that it satisfies the equation dMψ = p for some vector p defined in Appendix C.6. Furthermore, we show that that ψi ≤ 2 C ni 2 for some universal constant C; and similarly the function ψ(v) = v − β(v) will 2 also satisfy ψ(v) ≤ Cv 2 . Therefore |ψ(vi ) − ψi | ≤ C ni 2 . In particular for i ≤ n3/4 we have |ψ(vi ) − ψi | = |β(vi ) − βi | ≤ √Cn . 3. Using the fact that |F (vi ) − Fi | is in O( √1n ). We show the sequence of errors i = |β(vi ) − βi | satisfy the following recurrence: i−2 1 dMi,i−1 1X i ≤ C √ q(n, δ) + i−1 + j dMii n j=1 n dM ∼ 1i . Since convergence of this term to 0 is too It is not hard to prove that dMi,i−1 ii slow we cannot provide bound on i based on this recurrence. Instead, we will we √ ii to bound the difference between use the fact that |dMii ψi − dMi,i−1 ψi−1 | ≤ C dM n ψ and the solution ψ 0 of the equation dM0i ψ 0 = pi Where dM0ii = 2dMii , dM0i,i−1 = 0 and dM0ij = dMij otherwise. More precisely, We show that kψ − ψ 0 k∞ ≤ nC2 . 102 0.7 0.5 b 0.3 0.1 0.0 0.2 0.4 0.6 0.8 v Figure 4.3: Approximation of the empirical bidding function βb to the true solution β. The true solution is shown in red and the shaded region represents the confidence interval of βb when simulating the discrete GSP 10 times with a sample of size 200. Here N = 3, S = 2, c1 = 1, c2 = 0.5 and values were sampled uniformly from [0, 1] 4. We show that 0i = |ψ(vi ) − ψi0 | satisfies the recurrence 1X 0 1 . ≤ C √ q(n, δ) + n j=1 i n i−2 0i Notice that the term decreasing as 1i no longer appears in the recurrence. Therefore, we can conclude that 0i must satisfy the bound given in Theorem 14, which in turn implies that |β(vi ) − βi | also satisfies the bound. 4.6 Experiments Here we present preliminary experiments showing the advantages of our algorithm. We also present empirical evidence showing that the procedure proposed in (Sun et al., 2014) to estimate valuations from bids is incorrect. In contrast, our density estimation algorithm correctly recovers valuations from bids in equilibrium. 4.6.1 Setup Let F1 and F2 denote the distributions of two truncated log-normal random variables with parameters µ1 = log(.5), σ1 = .8 and µ2 = log(2), σ = .1 respectively. Here F1 is truncated to have support in [0, 1.5] and the support of F2 = [0, 2.5]. We consider a GSP with N = 4 advertisers with S = 3 slots and position factors c1 = 1, c2 =, 45 and c3 = 1. Based on the results of Section 4.5 we estimate the bidding function β with a sample of 2000 points and we show its plot in Figure 4.4. We proceed to evaluate the method proposed by Sun et al. (2014) for recovering advertisers’ valuations from bids in equilibrium. The assumption made by the authors is that the advertisers play a SNE in 103 2.5 β(v) v 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 Figure 4.4: Bidding function for our experiments in blue and identity function in red. Since β is a strictly increasing function it follows from (Gomes and Sweeney, 2014) that this GSP admits an equilibrium. which case valuations can be inferred by solving a simple system of inequalities defining the SNE (Varian, 2007). Since the authors do not specify which SNE the advertisers are playing we follow the work of Ostrovsky and Schwarz (2011) and choose the one that solves the SNE conditions with equality. We generated a sample S consisting of n = 300 i.i.d. outcomes of our simulated auction. Since, N = 4 the effective size of this sample is of 1200 points. We generated the outcome bid vectors bi , . . . , bn by using the equilibrium bidding function β. Assuming that the bids constitute a SNE we estimated the valuations and Figure 4.5 shows a histogram of the original sample as well as the histogram of the estimated valuations. It is clear from this figure that this procedure does not accurately recover the distribution of the valuations. By contrast the histogram of the estimated valuations using our density estimation algorithm is shown in Figure 4.5(c). The kernel function used by our algorithm was a triangular kernel given by K(u) = (1 − |u|)1|u|≤1 . Following the experimental setup of (Guerre et al., 2000) the bandwidth h was set to h = 1.06b σ n1/5 , where σ b denotes the standard deviation of the sample of bids. Finally, we use both our density estimation algorithm and discriminative learning algorithm to infer the optimal value of r and the associated expected revenue. To test our algorithm we generated a test sample of size n = 500 with the procedure previously described. The results are shown in Table 4.1 Density estimation Discriminative 1.42 ± 0.02 1.85 ± 0.02 Table 4.1: Mean revenue of both our algorithms. Notice that the difference in revenue between these algorithms does not seem comparable, even though both solve the same problem. The cause of this discrepancy is a sutil difference in the problems that the discriminative algorithm and the density estimation algorithm solve. Indeed, the discriminative algorithm assumes that test data 104 SNE estimates Kernel estimates 0.0 0.0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 0.0 0.2 0.4 0.6 0.8 1.0 True valuations 0.0 0.5 1.0 1.5 2.0 (a) 2.5 0.0 0.5 1.0 1.5 (b) 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 (c) Figure 4.5: Comparison of methods for estimating valuations from bids. (a) Histogram of true valuations. (b) Valuations estimated under the SNE assumption. (c) Density estimation algorithm. comes from the same distribution as the source data. However, as shown by Gomes and Sweeney (2014), introducing a reserve price may induce a different equilibrium. More precisely, a reserve price r induces an equilibrium function β(r, v). The reserve price ∗ ∗ rdens obtained using density estimation is such that rdens is optimal for the bidding func∗ ∗ tion β(rdens , v) whereas the reserve price rdisc selected by the discriminative algorithm is optimal for the equilbrium function β(0, v). We believe that it is reasonable to assume that bidders do not change their bidding function too fast under the presence of a new reserve price. Therefore, it is more plausible for bids on the test data to be given by beta(0, v). 4.7 Conclusion We proposed and analyzed two algorithms for learning optimal reserve prices for generalized second-price auctions. Our first algorithm is based on density estimation and therefore suffers from the standard problems associated with this family of algorithms. Furthermore, this algorithm is only well defined when bidders play in equilibrium. Our second algorithm is novel and is based on learning theory guarantees. We show that the algorithm admits an efficient O(nS log(nS)) implementation. Furthermore, our theoretical guarantees are more favorable than those presented for the previous algorithm of (Sun et al., 2014). Moreover, even though it is necessary for advertisers to play in equilibrium for our algorithm to converge to optimality, when bidders do not play an equilibrium, our algorithm is still well defined and minimizes a quantity of interest albeit over a smaller set. We also presented preliminary experimental results showing the advantages of our algorithm. To our knowledge, this is the first attempt to apply learning algorithms to the problem of reserve price optimization in GSP auctions. We believe that the use of learning algorithms in revenue optimization is crucial and that this work suggests a rich research agenda including the extension of this work to a general learning 105 setup where auctions and advertisers are represented by features. Additionally, in our analysis, we considered two different ranking rules. It would be interesting to combine the algorithm of (Zhu et al., 2009) with this work to learn both a ranking rule and an optimal reserve price for GSP auctions. Finally, we provided the first analysis of convergence of bidding functions in an empirical equilibrium to the true bidding function. This result on its own is of great importance as it better justifies the common assumption of advertisers playing in a Bayes-Nash equilibrium. 106 Chapter 5 Learning Against Strategic Adversaries The results of the previous chapter demonstrate the advantages of using machine learning techniques in revenue optimization for auctions. These algorithms however, rely on the assumption that auctions outcomes are independent from each other. In practice, however, buyers may react strategically against a seller trying to optimize his revenue. In particular, they might under-bid in order to obtain a more favorable reserve price in the future. In this chapter we study the interactions between a seller attempting to optimize his revenue and a strategic buyer. More precisely, we analyze the problem of revenue optimization learning algorithms for posted-price auctions with strategic buyers. We analyze a very broad family of monotone regret minimization algorithms for this problem, which includes the previously best known algorithm, and√show that no algorithm in that family admits a strategic regret more favorable than Ω( T ). We then introduce a new algorithm that achieves a strategic regret differing from the lower bound only by a factor in O(log T ), an exponential improvement upon the previous best algorithm. Our new algorithm admits a natural analysis and simpler proofs, and the ideas behind its design are general. We also report the results of empirical evaluations comparing our algorithm with the previous state of the art and show a consistent exponential improvement in several different scenarios. 5.1 Introduction Revenue optimization algorithms such as the one proposed in the previous chapter, critically rely on the assumption that the bids, that is, the outcomes of auctions, are drawn i.i.d. according to some unknown distribution. However, this assumption may not hold in practice. In particular, with the knowledge that a revenue optimization algorithm is being used, an advertiser could seek to mislead the publisher by under-bidding. In fact, consistent empirical evidence of strategic behavior by advertisers has been found by Edelman and Ostrovsky (2007). This motivates the analysis presented in this chapter of the interactions between sellers and strategic buyers, that is, buyers that may act 107 non-truthfully with the goal of maximizing their surplus. The scenario we consider is that of posted-price auctions, which, albeit simpler than other mechanisms, in fact matches a common situation in AdExchanges where many auctions admit a single bidder. In this setting, second-price auctions with reserve are equivalent to posted-price auctions: a seller sets a reserve price for a good and the buyer decides whether or not to accept it (that is to bid higher than the reserve price). In order to capture the buyer’s strategic behavior, we will analyze an online scenario: at each time t, a price pt is offered by the seller and the buyer must decide to either accept it or leave it. This scenario can be modeled as a two-player repeated non-zero sum game with incomplete information, where the seller’s objective is to maximize his revenue, while the advertiser seeks to maximize her surplus as described in more detail in Section 5.2. The literature on non-zero sum games is very rich (Nachbar, 1997, 2001; Morris, 1994), but much of the work in that area has focused on characterizing different types of equilibria, which is not directly relevant to the algorithmic questions arising here. Furthermore, the problem we consider admits a particular structure that can be exploited to design efficient revenue optimization algorithms. From the seller’s perspective, this game can also be viewed as a bandit problem (Kuleshov and Precup, 2010; Robbins, 1985) since only the revenue (or reward) for the prices offered is accessible to the seller. Kleinberg and Leighton (2003a) precisely studied this continuous bandit setting under the assumption of an oblivious buyer, that is, one that does not exploit the seller’s behavior (more precisely, the authors assume that at each round the seller interacts with a different buyer). The authors presented a tight regret bound of Θ(log log T ) for the scenario of a buyer holding a fixed valuation and a 2 regret bound of O(T 3 ) when facing an adversarial buyer by using an elegant reduction to a discrete bandit problem. However, as argued by Amin et al. (2013), when dealing with a strategic buyer, the usual definition of regret is no longer meaningful. Indeed, consider the following example: let the valuation of the buyer be given by v ∈ [0, 1] and assume that an algorithm with sublinear regret such as Exp3 (Auer et al., 2002) or UCB (Auer et al., 2002) is used for T rounds by the seller. A possible strategy for the buyer, knowing the seller’s algorithm, would be to accept prices only if they are smaller than some small value , certain that the seller would eventually learn to offer only prices less than . If v, the buyer would considerably boost her surplus while, in theory, the seller would have not incurred a large regret since in hindsight, the best fixed strategy would have been to offer price for all rounds. This, however is clearly not optimal for the seller. The stronger notion of policy regret introduced by Arora et al. (2012) has been shown to be the appropriate one for the analysis of bandit problems with adaptive adversaries. However, for the example just described, a sublinear policy regret can be similarly achieved. Thus, this notion of regret is also not the pertinent one for the study of our scenario. We will adopt instead the definition of strategic-regret, which was introduced by Amin et al. (2013) precisely for the study of this problem. This notion of regret also 108 matches the concept of learning loss introduced by (Agrawal, 1995) when facing an oblivious adversary. Using this definition, Amin et al. (2013) presented both upper and lower bounds for the regret of a seller facing a strategic buyer and showed that the buyer’s surplus must be discounted over time in order to be able to achieve sublinear regret (see Section√5.2). However, the gap between the upper and lower bounds they presented is in O( T ). In the following, we analyze a very broad family of monotone regret minimization algorithms for this problem (Section 5.3), which includes the algorithm of Amin et al. (2013), and √ show that no algorithm in that family admits a strategic regret more favorable than Ω( T ). Next, we introduce a nearly-optimal algorithm that achieves a strategic regret differing from the lower bound at most by a factor in O(log T ) (Section 5.4). This represents an exponential improvement upon the existing best algorithm for this setting. Our new algorithm admits a natural analysis and simpler proofs. A key idea behind its design is a method deterring the buyer from lying, that is rejecting prices below her valuation. 5.2 Setup We consider the following game played by a buyer and a seller. A good, such as an advertisement space, is repeatedly offered for sale by the seller to the buyer over T rounds. The buyer holds a private valuation v ∈ [0, 1] for that good. At each round t = 1, . . . , T , a price pt is offered by the seller and a decision at ∈ {0, 1} is made by the buyer. at takes value 1 when the buyer accepts to buy at that price, 0 otherwise. We will say that a buyer lies whenever at = 0 while pt < v. At the beginning of the game, the algorithm A used by the seller to set prices is announced to the buyer. Thus, the buyer plays strategically against this algorithm. The knowledge of A is a standard assumption in mechanism design and also matches the practice in AdExchanges. For any γ ∈ (0, 1), define the discounted surplus of the buyer as follows: Sur(A, v) = T X t=1 γ t−1 at (v − pt ). (5.1) The value of the discount factor γ indicates the strength of the preference of the buyer for current surpluses versus future ones. The performance of a seller’s algorithm is measured by the notion of strategic-regret (Amin et al., 2013) defined as follows: Reg(A, v) = T v − T X at p t . (5.2) t=1 The buyer’s objective is to maximize his discounted surplus, while the seller seeks to minimize his regret. Note that, in view of the discounting factor γ, the buyer is not fully 109 adversarial. The problem consists of designing algorithms achieving sublinear strategic regret (that is a regret in o(T )). The motivation behind the definition of strategic-regret is straightforward: a seller, with access to the buyer’s valuation, can set a fixed price for the good close to this value. The buyer, having no control on the prices offered, has no option but to accept this price in order to optimize his utility. The revenue per round of the seller is therefore v − . Since there is no scenario where higher revenue can be achieved, this is a natural setting to compare the performance of our algorithm. To gain more intuition about the problem, let us examine some of the complications arising when dealing with a strategic buyer. Suppose the seller attempts to learn the buyer’s valuation v by performing a binary search. This would be a natural algorithm when facing a truthful buyer. However, in view of the buyer’s knowledge of the algorithm, for γ 0, it is in her best interest to lie on the initial rounds, thereby quickly, in fact exponentially, decreasing the price offered by the seller. The seller would then incur an Ω(T ) regret. A binary search approach is therefore “too aggressive”. Indeed, an untruthful buyer can manipulate the seller into offering prices less than v/2 by lying about her value even just once! This discussion suggests following a more conservative approach. In the next section, we discuss a natural family of conservative algorithms for this problem. 5.3 Monotone Algorithms The following conservative pricing strategy was introduced by Amin et al. (2013). Let p1 = 1 and β < 1. If price pt is rejected at round t, the lower price pt+1 = βpt is offered at the next round. If at any time price pt is accepted, then this price is offered for all the remaining rounds. We will denote this algorithm by monotone. The motivation behind its design is clear: for a suitable choice of β, the seller can slowly decrease the prices offered, thereby pressing the buyer to reject many prices (which is not √ convenient for her) before obtaining a favorable price. The authors present an O(Tγ T ) regret bound for this algorithm, with Tγ = 1/(1 p − γ).√A more careful analysis shows that this bound can be further tightened to O( Tγ T + T ) when the discount factor γ is known to the seller. Despite its sublinear regret, the monotone algorithm remains sub-optimal for certain choices of γ. Indeed, consider a scenario with γ 1. For this setting, the buyer would no longer have an incentive to lie, thus, an algorithm such as binary search would achieve logarithmic regret, √ while the regret achieved by the monotone algorithm is only guaranteed to be in O( T ). One may argue that the monotone algorithm is too specific since it admits a single parameter β and that perhaps a more complex algorithm with the same monotonic idea could achieve a more favorable regret. Let us therefore analyze a generic monotone 110 Algorithm 4 Monotone algorithms. Let p1 = 1 and pt ≤ pt−1 for t = 2, . . . T . t←1 p ← pt Offer price p while (Buyer rejects p) and (t < T ) do t←t+1 p ← pt Offer price p end while while (t < T ) do t←t+1 Offer price p end while Algorithm 5 Definition of Ar . n = the root of T (T ) while Offered prices less than T do Offer price pn if Accepted then n = r(n) else Offer price pn for r rounds n = l(n) end if end while algorithm Am defined by Algorithm 4. Definition 18. For any buyer’s valuation v ∈ [0, 1], define the acceptance time κ∗ = κ∗ (v) as the first time a price offered by the seller using algorithm Am is accepted. Proposition 19. For any decreasing sequence of prices (pt )Tt=1 , there exists a truthful buyer with valuation v0 such that algorithm Am suffers regret of at least q √ 1 T − T. Reg(Am , v0 ) ≥ 4 The proof of this proposition is based on the idea that in order to achieve low regret in rounds where the buyer accepts a price, the distance between pt and pt+1 must be small. However, for this scenario, when facing a buyer with value v 1 the seller will accumulate a large regret from rounds where the buyer rejects a price. This intuition is formalized in the following Lemma. Lemma 6. Let (pt )Tt=1 be a decreasing sequence of prices. Assume that the seller faces a truthful buyer. Then, if v is sampled uniformly at random in the interval [ 12 , 1], the following inequality holds: 1 E[κ∗ ] ≥ . 32 E[v − pκ∗ ] The proof of this Lemma can be found in Appendix D.1. We proceed to prove Proposition 19. Proof. By definition of the regret, √ we have Reg(Am , v) = vκ∗ + (T − κ∗ )(v √ − pκ∗ ). We ∗ can consider two cases: κ (v0 ) > T for some v0 ∈ [1/2, 1]√and κ∗ (v) ≤ T for every √ 1 v ∈ [1/2, 1]. In the former case, we have Reg(Am , v0 ) ≥ v0 T ≥ 2 T , which implies the statement of the proposition. Thus, we can assume the latter condition. 111 Let v be uniformly distributed over [ 12 , 1]. In view of Lemma 6, we have √ 1 E[κ∗ ] + (T − T ) E[(v − pκ∗ )] 2 √ 1 T− T ∗ ≥ E[κ ] + . 2 32 E[κ∗ ] √ √ ∗ The right-hand side is minimized for E[κ ] = T 4− T . Plugging in this value yields √ √ T− T , which implies the existence of v0 with Reg(Am , v0 ) ≥ E[Reg(Am , v)] ≥ 4 √ √ T− T . 4 E[vκ∗ ] + E[(T − κ∗ )(v − pκ∗ )] ≥ √We have thus shown that any monotone algorithm Am suffers a regret of at least Ω( T ), even when facing a truthful buyer. A tighter lower bound can be given under a mild condition on the prices offered. Definition 19. A sequence (pt )Tt=1 is said to be convex if it verifies pt −pt+1 ≥ pt+1 −pt+2 for t = 1, . . . , T − 2. An instance of a convex sequence is given by the prices offered by the monotone algorithm. A seller offering prices forming a decreasing convex sequence seeks to control the number of lies of the buyer by slowly reducing prices. The following proposition gives a lower bound on the regret of any algorithm in this family. Proposition 20. Let (pt )Tt=1 be a decreasing convex sequence of prices. There exists a valuation v0 forpthe buyer√such that the regret of the monotone algorithm defined by γ . these prices is Ω( T Cγ + T ), where Cγ = 2(1−γ) The full proof of this proposition is given in Appendix D.1.1. The proposition shows that when the discount factor γ is known, the monotone algorithm is in fact asymptotically optimal in its class. The results just presented suggest that the dependency on T cannot be improved by any monotone algorithm. In some sense, this family of algorithms is “too conservative”. Thus, to achieve a more favorable regret guarantee, an entirely different algorithmic idea must be introduced. In the next section, we describe a new algorithm that achieves a substantially more advantageous strategic regret by combining the fast convergence properties of a binary search-type algorithm (in a truthful setting) with a method penalizing untruthful behaviors of the buyer. 5.4 A Nearly Optimal Algorithm Let A be an algorithm for revenue optimization used against a truthful buyer. Denote by T (T ) the tree associated to A after T rounds. That is, T (T ) is a full tree of height 112 1/2 1/16 1/2 1/4 3/4 5/16 9/16 1/4 13/16 (a) 3/4 13/16 (b) Figure 5.1: (a) Tree T (3) associated to the algorithm proposed in (Kleinberg and Leighton, 2003a). (b) Modified tree T 0 (3) with r = 2. T with nodes n ∈ T (T ) labeled with the prices pn offered by A. The right and left children of n are denoted by r(n) and l(n) respectively. The price offered when pn is accepted by the buyer is the label of r(n) while the price offered by A if pn is rejected is the label of l(n). Finally, we will denote the left and right subtrees rooted at node n by L (n) and R(n) respectively. Figure 5.1 depicts the tree generated by an algorithm proposed by Kleinberg and Leighton (2003a), which we will describe later. Since the buyer holds a fixed valuation, we will consider algorithms that increase prices only after a price is accepted and decrease it only after a rejection. This is formalized in the following definition. Definition 20. An algorithm A is said to be consistent if maxn0 ∈L (n) pn0 ≤ pn ≤ minn0 ∈R(n) pn0 for any node n ∈ T (T ). For any consistent algorithm A, we define a modified algorithm Ar , parametrized by an integer r ≥ 1, designed to face strategic buyers. Algorithm Ar offers the same prices as A, but it is defined with the following modification: when a price is rejected by the buyer, the seller offers the same price for r rounds. The pseudocode of Ar is given in Algorithm 5. The motivation behind the modified algorithm is given by the following simple observation: a strategic buyer will lie only if she is certain that rejecting a price will boost her surplus in the future. By forcing the buyer to reject a price for several rounds, the seller ensures that the future discounted surplus will be negligible, thereby coercing the buyer to be truthful. We proceed to formally analyze algorithm Ar . In particular, we will quantify the effect of the parameter r on the choice of the buyer’s strategy. To do so, a measure of the spread of the prices offered by Ar is needed. Definition 21. For any node n ∈ T (T ) define the right increment of n as δnr := pr(n) − pn . Similarly, define its left increment to be δnl := maxn0 ∈L (n) pn − pn0 . The prices offered by Ar define a path in T (T ). For each node in this path, we can define time t(n) to be the number of rounds needed for this node to be reached by Ar . 113 Note that, since r may be greater than 1, the path chosen by Ar might not necessarily reach the leaves of T (T ). Finally, let S : n 7→ S(n) be the function representing the surplus obtained by the buyer when playing an optimal strategy against Ar after node n is reached. Lemma 7. The function S satisfies the following recursive relation: S(n) = max(γ t(n)−1 (v − pn ) + S(r(n)), S(l(n))). (5.3) Proof. Define a weighted tree T 0 (T ) ⊂ T (T ) of nodes reachable by algorithm Ar . We assign weights to the edges in the following way: if an edge on T 0 (T ) is of the form (n, r(n)), its weight is set to be γ t(n)−1 (v − pn ), otherwise, it is set to 0. It is easy to see that the function S evaluates the weight of the longest path from node n to the leafs of T 0 (T ). It thus follows from elementary graph algorithms that equation (5.3) holds. The previous lemma immediately gives us necessary conditions for a buyer to reject a price. Proposition 21. For any reachable node n, if price pn is rejected by the buyer, then the following inequality holds: v − pn < γr (δ l + γδnr ). (1 − γ)(1 − γ r ) n Proof. A direct implication of Lemma 7 is that price pn will be rejected by the buyer if and only if γ t(n)−1 (v − pn ) + S(r(n)) < S(l(n)). (5.4) However, by definition, the buyer’s surplus obtained by following any path in R(n) is bounded above by S(r(n)). In particular, this is true for the path which rejects pr(n) and P accepts every price afterwards. The surplus of this path is given by Tt=t(n)+r+1 γ t−1 (v− pbt ) where (b pt )Tt=t(n)+r+1 are the prices the seller would offer if price pr(n) were rejected. Furthermore, since algorithm Ar is consistent, we must have pbt ≤ pr(n) = pn + δnr . Therefore, S(r(n)) can be bounded as follows: S(r(n)) ≥ T X t=t(n)+r+1 γ t−1 (v − pn − δnr ) = γ t(n)+r − γ T (v − pn − δnr ). 1−γ (5.5) We proceed to upper bound S(l(n)). Since pn − p0n ≤ δnl for all n0 ∈ L (n), v − pn0 ≤ v − pn + δnl and S(l(n)) ≤ T X t=tn +r γ t−1 (v − pn + δnl ) = 114 γ t(n)+r−1 − γ T (v − pn + δnl ). 1−γ (5.6) Combining inequalities (5.4), (5.5) and (5.6) we conclude that γ t(n)+r−1 − γ T γ t(n)+r − γ T (v − pn − δnr ) ≤ (v − pn + δnl ) 1−γ 1−γ γ r+1 − γ r γ r δnl + γ r+1 δnr − γ T −t(n)+1 (δnr + δnl ) (v − pn ) 1 + ≤ 1−γ 1−γ r l r γ (δn + γδn ) ⇒ (v − pn )(1 − γ r ) ≤ . 1−γ γ t(n)−1 (v − pn ) + ⇒ Rearranging the terms in the above inequality yields the desired result. Let us consider the following instantiation of algorithm A introduced in (Kleinberg and Leighton, 2003a). The algorithm keeps track of a feasible interval [a, b] initialized to [0, 1] and an increment parameter initialized to 1/2. The algorithm works in phases. Within each phase, it offers prices a+, a+2, . . . until a price is rejected. If price a+k is rejected, then a new phase starts with the feasible interval set to [a + (k − 1), a + k] and the increment parameter set to 2 . This process continues until b−a < 1/T at which point the last phase starts and price a is offered for the remaining rounds. It is not hard to see that the number of phases needed by the algorithm is less than dlog2 log2 T e + 1. A more surprising fact is that this algorithm has been shown to achieve regret O(log log T ) when the seller faces a truthful buyer. We will show that the modification Ar of this algorithm admits a particularly favorable regret bound. We will call this algorithm PFSr (penalized fast search algorithm). Proposition 22. For any value of v ∈ [0, 1] and any γ ∈ (0, 1), the regret of algorithm PFSr admits the following upper bound: Reg(PFSr , v) ≤ (vr + 1)(dlog2 log2 T e + 1) + (1 + γ)γ r T . 2(1 − γ)(1 − γ r ) (5.7) Note that for r = 1 and γ → 0 the upper bound coincides with that of (Kleinberg and Leighton, 2003a). Proof. Algorithm PFSr can accumulate regret in two ways: the price offered pn is rejected, in which case the regret is v, or the price is accepted and its regret is v − pn . Let K = dlog2 log2 T e + 1 be the number of phases run by algorithm PFSr . Since at most K different prices are rejected by the buyer (one rejection per phase) and each price must be rejected for r rounds, the cumulative regret of all rejections is upper bounded by vKr. The second type of regret can also be bounded straightforwardly. For any phase i, let i and [ai , bi ] denote the corresponding search parameter and feasible interval respectively. If v ∈ [ai , bi ], the regret accrued in the case where the buyer accepts a price in √ this interval is bounded by bi − ai = i . If, on the other hand v ≥ bi , then it readily 115 √ follows that v − pn < v − bi + i for all prices pn offered in phase i. Therefore, the regret obtained in acceptance rounds is bounded by K X i=1 Ni (v − bi )1v>bi K √ X (v − bi )1v>bi Ni + K, + i ≤ i=1 where Ni ≤ √1i denotes the number of prices offered during the i-th round. Finally, notice that, in view of the algorithm’s definition, every bi corresponds to a rejected price. Thus, by Proposition 21, there exist nodes ni (not necessarily distinct) such that pni = bi and v − bi = v − pni ≤ γr (δ l + γδnr i ). (1 − γ)(1 − γ r ) ni It is immediate that δnr ≤ 1/2 and δnl ≤ 1/2 for any node n, thus, we can write X γ r (1 + γ) γ r (1 + γ) N ≤ T. (v − bi )1v>bi Ni ≤ i 2(1 − γ)(1 − γ r ) i=1 2(1 − γ)(1 − γ r ) i=1 K X K The last inequality holds since at most T prices are offered by our algorithm. Combining the bounds for both regret types yields the result. When an upper bound on the discount factor γ is known to the seller, he can leverage this information and optimize upper bound (5.7) with respect to the parameter r. m l γ0r T ∗ Theorem 15. Let 1/2 < γ < γ0 < 1 and r = argminr≥1 r + (1−γ0 )(1−γ r ) . For any 0 v ∈ [0, 1], if T > 4, the regret of PFSr∗ satisfies Reg(PFSr∗ , v) ≤ (2vγ0 Tγ0 log cT + 1 + v)(log2 log2 T + 1) + 4Tγ0 , where c = 4 log 2. The proof of this theorem is fairly technical and is deferred to the Appendix. The theorem helps us define conditions under which logarithmic regret can be achieved. Indeed, if γ0 = e−1/ log T = O(1 − log1 T ), using the inequality e−x ≤ 1 − x + x2 /2 valid for all x > 0 we obtain 1 log2 T ≤ ≤ log T. 1 − γ0 2 log T − 1 It then follows from Theorem 15 that Reg(PFSr∗ , v) ≤ (2v log T log cT + 1 + v)(log2 log2 T + 1) + 4 log T. 116 Let us compare the regret bound given by Theorem 15 with the one given by Amin et al. (2013). The above discussion shows that for certain values of γ, an exponentially better regret can be achieved by our algorithm. It can be argued that the knowledge of an upper bound on γ is required, √ whereas this is not needed for the monotone algorithm. However, if γ > 1 − 1/ T , the regret bound on monotone is super-linear, and therefore uninformative.√Thus, in order to properly compare both algorithms, we may assume √ that γ < 1 − 1/ T in which case, by Theorem 15, the regret of our algorithm by the monotone algois O( T log T ) whereas only linear regret can be guaranteed √ p rithm. Even under the more favorable bound of O( Tγ T + T ), for any α < 1 and α+1 γ < 1 − 1/T α , the monotone algorithm will achieve regret O(T 2 ) while a strictly better regret O(T α log T log log T ) is attained by ours. 5.5 Lower Bound The following lower bounds have been derived in previous work. Theorem 16 ((Amin et al., 2013)). Let γ > 0 be fixed. For any algorithm A, there exists 1 Tγ . a valuation v for the buyer such that Reg(A, v) ≥ 12 This theorem is in fact given for the stochastic setting where the buyer’s valuation is a random variable taken from some fixed distribution D. However, the proof of the theorem selects D to be a point mass, therefore reducing the scenario to a fixed priced setting. Theorem 17 ( (Kleinberg and Leighton, 2003a)). Given any algorithm A to be played against a truthful buyer, there exists a value v ∈ [0, 1] such that Reg(A, v) ≥ C log log T for some universal constant C. Combining these results leads immediately to the following. Corollary 9. Given any exists a buyer’s valuation v ∈ [0, 1] such algorithm A, there 1 that Reg(A, v) ≥ max 12 Tγ , C log log T , for a universal constant C. We now compare the upper bounds given in the previous section with the bound of Corollary 9. For γ > 1/2, we have Reg(PFSr , v) = O(Tγ log T log log T ). On the other hand, for γ ≤ 1/2, we may choose r = 1, in which case, by Proposition 22, Reg(PFSr , v) = O(log log T ). Thus, the upper and lower bounds match up to an O(log T ) factor. 5.6 Empirical Results In this section, we present the result of simulations comparing the monotone algorithm and our algorithm PFSr . The experiments were carried out as follows: given 117 γ = .95, v = .75 1200 2500 PFS 1000 mon 2000 Regret Regret 800 600 400 γ = .75, v = .25 PFS mon 1000 0 0 0 3 3.5 4 4.5 ☞ ☛✡ ✠ 40 2 2.5 3 3.5 4 4.5 Number of rounds (log-scale) ✥✦✧ ✏✔✖ ☎ ✌✡ 60 20 2.5 ✆ 80 1500 500 2 ✆✁ PFS 100 mon 200 Number of rounds (log-scale) γ = .80, v = .25 120 Regret γ = .85, v = .75 ✄ ✂ ✁ 2 2.5 3 3.5 4 4.5 Number of rounds (log-scale) ✁ ✁✝✞ ✟ ✟✝✞ ✂ ✂✝✞ ✍✎✏✑✒✓ ✔✕ ✓✔✎✖✗✘ ✙✚✔✛✜✘✢✣✚✒✤ Figure 5.2: Comparison of the monotone algorithm and PFSr for different choices of γ and v. The regret of each algorithm is plotted as a function of the number rounds when γ is not known to the algorithms (first two figures) and when its value is made accessible to the algorithms (last two figures). a buyer’s valuation v, a discrete set of false valuations vb were selected out of the set {.03, .06, . . . , v}. Both algorithms were run against a buyer making the seller believe her valuation is vb instead of v. The value of vb achieving the best utility for the buyer was chosen and the regret for both algorithms is reported in Figure 5.2. We considered two sets of experiments. First, the value of parameter γ was left unknown to both algorithms and the value of r was set to log(T ). This choice is motivated by the discussion following Theorem 15 since, for large values of T , we can expect to achieve logarithmic regret. The first two plots (from left to right) in Figure 5.2 depict these results. The apparent stationarity in the regret of PFSr is just a consequence of the scale of the plots as the regret is in fact growing as log(T ). For the second set of experiments, we allowed access to the parameter γ to both algorithms. The value of r was chosen optimallyp based on the results of Theorem √the parameter β of monotone p 15 and was set to 1 − 1/ T Tγ to ensure regret in O( T Tγ + T ). It is worth noting that even though our algorithm was designed under the assumption of some knowledge about the value of γ, the experimental results show that an exponentially better performance over the monotone algorithm is still attainable and in fact the performances of the optimized and unoptimized versions of our algorithm are comparable. A more comprehensive series of experiments is presented in Appendix D.2. 5.7 Conclusion We presented a detailed analysis of revenue optimization algorithms against strategic buyers. In doing so, we reduced the gap between upper and lower bounds on strategic regret to a logarithmic factor. Furthermore, the algorithm we presented is simple to analyze and reduces to the truthful scenario in the limit of γ → 0, an important property that previous algorithms did not admit. We believe that our analysis helps gain a deeper understanding of this problem and that it can serve as a tool for studying more complex scenarios such as that of strategic behavior in repeated second-price auctions, VCG auctions and general market strategies. 118 Chapter 6 Conclusion We have provided an extensive theoretical analysis of two important problems in machine learning. Our results have inspired new and exciting research questions. Indeed, not only have we given tight learning guarantees for the problems of domain adaptation and drifting, but we have introduced two novel divergence metrics: the Y-discrepancy and the generalized discrepancy, which have been shown to be crucial in other related adaptation tasks such as that of Germain et al. (2013). We also believe that the use of Y-discrepancy can be of pivotal importance in the analysis of active learning too. Indeed, as pointed out in (Dasgupta, 2011), a secondary effect of actively selecting points for learning is that of biasing the data used for training, thereby making active learning an instance of the sample bias correction problem. It is thus natural to ask whether the notions of discrepancy can be used to analyze this challenging learning scenario. In this thesis, we also provided an original analysis for learning in auctions when features are used. Indeed, whereas learning tools had been used in the past to study this problem, previous work had never considered the problem of learning in auctions with features, which we have shown can drastically increase the revenue of an auction in practice. Furthermore, we provided a first convergence analysis of empirical BayesNash equilibria which can help better understand the behavior of buyers in generalized second-price auctions. Finally, we have carefully analyzed the effects of dealing with strategic buyers with fixed valuations and recently have extended these results to buyers with random valuations (Mohri and Medina, 2015) . It is worth mentioning that our work has in fact has inspired new connections between the auction theory and learning communities (Cole and Roughgarden, 2014; Morgenster and Roughgarden, 2015). Furthermore, the work presented in this thesis represents only the basis for what we believe will be a series of exciting future research challenges. Indeed, several questions can be derived from this thesis such as: can we analyze a more general scenario where the seller does not hold a monopoly and where he competes with other sellers to retain costumers? Similarly, what happens when there 119 exists more than one buyer? Do other participants force strategic buyers to bid truthfully? Finally, our model makes strong assumptions about the knowledge of the buyer, what would happen if both (buyer and seller), must learn the environment at the same time? These and other similar challenging questions could for the basis for a theory of learning in auctioning that we believe can form a rich branch of computer science and mathematics. 120 Appendix A Appendix to Chapter 1 A.1 SDP Formulation Lemma 8. The Lagrangian dual of the problem max a∈Rm kKs a−yk2 ≤r2 1 kKst ak2 − b> Kt Kst a, 2 (A.1) is given by min γ η≥0,γ s. t. 2 − 12 K> st Kst + ηKs 1 > b Kt Kst 2 − ηy> Ks 1 > K Kb 2 st t 2 − ηKs y η(kyk − r2 ) + γ ! 0. Furthermore, the duality gap for these problems is zero. Proof. For η ≥ 0 the Lagrangian of (A.1) is given by 1 L(a, η) = kKst ak2 − b> Kt Kst a − η(kKs a − yk2 − r2 ) 2 1 2 > > 2 2 = a> K> K − ηK st s a + (2ηKs y − Kst Kt b) a − η(kyk − r ). 2 st Since the Lagrangian is a quadratic function of a and that the conjugate function of a quadratic can be expressed in terms of the pseudo-inverse, the dual is given by † 1 1 > > 2 > 2 2 min (2ηKs y − K> K b) ηK − K K st (2ηKs y − Kst Kt b) − η(kyk − r ) st t s η≥0 4 2 st 1 s. t. ηK2s − K> Kst 0. 2 st 121 Introducing the variable γ to replace the objective function yields the equivalent problem min γ η≥0,γ 1 Kst 0 s. t. ηK2s − K> 2 st † 1 > 1 2 > 2 2 > − ηK K b) K γ − (2ηKs y−K> K st (2ηKs y−Kst Kt b)+η(kyk −r ) ≥ 0 s st t 4 2 st Finally, by the properties of the Schur complement (Boyd and Vandenberghe, 2004), the two constraints above are equivalent to ! 1 > − 1 K> K + ηK2s K K b − ηKs y > 2 st t 2 st st 0. 1 > K K b − ηKs y η(kyk2 − r) + γ 2 st t Since duality holds for a general QCQP with only one constraint (Boyd and Vandenberghe, 2004)[Appendix B], the duality gap between these problems is 0. Proposition 23. The optimization problem (1.23) is equivalent to the following SDP: max α,β,ν,Z,z s. t 1 Tr(K> st Kst Z) − β − α 2 e νK2 + 1 K> Kst − 1 K s 2 st > νy Ks + Z z 0 z> 1 4 1 >e z K 4 ∧ e νKs y + 14 Kz α + ν(kyk2 − r2 ) ! λKt + K2t 1 KK z 2 t st 1 > > z Kst Kt 2 2 2 β Tr(K2s Z) − 2y> Ks z + kyk ≤ r ∧ 0 ! 0 ν ≥ 0, e = K> Kt (λKt + K2 )† Kt Kst . where K st t Proof. By Lemma 3, we may rewrite (1.23) as 1 > > min b> (λKt + K2t )b + a> K> st Kst a − a Kst Kt b + γ a,γ,η,b 2 ! 1 > 1 > − 2 Kst Kst + ηK2s K K b − ηK y t s st 2 s. t. ∧ 1 > > b K K − ηy K η(kyk2 − r2 ) + γ t st s 2 (A.2) η≥0 kKs a − yk2 ≤ r2 . Let us apply the change of variables b = 21 (λKt + K2t )† Kt Kst a + v. The following 122 equalities can be easily verified. 1 2 † b> (λKt + K2t )b = a> K> st Kt (λKt + Kt ) Kt Kst a 4 + v> Kt Kst a + v> (λKt + K2t )v 1 > > 2 † > a> K> st Kt b = a Kst Kt (λKt + Kt ) Kt Kst a + v Kt Kst a. 2 Thus, replacing b on (A.2) yields 1 e K> K − K a+γ st 2 st 4 ! 1e 1 > 2 K + ηK Ka + K K v − ηK y − 12 K> st t s st s 4 2 st min v> (λKt + K2t )v + a> a,v,γ,η s. t. 1 >e a K 4 η≥0 1 + 12 v> Kt Kst − ηy> Ks ∧ kKs a − yk2 ≤ r2 . η(kyk2 − r2 ) + γ 0 Introducing the scalar multipliers µ, ν ≥ 0 and the matrix Z z 0 z> ze, as a multiplier for the matrix constraint, we can form the Lagrangian: 1 1 e K − K a + γ − µη + ν(kKs a − yk2 − r2 ) L := v> (λKt + K2t )v + a> K> st 2 st 4 !! 1 > 1e 1 > 2 − K K + ηK Ka + K K v − ηK y Z z s s 2 st st 4 2 st t − Tr . 1 >e 1 > > 2 2 z ze a K + v K K − ηy K η(kyk − r ) + γ t st s 4 2 The KKT conditions ∂L = ∂L = 0 trivially imply ze = 1 and Tr(K2s Z) − 2y> Ks z + ∂η ∂γ kyk2 − r2 + µ = 0. These constraints on the dual variables guarantee that the primal variables η and γ will vanish from the Lagrangian, thus yielding L= 1 2 2 > 2 > > > Tr(K> st Kst Z) + ν(kyk − r ) + v (λKt + Kt )v − z Kst Kt v 2 1 > 1 e 1 e > > 2 + a νKs + Kst Kst − K a − 2νKs y + Kz a. 2 4 2 This is a quadratic function on the primal variables a and v with minimizing solutions 1 2 1 > 1 e † 1e a= νKs + Kst Kst − K 2νKs y + Kz 2 2 4 2 123 1 and v = (λKt + K2t )† Kt Kst z, 2 and optimal value equal to the objective of the Lagrangian dual: 1 1 >e 2 2 Tr(K> st Kst Z) + ν(kyk − r ) − z Kz 2 4 1 1 e > 2 1 > 1 e † 1e − 2νKs y + Kz νKs + Kst Kst − K 2νKs y + Kz . 4 2 2 4 2 As in Lemma 3, we apply the properties of the Schur complement to show that the dual is given by max α,β,ν,Z,z s. t 1 Tr(K> st Kst Z) − β − α 2 e νK2 + 1 K> Kst − 1 K s st ! e νKs y + 14 Kz 2 4 0 e νy> Ks + 14 z> K α + ν(kyk2 − r2 ) Z z 0 ∧ Tr(K2s Z) − 2y> Ks z + kyk2 ≤ r2 z> 1 1 e ∧ ν≥0 β ≥ z> Kz 4 e and using the Schur complement one more time we Finally, recalling the definition of K arrive to the final SDP formulation: max α,β,ν,Z,z s. t 1 Tr(K> st Kst Z) − β − α 2 e νK2 + 1 K> Kst − 1 K s 2 st 4 e νy> Ks + 14 z> K Z z 0 ∧ z> 1 e νKs y + 14 Kz α + ν(kyk2 − r2 ) λKt + K2t 1 KK z 2 t st 1 > > z Kst Kt 2 2 2 β Tr(K2s Z) − 2y> Ks z + kyk ≤ r A.2 ! ∧ 0 ! 0 ν ≥ 0. QP Formulation Proposition 24. Let Y = (Yij ) ∈ Rn×k be the matrix defined by Yij = n−1/2 hj (x0i ) and P y0 = (y10 , . . . , yk0 )> ∈ Rk the vector defined by yi0 = n−1 nj=1 hi (x0j )2 . Then, the dual 124 problem of (1.24) is given by 1 −1 γ γ > Kt λI + Kt Yα + max − Yα + α,γ,β 2 2 2 1 − γ > Kt K†t γ + α> y0 − β 2 1 > s.t. 1 α = , 1β ≥ −Y> γ, α ≥ 0, 2 (A.3) where 1 is the vector in Rk with all components equal to 1. Furthermore, the solution a solution (α, γ, β) of (A.3) by ∀x, h(x) = Pn h of (1.24) can be recovered from 1 1 −1 i=1 ai K(xi , x), where a = λI + 2 Kt ) (Yα + 2 γ). We will first prove a simplified version of the proposition for the case of linear hypotheses, i.e. we can represent hypotheses in H and elements of X as vectors w, x ∈ Rd respectively. Define X0 = n−1/2 (x01 , . . . , x0n ) to be the matrix whose columns are the normalized sample points from the target distribution. Let also {w1 , . . . , wk } be a sample taken from ∂H 00 and define W := (w1 , . . . , wk ) ∈ Rd×k . Under this notation, problem (1.24) may be rewritten as min λkwk2 + w∈Rd 1 1 max kX0> (w − wi )k2 + min kX0> (w − w0 )k2 0 2 i=1,...,k 2 w ∈C (A.4) Lemma 9. The Lagrange dual of problem (A.4) is given by γ > 0> X0 X0> −1 0 γ max − Yα + X λI + X Yα+ α,γ,β 2 2 2 1 > 0> 0 0> † 0 − γ X (X X ) X γ + α> y0 − β 2 1 s. t. 1> α = 1β ≥ −Y> γ α ≥ 0, 2 where Y = X0> W and yi0 = kX0> wi k2 . Proof. By applying the change of variable u = w0 − w, problem (A.4) is can be made equivalent to min w∈Rd u∈C−w 1 1 1 λkwk2 + kX0> wk2 + kX0> uk2 + max kX0> wi k2 − 2wi> X0 X0> w. 2 2 2 i=1,...,k By making the constraints on u explicit and replacing the maximization term with the 125 variable r the above problem becomes 1 1 1 λkwk2 + kX0> wk2 + kX0> uk2 + r w,u,r,µ 2 2 2 0 > 0> s. t. 1r ≥ y − 2Y X w min 1> µ = 1 µ≥0 W µ − w = u. For α, δ ≥ 0, the Lagrangian of this problem is defined as 1 1 1 L(w, u, µ, r, α, β, δ, γ 0 ) = λkwk2 + kX0> wk2 + kX0> uk2 + r + β(1> µ − 1) 2 2 2 > 0 0 > > + α (y − 2(X Y) w − 1r) − δ µ + γ 0> (W µ − w − u). Minimizing with respect to the primal variables yields the following KKT conditions: 1> α = 1 2 X0 X0> u = γ 0 1β = δ − W > γ 0 . X0 X0> 2 λI + w = 2(X0 Y )α + γ 0 2 (A.5) (A.6) Condition (A.5) implies that the terms involving r and µ will vanish from the Lagrangian. Furthermore, the first equation in (A.6) implies that any feasible γ 0 must satisfy γ 0 = X0 γ for some γ ∈ Rn . Finally, it is immediate that γ 0> u = u> X0 X0> u 0 0> w = 2α> (X0 Y)> w + γ 0> w. Thus, at the optimal point, the and 2w> λI + X X 2 Lagrangian becomes 1 0 0> 1 − w λI + X X w − u> X0 X0> u + α> y0 − β 2 2 1 > > 0 s. t. 1 α = 1β = δ − W γ α ≥ 0 ∧ δ ≥ 0. 2 > The positivity of δ implies that 1β ≥ −W > γ 0 . Solving for w and u on (A.6) and applying the change of variable X0 γ = γ 0 we obtain the final expression for the dual problem: γ > 0> X0 X0> −1 0 γ max − Yα + X λI + X Yα+ α,γ,β 2 2 2 1 − γ > X0> (X0 X0> )† X0 γ + α> y0 − β 2 1 > s. t. 1 α = 1β ≥ −Y> γ α ≥ 0, 2 where we have used the fact that Y> γ = WX0> γ to simplify the constraints. Notice 126 also that we can recover the solution w of problem (A.4) as w = (λI + 12 X0> X0 )−1 X0 (Yα + 12 γ) Using the matrix identities X0 (λI + X0> X0 )−1 = (λI + X0 X0> )−1 X0 and X X0 (X0> X0 )† = X0> (X0 X0> )† X0 , the proof of Proposition 7 is now immediate. 0> Proposition 7. We can rewrite the dual objective of the previous lemma in terms of the Gram matrix X0> X0 alone as follows: X0> X0 −1 γ γ > 0> 0 X X λI + Yα+ max − Yα + α,γ,β 2 2 2 1 > 0> 0 0> 0 † − γ X X (X X ) γ + α> y0 − β 2 1 1β ≥ −Y> γ α ≥ 0. s. t. 1> α = 2 By replacing X0> X0 by the more general kernel matrix Kt (which corresponds to the Gram matrix in the feature space) we obtain the desired expression for the dual. Additionally, the same matrix identities that the optimal hyPn applied 0to condition (A.6) imply 1 pothesis h is given by h(x) = i=1 ai K(xi , x) where a = (λI+ 2 Kt )−1 (Yα+ γ2 ). A.3 µ-admissibility Lemma 10 (Relaxed triangle inequality). For any p ≥ 1, let Lp be the loss defined over RN by Lp (x, y) = ky − xkp for all x, y ∈ RN . Then, the following inequality holds for all x, y, z ∈ RN : Lp (x, z) ≤ 2q−1 [Lp (x, y) + Lp (y, z)]. Proof. Observe that Lp (x, z) = 2p x−y y−z + 2 2 p . For p ≥ 1, x 7→ xp is convex, thus, i 1h Lp (x, z) ≤ 2p k(x − y)kp + k(y − z)kp = 2p−1 [Lp (x, z) + Lp (y, z)], 2 which concludes the proof. Lemma 11. Assume that Lp (h(x), y) ≤ M for all x ∈ X and y ∈ Y, then Lp is µ-admissible with µ = pM p−1 . 127 Proof. Since x 7→ xp is p-Lipschitz over [0, 1] we can write 0 |L(h(x), y) − L(h (x), y)| = M p |h(x) − y| p |h0 (x) − y| p − M M p−1 0 ≤ pM |h(x) − y + y − h (x)| = pM p−1 |h(x) − h0 (x)|, which concludes the proof. Lemma 12. Let L be the Lp loss for some p ≥ 1 and let h, h0 , h00 be functions satisfying Lp (h(x), h0 (x)) ≤ M and Lp (h00 (x), h0 (x)) ≤ M for all x ∈ X , for some M ≥ 0. Then, for any distribution D over X , the following inequality holds: 1 |LD (h, h0 ) − LD (h00 , h0 )| ≤ pM p−1 [LD (h, h00 )] p . Proof. Proceeding as in the proof of Lemma 11, we obtain |LD (h, h0 ) − LD (h00 , h0 )| = | E Lp (h(x), h0 (x)) − Lp (h00 (x), h0 (x) | x∈D ≤ pM p−1 E |h(x) − h00 (x)| . (A.7) x∈D Since p ≥ 1, by Jensen’s inequality, we can write Ex∈D |h(x) − h00 (x)| ≤ 1/p 1 = [LD (h, h00 )] p . Ex∈D |h(x) − h00 (x)|p 128 Appendix B Appendix to Chapter 3 B.1 Contraction Lemma The following is a version of Talagrand’s contraction lemma Ledoux and Talagrand (2011). Since our definition of Rademacher complexity does not use absolute values, we give an explicit proof below. Lemma 13. Let H be a hypothesis set of functions mapping X to R and Ψ1 , . . . , Ψm , µLipschitz functions for some µ > 0. Then, for any sample S of m points x1 , . . . , xm ∈ X , the following inequality holds " # " # m m X X µ 1 E sup σi (Ψi ◦ h)(xi ) ≤ E sup σi h(xi ) m σ h∈H i=1 m σ h∈H i=1 b S (H). = µR Proof. The proof is similar to the case where the functions Ψi are all equal. Fix a sample S = (x1 , . . . , xm ). Then, we can rewrite the empirical Rademacher complexity as follows: m h h ii i X 1 h 1 E sup E E sup um−1 (h) + σm (Ψm ◦ h)(xm ) , σi (Ψi ◦ h)(xi ) = m σ h∈H i=1 m σ1 ,...,σm−1 σm h∈H P where um−1 (h) = m−1 i=1 σi (Ψi ◦ h)(xi ). Assume that the suprema can be attained and let h1 , h2 ∈ H be the hypotheses satisfying um−1 (h1 ) + Ψm (h1 (xm )) = sup um−1 (h) + Ψm (h(xm )) h∈H um−1 (h2 ) − Ψm (h2 (xm )) = sup um−1 (h) − Ψm (h(xm )). h∈H 129 When the suprema are not reached, a similar argument to what follows can be given by considering instead hypotheses that are -close to the suprema for any > 0. By definition of expectation, since σm uniform distributed over {−1, +1}, we can write h i E sup um−1 (h) + σm (Ψm ◦ h)(xm ) σm h∈H 1 1 sup um−1 (h) + (Ψm ◦ h)(xm ) + sup um−1 (h) − (Ψm ◦ h)(xm ) 2 h∈H 2 h∈H 1 1 = [um−1 (h1 ) + (Ψm ◦ h1 )(xm )] + [um−1 (h2 ) − (Ψm ◦ h2 )(xm )]. 2 2 = Let s = sgn(h1 (xm ) − h2 (xm )). Then, the previous equality implies h i E sup um−1 (h) + σm (Ψm ◦ h)(xm ) σm h∈H 1 ≤ [um−1 (h1 ) + um−1 (h2 ) + sµ(h1 (xm ) − h2 (xm ))] 2 1 1 = [um−1 (h1 ) + sµh1 (xm )] + [um−1 (h2 ) − sµh2 (xm )] 2 2 1 1 ≤ sup[um−1 (h) + sµh(xm )] + sup[um−1 (h) − sµh(xm )] 2 h∈H 2 h∈H h i = E sup um−1 (h) + σm µh(xm ) , σm h∈H where we used the µ−Lipschitzness of Ψm in the first inequality and the definition of expectation over σm for the last equality. Proceeding in the same way for all other σi ’s (i 6= m) proves the lemma. B.2 Proof of Theorem 8 Proof. We first show that the functions Ln are uniformly bounded for any b: Z M Z r 0 max L0n (0, b) , L0n (M, b) dr |Ln (r, b)| = Ln (r, b)dr ≤ 0 0 Z M ≤ Kdr = M K, 0 where the first inequality holds since, by convexity, the derivative of Ln with respect to r is an increasing function. 130 Next, we show that the sequence (Ln )n∈N is also equicontinuous. It will follow then by the theorem of Arzela-Ascoli that the sequence Ln (·, b) converges uniformly to Lc (·, b). Let r1 , r2 ∈ [0, M ], for any b ∈ [0, M ] we have |Ln (r1 , b) − Ln (r2 , b)| ≤ sup |L0n (r, b)| |r1 − r2 | r∈[0,M ] = max (|L0n (0, b)| , |L0n (M, b))|) |r1 − r2 | ≤ K|r1 − r2 |, where, again, the convexity of Ln was used for the first equality. Let Fn (r) = Eb∼D [Ln (r, b)] and F (r) = Eb∼D [Lc (r, b)]. Fn is a convex function as the expectation of a convex function. By the theorem of Arzela-Ascoli, the sequence (Fn )n admits a uniformly convergent subsequence. Furthermore, by the dominated convergence theorem, we have (Fn (r))n converges pointwise to F (r). Therefore, the uniform limit of Fn must be F . This implies that min F (r) = lim min Fn (r) = lim Fn (rn ) = F (r∗ ), n→+∞ r∈[0,M ] r∈[0,M ] n→+∞ where the first and third equalities follow from the uniform convergence of Fn to F . The e Furthermore, the function Lc (·, b) is last equation implies that Lc is consistent with L. convex since it is the uniform limit of convex functions. It then follows by Proposition 12 that Lc (·, b) ≡ Lc (0, b) = 0. B.3 Consistency of Lγ Lemma 14. Let H be a closed, convex subset of a linear space of functions containing 0 and let h∗γ = argminh∈H Lγ (h). Then, the following inequality holds: E x,b h i h∗γ (x)1I2 (x, b) h i 1 ∗ ≥ E hγ (x)1I3 (x, b) . γ x,b Proof. Let 0 < λ < 1. Since H is a convex set, it follows that λh∗γ ∈ H. Furthermore, by the definition of h∗γ , we must have: h i h i E Lγ (h∗γ (x), b) ≤ E Lγ (λh∗γ (x), b) . x,b x,b (B.1) If h∗γ (x) < 0, then Lγ (h∗γ (x), b) = Lγ (λh∗γ (x)) = −b(2) by definition of Lγ . If on the other hand h∗γ (x) > 0, since λh∗γ (x) < h∗γ (x), we must have that for (x, b) ∈ I1 Lγ (h∗γ (x), b) = Lγ (λh∗γ (x), b) = −b(2) too. Moreover, from the fact that Lγ ≤ 0 and Lγ (h∗γ (x), b) = 0 for (x, b) ∈ I4 it follows that Lγ (h∗γ (x), b) ≥ Lγ (λh∗γ (x), b) for 131 (x, b) ∈ I4 , and therefore the following inequality trivially holds: h i E Lγ (h∗γ (x), b)(1I1 (x, b) + 1I4 (x, b)) x,b h i ≥ E Lγ (λh∗γ (x), b)(1I1 (x, b) + 1I4 (x, b)) . (B.2) x,b Subtracting (B.2) from (B.1) we obtain h i E Lγ (h∗γ (x), b)(1I2 (x, b) + 1I3 (x, b)) x,b h i ≤ E Lγ (λh∗γ (x), b)(1I2 (x, b) + 1I3 (x, b)) . x,b Rearranging terms shows that this inequality is equivalent to h i E (Lγ (λh∗γ (x), b) − Lγ (h∗γ (x), b))1I2 (x, b) x,b h i ≥ E (Lγ (h∗γ (x), b) − Lγ (λh∗γ (x), b))1I3 (x, b) (B.3) x,b Notice that if (x, b) ∈ I2 , then Lγ (h∗γ (x), b) = −h∗γ (x). If λh∗γ (x) > b(2) too then Lγ (λh∗γ (x), b) = −λh∗γ (x). On the other hand if λh∗γ (x) ≤ b(2) then Lγ (λh∗γ (x), b) = −b(2) ≤ −λh∗γ (x). Thus E(Lγ (λh∗γ (x), b) − Lγ (h∗γ (x), b))1I2 (x, b)) ≤ (1 − λ) E(h∗γ (x)1I2 (x, b)) (B.4) This gives an upper bound for the left-hand side of inequality (B.3). We now seek to derive a lower bound on the right-hand side. To do so, we analyze two different cases: 1. λh∗γ (x) ≤ b(1) ; 2. λh∗γ (x) > b(1) . In the first case, we know that Lγ (h∗γ (x), b) = γ1 (h∗γ (x) − (1 + γ)b(1) ) > −b(1) (since h∗γ (x) > b(1) for (x, b) ∈ I3 ). Furthermore, if λh∗γ (x) ≤ b(1) , then, by definition Lγ (λh∗γ (x), b) = min(−b(2) , −λh∗γ (x)) ≤ −λh∗γ (x). Thus, we must have: Lγ (h∗γ (x), b) − Lγ (λh∗γ (x), b) > λh∗γ (x) − b(1) > (λ − 1)b(1) ≥ (λ − 1)M, (B.5) where we used the fact that h∗γ (x) > b(1) for the second inequality and the last inequality holds since λ − 1 < 0. We analyze the second case now. If λh∗γ (x) > b(1) , then for (x, b) ∈ I3 we have Lγ (h∗γ (x), b)−Lγ (λh∗γ (x), b) = γ1 (1−λ)h∗γ (x). Thus, letting ∆(x, b) = Lγ (h∗γ (x), b)− 132 Lγ (λh∗γ (x), b), we can lower bound the right-hand side of (B.3) as: h i E ∆(x, b)1I3 (x, b) x,b h i h i = E ∆(x, b)1I3 (x, b)1{λh∗γ (x)>b(1) } + E ∆(x, b)1I3 (x, b)1{λh∗γ (x)≤b(1) } x,b x,b i h i h 1−λ ∗ (1) ∗ ∗ E h (x)1I3 (x, b)1{λh∗γ (x)>b(1) } + (λ − 1)M P hγ (x) > b ≥ λhγ (x) , ≥ γ x,b γ (B.6) where we have used (B.5) to bound the second summand. Combining inequalities (B.3), (B.4) and (B.6) and dividing by (1 − λ) we obtain the bound h i E h∗γ (x)1I2 (x, b) x,b h i h i 1 ∗ ∗ (1) ∗ ≥ E hγ (x)1I3 (x, b)1{λh∗γ (x)>b(1) } − M P hγ (x) > b ≥ λhγ (x) . γ x,b Finally, taking the limit λ → 1, we obtain E x,b h i h∗γ (x)1I2 (x, b) h i 1 ∗ ≥ E hγ (x)1I3 (x, b) . γ x,b Taking the limit inside the expectation is justified by the bounded convergence theorem and P[h∗γ (x) > b(1) ≥ λh∗γ (x)] → 0 holds by the continuity of probability measures. Proposition 25. For any δ > 0, with probability at least 1 − δ over the choice of a sample S of size m, the following holds for all γ ∈ (0, 1] and h ∈ H: s "s # 1 1 log log2 γ log 2 δ Lγ (h) ≤ Lbγ (h) + Rm (H) + M + . γ m 2m Proof. Consider two sequences (γk )k≥1 and (k )k≥1 , with k ∈ (0, 1). By theorem 10, for any fixed k ≥ 1, 2 b P Lγk (h) − Lγk (h) > Rm (H) + M k ≤ exp(−2m2k ). γk 133 Choose k = + q log k , m then, by the union bound, 1 P ∃k : Lγk (h) − Lbγk (h) > Rm (H) + M k γk ≤ ≤ = X k≥1 p exp − 2m( + (log k)/m)2 X k≥1 2 1/k 2 exp(−2m2 ) π exp(−2m2 ) ≤ 2 exp(−2m2 ). 6 For any γ ∈ (0, 1], there exists kp≥ 1 such that γp ∈ (γk , γk−1 ) with γk =p1/2k . For such γ 1 1 a k, γk−1 ≤ γ , γk−1 ≤ 2 , and log(k − 1) = log log2 (1/γk−1 ) ≤ log log2 (1/γ). Since for any h ∈ H, Lγk−1 (h) ≤ Lγ (h), we can write 2 b P L(h) − Lγ (h) > Rm (H) + M K(γ) + ≤ exp(−2m2 ), γ q log log2 γ1 . This concludes the proof. where K(γ) = m Corollary 10. Let H be a hypothesis set with pseudo-dimension d = Pdim(H). Then, for any δ > 0 and any γ > 0, with probability at least 1 − δ over the choice of a sample S of size m, the following inequality holds: s s " r # 1 m 2 log log 2d log log 2γ + 2 2 γ d δ Rm (H) + γM + M 2 +2 + L(b hγ ) ≤ L∗ + . γ m 2m m The proof follows the same steps as Theorem 11 and uses the results of Proposition 25. Notice that, by setting γ = m11/4 , we can guarantee the convergence of L(b hγ ) to L∗ . Indeed, with this choice, the bound can be expressed as follows: 1 L(b hγ ) ≤ L∗ + (2 + m1/4 )Rm (H) + 1/4 M m s " r # r m 2d log d log 2δ log log2 m1/4 +M 2 +2 + . m 2m m Furthermore, when H has finite pseudo-dimension, it is known that <m (H) is in O m11/2 . This shows that L(b hγ ) = L∗ + O m11/4 . 134 B.4 Proof of Proposition 13 Proof. From the definition of v-function, it is immediate that Vi is differentiable every(1) (2) (2) (1) (3) (1) where except at the three points ni = bi , ni = bi and ni = (1 + η)bi . Let r∗ (j) be a minimizer of F . If r∗ 6= ni for every j ∈ {1, 2, 3} and i ∈ {1, . . . , m}, then F (j) (j) must be differentiable at r∗ and F 0 (r∗ ) = 0. Now, let n∗ = max{ni |ni < r∗ }. Since F is a linear function over the interval (n∗ , r∗ ], we must have F 0 (r) = F 0 (r∗ ) = 0 for every r ∈ (n∗ , r∗ ]. Thus, F reduces to a constant over this interval and continuity of F implies that F (n∗ ) = F (r∗ ). (1) We conclude the proof by showing that n∗ is equal to bi for some i. Suppose this (1) is not the case and let U be an open interval around n∗ satisfying bi ∈ / U for all i. It (1) is not hard to verify that Vi is a concave function over every interval not containing bi . In particular Vi is concave over U for any i and, as a sum of concave functions, F is concave too over the interval U . Moreover, by definition, n∗ minimizes F restricted to U . This implies that F is constant over U as a non-constant concave function cannot (1) reach its minimum over an open set. Finally, let b∗ = argmini |n∗ − bi |. Since U was an arbitrary open interval, it follows that there exists r arbitrarily close to b∗ such that F (r) = F (n∗ ). By the continuity of F , we must then have F (b∗ ) = F (n∗ ). 135 Appendix C Appendix to Chapter 4 C.1 The Envelope Theorem The envelope theorem is a well known result in applied mathematics characterizing the maximum of a parametrized family of functions. A general version of this theorem is due to Milgrom and Segal (2002) and we include its proof here for completeness. We will let X be an arbitrary space and we will consider a function f : X × [0, 1] → R. We define the envelope function V and the set valued function X ∗ as V (t) = sup f (x, t) and x∈X X ∗ (t) = {x ∈ X|f (x, t) = V (t)}. We show a plot of the envelope function in figure C.1. Theorem 18 (Envelope Theorem). Let f be an absolutely continuous function for every x ∈ X. Suppose also that there exists an integrable function b : [0, 1] → R+ such (x, t) ≤ b(t) almost everywhere in t. Then V is absolutely that for every x ∈ X, df dt continuous. If in addition f (x, ·) is differentiable for all x ∈ X, X ∗ (t) 6= ∅ almost everywhere on [0, 1] and x∗ (t) denotes an arbitrary element in X ∗ (t), then Z t df ∗ V (t) = V (0) + (x (s), s)ds. 0 dt Proof. By definition of V , for any t0 , t00 ∈ [0, 1] we have |V (t00 ) − V (t0 )| ≤ sup |f (x, t00 ) − f (x, t0 )| x∈X = sup x∈X Z t00 t0 df (x, s) ≤ dt Z t00 b(t)dt. t0 This easily implies that V (t) is absolutely continuous. Therefore, V is differentiable 136 0.8 0.6 0.4 f(x3, t) 0.2 V(t) f(x2, t) 0.0 f(x1, t) 0.0 0.2 0.4 0.6 0.8 1.0 Figure C.1: Depiction of the envelope function Rt almost everywhere and V (t) = V (0) + 0 V 0 (s)ds. Finally, if f (x, t) is differentiable in t then we know that V 0 (t) = df (x∗ (t), t) for any x∗ (t) ∈ X ∗ (t) whenever V 0 (t) exists dt and the result follows. C.2 Elementary Calculations We present elementary results of Calculus that will be used throughout the rest of this Appendix. Lemma 15. The following equality holds for any k ∈ N: ∆Fki = and k k−1 ik−2 1 F + O 2 , n i−1 nk−2 n 1 k . ∆Gki = − Gk−1 + O n i−1 n2 Proof. The result follows from a straightforward application of Taylor’s theorem to the function h(x) = xk . Notice that Fki = h(i/n), therefore: i − 1 1 i − 1 ∆Fki = h + −h n n n 1 0 i−1 1 00 =h + h (ζi ) 2 n n 2n k k−1 1 00 = Fi−1 + h (ζi ) 2 , n 2n 137 for some ζi ∈ [(i − 1)/n, i/n]. Since h00 (x) = k(k − 1)xk−2 , it follows that the last term in the previous expression is in (i/n)k−2 O(1/n2 ). The second equality can be similarly proved. Proposition 26. Let a, b ∈ R and N ≥ 1 be an integer, then N j N −j X N ab j=0 j j+1 Proof. The proof relies on the fact that then equal to 1 a Z 0 = aj j+1 (a + b)N +1 − bN +1 a(N + 1) = N aX j=0 1 a Ra 0 (C.1) tj dt. The left hand side of (C.1) is Z N j N −j 1 a tb dt = (t + b)N dt j a 0 = (a + b)N +1 − bN +1 . a(N + 1) Lemma 16. If the sequence ai ≥ 0 satisfies ai ≤ A + B ai ≤ δ ∀i ≤ r aj ∀i > r. i−1 X j=1 Then ai ≤ (A + rδB)(1 + B)i−r−1 ≤ (A + rδB)eB(i−r−1) ∀i > r. This lemma is well known in the numerical analysis community and we include the proof here for completeness. Proof. We proceed by induction on i. The base of our induction is given by i = r + 1 and it can be trivially verified. Indeed, by assumption ar+1 ≤ A + rδB. Let us assume that the proposition holds for values less than i and let us try to show it 138 also holds for i. ai ≤ A + B r X aj + B j=1 i−1 X aj j=r+1 ≤ A + rBδ + B i−1 X (A + rBδ)(1 + B)j−r−1 j=r+1 = A + rBδ + (A + rBδ)B i−r−2 X (1 + B)j j=0 = A + rBδ + (A + rBδ)B (1 + B)i−r−1 − 1 B = (A + rBδ)(1 + B)i−r−1 . Lemma 17. Let W0 : [e, ∞) → R denote the main branch of the Lambert function, i.e. W0 (x)eW0 (x) = x. The following inequality holds for every x ∈ [e, ∞). log(x) ≥ W0 (x). Proof. By definition of W0 we see that W0 (e) = 1. Moreover, W0 is an increasing function. Therefore for any x ∈ [e, ∞) W0 (x) ≥ 1 ⇒W0 (x)x ≥ x ⇒W0 (x)x ≥ W0 (x)eW0 (x) ⇒x ≥ eW0 (x) . The result follows by taking logarithms on both sides of the last inequality. C.3 Proof of Proposition 18 b For Here, we derive the linear equation that must be satisfied by the bidding function β. the most part, we adapt the analysis of Gomes and Sweeney (2014) to a discrete setting. Proposition 17. In a symmetric efficient equilibrium of the discrete GSP, the probability zbs (v) that an advertiser with valuation v is assigned to slot s is given by zbs (v) = N −s X s−1 X j=0 k=0 Fji−1 Gki N −1 . j, k, N −1−j −k (N − j − k)nN −1−j−k 139 if v = vi and by N −1 zbs (v) = lim− Fb(v 0 )p (1 − Fb(v))s−1 =: zbs− (v), 0 v →v s−1 where p = N − s. Proof. Since advertisers play an efficient equilibrium, these probabilities depend only on the advertisers’ valuations. Let Aj,k (s, v) denote the event that j buyers have a valuation lower than v, k of them have a valuation higher than v and N − 1 − j − k a valuation exactly equal to v. Then, the probability of assigning s to an advertiser with value v is given by N −s X s−1 X 1 P(Aj,k (s, v)). (C.2) N − i − j j=0 k=0 1 The factor N −i−j appears due to the fact that the slot is randomly assigned in the case of a tie. When v = vi , this probability is easily seen to be: j Fi−1 Gki N −1 . j, k, N −1−j −k nN −1−j−k On the other hand, if v ∈ (vi−1 , vi ) the event Aj,k (s, v) happens with probability zero unless j = N − s and k = s − 1. Therefore, (C.2) simplifies to N − 1 N −1 b p s−1 F (v) (1 − Fb(v)) = lim Fb(v 0 )p (1 − Fb(v))s−1 . s − 1 v0 →v− s−1 Proposition 27. Let E[P P E (v)] denote the expected payoff of an advertiser with value v at equilibrium. Then E[P P E (vi )] = S h i X X cs zbs (vi )vi − zbs− (vi )(vi − vi−1 ) . s=1 j=1 Proof. By the revelation principle (Gibbons, 1992), there exists a truth revealing mechanism with the same expected payoff function as the GSP with bidders playing an equilibrium. For this mechanism, we then must have v ∈ arg max v∈[0,1] S X s=1 cs zbs (v)v − E[P P E (v)]. 140 By the envelope theorem (see Theorem 18), we have S X s=1 cs zbs (vi )vi − E[P PE (vi )] = − E[P PE (0)] + S Z X 0 s=1 vi zbs (t)dt. Since the expected payoff of an advertiser with valuation 0 should be zero too, we see that Z vi PE E[P (vi )] = cs zbs (vi )vi − zbs (t)dt. 0 Using the fact that zbs (t) ≡ zbs− (vi ) for t ∈ (vi−1 , vi ) we obtain the desired expression. Proposition 18. If the discrete GSP auction admits a symmetric efficient equilibrium, b i ) = βi , where β is the solution of the following then its bidding function βb satisfies β(v linear equation: Mβ = u. PS where M = s=1 cs M(s) and ui = S X s=1 cs zs (vi )vi − i X j=1 zbs− (vj )∆vj . Proof. Let E[P β (vi )] denote the expected payoff of an advertiser with value vi when b Let A(s, vi , vj ) denote the event that an all advertisers play the bidding function β. advertiser with value vi gets assigned slot s and the s-th highest valuation among the remaining N −1 advertisers is vj . If the event A(s, vi , vj ) takes place, then the advertiser has a expected payoff of cs β(vj ). Thus, b E[P β (vi )] = S X s=1 cs i X β(vj ) P(A(s, vi , vj )). j=1 In order for event A(s, vi , vj ) to occur for i 6= j, N − s advertisers must have valuations less than or equal to vj with equality holding for at least one advertiser. Also, the valuation of s − 1 advertisers must be greater than or equal to vi . Keeping in mind that a slot is assigned randomly in the event of a tie, we see that A(s, vi , vj ) occurs with 141 probability NX −s−1 X s−1 N −1 s−1 N −s Flj−1 Gs−1−k i N −s−l n (k + 1)nk s−1 k l l=0 k=0 N −s−1 s−1 N −1 X N − s Flj−1 X s − 1 Gs−1−k i = N −s−l s−1 l n k (k + 1)nk l=0 k=0 n Gs − Gs N −1 1 N −s i−1 i N −s = Fj−1 + − Fj−1 s−1 n s N − 1 n∆Fj ∆Gi , =− s s−1 where the second equality follows from an application of the binomial theorem and Proposition 26. On the other hand if i = j this probability is given by: NX −s−1 X s−1 j=0 k=0 Fji−1 Gki N −1 j, k, N − 1 − j − k (N − j − k)nN −1−j−k It is now clear that M(s)i,j = P(A(s, vi , vj )) for i ≥ j. Finally, given that in equilibrium b the equality E[P P E (v)] = E[P β (v)] must hold, by Proposition 27, we see that β must satisfy equation (4.10). We conclude this section with a simpler expression for Mii (s). By adding and subtracting the term j = N − s in the expression defining Mii (s) we obtain s−1 X Fpi−1 Gki N −1 Mii (s) = zbs (vi ) − N −s, k, s−1−k (s − k)ns−1−k k=0 s−1 Fpi−1 Gki N −1 X s−1 = zbs (vi ) − s − 1 k=1 k (s − k)ns−1−k N − 1 p n∆Gi = zbs (vi ) + Fi−1 , s s−1 (C.3) where again we used Proposition 26 for the last equality. C.4 High Probability Bounds In order to improve the readability of our proofs we use a fixed variable C to refer to a universal constant even though this constant may be different in different lines of a proof. 142 Theorem 19. (Glivenko-Cantelli Theorem) Let v1 , . . . , vn be an i.i.d. sample drawn from a distribution F . If Fb denotes the empirical distribution function induced by this sample, then with probability at least 1 − δ for all v ∈ R r log(1/δ) . |Fb(v) − F (v)| ≤ C n Proposition 28. Let X1 , . . . , Xn be an i.i.d sample from a distribution F supported in [0, 1]. Suppose F admits a density f and assume f (x) > c for all x ∈ [0, 1]. If X (1) , . . . , X (n) denote the order statistics of a sample of size n and we let X (0) = 0, then 3 P( max X (i) − X (i−1) > ) ≤ e−cn/2 . i∈{1,...,n} In particular, with probability at least 1 − δ: max X (i) − X (i−1) ≤ i∈{1,...,n} where q(n, δ) = 2c log nc 2δ 1 q(n, δ), n (C.4) . Proof. Divide the interval [0, 1] into k ≤ d2/e sub-intervals of size 2 . Denote this subintervals by I1 , . . . , Ik , with Ij = [aj , bj ] . If there exists i such that X (i) − X (i−1) > then at least one of these sub-intervals must not contain any samples. Therefore: P( max X (i) − X (i−1) > ) ≤ P(∃ j s.t Xi ∈ / Ij ∀i) i∈{1,...,n} d2/e ≤ X j=1 P(Xi ∈ / Ij ∀i). Using the fact that the sample is i.i.d. and that F (bk ) − F (ak ) ≥ minx∈[ak ,bk ] f (x)(bk − ak ) ≥ c(bk − ak ), we may bound the last term by 2 + 3 (1 − (F (bk ) − F (ak )))n ≤ (1 − c(bk − ak ))n 3 −cn/2 ≤ e . 2 The equation 3 e−cn/2 = δ implies = nc W0 ( 3nc ), where W0 denotes the main branch 2δ of the Lambert function (the inverse of the function x 7→ xex ). By Lemma 17, for x ∈ [e, ∞) we have log(x) ≥ W0 (x). (C.5) 143 Therefore, with probability at least 1 − δ max X (i) − X (i−1) ≤ i∈{1,...,n} 3cn 2 log . nc 2δ The following estimates will be used in the proof of Theorem 14. √ Lemma 18. Let p ≥ 1 be an integer. If i > n , then for any t ∈ [vi−1 , vi ] the following inequality is satisfied with probability at least 1 − δ: p |F (v) − Fpi−1 | ip−1 log(2/δ) √ ≤ C p−1 n n p−1 2 q(n, 2/δ) Proof. The left hand side of the above inequality may be decomposed as |F p (v) − Fpi−1 | ≤ |F p (v) − F p (vi−1 )| + |F p (vi−1 ) − Fpi−1 | p−1 ≤ p|F (ζi )p−1 f (ζi )|(vi − vi−1 ) + pFi−1 (F (vi−1 ) − Fi−1 ) r q(n, 2δ ) ip−1 log(2/δ) F (ζi )p−1 + C p−1 , ≤C n n n for some ζi ∈ (vi−1 , vi ). The second inequality follows from Taylor’s theorem and we have used Glivenko-Cantelli’s q theorem and √ Proposition 28 for the last inequality. √ √ log 2/δ(i+ n) log 2/δ ≤ C . Finally, since i ≥ n it Moreover, we know F (vi ) ≤ Fi + n n follows that ip−1 F (ζi )p−1 ≤ F (vi )p−1 ≤ C p−1 log(2/δ)(p−1)/2 . n Replacing this term in our original bound yields the result. Proposition 29. Let ψ : [0, 1] → R be a twice continuously differentiable function. With √ probability at least 1 − δ the following bound holds for all i > n Z 0 vi p F (t)dt − i−1 X j=1 Fpj−1 ∆vj ≤ C ip log(2/δ)p/2 √ q(n, δ/2)2 . np n and Z 0 vi ψ(t)pF p−1 (t)f (t)dt − i−2 X j=1 ψ(vj )∆Fjp ≤ C 144 ip log(2/δ)p/2 √ q(n, δ/2)2 . np n Proof. By splitting the integral along the intervals [vj−1 , vj ] we obtain Z vi p F (t)dt− 0 i−1 X Fpj−1 ∆vj i−1 Z X ≤ j=1 j=1 vj vj−1 F p (t)−Fpj−1 dt +F p (vi )(vi −vi−1 ) (C.6) By Lemma 18, for t ∈ [vj−1 , vj ] we have: p |F (t) − Fpj−1 | j p−1 log(2/δ) √ ≤ C p−1 n n p−1 2 q(n, δ/2). Using the same argument of Lemma 18 we see that for i ≥ iplog(2/δ) p p F (vi ) ≤ C n √ n Therefore we may bound (C.6) by ip−1 log(2/δ) √ C p−1 n n p−1 2 p i(vi − vi−1 ) log(2/δ) q(n, δ/2) vj + . n j=1 i−1 X We can again use Proposition 28 to bound the sum by ni q(n, δ/2) and the result follows. In order to proof the second bound we first do integration by parts to obtain Z vi Z vi p−1 p ψ 0 (t)F p (t)dt. ψ(t)pF f (t)dt = ψ(vi )F (vi ) − 0 0 Similarly i−2 X ψ(vj )∆Fpj = ψ(vi−2 )Fpi−2 j=1 − i−2 X j=1 Fpj ψ(vj ) − ψ(vj−1 ) . Using the fact that ψ is twice continuously differentiable, we can recover the desired bound by following similar steps as before. Proposition 30. With probability at least 1 − δ the following inequality holds for all i r 2 s ∆ G log(1/δ) i (s − 1)G(vi )s−2 − n2 ≤C . (C.7) s n Proof. By Lemma 15 we know that n2 1 ∆2 Gsi = (s − 1)Gs−2 + O i s n 145 Therefore the left hand side of (C.7) can be bounded by (s − 1)|G(vi )s−2 − Gs−2 |+ i C . n The result now follows from Glivenko-Cantelli’s theorem. Proposition 31. With probability at least 1 − δ the following bound holds for all i p−2 N −1 ip−2 (log(2/δ)) 2 s−1 p √ pG(vi ) F (vi ) − 2nMii (s) ≤ C p−2 q(n, δ/2). n s−1 n Proof. By analyzing the sum defining Mii (s) we see that all terms with exception of p−2 the term given by j = N − s − 1 and k = s − 1 have a factor of ni p−2 n12 . Therefore, ip−2 1 1 N −1 p−1 s−1 pFi−1 Gi + p−2 O 2 . (C.8) Mii (s) = 2n s−1 n n Furthermore, by Theorem 19 we have |Gs−1 − G(vi )s−1 | ≤ C i r log(2/δ) . n (C.9) Similarly, by Lemma 18 |Fp−1 i−1 p−1 − F (vi ) ip−2 (log(2/δ)) √ |C ≤ p−2 n n p−2 2 q(n, δ/2). (C.10) From equation (C.8) and inequalities (C.9) and (C.10) we can thus infer that pG(vi )s−1 F (vi )p − 2nMii (s) ip−2 1 p−1 p−1 s−1 s−1 s−1 p−1 ≤ C pFi−1 |G(vi ) − Gi | + G(vi ) p|F (vi ) − Fi−1 | + C p−2 2 n n r p−2 p−2 i i log(2/δ) (log(2/δ)) 2 1 √ ≤ C p−2 + q(n, δ/2) + 2 n n n n n The desired bound follows trivially from the last inequality. C.5 Solution Properties A standard way to solve a Volterra equation of the first kind is to differentiate the equation and transform it into an equation of the second kind. As mentioned before this may 146 only be done if the kernel defining the equation satisfies K(t, t) ≥ c > 0 for all t. Here we take the discrete derivative of (4.10) and show that in spite of the fact that the new system remains ill-conditioned the solution of this equation has a particular property that allows us to show the solution β will be close to the solution β of surrogate linear system which, in turn, will also be close to the true bidding function β. Proposition 32. The solution β of equation (4.10) also satisfies the following equation dMβ = du (C.11) where dMij = Mi,j − Mi−1,j and dui = ui − ui−1 . Furthermore, for j ≤ i − 2 S X N −1 n∆Fj ∆2 Gsi dMij = − cs s s−1 s=1 and dui = S X cs vi zbs (vi ) − zbs− (vi ) + vi−1 zbs− (vi ) − zbs (vi−1 ) . i=1 Proof. It is clear that the new equation is obtained from (4.10) by subtracting row i − 1 from row i. Therefore β must also satisfy this equation. The expression for dMij follows directly from the definition of Mij . Finally, zbs (vi )vi − i X j=1 = vi zbs (vi ) − zbs− (vj )(vj zbs− (vi ) − vj−1 ) − zbs (vi−1 )vi−1 − + zbs− (vi )vi i−1 X j=1 zbs− (vj )(vj − vj−1 ) − zbs (vi−1 )vi−1 − zbs− (vi )(vi − vi−1 ). Simplifying terms and summing over s yields the desired expression for dui . A straightforward bound on the difference |βi − β(vi )| can be obtain by bounding the following quantity: difference i X j=1 dMi,j (β(vi ) − βi ) = i X j=1 dMi,j β(vi ) − dui , (C.12) and by then solving the system of inequalities defining i = |β(vi ) − βi |. In order to do this, however, it is always assumed that the diagonal terms of the matrix satisfy mini ndMii ≥ c > 0 for all n, which in view of (C.8) does not hold in our case. We therefore must resort to a different approach. We will first show that for values of i ≤ n3/4 the values of βi are close to vi and similarly β(vi ) will be close to vi . Therefore for i ≤ n3/4 we can show that the difference |β(vi ) − βi | is small. We will see that the 147 analysis for i n3/4 is in fact more complicated; yet, by a clever manipulation of the system (4.10) we are able to obtain the desired bound. Proposition 33. If cS > 0 then there exists a constant C > 0 such that: i N −S−1 1 cs Mii (s) ≥ C n 2n s=1 S X Proof. By definition of Mii (s) it is immediate that cs N − 1 s−1 cs Mii (s) ≥ pFp−1 i−1 (Gi ) 2n s − 1 1 i − 1 p−1 i s−1 = Cs 1− , 2n n n with CS = cs p N−1 . The sum can thus be lower bounded as follows s−1 S X s=1 cs M(s)ii ≥ When C1 i−1 n i − 1 N −2 i − 1 N −S−1 i S−1 1 max C1 , CS 1− 2n n n n N −2 ≥ CS i−1 n N −S−1 1− i n S−1 (C.13) , we have K i−1 ≥ 1 − ni , with n K = (C1 /CS )1/(S−1) . Which holds if and only if i > n+K . In this case the max term of K+1 (C.13) is easily seen to be lower bounded by C1 (K/K + 1)N −2 . On the other hand, if N −S−1 n+K s−1 i i < K+1 then we can lower bound this term by CS (K/K + 1) . The result n follows immediately from these observations. Proposition 34. For all i and s the following inequality holds: |dMii (s) − dMi,i−1 (s)| ≤ C ip−2 1 . np−2 n2 Proof. From equation (C.8) we see that |dMii (s) − dMi,i−1 (s)| = |Mii (s) + Mi−1,i−1 (s) − Mi,i−1 (s)| 1 1 ≤ Mii (s) − Mi,i−1 (s) + Mi−1,i−1 (s) − Mi,i−1 (s) 2 2 p−1 s−1 p s ∆Fi−1 n∆Gi N −1 1 pFi−1 Gi ≤ − 2 n s s−1 p−1 s−1 p n∆Fi−1 ∆Gsi 1 pFi−2 Gi−1 ip−2 1 + − + C p−2 2 , 2 n s n n A repeated application of Lemma 15 yields the desired result. 148 Lemma 19. The following holds for every s and every i N −1 p n∆Gsi − Fi−1 + Gs−1 zbs (vi ) − zbs (vi ) = Mii (s) − i−1 s s−1 and zbs− (vi ) N −1 p ∆2 Gsi − zbs (vi−1 ) = M(s)i,i−1 − M(s)i−1,i−1 − n Fi−2 s−1 s s N −1 p ∆Gi + Fi−1 Gs−1 . i−1 + n s−1 s Proof. From (C.3) we know that zbs (vi ) − zbs− (vi ) N −1 ∆Gsi = Mii (s) − − zbs− (vi ). nFpi−1 s−1 s By using the definition of zbs− (vi ) we can verify that the right hand side of the above equation is in fact equal to N −1 p n∆Gsi Mii (s) − Fi−1 + Gs−1 i−1 . s−1 s The second statement can be similarly proved zbs− (vi ) − zbs (vi−1 ) = zbs− (vi ) − M(s)i−1,i−1 N −1 p ∆Gsi−1 + M(s)i,i−1 − M(s)i,i−1 . (C.14) +n Fi−2 s s−1 On the other hand we have N −1 p ∆Gsi−1 n Fi−2 − M(s)i,i−1 s−1 s p p N −1 h p ∆Gsi−1 (Fi−1 − Fi−2 )∆Gsi i Fi−2 + =n s−1 s s h s 2 si N −1 ∆Gi ∆ Gi =n Fpi−1 − Fpi−2 s−1 s s 149 By replacing this expression into (C.14) and by definition of zbs− (vi ). N −1 p ∆2 Gsi − zbs (vi ) − zbs (vi−1 ) = M(s)i,i−1 − M(s)i−1,i−1 − n Fi−2 s−1 s s N −1 p ∆Gi + Fi−1 Gs−1 . i−1 + n s−1 s Corollary 11. The following equality holds for all i and s. dui = vi (b zs (vi ) − zbs− (vi )) + vi−1 (b zs− (vi ) − zbs (vi−1 )) = vi dMii (s) + vi−1 dMi,i−1 (s) + i−2 X dMij (s)vj j=1 i−2 N −1 n∆2 Gsi X p N −1 p s−1 n∆Gsi − Fj−1 ∆vj − (vi − vi−1 ) Fi−1 Gi−1 + s−1 s s−1 s j=1 Proof. From the previous proposition we know vi (b zs (vi ) − zbs− (vi )) + vi−1 (b zs− (vi ) − zbs (vi−1 )) N −1 p ∆2 Gsi = vi Mii (s) + vi−1 (M(s)i,i−1 − M(s)i−1,i−1 ) − vi−1 n Fi−2 s−1 s s n∆Gi N −1 p + (vi−1 − vi ) Fi−1 Gs−1 i−1 + s−1 s N −1 p ∆2 Gsi = vi dMii (s) + vi−1 dMi,i−1 (s) − vi−1 n Fi−2 s−1 s s N −1 p n∆Gi , − (vi − vi−1 ) Fi−1 Gs−1 i−1 + s−1 s where the last equality follows from the definition of dM. Furthermore, by doing sum- 150 mation by parts we see that N −1 p ∆2 Gsi vi−1 n Fi−2 s−1 s N −1 p n∆2 Gsi N −1 p n∆2 Gsi + (vi−1 − vi−2 ) Fi−2 = vi−2 Fi−2 s s−1 s s−1 i−2 i−2 X p N −1 n∆2 Gsi X = vj ∆Fpj + Fj−1 ∆vj s s−1 j=1 j=1 N −1 p n∆2 Gsi + (vi−1 − vi−2 ) Fi−2 s s−1 i−1 i−2 X N −1 n∆2 Gsi X p Fj−1 ∆vj , =− dMij vj + s s−1 j=1 j=1 where again we used the definition of dM in the last equality. By replacing this expression in the previous chain of equalities we obtain the desired result. Corollary 12. Let p denote the vector defined by S i−1 X N −1 n∆2 Gsi X p N −1 p s−1 n∆Gsi pi = cs Fj−1 ∆vj + cs ∆vi . Fi−1 Gi−1 + s s s−1 s−1 s=1 j=1 If ψ = v − β, then ψ solves the following system of equations: dMψ = p. (C.15) Proof. It is immediate by replacing the expression for dui from the previous corollary into (C.11) and rearranging terms. We can now present the main result of this section. Proposition 35. Under Assumption 2, with probability at least 1 − δ, the solution ψ of 2 equation (C.15) satisfies ψi ≤ C ni 2 q(n, δ). Proof. By doing forward substitution on equation (C.15) we have: dMi,i−1 ψi−1 + dMii ψi = pi + i−2 X dMij ψj j=1 i−2 n∆2 Gsi X = pi + cs ∆Fpj ψj . s s=1 j=1 S X 151 (C.16) A repeated application of Lemma 15 shows that i 1 iN −S X pi ≤ C ∆vj , n nN −S j=1 which by Proposition 28 we know it is bounded by pi ≤ C 1 iN −S−1 i2 q(n, δ). n nN −S−1 n2 Similarly for j ≤ i − 2 we have n∆2 Gsi 1 iN −S−1 1 ∆Fpj ≤ C . s n nN −S−1 n Finally, Assumption 2 implies that ψ ≥ 0 for all i and since dMi,i−1 > 0, the following inequality must hold for all i: dMii ψi ≤ dMi,i−1 ψi−1 + dMii ψi i−2 1X 1 iN −S−1 i2 q(n, δ) + ≤C ψj . n nN −S−1 n2 n j=1 N −S−1 In view of Proposition 33 we know that dMii ≥ C n1 ni N −S−1 , therefore after dividing both sides of the inequality by dMii , it follows that 1X i2 ψj . ψi ≤ C 2 q(n, δ) + n n j=1 i−2 2 Applying Lemma 16 with A = C ni 2 , r = 0 and B = Cn we arrive to the following inequality: i i2 i2 ψi ≤ C 2 q(n, δ)eC n ≤ C 0 2 q(n, δ). n n We now present an analogous result for the solution β of (4.2). Let CS = cs and define the functions N−1 s−1 Fs (v) = Cs F N −s (v) Gs (v) = G(v)s−1 . It is not hard to verify that zs (v) = Fs (v)Gs (v) and that the integral equation (4.2) is 152 given by S Z X s=1 v 0 t(Fs (t)Gs (t)) dt = 0 S X Gs (v) Z v β(t)F0s (t)dt (C.17) 0 s=1 After differentiating this equation and rearranging terms we obtain 0 = (v − β(v)) = (v − β(v)) S X s=1 S X s=1 Gs (v)F0s (v) Gs (v)F0s (v) + + S X s=1 S X s=1 G0s (v) Z v β(t)F0s (t)dt + vG0s (v)Fs (v) 0 G0s (v) Z 0 v (t − β(t))F0s (t)dt + G0s (v) Z v Fs (t)dt, 0 where the last equality follows from integration by parts. Notice that the above equation is the continuous equivalent of equation (C.15). Letting ψ(v) := v − β(v) we have that Rv Rv PS 0 0 0 s=1 Gs (v) 0 Fs (t)dt + Gs (v) 0 ψ(t)Fs (t)dt ψ(v) = − (C.18) PS 0 (v) G (v)F s s s=1 Since limv→0 Gs (v) = limv→0 G0s (v)/f (v) = 1 and limv→0 Fs (v) = 0, it is not hard to see that Rv Rv PS 0 0 G (v) F (t)dt + G (v) ψ(t)F0s (t)dt s s 0 0 ψ(0) = − lim s=1 s PS 0 v→0 s=1 Gs (v)Fs (v) P R S G0s (v) v G0s (v) R v 0 f (v) s=1 f (v) 0 Fs (t)dt + f (v) 0 ψ(t)Fs (t)dt = − lim PS 0 v→0 s=1 Gs (v)Fs (v) P R R v v S 0 f (v) s=1 0 Fs (t)dt + 0 ψ(t)Fs (t)dt = − lim PS 0 v→0 s=1 Fs (v) Since the smallest power in the definition of Fs is attained at s = S, the previous limit 153 is in fact equal to: − lim v→0 = − lim v→0 = − lim v→0 R Rv v 0 f (v) 0 FS (t)dt + 0 ψ(t)FS (t)dt F0S (v) R Rv v f (v) 0 F N −S (t)dt + 0 (N − S)ψ(t)F N −S−1 (t)f (t)dt Rv 0 (N − S)F N −S−1 (v)f (v) Rv F N −S (t)dt + 0 (N − S)ψ(t)F N −S−1 (t)f (t)dt . (N − S)F N −S−1 (v) Using L’Hopital’s rule and simplifying we arrive to the following: F 2 (v) ψ(v)F (v) + v→0 (N − S)(N − S − 1)f (v) (N − S − 1) ψ(0) = − lim Moreover, since ψ is a continuous function, it must be bounded and therefore, the previous limit is equal to 0. Using the same series of steps we also see that: R Rv v N −S N −S−1 F (t)dt + (N − S)ψ(t)F (t)f (t)dt 0 0 ψ(v) ψ 0 (0) = lim = − lim v→0 v v→0 v(N − S)F N −S−1 (v) By L’Hopital’s rule again we have the previous limit is equal to F N −S (v) + (N − S)ψ(v)F N −S−1 (v)f (v) − lim v→0 (N − S)(N − S − 1)F N −S−2 (v)f (v)v + (N − S)F N −S−1 (v) (C.19) Furthermore, notice that F N −S (v) + (N − S)ψ(v)F N −S−1 (v)f (v) v→0 (N − S)(N − S − 1)F N −S−2 (v)f (v)v ψ(v)F (v) F 2 (v) = lim + = 0. v→0 (N − S)(N − S − 1)f (v)v (N − S − 1)v lim Where for the last equality we used the fact that limv→0 Similarly, we have: F (v) v = f (0) and ψ(0) = 0. F N −S (v) + (N − S)ψ(v)F N −S−1 (v)f (v) F (v) = lim + ψ(v)f (v) = 0 N −S−1 v→0 v→0 N − S (N − S)F (v) lim Since the terms in the denominator of (C.19) are positive, the two previous limits imply that the limit given by (C.19) is in fact 0 and therefore ψ 0 (0) = 0. Thus, by Taylor’s theorem we have |ψ(v)| ≤ Cv 2 for some constant C. 154 Corollary 13. The following inequality holds with probability at least 1 − δ for all 1 i ≤ n3/4 1 |ψi − ψ(vi )| ≤ C √ q(n, δ). n Proof. Follows trivially from the bound on ψ(v), Proposition 35 and the fact that √1 . n i2 n2 ≤ Having bounded the magnitude of the error for small values of i one could use the forward substitution technique used in Proposition 35 to bound the errors i = |ψi − ψ(vi )|. Nevertheless, a crucial assumption used in Proposition 35 was the fact that ψi ≥ 0. This condition, however is not necessarily verified by i . Therefore, a straightforward forward substitution will not work. Instead, we leverage the fact that |dMi,i−1 ψi−1 − dMi,i ψi | is in O n12 and show that the solution ψ of a surrogate linear equation is close to both ψ and ψ implying that ψi and ψ(vi ) will be close too. Therefore let dM0 denote the lower triangular matrix with dM0i,j = dMi,j for j ≤ i − 2, dM0i,i−1 = 0 and dM0ii = 2dMii . Thus, we are effectively removing the problematic term dMi,i−1 in the analysis made by forward substitution. The following proposition quantifies the effect of approximating the original system with the new matrix dM0 . Proposition 36. Let ψ be the solution to the system of equations dM0 ψ = p. Then, for all i ∈ {1, . . . , n} it is true that 1 q(n, δ) C |ψi − ψ| ≤ √ + 3/2 e . n n 2 Proof. We can show, in the same way as in Proposition 35, that ψi ≤ C ni 2 q(n, δ) with probability at least 1 − δ for all i. In particular, for i < n3/4 it is true that 1 |ψi − ψ i | ≤ C √ q(n, δ). n On the other hand by forward substitution we have dM0ii ψ i = pi − i−1 X dMii ψi = pi − i−1 X and dM0ij ψ j j=1 155 j=1 dMij ψj . By using the definition of dM0 we see the above equations hold if and only if 2dMii ψ i = pi − i−2 X dMij ψ j j=1 2dMii ψi = dMii ψi + pi − dMi,i−1 − i−2 X dMij ψj . j=1 Taking the difference of these two equations yields a recurrence relation for the quantity ei = ψi − ψ i . 2dMii ei = dMii ψi − dMi,i−1 ψi−1 − i−2 X dMij ej . j=1 Furthermore we can bound dMii ψi − dMi,i−1 ψi−1 as follows: |dMii ψi − dMi,i−1 ψi−1 | ≤ |dMii − dMi,i−1 |ψi−1 + |ψi − ψi−1 |dMii . ip q(n, δ) C + √ dMii . ≤C p 2 n n n Where the last inequality follows from Assumption 2 and Proposition 34 as well as from 2 the fact that ψi ≤ ni 2 q(n, δ). Finally, using the same bound on dMij as in Proposition 35 gives us |ei | ≤ C q(n, δ) ni 1 1X +√ + ei n n j=1 i−2 i−2 CX 1 ei . ≤ C√ + n n j=1 Applying Lemma 16 with A = C √ , n B= C n and r = n3/4 we obtain the final bound 1 q(n, δ) C |ψi − ψ i | ≤ √ + 3/2 e . n n 156 C.6 Proof of Theorem 14 b the vector Proposition 37. Let ψ(v) denote the solution of (C.18) and denote by ψ bi = ψ(vi ). Then, with probability at least 1 − δ defined by ψ N −S i log(2/δ) b √ max √ n|(dM ψ)i − pi | ≤ C N −S n n i> n 0 N/2 q(n, δ/2)3 . (C.20) b i − pi Proof. By definition of dM0 and pi we can decompose the difference n (dM0 ψ) as: S X s=1 cs Is (vi ) + Υ3 (vi ) − Υ1 (s, i) + Υ2 (s, i) ! s N −1 p n∆Gi − n∆vi Fi−1 Gs−1 . (C.21) i−1 + s s−1 where Z i−1 G0s (vi ) vi N −1 n2 ∆2 Gsi X p Fs (t), Υ1 (s, i) = Fj−1 ∆vj − s−1 s f (v ) i 0 j=1 2 2 sX Z i−2 N −1 n ∆ Gi G0s (vi ) vi 0 p F (t)ψ(t)dt, Υ2 (s, i) = ∆Fi ψ(vj ) − s−1 s f (vi ) 0 s j=1 F0 (vi ) Υ3 (s, i) = 2nMii (s) − s Gs (vi ) ψ(vi ) and f (vi ) Z vi Z vi 1 0 0 0 F0s (t)ψ(t)dt . Fs (t) + Gs (vi ) Fs (vi )Gs (vi )ψ(vi ) + Gs (vi ) Is (vi ) = f (vi ) 0 0 P Using the fact that ψ solves equation (C.18) we see that Ss=1 cs Is (vi ) = 0. Furthermore, using Lemma 15 as well as Proposition 28 we have N −1 p s−1 n∆Gsi ip 1 iN −S 1 n∆vi Fi−1 Gi−1 + ≤ p q(n, δ/2) ≤ N −S q(n, δ/2) s−1 s n n n n Therefore we need only to bound Υk for k = 1, 2, 3. After replacing the values of Gs 2 and Fs by its definitions, Proposition 31 and the fact that ψ(vi ) ≤ Cvi2 ≤ C ni 2 q 2 (n, δ) imply that with probability at least 1 − δ ip log(2/δ)p−2 √ Υ3 (s, vi ) ≤ C p q(n, δ/2)3 . n n 157 We proceed to bound the term Υ2 . The bound for Υ1 can be derived in a similar manner. (1) (2) By using the definition of Gs and Fs we see that Υ2 = N−1 Υ2 + Υ2 where s−1 (1) Υ2 (s, i) (2) Υ2 (s, i) = = n2 ∆2 Gs i s i−2 X j=1 s−2 − (s − 1)G(vi ) b ∆Fpi ψ − Z vi ψ(t)pF 0 i−2 X j=1 p−1 bi ∆Fpi ψ (t)f (t)dt (s−1)G(vi )s−2 . p p/2 √ It follows from Propositions 29 and 30 that |Υ2 (s, i)| ≤ C ni p log(2/δ) q(n, δ/2)2 . And n the same inequality holds for Υ1 . Replacing these bounds in (C.21) and using the fact N −S ip ≤ ni N −S yields the desired inequality. np Proposition 38. With probability at least 1 − δ max |ψ(vi ) − ψ i | ≤ eC i log(2/δ)N/2 Cq(n, δ/2) √ q(n, δ/2)3 + n3/2 n Proof. With the same argument used in Corollary 13 we see that with probability at 1 least 1 − δ for i ≤ n3/4 we have |ψ(vi ) − ψ i | ≤ √Cn q(n, δ). On the other hand, since dMi = pi the previous Proposition implies that for i > n3/4 b − ψ) n dM0 (ψ i ≤C iN −S log(2/δ)N/2 √ q(n, δ/2)3 . nN −S n Letting i = |ψ(vi ) − ψ i |, we see that the previous equation defines the following recursive inequality. ndM0ii i i−2 X iN −S log(2/δ)N/2 3 √ ≤ C N −S q(n, δ/2) − Cn dM0ij j , n n j=1 N −S−1 where we used the fact that dM0i,i−1 = 0. Since dM0ii = 2Mii ≥ 2C ni N −S−1 n1 , after dividing the above inequality by dM0ii we obtain i−2 log(2/δ)N/2 CX 3 √ i ≤ C q(n, δ/2) − j . n j=1 n Using Lemma 16 again we conclude that i ≤ eC log(2/δ)N/2 Cq(n, δ/2) √ q(n, δ/2)3 + n3/2 n 158 Theorem 14. If Assumptions 1, 2 and 3 are satisfied, then, with probability at least 1−δ over the draw of a sample of size n, the following bound holds for all i ∈ [1, n]: N/2 Cq(n, δ/2) C log(2/δ) 3 b √ |β(vi ) − β(vi )| ≤ e q(n, δ/2) + . n3/2 n where q(n, δ) = 2c log(nc/2δ) with c defined in Assumption 1, and where C is some universal constant. Proof. The proof is a direct consequence of the previous proposition and Proposition 36. 159 Appendix D Appendix to Chapter 5 D.1 Appendix to Chapter 5 Lemma 20. The function g : γ 7→ log 1 γ 1−γ is decreasing over the interval (0, 1). Proof. This can be straightforwardly established: 1−γ 1 1 + log γ log 1 − 1 − − (1 − γ) − (1 − γ) − (1 − γ) γ γ γ = < = 0, g 0 (γ) = 2 2 (1 − γ) γ(1 − γ) γ(1 − γ)2 using the inequality log(1 − x) < −x valid for all x < 0. Lemma 21. Let a ≥ 0 and let g : D ⊂ R → [a, ∞) be a decreasing and differentiable function. Then, the function F : R → R defined by p F (γ) = g(γ) − g(γ)2 − b is increasing for all values of b ∈ [0, a]. Proof. We will show that F 0 (γ) ≥ 0 for all γ ∈ D. Since F 0 = g 0 [1 − g(g 2 p − b)−1/2 ] and g 0 ≤ 0 by hypothesis, the previous statement is equivalent to showing that g 2 − b ≤ g which is trivially verified since b ≥ 0. l m γ0r T ∗ Theorem 15. Let 1/2 < γ < γ0 < 1 and r = argminr≥1 r + (1−γ0 )(1−γ r ) . For any 0 v ∈ [0, 1], if T > 4, the regret of PFSr∗ satisfies Reg(PFSr∗ , v) ≤ (2vγ0 Tγ0 log cT + 1 + v)(log2 log2 T + 1) + 4Tγ0 , where c = 4 log 2. 160 γr T 0 Proof. It is not hard to verify that the function r 7→ r + (1−γ0 )(1−γ is convex and r 0) ∗ approaches infinity as r → ∞. Thus, it admits a minimizer r̄ whose explicit expression can be found by solving the following equation γ0r T log γ0 γ0r T d = 1 + r+ . 0= dr (1 − γ0 )(1 − γ0r ) (1 − γ0 )(1 − γ0r )2 Solving the corresponding second-degree equation yields r ∗ γ0r̄ 2+ = T log(1/γ0 ) 1−γ0 − 2+ T log(1/γ0 ) 1−γ0 2 −4 2 =: F (γ0 ). ∗ By Lemmas 20 and 21, the function F thereby defined is increasing. Therefore, γ0r̄ ≤ limγ0 →1 F (γ0 ) and p 2 + T − (2 + T )2 − 4 4 2 ∗ p γ0r̄ ≤ = (D.1) ≤ . 2 2 T 2(2 + T + (2 + T ) − 4) ∗ By the same argument, we must have γ0r̄ ≥ F (1/2), that is p 2 + 2T log 2 − (2 + 2T log 2)2 − 4 ∗ γ0r̄ ≥ F (1/2) = 2 4 p = 2(2 + 2T log 2 + (2 + 2T log 2)2 − 4) 2 1 ≥ ≥ . 4 + 4T log 2 4T log 2 Thus, r∗ = dr̄∗ e ≤ log(1/F (1/2)) log(4T log 2) +1≤ + 1. log(1/γ0 ) log 1/γ0 (D.2) Combining inequalities (D.1) and (D.2) with (5.7) gives log(4T log 2) (1 + γ0 )T Reg(PFSr∗ , v) ≤ v + 1 + v (dlog2 log2 T e + 1) + log 1/γ0 (1 − γ0 )(T − 2) ≤ (2vγ0 Tγ0 log(cT ) + 1 + v)(dlog2 log2 T e + 1) + 4Tγ0 , using the inequality log( γ1 ) ≥ 1−γ 2γ valid for all γ ∈ (1/2, 1). 161 D.1.1 Lower bound for monotone algorithms Lemma 6. Let (pt )Tt=1 be a decreasing sequence of prices. Assume that the seller faces a truthful buyer. Then, if v is sampled uniformly at random in the interval [ 12 , 1], the following inequality holds: 1 E[κ∗ ] ≥ . 32 E[v − pκ∗ ] Proof. Since the buyer is truthful, κ∗ (v) = κ if and only if v ∈ [pκ , pκ−1 ]. Thus, we can write κ κ max max Z pκ−1 max h i κX X X (pκ−1 − pκ )2 ∗ E[v − pκ ] = E 1v∈[pκ ,pκ−1 ] (v − pκ ) = (v − pκ ) dv = , 2 κ=2 κ=2 pκ κ=2 where κmax = κ∗ ( 12 ). Thus, by the Cauchy-Schwarz inequality, we can write E " κ∗ X κ=2 pκ−1 − pκ # v u κ∗ u X ≤ E tκ∗ (p κ=2 κ−1 − pκ )2 v u κmax u X ≤ E tκ∗ (pκ−1 − pκ )2 κ=2 hp i =E 2κ∗ E[v − pκ∗ ] p p ≤ E[κ∗ ] 2 E[v − p∗κ ], where the last step holds by Jensen’s inequality. In view of that, since v > pκ∗ , it follows that: # " κ∗ X p p 3 = E[v] ≥ E[pκ∗ ] = E pκ − pκ−1 + p1 ≥ − E[κ∗ ] 2 E[v − pκ∗ ] + 1. 4 κ=2 Solving for E[κ∗ ] concludes the proof. The following lemma characterizes the value of κ∗ when facing a strategic buyer. ∗ ∗ Lemma 22. For any v ∈ [0, 1], κ∗ satisfies v − pκ∗ ≥ Cγκ (pκ∗ − pκ∗ +1 ) with Cγκ = ∗ p ∗ log(2/γ) γ−γ T −κ +1 ∗ . Furthermore, when κ ≤ 1 + Tγ T and T ≥ Tγ + 2log(1/γ) , Cγκ can be 1−γ γ replaced by the universal constant Cγ = 2(1−γ) . Proof. Since an optimal strategy is played by the buyer, the surplus obtained by accepting a price at time κ∗ must be greater than the corresponding surplus obtained when 162 accepting the first price at time κ∗ + 1. It thus follows that: T X γ t−1 t=κ∗ ⇒γ κ∗ −1 (v − p ) ≥ T X κ∗ (v − pκ∗ ) ≥ t=κ∗ +1 T X t=κ∗ +1 γ t−1 (v − pκ∗ +1 ) ∗ γ t−1 (pκ∗ γκ − γT (pκ∗ − pκ∗ +1 ). − pκ∗ +1 ) = 1−γ ∗ Dividing both sides of the inequality by γ κ −1 yields the first statement of the lemma. Let us verify the second A straightforward calculation shows that the condipstatement. log(2/γ) tions on T imply T − T Tγ ≥ log(1/γ) , therefore κ∗ Cγ √ log(2/γ) γ − γ2 γ − γ T − Tγ T γ − γ log(1/γ) γ ≥ ≥ = = . 1−γ 1−γ 1−γ 2(1 − γ) log(2/γ) Proposition 39. For any convex decreasing sequence (pt )Tt=1 , if T ≥ Tγ + 2log(1/γ) , then 1 there exists a valuation v0 ∈ [ 2 , 1] for the buyer such that v u q √ 1u p 1 Reg(Am , v0 ) ≥ max T − T , tCγ T − Tγ T 8 2 √ p = Ω( T + Cγ T ). 1 − 2 r ! Cγ T Proof. In view of Proposition 19, we only need to verify that there exists v0 ∈ [ 21 , 1] such that v ! u 1 rC u p γ − . Reg(Am , v0 ) ≥ tCγ T − Tγ T 2 T p Let κmin = κ∗ (1), and κmax = κ∗ ( 21 ). If κmin > 1 + Tγ T , then Reg(Am , 1) ≥ p 1+ Tγ T , from which the statement of the proposition can be derived straightforwardly. p Thus, in the following we will only consider the case κmin ≤ 1 + Tγ T . Since, by definition, the inequality 21 ≥ pκmax holds, we can write κ max X 1 ≥ pκmax = (pκ − pκ−1 ) + pκmin ≥ κmax (pκmin +1 − pκmin ) + pκmin , 2 κ=κ +1 min where the last inequality holds by the convexity of the sequence and the fact that pκmin − pκmin −1 ≤ 0. The inequality is equivalent to pκmin − pκmin +1 ≥ 163 pκmin − 21 . κmax Furthermore, by Lemma 22, we have max Reg(Am , v) ≥ max v∈[ 12 ,1] κ max 2 , (T − κmin )(1 − pκmin ) , Cγ (T − κmin )(pκmin − pκmin +1 ) 2 (T − κmin )(pκmin − 12 ) κmax , Cγ . ≥ max 2 κmax q The right-hand side is minimized for κmax = 2Cγ (T − κmin )(pκmin − 21 ). Thus, there exists a valuation v0 for which the following inequality holds: r r p 1 1 1 Reg(Am , v0 ) ≥ ≥ Cγ T − Tγ T pκmin − . Cγ (T − κmin ) pκmin − 2 2 2 q Furthermore, we can assume that pκmin ≥ 1 − CTγ otherwise Reg(Am , 1) ≥ (T − p 1) Cγ /T , which is easily seen to imply the desired lower bound. Thus, there exists a valuation v0 such that v ! u 1 rC p 1u γ − , Reg(Am , v0 ) ≥ tCγ T − Tγ T 2 2 T ≥ max κ max which concludes the proof. D.2 Simulations Here, we present the results of more extensive simulations for PFSr and the monotone algorithm. Again, we consider two different scenarios. Figure D.1 shows the experimental results for an agnostic scenario where the value of the parameter γ remains unknown to both algorithms and where the parameter r of PFSr is set to log(T ). The results reported in Figure D.2 correspond to the second scenario where the discounting factor γ is known to the p algorithms and where the parameter β for the monotone algorithm is set to 1 − 1/ T Tγ . The scale on the plots is logarithmic in the number of rounds and in the regret. 164 v = 0.20 γ = 0.50 γ = 0.60 PFS 6 monr 5 PFS 6 monr 5 PFS 6 monr 5 4 4 4 4 3 3 3 3 v = 0.40 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 v = 0.60 γ = 0.80 PFS 6 monr 5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 v = 0.80 γ = 0.70 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 Figure D.1: Regret curves for PFSr and monotone for different values of v and γ. The value of γ is not known to the algorithms. 165 v = 0.20 γ = 0.50 γ = 0.60 PFS 6 monr 5 PFS 6 monr 5 PFS 6 monr 5 4 4 4 4 3 3 3 3 v = 0.40 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 v = 0.60 γ = 0.80 PFS 6 monr 5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 v = 0.80 γ = 0.70 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 PFSr 6 mon PFSr 6 mon PFSr 6 mon PFSr 6 mon 5 5 5 5 4 4 4 4 3 3 3 3 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 6.5 7.5 8.5 9.5 10.5 Figure D.2: Regret curves for PFSr and monotone for different values of v and γ. The value of γ is known to both algorithms. 166 Bibliography Agrawal, R. (1995). The continuum-armed bandit problem. SIAM journal on control and optimization 33(6), 1926–1951. Amin, K., M. Kearns, P. Key, and A. Schwaighofer (2012). Budget optimization for sponsored search: Censored learning in MDPs. In UAI, pp. 54–63. Amin, K., A. Rostamizadeh, and U. Syed (2013). Learning prices for repeated auctions with strategic buyers. In Proceedings of NIPS, pp. 1169–1177. Arora, R., O. Dekel, and A. Tewari (2012). Online bandit learning against an adaptive adversary: from regret to policy regret. In Proceedings of ICML. Auer, P., N. Cesa-Bianchi, and P. Fischer (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3), 235–256. Auer, P., N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002). The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77. Auer, P., N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002). The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77. Baker, C. T. (1977). The numerical treatment of integral equations. Clarendon press. Balcan, M.-F., A. Blum, J. D. Hartline, and Y. Mansour (2008). Reducing mechanism design to algorithm design via machine learning. J. Comput. Syst. Sci. 74(8), 1245– 1270. Bartlett, P. L. (1992). Learning with a slowly changing distribution. In Proceedings of the fifth annual workshop on Computational learning theory, Proceedings of COLT, New York, NY, USA, pp. 243–252. ACM. Bartlett, P. L., S. Ben-David, and S. Kulkarni (2000). Learning changing concepts by exploiting the structure of change. Machine Learning 41, 153–174. Bartlett, P. L., M. I. Jordan, and J. D. McAuliffe (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association 101(473), 138–156. 167 Barve, R. D. and P. M. Long (1997). On the complexity of learning from drifting distributions. Information and Computation 138(2), 101–123. Ben-David, S., J. Blitzer, K. Crammer, and F. Pereira (2006). Analysis of representations for domain adaptation. In Proceedings of NIPS, pp. 137–144. Ben-David, S., T. Lu, T. Luu, and D. Pál (2010). Impossibility theorems for domain adaptation. JMLR - Proceedings Track 9, 129–136. Ben-David, S. and R. Urner (2012). On the hardness of domain adaptation and the utility of unlabeled target samples. In Proceedings of ALT, pp. 139–153. Berlind, C. and R. Urner (2015). Active nearest neighbors in changing environments. In Proceedings of ICML, pp. 1870–1879. Blitzer, J., K. Crammer, A. Kulesza, F. Pereira, and J. Wortman (2007). Learning bounds for domain adaptation. In Proceedings of NIPS. Blitzer, J., M. Dredze, and F. Pereira (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL. Blum, A., V. Kumar, A. Rudra, and F. Wu (2004). Online learning in online auctions. Theor. Comput. Sci. 324(2-3), 137–146. Börgers, T., I. Cox, M. Pesendorfer, and V. Petricek (2013). Equilibrium bids in sponsored search auctions: Theory and evidence. American Economic Journal: Microeconomics 5(4), 163–87. Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge: Cambridge University Press. Cavallanti, G., N. Cesa-Bianchi, and C. Gentile (2007). Tracking the best hyperplane with a simple budget perceptron. Machine Learning 69(2/3), 143–167. Cesa-Bianchi, N., A. Conconi, and C. Gentile (2001). On the generalization ability of on-line learning algorithms. In NIPS, pp. 359–366. Cesa-Bianchi, N., C. Gentile, and Y. Mansour (2013). Regret minimization for reserve prices in second-price auctions. In Proceedings of the Twenty-Fourth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pp. 1190–1204. SIAM. Cesa-Bianchi, N., C. Gentile, and Y. Mansour (2013). Regret minimization for reserve prices in second-price auctions. In SODA, pp. 1190–1204. 168 Cesa-Bianchi, N. and G. Lugosi (2006). Prediction, learning, and games. Cambridge University Press. Cole, R. and T. Roughgarden (2014). The sample complexity of revenue maximization. In Proceedings of STOC, pp. 243–252. Cortes, C., Y. Mansour, and M. Mohri (2010). Learning bounds for importance weighting. In Proceedings of NIPS, pp. 442–450. Cortes, C. and M. Mohri (2011). Domain adaptation in regression. In Proceedings of ALT. Cortes, C. and M. Mohri (2013). Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science 9474, 103–126. Cortes, C., M. Mohri, M. Riley, and A. Rostamizadeh (2008). Sample selection bias correction theory. In Proceedings of ALT, pp. 38–53. Crammer, K., E. Even-Dar, Y. Mansour, and J. W. Vaughan (2010). Regret minimization with concept drift. In COLT, pp. 168–180. Cui, Y., R. Zhang, W. Li, and J. Mao (2011). Bid landscape forecasting in online ad exchange marketplace. In KDD, pp. 265–273. Dasgupta, S. (2011). Recent advances in active learning. In 2011 Symposium on Machine Learning in Speech and Language Processing. Daumé III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of ACL, Prague, Czech Republic. Debreu, G. and T. C. Koopmans (1982). Additively decomposed quasiconvex functions. Mathematical Programming 24, 1–38. Devanur, N. R. and S. M. Kakade (2009). The price of truthfulness for pay-per-click auctions. In Proceedings 10th ACM Conference on Electronic Commerce (EC-2009), Stanford, California, USA, July 6–10, 2009, pp. 99–106. Dredze, M., J. Blitzer, P. P. Talukdar, K. Ganchev, J. Graca, and F. Pereira (2007). Frustratingly Hard Domain Adaptation for Parsing. In CoNLL 2007. Dudley, R. M. (1984). A course on empirical processes. Lecture Notes in Math. 1097, 2 – 142. Easley, D. A. and J. M. Kleinberg (2010). Networks, Crowds, and Markets - Reasoning About a Highly Connected World. Cambridge University Press. 169 Edelman, B. and M. Ostrovsky (2007). Strategic bidder behavior in sponsored search auctions. Decision Support Systems 43(1), 192–198. Edelman, B., M. Ostrovsky, and M. Schwarz (2007). Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American Economic Review 97(1), 242–259. Edelman, B. and M. Schwarz (2010). Optimal auction design and equilibrium selection in sponsored search auctions. American Economic Review 100(2), 597–602. Fischer, K., B. Gärtner, and M. Kutz (2003). Fast smallest-enclosing-ball computation in high dimensions. In Algorithms-ESA 2003, pp. 630–641. Springer. Freund, Y. and Y. Mansour (1997). Learning under persistent drift. In EuroColt, pp. 109–118. Germain, P., A. Habrard, F. Laviolette, and E. Morvant (2013). A PAC-Bayesian approach for domain adaptation with specialization to linear classifiers. In Proceedings of ICML. Gibbons, R. (1992). Game theory for applied economists. Princeton University Press. Gomes, R. and K. S. Sweeney (2014). Bayes-Nash equilibria of the generalized secondprice auction. Games and Economic Behavior 86, 421–437. Gretton, A., K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola (2007). A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hoffman (Eds.), Proceedings of NIPS, pp. 513–520. Cambridge, MA: MIT Press. Guerre, E., I. Perrigne, and Q. Vuong (2000). Optimal nonparametric estimation of first-price auctions. Econometrica 68(3), 525–574. He, D., W. Chen, L. Wang, and T. Liu (2014). A game-theoretic machine learning approach for revenue maximization in sponsored search. CoRR abs/1406.0728. Helmbold, D. P. and P. M. Long (1994). Tracking drifting concepts by minimizing disagreements. Machine Learning 14(1), 27–46. Herbster, M. and M. Warmuth (1998). Tracking the best expert. Machine Learning 32(2), 151–78. Herbster, M. and M. Warmuth (2001). Tracking the best linear predictor. Journal of Machine Learning Research 1, 281–309. Hoffman, J., T. Darrell, and K. Saenko (2014). Continuous manifold based adaptation for evolving visual domains. In Computer Vision and Pattern Recognition (CVPR). 170 Horst, R. and N. V. Thoai (1999). DC programming: overview. Journal of Optimization Theory and Applications 103(1), 1–43. Huang, J., A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf (2006). Correcting sample selection bias by unlabeled data. In Proceedings of NIPS, Volume 19, pp. 601–608. Jiang, J. and C. Zhai (2007). Instance Weighting for Domain Adaptation in NLP. In Proceedings of ACL, pp. 264–271. Kanamori, T., S. Hido, and M. Sugiyama (2009). A least-squares approach to direct importance estimation. Journal of Machine Learning Research 10, 1391–1445. Kleinberg, R. D. and F. T. Leighton (2003a). The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of FOCS, pp. 594– 605. Kleinberg, R. D. and F. T. Leighton (2003b). The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In Proceedings of FOCS, pp. 594– 605. Koltchinskii, V. and D. Panchenko (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Ann. Statist. 30(1), 1–50. Kress, R., V. Maz’ya, and V. Kozlov (1989). Linear integral equations, Volume 82. Springer. Kuleshov, V. and D. Precup (2010). Algorithms for the multi-armed bandit problem. CoRR abs/1402.6028. Kumar, P., J. S. B. Mitchell, and E. A. Yildirim (2003). Computing core-sets and approximate smallest enclosing hyperspheres in high dimensions. In ALENEX, Lecture Notes Comput. Sci, pp. 45–55. Lahaie, S. and D. M. Pennock (2007). Revenue analysis of a family of ranking rules for keyword auctions. In Proceedings of ACM EC, pp. 50–56. Langford, J., L. Li, Y. Vorobeychik, and J. Wortman (2010). Maintaining equilibria during exploration in sponsored search auctions. Algorithmica 58(4), 990–1021. Ledoux, M. and M. Talagrand (2011). Probability in Banach spaces. Classics in Mathematics. Berlin: Springer-Verlag. Isoperimetry and processes, Reprint of the 1991 edition. 171 Leggetter, C. J. and P. C. Woodland (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language 9(2), 171–185. Linz, P. (1985). Analytical and numerical methods for Volterra equations, Volume 7. SIAM. Littlestone, N. (1989). From on-line to batch learning. In Proceedings of COLT, pp. 269–284. Morgan Kaufmann Publishers Inc. Long, P. M. (1999). The complexity of learning according to two models of a drifting environment. Machine Learning 37, 337–354. Lucier, B., R. P. Leme, and É. Tardos (2012). On revenue in the generalized second price auction. In Proceedings of WWW, pp. 361–370. Mansour, Y., M. Mohri, and A. Rostamizadeh (2009). Domain adaptation: Learning bounds and algorithms. In Proceedings of COLT, Montréal, Canada. Omnipress. Martı́nez, A. M. (2002). Recognizing imprecisely localized, partially occluded, and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell. 24(6), 748–763. Milgrom, P. and I. Segal (2002). Envelope theorems for aribtrary choice sets. Econometrica (2), 583–601. Milgrom, P. and R. Weber (1982). A theory of auctions and competitive bidding. Econometrica: Journal of the Econometric Society 50(5), 1089–1122. Mohri, M. and A. M. Medina (2014). Learning theory and algorithms for revenue optimization in second price auctions with reserve. In Proceedings of ICML, pp. 262–270. JMLR.org. Mohri, M. and A. M. Medina (2015). Revenue optimization against strategic buyers. In Proceedings of NIPS. Mohri, M. and A. Muñoz (2012). New analysis and algorithm for learning with drifting distributions. In Proceedings of ALT. Springer. Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012). Foundations of machine learning. Cambridge, MA: MIT Press. Morgenster, J. and T. Roughgarden (2015). The pseudo-dimension of near-optimal auctions. In Proceedings of NIPS. 172 Morgenstern, J. and T. Roughgarden (2015). The pseudo-dimension of near-optimal auctions. CoRR abs/1506.03684. Morris, P. (1994). Non-zero-sum games. In Introduction to Game Theory, pp. 115–147. Springer. Muthukrishnan, S. (2009). Ad exchanges: Research issues. Internet and network economics 5929, 1–12. Myerson, R. (1981). Optimal auction design. Mathematics of operations research 6(1), 58–73. Nachbar, J. (2001). Bayesian learning in repeated games of incomplete information. Social Choice and Welfare 18(2), 303–326. Nachbar, J. H. (1997). Prediction, optimization, and learning in repeated games. Econometrica: Journal of the Econometric Society 65(2), 275–309. Nisan, N., T. Roughgarden, É. Tardos, and V. V. Vazirani (Eds.) (2007). Algorithmic game theory. Cambridge: Cambridge University Press. Ostrovsky, M. and M. Schwarz (2011). Reserve prices in internet advertising auctions: a field experiment. In Proceedings of ACM EC, pp. 59–60. Pan, S. J., I. W. Tsang, J. T. Kwok, and Q. Yang (2011). Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22(2), 199–210. Pollard, D. (1984). Convergence of Stochastic Processess. New York: Springer. Qin, T., W. Chen, and T. Liu (2014). Sponsored search auctions: Recent advances and future directions. ACM TIST 5(4), 60. Rakhlin, A., K. Sridharan, and A. Tewari (2010). Online learning: Random averages, combinatorial parameters, and learnability. Rasmussen, C. E., R. M. Neal, G. Hinton, D. van Camp, M. R. Z. Ghahramani, R. Kustra, and R. Tibshirani (1996). The delve project. http://www.cs.toronto. edu/˜delve/data/datasets.html. version 1.0. Riley, J. and W. Samuelson (1981). Optimal auctions. The American Economic Review 71(3), 381–392. Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers, pp. 169–177. Springer. Schőnherr, S. (2002). Quadratic Programming in Geometric Optimization: Theory, Implementation, and applications. Ph. D. thesis, Swiss Federal Institute of Technology. 173 Sriperumbudur, B. K. and G. R. G. Lanckriet (2012). A proof of convergence of the concave-convex procedure using Zangwill’s theory. Neural Computation 24(6), 1391–1407. Sugiyama, M., S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Proceedings of NIPS, pp. 1433–1440. Sun, Y., Y. Zhou, and X. Deng (2014). Optimal reserve prices in weighted GSP auctions. Electronic Commerce Research and Applications 13(3), 178–187. Talagrand, M. (2005). The Generic Chaining. New York: Springer. Tao, P. D. and L. T. H. An (1997). Convex analysis approach to DC programming: theory, algorithms and applications. Acta Mathematica Vietnamica 22(1), 289–355. Tao, P. D. and L. T. H. An (1998). A DC optimization algorithm for solving the trustregion subproblem. SIAM Journal on Optimization 8(2), 476–505. Thompson, D. R. M. and K. Leyton-Brown (2013). Revenue optimization in the generalized second-price auction. In Proceedings of ACM EC, pp. 837–852. Tommasi, T., T. Tuytelaars, and B. Caputo (2014). A testbed for cross-dataset analysis. CoRR abs/1402.5923. Tuy, H. (1964). Concave programming under linear constraints. Translated Soviet Mathematics 5, 1437–1440. Tuy, H. (2002). Counter-examples to some results on D.C. optimization. Technical report, Institute of Mathematics, Hanoi, Vietnam. Valiant, P. (2011). Testing symmetric properties of distributions. SIAM J. Comput. 40(6), 1927–1968. Varian, H. R. (2007, December). Position auctions. International Journal of Industrial Organization 25(6), 1163–1178. Vickrey, W. (1961). Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance 16(1), 8–37. Welzl, E. (1991). Smallest enclosing disks (balls and ellipsoids). In New results and new trends in computer science (Graz, 1991), Volume 555 of Lecture Notes in Comput. Sci., pp. 359–370. Berlin: Springer. Widrow, B. and M. E. Hoff (1988). Adaptive switching circuits, pp. 123–134. ACM. 174 Yen, I. E., N. Peng, P.-W. Wang, and S.-D. Lin (2012). On convergence rate of concaveconvex procedure. In Proceedings of the NIPS 2012 Optimization Workshop. Yildirim, E. A. (2008). Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization 19(3), 1368–1391. Yuille, A. L. and A. Rangarajan (2003). The concave-convex procedure. Neural Computation 15(4), 915–936. Zhang, K., B. Schölkopf, K. Muandet, and Z. Wang (2013). Domain adaptation under target and conditional shift. In Proceedings of ICML 2013, pp. 819–827. Zhong, E., W. Fan, Q. Yang, O. Verscheure, and J. Ren (2010). Cross validation framework to choose amongst models and datasets for transfer learning. In Proceedings of ECML PKDD 2010 Part III, pp. 547–562. Zhu, Y., G. Wang, J. Yang, D. Wang, J. Yan, J. Hu, and Z. Chen (2009). Optimizing search engine revenue in sponsored search. In Proceedings of ACM SIGIR on Research and Development in Information Retrieval, pp. 588–595. 175