Learning and Testing Junta Distributions over Hypercubes by Maryam Aliakbarpour Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of ARCHVES Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY MASSACHUSETTS INSTITUTE OF TECHNOLOGY NOV 022015 September 2015 LIBRARIES @ Massachusetts Institute of Technology 2015. All rights reserved. Signature redacted A uth or .................... . - - - - - - - - - - - - ........ .... Department of Electrical Engineering and Computer Science August 24, 2015 Certified by.. Signature redacted I Ronitt Rubinfeld Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by ..... Signature redacted j c/ Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, EECS Committee for Graduate Students I 2. Learning and Testing Junta Distributions over Hypercubes by Maryam Aliakbarpour Submitted to the Department of Electrical Engineering and Computer Science on August 24, 2015, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Many tasks related to the analysis of high-dimensional datasets can be formalized as problems involving learning or testing properties of distributions over a highdimensional domain. In this work, we initiate the study of the following general question: when many of the dimensions of the distribution correspond to "irrelevant" features in the associated dataset, can we learn the distribution efficiently? We formalize this question with the notion of junta distribution. The distribution D over {0, 1}" is a k-junta distribution if the probability mass function p of D is a k-juntai.e., if there is a set J C [n] of at most k coordinates such that for every x c {0, 1}7, the value of p(x) is completely determined by the value of x on the coordinates in J. We show that it is possible to learn k-junta distributions with a number of samples that depends only logarithmically on the total number n of dimensions. We give two proofs of this result; one using the cover method and one by developing a Fourierbased learning algorithm inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan (1993). We also consider the problem of testing whether an unknown distribution is a k-junta distribution. We introduce an algorithm for this task with sample complexity O(2'/ 2 k) and show that this bound is nearly optimal for constant values of k. As a byproduct of the analysis of the algorithm, we obtain an optimal bound on the number of samples required to test a weighted collection of distribution for uniformity. Finally, we establish the sample complexity for learning and testing other classes of distributions related to junta-distributions. Notably, we show that the task of testing whether a distribution on {0, 1}' contains a coordinate i E [n] such that xi is drawn independently from the remaining coordinates requires 0(2 2 ,/ 3 ) samples. This is in contrast to the task of testing whether all of the coordinates are drawn independently from each other, which was recently shown to have sample complexity 6(2'/2) by Acharya, Daskalakis, and Kamath (2015). Thesis Supervisor: Ronitt Rubinfeld Title: Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments The result of this thesis was based on a collaboration with Eric Blais and Ronitt Rubinfeld. I would like to gratefully and sincerely thank Prof. Ronitt Rubinfeld, my advisor, for her guidance, support, and most importantly, her friendship. Moreover, I would like to thank my parents for providing me with unending love and support. 5 -11 "1 RIM 6 - | | - M. . I- . .M - -1 1 n , Ir ,ir ,1 ,!1 I Iln 0 Contents 1.1.1 Learning junta distributions . . . . . . . . . . . . . . . . . . 11 1.1.2 Testing junta distributions . . . . . . . . . . . . . . . 1.1.3 Learning and testing dictator distributions . . . . .* 1.1.4 Learning and testing feature-restricted distributions . . . . . . 15 1.1.5 Learning and testing feature-separable distributions . . . . . . 15 . . 11 14 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Prelim inaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . 1.2 23 Learning junta distributions 2.2 23 . Learning junta distributions using Fourier analysis . . . . Step 1: The gap between h(J*) and h(J) . . . . 25 2.1.2 Step 2: Equality of f(J) and h(J) . . . . . . . . 27 2.1.3 Step 3: Estimating f(J) . . . . . . . . . . . . . . 30 . . . . . . . . . . . . 31 . . . 2.1.1 A Lower bound for learning juntas . 2.1 33 3.2 33 3.1.1 Uniformity Test of a Collection of Distributions . 35 3.1.2 Uniformity Test within a Bucket . . . . . . . . . . 40 A lower bound for testing junta distributions . . . . . . . 45 A test algorithm for junta distributions . . . . . . . . . . . 3.1 . Testing junta distributions . 3 . . . . . . . . .. .13 O ur Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 2 9 Introduction . 1 7 4 5 6 Learning and testing feature-separable distributions 49 . . . . . . . . . . . . . . . . . 49 . . . . . . . . . . . . . . . . . . . . . . 53 4.1 Testing feature-separable distributions 4.2 Identifying separable features 57 Learning and testing 1-junta distributions 5.1 Learning 1-juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Testing 1-juntas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 63 Learning and testing dictator distributions 6.1 Learning dictator distributions . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Testing dictator distributions . . . . . . . . . . . . . . . . . . . . . . 66 6.3 Learning and testing feature-restricted distribution . . . . . . . . . . 69 . . . . . . . . . . . . . 69 . . . . . . . . . . . . . . . . . . 74 6.3.1 Testing feature-restricted distributions 6.3.2 Identifying restricted features A Learning juntas with the cover method 81 B Proof of Equation (2.1) 83 8 . ki. ,- --- 11.,,, i- , - ulfii Chapter 1 Introduction One of the central challenges in data analysis today is the study of high-dimensional datasets. Consider for example the general scenario for medical studies: to bet- ter understand some medical condition, researchers select a set of patients with this condition and collect some attributes (or features) related to their health and environment (e.g., smoker or not, age, full genome analysis). Until recently, typical medical studies collected only a few features from each patient. For example, a classic dataset obtained from a long-term diabetes study [39] included only 8 features. By contrast, modern genome-wide association studies routinely include over a million different features [16]. Many problems related to understanding high-dimensional datasets can be formalized in the setting of learning distributions and testing whether those distributions have some given properties. For example, in the medical study scenario, we can view each selected patient as being drawn from the (unknown) underlying distribution over all patients with the medical condition; to understand this condition, we want to learn this distribution or identify some of its prominent characteristics. In order to do so, we need learning and testing algorithms whose sample complexity and time complexity are both reasonable functions (ideally, polynomial in) the dimension of the distribution's domain. Traditional algorithms developed for low-dimensional distributions typically have sample and time complexity that is exponential in the dimension of the distribution, so in most cases new algorithms are required for the 9 high-dimensional setting. To obtain such algorithms, we must exploit some structural aspects of the high-dimensional distributions under study. The starting point of the current research is a basic observation: in datasets with many features, most of the features are often irrelevant to the phenomenon under study. (In our medical study example, a full genome analysis includes information about every gene even when only a small number of them are related to the condition.) As we discuss in the related work section below, the same observation has been extremely influential in the theory of learning functions; can it also lead to efficient algorithms for learning and testing distributions? In this work, we initiate the theoretical study of these questions in the setting of learning distributions [291 and testing properties of distributions [71. For concreteness, we focus on the case where each feature collected in the dataset is Boolean-i.e., on distributions over the Boolean hypercube {0, 1 }'. Our first task is to formalize the notion of relevant and irrelevant features. This notion already has been studied extensively in the setting of functions over the Boolean hypercube, where f : {0, 1} is a k-junta if there is a set J C [n] of size at most k such that for every x -+ R E {0, }", the value f(x) is completely determined by {xj}ijE. In this setting, the coordinates in J correspond to the relevant features, and the remaining coordinates are called irrelevant. There is a very natural extension of the definition of juntas to the setting of distributions: we say that the distribution D over {0, 1}" is a k-junta distribution if its probability mass function p : {0, 1}' -+ {0, 1} is a k-junta. In other words, D is a k-junta distribution if there is a set J C [n] of size at most k such that the probability that x E {0, 1}" is drawn from D is completely determined by {xj}ijE. This notion, as in the case of Boolean functions, also captures a fundamental notion of relevance of features. Again using the medical study example, we see that the features that are relevant to the medical condition will be exactly the ones identified in J when the features are distributed uniformly (and independently) at random in distribution over the whole population (instead of those with the medical condition). This notion is of course a somewhat idealized scenario; nonetheless, it does appear to capture much of 10 the inherent complexity of learning distributions in the presence of irrelevant features and appears to be a natural starting point for this line of research. We also examine other classes of distributions that are of particular interest to our understanding the role of relevant and irrelevant features in high-dimensional distributions. We study dictator distributions, which capture the simpler setting where one of the features is always set to 1 and the remaining coordinates are irrelevant; feature-restricteddistributions,where we keep the requirement that one of the features takes the value 1 for all inputs in the support of the distribution but add no other constraints to the remaining coordinates; and feature-separable distributions, where one of the features is drawn independently from the rest. We consider the problems of learning and testing each of these classes of distributions. Our results show that these problems exhibit a rich diversity in sample and time complexity. We discuss our results in more details below. Table 1.1 includes a summary of the sample complexity for each of these problems. Our Results 1.1 1.1.1 Learning junta distributions We begin by considering the problem of learning k-juntas. We can obtain a strong upper bound on the sample complexity of this problem using the cover method [24, 22, 231. By showing that there is a set C of (') 2 k2k/' distributions such that every k-junta distribution is E-close to some distribution in C, we obtain the following result. Theorem 1.1.1. Fix E > 0 and 1 < k < n. Define t = ( 2k2k/E. 9 There is an 2 3 algorithm A with sample complexity O(log t/c 2 ) = O(k2k/c + k log n/E ) and running time O(t log t/E 2 ) that, given samples from a k-junta distributionD, with probability at least 2 outputs a distributionD' such that dTv(D, D') := E o,}n pE(x) - p,(x)j I C. The algorithm's sample complexity is logarithmic in n-an exponential improvement over the sample complexity of the (folklore) general learning algorithm. The 11 2 running time of the algorithm from Theorem 1.1.1, however, is Q(t/c ), which is ex- ponential in k. Our second main learning result introduces a different algorithm for learning k-juntas with a running time that is only singly-exponential in k while still maintaining the logarithmic dependence on n in the sample complexity. Theorem 1.1.2. Fix c > 0 and 1 < k < n. There is an algorithm A with sample 4 ) and running time complexity O(2 2kk log n/0 ( () . 2 2k log n/,E) that, given samples from a k-junta distribution D, with probability at least outputs a distribution D' such that dTv(D, D') < CThe proof of Theorem 1.1.2 is inspired by the Low-Degree Algorithm of Linial, Mansour, and Nisan [311 for learning Boolean functions. As in the Low-Degree Algorithm, we estimate the low-degree Fourier coefficients of the function of interest (in this case, the probability mass function of the unknown distribution). Unlike in the low-degree algorithm, however, we do not use those estimated Fourier coefficients to immediately generate our hypothesis. Instead, we show that these estimates can be used to identify a set J c [n] of coordinates such that the target function is close to being a junta distribution on J. Our upper bounds on the sample complexity of the problem of learning k-juntas have a logarithmic dependence on n and an exponential complexity in k. We show that both of these dependencies are necessary. Theorem 1.1.3. Fix e > 0 and n > k > 1. Any algorithm that learns k-junta distributions over {0, 1}" has sample complexity Q(2'/E 2 + k log(n)/e). The key element of the proof of Theorem 1.1.3 is the construction of a distribution ' V over k-junta distributions that outputs the uniform distribution with probability and such that any deterministic algorithm drawing o(k log(n)/E) samples minimizes its error by always outputing the uniform distribution as its hypothesis. This result implies the Q(k log(n)/e) portion of the lower bound in the theorem, and the remaining 2 k/2 term is obtained by a simple reduction to the problem of learning general discrete distributions. 12 The proofs of Theorems 1.1.2 and 1.1.3 are presented in Section 2, and the proof of Theorem 1.1.1 is included in Appendix A. 1.1.2 Testing junta distributions We next turn our attention to the problem of testing whether an unknown distribution is a k-junta distribution or not. More precisely, a distribution is E-far from having some property P (e.g., being a k-junta) if for every distribution D' that does have property P, we have dTv(D, D') > E. An c-tester for property P is a randomized algorithm with bounded error that distinguishes between distributions with property P from those that are e-far from having the same property. We show that it is possible to test k-juntas with a number of samples that is sublinear in the size of the domain of the distribution. Theorem 1.1.4. Fix c > 0 and n > k > 1. There is an algorithm that draws O(2n/2k log n/E2 ) samples from D and distinguisheswith probability at least 2 between the case where D is a k-junta and the case where it is E-far from being a k-junta. The proof of Theorem 1.1.4 is obtained by reducing the problem of testing juntas to the problem of testing a weighted collection of distributions for uniformity. We then show that this problem can be solved in time O(V/mN) when there are m distributions with domain sizes N each. See Theorem 3.1.2 for the details. When k < n, the sample complexity of the junta testing algorithm is much larger than (i.e., doubly-exponential in) the sample complexity of the learning algorithm. We show that this gap is unavoidable and that the bound in Theorem 1.1.4 is nearly optimal. Theorem 1.1.5. Fix any 0 < k < n and any constant 0 < c < 1. Every algorithm for c-testing k-juntas must make Q( 2 n/2/ c2 ) queries. The lower bound again uses the connection to the problem of testing collections of distributions. In this case, this is done by constructing a distribution over k-junta distributions and distributions that are far from k-juntas such that any algorithm 13 that distinguishes between the two with sample complexity o( 2 n/2/E2) would also be able to test collections of distributions for uniformity with a number of samples that violates a lower bound of Levi, Ron, and Rubinfeld [301. See Section 3 for the proofs of Theorems 1.1.4 and 1.1.5. 1.1.3 Learning and testing dictator distributions Let us examine the special case where k = 1 in more detail. The definition of 1-junta distributions can be expressed in a particularly simple way: a distribution D is a 1-junta if its probability mass function p : {0, 1}" - R is of the form for some i E [n] and 0 < a < 1. Using this definition, we obtain simpler and more sample-efficient algorithms for learning and testing 1-juntas. (See Appendix 5 for the details.) This representation also suggests another natural class of distributions to study: those that satisfy this definition with a = 1. This corresponds to the class of distributions with only 1 relevant feature, and where moreover this feature completely determines whether the element is in the support of the distribution or not. This class also corresponds to the set of distributions whose probability mass function is a scalar multiple of a dictator function. We thus call these distributions dictator distributions. We give exact bounds for the number of samples required for learning and testing dictator distributions. Theorem 1.1.6. Fix e > 0 and n > 1. The minimal sample complexity to learn dictatordistributionsup to total variation distance e is e(log n). The minimal sample complexity to F-test whether a distribution D on {0, 1}' is a dictator distribution is 8(2n_- '/E2). The sample complexity for the task of learning dictator distributions is obtained with a simple argument. The upper arid lower bounds for testing dictator distributions 14 -'- -. I- - - .. "1" 1-1111 'ag" . _Ijidu are both obtained by exploiting the close connection between dictator distributions and uniform distributions. The proof of Theorem 1.1.6 is included in Appendix 6. 1.1.4 Learning and testing feature-restricted distributions The definition of dictator distributions in turn suggests a second variant of interest: the distribution D is a dictator distribution if there exists an index i E [n] such that D is the uniform distribution over {x E {0, 1}" : X, = 1}. What happens when we consider arbitrary distributions whose support satisfies the same property? We call such distributions feature-restricteddistributions,and a coordinate i that satisfies the condition that xi 1 for every x in the support of D is called a restricted feature. We give exact bounds for the number of samples required to learn and test featurerestricted distributions as well as to identify a restricted feature of feature-restricted distributions. Theorem 1.1.7. Fix E > 0 and n > 1. The minimal sample complexity to learn feature-restricteddistributionsup to total variation distance e is e(2/. 2 ), but we can identify a restrictedfeature of these distributions with only 8(log(n)/e) samples. The minimal sample complexity to E-test whether a distributionD on {0, 1} is feature- restricted is e(log n/e). The learning result follows from a simple reduction to the problem of learning general discrete distributions. The sample complexity of the other two tasks, the identification of restricted features and testing feature-restricted distributions, are both established with elementary arguments. For completeness, the details of the proof of Theorem 1.1.7 are included in Appendix 6.3. 1.1.5 Learning and testing feature-separable distributions Finally, the last topic we consider in this work is a property of distributions called feature separability. The distribution D over {0, 1}" is feature-separable if there exists an index i C [n] such that the variable xi is drawn independently from the remaining coordinates of x, or equivalently where there exist probability mass functions q : 15 _'__- ................. {0, 1} -+ [0, 1] and r : {0, 1}t1 - [0, 1] such that the probability mass function Ii+1 -... , X 71). This property is very p of E is defined by p(x) = q(xi)r(xi, . . . , .i1 close to that of dictatorships-a dictator distribution is one that satisfies the feature separability condition under the extra restriction that r be the uniform distribution. We show that feature-separability can be tested with a number of samples that is sublinear in the size of the domain of the distribution. Theorem 1.1.8. Fix E > 0 and n > 1. The minimal sample complexity to learn feature-separable distributions up to total variation distance E is E(2r/c 2 ), but we can identify a separable feature of these distributions with O(poly(1/E)22 n /3 n2 log n) samples. There is an algorithm that E-tests whether a distributionD on {0, 1}' is feature- separable with O(poly(1/E)22,/ 3n 2 log n) samples. Furthermore, there is a constant c 0 > 0 such that every co-testerfor feature separabilityhas sample complexity Q(2 2 3 l/ ). When every index i E [n] satisfies the feature separability condition, the distribution D is called independent. The sample complexity of the independence testing problem has been established very recently: 0(2"/ 2/6 2 ) samples are both necessary and sufficient for this task [11; our result shows that, interestingly, the feature separability testing problem requires significantly more samples. The proof of Theorem 1.1.8 is presented in Section 4. 1.2 Related work Learning and testing Boolean functions. The present work was largely in- fluenced by the seminal work of Blum [131 and Blum and Langley [151, who first proposed the study of junta functions to formalize the problem of learning Boolean functions over domains that contain many irrelevant features. Their work led to a rich line of research on learning juntas [14, 33, 3, 40]. Starting with the work of Fischer et al. [27], there has also been a lot of research on the problem of testing junta functions [21, 10, 11, 4, 38, 21. This work, along with the testing by implicit learning method, has since led to new results for testing many other properties of Boolean 16 uw Learning Testing O(2n-/ k-juntas 2 k4/e 3 Q(2'r/ 2 + 2 /e) ft 2 ) Dictators E(2 Feature-restricted e(log n/E) __________________Q(2 e(logn) E(log n/e) t O(poly(l/E)222 n/33 rn2 log n)O(poly(1/)22n/31 n/ 2 log n) f ) Feature-separable /e2) O(k2k/E 3 + k log n/C 2 Q(2 t 2 + k log n/f) ) Distribution class Table 1.1: Summary of our results. The table includes the upper and lower bounds on the sample complexity for learning and testing the classes of distributions described in Sections 1.1. The results marked with t describe the sample complexity for identifying the restricted feature and separable feature, respectively. In both cases, the standard learning task requires 8(2"/E2) samples. functions as well [26, 19, 18, 12]. Our work can be viewed as trying to extend this line of research from the setting of supervised learning theory to unsupervised learning. Learning and testing distributions. The problem of determining the sample complexity and running time requirements for learning unknown distributions over large domains has a long and rich history; the recent book chapter of Diakonikolas [24] and the references therein provide a great introduction to the topic. The model and notation we used for the problem of learning distributions was introduced in [29]. The problem of testing properties of distributions was introduced more recently but it has also generated a rich body of results, including tight bounds for the problem of testing uniformity [28, 8, 351, monotonicity of the probability distribution function [91, and identity to a given known distribution [61 or another unknown distribution [8, 42, 251. In addition, related problems like estimating entropy or support size have been investigated [5, 41, 42, 371. See Canonne's survey [171 for an engaging overview of the results in this area. Despite all of this previous work on learning and testing properties of distributions, we are not aware of previous results on the classes of distributions we study in the current paper. Nor are we aware of existing approaches that seek to exploit the notion of relevant or irrelevant features (rather than other assumptions about the 17 shape or family of the distribution) to obtain efficient algorithms for the analysis of distributions. Remark on terminology. Procaccia and Rosenschein [361 introduced a different class of distributions that they also called "junta distributions". They did so in a completely different context (the study of manipulability of distributions by small coalitions of voters) and over completely different types of distributions (over instances of candidate-ranking problems instead of the Boolean hypercube). We use the same name for our notion of junta distributions because of the strong connection to junta functions and because we believe there should be no confusion between the two (very different) settings. We should emphasize, however, that our results do not apply to Procaccia-Rosenschein junta distributions (and vice-versa). 1.3 Preliminaries We denote the set {1, 2,.. ., n} by [n]. We use x to indicate a binary vector of size n unless otherwise specified. The value of the i-th coordinate of x is denoted by xi. In addition, the restriction of x to the coordinates in the set I C [n] is denoted by x('). For example, 101000(1,31) a hypercube. 11. Let P : {, 1}' -+ [0, 1] be a distribution over We sometimes write P(x) to denote the probability of drawing x in P. The uniform distribution is denoted by U. The L1 (or total variation) distance between two distributions is defined as dL 1 (Pl, P2) = E Pi(x) - P2 (x)IA function f {0, 1}r -÷ R is a dictatorfunction if there is an index i E [n] such that f(x) = xi for every x E {0, 1 }". We say that there is a constant c 7 0 such that f(x) f is a weighted dictatorfunction if c - x for every x C {0, 1}". The function f is a k-junta for some value 1 < k < n if there is a set J C [n] of size that f(x) = IJI < k such f(y) whenever () = y(J). The classes of distributions that we consider are defined as follows. Definition 1.3.1 (Junta distributions). The distribution P over {0, 1}" with probability mass function p {0, 1}" - [0,1] is a k-junta distribution if p is a k-junta. 18 Equivalently, P is a k-junta distribution if p is of the form -o if X al if x() ) =0 ...00 ... 01 pnx)if XW= 11...1 for some values a1 ,. . . , a 2 k-1 E [0, 1] that satisfy Zj=0 ai = 1. Definition 1.3.2 (Dictator distributions). The distribution P over {0, 1}" is a dictator distribution if its probability mass function p : {0, 1}' -+ [0, 1] is a weighted dictator function. Equivalently, P is a dictator distribution if its pmf p is of the form P0 if Xi = 1 if Xi = 0 for some index i E [n]. Definition 1.3.3 (Feature-restricted distributions). The distribution P over {0, 1}" is a feature-restricteddistribution if there is an index i E [n] such that its probability mass function p : {0, 1}' -+ [0, 1] satisfies p(x) = 0 for every x with xi = 0. Definition 1.3.4 (Feature-separable distributions). The distribution P over {0, 1}' is a feature-restricteddistribution if there is an index i E [n] such that xi and are independent random variables when x is drawn from P. X([t\{i}) Equivalently, P is a feature-restricted distribution if its probability mass function is of the form a - q(X([r\{i})) if xi = 1 (1 if Xi = 0 - a) q(x"l\fi)) for some index i E [n], some parameter a E [0, 1], and some probability mass function q : {0, 1}T-1 -+ [0, 1]. 19 We use the model of learning distributions introduced in [291. A concept class C of discrete distributions is simply a set of distributions (e.g., the set of all dictator or k-junta distributions). All of the learning algorithms introduced in this paper are proper learning algorithms: they always output a hypothesis distribution in the target distribution's concept class. Definition 1.3.5 ((q, e, 6)-Learner). An (q, c, 6)-learner for a class C of distributions is an algorithm that draws q samples of an arbitrary and unknown distribution P E C and outputs a hypothesis P' E C such that with probability at least 1-6, dTv(P, P') < E. Our testing results are obtained within the standard framework for testing properties of distributions introduced in [7]. A property of discrete distributions is again a set of distributions (it is equivalent to the notion of concept class). A distribution P is E-far from having the property C if the total variation between the distribution and any distribution that has the property is at least e. A tester for C is an algorithm that distinguishes distributions in C from those that are E-far from C. Definition 1.3.6 ((c,6)-test). An (E,6)-test for a property C of discrete distributions is an algorithm that draws samples from and unknown distribution P and with probability 1 - 6, accepts if P is in C and rejects if P is E-far from C. We also consider the problem of testing collections of distributions. A (weighted) collection of distributionsis a set of m distributions PI, P2 ,.. . , P.,, on the domain [N] and a set of m weights wi, . . . , E [0, 1] such that 'wi = 1. We denote such collection by {Pilwi}'i 1 . We can also consider the collection as a single distribution P where for any i E [m] and x E [N], P((i, x)) = wiPi(x). When we draw a sample from {Pilwi}T, we obtain a pair (P, j) such that Pi is picked with probability wi and then j is a sample drawn from Pi. Definition 1.3.7 (Weighted distance to uniformity). The weighted distance to uni- formity of the set S c [m] of distributions in the collection {Pilwi}" is E wi- dt,,(Ps, U). The weighted distance to uniformity of the collection itself is the weighted distance to uniformity of the set S = [n]. 20 Definition 1.3.8 ((c, 6)-tester for uniformity). An (E, 6) -test for uniformity (of collections of distributions)is an algorithm that draws samples from an unknown collection of distributions {Pilwi}'&1 and with probability at least 1 - 6, accepts if all the Pi's are uniform and rejects if the weighted distance of the collection is at least e. 21 " - "1," ,1 "l --- s . I 22 . uw a . .7 w. . .o mu . . . K6 1 . " Chapter 2 Learning junta distributions 2.1 Learning junta distributions using Fourier analysis In this section, we introduce an algorithm for learning junta distributions. To do so, we have to figure out the set of junta coordinates P and the biases ai's. We use Fourier analysis over the Boolean hypercube to solve this problem. For a complete introduction of Fourier analysis, see [341. The Fourier basis of the set S is Xs(x) = (-1)EiEsxi = (-1)(EsiS when S is not empty and Xo(x) = 1. The distribution P- or, more precisely, its probability mass function-is written in terms of the Fourier basis as P(x)= P(x) - s(x) SC [n] where the Fourier coefficient associated with S is defined as P(x) -xs(x) = E[P(x) -xs(x)]. P(S) =XE{O,1}"1 Here and throughout, we write E[f(x)] to refer to the expected value of f(x) when x is picked uniformly at random and Ezp[f(x)] to refer to the expected value of f(x) when x is drawn from P. The proof of Theorem 1.1.2 is obtained via the following 23 Junta learning algorithm 1. Draw s = 9- 2 - ln(6 - 2 k - ('))/2e 4 samples. 2. For all J C [n] of size k: 3. f(J) +-0 3.1 For all S C J where S # 0: 2. [#samples with Xs(x) = 1] i. f(J) <- f(J) + ( 1)2 S) S 4. Output J such that maximizes f(J) (break ties arbitrarily). Figure 2-1: A learning algorithm for junta distributions analysis of the learning algorithm in Figure 2-1. Theorem 2.1.1. Let P be a junta function over k coordinates in the set P with biases ao , a2 ,..., a 2 k-1. The Junta learning algorithm, shown in Figure 2-1, outputs the set J such that P is e-close to a junta distribution on J using s = 9. 2 k - ln(6 - 2kk (') )/2e samples with probability at least 2/1 In addition, the running time of the algorithm is O(nk - 2 3k . log n). Proof: To prove the correctness of the algorithm, we need to show that the probability that the junta learning algorithm outputs a set J such that P is c-far from any junta distribution on the set J is very small. Let us call such sets J invalid sets; we want to show that P outputs an invalid set only with small probability. For the distribution P in the theorem statement and any set J C [n] of size k, we define a distribution Pj by setting Pi(x) = Pr [yM() = x(J)]/ 2 -k y~P for every x in the hypercube. The distribution Pj is a junta distribution on the set J. In particular, for any x and x' that agree on coordinates in J, Pj(x) is equal to Pj(x'). Furthermore, Pi. is equal to the original distribution P. Now we define two 24 functions f and h on subsets of [n] of size k by setting h(J) = 22"' - E[(Pj(x) - 1/2"")2] and f (J) = 2 2r. 'P ()2_ scJ'sfe We complete the proof of correctness of the algorithm via the following three steps: 9 Step 1. If J is not a valid output, then h(J*) - h(J) is at least 4E2. 2 * Step 2. f(J) is always equal to h(J). Therefore, f(J*) - f(J) is at least 46 too. & Step 3. With probability at least 2/3, for every set J of size k, If(J) - f(J)I < 2 2. We prove each statement in the corresponding section below. Assuming the correctness of each step, the correctness of the algorithm follows from the fact that the estimated value f(J) will be less than f(J*) for every invalid set J* Finally, we analyze the running time of the algorithm. Observe that we consider (") subsets of size k. Each of them has 2 k -1 non-empty subsets (the S's). Then we need to compute Xs(x) for each sample x which takes O(k - s). Thus, the time complexity of our algorithm is Q(nk -23k - k2 log ,,,). 2.1.1 Step 1: The gap between h(J*) and h(J) We begin with the observation that for every set J, the distribution Pj is a junta distribution over the set Jf P. Also, as we show in Appendix B, we have the identity Pj(x) = (Pr [y JflJ) = x(Jnj*)])/ Y-P 2 kt (2.1) 25 ........... ....... ...... Note that if P is c-far from being a junta distribution on the set J, it should be also E-far from 'Pj. Therefore, by the following lemma we can infer that h(J*) - h(J) > 4C2. (2.2) Lemma 2.1.2. Let P and P' be two k-Junta distributions defined as above, which are are E-far from each other. Then we have E[(P(x) - 1/2 n)2] - E[(Pj(x) - 1/2 n)2] > 2- 2 n(2c) 2 (2.3) . Proof: Before, we go to prove the inequality, we prove the following equality E[Pj(x)(P(x) - Pj(x))] = 0. (2.4) Assume we partition all x's into Xi's such that for any two vectors x, and x 2 in Xi then X flJ= . We prove that for each Xi, E Pj(x)(P(x) - Pj(x)) is zero which yields the Equation 2.4. By Equation 2.1, Pj is a junta distribution on the set J n J*. Therefore, for any two vectors x 1 and x 2 in X, we have PJ(x2 ) = Pi(x) = E Thus, we just need to prove E PJ(x2). P(x) which follows from Equation 2.1 directly. Moreover, observe that E[D(x)] = 1/ 2 " for any distribution D. Therefore, we have E[(D(x) - 1/2")2] = E[D(x) 2 ] + E[1/22"] - 2E[D(x)/2"'] = E[D(x) 2 ] + 1/22n - 2/22-n = E[D(X) 21 - 1/2n- (2.5) E[D(X) 2] - E[D(X)]2 Now, we prove Equation 2.3. Since P and Pj are distributions. Then, E[P(x)] 26 E[Pj(x)] = 1/2"n. By this fact and linearity of expectation, E[(P(x) - 1/2" )2] - E[(Pj(x) - 1/2"=)2 = E[P(x) 2 ] - E[Pj(x) 2 ] E[(P(x) - p(x) +pj(x)) = E[(P(x) - Pj(x)) 2] = +E[Pj(x) 2] - 2 ] - E[P (X) 2 1 + 2E((Pj(x)(P(x) - Pj(x))] E[Pj(x) 2 1 = E[(P(x) -pj(x))2 > (E[|P(x) = dgi (-P, >2 - Pj(x)|) 2 -PJ)2/22n -2n (2e)2 where the second to last equality comes from Equation 2.4 and the second to last inequality comes from the fact that the L 1 distance of P and Pj is at least 2e. 2.1.2 L Step 2: Equality of f(J) and h(J) Below, we first show that the Fourier coefficient P(S) of a set S 9 J* is zero. This lemma allows us to infer that it is enough to compute the low degree Fourier coefficients, because the other ones are zero. Intuitively, such a high degree S contains a coordinate that will be either zero or one with probability a half. Therefore, the Fourier coefficient of S is zero. We prove this formally in Lemma 2.1.3. Leveraging this lemma, we prove that the value of h(J) and f(J) are equal in Lemma 2.1.4. Lemma 2.1.3. For any J C [n], let D be a junta distribution with J being the set of junta coordinates. For any S 7 J, 'D(S) is zero. Proof: Observe that J might be the empty set, in which case D is a uniform distribution. Since S is not a subset of J, there is a coordinate i such that i is in S but not J. Thus, the i-th coordinate in each sample x is one or zero, each with probability a half. We simply pair up all x's based on their agreement on xnND and denote a pair by (xo, x 1 ). Observe that since i is not in the junta, distribution D(Vo) = D(x1 ). 27 However since i is in S, xls(Xo) = -Xs(xi). Therefore, we have E 1n ' (S)= D(x)- Xs(x) xE{O,1} 1 D(o)-xs(xi) + D(xI) -xs(xi) (2 o 1 E xi) D(xo) . xs(x1) - D(xo) - xs(xo) (xo x1) =0. Now we are ready to prove that f(J) is equal to h(J) for any J C [n] of size k. Lemma 2.1.4. Let f and h be two functions as defined before, for any J C [n] of size k we have f(J) = h(J). Proof: Recall that h(J) = 2 - E[(Pj(x) - 1/2")2 and 5 P(S) 2 . f(J) = 22n -_ ScJ.'so0 By Equation 2.5, we have h(J) = 22n- E[(Pj(x) - 1/2")2] ) = 2"2- (E[pj(x) 2 _ E[P (x)]2 = 2 Pj(S) -n- 2 - Pi(0)2 wS where the second to last equality follows by Parseval's Theorem and the fact that Pj(0) = 1 E Pj(x) - Xo(x) = E[Pj(x)]. In addition, note that by Equation 2.1, any Pj is a junta distribution over set Jn J*. 28 By Lemma 2.1.3, for any S g (J n J*), Pj(S) is zero. Thus, we know =2 h(J) (2n P (S) 2 - PJ(0)2 (S p (S) 2 = 22n. Sc(JnJ*),S30 Now, it is clear that l(J*) J n f(J*). Assume J = J*. Let S be a non-empty set of J* and c be a fixed binary vector of size jSI. By definition of PJ, it is not hard to see Pr,-p[x(S) - c] = Prx~pj[x(s) - c]. Thus, by conditioning over all possible c, we can prove Prx-p[Xs(x) = b] = Prx-pj[Xs(x) = b] when b is +1 or -1. Therefore, we have PJ(S) >] PJ(x)xS(X) 2- - x =2-"-- Pr [Xs(x) =1 Pr [ys(x) = 1] - 2-" - Pr [Xs(x) =-1] Pr [XS(x) =-1 E Pj(X)'XS(X) 92" P(S) In this way, for any non-empty subset S of J n J*, Pj(S) is equal to the P(S). By Lemma 2.1.3 for any S C J which is not subset of J* then P(S) is zero. Thus, I(J) = 22,n- J(S)2 . Sc(JnJ*),S#0 p(S) = 2 2n. 2 SC(JnlJ*),S# = 2rz= 2fl- P(S) E Sc(JnJ*),S#0 > P(S) Sg(J),S 0 =f(J) and the proof is complete. 29 2 2 E SCJ,S;J* p ()2 Step 3: Estimating f(J) 2.1.3 In the following lemma, we prove that with probability 2/3 we can assume for all J C [n] of size k f(J*) - f(J) If(J) - f(J)I is less than 2C 2 . Note that for any invalid output J, is at least 42. Thus, f(J*) > f(J) with probability 2/3, so the learning algorithm does not output an invalid J. Lemma 2.1.5. Let P be a junta distribution on the set J* of size k. Suppose we draw s = 9 - 22k - ln(6 .2 k -( ')/2e' samples from P. For any set J of size k, we estimate f(J), as it is defined before, by f2 ) z 2# samples xswith Xs(x) = 1] sc JIs# )2 With probability 2/3 all of the J we have If(J) - f(J) I < 2e2. Proof: By definition of Fourier coefficients, we have f(J) 2' - scJ,so0 E SCJ,S54 E scjsos P (S) E P(x)\s(x) x Pr [,Xs(x) (X~P Pr [s(x) S(2 X~P sg J,ss ( - X~P [xs(x) -1]) 1] - 1) Now for abbreviation, let Ps = 2-Pr[ys(x) = 1]-1 and let PS be 2.[# samples x with Xs(x) 1]/s. First, notice that Ps is an estimator for PS such that their difference is not likely to be more than e' = e2 /(10 . 2k), because by the Hoeffding bound we have PrPs - PS > e'] Pr [# samples x with e'/2 <; s(x) =1 Pr fy s(x) =1] 1/2- >_' P Pr_____________ Note that in the learning algorithm we estimate this value for there are (") such J's. Thus, for s = 21n(6 - 2 . ())/2, 2k subsets of J and it is not hard to see that the probability of estimating at least one Ps inaccurately is at most 1/3 by the union 30 2/2 bound. Now, we can assume as = Ps - Ps is in the range [-e, C'] with probability 2/3. Now, we compute the maximum error of f(J) as below If(J)-f(J)I = 1P - scJ's#o P >E 2Psas Z 2Pslas|+a2 - scJsw < E SCJ,S/0 S-a ScGJS#0 < where the last inequality follows by c' 2.2 2 1 ) < 2 k (2e' + 2E2 = 2E 2 /(3- 2 k) < 1. A Lower bound for learning juntas We now complete the proof of the lower bound on the number of samples required to learn juntas. Theorem 1.1.3 (Restated). Fix 0 < e < - and 1 < k < n. Any (s, c, )-learner for 2 k-junta distributions must have sample complexity s = Q(max{2k/E , log Proof: The first part of the lower bound, s = Q( 2 k/2) () /). follows from the (folklore) lower bound on the number of samples required to learn a general discrete distribution over a domain of size N. Q(N/E 2 ) samples are required for this task. Observing that the set of juntas on the set J = {1,2, . .. k} is a set of general discrete functions on a domain of size N = 2 k, we conclude that any k-junta learning algorithm must draw Q( 2 k/,2) samples--even if it is given the identity of the junta coordinates. We now want to show that s = Q(log ('k) /c): By Yao's minimax principle, it suffices to show that there is a distribution P on k-junta distributions such that any deterministic algorithm that (6, j)-learns D ~ P must draw at least s = Q(log ()/c) samples from D. For non-empty sets S C [n], let Ds be the distribution with the 31 probability mass function + iE)/2T'~1 ps~) =(I (I - c)/2n-1 if (Dis xi =-I if @jes xi = 0. Let Do be the uniform distribution on {0, 1}". We let P be the distribution defined by P(DO) = - and P(Ds) = - 2 for every set of size 1 < < k. KSI Every function <k)1 in the support of P is a k-junta distribution, and they are all E-far from each other. Fix any deterministic learning algorithm A that (f, )-learns the k-junta distribu- tions drawn from P. Let X be a sequence of s samples drawn from D. The success probability of A guarantees that - 3 3 < Pr[A identifies the correct distribution] 4 S ps(X) - 1[A outputs Ds on X] P(Ds) se(n) Xe{0,1}"nS max P(Ds) -ps(X). < XE{O 11}nXs se(k) We can partition the set of s-tuples of samples, {0, 1}"' into S E () () parts xs, -) P(DT) -pT(X) (breaking such that X E Xs iff P(Ds) -ps(X) = max ties arbitrarily). For any set of samples X, we have that P(DO) - ps(X) 2-"t since Do is the uniform distribution. This means that if X E Xs for some S 4 0, then P(Ds) - ps(X) > 2 "'-1 and hence ps(X) > (( " ) - 1)2-'s. Let Ks(X) denote the number of samples x E X such that @iES xi = 1. Then from the above inequality we have < k) 1)2-" << ps(X) - 2 + )s(X)(_ < (1 + 2,) s(X) . 2 Therefore, s > Ks(X) Q(log ()/e), 2 K- e 2 < _ g)ss(X) Ks (X) - as we wanted to show. 32 2 -(-1) 2-"i. Chapter 3 Testing junta distributions 3.1 A test algorithm for junta distributions In this section we consider the testing junta distributions. Considering the Definition 1.3.1, we want to see if there exists a subset of coordinates of size k, namely J, such that condition on any setting of x), we get a uniform distribution. Hence, testing that a distribution is junta relies on a test of a collection of 2k distributions. In Section 3.1.1, we provide uniformity test for a collection of distributions, which is a natural problem by its own too. In Figure 3-1 we show the reduction and prove it formally in the following theorem. Theorem 3.1.1. The algorithm shown in Figure 3-1 is an (e, 1/3)-test for k-junta distributions using O(Sklog'n) - 6(2/2k4/E3 + 2k/c). Proof: First, note that we amplify the confidence parameter of the uniformity test of a collection by repeating the algorithm [2log 3 (3("t))] times and taking the majority of the answers. By Theorem 3.1.2, the uniformity testof a collection uses at most 2S samples and returns the correct answer with probability at least 2/3. It is not hard to see, the majority of the answers is correct with probability 1 - 1/ (3 ()). Therefore, by the union bound, we can assume we test all the J's correctly with probability at 2k and n hard to see that the total number of samples is 33 - O(k 2'-k in Theorem 3.1.2, it is not - log n -S) = O(2 / 2 k 4 /e 3 +2 / least 2/3. In addition, by setting m = ....... ... "I'll, "I'll, "I'll, - I,"'." ........... - I'll" ........ .... , ,"I'll - 1 -1 - - 1- 1 I'll Testing junta distributions {(Input: e, n, oracle access to draw sampies from a distribution P}} 1. Draw s = 2[21og 3 (3(")) -1 2. For every subset of [n] of size k, namely J': 2.1 Convert each sample x into a pair (x(J), x[.l\j). 2.2 Repeat the uniformity test of a collection withe the same e [2log 3 (3('))] times each time using at most 28 samples. 2.3 If the majority of the answers is "Accept", Accept. 3. Reject. Figure 3-1: A testing algorithm junta distributions Now we prove the correctness of the algorithm. For an arbitrary set J of size k, let Pi be a conditional distribution over the domain {0, 1}a-k such that P(z) Prx~p[x l\J) = Z / y- ) = Ci]. We need to consider two cases when P is a junta distribution and when it is c-far from being a junta distribution. We show to show that a junta distribution is passed by probability at least 2/3. Assume P is a junta distribution on the set J*. By definition, Pr[x([flI\J*) |x(J) 1 / 2 'nk In other words, the Pis are uniform. Thus, we can assume that the pair (x(.), x('\J*)) is distributed according to a collection of uniform distributions. This means that in the iteration where J = J* the distribution should be accepted by the uniformity test of a collection. Thus, we answer the correct answer with probability at least 2/3. Third, we show that the algorithm does not accept a distribution which is Efar from being a junta distribution with probability at least 2/3. Let P be such a distribution. Let define Pj be a junta distribution on J, as it is defined before, BP(x)d Pr [(J) .Nt s- r Y-P .L Below, we compute the distance of Pj and P. Note that P is c-far from Pj. Let Xi 34 , , 11-.1- . .. . . be the set of all x's such that x(J) = Ci where Ci is binary code of i over k bits. Then, x 2k -1 =EJ E 2 -1 xEXi i=1 1P(x) - PJ(-)1 2Z' 1E Pr Pry ij [y E Xi]- =E i=O xcX, Y-4 Observe that Pry-p[y E Xi] = Pryp[yW() =] P(x) [y c XiJ Pr y~P = Pr _ _____ Pr [y c Xi] y~P p [yW() = ci] by definition of Pj. Thus, 2 dL1(P1 PJ) k-1 = E =O 2 Pr [y EXi-)) Pr [y c Xi] XX y~-a' Y~P Pr [y E Xi] Y~Pi k -1 E E Pr [y c Xi] -Pi(x([\J)) i=O XEXi Y-P 2k-1 E Pr [y E Xi]dL1 (PiU) - = i=0 YP* Note that if we the distribution as a collection of Pi's then Pry p[y E Xi] is basically the weight of Pi namely wi. In other words, we see a sample from Pi with probability wi. Thus, the value of dL, (P, PJ) or (equivalently dt,(P, Pj)) is the weighted distance of the collection. Since P is c-far from any junta distribution, the dLl(P, PJ) is at least 2e. Thus, the collection is e-far from being a collection of uniform distribution and it should be rejected by the uniformity test with high probability. l proof is complete. 3.1.1 Thus, the Uniformity Test of a Collection of Distributions In this section, we propose an approach for uniformity test of a collection namely C = {Pilwi}. Note that when we sample a collection we get a pair (i, x) which means the distribution Pi is selected with probability wi and x is drawn from 'P. Observe that when wu are uniform then the problem is related to uniformity test over a single 35 Uniformity test of a collection ((Input: E, n, m, oracle access to draw samples from the collection {Piwi}"..1 .)) 1. B +- [log(4m/c)] 2. S +- max{[80m log 12(m + 1)/cl, [8B/E[2log 3 (6B)][21 'B/6mn/ 2 11} 3. Draw a sample, namely s, from Poi(S). 4. If s > 28, reject. 5. Draw s samples from the collection. 6. di +- si/s where si's is the number of samples from Pi. 7. For I = 1,. .. , B 7.1 B, +- {ie/4m2i~ 7.2 WI 1 *+ 1 <,&i E/4m2i} z- i iE B1 7.3 S, +- E si 7.4 If WI > E/8B and S > 2 13 B2 log 3 6B /6mn/E 2 i. Run bucket uniformity test with distance parameter E/2B and maximum error probability 1/6B ii. If the test rejects, Reject 8. Accept. Figure 3-2: A test algorithm for uniformity of a collection of distributions distribution over the domain [in] x [n]. Based on this observation, we use bucketing argument in a way that each bucket contains distributions with wi's within a constant factor of each other. In this section, we formally prove the reduction to the uniformity test in a bucket and in Section 3.1.2, we show how we can test each bucket. Theorem 3.1.2. The algorithm shown in Figure 3-2, is an (E, !)-test that test the uniformity of a collection of distributions {P1iwij using s < 2S samples where S O(B3 mn./ 3 +Orn/g) where B O(log/e). Proof: In the algorithm instead of drawing a fixed number of samples, we use the 36 "Poissonization method"1 and draw s samples where s is a random variable drawn from a Poisson distribution with mean S. Thus, we can assume the number of samples from each distribution Pi, namely si, is distributed as Poi(wi - s) and is independent from the rest of s.'s. Now, we show in the following concentration lemma that the si's are not far from their mean. Equivalently, we prove that 'i = si/s is close to we. Lemma 3.1.3. Suppose we draw s ~ Poi(S) samples from a collection of distributions { Pi }i such that S > 80 m log 12(m + 1)/c. Let dij = si/s where si is the number of samples from, Pi. With probability of 5/6 all of the following events happen. " s is in the range [2S,S/2]. " For any i if w> e/8m, then zi is in the range [lwi, 2w]. " For any i if w < e/8m, then ibi < e/4m. Proof: Here we need to use concentration inequalities for Poisson distribution. (See Theorem 5.4 in [32]) For a Poisson random variable X with mean ft eE ( Pr(X > (1 + e)t) < and Pr(X < (1 - e)p) < Thus, it is not hard to see 1 - Pr[p/2 < X < 21 < 0.68" + 0.86,4 < 2 -2-p/5 < 1 - 6(m + 1) where the last inequality holds for it > 5 log 12 (m + 1). Thus, s is in the range 1 Observe that when we draw a fix number of samples, the number of appearances of each element depends on others. This would usually convolute the analysis of algorithms. However, the Poisson distribution has a very convenient property that makes the number of appearance of each symbol independent from the others. In the literature like 1321, it has been proved that if a single distribution P sampled Poi(n) times, then the number of samples equals to the a symbol x is a random variable from a Poisson distribution with mean nP(x). This also implies that the number of appearance of each symbol is independent of the others. 37 [S/2, 28] with probability 1 - 1/6(rn + 1). We assume this fact is true for the rest of the proof. Note that by properties of the Poissonization method [321, the si's are distributed as independent draws from Poi(wi - s). For wi > e/8m, since we assume that s is at least S/2, we can conclude that si is in the range [wi - s/2, 2wi - s] or equivalently '&j is in the range [wj/2, 2wi] with probability at least 1 - 1/6(m + 1). Now assume wi is less than e/8m. Clearly, that the expected value of si is less than s e/8m. Consider another random variable X which is drawn from Poi(s - E/8m). Thus, Pr[si > s - E/4m] < Pr[X > s - e/4m] < - 1 1 6(m + 1) Thus, by the union bound over the si's and the s with probability at least 5/6 the E lemma is true. Partitioning into buckets: Based on the idea that uniformity test of a collection of distributions is easier when wi's are uniform, we partition the distributions into buckets such that wi in the same buckets are within a constant factor of each other. Assume we have B = [log(4m/e)l buckets where the I-th buckets contains all the distributions Pi's such that E/4n2- 1 < di < E/4m2l. By Lemma 3.1.3, the wi's are in the range [e/8m2' 1 , e/2m2l]. Observe that each bucket I can be viewed as a (sub-)collection of m. = 1B11 distributions with the new weights wi/WI where W is the total weight of the I-th bucket. Reduction to the bucket uniformity test: Here, we want to show that there is a reduction between uniformity test of a collection of distributions and uniformity test of each bucket as a sub-collection of distributions. For uniformity test of a collection, we partition the collection into buckets as explained before. Then for each bucket, we invoke the bucket uniformity test with distance parameter E/2B and with error probability of at most 1/6B. To prove the correctness of the reduction, we consider the two following cases: * {Pijwi}' is a collection of uniform distributions. Since all of the dis- tributions are uniform, all buckets contain only uniform distributions. 38 Then, all the B invocations of bucket uniformity test should accept with probability at least 1 - 1/6B. Thus, none of them rejects with probability 1 - 1/6 by the union bound. is c-far from being a collection of uniform distributions. We o {Pilwi}l prove that at least one bucket should be rejected with high probability. Note that in our bucketing method we ignore the distribution with ib < e/4m: by Lemma 3.1.3, the total weight of these distributions is at most e/2m and since the total variation distance is at most one, they can not contribute to the weighted distance by more than e/2. Thus, B E E wi - dt ('Ps,,U ) > e /2. I=1 iEBi By averaging there is at least one bucket, namely 1, such that EiEB, w -dtv(Pi,U) e/2B. Since the total variation distance is at most one, EiEB, w e/2B. In addition, we would like to consider this bucket as a separate collection. Since W, < 1, if we renormalize the weights, we can also see that E - - 2BW ( WB. E 2B Now if we show the assumptions of Corollary 3.1.6 are satisfied, then the bucket uniformity test rejects the bucket 1 with probability at least 1 - 1/6B. It is not hard to see our estimation of the new weight of the i-th distribution in bucket I is i/Wi which is in the range [w/4Wi, 4wt/Wi] by Lemma 3.1.3. Moreover, Since the wi's are in the range [e/8m2 1 , e/2m2'], wi/W 1 are at most 8/mi. In addition the number of samples in this bucket is a -8 > S E si icB, iEB1 wi >ES > [210g 3 (6B)] [2. 2 B 2 V6mn/C 2 iEB1 Hence, the proof is complete. 39 bucket uniformity test ((Input: E, 6, c, T, n, m, estimated 'i's, [2log 3 1/61 sets of s = [8c3 m 3Tn/C2] samples drawn from C.)) Repeat the following algorithm [2log 3 1/31 and output the majority answer. 1. Draw s = 8c 3 5 m /3Tn/e 2 samples from the collection. 2. Y +- the number of unique elements in these s samples. 3. For each sample (i, x), replace x with x' uniformly chosen from [n]. 4. A*<- C.r 5. Y' +- the number of unique elements in these s samples. 6. If IY - Y'j > A/2, 6.1 Reject. 7. Otherwise, 7.1 Accept. Figure 3-3: A uniformity test for collection of distributions with special constraints. 3.1.2 Uniformity Test within a Bucket In this section, we provide a uniformity test for a collection of distributions when the weights are bounded. In other words, the algorithm distinguishes whether the weighted distance is zero or at least e. Our algorithm is based on counting the number of unique elements which is also negatively related to the number of the coincidences. This idea was proposed before in 135, 7 for uniformity test of a single distribution. The high level idea is to estimate the expected value of the number of unique elements when the underlying collection is an unknown collection and compare that value to the case when it is a collection of uniform distributions. If these values are close enough to each other we can infer that the unknown collection is actually a collection of uniform distributions. Otherwise it is not. The algorithm is showed in Figure 3-3 and its correctness is proved in the following theorem. Theorem 3.1.4. Assume we have a collection of distributionsC 40 {PIw.};'1 such . 1- 1- 1- 1---y.......................... .. that wi < T for all i's. We have s = [2log3 1/] [8c3 m /3Tn/es1 samples drawn form. C such that for each distribution Pi we get s ~ Poi(w - s) samples from Pi. Suppose si is in [wi - s/c, cw - s] for a constant c. Then, the test shown in Figure 3-3 distinguishes that C is a collection of uniform distributions or it is c-far from it with probability 2/3. Proof: In the following, we prove the repeating part of the algorithm outputs the correct answer by probability at least 2/3. Hence, we can amplify the confidence parameter by repeating it [210g 3 1/] times and taking the majority of the answers. Thus, the uniformity test Algorithm outputs the correct answer with probability 1-5. Let Y be a random variable that indicates the number of unique elements in a set of samples. Notice that we consider each sample as an ordered pair (i, x) which means x is drawn based on Pi. Thus, (i, x) is not equal to (j, ). Similarly, let YI denote the number of unique elements from distribution P. It is not hard to see Yi To abbreviation, Ef-p }[Y] denotes the expected value of Y when samples are drawn 1. In addition, we denote the expected value from the underlying collection {Pilwi} of Y by E{u} [Y] when the underlying collection is a set of uniform distributions with the same weights as P (i.e. {Ulwi}" 1 ). Now we need to answer this question: Does the number of unique elements indicate that the collection is a set of uniform distribution or not? The answer is Yes. We show E{p, I[Y] is smaller than Eu} [Y] if the collection is far from being a collection of uniform distributions. Therefore, if we see a meaningful difference between Efp,[Y and Eu} [Y], we can conclude {Pi li }I'I is not a collection of uniform distributions. For a single distribution P, in [35], Paninski showed that the difference E-P[Y] and Eu[Y] is related to the distance between P and the uniform distribution. Eu[YI - EPY > S Since we are looking for E{u}[Yj (not EUm] directly over the domain [m] x [n]. separately. (dLI(P, U)) 2 n [Y]), we can not use this inequality However, we use this inequality for each Pi Observed that the way that we convert the samples allows us to get 41 - -- the same number of samples, si from Pi and U over the domain of size n. Thus, we can use the above inequality separately. Hence, by linearity of expectation and Cauchy-Schwarz inequality, we have Trn E{u}[Y]- E{p,1 [Y] = (Eu[Y] - E-pY]) in n 2 i (si - dL1 (Pi, i=1 U))2 (w > c2 ulm i dL i i,- where the first inequality follows from [351, the second follows from that si E [Wi s/c, cwi . s]. Set A to (s 2 E2 )/(c 2 - m - n). Therefore, if C is E-far from being a collection of uniform distributions, then Elul[Y] - E{p.,[Y] 2 C271 -' = A, (3.1) because the weighted L1 distance is at least 2e. However, these two expected values cannot be calculated directly since wi's and Pi's are unknown. estimate them. Thus, we need to By definition, the number of unique elements in s samples, Y, is an unbiased estimator for Efp,}[Y]. To estimate E{11 }[YI, we reuse the samples we get from the collection and change each sample (i, x) to (i, x') where x' is chosen respectively, we can assume the sample (i, x') is drawn from the collection {u Iwi }i . uniformly at random from [n]. Since i and x' are picked with probability wi and 1/n Therefore, the number of unique elements in the new sample set, namely Y', is an unbiased estimator for Elul [Y]. Below, we formally prove that the number of unique elements, Y, (and Y') can not be far from their own expected value using Chebyshev's inequality. To do so, we need to bound the varianc e Lemma 3.1.5. We have s samples drawn form, a collection of distributions C {Pi Iwi} such that for each distribution Pi we get si ~ Poi(wi - s) samples from, Pi. 42 ............ a............ -111111.111.: I Suppose the si is in the range [w- s/c, cw s] for a fixed constant c. Also, each 'w is at most T. Then Eu[Y| - EPi[Y] + c s T n Var[Y] Proof: Bounding the variance of the number of unique elements has been studied in [35]. Paninski showed the following inequality Eu[Y] - Ep[Y] + Varp[Y] Here, since we know the si's are independent, we have rn Var[Y] - Z Var[Yi] i=1 (Eu[Yi] - E,[Y]+ C2 S2 <Eu[Y] - E'P[Y]+ c ( rn 2 On the other hand, it is not hard to see that since wi's are less than T, we have 2 <i& ( w T < T. Combining the two above inequalities we get Var[Y] < Eu[Y] - EPi[Y1 + c s T n Now, we are ready to use Chebyshev's inequality to prove that we are able to estimate Y accurately. Below we consider two cases based on the underlying collection. * Case 1: C is a collection of uniform distribution: In this case Ejpj is equal to E{u} [Y], so by Lemma 3.1.5 the variance of Y is at most c2 s 2 T/n. Thus 43 by Chebyshev's inequality we have Pr[IY - Efu}[Y]I < 16Var[Y]/A 2 A/4] 16 c6 T n12 2 4 S E It is not hard to see that for s > 4c3 mv/6Tn/ 2 the above probability is 1/6. Similar to Y, we can prove that the probability that Y' is A/4 far away from its mean is less than 1/6. Therefore, Y' - Y is at most A/2 with probability at least 1 - 1/3. * Case 2: C is c-far from being a collection of uniform distributions:. Therefore by Equation 3.1, E{u} [Y] - Ep, I[Y] is at least A. Similar to the above, we use Chebyshev's inequality. By Lemma 3.1.5 and Equation 3.1 we have Pr[IY - E{-pl [Y] > (E{tji[Y] - E{pl [Y])] < Var[Y]/(E{u}[Y] - ElpI[Y]/4)2 3 2 Elul[Y] - E-ps}[Y + c S T/n (Elul [Y] - ElpI[Y]/4)2 16 < E{u}[Y] - E{pI[Y] 16c 2 8 2 T + n - (E{u}[Y] - Elp [Y])2 16c 2 s2 T 16 A 2 n 16c nm 16c 6 Tnm 2 S 2E2 82 f4 32 c 6 Tnm 2 K 24 Note that T by definition can not be less than 1/rn that's why the last inequality is true. It is straightforward that for s > 8c3 mvN3Tn/c 2 the above probability is at most 1/6. On the other hand, similar to what we had in case one, Y' cannot go far from its mean too. Thus, Pr[IY' - E{u}[Y]I > 1(Elul[Y] - E-p,}[Y])] < Pr[Y' - E{u}[Y]I > A/4] < 1/6. Therefore, Y' - Y is at least (E{u [Y] - Ep, [Y])/2 > A/2 with probability at 44 least 1 - 1/3. In both cases, the uniformity test outputs the correct answer with probability at least 2/3. such Corollary 3.1.6. Assume we have a collection of distributionsC = {RJiwi} that wi < 8/m for all i's. We have s = [2 log 1/61 [210 /6mn/e 2] samples drawn form C such that for each distribution Pi we get si ~ Poi(wi - s) samples from Pi. Suppose si is in [wi - s/4,4wi - s]. Then, there exists a uniformity test for C such that outputs the correct answer with probability 1 - 6. Proof: This corollary follows directly form Theorem 3.1.4 by setting T = 8/m and c = 4. 3.2 A lower bound for testing junta distributions We prove a lower bound for testing junta distributions in the following theorem. Theorem 3.2.1. There is no (e, 1/3)-test for k-junta distributions using less than o(2n/2) samples where c < 1/4. Proof: First, we construct two families of distributions, T+ and F-, such that the distributions in T+ are junta distributions and the distributions in T- are 1/4-far from being junta distributions. Thus, any (e, 1/3)-test with e < 1/4 must distinguish these two families with probability 2/3. Below, we prove that the distributions from these two families are so similar that if we draw only o( 2 n/2) samples, we cannot distinguish them. Hence, there is no (c, 1/3)-test for k-junta distributions using less than o(2 2 / 2 ) samples when e < 1/4. Let C be the binary representation of i with k bits for i 1, 2,.. ., 2 - 1. Let X1 be the set of all x's that x(kI) = Ci. In other words, Xi's forms a partition of the domain based on the first k bits. Let F+(x) be a family that contains only a single distribution P+(x) as constructed below. 1. For all the x's in Xi, if the parity of the Ci bits is odd, set P+(x) = 0. 45 2. For all the x's in Xi, if the parity of the C' bits is even, set P+(x) = 1/271 Note that P+(x) only depends on the first k bits of the x. Thus, it is a junta distribution on [k]. We construct a distribution P- by the randomized process explained below. Let F- be the set of all possible P- that this process can generate. 1. For all the x's in Xi, if the parity of the C is odd, set P~(x) = 0. 2. If the parity of the C bits is even, pick half of the x's in Xi randomly and set P-(x) = 1/2'-2. Set the probabilities of the other half to zero. Now, assume we pick a distribution P- from F- uniformly at random. We show that P- is a 1/4-far from being a junta distribution. Let Pi be an arbitrary k-junta distribution with biases a as defined in Definition 1.3.1. We define X' to be the set of all x's such that x() = Cs. (In contrast to Xi's that partition the domain based on xUlI)). Now, we show the probability of at least half of the elements in X' is zero. If J = [k], then it follows by the construction of P-. Otherwise, since JJ = k there exists a coordinate I E [k] such that 1 is not in the set J. Consider an element x E X'. Let y be x with the i-th bit flipped. Since 1 is not in J, W) = y(J). Thus, y is also in X.V. On the other hand, the parity of the first k bits of x and y is not the same, because I E [k]. Thus one of them has probability zero. Note that we can pair up all the elements in X4 similar to x and y. Thus, at least half of the elements in X, has 46 probability zero. Hence, we can conclude that P- is 1/4-far from Pj as below dL1(P-,PJ) = EP-(x) -Pj(x) X 2 -- 1 E E IP~(x) - PJ(x) i=1 aEX 2 k_1 D-(x) - ai/2"k1 =1E i= 2 EX, k-1 E > E Z=1 xEX' s.t. a1/2"n-k P-(x)=Q 2k-1 > i=1 -2 where the first inequality follows from the fact that at least half of the elements in X, has probability zero. Thus, the total variation distance is at least 1/4. Since we show that P- is 1/4-far from any arbitrary k-junta distribution, P- is 1/4-far from being a junta distribution. Here, we want to show that with high probability P+(x) and P-(x) that is picked from F- uniformly at random are indistinguishable. Note that we can consider a sample from these distributions to be a sample from a collection of 2 k-1 distributions: For each sample r consider the first k bits, X 1 , ..., Xk, as the index of distribution, and then the last n - k bits xk+1, -- , x,, to be the sample from distribution x 1 , ... Since odd parity patterns do not show up, there are exactly the bits x 1 ,. .- Xk, 2 k-1 possible settings on and we see each of them with uniform probability for both P- and P+. Moreover, conditioned on fixing the first k bits, the distribution over is uniform for , Xk. 'P+. x([n]\1k])'s On the other hand, conditioned on fixing the first k bits, the distribution over x(QH\[k)'s is uniform over exactly half of the settings of In 1301 (Lemma 4.3), it has been proved that if we draw o( / 2 k - Xk+1, .- n- = o(2I/2), no algorithm can distinguishes P+ and P- (more precisely the collections correspond to them) with probability more that 1/2 + o(1). 47 48 Chapter 4 Learning and testing feature-separable distributions 4.1 Testing feature-separable distributions Testing feature-separable distributions is related to testing closeness of two distributions: even if we are given the separated feature i and the parameter a (as it is defined in Definition 1.3.4), testing feature-separable distributions requires that we make sure the distributions over x-0 condition on xi = 1 and xi - 0 are close. The problem of testing closeness of two distributions is considered in [8, 42, 201. Here we show reductions between closeness testing and testing feature-separable distributions in Theorem 4.1.1 and Theorem 4.1.2. In (201, an algorithm is given for (e, 6)-testing the closeness of two distributions p and q using sid( 6 ) O(poly(1/e)N 2 / 3 log N log g) samples where co is a constant. closeness-test(S,), Sq c, 6) where Sp and Sq denote the We refer to this algorithm as sample set of size at least sid(6 ) drawn from p and q respectively. Theorem 4.1.1. Feature-separable test, shown in Figure 4-1, is an (e, 3)-test for feature-separable distributions. Proof: First, we show that the algorithm does not reject feature-separable distributions with high probability. Assume P is a feature-separable distribution with the 49 Feature-separable test (Input: E and oracle access to draw sampIcs from distributionP)} 1. Draw si =[16 In"" samples. 2. For i = 1,...,n 2.1 Let ai = (#samples x with xi = 1)/s Prx-p[xi = 1]. 2.2 If ai < or ai > 1 Accept. -c, 3. Draw S2 = 4sid(0.1/n)/e = 2 be an estimation of O(poly(1/c)2 2 ,/ 3 n log n) samples. 4. Fori= 1, ... ,I n: 4.1 Split samples into two sets, Y' and Y based on i-th coordinate and remove this coordinate. 4.2 If there is enough samples, Run the closeness-test(Yo, y1, E,1) 4.3 If the test accepts, Accept 5. Reject Figure 4-1: A test algorithm for feature-separable distributions separated feature i. The rejection of P means the "if condition" in Lines 2b, 4b, and 4c do not hold, which implies the following: " ai is in [1E, 1 - 3e]. " For the i-th iteration, either there are not enough samples or the closeness- test(Yo, Y,, e, 1) rejects. Here, we show that the probability of these two events happening together is small. Observe that by the Hoeffding bound it is unlikely that ai deviates from its mean, 50 Pr[xi = 1], by more than e/4. Since ai is at least 3e/4, we have Pr[Pr[xi = 1] < c/2] 1] + { < 41 Pr[Pr[xi 1] + i < a] < Pr[Pr[xi K4 < -(2C2s1)/16 =0( 2) Therefore, after drawing S2 samples, we expect to see at least Es 2 /2 samples with xi = 1 with high probability. However, this is twice of the number of samples we need: 2 sid(0.1/n). By the Chernoff bound, the probability of not having enough samples is o(1). By similar arguments, one can prove that there are also a sufficient number of samples with xi = 0 with probability 1 - o(1). Additionally, since i is the separated feature coordinate of P, the probability that closeness-test(Y, Y1 , c, -) rejects on the i-th iteration is at most 0.1/n = o(1). By the union bound the probability of rejection in both cases is o(1). Now, assume P is E-far from being a feature-separable distribution. We prove that the algorithm rejects with probability at least 2/3. First, note that any featurerestricted distribution is also a feature-separable distribution, which implies that P is c-far from being feature-restricted too. By Lemma 6.3.1, the probability of xi = 0 is at least e for each coordinate i. It is straightforward to similarly prove that the probability of x= 1 is at least E for each coordinate i. By the Hoeffding bound, the probability that ai (- [c, 1 - je] is at least 1 - 1/n 2 for each coordinate i. By the union bound, the probability of accepting P due to the wrong estimation of ai's is at most 1/n = o(1). In addition, the closeness-test(Y', Y', e, 1) mistakenly accepts P with probability 1/10n. By the union bound, we will not accepts P with probability more than 0.1. Therefore, the total probability of accepting an E-far distribution is at most 0.1 + o(1) < 1/3. Moreover, the Feature-separabletest uses at most sample complexity is O(poly(1/)22 -n/3n log n). Now, we show Q( 2 2n/3) sI 1 + s2 samples. Thus the I samples are required to test feature-separable distributions. 51 Theorem 4.1.2. There is no (e, ' )-test for feature-separabledistributionsusing o( 2 2n/3) samples. Proof: We prove this theorem by showing a reduction between the feature-separable testing problem and the closeness testing problem. The main idea is that if a distribution is feature-separable the two distributions over x's with xi = 1 and x's with xi = 0 have to be equal for some i. We provide two distributions on {0, 1}i.-1 namely p and q such that they seem quite similar if we draw too few samples. These distributions are used to prove lower bounds for the sample complexity in [421. Suppose we have a distribution PI on {0, 1} such that the distribution over x's with xi = 0 is p and the distribution over x's with xi = 1 is q. Since p and q seem similar to each other, P1 is indistinguishable from a feature-separable distribution, although it is not one. Thus, any (f, 3/4)-test has to draw enough number of samples to reveal the dissimilarity of p and q. We construct two distributions p and q as explained below. First, we pick two sets of elements: heavy and small elements. As the term indicates, heavy elements are much more probable than small elements. The distributions p and q have the same heavy elements but their small elements are disjoint. If one draw too few samples, heavy elements may conceal the difference between two distributions. 1. Pick 22(-1)/3 elements randomly and set p(x) = q(x) = 1/22(I-1)/3+1 2. Pick two random disjoint set of 2"-1/4 elements, P and Q yet. 3. Let p(x) = 1/2n--2 for x E P and q(x) = 1/2n-2 for x E Q. Now, we construct two distributions. _P=1/2 1/2 - p(x(')) if xi = 1 1/2. p(x(')) if xi = 0. - ~()if 1/ 2 -p(x(')) 52 Xi = i if xi = 0. which are not picked Clearly, P, is a feature-separable distribution. In addition, separable distribution with probability 1 - o(1). Assume distribution with the separated feature equal to i. Let t E {o, 1} j P2 P2 is a not a feature- is a feature-separable and bias parameter a. Clearly, j is not - 2 and bo, b, E {0, 1}. We define x = vec(bo, bi, t) be a vector of size n such that xi = bo, xj = b1 and X(nMfsl - t. Fix a vector t of size n - 2. Let Xbo,bi = vec(bo, bl, t). Since P2 is feature-separable we have 'P2 (X 0 ,1 ) + P2 (Xi,1) 1 - a a (P2 (X 0 ,0 ) + P2 (Xi, 0 )). Or equivalently, Let ti = x1(i = (i and to = x =('). We rewrite the above equation as below ,o 1- a 1 (p(to) + q(to)). -(p(ti) + q(ti)) = 2a 2 Note that there are only three different possible values in the distributions p and q based on our construction. It is not hard to show that the probability of holding above equality for all t is o(1). Thus, distinguishing between P1 and 'P2 with high probability is equivalent to the closeness testing problem. 4.2 Identifying separable features Definition 4.2.1 (Separable feature identifier). The algorithm A is a (s, E, 6) -separable feature identifier if, given sample access to a feature-separable distribution D, with probability at least 1 - 6 it identifies an index i E [nl such that P is c-close to a distribution D' for which i is a separable feature. Theorem 4.2.2. The Separable feature identifier algorithm in Figure (O(22n/3log -n 1 3 4-2 is an ), c, 2)-separable feature identifier. Proof: Assume the underlying distribution P is c-far from any feature-separable distribution with the separated feature i. We show that the probability of outputting such i is very small. There are three possible ways to output i. Here we consider each case and show that the probability of each case is very small. 53 Separable feature identifier ((Input: c and oracle access to draw samples from distribution P)) 1. Draw si I""samples. [16 h 2. For i=1,..., n 2.1 Let ai = (#samples x with xi = 1)/s 2 be an estimation of Prxp[xi = 1]. 3c or ai > 1 - 2.2 If ai < Output i. 3. Draw s2 =4sid(0.1/n)/c 4. For i = 1, . .. o(2 2n/ 3n log nc 1 1 / 3 ) samples. , n: 4.1 Split samples into two sets, Yo and Yi based on i-th coordinate and remove this coordinate. 4.2 If there is enough samples, Run the closeness-test(Yo, Y, cE). 4.3 If the test accepts, Output i. 5. Output a random index in [n] Figure 4-2: A test algorithm for feature-separable distributions * Case 1: The algorithm outputs i, because ai is not in [4e, 1 - ic]. Observe that any distribution such that Pr[xj = 0] is zero or one is a featureseparable distribution. However, we know that the underlying distribution P is E-far from any feature-separable distribution with the separated feature i. This implies that Prtxi = 0] is at least e and not greater 1 - c (the proof is quite similar to Lenma 6.3.1). Therefore, by the Hoeffding bound the probability that ai is in [4e, 1 - 4c] is 0(1/n 2 ) o(1). * Case 2: The algorithm outputs i, because closeness test accepts on i-th iteration.. Let p and q be the two distribution on x's with x = 0 and x's with x= 1 respectively. Clearly, if we replace one distribution with the other we reach a 54 - 11MMIRTTM." II 1 11 feature-separable distribution. It is not hard to show that the distance between p and q is at least twice that of the distance between P and the feature-separable distribution with the separated feature i. This means the probability of acceptance of closeness test is 0(1/10n) = o(1). Case 3: The algorithm outputs i, because i is chosen on last line randomly. Note that the feature-separable learneris very similar to feature-separable test shown in Figure 4-1. The probability of this case is equal to the probability of rejection of a feature-separable distribution. In the proof of Theorem 4.1.1, we show the probability of this case is o(1). By the union bound, the total probability of outputting incorrect i is less than . Moreover, the Feature-separablelearner uses at most ns1 + 82 samples. Thus the sample complexity is O(22,/3n log n01 /3). Therefore, the proof is complete. 55 0 56 Chapter 5 Learning and testing 1-junta distributions In this section, we discuss the problems of learning and testing k-junta distributions in the special case where k = 1. The algorithms for both testing and learning are simpler in this setting, and we obtain tighter bounds as well. 5.1 Learning 1-juntas Here, we consider the problem of learning a 1-junta distribution. Theorem 5.1.1. Fix e > 0 and n > 1. Let D be a 1-junta distribution over {0, 1}". There is an algorithm that draws O(log n/E2 ) from D and outputs a distribution D' such that 'with probabilityat least 2, dTv(D, D') < e. Furthermore, every such learning algorithm must have sample complexity at least Q(log n/C). The quadratic dependence on 1/c in this result improves on both Theorems 1.1.1 and 1.1.2 in their special case where k = 1. Let Di, denote the 1-junta distribution with junta coordinate i and biased parameter a. Recall that a (q, E, 6)-learner for 1-junta distributions is an algorithm that determines the junta coordinate, namely i, and the biased parameter a by using q samples with probability 6. The learner algorithm is described in Figure 5-1 and the 57 ........ ............. 1-Junta Learner ((Input: oracle access to draw samples from distributionp)) 1. Draw s = 2ln(n )/c 2 samples from p: x(), X(2) . (S). 2. For i= 1,..., n 2.1 Let &j be the fraction of samples with xi = 1. &j - 3. Find i that maximize 4. Output i as junta coordinate and &i as biased parameter. Figure 5-1: An learner algorithm for 1-junta distributions correctness of that explained in Theorem 5.1.2. Theorem 5.1.2. The algorithm "1-Junta Learner", described in Figure 5-1., is an (2 In(n 2 ,/e, e, 6)-learner for the set of all 1-junta distribution. Proof: Assume the underlying distribution is D = Dia and the algorithm outputs D = D . Here, we want to bound the probability of failure of this algorithm in order to prove the theorem. Thus, we want to compute the probability of the event dt,(D, D) > e or equivalently, dL,(D, D) > 2e . We consider two cases: * Case 1: i = i. In this case dLi(P, $D) = Ia - cj > 2e. Note that i is an unbiased estimator of a. In other words, the expected value of & is a. If dL, (D, $) > 2e. Therefore, by the Hoeffding bound, the probability of this case is at most e* Case 2: i $ - ai - j,+has to be greater Z. In this case, the L 1 distance, than 2c since we failed. Clearly, we have either 1& - ' > e or jai - } > 6. Note that the expected value of &6is a half and the expected value of 6i is a. Thus, If 1&; - 11 > E, again by the Hoeffding bound we can say Pr[dL,(D, D) > 2e < e-2se 2 . Otherwise, we can assume Jai - I1> K- say Iai - i I+ IK- 11 > e . Observe that of the algorithm. Therefore, Iai - &iI + I e. By triangle inequality, we can - 11 < IK- j1 by the third line > e. This is equivalent that one of the estimators is at least e/2 away of its mean. Therefore, the probability of picking j-th coordinate as i mistakenly is at most e-(" 2 )/2. Since we have n - 1 58 "da 6wAOLWA coordinates other than i, the total probability of failure is at most (n-l)e( 2 )/2 by the union bound. Hence, in both cases, setting s = 2 n(n})/c 2 make the failure probability less that By Theorem 1.1.3, any algorithm for learning 1-juntas needs Q(log(n)/c) samples. So Theorem 5.1.1 is optimal up to a factor of 1 in terms of sample complexity. 5.2 Testing 1-juntas We now consider the problem of testing 1-junta distributions. We obtain an exact characterization of the minimal sample complexity for this task. Theorem 5.2.1. Fix E > 0 and n > 1. whether a distributionD on {0, 1} The minimal sample complexity to E-test is a 1-junta distribution is 8(2(,-1)/2 log n/E2 ). The algorithm that tests 1-juntas with this sample complexity is described in Figure 5-2. We establish its correctness in Theorem 5.2.2. The matching lower bound is established in Theorem 5.2.3. Theorem 5.2.2. The 1-Junta test, shown in Figure 5-2, is an (E, 6)-test for 1-junta distributions. Proof: We want to show that the 1-Junta test is an (E, 2/3)-test for 1-junta distributions. Let b = Pr[xi = 1] = E p(x). First, we prove that the probability of X s.t. Xi=1 not seeing enough samples for Line 5.1 and Line 5.2 is really small. The probability of having enough samples: Here, we compute the probability of having enough sample for the uniformity tests in Line 5.1. and Line 5.2. Note that computing these probabilities are quite similar. Here, we focus on the first one. Observe that the maximum total variation distance between two arbitrary distribution is at most 1. Therefore, if a < e/2, the test will always accept with no sample. Thus, we can assume a > E/2. To use Paninstki's method for testing uniformity, we need C1a2 2(r- 1 )/ 2 / 2 samples where C1 is a constant. We want to show that with probability 1 - o(1), S1 contains 59 1-Junta test ((Input: E, 6, oracle access to draw sampics from distributionp)) 1. i, a <- 1-Junta Learner (c, 0.1). 2. Draw C2(nf2) samples. 3. Split samples in two sets So and S1 based on their value on i-th coordinate. 4. Remove i-th coordinate of all samples. 5. If there are enough samples, 5.1. Run the uniformity (e/2a, 0.1)-test on n - 1 coordinates using samples in S1. 5.2. Run the uniformity (e/2(1 - a), 0.1)-test on n - 1 coordinates using samples in So. 5.3. If both of the above tests accept, Accept. 6. Reject. Figure 5-2: A test algorithm for 1-junta distributions this many samples. Let 2OC12( It is clear that E[ISi1] = bs. Also, E[a] 1/2 b by our approach in learner algorithm. So Pr[Sij < Oi2 0)] = Pr[S' < (')b] = Pr[sl < (a2)bjI I< (')baL + Pr[ " Pr[- " > 0.1] Pr[! > 0.1] < 0.1] Pr[b > 0.1] + Pr[H < (()b 0.1] a2< 0.1] Pr[a 2 > 2b] + Pr[ I' < 0.1bla 2 < b] " Pr[a > 2b] + Pr[1I < 0.1bla 2 < b] =O( i) =o(1). Accepting 1-junta distribution: Now, we want to show that the algorithm accepts 1-junta distributions. Let p = Dg. the probability of i 4 60 j is at most 0.1. If we guess j correctly (i = J). The distance of distributions on non-junta coordinates 1 is zero. Thus, the tests on Line 5.1. and 5.2. in both case where xi = 0 or xi have to pass with probability 0.9. Thus, the total probability of rejection of p is at most 0.3. Rejecting of non-1-junta distributions: If p is not a 1-junta distribution, we have dt (P, i ,) = dL (p, Di,b) p(x) - Di, b(x)| 1 x x S.t. xi= Ip(x) +1 - 2-_ bp(x)- 2n-1 Ss.t. xs=o =lb |P~c - 22-I| x s.t. Xi +(1 - b) E x S .t. Note that by definition of b, we q junta coordinates after eliminating xi distribution. xz=o 1- 2 n-1 = p(x)/b is a distribution over n - 1 non1. Thus, all samples in Si are from this Similarly, all samples in So are from distribution q2 = p(x)/(l - b). Since dt,(p, Di,b) is at least e one of the test in Line 5.1 or Line 5.2 has to reject these distribution with probability 0.9. Thus, the proof is complete. Theorem 5.2.3. The is no (e, 2)-test E for 1-junta distributions using o(2*T1 /e2) sam- ples. Proof: Note that 1-junta distribution is uniform over non-junta coordinates. Thus, for any input distribution the test should check the uniformity on those coordinates as well. It is not hard to show that any (E, 6)-test for 1-junta distribution can be leveraged as a uniformity (e, 6)-test as well. The proof is quite similar to what we had for the dictator distributions in 6.2.1. By Paninski's lower bound for uniformity test in [35], we need at least 2-"21 /E2 to test 1-junta distributions. 61 0 62 Chapter 6 Learning and testing dictator distributions In this section, we consider dictator distributions. Based on Definition 1.3.2, in a dictator distribution there is some coordinate i for which the sample is always one, and the distribution is uniform over the rest of the coordinates. In the following subsections, we describe our tight sample complexity algorithms and lower bounds for learning and testing dictator distributions. 6.1 Learning dictator distributions We show that the sample complexity of learning dictator distributions is O(log n). The upper bound and the lower bound are described in Theorem 6.1.1 and Theorem 6.1.2 respectively. A dictator distribution, p, has only one parameter to learn: the index of the dictator coordinate. Let i denote the index of the dictator coordinate. If we draw a sample x from p, then xi is always one and for any j : i, xj is zero or one each with probability 1/2. After drawing several samples, we expect to see a sample x such that xj is zero, assuring us that j is not the dictator coordinate. Based on this fact, we give a simple algorithm for learning dictator distributions. The algorithm is described in Figure 6-1. Note that as described, the algorithm may output more than one index, however we show that with high probability, with O(log n) samples, 63 Dictator Learner ) ((Input: oracle access to draw samples from distributionp 1. Draw s samples from p: X(,), O(log n))) . . , X(s). ((s will be 2. Fori = 1, . .. , n If all (1) ... , X(S), are one. Output i. 3. Output _. Figure 6-1: A learner algorithm for dictator distribution the algorithm will output only the correct index. In addition, observe that we never each Line 3 since one of the "if statement" must be satisfied before. However, since we use this algorithm later without the assumption that the underlying distribution is dictator, it may output I and we keep this line on purpose. Theorem 6.1.1. If x(), X(2) ... ,x(s) are s = 2 log n samples from a dictator distri- bution p, The Dictator Learner Algorithm will output only the dictator coordinate - with probability 1 Proof: Let i* be the actual dictator coordinate of p. Since the i*-th coordinate is always one, the algorithm will output i*. For i 7 i*, the probability that the algorithm outputs i is at most 2-'. By the union bound over the n - 1 non-dictator coordinates, we have Pr[anyi $ i*isoutput < (n - 1)2- . Thus, by setting s = [2log'n], the probability of Pr[i # i*] <1. Theorem 6.1.2. Let A be a randomized algorithm that learns a dictator distribution using at most log(n-1) samples. The probability of success of A is O( 1). Proof: By Yao's Lemma, we can assume A is deterministic and that the underlying distributions are chosen randomly: first, we choose a random dictator distribution, p, uniformly from n possible dictator distributions, and then we draw s samples from p to feed A. 64 Let pi denote the dictator distribution such that the i-th coordinate is dictator. Also, let x E {0, 1} be one of the samples we draw via above procedure. Note that if xi and xj are one, then the underlying distribution p can be either pi or pj with equal probability. Now, assume we draw s samples, namely x(, X2,..., X(s). are one and zero . x Suppose 1 is an indicator variable that is one if all P otherwise. Let C be the set of coordinates i such that 1 is one. For any i and j in C, the underlying distribution p could be equal to pi or pj with the same probability. Thus, these distributions are not distinguishable and any deterministic algorithm, 1 = A, outputs the correct distribution with probability at most i.e., the Pr[A outputs the correct coordinatel E I = i= 1 11 < . i=O Observe that if i is the dictator coordinate of p then xi's will be always one. Otherwise, it can be zero with probability a half by definition. Thus, Pr[1j = 1 Id 1= 2-'. Therefore, we have the following for the success probability of A (i.e. the probability that A outputs i): Pr[ success of A] < Z Pr[ l=1 = 1] Pr[E 1 = 1] success of Al IE i=1 i=1 < E }Pr[Y: I== 1 1 i=1i -2 2 n < E Pr[E Ii= 1] + < u-] + - SPr[i E Pr[j 1i = 11 2 2 The last inequality holds by the Chernoff bound and the fact that E[E Il] i=1 n2-- ;> /117. E 65 6.2 Testing dictator distributions Although learning dictator distributions can be done quickly, with only theta(log n) samples with respect to the domain size, the testing task is much harder and needs 0 (2 "21) samples. The lower bound and upper bound comes from the natural relation between the definition of dictator distributions and the uniform distribution. In Lemmas 6.2.1 and 6.2.3, the reductions to and from uniformity testing are described. Also, the formal results of the lower and upper bounds are in Theorem 6.2.4. Lemma 6.2.1. [Reductions from testing uniformity to testing dictator I If there exists an (e, 6)-test, A, for dictator distributions on domain {0, 1}" using q samples, then there exists an (E, 6)-test, B, for uniformity on domain {0, 1}"1 using q samples. Proof: We show that the existence of A implies the existence of B. want to test the uniformity of a distribution p. Assume we We define another distribution p' corresponding to p. Given samples of distribution p, create samples of distribution p' as follows: 1. Draw sample x = (xi,... , xr1_) from p. 2. Output x' = (x 1 , . . , x_ 1 , 1). Note that p' is a dictator iff p is uniform. Thus, to test the uniformity of a distribution p, B can just simulate A and output as A does. The only difference is that instead of using samples directly from p, we should use samples of p' by above procedure. Additionally, the 11-distance between the uniform distribution and p on n - 1 bits is equal to the 11 distance between dictator distributions and p' on n bits. Thus, the error guarantees of A also applies to B. The reduction we proved above immediately implies that any lower bound for uniformity testing is also a lower bound for testing dictator distributions. Corollary 6.2.2. If there is no (e, 6)-test for uniformity on domain {0, 1}" using q samples, then any (e, 6)-test for testing dictator distribution on domain {0, 1}'" uses more than q samples. 66 Dictator test ((Input: e, oracle access to draw samples from distributionp)) 1. i <- The first output of DictatorLearner() 2. If i # -L, 2.1 Draw q = O(22l /e2) samples: x(1) ... 2.2 If for any j, xi (q). 1 reject 2.3 Else return the result of uniformity test with samples removing i-th ) . (q) coordinate. 3. Else Reject Figure 6-2: A test algorithm for dictator distributions There is a weaker reduction in other direction. In the following Lemma, we describe it more formally. Lemma 6.2.3. [Reductions from testing dictator to testing uniformity] If there exists an (e, 6)-test, A, for uniformity on domain {0, 1}'-' using q samples, then there exists an (2c, max{1(1- ) , })-test B., for dictator distributionson domain {0, 1}' using q + 2 log n samples. Proof: Note that B has to accept the distribution p if it is a dictator distribution and reject if it is 2e-far from being a dictator distribution both with probability 1 - max{(1 - f)2logn+q, 6 + }. The general idea is that first, B learns the index of dictator coordinate, say i, and then performs a uniformity test on the rest of the coordinates. The algorithm is described in Figure 6-2. By Theorem 6.1.1, we can learn i using 2 log n samples with probability 1 - I. f the learner returns I, we know that there is no dictator coordinate, and we reject (Line 3). Otherwise, i is the candidate dictator coordinate. If we see any violation to this assumption (i.e. any sample xW such that P" = 0), then we reject (Line 2.2). If not (i.e. the samples are consistent with i being the dictator coordinate), then the 67 result of the uniformity test on the rest of the coordinates should be returned as the answer. Assume the input distribution, p, is a dictator distribution. By Theorem 6.1.1, i is the dictator coordinate with probability 1 p is uniform on the rest of the coordinates. Hence, none of xU are zero and . Thus, with probability at least 1 - 6, + probability of rejection is at most 6 . the uniformity test accepts p and our algorithm accepts it as well. Thus the total Now, suppose p is 2e-far from being a dictator distribution. Equivalently, the 11 distance is at least e (since dt = 4d1 ). Thus, we have E [px - q, ;> E where q is XC{0,1}" any dictator distribution with dictator coordinate i. Thus, we have: < < |px - qI = x s.t. Xi=O px - q.T|I+ E |Px -q._| x S.t. Xi=1 =Pr[xi = 0] + |px - 2- E x 2 s.t. x'i=I Note that the second term is the 11 distance to the uniform distribution (after removing i-th coordinate). Therefore, at least one the following has to happen: (i) Pr[xi = 0] is at least '. (ii) p is at least e far from being uniform on the rest of the coordinates. However, the only way to accept p mistakenly is having x= 1 for all samples and passing the uniformity test. The probability that B does not detect the case (i) is (1 - L)21ogn+q. Also in case (ii), the uniformity test fails with probability 6. Thus, B may fail to reject p with probability max{(I - )2Iogn+q, 61. Thus, the probability of failure in either cases (p is dictator and p is 2e-far from dic- tator) is at most max{(1 - c)2log n+q, 6+1}. Hence, B is a (2e, max{(I _ E)2 og n+q, 6+ S})-test. Theorem 6.2.4. There is an (e, 6)-test for the dictator distributions with sample complexity 0(2% /e). Also, there is no (e, 6)-test 'with asyrptoticallysmaller sample size. 68 Proof: Paninski in [351 shows the testing uniformity needs Q(N-2/1 N is the size of the domain. Q(2 i2 In our case, N = 2". 2 ) samples where Thus, by Corollary 6.2.2, /c 2 ) samples are needed for testing dictator distributions. Additionally, Paninski provides an (E, )-test for uniformity using O(N/ 2 /c 2 ) sam- ples. Thus, by Lemma 6.2.3, there exists an (E, -)-test for dictator distributions (for sufficiently large n). Hence, the proof is complete. 6.3 Learning and testing feature-restricted distribution In this section we consider the problem of testing and learning feature-restricted distributions. 6.3.1 Testing feature-restricted distributions In this section, we describe our results for testing feature-restricted distributions: By Definition 1.3.3, a distribution is feature-restricted if for a fixed i, the i-th coordinate of any sample drawn from this distribution is always one. However, if a distribution is E-far from being feature-restricted, the probability of xi = 0 is not close to zero. Using this property, in Theorem 6.3.2 and Theorem 6.3.3, we show 8(log n/c) samples are required to test feature-restricted distributions. We first show that for any distribution far from to being feature-restricted, Pr[xi 0] is noticeably far from zero: Lemma 6.3.1. Let D denote a distribution which is c-far from being a featurerestricted distribution. For any 1 < i n and any sample, x, drawn randomly < from D, the probability that xi equals to zero is at least E. Proof: Pick any coordinate i. Let Xo denote all the domain elements with xi = 0. Clearly, the probability of xi = 0 is EE D(x). Now, we define a feature-restricted 69 Feature-restricted test (Input: e and oracic access to draw samples from distribution'P)) 1. Draw s = [I2-"] samples: x(), ... x(s). 2. for i = 1,1 ... ., n: 2.1 If xj = 1 for all I < j < s: Accept 3. Reject Figure 6-3: Algorithm for testing feature-restricted distributions distribution as follows 'D(x) + (Ex,,V D(x))/2"- 1 if Xi = 1 (6.1) if xi = 0. 0 It is not hard to see that total variation distance between D and D' is Zgy0 D(x). On the other hand, D' is a feature-restricted distribution, and this means the total variation distance of D' and D, which is E-far from being feature-restricted is at least c. Therefore, Pr[xi = 0] is at least E, as is EZx-, D(x). Lemma 6.3.1 gives us the insight for how to test these types of distributions. If a distribution is not feature-restricted, after drawing enough samples, we will see samples with Xj equal to both zero and one for each coordinate i C [n]. We prove this formally in Theorem 6.3.2. Theorem 6.3.2. Feature-restrictedtest (See Figure 6-3) is an (e, 2)-test for featurerestricted distributions using O(log n/c) samples. Proof: Here we show that the probability than the algorithm makes a mistake is at most J. Note that if p is a feature-restricted distribution, we never reject it. Thus, the only mistake that the algorithm can do is to accept a non-feature-restricted distribution. This means there is a coordinate i such that P) is one for all 1 < < s. By Lemma 6.3.1, the probability of this event happening for a particular i is at most (1 - )s. By the union bound over all coordinates, the probability we have such an i 70 is at most n(1 - r)'. By setting s we have [l|gf~] Pr[Algorithm fails] < n(1 - 6)1o911/1 < 1 (6.2) where the last inequality holds for sufficiently large n. Now, we show that s = Q(log n/c) samples is necessary to test feature-restricted distributions. To do so, we introduce a family of feature-restricted distributions {D 1 , . . . , D,,} and a non-feature-restricted distribution D', and then show no algorithm can distinguish between these two types of distributions. Below the processes of drawing a sample x from these distribution is described. Drawing sample from V: For each coordinate xi, set xi 1 with probability 1 - E and xi = 0 with probability E. Drawing sample from Di: For each non-restricted feature j (i j), set xi j 1 with probability 1 - c and xj = 0 with probability c. Always set xi = 1. The formal lower bound on number of samples is proved in the following theorem. Theorem 6.3.3. There is no (E, 3) -test for feature-restricteddistributionswith o( I"") samples and c < Proof: The proof is by contradiction. Assume there is an algorithm, namely A that is an (e, 2)-test for feature-restricted distributions using s = o('"gn) samples. By Yao's Lemma, assume A is a deterministic algorithm, and that the input distribution P is chosen as follows: With probability a half, P is D' and with probability 1/2n P is Di for each 1 < i < it. Observe that with probability a half P is a feature-restricted distribution and with probability half it is E-far from being feature-restricted. the coordinates i such that x) E(C(X)) = n(1-E)s if P = x(-). is one for all 1 < '. Otherwise E(C(X)) j Let C(X) be the number of < s. It is straightforward that 1+(n-1)(1-c)s =(n(1-c) 8 by our assumption about s. By the Chernoff bound and the fact that n(1 I, ) Let X denote any set of s samples x,., - e)y > we can show the probability of C(X) < n(1 - c)3/2 or C(X) > 2n(1 - C)' is O(e--G/)) = o(1). Now, we compute the success probability of A and show that it is strictly less 71 than 1. Let X denote set of all X with n(1 - e)'/2 < C(X) < 2n(1 - c)' and X' denote set of all X with C(X) < n(1 -,E)"/2 Pr(A succeeds) or C(X) > 2n(1 - c)". = Pr(A) =E Pr(A X) Pr(X) x =E Pr(AIX) Pr(X) + E Pr(AIX) Pr(X) XeX XEX' o(1) + E Pr(AIX, P is feature-restricted) - Pr(P is feature-restrictediX) - Pr(X) XEx + Z Pr(AIX, P is not feature-restricted) - Pr(P is not feature-restrictedIX) - Pr(X) X CX o(1) + E [A accepts P] - Pr(P is feature-restrictedIX) - Pr(X) XEX + Z [A rejects P] . Pr(P is not feature-restrictedIX) . Pr(X) XEX o(1) + E [A accepts P] -Pr(XIP is feature-restricted) - Pr(P is feature-restricted) xEx + 1 [A rejects P] - Pr(XIP is not feature-restricted) - Pr(P is not feature-restricted) XEX * o(1) + - E max(Pr(XIP is feature-restricted), Pr(XIP is not feature-restricted)) XEX (6.3) For any X let #0(X) denote number of x* o( = 0. Define #1(X) similarly with = 1. Note that we have Pr(XIP is not feature-restricted) = F#O(x)(i _ E)#1(x) and Pr(X 'P is feature-restricted) = Pr(X IP = D) Pr(Di'P is feature-restricted u-~n Additionally, we have 72 C(X) E#0(x)(i )#i(X)-s Restriced feature identifier ((Input: c and oracle access to draw samples from distributionP)) 1. Draw s = | *l] samples: x) ... , x(s). 2. for i =1,...,: 2.1 If x ) = 1 for all 1 < J < s: Output i. Figure 6-4: Algorithm for identifying restricted features. Pr(X) = [Pr(XIP is feature-restricted) + Pr(XIP is not fcature-restricted)] (1 (X)) Pr(XIP is not feature-restricted) or = (1+ " 2 (1 j ) Pr(XIP is feature-restricted) (1) )) max(Pr(X P is feature-restricted), Pr(X P is not feature-restricted)) + min( "(X) By Equation 6.3, Pr(A succeeds) < o(1) + 1 T max(Pr(XIP is feature-restricted), Pr(XIP is not feature-restricted)) XEX o(1) + E 1/(1 + min( " ))Pr(X) XEX < o(1) + E 1/(1 + 1/2) Pr(X) XeX < o(1) + - E Pr(X) XEX <3 Thus, by contradiction no such algorithm exists. 73 LI 6.3.2 Identifying restricted features Definition 6.3.4 (Restricted feature identifier). The algorithm A is a (s, E, 6)-restricted feature. identifier if, given sample access to a feature-restricted distribution D, with probability at least 1 - 6 it identifies an index i E [n] such that D is c-close to a distribution D' for which i is a restricted feature. Theorem 6.3.5. The Restricted feature identifier algorithm described in Figure 6-4 is a (W2-1,e, 2)-restricted feature identifier. Proof: Assume i is the restricted feature of the underling distribution P. The "If condition" in Line 2.1 in the algorithm holds at least for the coordinate i. Thus, the Feature-restricted Learner always outputs a coordinate. Let j be the output coordinate. We show that the probability of P being c-far from a feature-restricted distribution with restricted feature j is at most 1/3. Assume P is a distribution which is e-far from any feature-restricted distribution with restricted feature j. Similar to what we had in Lemma 6.3.1, we easily can prove Pr. p[xj = 0] ;> E. Thus, the probability of outputting j mistakenly is (1 - f)'. Since we have n - 1 non-restricted feature, by the union bound, the probability of outputing any wrong coordinate is n(1 - c)s, which is less 1/3 for sufficiently large n (See Equation 6.2 for more detail). D Now, we show Q(log n/e) samples are required to learn feature-restricted distributions. Theorem 6.3.6. There is no (o(log n/c), c, 3) -learnerfor feature-restricteddistribu[ions. Proof: The proof is by contradiction. Assume there is such a learner A. By Yao's Lemma, suppose A is deterministic and the underlying distribution is randomly chosen from D1, . . . , DP as explained for Theorem 6.3.3. Let X denote a set of s samples x(M),... ,x(s). Let C(X) be the number of the coordinates i such that xH is one for all 1 < j < s. Clearly, E(C(X)) = I+(n -1)(1c)" > Vr. By the Chernoff bound, the probability of C(X) < xrz/2 is O(e-O(a)) which is o(1). Now, assume we have a set of s samples ,r with C(X) > #n/2 and 74 A outputs I as the restricted feature. Suppose xV is one for all t E {i ... and 1 < j < s. The probability of X coming from any of Dj, ... . (X) ,IC(X)I is equal. Thus, I is the correct answer with probability C(1). Thus, the success probability of algorithm in this case is at most + o(1) = o(1). Thus, there is no such learner and the proof is complete. 75 76 Bibliography [1] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath. Optimal Testing for Properties of Distributions. arXiv.org, July 2015. [21 Andris Ambainis, Aleksandrs Belovs, Oded Regev, and Ronald de Wolf. Efficient quantum algorithms for (gapped) group testing and junta testing. arXiv preprint arXiv:1507.03126, 2015. [31 Alp Atici and Rocco A Servedio. Quantum Algorithms for Learning and Testing Juntas. Quantum Information Processing, 6(5):323-348, October 2007. [41 Maria-Florina Balcan, Eric Blais, Avrim Blum, and Liu Yang. Active Property Testing. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 21-30. IEEE, 2012. [51 Tugkan Batu, Sanjoy Dasgupta, Ravi Kumar, and Ronitt Rubinfeld. The complexity of approximating the entropy. SIAM Journal on Computing, 35(1):132- 150, 2005. [6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 442-451. IEEE, 2001. [71 Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing that distributions are close. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000., 12-14 November 2000, Redondo Beach, California, USA, pages 259-269, 2000. [8] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White. Testing closeness of discrete distributions. CoRR, abs/1009.5397, 2010. [91 Arnab Bhattacharyya, Eldar Fischer, Ronitt Rubinfeld, and Paul Valiant. Testing monotonicity of distributions over general partial orders. In ICS, pages 239- 252, 2011. [101 Eric Blais. Improved Bounds for Testing Juntas. In APPROX '08 / RANDOM '08: Proceedings of the 11th internationalworkshop, APPROX 2008, and 12th internationalworkshop, RANDOM 2008 on Approximation, Randomization and 77 CombinatorialOptimization: Algorithms and Techniques, pages 317-330, Berlin, Heidelberg, August 2008. Springer-Verlag. [11] Eric Blais. Testing juntas nearly optimally. In STOC '09: Proceedings of the forty-first annual ACM symposium on Theory of computing, page 151, New York, New York, USA, May 2009. ACM Request Permissions. [121 Eric Blais, Amit Weinstein, and Yuichi Yoshida. Partially Symmetric Functions Are Efficiently Isomorphism Testable. SIAM J. Comput., 44(2):411-432, 2015. [131 Avrim Blum. Relevant examples and relevant features: Thoughts from computational learning theory. In AAAI Fall Symposium on 'Relevance', volume 5, 1994. [141 Avrim Blum, Lisa Hellerstein, and Nick Littlestone. Learning in the presence of finitely or infinitely many irrelevant attributes. J. Comput. Syst. Sci. (), 50(1):32-40, 1995. [15] Avrim L Blum and Pat Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1-2):245-271, December 1997. [16] William S. Bush and Jason H. Moore. Chapter 11: Genome-wide association studies. PLoS Comput Biol, 8(12):e1002822, 12 2012. [17] Cl6ment L. Canonne. A survey on distribution testing: Your data is big. but is it blue? Electronic Colloquium on Computational Complexity (ECCC), 22:63, 2015. [18] Sourav Chakraborty, Eldar Fischer, David Garcfa-Soriano, and Arie Matsliah. Junto-Symmetric Functions, Hypergraph Isomorphism and Crunching. In CCC '12: Proceedings of the 2012 IEEE Conference on Computational Complexity (CCC, pages 148-158. IEEE Computer Society, June 2012. [19] Sourav Chakraborty, David Garefa-Soriano, and Arie Matsliah. Efficient sample extractors for juntas with applications. In ICALP'11: Proceedings of the 38th international colloquim conference on Automata, languages and programming, pages 545-556. Springer-Verlag, July 2011. [201 Siu-on Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the. Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014, pages 1193-1203, 2014. 1211 Hana Chockler and Dan Gutfreund. A lower bound for testing juntas. Information Processing Letters, 90(6), June 2004. [22] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning k-modal distributions via testing. Theory of Computing, 10(20):535-570, 2014. 1231 Luc Dcvroye and Gdbor Lugosi. Combinatorial methods in density estimation. Springer, 2001. 124] Ilias Diakonikolas. Learning structured distributions, To appear. 1251 Ilias Diakonikolas, Daniel M. Kane, and Vladimir Nikishkin. Testing identity of structured distributions. CoRR, abs/1410.2266, 2014. [261 Ilias Diakonikolas, Homin K Lee, Kevin Matulef, Krzysztof Onak, Ronitt Rubinfeld, Rocco A Servedio, and Andrew Wan. Testing for Concise Representations. In FOCS '07: Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, pages 549-558. IEEE Computer Society, October 2007. 127] Eldar Fischer, Guy Kindler, Dana Ron, Shnuel Safra, and Alex Samorodnitsky. Testing juntas. Journal of Computer and System, Sciences, 68(4):753-787, June 2004. 1281 Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation, pages 68-75. Springer, 2011. [291 Michael J. Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert E. Schapire, and Linda Sellie. On the learnability of discrete distributions. In Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing, 23-25 May 1994, Montr6al, Quebec, Canada, pages 273-282, 1994. [301 Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of distributions. Theory of Computing, 9(8):295-347, 2013. [311 Nathan Linial, Yishay Mansour, and Noam Nisan. Constant depth circuits, Fourier transform, and learnability. Journalof the A CM (JA CM, 40(3):607-620, July 1993. [321 Michael Mitzeninacher and Eli Upfal. Probability and Computing: Randomized Algorithms and ProbabilisticAnalysis. Cambridge University Press, New York, NY, USA, 2005. [331 Elchanan Mossel, Ryan ODonnell, and Rocco A Servedio. Learning functions of k relevant variables. Journal of Computer and System Sciences, 69(3):421-434, November 2004. [341 Ryan ODonnell. Analysis of Boolean functions. Cambridge University Press, October 2014. [35] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Trans. Inf. Theor., 54(10):4750-4755, October 2008. [36] Ariel D Procaccia and Jeffrey S Rosenschein. Junta distributions and the averagecase complexity of manipulating elections. J. Artif. Intell. Res., pages 157-181, 2007. 79 [371 Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. SL4M J. Comput., 39(3):813-842, 2009. [381 Rocco A Servedio, Li-Yang Tan, and John Wright. Adaptivity Helps for Testing Juntas. Conference on Computational Complexity, 33:264-279, 2015. [39] Jack W Smith, JE Everhart, WC Dickson, WC Knowler, and RS Johannes. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedingsof the Annual Symposium on Computer Application in Medical Care, page 261. American Medical Informatics Association, 1988. [401 Gregory Valiant. Finding correlations in subquadratic time, with applications to learning parities and juntas. FOCS, pages 11-20, 2012. 141] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA., USA, 6-8 June 2011, pages 685-694, 2011. [421 Paul Valiant. Testing symmetric properties of distributions. In Proceedings of the FortiethAnnual ACM Symposium on Theory of Computing, STOC '08, pages 383-392, New York, NY, USA, 2008. ACM. 80 Appendix A Learning juntas with the cover method Fix any class C of distributions over {o, 1}". An e-cover of C is a collection C, of distributions on {0, 1}' such that for every distribution D E C, there is a distribution D' C CE such that dTv(D, D') < E. We can obtain a good learning algorithm for C by designing a small E-cover for it and using the following lemma. Lemma A.0.7. Let C be an arbitraryfamily of distributions and e > 0. Let C, C C be an e-cover of C of cardinality N. Then there is an algorithm that draws O(E-2 log N) samples from an unknown distribution D E C and, with probability 9/10, outputs a distribution D' G C, that satisfies dTv(D, D') < 6E. algorithm is 0(N log N/ See 2 The running time of this ). 124, 23, 221 for good introductions to the lemma itself and its application to distribution learning problems. We are now ready to use it to complete the proof of Theorem 1.1.1. Theorem 1.1.1 (Restated). Fixe > 0 and 1 < k < n. Define t = (") 2k2k/. There is 2 3 an algorithm A with sample complexity O(log t/c 2 ) = 0(k2k/E + k log n/E ) and run- ning time O(t log t/c 2 ) that, given samples from a k-junta distribution D, with proba- P-D1QW I K c 81 Grn P(X) - bility at least 2 outputs a distribution D' such that dTv(D, D') := Eo Proof: By Lcmma A.0.7, it suffices to show that the class of all k-junta distributions has a cover of size N = (') 2 2-/ . This, in turn, follows directly from the fact that we can simply let C, be the set of all k-juntas with probability mass function p where p(x) is a multiple of E/2" for each element x C {0, 1}y. There are (") ways to choose the set J C [n] of junta coordinates and at most (2 k)2k/f ways to allocate the probability mass in e/2 k increments among the 2k different restrictions of x on J. 82 n Appendix B Proof of Equation (2.1) We establish some basic properties of Pj as described in Section 2. First, for a fixed set J, define the biases bi, i E {o, 1, . . . , 2e - 1} for J to be the probability of x07) = C where x is drawn from P and CO is the binary encoding of i with k bits. Lemma B.O.8. For each bias bi of the set J, we have bi = a., (B. 1) Proof: First, we introduce the notation we use below. For a subset I of size k, we define a function r : I -4 [k] such that rj(c) indicates the rank of the coordinate c E I (rank the smallest first). Basically, when x) = Cj, the c-th bits of x is equal to r1 (c)-th bit of Ci for all c E I. In addition, we define ri(S) to be the image of subset S under the function rj. In particular, if x') = Cj, we have x(S) = Cr(s)) for any S C I. Let t J \ J*I and it is at least one. Then by Bayes' Rule, we have bi = Pr[(j) = Cd] x(Jnj*) = C[.j(jnj*)] 2k-_jj Prfx('\') =C =0 Pr[5i-(\J*) -C = Pr[x~j\''*) = CT 1=0 A n (JJ)- "(AJKJ AX$/2)=Cd'''|6* 83 j ), (* = CTJ(JnJ*)I*) = Oi . 1 C] , = Pr[(J\J*) = Cr(J\J*) A Pr[x(*) C-1] Observe the restriction of the P on non Junta coordinates (including J\ J*) is uniform. Thus, the probability of each setting for the coordinates in J \ J* is 2- = 2-. -\' Consecutively, it is independent of the junta coordinates. Therefore, we conclude Pr[xG'\*) i=0 ] - Pr[x(* ) C C'"' C1] | . 2k-1 bi 2k _1 S 2- - Pr[x(jnJ*) =CTj(Jnj*)x(J*) = C1] i=0 Note that if x(*) = a, C1, then the values of x on all the coordinates in J n J* is determined. Therefore, both binary encodings Ci and C, should appoint the same value to these coordinates. In other words, if CrIl'nj') # bility of x(jnj*) - J*(Jj*), then the proba- Ca['j(jnj) is zero since we have the condition x(jnj*) =Cj*(jnj). Otherwise, it is one. Note that we can partition C's (or l's) based on the values they appoint for the coordinates in J {0, 1, J*. We define = {L 1 , L22 ,... , L 2 1-t } to be a partition of 1} such that for each L E L and any two elements mi and M2 in L, .. . , 2- Cr*Jl4*) n C namely 7n, C' (2J*2 ). In addition, L(i) is the set such that for any of its element, = CfnJ*(J'n). In other words, while CO is a binary encoding over set J it assign the same values to J n J as Cm (or any member of L(i)). As there is only one sets that its elements agree on C* (JoJ*), L(i) is well-defined. We define L- 1 to be the set of all i such that L(i) = La. By this new notation, we conclude bi =a, be=Zs) I&C(i) Thus, the proof is complete. In addition, it is not hard to see that Pj is actually a junta distribution over the set J n J*. Recall that based on Equation B.1, we obtain the desired inequality Pj(x) = (Pr[y(J"J*) = x(Jnj*)])/ Y-P 84 2 I-k i. (B.2)