Learning from Satisfying Assignments Rocco A. Servedio Columbia University Anindya De DIMACS/IAS Ilias Diakonikolas U. Edinburgh Symposium on Learning, Algorithms and Complexity / MSR India Theory Day Indian Institute of Science January, 2015 1 Background: Learning Boolean functions from labeled examples Each data point: • Studied in TCS for ~three decades • Lots of interplay with (low-level) complexity theory • Bumper-sticker level summary of this research: Simple functions are (computationally) easy to learn, and complicated functions are hard to learn 2 This talk: Learning Probability Distributions Each data point: • Big topic in statistics (“density estimation”) for decades • Exciting algorithmic work in the last decade+ in TCS, largely on continuous distributions (mixtures of Gaussians & more) • This talk: distribution learning from a complexity theoretic perspective – What about distributions over the hypercube {0,1}n? – Can we formalize intuition that “simple distributions are easy to learn”? – Insights into classical density estimation questions 3 What do we mean by “learn a distribution”? • Unknown target distribution • Algorithm gets i.i.d. draws from • With probability 9/10, must output (a sampler for a) distribution such that statistical distance between and is small: (Natural analogue of Boolean function learning.) 4 Previous work: [KMRRSS94] • Looked at learning distributions over {0,1}n in terms of n-output circuits that generate distributions: output x1............ xn distributed according to circuit input z1........................ zm uniform over {0,1}m • [AIK04] showed it’s hard to learn even very simple distributions from this perspective: already hard even if each output bit is a 4-junta of input bits. 5 This work: A different perspective Our notion of a “simple” distribution over {0,1}n: uniform distribution over satisfying assignments of a “simple” Boolean function. What kinds of Boolean functions can we learn from their satisfying assignments? Want algorithms that have polynomial runtime and # of samples required. 6 What are “simple” functions? + Halfspaces: + + + - - - ++ + - -- - - + OR DNF formulas: AND _ x2 x3 _ x5 x6 AND _ x3 AND x5 x1 x6 _ x7 7 Simple functions, cont. AND OR 3-CNF formulas: _ x2 x3 OR _ x5 OR _ x3 x7 x5 x1 x6 _ x7 AND OR Monotone 2-CNF: x2 x3 OR x3 OR OR x2 x3 x5 x6 x7 8 Yet more simple functions Low-degree polynomial threshold functions: Intersections of k halfspaces: - ++ + + + + -- +- + +- + + -- -- - + + ++ - - - + - -- - - -- 9 The model, more precisely • Let be a fixed class of Boolean functions over • There is some unknown . Learning algorithm sees samples drawn uniformly from . Target distribution: . • Goal : With probability 9/10, output a sampler for a hypothesis distribution such that We’ll call this a distribution learning algorithm for . 10 Relation to function learning Q: How is this different from learning (function learning) under the uniform distribution? A: Here we only get positive examples. Some other ways: • (not so major) Output a hypothesis distribution rather than a hypothesis function • (major) Much more demanding guarantee than usual uniform-distribution learning. 11 Example: Halfspaces Usual uniform-distribution model for learning functions: Hypothesis allowed to be wrong on points in . 1n 0n For highly biased target function like a fine hypothesis for any , constant-0 function is . 12 A stronger requirement Our distribution-learning model: “constant-0 hypothesis” is meaningless! In this example, for to be a good hypothesis distribution, must be only a fraction of . 1n 0n Essentially, we require hypothesis function with multiplicative rather than additive -accuracy relative to . 13 Usual functionlearning setting Given: random labeled examples from , must Output: hypothesis such that Our distributionlearning setting Given: draws from , must Output: hypothesis with the following guarantee : ++ ++ - -- - - If both regions are small, this is fine! must satisfy 14 Brief motivational digressions: (1) Real-world language learning People typically learn new languages by being exposed to correct utterances (positive examples), which are a sparse subset of all possible vocalizations (all examples). Goal is to be able to generate new correct utterances (generate draws from a distribution similar to the one the samples came from). 15 (2) Connection to continuous density estimation questions A basic question in continuous 1-dimensional density estimation: Target distribution (say over [0,1]) is a “k-bin histogram” -- pdf is piecewise constant with k pieces. k=5 0 1 Easy to learn such a distribution with poly(k,1/e) samples and runtime. 16 Multi-dimensional histograms Target distribution over [0,1]d is specified by k hyper-rectangles that cover [0,1]d; pdf is constant within each rectangle. d=2, k=5 Question: Can we learn such distributions without incurring the “curse of dimensionality”? (Don’t want runtime, # samples to be exponential in d) 17 Connection with our problem Our “learning from satisfying assignments” problem for the class = {all k-leaf decision trees over d Boolean variables} is a (very) special case of learning k-bin d-dimensional histograms. One of the k hyperrectangles set of inputs reaching one of the k decision tree leaves Rectangle with 0 weight in the distribution decision tree leaf that’s labeled 0 For this special case, we beat the “curse of dimensionality” and achieve runtime dO(log k). 18 Results 19 Positive results Theorem 1: We give an efficient distribution learning algorithm for = { halfspaces }. + + + + + Runtime is - - - ++ - - + Theorem 2: We give a (pretty) efficient distribution learning algorithm for OR = { poly(n)-term DNFs }. AND AND AND Runtime is _ x2 x3 _ x5 x6 _ x3 Both results obtained via a general approach, plus work. x5 x1 x6 _ x7 -specific 20 Negative results Assuming crypto-hardness (essentially RSA), there are no efficient distribution learning algorithms for: o o - - - --- - - -- - + + ++ Intersections of two halfspaces -- -+ +++++ + -- + ++ + Degree-2 polynomial threshold functions AND o 3 – CNFs , or even _ x2 OR x3 OR _ x5 OR _ x3 x7 x5 x1 x6 _ x7 AND o OR Monotone 2-CNFs x2 x3 OR x3 OR OR x2 x3 x5 x6 x7 21 Rest of talk • Positive results • General approach, illustrated through specific case of halfspaces • Touch on DNFs 22 Learning halfspace distributions Given positive examples drawn uniformly from unknown halfspace , for some ++ + ++ + + + + unknown We need to (whp) output a sampler for a distribution that’s close to . 23 Let’s fantasize Suppose somebody gave us . ++ ++ + + + ++ known Even then, we need to output a sampler for a distribution close to uniform over . Is this doable? Yes. 24 Approximate sampling for halfspaces Theorem: Given over can return a uniform point from in time (with failure probability ) • [MorrisSinclair99]: sophisticated MCMC analysis • [Dyer03]: elementary randomized algorithm & analysis using “dart throwing” Of course, in our setting we are not given . But, we should expect to use (at least) this machinery for our problem. , 25 A potentially easier case…? For approximate sampling problem (where we’re given ), problem is much easier if is large: sample uniformly & do rejection sampling. Maybe our problem is easier too in this case? In fact, yes. Let’s consider this case first. 26 Halfspaces: the high-density case • Let . • We will first consider the case that . • We’ll solve this case using Statistical Query learning & hypothesis testing for distributions. 27 First Ingredient for the high-density case: SQ Statistical Query (SQ) learning model: o SQ oracle : given poly-time computable outputs where . o An algorithm is said to be a SQ learner for (under distribution ) if can learn given access to . 28 SQ learning for halfspaces Good news: [BlumFriezeKannanVempala97] gave an efficient SQ learning algorithm for halfspaces. Outputs halfspace hypotheses! Of course, to run it, need access to oracle for for the unknown halfspace . So, we need to simulate this given our examples from . 29 The high-density case: first step Lemma: Given access to uniform random samples from and such that , queries to can be simulated up to error in time . Proof sketch: Estimate using samples from Estimate using samples from 30 The high-density case: first step Lemma: Given access to uniform random samples from and such that , queries to can be simulated up to error in time . Recall promise: Additionally, we assume that we have = . A halfspace! Lemma lets us use the halfspace SQ-learner to get that such 31 Handling the high-density case • Since , have that o o • Hence using rejection sampling, we can easily sample . Caveat : We don’t actually have an estimate for . 32 Ingredient #2: Hypothesis testing • Try all possible values of multiplicative grid in a sufficiently fine • We will get a list of candidate distributions such that at least one of them is -close to . • Run a “distribution hypothesis tester” to return which is - close to . 33 Distribution hypothesis testing Theorem: Given • Sampler for target distribution • Approximate samplers for distributions • Approximate evaluation oracles for • Promise : Hypothesis tester guarantee: Outputs in time such that Having evaluators as well as samplers for the hypotheses is crucial for this. 34 Distribution hypothesis testing, cont. We need samplers & evaluators for our hypothesis distributions All our hypotheses are dense, so can do approximate counting easily (rejection sampling) to estimate Note that So we get the required (approximate) evaluators. Similarly, (approximate) samples are easy via rejection sampling. 35 Recap So we handled the high-density case using • SQ learning (for halfspaces) • Hypothesis testing (generic). (Also used approximate sampling & counting, but they were trivial because we were in the dense case.) Now let’s consider the low-density case (the interesting case). 36 Low density case: A new ingredient New ingredient for the low-density case: A new kind of algorithm called a densifier. • Input: such that samples from • Output: A function , and such that: – – For simplicity, assume that (like ) 37 Densifier illustration : g Input: Samples from Good estimate f Output: g satisfying two conditions: : 38 Low-density case (cont.) To solve the low-density case, we need approximate sampling and approximate counting algorithms for the class . This, plus previous ingredients for high-density case (SQ learning, hypothesis testing, & densifier) suffices: given all these ingredients, we get a distribution learning algorithm for . 39 How does it work? The overall algorithm: (recall that ) Needs good estimate 1. Run densifier to get 2. Use approximate sampling algorithm for from 3. Run SQ-learner for under distribution hypothesis for 4. Sample from till get such that this . of to get samples to get ; output Repeat with different guesses for , & use hypothesis testing to choose that’s close to 40 A picture of one stage Note: This all assumed we have a good estimate g h 1. Using samples from , run densifier to get g f 2. Run approximate uniform generation algorithm to get uniform positive examples of g 3. Run SQ-learner on distribution to get high-accuracy hypothesis h for (under ) 4. Sample from till get point where , and output it. 41 How it works, cont. Recall that to carry out hypothesis testing, we need samplers & evaluators for our hypothesis distributions Now some hypotheses may be very sparse… • Use approximate counting to estimate As before, so we get (approximate) evaluator. • Use approximate sampling to get samples from . 42 Recap: a general method Theorem: Let be a class of Boolean functions such that: (i) is efficiently SQ-learnable; (ii) has a densifier with an output in ; and (iii) has efficient approximate counting and sampling algorithms. Then there is an efficient distribution learning algorithm for . 43 Back to halfspaces: what have we got? • Saw earlier we have SQ learning [BlumFriezeKannanVempala97] • [MorrisSinclair99,Dyer03] give approximate counting and sampling. So we have all the necessary ingredients.…except a densifier. Reminiscent of [Dyer03] “dart throwing” approach to approximate counting – but in that setting, we are given f Approximate counting setting: g Densifier setting: g f Given , come up with Can we come up with a suitable given only samples from 44 ? A densifier for halfspaces Theorem: There is an algorithm running in time such that for any halfspace , if the algorithm gets as input such that and access to , it outputs a halfspace with the following properties : 1. , and 2. . 45 Getting a densifier for halfspaces Key ingredients: o Approximate sampling for halfspaces [MorrisSinclair99,Dyer03] o Online learner of [MaassTuran90] 46 Towards a densifier for halfspaces Recall our goals: 1. 2. Fact: Let be of size . Then, with probability , condition (1) holds for any halfspace such that . Proof: If (1) fails for a halfspace , then Fact follows from union bound over all (at most So ensuring (1) is easy – choose consistent with . How to ensure (2)? . many) halfspaces . and ensure is 47 Online learning as a two-player game Imagine a two player game in which Arnab has a halfspace and Larry wants to learn : i. Larry initializes to the empty set ii. Larry runs a (specific polytime) algorithm on the set and returns halfspace consistent with iii. Arnab either says “yes, “ or else returns an such that iv. Larry adds to and returns to step (ii). 48 Guarantee of the game Theorem: [MaassTuran90] There is a specific algorithm that Larry can run so that the game terminates in at most rounds. At the end, either or Larry can certify that there is no halfspace meeting all the constraints. (Algorithm is essentially the ellipsoid algorithm.) Q: How is this helpful for us ? A: Larry seems to have a powerful strategy We will exploit it. 49 Using the online learner • Choose as defined earlier. Start with . • “Larry” simulation: stage – Run Larry’s strategy and return consistent with . • “Arnab” simulation: If for some , then return . – Else, if (use approx counting), then we are done and return . – Else use approx sampling to randomly choose a point and return . Key point: In this case, have 50 Why is the simulation correct? • If for , then the simulation step is indeed correct. • The other case in which Alice returns a point is that . This means that the simulation at every step is correct with probability ! • Since the simulation lasts steps, all the steps are correct with probability . 51 Finishing the algorithm • Provided the simulation is correct, returned satisfies the conditions: which gets 1. 2. So, we have a densifier – and a distribution learning algorithm – for halfspaces. 52 DNFs Recall general result: Theorem: Let be a class of Boolean functions such that: (i) is efficiently SQ-learnable; (ii) has a densifier with an output in ; and (iii) has efficient approximate counting and sampling algorithms. Then there is an efficient distribution learning algorithm for . Get (iii) from [KarpLubyMadras89]. What about densifier and SQ learning? 53 Sketch of the densifier for DNFs • Consider a DNF suppose each • Key observation: for each i, So Pr[ . For concreteness, consecutive samples from satisfy same ] is all • If this happens, whp these samples completely identify • The densifier finds candidate terms in this way, outputs OR of all candidate terms. 54 SQ learning for DNFs • Unlike halfspaces, no efficient SQ algorithm for learning DNFs under arbitrary distributions is known; best known runtime is . • But: our densifier identifies “candidate terms” such that f is (essentially) an OR of at most of them. • Can use noise-tolerant SQ learner for sparse disjunctions, applied over “metavariables” (the candidate terms). • Running time is poly(# metavariables). 55 Summary of talk • New model: Learning distribution of satisfying assignments • “Multiplicative accuracy” learning + • Positive results: + + + - -+ - -- - - + OR ++ AND AND AND DNFs Halfspaces • Negative results: AND OR OR 3-CNFs OR + + ++ -- - -- -- - -- -- - Intersection of 2 halfspaces AND OR OR OR Monotone 2-CNFs --- -+ + -- +- + ++ + ---- -+ + + + Degree-2 PTFs 56 Future work • Beating the “curse of dimensionality” for ddimensional histogram distributions? • Extensions to agnostic / semi-agnostic setting? • Other formalizations of “simple distributions”? Thank you! 58 Hardness results 59 Secure signature schemes • : (randomized) key generation algorithm; produces key pairs • : signing algorithm; is signature for message using secret key . • : verification algorithm; if Security guarantee: Given signed messages no poly-time algorithm can produce for a new , such that 60 Connection with our problem Intuition: View messages . as uniform distribution over signed . If, given signed messages, you can (approximately) sample from , this means you can generate new signed messages – contradicts security guarantee! Need to work with a refinement of signature schemes – unique signature schemes [MicaliRabinVadhan99] – for intuition to go through. Unique signature schemes known to exist under various crypto assumptions (RSA’, Diffie-Hellman’, etc.) 61 Signature schemes + Cook-Levin Lemma: For any secure signature scheme, there is a secure signature scheme with the same security where the verification algorithm is a 3-CNF. corresponds to , so security of signature scheme no distribution learning algorithm for 3-CNF. 62 More hardness Same approach yields hardness for intersections of 2 halfspaces & degree-2 PTFs. (Require parsimonious reductions, efficiently computable/invertible maps between sat. assignments of and sat. assignments of 3-CNF.) For monotone 2CNFs: use the “Blow-up” reduction used in proving hardness of approximate counting for monotone-2-SAT. Roughly, most sat. assignments of monotone-2-CNF correspond to sat. assignments of 3-CNF. 63