❊❧❡❝ ♦♥✐❝ ❈♦❧❧♦ ✉✐✉♠ ♦♥ ❈♦♠♣✉ ❛ ✐♦♥❛❧ ❈♦♠♣❧❡①✐ ②✱ |❡✈✐ ✐♦♥ ✷ ♦❢ |❡♣ ♦ ◆♦✳ ✶✼ ✭✷✵✶✵✮ PCPs and the Hardness of Generating Synthetic Data Jonathan Ullman Salil Vadhany School of Engineering and Applied Sciences & Center for Research on Computation and Society Harvard University, Cambridge, MA fjullman,salilg@seas.harvard.edu January 6, 2011 Abstract Assuming the existence of one-way functions, we show that there is no polynomial-time, differentially private algorithm Athat takes a database D2(f0;1gnd and outputs a “synthetic database” ^ ) Dall d d ISSN 1433-8092 aggregating the results in some way. y 1 of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x2f0;1gwith a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS ‘07), who gave an algorithm running in time poly(n;2). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC ‘09) with encodings based on the PCP Theorem. We also present both negative and positive results for generating “relaxed” synthetic data, where the fraction of rows in Dsatisfying a predicate care estimated by applying cto each row of ^ Dand Keywords: privacy, digital signatures, inapproximability, constraint satisfaction problems, probabilistically checkable proofs http://seas.harvard.edu/ ˜ jullman . Supported by NSF grant CNS-0831289. http://seas.harvard.edu/ ˜ salil. Supported by NSF grant CNS-0831289. 1 Introduction There are many settings in which it is desirable to share information about a database that contains sensitive information about individuals. For example, doctors may want to share information about health records with medical researchers, the federal government may want to release census data for public information, and a company like Netflix may want to provide its movie rental database for a public competition to develop a better recommendation system. However, it is important to do this in way that preserves the “privacy” of the individuals whose records are in the database. This privacy problem has been studied by statisticians and the database security community for a number of years (cf., [1, 14, 21]), and recently the theoretical computer science community has developed an appealing new approach to the problem, known as differential privacy. (See the surveys [16, 15].). Differential Privacy. A randomized algorithm Ais defined to be differentially private [17] if for every two databases D= (x1;:::;xn), D0= (x0 10 n0;:::;x) that differ on exactly one row, the distributions A(D) and A(D) are “close” to each other. Formally, we require that A(D) and A(D0 ) assign the same probability mass to every event, up to a multiplicative factor of e 1 + , where is typically taken to be a small constant. (In addition to this multiplicative factor, it is often allowed to also let the probabilities to differ by a negligible additive term.) This captures the idea that no individual’s data has a significant influence on the output of A(provided that data about an individual is confined to one or a few rows of the database). Differential privacy has several nice properties lacking in previous notions, such as being agnostic to the adversary’s prior information and degrading smoothly under composition. With this model of privacy, the goal becomes to design algorithms Athat simultaneously meet the above privacy guarantee and give “useful” information about the database. For example, we may have a true query function cin which we’re interested, and the goal is to design Athat is differentially private (with as small as possible) and estimates cwell (e.g. the error jA(D)c(D)jis small with high probability). For example, if c(D) is the fraction of database rows that satisfy some property — a counting query — then it is known that we can take A(D) to equal c(D) plus random Laplacian noise with standard deviation O(1=( n)), where nis the number of rows in the database and is the measure of differential privacy [8]. A sequence of works [11, 19, 8, 17] has provided a very good understanding of differential privacy in an interactive model in which real-valued queries care made and answered one at a time. The amount of noise that one needs when responding to a query cshould be based on the sensitivity of c, as well as the total number of queries answered so far. However, for many applications, it would be more attractive to do a noninteractive data release, where we compute and release a single, differentially private “summary” of the database that enables others to determine accurate answers to a large class of queries. What form should this summary take? The most appealing form would be a synthetic database, which is a new databaseb D= A(D) whose rows are “fake”,but come from the same universe as those of Dand are guaranteed to share many statistics with those of D (up to some accuracy). Some advantages of synthetic data are that it can be easily understood by humans, and statistical software can be run directly on it without modification. For example, these considerations led the German Institute for Employment Research to adopt synthetic databases for the release of employment statistics [33]. Previous Results on Synthetic Data. The first result on producing differentially private synthetic data came in the work of Barak et al. [5]. Given a database Dconsisting of nrows from f0;1g, they show how to construct a differentially private synthetic databaseb D, also of nrows from f0;1g dd, in which the full 1 “contingency table,” consisting of all conjunctive counting queries, is approximately preserved. That is, for every conjunction c(x1;:::;xn) = xi1^xi2^ ^xikfor i1;:::;iksatisfy cequals the fraction of rows in Dthat satisfy cup to an additive error of 2dO(d)2[d], the fraction of rows in b Dthat=n. The running time of their algorithm is poly(n;2k), which is feasible for small values of d. They pose as an open problem whether the running time of their algorithm can be improved for the case where we only want to preserve the k-way marginals for small k(e.g. k= 2). These are the counting queries corresponding to conjunctions of up to kliterals. Indeed, there are only O(d)ksuch conjunctions, and we can produce differentially private estimates for all the corresponding counting queries in time poly(n;d) by just adding noise O(d)1k=nto each one. Moreover, a version of the Barak et al. algorithm [5] can ensure that even these noisy answers are consistent with a real database. dA more general and dramatic illustration of the potential expressiveness of synthetic data came in the work of Blum, Ligett, and Roth [9]. They show that for every class C= fc: f0;1g!f0;1ggof predicates, there is a differentially private algorithm Athat produces a synthetic database^ D= A(D)such that all count-ing queries corresponding to predicates in Care preserved to within an accuracy of ~ O((dlog(jCj)=n)d1=3), with high probability. In particular, with n= poly(d), the synthetic data can provide simultaneous accuracy for an exponential-sized family of queries (e.g. jCj= 2). Unfortunately, the running time of the BLR mechanism is also exponential in d. 2Dwork et al. [18] gave evidence that the large running time of the BLR mechanism is inherent. Specifically, assuming the existence of one-way functions, they exhibit an efficiently computable family Cof predicates (e.g. consisting of circuits of size d) for which it is infeasible to produce a differentially private synthetic database preserving the counting queries corresponding to C(for databases of any n= poly(d) number of rows). For non-synthetic data, they show a close connection between the infeasibility of producing a differentially private summarization and the existence of efficient “traitor-tracing schemes.” However, these results leave open the possibility that for natural families of counting queries (e.g. those corresponding to conjunctions), producing a differentially private synthetic database (or non-synthetic summarization) can be done efficiently. Indeed, one may have gained optimism by analogy with the early days of computational learning theory, where one-way functions were used to show hardness of learning arbitrary efficiently computable concepts in computational learning theory but natural subclasses (like conjunctions) were found to be learnable [36]. Our Results. We prove that it is infeasible to produce synthetic databases preserving even very simple counting queries, such as 2-way marginals: dp(d))Theorem 1.1. Assuming the existence of one-way functions, there is a constant >0 such that for every polynomial p, there is no polynomial-time, differentially private algorithm Athat takes a database D2 (f0;1gand produces a synthetic database ^ D2(f0;1gd) such that jc(D) c( ^ D)j for all 2-waymarginals c. (Recall that a 2-way marginal c(D) computes the fraction of database rows satisfying a conjunction of d2f0;1g such that xi(j) = band xi(j0) = b0 0 2 1 for some columns j;j02[d] and values b;b2f0;1g.) t In fact, our impossibility res wo literals, i.e. the fraction of rows xconjunctions i of 2 literals to any family of constant arity predicates that conta depending on at least two variables, such as parities of 3 literals. 2As mentioned earlier, all 2-way marginals can be easily summarized with n just adding noise to each of the (2d)values). Thus, our result shows that req database may Technically, this “real database” may assign fractional weight to some rows. severely constrain what sorts of differentially private data releases are possible. (Dwork et al. [18] also showed that there exists a poly(d)-sized family of counting queries that are hard to summarize with synthetic data, thereby separating synthetic data from non-synthetic data. Our contribution is to show that such a separation holds for a very simple and natural family of predicates, namely 2-way marginals.) This separation between synthetic data and non-synthetic data seems analogous to the separations between proper and improper learning in computational learning theory [32, 22], where it is infeasible to learn certain concept classes if the output hypothesis is constrained to come from the same representation class as the concept, but it becomes feasible if we allow the output hypothesis to come from a different representation class. This gives hope for designing efficient, differentially private algorithms that take a database and produce a compact summary of it that is not synthetic data but somehow can be used to accurately answer exponentially many questions about the original database (e.g. all marginals). The negative results of [18] on non-synthetic data (assuming the existence of efficient traitor-tracing schemes) do not say anything about natural classes of counting queries, such as marginals. To bypass the complexity barrier stated in Theorem 1.1, it may not be necessary to introduce exotic data representations; some mild generalizations of synthetic data may suffice. For example, several recent algorithms [9, 35, 20] produce several synthetic databases, with the guarantee that the median answer over these databases is approximately accurate. More generally, we can consider summarizations of a database Dthat that consist of a collection^ Dof rows from the same universe as the original database, and where we estimate c(D) by applying the predicate cto each row of ^ Dand then aggregating the results via some aggregation function f. With standard synthetic data, fis simply the average, but we may instead allow f to take a median of averages, or apply an affine shift to the average. For such relaxed synthetic data, we prove the following results: There is a constant ksuch that counting queries corresponding to k-juntas (functions depending on at most kvariables) cannot be accurately and privately summarized as relaxed synthetic data with a median-of-averages aggregator, or with a symmetric and monotone aggregator (that is independent of the predicate cbeing queried). For every constant k, counting queries corresponding to k-juntas can be accurately and privately summarized as relaxed synthetic data with an aggregator that applies an affine shift to the average (where the shift does depend on the predicate being queried). Techniques. Our proof of Theorem 1.1 and our other negative results are obtained by combining the hardto-sanitize databases of Dwork et al. [18] with PCP reductions. They construct a database consisting of valid message-signature pairs (mi; i) under a digital signature scheme, and argue that any differentially private sanitizer that preserves accuracy for the counting query associated with the signature verification predicate can be used to forge valid signatures. We replace each message-signature pair (m) with a PCP encoding ithat proves that (mi; ii; i) satisfies the signature verification algorithm. We then argue that if accuracy is preserved for a large fraction of the (constant arity) constraints of the PCP verifier, then we can “decode” the PCP either to violate privacy (by recovering one of the original message-signature pairs) or to forge a signature (by producing a new message-signature pair). We remark that error-correcting codes were already used in [18] for the purpose of producing a fixed polynomial-sized set of counting queries that can be used for all verification keys. Our observation is that by using PCP encodings, we can reduce not only the number of counting queries in consideration, but also their computational complexity. Our proof has some unusual features among PCP-based hardness results: 3 As far as we know, this is the first time that PCPs have been used in conjunction with cryptographic assumptions for a hardness result. (They have been used together for positive results regarding computationally sound proof systems [28, 29, 6].) It would be interesting to see if such a combination could be useful in, say, computational learning theory (where PCPs have been used for hardness of “proper” learning [2, 23] and cryptographic assumptions for hardness of representation-independent learning [36, 26]). While PCP-based inapproximability results are usually stated as Karp reductions, we actually need them to be Levin reductions — capturing that they are reductions between search problems, and not just decision problems. (Previously, this property has been used in the same results on computationally sound proofs mentioned above.) n) d)n be a matrix of nrows, x1;:::;xn ^n 2 Preliminaries 2.1 Sanitizers Let a database D2(f0;1gd d 2 1 , corresponding to people, each of which contains dbinary attributes. A sanitizer A: (f0;1g)!Rtakes a database and outputs some data structure in R. In the case where R= (f0;1g(an ^n-row database) we say that Aoutputs a synthetic database. We would like such sanitizers to be both private and accurate. In particular, the notion of privacy we are interested in is as follows Definition 2.1 (Differential Privacy). [17] A sanitizer A: (f0;1g;D!Ris ( ; )-differentially private if for every two databases D2(f0;1gdn)dn)that differ on exactly one row, and every subset S R Pr[A(D1 ) 2S] e Pr[A(D2) 2S] + In the case where = 0 we say that Ais -differentially private. Since a sanitizer that always outputs 0 satisfies Definition 2.1, we also need to define what it means for a ddatabase to be accurate. In this paper we consider accuracy with respect to counting queries. Consider a set Cconsisting of boolean predicates c: f0;1g!f0;1g, which we call a concept class. Then each predicate cinduces a counting query that on database D= (x1;:::;xn) 2(f0;1gdn)returns ni=1 c(xi) c(D) = X 1n d) where 4 c)c2CjCj Zc If the output of Ais a synthetic database b , then c(A(D)) is simply the fraction of rows of bDthat satisfy th D2(f0;1g Aoutputs a data structure that is not a synthetic database, then evaluator function E: R C!R that estimates c(D) from the outpu For example, Amay output a vector Z= (c(D) + Zis a random v the c-th component of Z2R= R. Abusing notation, we will write E(A(D);c).We will say that Athat outputs a synthetic database Cif the fractional counts c(A(D)) are close to the fractional cou Definition 2.2 (Accuracy). An output Zof sanitizer A(D) is -accurate for a concept class Cif 8c2C;jc(Z) c(D)j : A sanitizer Ais ( ; )-accurate for a concept class Cif for every database D, Pr0Ascoins[8c2C;jc(A(D)) c(D)j ] 1 In this paper we say f(n) = negl(n) if f(n) = o(nc1) for every c>0 and say that f(n) is negligible. We use jsjto denote the length of the string s, and sks2to denote the concatenation of s1and s2. 2.2 Hardness of Sanitizing Differential privacy is a very strong notion of privacy, so it is common to look for hardness results that rule out weaker notions of privacy. These hardness results show that every sanitizer must be “blatantly non-private” in some sense. In this paper our notion of blatant non-privacy roughly states that there exists an efficient adversary who can find a row of the original database using only the output from any efficient sanitizer. Such definitions are also referred to as “row non-privacy.” We define hardness-of-sanitization with respect to a particular concept class, and want to exhibit a distribution on databases for which it would be infeasible for any efficient sanitizer to give accurate output without revealing a row of the database. Specifically, following [18], we define the following notions Definition 2.3 (Database Distribution Ensemble). Let D= Dbe an ensemble of distributions on d-column databases with n+ 1 rows D2(f0;1gchoose D0 Rdn+1). Let (D;D0d;i) R~ Ddenote the experiment in which weDand i2[n] uniformly at random, and set Dto be the first nrows of D0and D0to be D with the i-th row replaced by the (n+ 1)-st row of D. Definition 2.4 (Hard-to-sanitize Distribution). Let Cbe a concept class, 2[0;1] be a parameter, andD= Dd0be a database distribution ensemble. The distribution Dis ( ;C)-hard-to-sanitize if there exists an efficient adversary Tsuch that for anyalleged polynomial-time sanitizer Athe following two conditions hold: 1.Whenever A(D) is -accurate, then T(A(D)) outputs a row of D: [(A(D) is -accurate for C) ^(T(A(D)) \D= ;)] negl(d): Pr (D ;D 0 ;i) R ~D A0 s and T0 s coins from the database D0 0)) = xi negl(d) (D ;D 0 ;i) R ~D A0 s and T0 s coins d and ( ;negl(n))-differential privacy. 5 : Pr T(A(D 2.For every efficient sanitizer A, Tcannot extract xi is the i-th row of D. In [18], w it was shown that every distribution t h the sense of Definition 2.4,is also hard to sanitize while achievin e r Claim 2.5. [18] If a distribution ensemble D= D1+aon n(d)-row da e then for every constant a>0 and every = (d) 11=poly(d), no e x i ( ; )-accurate with respect to Ccan achieve (alog(n);(1 8 )=2n)-d for all constants ; >0, no polynomial-time sanitizer can achieve AWe could use a weaker definition of hard-to-sanitize distributions, which would still suffice to rule out differential privacy, that only requires that for every efficient A, there exists an adversary Tthat almost always extracts a row of Dfrom every -accurate output of A(D). In our definition we require that there exists a fixed adversary Tthat almost always extracts a row of Dfrom every -accurate output of any efficient A. Reversing the quantifiers in this fashion only makes our negative results stronger.In this paper we are concerned with sanitizers that output synthetic databases, so we will relax Definition 2.4 by restricting the quantification over sanitizers to only those sanitizers that output synthetic data. Definition 2.6 (Hard-to-sanitize Distribution as Synthetic Data). A database distribution ensemble Dis ( ;C)-hard-to-sanitize as synthetic data if the conditions of Definition 2.4 hold for every sanitizer Athat outputs a synthetic database. 3 Relationship with Hardness of Approximation The objective of a privacy-preserving sanitizer is to reveal some properties of the underlying database without giving away enough information to reconstruct that database. This requirement has different implications for sanitizers that produce synthetic databases and those with arbitrary output. c)c2Cwhere the random variables (Zc)c2CThe SuLQ framework of [8] is a well-studied, efficient technique for achieving ( ; )-differential privacy, with non-synthetic output. To get accurate, private output for a family of counting queries with predicates in C, we can release a vector of noisy counts (c(D)+Zare drawn independently from a distribution suitable for preserving privacy. (e.g. a Laplace distribution with standard deviation O(jCj= n)).Consider the case of an n-row database Dthat contains satisfying assignments to a 3CNF formula ’, and suppose our concept class includes all disjunctions on three literals (or, equivalently, all conjunctions on three literals). Then the technique above releases a set of noisy counts that describes a database in which every clause of ’is satisfied by most of the rows of D. However, sanitizers with synthetic-database output are required to produce a database that consists of rows that satisfy most of the clauses of ’. Because of the noise added to the output, the requirement of a synthetic database does not strictly force the sanitizer to find a satisfying assignment for the given 3CNF. However, it is known to be NP-hard to find even approximate satisfying assignments for many constraint satisfaction problems. In our main result, Theorem 4.4, we will show that there exists a distribution over databases that is hard-to-sanitize with respect to synthetic data for any concept class that is sufficient to express a hard-to-approximate constraint satisfaction problem. 3.1 Hard to Approximate CSPs We define a constraint satisfaction problem to be the following. Definition 3.1 (Constraint Satisfaction Problem (CSP)). For a function q= q(d) d, a family of q(d)-CSPs, denoted = (d)d2N, is a sequence of sets of boolean predicates on q(d) variables. If q(d) and dd(d) do not depend on dthen we refer to as a fixed family of q-CSPs. For every d q(d), let Cbe the class consisting of all predicates c : f0;1g!R of the form c(u1;:::;ud) = (ui1;:::;uiq (d)) for some 2dand i1;:::;iq(d)d2[d]. We call C1 d=0= [(d) (d) the class of constraints of . Finally, we say a multiset ’ Cis a d-variable instance of CCand each ’i2’ is a constraint of ’. 6 We say that an assignment xsatisfies the constraint ’i d ’i and val(’) = max val(’;u): 1(1), d 0 if ’i (u) = 1. For ’= f’1;:::;’mg, define v u2f0;1g (u) m a results will apply to concept hardness l classes C ( CSP). A family = (d)d2N Definition 3.2 (Nice ’ ; u ) (d) Our = P m i = 1 2.for every d2N , dcontains a non-constant predicate : f0;1gq(d) ’ C C d2N C C d)d2N 1.for every u2f0;1gd such that C(u) = 1, val(’ !f0;1g. Moreover, ’ C such that ’ (u ) = 0 and ’ (u 2f0;1g d (d) C d (d) C 7 CSPs =( q(d) ) 0 1 = R(C) C such 1 that val(’ for CSP families with certain additional properties. Specifically we define, of q(d)-CSPs nice if 1. q(d) = d ;uand two assignments u) = 1 can be found in time poly(d). We note that any fixed family of q-CSP that contains a non-constant predicate is a nice CSP. Indeed, these CSPs (e.g. conjunctions of 2 literals) are the main application of interest for our results. However it will sometimes be useful to work with generalizations to nice CSPs with predicates of non-constant arity. For our hardness result, we will need to consider a strong notion of hard constraint satisfaction problems, which is related to probabilistically checkable proofs. First we recall the standard notion of hardness of approximation under Karp reductions. (stated for additive, rather than multiplicative approximation error) Definition 3.3 (inapproximability under Karp reductions). For functions ;: N ![0;1]. A family of CSPs = (is ( ;)-hard-to-approximate under Karp reductions if there exists a polynomial-time computable function Rsuch that for every circuit Cwith input size d, if we set ’for some d= poly( d), then 1.if Cis satisfiable, then val(’) (d), and 2.if Cis unsatisfiable, then val(’) <(d) (d). For our hardness result, we will need a stronger notion of inapproximability, which says that we can efficiently transform satisfying assignments of Cinto solutions to ’of high value, and vice-versa. Definition 3.4 (inapproximability under Levin reductions). For functions ;: N ![0;1]. A family ofis ( ;)-hard-to-approximate under Levin reductions if there exist polynomial-time computable functions R;Enc ;Dec such that for every circuit Cwith input of size dif we set ’for some d= poly(d), then= R(C) C ;Enc (u;C)) (d), 2.and for every 2f0;1g; ) (d) (d), C(Dec ( ;C)) = 1, 3.and for every u2f0;1g, Dec (Enc (u;C)) = u When we do not wish to specify the value we will simply say that the family is -hard-to-approximate under Levin reductions to indicate that there exists such a 2( ;1]. If we drop the requirement that Ris efficiently computable, then we say that is ( ;)-hard-to-approximate under inefficient Levin reductions. The notation Enc ;Dec reflects the fact that we think of the set of assignments such that val(’; ) as a sort of error-correcting code on the satisfying assignments to C. Any 0Cwith value close to can be decoded to a valid satisfying assignment. Levin reductions are a stronger notion of reduction than Karp reductions. To see this, let be -hard-toapproximate under Levin reductions, and let R;Enc ;Dec be the functions described in Definition 3.4. We now argue that for every circuit C, the formula ’C= R(C) satisfies conditions 1 and 2 of Definition 3.3. Specifically, if there exists an assignment u2f0;1gthat satisfies C, then Enc (u;C) satisfies at least a fraction of the constraints of ’Cd. Conversely if any assignment 2f0;1gsatisfies at least a fraction of the constraints of ’Cd, then Dec ( ;C) is a satisfying assignment of C. Variants of the PCP Theorem can be used to show that essentially every class of CSP is hard-toapproximate in this sense. We restrict to CSP’s that are closed under complement as it suffices for our application. Theorem 3.5 (variant of PCP Theorem). For every fixed family of CSPs that is closed under negation and contains a function that depends on at least two variables, there is a constant = ( >0 such that is -hard to approximate under Levin reductions. Proof sketch. Hardness under Karp reductions follows directly from the classification theorems of Creignou [10] and Khanna et al. [27]. These theorems show that all CSPs are either -hard under Karp reductions for some constant >0or can be solved optimally in polynomial time. By inspection, the only CSPs that fall into the polynomial-time cases (0-valid, 1-valid, and 2-monotone) and are closed under negation are those containing only dictatorships and constant functions. The fact that standard PCPs actually yield Levin reductions has been explicitly discussed and formalized by Barak and Goldreich [6] in the terminology of PCPs rather than reductions (the function Enc is called “relatively efficient oracle-construction” and the function Dec is called “a proof-of-knowledge property”). They verify that these properties hold for the PCP construction of Babai et al. [4], whereas we need it for PCPs of constant query complexity. While the properties probably hold for most (if not all) existing PCP constructions, the existence of the efficient “decoding” function grequires some verification. We observe that it follows as a black box from the PCPs of Proximity of [7, 12]. There, a prefix of the PCP (the “implicit input oracle”) can be taken to be the encoding of a satisfying assignment of the circuit Cin an efficiently decodable error-correcting code. If the PCP verifier accepts with higher probability than the soundness error s, then it is guaranteed that the prefix is close to a valid codeword, which in turn can be decoded to a satisfying assignment. By the correspondence between PCPs and CSPs [3], this yields a CSP (with constraints of constant arity) that is -hard to approximate under Levin reductions for some constant >0 (and = 1). The sequence of approximation-preserving reductions from arbitrary CSPs to MAXCUT [31] can be verified to preserve efficiency of decoding (indeed, the correctness of the reductions is proven by specifying how to encode and decode). Finally, the reductions of [27] from MAX-CUT to any other CSP all involve constant-sized “gadgets” that allow encoding and decoding to be done locally and very efficiently. It seems likely that optimized PCP/inapproximability results (like [25]) are also Levin reductions, which would yield fairly large values for for natural CSPs (e.g. = 1=8 if contains all conjunctions of 3-literals, because then Ccontains MAX 3-SAT.) For some of our results we will need CSPs that are very hard to approximate (under possibly inefficient reductions), which we can obtain by “sequential repetition” of constant-error PCPs. 8 Theorem 3.6. There is a constant Csuch that for every = (d) >0, the constraint family = (d)d2Nof k(d)-clause 3-CNF formulas is (1 (d);1)-hard-to-approximate under inefficient Levin reductions, for k(d) = Clog(1= (d)). Proof sketch. As in the proof of Theorem 3.5, disjunctions of 3 literals are (1 ;1)-hard-to-approximate under Levin reductions for some constant >0. By taking k(d) = log(1= (d)) sequential repetitions of this PCP, we get a PCP with completeness 1 and soundness (d) whose constraints are 3-CNF formulas with k(d) = log (1= (d)) clauses. We have to check that this resulting PCP preserves the properties of inefficient Levin reductions. The encoder for the k-fold sequential repetition is unchanged. If the initial reduction is R(C) = ’1;:::;’m= f’g(a set of 3-literal disjunctions), then the reduction RkCk(C) for the k-fold sequential repetition will produce m, k-clause 3-CNF formulae by taking every subcollection of kclauses in ’. Specifically, for every i1;i2;:::;ik2[m], Rk(C) will contain a k-clause 3-CNF formula ’i1^’i2C^ ^’kik. The decoder also remains unchanged. If the value of an assignment is at least kwith respect to R(C) then it must have value at least with respect to R(C) and thus Dec ( ;C) will return a satisfying assignment to C, that is C(Dec ( ;C)) = 1. Notice that when k= k(d) = !(1), the reduction will produce m!(1)clauses and be inefficient. Thus we will have an inefficient Levin reduction if we want to obtain (d) = o(1) from this construction. 4 Hard-to-Sanitize Distributions from Hard CSPs In this section we prove that to efficiently produce a synthetic database that is accurate for the constraints of a CSP that is hard-to-approximate under Levin reductions, we must pay constant error in the worst case. Following [18], we start with a digital signature scheme, and a database of valid message-signature pairs. There is a verifying circuit Cand valid message-signature pairs are satisfying assignments to that circuit. Now we encode each row of database using the function Enc , described in Definition 3.4, that maps satisfying assignments to Cvkvkto assignments of the CSP instance ’Cv k= R(Cvk) with value at least . Then, any assignment to the CSP instance that satisfies a fraction of clauses can be decoded to a valid message-signature pair. The database of encoded message-signature pairs is what we will use as our hard-to-sanitize distribution. 4.1 Super-Secure Digital Signature Schemes Before proving our main result, we will formally define a super-secure digital signature scheme. These digital signature schemes have the property that it is infeasible under chosen-message attack to find a new message-signature pair that is different from all obtained during the attack, even a new signature for an old message. First we formally define digital signature schemes Definition 4.1 (Digital signature scheme). A digital signature scheme is a tuple of three probabilistic polynomial time algorithms = (Gen ;Sign ;Ver ) such that 1. Gen takes as input the security and outputs a key pair (sk;vk) Gen R parameter 1 (1 R (sk;vk) in the range of Gen , and every message m, we have Signsk v Ver sk(m)) = 1. k 9 ). 2. Sign takes skand a message m2f0;1gas input and outputs (m). 3. Ver takes vkand pair (m; ) and deterministically outputs a bit b 2f0;1g, such that for every(m;Sign We define the security of a digital signature scheme with respect to the following game. Definition 4.2 (Weak forgery game). For any signature scheme = (Gen ;Sign ;Ver ) and probabilistic R Gen (1polynomial time adversary F, WeakForge(F; ; ) is the following probabilistic experiment. 1. (sk;vk) ). 2. Fis given vkand oracle access to Signsk. The adversary adaptively queries Sign on a set of messages Q f0;1g, receives a set of message-signature pairs A f0;1g skand outputs (m ; ). 3.The output of the game is 1 if and only if (1) Vervk(m ; ) = 1, and (2) (m ; ) 62A. The weak forgery game is easier for the adversary to win than the standard forgery game because the final condition requires that the signature output by Fbe different from all pairs (m; ) 2A, but allows for the possibility that m 2Q. In the standard definition, the final condition would be replaced by m62Q. Thus the adversary has more possible outputs that would result in a “win” under this definition than under the standard definition. Definition 4.3 (Super-secure digital signature scheme). A digital signature scheme = (Gen ;Sign ;Ver ) is super-secure under adaptive chosen-message attack if for every probabilistic polynomial time adversary, F, Pr[WeakForge(F; ; ) = 1] negl( ). Although the above definition is stronger than the usual definition of existentially unforgeable digital signatures, in [24] it is shown how to modify known constructions [30, 34] to obtain a super-secure digital signature scheme from any one-way function. 4.2 Main Hardness Result We are now ready to state and prove our hardness result. Let = (1 d=1be a family of q(d)-CSPs and let C= [(d) Cd)d2Nbe the class of constraints of , which was constructed in Definition 3.1. We now state our hardness result. be a family of nice q(d)-CSPs such [:d Theorem 4.4. Let = (d)d2N d that d (d) d q d : and the ; (u ( such that ’ 0 assignments u d ) f u 0 0 1 ; 1 g q ( d ) is (d)-hard-toappr oximate under (possibly inefficient) Levin reductions for = (d) 2(0;1=2). Assuming the existence of one-way functions, for every polynomial n(d), there exists a distribution ensemble D= Don n(d)-row databases that is ( (d);C)-hard-to-sanitize as synthetic data. Proof. Let = (Gen ;Sign ;Ver ) be a super-secure digital signature scheme and let be a family of CSPs that is -hard-to-approximate under Levin reductions. Let R;Enc ;Dec be the polynomial-time functions and = (d) 2( ;1] be the parameter from Definition 3.4. Let = dfor a constant >0 to be defined later. Let n= n(d) = poly(d). We define the database distribution ensemble D= Dto generate n+ 1 random message-signature pairs and then encode them as PCP witnesses with respect to the signatureverification algorithm. We also encode the verification key for the signature scheme using the non-constant constraint ’!f0;1gin 2f0;1g) = 0 and ’(u 1) = 1, as described in the definition of nice CSPs (Definition 3.2). denotes the concatenation of the strings poly(d 10 Recall that s1ks2s1 and s2. Note that the length of xi 1(1) before padding is poly( ) + q(d)poly( ) d), so we can choose the constant >0 to be small enough that the length of xbefore padding is at most dand the above is well defined. D at ab as e Di str ib uti on En se m bl e D = DR G en (1) , let vk = vk 1v k2 d: (s k; vk ) , w he re ‘= jvk j= po ly( ) (m 1;:: :; mn +1) Rfo r i= 1 to n+ 1 do xi: = En c (m i (f 0; 1g kS ig ns k( m i n+1 )); Cv k)k u:: :v kvk1 ku ‘ vk 2 v k0: = (x1 ;::: ;xn +1) k:: :k u‘, pa dd ed wit h ze ro s to be of le ng th ex ac tly d en d for ret ur n D E ve ry va lid p ai r ( m ;S ig ns k( m )) is a sa tis fyi n g as si g n m e nt of th e cir cu it Cv k, h e nc e ev er y ro w of D 0 constructed in this way will satisfy at least a fraction of the clauses of the formula ’Cv = R(Cvk k L+(j1)q+2ikSignsk(mi))jin the construction of D;:::;xL+jq0). Then, by construction, ’ j j(d) 2C(D0) = vk(d) for j= 1;2;:::;‘, j by our construction of Cj ( x ) = ’ ( x L + ( j 1 ) q + 1 ; x ). Pr h (A(D) is -accu (D ;D 0 ;i) R ~D A0 s Additionally, for every bit of the verification key, there is a block of q(d) bits in each row that C(d) and T0 s coins contains either a satisfying assignment or a non-satisfying assignment of ’, depending on whether that bit of the key is 1 or 0. Specifically, let L= jEnc (mand for j= 1;2;:::;‘, let ’, the j-th bit of the verification key. Note that ’(Definition 3.1). We now prove the following two lemmas that will establish Dis hard-to-sanitize: Lemma 4.5. There exists a polynomial-time adversary Tsuch that for every polynomial-time sanitizer A, Proof. Our privacy adversary tries to find a row of the original database by trying to P 0( b D) outputs a pair (m; ) s.t. Cvk(m; ) = 1.each row of the “sanitized” database and then re-encoding it. In order to do so, the a needs to know the verification key used in the construction of the database, which it from the answers to the queries ’ j, defined above. Formally, we define the privacy a means of a subroutine that tries to learn the verification key and then PCP-decode e input database: Let Abe a polynomial-time sanitizer, we will show that Inequality (1) holds. Claim 4.6 A(D) is -accurate for C, then T(d) verification key used in the construction of DProof argue that if b Dis -accurate for Caccurate for Dthen ’ j and c = vkj ( b D) <1=2, and c = vk . Similarly, if = 1 then = 0 and b Dis -( b D) 1 >1=2,. Thus, for the res vkj vkj vk ’ j vkfor c vk.that val(’Cv k;xi) , which implies CvkNex D) 6= ?. It is sufficient to show there exists ^x)) = is a satisfying assignment to Cvki2 b Dsuch, the d each row xiof Dhas val(’Cv k;xim ) . Thus if ’Cv k = f’1;:::;’ g, then 1m mj=1’X j(D) = 1m mj=1X1n ni=1X ’j(xi) ! = ni=1 val(’Cv k ;xi) : 11 1n X S u br o uti n e K( b D) :L et d b e th e di m e ns io n of ro w s in b D, = d, ‘= jv kj = p ol y( ). fo r j= 1 to ‘d oc vk j= h e n d fo r re tu rn ’c vk 1S u br o uti n e T ji( b D) ro u n d e d to f0 ;1 g0 k c vk 2k ::: k c vk ( b D) : ‘L et ^n b e th e n u m b er of ro w s in b D, fo r i= 1 to ^n d o if C) ) = 1 th e n re tu rn D ec (^ xc vk( D ec (^ xi; Ci c vk; C) e n d if c vk = K( b D) c end for vk return ? 2, then for every constraint ’j2’Cv k, we have ’j Privacy Adversary T( b D):Let c vk= K(return Enc (T0b D).( b D);Cc vk) (d) Since b Dis -accurate for C, and for every constraint ’j, either ’j m i j j=1 j j=1 ^ni=1val(’Cv k 1X ^n ;^x) = 1 mX ’ ( b D) 1 m mX ’ (D) 12 vk 0 : 0 vk sk sk 0 d ;Cvk) 2D(except with negligible return ^x := Tb D:= A(D)0 (d) So for at leas t one row ^x 2 b Dit mus t be the cas e that val(’ Cv (De c (^x; C;^x ) . The defi nitio n of Dec (De finiti on 3.4) impl ies C)) = 1. k ( b D) ’j(D) . Thus No w noti ce that if T(A( D)) outp uts a vali d mes sag e-si gnat ure pair but T(A( D)) \D= ;, then this mea ns T(A( D)) is forgi ng a new sign atur e not amo ng thos e use d to gen erat e D, viol atin g the sec urity of the digit al sign atur e sch eme . For mall y, we con stru ct a sign atur e forg er as follo ws: For ger F(vk ) with orac le acc ess to Sig n: Use the orac le Sig nto gen erat e an n-ro w data bas e Dju st as in the defi nitio n of D(c onsi stin g of PC P enc odin gs of vali d mes sag e-si gnat ure pair s and an enc odin g of vk). Let ( b D) N oti ce th at ru n ni n g Fi n a ch os e nm es sa g e at ta ck is e q ui va le nt to ru n ni n g Ti n th e ex p er im e nt of in e q u ali ty (1 ), ex ce pt th at F d o es n ot re -e nc o d e th e o ut p ut of T( A( D) ). B y th e su p er -s ec ur ity of th e si g n at ur e sc h e m e, if th e ^x o ut p ut by Fi s a va lid m es sa g esi g n at ur e p ai r (a s h ol ds if A( D) is ac cu ra te fo r C, by Cl ai m 4. 6) , th e n it m us t b e o n e of th e m es sa g esi g n at ur e p ai rs us e d to co ns tr uc t D( ex ce pt wi th pr o b a bil ity n e gl () = n e gl (d )). T hi s im pli es th at T( A( D) ) = E nc (^ x probability). Thus, we have [A(D) is -accurate for C (d) )T(A(D)) 2D] 1 negl(d); Pr (D ;D 0 ;i) R ~0 s coinsD A Pr T(A(D0)) = xi negl(d) (D ;D 0 ;i) R ~D A0 s and T0 s coins which is equivalent to the statement of the lemma. Lemma 4.7. Proof. Since the messages used in D0 mi i d ! ( f 0 ; 1 g are drawn independently, D0 0) output the target row xi ^nb D 2(f0;1gd)b D, apply cto each row ^x1;:::;b d n 2 fis monotone 2Given two vectors a = (a1 contains no information about the message m, thus no adversary can, on input A(D= negl(d). These two claims suffice to establish that Dis ( ;C)-hard-to-sanitize as synthetic data. Theorem 1.1 in the introduction follows by combining Theorems 3.5 and 4.4. 5 Relaxed Synthetic Data The proof of Theorem 4.4 requires that the sanitizer output a synthetic database. In this section we present similar exc hardness results for sanitizers that produce other forms of output, as long as they still produce a collection of elements from f0;1g, that are interpreted as the data of (possibly “fake”) individuals. More specifically, we consider sanitizers that output a databasebut are interpreted using an evaluation function of the following form: To evaluate predicate c2Conto get a string of ^nbits, and then apply a function f: f0;1g^ni C![0;1] to determine the answer. For example, when the sanitizer outputs a synthetic database, we have f(b^n;c) = (1=^n) P^n i=1, which is just the fraction of rows that get labeled with a 1 by the predicate c(independent of c).We now give a formal definition of relaxed synthetic data Definition 5.1 (Relaxed Synthetic Data). A sanitizer A: (f0;1gn)d^n)biwith evaluator Eoutputs relaxed synthetic data for a family of predicates Cif there exists f: f0;1g C![0;1] such that For every c2CE( b D;c) = f(c(^x1);c(^x2);:::;c(^x^n^n);c); and in the first ^ninputs. This relaxed notion of synthetic data is of interest because many natural approaches to sanitizing yield outputs of this type. In particular, several previous sanitization algorithms [20, 35, 9] produce a set of synthetic databases and answer a query by taking a median over the answers given by the individual databases. Sanitizers that use medians of synthetic databases no longer have the advantage that they are “interchangeable” with the original data, but are still be desirable for data releases because they retain the property that a small data structure can give accurate answers to a large number of queries. We view such databases ) and b = (b1; : : : ; bn) we say b a iff bi ; : : : ; a 13 ai n for every i 2[n]. We say a function f : f0; 1g![0; 1] is monotone if b a =)f (b ) f (a). as a si n gl e sy nt h eti c d at a b as e b ut re q ui re th at fh av e a sp ec ial fo r m . U nf or tu n at el y, th e sa nit iz er s of [2 0] a n d [3 5] ru n in ti m e ex p o n e nti al in th e di m e ns io n of th e d at a, d, a n d th e re su lts of th e n ex t su bs ec tio n sh o w thi s li mi ta tio n is in h er e nt ev e n fo r si m pl e co nc e pt cl as se s. T hr o u g h o ut thi s se cti o n w e wi ll co nti n u e to us e c( b D) to re fe r to th e a ns w er gi ve n by n) !(f0;1gd ^n ) 5.1 Hardness of Sanitizing as Medians Definition 5.2 (medians of synthetic data). A sanitizer A: (f0;1gd 1 2 ‘ c(^xi); 1jS2i2Sj E(^x1;:::;^x^n;c) = median 8< :1 jS1i2SXj 2 14 X1 c(^xi);:::; 1jS‘i2Sj c(^: X‘ xi) 1;:::;S‘ 9= ; i Theorem 5.3. Let = (d)d2N ensemble D= Dd (d) d D= Dlet nb Dd1; b D2;:::; b D‘^x2 b Dsuch that val(’Cv k Cv k Pr (D ;D 0 ;i) R ~D A0 s and T0 s coins (d) [:d j)j2Si as a synthetic database. We now present our hardness results for relaxed synthetic data where b Dwhen the function ftakes the median interpreted [S [S such over synthetic database (Section 5.1), where fis an arbitrary monotone, symmetric function that (Section 5.2), or when the family of concepts contains CSPs that are very hard to approximate (Section 5.3). Our proofs use the same construction of hard-to-sanitize databases as Theorem 4.4 with a modified analysis and parameters to show that the output must still contain a PCP-decodable row. In this section we establish that the distribution Dused in the proof of Theorem 4.4 is hard-to-sanitize as medians of synthetic data, formally defined as: with evaluator E outputs medians of synthetic data if there is a partition [^n] = S N ote that med ians of synt heti c data are a spe cial cas e of rela xed synt heti c data . In the follo win g, we rule out effic ient sani tizer s with med ians of synt heti c data for CS Ps that are har d to app roxi mat e with in a mult iplic ativ e fact or larg er than 2. By The ore m 3.6, thes e CS Ps incl ude k-cl aus e 3-C NF form ulas for som e con stan t k. be a fami ly of nice (De finiti on 3.2) q(d) -CS Ps suc h that is (( (d )+ (d)) =2;( d))har d-to -ap prox imat e und er (pos sibl y inef ficie nt) Levi n red ucti ons for = ( d) 2 (0;1 =2). Ass umi ng the exis tenc e of one -wa y func tion s, for ever y poly nom ial n(d) , ther e exis ts a distr ibuti on on n(d) -row data bas es that is ( (d) ;C)har d-to -san itize as med ians of synt heti c data . Pro of. Let be a fami ly of CS Ps that is (( + )=2; )-ha rd-t o-a ppr oxi mat e und er Levi n red ucti ons. Let be the data bas e distr ibuti on ens emb le des crib ed in the pro of of The ore m 4.4. Let A(D )= b Dan do be the parti tion of the row s ofb Dco rres pon ding to S, i.e. b D= (^x. Ass umi ng that b Dis -ac cura te as med ians of synt heti c data , we will sho w that ther e mus t exis ta row; ^x) (+ )=2 = ( )= 2. To do so, we obs erve that if b Dis acc urat eas med ians of synt heti c data bas es, then for eac h pre dica te, half of b D’s synt heti c data bas es mus t give ana nsw er that is “clo se to D’s ans wer” . Thu s one of thes e synt heti c dats bas es mus t be “clo se” to Dfor half of the pre dica tes in ’. By our con stru ctio n of D, we con clud e that eac h of thes e pre dica tes is sati sfie d by man y row s of this synt heti c data bas e and thus som e row sati sfie s eno ugh of the pre dica tes to dec ode a mes sag e-si gnat ure pair. W e n e e d to sh o w th at th er e ex ist s a n a dv er sa ry T su ch th at fo r ev er y p ol yn o mi al ti m e sa nit iz er A, h (A (D ) is ac cu ra te fo r C) ^( T( A( D) ) \D = ;) i n e gl (d ) (2 ) To do so, we will use the same subroutine T0consider a subroutine that looks for rows satisfying sufficiently many clauses of ’Cv k( b D) we used for the proof of Lemma 4.5. That is, weand returns the PCPdecoding of that row. It will suffice to establish the following claim, analogous to Claim 4.6: 0 (d) Cvk(m; ) = 1.Claim 5.4. If b 0( b D) outputs a pair (m; ) Dis -accurate for C s.t.for < 1=2, then K( b D) = vk, the 1 m m j=1 k R[‘] m h ’’ j (D) . We say that b D;:::; b is good for ’j if ’j( b Dv m k b ; b Dm D‘kis good for ’ D j )i k m v k;^xi Then Ek R[‘]2 41 jSki2jSj Xkjval(’ = f’1;:::;’ g, then If ’Cc vk.v k 1P m [‘] 2 h bD C k R[‘]k X Pr Pr R j = 1 = 1 ) 3 j j=1 X’ j ko is -accurate for every constraint ’ji k 2’C m j = 1 1R[‘] 2 ( X ’ m j = 1 X verification key used in the construction of DProof. As in the proof of Claim 4.6, if b Dis -accurate as medians of synthetic data, for C. For the rest of the proof we will be justified in substituting vkfor T ) ’(D) . Since the median over nwe have (d) j 1 2 5=E E(^x1;:::;^x^n 2 14 m1 ^ E n Cm (b Dk ( b Dh bDkk ! ( i f ) 0 ; ! 1 : g )d ) 3 5 j i d ( ( ’ j ( D ) ) 1 1 ^ n 2 is good for ’ (D) ) j ;^x) ( )=2. Since the distribution Dis unchanged, Lemma 4.7 still holds in this setting. Thus we have established that Dis ( ;C)-hard-tosanitize as medians of synthetic data. 5.2 Hardness of Sanitizing with Symmetric Evaluation Functions In this section we establish the hardness of sanitization for relaxed synthetic data where the evaluator function is symmetric. Definition 5.5 (symmetric relaxed synthetic data). A sanitizer A: (f0;1gwith evaluator Eoutputs symmetric relaxed synthetic data if there exists a monotone function g: [0;1] ![0;1] such that;c) = g ^ni=1Xc(^xn) 15 So for at least one row ^x2 b Dit must be the case that val(’ vk vk Note that symmetric relaxed synthetic data is also a special case of relaxed synthetic data. P) Pm Our definition of symmetric relaxed synthetic data is actually symmetric in two respects, becausej=1 we require that gdoes not depend on the predicate cand also that gonly depends on the fraction of rows that satisfy c. Similar to medians of synthetic data,we show that it is intractable to produce a sanitization as symmetric relaxed synthetic data that is accurate when the queries come from a CSP that is hard to approximate. Theorem 5.6. Let = (be a family of nice (Definition 3.2) q(d)-CSPs that is closed under complement (d= :dd)d2N) and is ( (d) + 1=2;(d))-hard-to-approximate under (possibly inefficient) Levin reductions for = (d) 2(0;1=2). Assuming the existence of one-way functions, for every polynomial n(d),there exists a distribution ensemble D= Dd(d) on n(d)-row databases that is ( (d);C)-hard-to-sanitize as symmetric relaxed synthetic data. By Theorem 3.6, the family of k-clause CNF formulas, for some constant k, is (1=2 + )-hard-toapproximate under Levin reductions for >0. Proof. Let be a family of CSPs that is ( +1=2)-hard-to-approximate under Levin reductions. Let D= Dbe the database distribution ensemble described in the proof of Theorem 4.4, D RdD, and b D= A(D).We will use the idea from Theorem 4.4, which shows that the underlying synthetic database cannot contain a row that satisfies too many clauses of ’C, to show that gmust map a small input to a large output and a large input to a small answer, contradicting the monotonicity of g. mj=1X ’j(D) ’j + 1=2( b D) J 1 m J J J J Consider the negation of ’J J 1=2. But if this were J (b D) 0 we also have ’J 2 : 2: 1 ’j(D) :It must also be that1=2. Otherwise there would exist a row ^x2 b 1 m ’j( b D) Let ’Cv k = f’1;:::;’mg, then m m j=1 D= A(D) such that val(’Cv k so there must exist J2[m] s.t. ’J (D) ’ ( b D) + 1=2. Since ’J(D) + 1=2 >1=2, and since ’(D) 1 we also have ’( b D) 1=2 monotonicity of gand -accuracy of b Das symmetric relaxed synthetic data we haveg(1=2 ) g(’(D) 1 J . Since :’( b D) 1=2 + . Thus ( b D)) ’(D) = 1 ’(D) we can conclude we have that :’J (D ) 1 = 2 a n d :’ g(1=2 + ) g(:’ B u t g ( 1 = 2 ) 1 = 2 g ( 1 = 2 + ( b D)) :’J (D) + 16 1 5. 3 H ar d n e ss of S ) a n d > 0 c o n t r a d i c t s t h e m o n o t o n i c i t y o f g . a ni tiz in g V er y H ar d C S P s wi th R el a x e d S y nt h et ic D at a In thi s se cti o n w e sh o w th at n o ef fic ie nt sa nit iz er ca n pr o d uc e ac cu ra te re la xe d sy nt h eti c d at a fo r a se q u e nc e of C S P s th at is (1 n e gl (d ))h ar dto -a p pr ox im at e u n d er in ef fic ie nt L ev in re d uc tio ns . B y T h e or e m 3. 6, th es e C S P s in cl u d e 3C N F fo r m ul as of !(l o g d) cl a us es . Intuitively, an efficient sanitizer must produce a synthetic database of ^n(d) = poly(d) rows, and thus as dgrows, an efficient sanitizer cannot produce a synthetic database that contains a row satisfying a nonnegligible fraction of clauses from a particular CSP instance (the signature-verification CSP from our earlier results). Thus using evaluators of the type in Definition 5.1 there can only be one answer to most queries, and thus we cannot get an accurate sanitizer. be a family of nice (Definition 3.2) q(d)-CSPs such that d [:d Theorem 5.7. Let = (d)d2N (d) ensemble D= Dd 1m = f’1;:::;’mg. By the construction of Dd Let D= Dd R dDd, and A(D) = b D.d;Decd;R ’j Let ’Cv k j=1X’j Since E ^n;’j j2J, ’j is (1 (d);1)-hard-to-approximate under (possibly inefficient) Levin reductions for a negligible function (d). Assuming the existence of one-way functions, for every polynomial n(d), there exists a distribution on n(d)-row databases that is (1=3;C)-hard-to-sanitize as relaxed synthetic data. Proof. Let = d) be a family of CSPs that is (1 (d);1)-hard-to-approximate under Levin reductions.be the database distribution ensemble described in the proof of Theorem 4.4. Note that since d) depends on dthere will be a sequence of triples (Enc) and the construction of Dshould use the appropriate encoder for each choice of d. Let D we have ) (d). But if this were the case we could PCP-decode ^xas in the proof of Theorem 4.4. j R[m][’j( b D)] (d), there must exist a subset J [m] of size jJj 2m=3 such that for all( b D) 3 (d) negl(d).Since b D2(f0;1gd^n)for ^n= ^n(d) = poly(d) (by the efficiency of A),1 ^n^ni=1X’j(^xi) = ’j ( b D) 1);:::;c(^x^n vk 2f0;1=^n(d);2=^n(d);:::;1g( b D) = 0 for large n. Let E(0= (0n) ( b D) negl(d) implies ’j ’Cv k (D0) = ’Cv k d(0) (d): are idenitcal, ’j(D0 Sinc e the rows of D0 = A(D j 0 1m mj=1’X j( b D0) (d) 17 b D0 (D) = 1 vk for every j2[m]. As in the proof of Theorem 5.6 it must be that m ( b D) (d):Otherwise there would exist a row ^x2 b D= A(D) such that val(’ and thus ’j );c). If we assumeb Dis 1=3-accurate as relaxed synthetic data then f(0) 2=3 for every j2J. Now consider the execution of Aon the database D. With probability 1negl(d), Dec (0) is not a valid message-signature pair, thus by Definition 3.4 Part 2, we have (D) 2f0;1gfor every j2[m]. So for at least a 1 (d) fraction of j2[m], we have ’) = 0. Let0). By repeating the signature-forging argument, we see that with probability 1negl(d) and thus there must exist a subset J 3 (d) negl(d). So ’( b D0) = 0 for every j2J0as well. There must also exist a size jJj= (2=3 (d))m, such that for every j2J00, ’j(D0) = ’j( b D0) = 0. So if b D0^n for Cas relaxed synthetic data it must be that f(0j) 1=3 for every j2J00. By our J00there must exist j2J\J00^nsuch that: 1. f(0^n;’j;’) 2=3, and 2. f(0j) 1=3, whic contradiction. j) 5.4 Positive Results for Relaxed Synthetic Data In this section we present an efficient, accurate sanitizer for small (e.g. polynomial in d) families of parity queries that outputs symmetric relaxed synthetic data and show that this sanitizer also yields accurate answers for any family of constant-arity predicates when evaluated as a relaxed synthetic data. Our result for parities shows that relaxed synthetic data (and even symmetric relaxed synthetic data) allows for more efficient sanitization than standard synthetic data, since Theorem 4.4 rules out an accurate, efficient sanitizer that produces a standard synthetic database, even for the class of 3-literal parity predicates. Our result for parities also shows that our hardness result for symmetric relaxed synthetic data (Theorem 5.6) is tight with respect to the required hardness of approximation, since the class of 3-literal parity predicates is (1=2 )hard-to-approximate [25] d;k when!f0;1gis a k-junta if it depends on at most kvariables. Let Jd;k A function f: f0;1gd n Cd k be the set of all k-juntas on dvariables. Theorem 5.8. There exists an -differentially private sanitizer that runs in time poly(n;d) and produces relaxed synthetic data and is ( ; )-accurate for J .dThe privacy, accuracy, and efficiency guarantees of our theorem can be achieved without relaxed synthetic data simply by releasing a vector of noisy answers to the queries [17]. Our sanitizer will, in fact, begin with this vector of noisy answers and construct relaxed synthetic data from those answers. Our technique is similar to that of Barak et. al. [5], which begins with a vector of noisy answers to parity queries (defined in Section 5.4.1) and constructs a (standard) synthetic database that gives answers to each query that are close to the initial noisy answers. They construct their synthetic database by solving a linear program over 2variables that correspond to the frequency of each possible row x 2f0;1g, and thus their sanitizer runs in time exponential in d. Our sanitizer also starts with a vector of noisy answers to parity queries and efficiently constructs symmetric relaxed synthetic data that gives answers to each query that are close to the initial noisy answers after applying a fixed linear scaling. We then show that the database our sanitizer constructs is also accurate for the family of k-juntas using an affine shift that depends on the junta. 18 log for a sufficiently large constant d k = C, whered k =Pk i=0d i d 5.4.1 Efficient Sanitizer for Parities To prove Theorem 5.8, we start with a sanitizer for parity predicates. !f1;1gis a parity predicate Definition 5.9 (Parity Predicate). A function : hx;si f0;1gdds.t. (x) = s 0 si P n 2jPjlog(2jPj= ) : d i = 1 d s; t 3 : f0;1gd i f there exists a vector s2f0;1g(x) = (1): We will use wt(s) =to denote the number of non-zero entries in s. Theorem 5.10. Let Pbe a family of parity predicates on dvariables such that d 62P. There exists an -differentially private sanitizer A(D;P) that runs in time poly(n;d) and produces symmetric relaxed synthetic data that is ( ; )-accurate for Pwhen The analysis of our sanitizer will make use of the following standard fact about parity predicates Fact 5.11. Two parity predicates !f1;1gare either identical or orthogonal. Specifically, for s6= t, s6= 0and b2f1;1g, R x R Ef0;1gd[ s(x)j t(x) = b] = Ex f0;1gd[ s(x)] = 0: Our sanitizer will start with noisy answers to the predicate queries s(D). Each noisy answer will be the true answer perturbed with noise from a Laplace distribution. The Laplace distribution Lap( ) is a continuous distribution on R with probability density function p(y) /exp(yj= ). The following theorem of Dwork, et. al. [17] shows that these queries are differentially private for an appropriate choice of . Theorem 5.12 ([17]). Let (c1;c2;:::;ck) be a set of predicates and let = k=n and let D2(f0;1g1(D)+Z1;c2(D)+Z2;:::;ck(D)+Zk), where (Z1dn);:::;Zkbe a database. Then the mechanism A(D) = (c) are independent samples from Lap( ) is -differentially private. In order to argue about the accuracy of our mechanism we need to know how much error is introduced by noise from the Laplace distribution. The following fact gives a bound on the tail of a Laplace random variable. Fact 5.13. The tail of the Laplace distribution decays exponentially. Specifically, Pr[jLap( )j t] = exp(t= ): Now we present our sanitizer for queries that are parity functions. We will not consider the query 0d as 0d (x) = 1 for every x2f0;1gd. Let Pbe a set of parity functions that does not contain 0d . We now present a poly(n;d;jPj)-time sanitizer for P.Our sanitizer starts by getting noisy estimates of the quantities (D) for each predicate 2Pby adding Laplace noise. Then it builds the relaxed synthetic datab Din blocks of rows. Each block of rows is“assigned” to contain an answer to a query . In that block we randomly choose rows such that the expected3dIn the preliminaries we define a predicate to be a f0; 1g-valued function but our definition naturally generalizes to f1; 1gvalued functions. For c : f0; 1g!f1; 1gand database D = ( x1; : : : ; xnd) 2(f0; 1gn), we define c(D ) =1 nPn i=1c(xi) 19 value of on each row equals the noisy estimate of (D). By Fact 5.11, the expected value of every other predicate 0is 0 for rows in this block. The sanitizer is accurate so long as the total number of rows is sufficient for the value of (b D) to be concentrated around its expectation.Sanitizer A(D;P), where P= (1) Let := jPj=n T:= (2jPj= 2(t);:::; :)log(4jPj= ) for all j= 1;:::;tdo i j (j):= x2f0;1gid j x2f0;1gd( (j)(x) = 1 R x) = 1 tT 1 Let aj(D) + Lap( ) for i= jT+ 1 to (j+ 1)Tdo With probability (a + 1)=2: Let ^x Otherwise: Let ^x end for end for ;:::;^x ) returnb D= (^xEvaluator E R j (j) n 2jPjlog(2jPj= ) : (j)2P D) jPj (j)( b (j)(D) j every (j) j (j) to (j)( b D) by sampling P( b D; ): return jPj ( b D)The following claims will suffice to establish Theorem 5.10 Claim 5.14. Ais -differentially private. Proof. The output of Aonly depends on the answer to jPjpredicate queries. By Theorem 5.12 the answersto jPjpredicate queries perturbed by independent samples from Lap(jPj=n ) is -differentially private. Claim 5.15. Ais ( ; )-accurate for Pwhen Proof. We want to show that for every except with probability . To do so we consider separately the error introduced in going from (D) to ausing Laplacian noise and the error introduced in going from noisy answers a rows at random. First we bound the error introduced by the noisy queries to D. Specifically, we want to show that for (j)(D) a we have =2 2P except with probability =2. For each (D) a jj =2] exp(n =2jPj) ( by Fact 5.13. So by a unionj bound we ) have P r [ j (j)Pr[9 20 (j)j (D) ajj =2] jPjexp( =2 ) jPjexp(n =2jPj) < =2; so long as n We also want to show that for every (j) except with probability =2, where aj for every (j) 2P. By a Chernoff-Hoeffding Bound4 2jPjlog(2jPj= ) : 2P jPj (j)( b D) ajis the noisy answer for =2(j)(D) computed in A(D). To do so show that the expectation of jPj ( b D) is indeed aj(j), then we will use a Chernoff-Hoeffd bound to show that the random rows generated by A(D) are close to their expectation. we take a union bound over all 2P.Fix (j)2Pand consider (j)( b D). (j)jT+1;jT+2;:::;(j+1 rows where we focus on (j)) the expectation of each coin flip is aj( b D) is the sum of Tindependent biased coin flips. In rows, and in all other rows the expectation of each c 0 by Fact 5.11. Thus ( b D) i = E (j) E h " ^ni=1(j) (^xi) # = aj=jPj 1^nX 21 1 ( b D) a j 2 2=2jPj) k 0 0 n =2: Pr h we conclude i <2exp(T 2=2jPj): jPj ) [13] s d ( b D) in our output, as 4 n i=1 Combining the two bounds suffices to prove the claim. (j) j By taking a union bound over Pwe concludePr h(j)9 jP j (j)( b D) a 5.4.2 Efficient Sanitizer for k-Juntas We now show that our sanitizer for parity quer used to give accurate answers for any family of k-juntas, for constant k. We start with the observation that k-juntas only have Fouri coefficients of weight at most k. Alternatively, this says that any k-junta can be writt combination of parity functions on at most kvariables. (In the language of our previo construction, such that wt(s) k.) Thus we start by running our sanitizer for parity p the set Pcontaining all parity predicates on at most kvariables. We have to modify t function to take into account that not every k-junta predicate has the same bias. Ind cannot control d (D) = 1 for any database. Thus our evaluator will apply an affine s result that depends on the junta. Because the evaluator depends on the predicate, sanitizer no longer outputs symmetric relaxed synthetic data. The use of a sanitizer for parity queries as a building block to construct a sanitiz k-juntas is inspired by [5], which uses a noisy vector of answers to parity queries as block to construct One form of the Chernoff-Hoeffding Bound states if Xare independent random variables over [0; 1] and X = (1=n)Pthen Pr[jX E [X ]j t ] < 2 exp( 2nt; : : : ; X synthetic data for a particular class of k-juntas (conjunctions on k-literals). However, while their sanitizer constructs a standard synthetic database and is inefficient, our construction of symmetric relaxed synthetic data for parity predicates is efficient, and thus our eventual sanitizer for k-juntas will also be efficient. Consider a predicate c: !f0;1g. Then we can take the Fourier f0;1gd expansion c(x) =Xs2f0;1gds d bc(s) (x) d;k k kb D;c) for a k-junta c:jc( b D) (jPk) is ( ; )-accurate for Jd;k k applied to the set P return jP [ accuracy of our sanitizer relies on the following c fac k-juntas Fact 5.16. If c: f0;1g!f0;1gis a k-junta, then ( wt(s R P= sd js2f0;1g k ;1 wt(s) k . Our sanitizer for k-juntas x w define the evaluator that computes the answer to conju ) h Evaluator E(j1)bc(;) Efficiency and privacy follow from th e k-juntas on dvariables. Theorem 5.17. A(D;Pusing Ewh r e kxThe b c ( s ) = E f 0 ; 1 g d n 2jPkjlog(2jPkj= ) : Proof. Let c2Jd;k c( b D) = a fixed predicate. Assume that b Dis -accurate using for Pk EP i=1XXc(^ be xi) . This event occurs with probability at least 1 by Theorem 5.10 and our assumption on n. We now analyze the quantity c(b D).1 ^n^n ^n Xs2f0;1gdbc(s) s(^xi) s( b D) (4) = :1 jsj k 1^n i=1 :jsj k s2f0;1g d X s( b D) (3) j s2f0;1g0 @d:1 jsj k bc(s) s2f0;1gd= (D) + ) 2 2 bc(s) bc(s) = bc(;) + X bc(;) + X 1jP j ((jP s2f0;1gj1)bc(;) + X1)bc(;) + c(D)) + = 1jP s kj s : j 1s j k ( D ) (5) + j P A d (jPk k k k (bc(s) where step 3 uses Fact 5.16, step 4 uses the fact that 0 1 j ((jP jP is -accurate for Pk kj1)bc(;) References when evaluated by . A similar argument shows thatd (x) = 1 EP step 5 uses the fact that b D + c(D)) 23 k Thus b D= A(D) is -accurate for Jd;k Ac kn o wl ed g m en ts c ( using Ewith probability at least 1 . b D Cynthia Dwork, Vitaly Feldman, Oded Goldreic We thank Boaz Barak, Irit Dinur, ) astad, Valentine Kabanets, Dana Moshkovitz, Anup Rao, Guy Rothblum, and Le helpful conversations. [1]A DAM, N. R., AND W ORTMANN, J. Security-control methods for statistical databases: A study. ACM Computing Surveys 21 (1989), 515–556. [2]A LEKHNOVICH, M., BRAVERMAN, M., FELDMAN, V., KLIVANS, A. R., AND PITASSI, T. The properly learning simple concept classes. In J. Comput. Syst. Sci. (2008), vol. 74, pp. 1 [3]A RORA, S., LUND, C., MOTWANI, R., SUDAN, M., AND SZEGEDY, M. Proof verification a approximation problems. J. ACM 45, 3 (1998), 501–555. [4]B ABAI, L., FORTNOW, L., LEVIN, L. A., AND SZEGEDY, M. Checking computations in po In STOC (1991), pp. 21–31. [5]B ARAK, B., CHAUDHURI, K., DWORK, C., KALE, S., MCSHERRY, F., AND TALWAR, K. Priv consistency too: A holistic solution to contingency table release. In Proceedings of the Principles of Database Systems (2007), pp. 273–282. [6]B ARAK, B., AND GOLDREICH, O. Universal arguments and their applications. In SIAM vol. 38, pp. 1661–1694. [7]B EN-SASSON, E., GOLDREICH, O., HARSHA, P., SUDAN, M., AND VADHAN, S. P. Robust shorter pcps, and applications to coding. In SIAM J. Comput. (2006), vol. 36, pp. 889–9 [8]B LUM, A., DWORK, C., MCSHERRY, F., AND NISSIM, K. Practical privacy: The SuLQ fra Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of (June 2005). [9]B LUM, A., LIGETT, K., AND ROTH, A. A learning theory approach to non-interactive da Proceedings of the 40th ACM SIGACT Symposium on Thoery of Computing (2008). [10]C REIGNOU, N. A dichotomy theorem for maximum generalized satisfiability problems Syst. Sci. (1995), vol. 51, pp. 511–522. [11]D INUR, I., AND NISSIM, K. Revealing information while preserving privacy. In Proceed TwentySecond ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database pp. 202–210. [12]D INUR, I., AND REINGOLD, O. Assignment testers: Towards a combinatorial proof of th SIAM J. Comput. (2006), vol. 36, pp. 975–1024. [13]D UBHASHI, D. P., AND SEN, S. Concentration of measure for randomized algorithms: applications. In Handbook of Randomized Algorithms, 2001. [14]D UNCAN, G. International Encyclopedia of the Social and Behavioral Sciences. Elsevier, 2001, ch. Confidentiality and statistical disclosure limitation. [15]D WORK, C. A firm foundation for private data analysis. Communications of the ACM (to appear). [16]D WORK, C. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming (ICALP)(2) (2006), pp. 1–12. [17]D WORK, C., MCSHERRY, F., NISSIM, K., AND SMITH, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference (2006), pp. 265–284. [18]D WORK, C., NAOR, M., REINGOLD, O., ROTHBLUM, G., AND VADHAN, S. When and how can privacypreserving data release be done efficiently? In Proceedings of the 2009 International ACM Symposium on Theory of Computing (STOC) (2009). [19]D WORK, C., AND NISSIM, K. Privacy-preserving datamining on vertically partitioned databases. In Proceedings of CRYPTO 2004 (2004), vol. 3152, pp. 528–544. [20]D WORK, C., ROTHBLUM, G., AND VADHAN, S. P. Boosting and differential privacy. In Proceedings of FOCS 2010 (2010). [21]E VFIMIEVSKI, A., AND GRANDISON, T. Encyclopedia of Database Technologies and Applications. Information Science Reference, 2006, ch. Privacy Preserving Data Mining (a short survey). [22]F ELDMAN, V. Hardness of proper learning. In The Encyclopedia of Algorithms. Springer-Verlag, 2008. [23]F ELDMAN, V. Hardness of approximate two-level logic minimization and PAC learning with membership queries. Journal of Computer and System Sciences 75, 1 (2009), 13–26. [24]G OLDREICH, O. Foundations of Cryptography, vol. 2. Cambridge University Press, 2004. [25]H ° ASTAD, J. Some optimal inapproximability results. In J. ACM (2001), vol. 48, pp. 798–859. [26]K EARNS, M. J., AND VALIANT, L. G. Cryptographic limitations on learning boolean formulae and finite automata. In J. ACM (1994), vol. 41, pp. 67–95. [27]K HANNA, S., SUDAN, M., TREVISAN, L., AND W ILLIAMSON, D. P. The approximability of constraint satisfaction problems. In SIAM J. Comput. (2000), vol. 30, pp. 1863–1920. [28]K ILIAN, J. A note on efficient zero-knowledge proofs and arguments (extended abstract). In STOC (1992). [29]M ICALI, S. Computationally sound proofs. In SIAM J. Comput. (2000), vol. 30, pp. 1253–1298. [30]N AOR, M., AND YUNG, M. Universal one-way hash functions and their cryptographic applications. In STOC (1989), pp. 33–43. [31]P APADIMITRIOU, C. H., AND YANNAKAKIS, M. Optimization, approximation, and complexity classes. In J. Comput. Syst. Sci. (1991), vol. 43, pp. 425–440. [32]P ITT, L., AND VALIANT, L. G. Computational limitations on learning from examples. In J. ACM (1988), vol. 35, pp. 965–984. [33]R EITER, J. P., AND DRECHSLER, J. Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality. Iab discussion paper, Intitut f ¨ ur Arbeitsmarkt und Berufsforschung (IAB), N ¨ urnberg (Institute for Employment Research, Nuremberg, Germany), 2007. [34]R OMPEL, J. One-way functions are necessary and sufficient for secure signatures. In STOC (1990), pp. 387–394. [35]R OTH, A., AND ROUGHGARDEN, T. Interactive privacy via the median mechanism. In STOC 2010 (2010). [36]V ALIANT, L. G. A theory of the learnable. Communications of the ACM 27, 11 (1984), 1134–1142. 24 ECCC http://eccc.hpi-web.de ISSN 1433-8092