PCPs and the Hardness of Generating Synthetic Data

advertisement
❊❧❡❝ ♦♥✐❝ ❈♦❧❧♦ ✉✐✉♠ ♦♥ ❈♦♠♣✉ ❛ ✐♦♥❛❧ ❈♦♠♣❧❡①✐ ②✱ |❡✈✐ ✐♦♥ ✷ ♦❢ |❡♣ ♦ ◆♦✳ ✶✼ ✭✷✵✶✵✮
PCPs and the Hardness of Generating Synthetic Data
Jonathan
Ullman
Salil
Vadhany
School of Engineering and Applied Sciences &
Center for Research on Computation and Society
Harvard University, Cambridge, MA
fjullman,salilg@seas.harvard.edu
January 6, 2011
Abstract Assuming the existence of one-way functions, we show that there is no polynomial-time,
differentially private algorithm Athat takes a database D2(f0;1gnd and outputs a “synthetic database” ^
) Dall
d
d
ISSN 1433-8092
aggregating the results in some way.
y
1
of whose two-way marginals are approximately equal to those of D. (A two-way
marginal is the fraction of database rows x2f0;1gwith a given pair of values in a given
pair of columns.) This answers a question of Barak et al. (PODS ‘07), who gave an
algorithm running in time poly(n;2). Our proof combines a construction of
hard-to-sanitize databases based on digital signatures (by
Dwork et al., STOC ‘09) with encodings based on the PCP Theorem. We also present
both negative and positive results for generating “relaxed” synthetic data, where
the fraction of rows in Dsatisfying a predicate care estimated by applying cto each row
of ^ Dand
Keywords: privacy, digital signatures, inapproximability, constraint satisfaction
problems, probabilistically checkable proofs
http://seas.harvard.edu/ ˜ jullman . Supported by NSF grant
CNS-0831289.
http://seas.harvard.edu/ ˜ salil. Supported by NSF grant CNS-0831289.
1 Introduction
There are many settings in which it is desirable to share information about a database that contains
sensitive information about individuals. For example, doctors may want to share information about
health records with medical researchers, the federal government may want to release census data
for public information, and a company like Netflix may want to provide its movie rental database for a
public competition to develop a better recommendation system. However, it is important to do this in
way that preserves the “privacy” of the individuals whose records are in the database. This privacy
problem has been studied by statisticians and the database security community for a number of
years (cf., [1, 14, 21]), and recently the theoretical computer science community has developed an
appealing new approach to the problem, known as differential privacy. (See the surveys [16, 15].).
Differential Privacy. A randomized algorithm Ais defined to be differentially private [17] if for every
two databases D= (x1;:::;xn), D0= (x0 10 n0;:::;x) that differ on exactly one row, the distributions A(D)
and A(D) are “close” to each other. Formally, we require that A(D) and A(D0 ) assign the same
probability mass to every event, up to a multiplicative factor of e 1 + , where is typically taken to be
a small constant. (In addition to this multiplicative factor, it is often allowed to also let the probabilities
to differ by a negligible additive term.) This captures the idea that no individual’s data has a
significant influence on the output of A(provided that data about an individual is confined to one or a
few rows of the database). Differential privacy has several nice properties lacking in previous
notions, such as being agnostic to the adversary’s prior information and degrading smoothly under
composition.
With this model of privacy, the goal becomes to design algorithms Athat simultaneously meet the
above privacy guarantee and give “useful” information about the database. For example, we may
have a true query function cin which we’re interested, and the goal is to design Athat is differentially
private (with as small as possible) and estimates cwell (e.g. the error jA(D)c(D)jis small with high
probability). For example, if c(D) is the fraction of database rows that satisfy some property — a
counting query — then it is known that we can take A(D) to equal c(D) plus random Laplacian noise
with standard deviation O(1=( n)), where nis the number of rows in the database and is the
measure of differential privacy [8]. A sequence of works [11, 19, 8, 17] has provided a very good
understanding of differential privacy in an interactive model in which real-valued queries care made
and answered one at a time. The amount of noise that one needs when responding to a query
cshould be based on the sensitivity of c, as well as the total number of queries answered so far.
However, for many applications, it would be more attractive to do a noninteractive data release,
where we compute and release a single, differentially private “summary” of the database that
enables others to determine accurate answers to a large class of queries. What form should this
summary take? The most appealing form would be a synthetic database, which is a new databaseb
D= A(D) whose rows are “fake”,but come from the same universe as those of Dand are guaranteed
to share many statistics with those of D (up to some accuracy). Some advantages of synthetic data
are that it can be easily understood by humans, and statistical software can be run directly on it
without modification. For example, these considerations led the German Institute for Employment
Research to adopt synthetic databases for the release of employment statistics [33].
Previous Results on Synthetic Data. The first result on producing differentially private synthetic data
came in the work of Barak et al. [5]. Given a database Dconsisting of nrows from f0;1g, they show
how to construct a differentially private synthetic databaseb D, also of nrows from f0;1g dd, in which
the full
1
“contingency table,” consisting of all conjunctive counting queries, is approximately preserved. That
is, for every conjunction c(x1;:::;xn) = xi1^xi2^ ^xikfor i1;:::;iksatisfy cequals the fraction of rows in Dthat
satisfy cup to an additive error of 2dO(d)2[d], the fraction of rows in b Dthat=n. The running time of
their algorithm is poly(n;2k), which is feasible for small values of d. They pose as an open problem
whether the running time of their algorithm can be improved for the case where we only want to
preserve the k-way marginals for small k(e.g. k= 2). These are the counting queries corresponding to
conjunctions of up to kliterals. Indeed, there are only O(d)ksuch conjunctions, and we can produce
differentially private estimates for all the corresponding counting queries in time poly(n;d) by just
adding noise O(d)1k=nto each one. Moreover, a version of the Barak et al. algorithm [5] can ensure
that even these noisy answers are consistent with a real database.
dA more general and dramatic illustration of the potential expressiveness of synthetic data came in
the work of Blum, Ligett, and Roth [9]. They show that for every class C= fc: f0;1g!f0;1ggof
predicates, there is a differentially private algorithm Athat produces a synthetic database^ D=
A(D)such that all count-ing queries corresponding to predicates in Care preserved to within an
accuracy of ~ O((dlog(jCj)=n)d1=3), with high probability. In particular, with n= poly(d), the synthetic
data can provide simultaneous accuracy for an exponential-sized family of queries (e.g. jCj= 2).
Unfortunately, the running time of the BLR mechanism is also exponential in d.
2Dwork et al. [18] gave evidence that the large running time of the BLR mechanism is inherent.
Specifically, assuming the existence of one-way functions, they exhibit an efficiently computable
family Cof predicates (e.g. consisting of circuits of size d) for which it is infeasible to produce a
differentially private synthetic database preserving the counting queries corresponding to C(for
databases of any n= poly(d) number of rows). For non-synthetic data, they show a close connection
between the infeasibility of producing a differentially private summarization and the existence of
efficient “traitor-tracing schemes.” However, these results leave open the possibility that for natural
families of counting queries (e.g. those corresponding to conjunctions), producing a differentially
private synthetic database (or non-synthetic summarization) can be done efficiently. Indeed, one
may have gained optimism by analogy with the early days of computational learning theory, where
one-way functions were used to show hardness of learning arbitrary efficiently computable concepts
in computational learning theory but natural subclasses (like conjunctions) were found to be
learnable [36].
Our Results. We prove that it is infeasible to produce synthetic databases preserving even very
simple counting queries, such as 2-way marginals:
dp(d))Theorem 1.1. Assuming the existence of one-way functions, there is a constant >0 such that
for every polynomial p, there is no polynomial-time, differentially private algorithm Athat takes a
database D2 (f0;1gand produces a synthetic database ^ D2(f0;1gd) such that jc(D) c( ^ D)j for all
2-waymarginals c. (Recall that a 2-way marginal c(D) computes the fraction of database rows
satisfying a conjunction of
d2f0;1g such that xi(j) = band xi(j0) = b0
0
2
1
for some columns j;j02[d] and values b;b2f0;1g.)
t
In fact, our impossibility res
wo literals, i.e. the fraction of rows xconjunctions
i
of 2 literals to any family of constant arity predicates that conta
depending on at least two variables, such as parities of 3 literals.
2As mentioned earlier, all 2-way marginals can be easily summarized with n
just adding noise to each of the (2d)values). Thus, our result shows that req
database may
Technically, this “real database” may assign fractional weight to some rows.
severely constrain what sorts of differentially private data releases are possible. (Dwork et al. [18]
also showed that there exists a poly(d)-sized family of counting queries that are hard to summarize
with synthetic data, thereby separating synthetic data from non-synthetic data. Our contribution is to
show that such a separation holds for a very simple and natural family of predicates, namely 2-way
marginals.)
This separation between synthetic data and non-synthetic data seems analogous to the
separations between proper and improper learning in computational learning theory [32, 22], where it
is infeasible to learn certain concept classes if the output hypothesis is constrained to come from the
same representation class as the concept, but it becomes feasible if we allow the output hypothesis
to come from a different representation class. This gives hope for designing efficient, differentially
private algorithms that take a database and produce a compact summary of it that is not synthetic
data but somehow can be used to accurately answer exponentially many questions about the
original database (e.g. all marginals). The negative results of [18] on non-synthetic data (assuming
the existence of efficient traitor-tracing schemes) do not say anything about natural classes of
counting queries, such as marginals.
To bypass the complexity barrier stated in Theorem 1.1, it may not be necessary to introduce
exotic data representations; some mild generalizations of synthetic data may suffice. For example,
several recent algorithms [9, 35, 20] produce several synthetic databases, with the guarantee that
the median answer over these databases is approximately accurate. More generally, we can
consider summarizations of a database Dthat that consist of a collection^ Dof rows from the same
universe as the original database, and where
we estimate c(D) by applying the predicate cto each row of ^ Dand then aggregating the results via
some
aggregation function f. With standard synthetic data, fis simply the average, but we may instead
allow f to take a median of averages, or apply an affine shift to the average. For such relaxed
synthetic data, we prove the following results:
There is a constant ksuch that counting queries corresponding to k-juntas (functions depending
on at most kvariables) cannot be accurately and privately summarized as relaxed synthetic data
with a median-of-averages aggregator, or with a symmetric and monotone aggregator (that is
independent of the predicate cbeing queried).
For every constant k, counting queries corresponding to k-juntas can be accurately and privately
summarized as relaxed synthetic data with an aggregator that applies an affine shift to the
average (where the shift does depend on the predicate being queried).
Techniques. Our proof of Theorem 1.1 and our other negative results are obtained by combining the
hardto-sanitize databases of Dwork et al. [18] with PCP reductions. They construct a database
consisting of valid message-signature pairs (mi; i) under a digital signature scheme, and argue that
any differentially private sanitizer that preserves accuracy for the counting query associated with the
signature verification predicate can be used to forge valid signatures. We replace each
message-signature pair (m) with a PCP encoding ithat proves that (mi; ii; i) satisfies the signature
verification algorithm. We then argue that if accuracy is preserved for a large fraction of the (constant
arity) constraints of the PCP verifier, then we can “decode” the PCP either to violate privacy (by
recovering one of the original message-signature pairs) or to forge a signature (by producing a new
message-signature pair).
We remark that error-correcting codes were already used in [18] for the purpose of producing a
fixed polynomial-sized set of counting queries that can be used for all verification keys. Our
observation is that by using PCP encodings, we can reduce not only the number of counting queries
in consideration, but also their computational complexity.
Our proof has some unusual features among PCP-based hardness results:
3
As far as we know, this is the first time that PCPs have been used in conjunction with cryptographic
assumptions for a hardness result. (They have been used together for positive results regarding
computationally sound proof systems [28, 29, 6].) It would be interesting to see if such a combination
could be useful in, say, computational learning theory (where PCPs have been used for hardness of
“proper” learning [2, 23] and cryptographic assumptions for hardness of representation-independent
learning [36, 26]).
While PCP-based inapproximability results are usually stated as Karp reductions, we actually
need them to be Levin reductions — capturing that they are reductions between search problems,
and not just decision problems. (Previously, this property has been used in the same results on
computationally sound proofs mentioned above.)
n)
d)n be a matrix of nrows, x1;:::;xn
^n
2 Preliminaries
2.1 Sanitizers
Let a database D2(f0;1gd
d
2
1
, corresponding to people, each of which
contains dbinary attributes. A sanitizer A: (f0;1g)!Rtakes a database and outputs some data structure
in R. In the case where R= (f0;1g(an ^n-row database) we say that Aoutputs a synthetic database.
We would like such sanitizers to be both private and accurate. In particular, the notion of privacy
we are interested in is as follows
Definition 2.1 (Differential Privacy). [17] A sanitizer A: (f0;1g;D!Ris ( ; )-differentially private if for
every two databases D2(f0;1gdn)dn)that differ on exactly one row, and every subset S R
Pr[A(D1 ) 2S] e Pr[A(D2) 2S] +
In the case where = 0 we say that Ais -differentially private. Since a sanitizer that always outputs 0
satisfies Definition 2.1, we also need to define what it means for a
ddatabase to be accurate. In this paper we consider accuracy with respect to counting queries.
Consider a set Cconsisting of boolean predicates c: f0;1g!f0;1g, which we call a concept class.
Then each predicate cinduces a counting query that on database D= (x1;:::;xn) 2(f0;1gdn)returns
ni=1 c(xi)
c(D) =
X
1n
d)
where
4
c)c2CjCj
Zc
If the output of Ais a synthetic database b
, then c(A(D)) is simply the fraction of rows of bDthat satisfy th
D2(f0;1g
Aoutputs a data structure that is not a synthetic database, then
evaluator function E: R C!R that estimates c(D) from the outpu
For example, Amay output a vector Z= (c(D) + Zis a random v
the c-th component of Z2R= R. Abusing notation, we will write
E(A(D);c).We will say that Athat outputs a synthetic database
Cif the fractional counts c(A(D)) are close to the fractional cou
Definition 2.2 (Accuracy). An output Zof sanitizer A(D) is -accurate for a concept class Cif
8c2C;jc(Z) c(D)j :
A sanitizer Ais ( ; )-accurate for a concept class Cif for every database D,
Pr0Ascoins[8c2C;jc(A(D)) c(D)j ] 1
In this paper we say f(n) = negl(n) if f(n) = o(nc1) for every c>0 and say that f(n) is negligible. We use
jsjto denote the length of the string s, and sks2to denote the concatenation of s1and s2.
2.2 Hardness of Sanitizing
Differential privacy is a very strong notion of privacy, so it is common to look for hardness results
that rule out weaker notions of privacy. These hardness results show that every sanitizer must be
“blatantly non-private” in some sense. In this paper our notion of blatant non-privacy roughly states
that there exists an efficient adversary who can find a row of the original database using only the
output from any efficient sanitizer. Such definitions are also referred to as “row non-privacy.” We
define hardness-of-sanitization with respect to a particular concept class, and want to exhibit a
distribution on databases for which it would be infeasible for any efficient sanitizer to give accurate
output without revealing a row of the database. Specifically, following [18], we define the following
notions
Definition 2.3 (Database Distribution Ensemble). Let D= Dbe an ensemble of distributions on
d-column databases with n+ 1 rows D2(f0;1gchoose D0 Rdn+1). Let (D;D0d;i) R~ Ddenote the
experiment in which weDand i2[n] uniformly at random, and set Dto be the first nrows of D0and
D0to be D with the i-th row replaced by the (n+ 1)-st row of D. Definition 2.4 (Hard-to-sanitize
Distribution). Let Cbe a concept class, 2[0;1] be a parameter, andD= Dd0be a database
distribution ensemble. The distribution Dis ( ;C)-hard-to-sanitize if there exists an efficient
adversary Tsuch that for anyalleged polynomial-time sanitizer Athe following two conditions hold:
1.Whenever A(D) is -accurate, then T(A(D)) outputs a row of D:
[(A(D) is -accurate for C) ^(T(A(D)) \D= ;)] negl(d):
Pr
(D ;D 0 ;i) R ~D A0 s
and T0 s coins
from the database
D0
0)) = xi negl(d)
(D ;D 0 ;i) R ~D A0 s and
T0 s coins
d
and ( ;negl(n))-differential privacy.
5
: Pr T(A(D
2.For every efficient sanitizer A, Tcannot extract xi
is the i-th row of D. In [18],
w it was shown that every distribution t
h
the sense of Definition 2.4,is also hard to sanitize while achievin
e
r
Claim 2.5. [18] If a distribution ensemble D= D1+aon n(d)-row da
e
then for every constant a>0 and every = (d) 11=poly(d), no e
x
i
( ; )-accurate with respect to Ccan achieve (alog(n);(1 8 )=2n)-d
for all constants ; >0, no polynomial-time sanitizer can achieve
AWe
could use a weaker definition of hard-to-sanitize distributions, which would still suffice to rule
out differential privacy, that only requires that for every efficient A, there exists an adversary Tthat
almost always extracts a row of Dfrom every -accurate output of A(D). In our definition we require
that there exists a fixed adversary Tthat almost always extracts a row of Dfrom every -accurate
output of any efficient A. Reversing the quantifiers in this fashion only makes our negative results
stronger.In this paper we are concerned with sanitizers that output synthetic databases, so we will
relax Definition 2.4 by restricting the quantification over sanitizers to only those sanitizers that output
synthetic data.
Definition 2.6 (Hard-to-sanitize Distribution as Synthetic Data). A database distribution ensemble Dis
( ;C)-hard-to-sanitize as synthetic data if the conditions of Definition 2.4 hold for every sanitizer Athat
outputs a synthetic database.
3 Relationship with Hardness of Approximation
The objective of a privacy-preserving sanitizer is to reveal some properties of the underlying
database without giving away enough information to reconstruct that database. This requirement has
different implications for sanitizers that produce synthetic databases and those with arbitrary output.
c)c2Cwhere the random variables (Zc)c2CThe SuLQ framework of [8] is a well-studied, efficient
technique for achieving ( ; )-differential privacy, with non-synthetic output. To get accurate, private
output for a family of counting queries with predicates in C, we can release a vector of noisy counts
(c(D)+Zare drawn independently from a distribution suitable for preserving privacy. (e.g. a Laplace
distribution with standard deviation O(jCj= n)).Consider the case of an n-row database Dthat
contains satisfying assignments to a 3CNF formula ’, and suppose our concept class includes all
disjunctions on three literals (or, equivalently, all conjunctions on three literals). Then the technique
above releases a set of noisy counts that describes a database in which every clause of ’is satisfied
by most of the rows of D. However, sanitizers with synthetic-database output are required to produce
a database that consists of rows that satisfy most of the clauses of ’.
Because of the noise added to the output, the requirement of a synthetic database does not
strictly force the sanitizer to find a satisfying assignment for the given 3CNF. However, it is known to
be NP-hard to find even approximate satisfying assignments for many constraint satisfaction
problems. In our main result, Theorem 4.4, we will show that there exists a distribution over
databases that is hard-to-sanitize with respect to synthetic data for any concept class that is
sufficient to express a hard-to-approximate constraint satisfaction problem.
3.1 Hard to Approximate CSPs
We define a constraint satisfaction problem to be the following. Definition 3.1 (Constraint Satisfaction
Problem (CSP)). For a function q= q(d) d, a family of q(d)-CSPs, denoted = (d)d2N, is a sequence
of sets of boolean predicates on q(d) variables. If q(d) and dd(d) do not depend on dthen we refer to
as a fixed family of q-CSPs. For every d q(d), let Cbe the class consisting of all predicates c :
f0;1g!R of the form c(u1;:::;ud) = (ui1;:::;uiq (d)) for some 2dand i1;:::;iq(d)d2[d]. We call C1 d=0= [(d) (d) the
class of constraints of . Finally, we say a multiset ’ Cis a d-variable instance of CCand each ’i2’ is a
constraint of ’.
6
We say that an assignment xsatisfies the
constraint ’i
d
’i
and val(’) = max
val(’;u):
1(1),
d 0
if ’i (u) = 1. For ’= f’1;:::;’mg, define
v
u2f0;1g
(u) m
a results will apply to concept
hardness
l
classes C
( CSP). A family = (d)d2N
Definition 3.2 (Nice
’
;
u
)
(d) Our
=
P
m
i
=
1
2.for every d2N ,
dcontains
a non-constant predicate
: f0;1gq(d)
’
C
C
d2N
C
C
d)d2N
1.for every u2f0;1gd
such that C(u) = 1, val(’
!f0;1g. Moreover, ’
C
such that ’ (u ) = 0 and ’ (u
2f0;1g
d
(d)
C
d
(d)
C
7
CSPs
=(
q(d)
)
0
1
= R(C) C
such
1 that
val(’
for CSP families with certain additional
properties. Specifically we define,
of q(d)-CSPs nice
if 1. q(d) = d
;uand two assignments u) = 1 can be found in time poly(d).
We note that any fixed family of q-CSP that contains a non-constant predicate is a nice CSP.
Indeed, these CSPs (e.g. conjunctions of 2 literals) are the main application of interest for our
results. However it will sometimes be useful to work with generalizations to nice CSPs with
predicates of non-constant arity.
For our hardness result, we will need to consider a strong notion of hard constraint
satisfaction problems, which is related to probabilistically checkable proofs. First we recall the
standard notion of hardness of approximation under Karp reductions. (stated for additive, rather
than multiplicative approximation error)
Definition 3.3 (inapproximability under Karp reductions). For functions ;: N ![0;1]. A family of
CSPs = (is ( ;)-hard-to-approximate under Karp reductions if there exists a polynomial-time
computable function Rsuch that for every circuit Cwith input size d, if we set ’for some d= poly(
d), then
1.if Cis satisfiable, then val(’) (d), and 2.if Cis unsatisfiable, then val(’) <(d) (d). For our
hardness result, we will need a stronger notion of inapproximability, which says that we can
efficiently transform satisfying assignments of Cinto solutions to ’of high value, and vice-versa.
Definition 3.4 (inapproximability under Levin reductions). For functions ;: N ![0;1]. A family ofis
( ;)-hard-to-approximate under Levin reductions if there exist polynomial-time computable
functions R;Enc ;Dec such that for every circuit Cwith input of size dif we set ’for some d= poly(d),
then= R(C) C
;Enc (u;C)) (d), 2.and for
every 2f0;1g; ) (d) (d), C(Dec ( ;C)) = 1, 3.and for every u2f0;1g, Dec (Enc (u;C)) = u When we
do not wish to specify the value we will simply say that the family is -hard-to-approximate
under Levin reductions to indicate that there exists such a 2( ;1]. If we drop the requirement
that Ris efficiently computable, then we say that is ( ;)-hard-to-approximate under inefficient
Levin reductions.
The notation Enc ;Dec reflects the fact that we think of the set of assignments such that val(’; ) as
a sort of error-correcting code on the satisfying assignments to C. Any 0Cwith value close to can be
decoded to a valid satisfying assignment.
Levin reductions are a stronger notion of reduction than Karp reductions. To see this, let
be -hard-toapproximate under Levin reductions, and let R;Enc ;Dec be the functions described in
Definition 3.4. We now argue that for every circuit C, the formula ’C= R(C) satisfies conditions 1 and
2 of Definition 3.3. Specifically, if there exists an assignment u2f0;1gthat satisfies C, then Enc (u;C)
satisfies at least a fraction of the constraints of ’Cd. Conversely if any assignment 2f0;1gsatisfies at
least a fraction of the constraints of ’Cd, then Dec ( ;C) is a satisfying assignment of C. Variants of
the PCP Theorem can be used to show that essentially every class of CSP is hard-toapproximate in this sense. We restrict to CSP’s that are closed under complement as it suffices for
our application.
Theorem 3.5 (variant of PCP Theorem). For every fixed family of CSPs that is closed under
negation and contains a function that depends on at least two variables, there is a constant = ( >0
such that is -hard to approximate under Levin reductions.
Proof sketch. Hardness under Karp reductions follows directly from the classification theorems of
Creignou [10] and Khanna et al. [27]. These theorems show that all CSPs are either -hard under Karp
reductions for some constant >0or can be solved optimally in polynomial time. By inspection, the only
CSPs that fall into the polynomial-time cases (0-valid, 1-valid, and 2-monotone) and are closed under
negation are those containing only dictatorships and constant functions.
The fact that standard PCPs actually yield Levin reductions has been explicitly discussed and
formalized by Barak and Goldreich [6] in the terminology of PCPs rather than reductions (the
function Enc is called “relatively efficient oracle-construction” and the function Dec is called “a
proof-of-knowledge property”). They verify that these properties hold for the PCP construction of
Babai et al. [4], whereas we need it for PCPs of constant query complexity. While the properties
probably hold for most (if not all) existing PCP constructions, the existence of the efficient “decoding”
function grequires some verification. We observe that it follows as a black box from the PCPs of
Proximity of [7, 12]. There, a prefix of the PCP (the “implicit input oracle”) can be taken to be the
encoding of a satisfying assignment of the circuit Cin an efficiently decodable error-correcting code.
If the PCP verifier accepts with higher probability than the soundness error s, then it is guaranteed
that the prefix is close to a valid codeword, which in turn can be decoded to a satisfying assignment.
By the correspondence between PCPs and CSPs [3], this yields a CSP (with constraints of constant
arity) that is -hard to approximate under Levin reductions for some constant >0 (and = 1). The
sequence of approximation-preserving reductions from arbitrary CSPs to MAXCUT [31] can be
verified to preserve efficiency of decoding (indeed, the correctness of the reductions is proven by
specifying how to encode and decode). Finally, the reductions of [27] from MAX-CUT to any other
CSP all involve constant-sized “gadgets” that allow encoding and decoding to be done locally and
very efficiently.
It seems likely that optimized PCP/inapproximability results (like [25]) are also Levin reductions,
which would yield fairly large values for for natural CSPs (e.g. = 1=8 if contains all conjunctions
of 3-literals, because then Ccontains MAX 3-SAT.) For some of our results we will need CSPs that
are very hard to approximate (under possibly inefficient
reductions), which we can obtain by “sequential repetition” of constant-error PCPs.
8
Theorem 3.6. There is a constant Csuch that for every = (d) >0, the constraint family = (d)d2Nof
k(d)-clause 3-CNF formulas is (1 (d);1)-hard-to-approximate under inefficient Levin reductions, for
k(d) = Clog(1= (d)).
Proof sketch. As in the proof of Theorem 3.5, disjunctions of 3 literals are
(1 ;1)-hard-to-approximate under Levin reductions for some constant >0. By taking k(d) =
log(1= (d)) sequential repetitions of this PCP, we get a PCP with completeness 1 and
soundness (d) whose constraints are 3-CNF formulas with k(d) = log (1= (d)) clauses. We have to
check that this resulting PCP preserves the properties of inefficient Levin reductions. The
encoder for the k-fold sequential repetition is unchanged. If the initial reduction is R(C) = ’1;:::;’m=
f’g(a set of 3-literal disjunctions), then the reduction RkCk(C) for the k-fold sequential repetition will
produce m, k-clause 3-CNF formulae by taking every subcollection of kclauses in ’. Specifically, for
every i1;i2;:::;ik2[m], Rk(C) will contain a k-clause 3-CNF formula ’i1^’i2C^ ^’kik. The decoder also
remains unchanged. If the value of an assignment is at least kwith respect to R(C) then it must
have value at least with respect to R(C) and thus Dec ( ;C) will return a satisfying assignment to C,
that is C(Dec ( ;C)) = 1.
Notice that when k= k(d) = !(1), the reduction will produce m!(1)clauses and be inefficient. Thus we
will have an inefficient Levin reduction if we want to obtain (d) = o(1) from this construction.
4 Hard-to-Sanitize Distributions from Hard CSPs
In this section we prove that to efficiently produce a synthetic database that is accurate for the
constraints of a CSP that is hard-to-approximate under Levin reductions, we must pay constant error
in the worst case. Following [18], we start with a digital signature scheme, and a database of valid
message-signature pairs. There is a verifying circuit Cand valid message-signature pairs are
satisfying assignments to that circuit. Now we encode each row of database using the function Enc ,
described in Definition 3.4, that maps satisfying assignments to Cvkvkto assignments of the CSP
instance ’Cv k= R(Cvk) with value at least . Then, any assignment to the CSP instance that satisfies
a fraction of clauses can be decoded to a valid message-signature pair. The database of encoded
message-signature pairs is what we will use as our hard-to-sanitize distribution.
4.1 Super-Secure Digital Signature Schemes
Before proving our main result, we will formally define a super-secure digital signature scheme.
These digital signature schemes have the property that it is infeasible under chosen-message attack
to find a new message-signature pair that is different from all obtained during the attack, even a new
signature for an old message. First we formally define digital signature schemes
Definition 4.1 (Digital signature scheme). A digital signature scheme is a tuple of three probabilistic
polynomial time algorithms = (Gen ;Sign ;Ver ) such that
1. Gen takes as input the security
and outputs a key pair (sk;vk) Gen
R
parameter 1
(1
R
(sk;vk) in the range of Gen , and every message m, we have Signsk
v
Ver
sk(m)) = 1.
k
9
). 2. Sign takes skand a message m2f0;1gas input and outputs (m). 3. Ver takes vkand pair (m; )
and deterministically outputs a bit b 2f0;1g, such that for every(m;Sign
We define the security of a digital signature scheme with respect to the following game. Definition
4.2 (Weak forgery game). For any signature scheme = (Gen ;Sign ;Ver ) and probabilistic
R Gen (1polynomial time adversary F, WeakForge(F; ; ) is the following probabilistic experiment. 1.
(sk;vk) ). 2. Fis given vkand oracle access to Signsk. The adversary adaptively queries Sign on a
set of messages Q f0;1g, receives a set of message-signature pairs A f0;1g skand outputs (m ; ).
3.The output of the game is 1 if and only if (1) Vervk(m ; ) = 1, and (2) (m ; ) 62A. The weak forgery
game is easier for the adversary to win than the standard forgery game because the
final condition requires that the signature output by Fbe different from all pairs (m; ) 2A, but allows
for the possibility that m 2Q. In the standard definition, the final condition would be replaced by
m62Q. Thus the adversary has more possible outputs that would result in a “win” under this
definition than under the standard definition.
Definition 4.3 (Super-secure digital signature scheme). A digital signature scheme = (Gen ;Sign
;Ver ) is super-secure under adaptive chosen-message attack if for every probabilistic polynomial
time adversary, F, Pr[WeakForge(F; ; ) = 1] negl( ).
Although the above definition is stronger than the usual definition of existentially unforgeable
digital signatures, in [24] it is shown how to modify known constructions [30, 34] to obtain a
super-secure digital signature scheme from any one-way function.
4.2 Main Hardness Result
We are now ready to state and prove our hardness result. Let = (1 d=1be a family of q(d)-CSPs
and let C= [(d) Cd)d2Nbe the class of constraints of , which was constructed in Definition 3.1. We
now state our hardness result.
be a family of nice q(d)-CSPs such
[:d
Theorem 4.4. Let = (d)d2N
d
that d
(d)
d
q
d
:
and the
;
(u
(
such
that ’ 0
assignments
u
d
)
f
u 0
0
1
;
1
g
q
(
d
)
is (d)-hard-toappr
oximate under (possibly inefficient) Levin reductions for = (d) 2(0;1=2). Assuming the existence of
one-way functions, for every polynomial n(d), there exists a distribution ensemble D= Don n(d)-row
databases that is ( (d);C)-hard-to-sanitize as synthetic data. Proof. Let = (Gen ;Sign ;Ver ) be a
super-secure digital signature scheme and let be a family of CSPs
that is -hard-to-approximate under Levin reductions. Let R;Enc ;Dec be the polynomial-time
functions and = (d) 2( ;1] be the parameter from Definition 3.4. Let = dfor a constant >0 to be
defined later.
Let n= n(d) = poly(d). We define the database distribution ensemble D= Dto generate n+ 1 random
message-signature pairs and then encode them as PCP witnesses with respect to the
signatureverification algorithm. We also encode the verification key for the signature scheme using
the non-constant constraint ’!f0;1gin 2f0;1g) = 0 and ’(u 1) = 1, as described in the definition of
nice CSPs (Definition 3.2).
denotes the concatenation of the strings
poly(d
10
Recall that s1ks2s1
and s2. Note
that the length of xi
1(1)
before padding is poly( ) + q(d)poly( ) d), so we can choose the constant >0 to be small enough
that the length of xbefore padding is at most dand the above is well defined.
D
at
ab
as
e
Di
str
ib
uti
on
En
se
m
bl
e
D
=
DR
G
en
(1)
,
let
vk
=
vk
1v
k2
d:
(s
k;
vk
)
,
w
he
re
‘=
jvk
j=
po
ly(
)
(m
1;::
:;
mn
+1)
Rfo
r
i=
1
to
n+
1
do
xi:
=
En
c
(m
i (f
0;
1g
kS
ig
ns
k(
m
i
n+1
));
Cv
k)k
u::
:v
kvk1
ku
‘
vk
2 v
k0:
=
(x1
;:::
;xn
+1)
k::
:k
u‘,
pa
dd
ed
wit
h
ze
ro
s
to
be
of
le
ng
th
ex
ac
tly
d
en
d
for
ret
ur
n
D
E
ve
ry
va
lid
p
ai
r
(
m
;S
ig
ns
k(
m
))
is
a
sa
tis
fyi
n
g
as
si
g
n
m
e
nt
of
th
e
cir
cu
it
Cv
k,
h
e
nc
e
ev
er
y
ro
w
of
D
0
constructed in this way will satisfy at least a fraction of the clauses of the formula ’Cv = R(Cvk
k
L+(j1)q+2ikSignsk(mi))jin
the construction of D;:::;xL+jq0).
Then, by construction, ’ j j(d) 2C(D0) = vk(d) for j= 1;2;:::;‘,
j
by our construction of Cj
(
x
)
=
’
(
x
L
+
(
j
1
)
q
+
1
;
x
).
Pr
h (A(D) is -accu
(D ;D 0 ;i) R ~D A0 s
Additionally, for every bit of the verification key, there is a block of q(d) bits in each row that
C(d)
and T0 s coins
contains either a satisfying assignment or a non-satisfying assignment of ’, depending on whether
that bit of the key is 1 or 0. Specifically, let L= jEnc (mand for j= 1;2;:::;‘, let ’, the j-th bit of the
verification key. Note that ’(Definition 3.1). We now prove the following two lemmas that will
establish Dis hard-to-sanitize:
Lemma 4.5. There exists a polynomial-time adversary Tsuch that for every polynomial-time
sanitizer A,
Proof. Our privacy adversary tries to find a row of the original database by trying to P
0( b D) outputs a pair (m; ) s.t. Cvk(m; ) = 1.each row of the “sanitized” database and then re-encoding it. In order to do so, the a
needs to know the verification key used in the construction of the database, which it
from the answers to the queries ’ j, defined above. Formally, we define the privacy a
means of a subroutine that tries to learn the verification key and then PCP-decode e
input database:
Let Abe a polynomial-time sanitizer, we will show that Inequality (1) holds. Claim 4.6
A(D) is -accurate for C, then T(d) verification key used in the construction of DProof
argue that if b Dis -accurate for Caccurate for Dthen ’ j
and c = vkj
( b D) <1=2, and c = vk . Similarly, if
= 1 then = 0 and b Dis -( b D) 1 >1=2,. Thus, for the res
vkj
vkj
vk
’ j
vkfor c vk.that val(’Cv k;xi) , which implies CvkNex
D) 6= ?. It is sufficient to show there exists ^x)) =
is a satisfying assignment to Cvki2 b Dsuch, the d
each row xiof Dhas val(’Cv k;xim
) . Thus if ’Cv k
= f’1;:::;’
g,
then
1m
mj=1’X
j(D) = 1m mj=1X1n ni=1X
’j(xi) ! = ni=1 val(’Cv k ;xi) :
11
1n X
S
u
br
o
uti
n
e
K(
b
D)
:L
et
d
b
e
th
e
di
m
e
ns
io
n
of
ro
w
s
in
b
D,
=
d,
‘=
jv
kj
=
p
ol
y(
).
fo
r
j=
1
to
‘d
oc
vk
j=
h
e
n
d
fo
r
re
tu
rn
’c
vk
1S
u
br
o
uti
n
e
T
ji(
b
D)
ro
u
n
d
e
d
to
f0
;1
g0
k
c
vk
2k
:::
k
c
vk
(
b
D)
: ‘L
et
^n
b
e
th
e
n
u
m
b
er
of
ro
w
s
in
b
D,
fo
r
i=
1
to
^n
d
o
if
C)
)
=
1
th
e
n
re
tu
rn
D
ec
(^
xc
vk(
D
ec
(^
xi;
Ci
c
vk;
C)
e
n
d
if
c
vk
=
K(
b
D)
c
end for
vk
return ?
2, then for every constraint ’j2’Cv k, we have ’j
Privacy Adversary T( b
D):Let c vk= K(return Enc
(T0b D).( b D);Cc vk)
(d) Since
b Dis -accurate for C, and for every
constraint ’j, either ’j
m
i
j
j=1
j
j=1
^ni=1val(’Cv k
1X
^n
;^x) = 1
mX
’ ( b D) 1 m
mX ’ (D)
12
vk
0
:
0
vk
sk
sk
0
d
;Cvk) 2D(except with negligible
return ^x := Tb
D:= A(D)0
(d)
So
for
at
leas
t
one
row
^x 2
b
Dit
mus
t be
the
cas
e
that
val(’
Cv
(De
c
(^x;
C;^x
) .
The
defi
nitio
n of
Dec
(De
finiti
on
3.4)
impl
ies
C))
= 1.
k
( b D) ’j(D) . Thus
No
w
noti
ce
that
if
T(A(
D))
outp
uts
a
vali
d
mes
sag
e-si
gnat
ure
pair
but
T(A(
D))
\D=
;,
then
this
mea
ns
T(A(
D))
is
forgi
ng a
new
sign
atur
e
not
amo
ng
thos
e
use
d to
gen
erat
e D,
viol
atin
g
the
sec
urity
of
the
digit
al
sign
atur
e
sch
eme
.
For
mall
y,
we
con
stru
ct a
sign
atur
e
forg
er
as
follo
ws:
For
ger
F(vk
)
with
orac
le
acc
ess
to
Sig
n:
Use
the
orac
le
Sig
nto
gen
erat
e an
n-ro
w
data
bas
e
Dju
st
as
in
the
defi
nitio
n of
D(c
onsi
stin
g of
PC
P
enc
odin
gs
of
vali
d
mes
sag
e-si
gnat
ure
pair
s
and
an
enc
odin
g of
vk).
Let
(
b
D)
N
oti
ce
th
at
ru
n
ni
n
g
Fi
n
a
ch
os
e
nm
es
sa
g
e
at
ta
ck
is
e
q
ui
va
le
nt
to
ru
n
ni
n
g
Ti
n
th
e
ex
p
er
im
e
nt
of
in
e
q
u
ali
ty
(1
),
ex
ce
pt
th
at
F
d
o
es
n
ot
re
-e
nc
o
d
e
th
e
o
ut
p
ut
of
T(
A(
D)
).
B
y
th
e
su
p
er
-s
ec
ur
ity
of
th
e
si
g
n
at
ur
e
sc
h
e
m
e,
if
th
e
^x
o
ut
p
ut
by
Fi
s
a
va
lid
m
es
sa
g
esi
g
n
at
ur
e
p
ai
r
(a
s
h
ol
ds
if
A(
D)
is
ac
cu
ra
te
fo
r
C,
by
Cl
ai
m
4.
6)
,
th
e
n
it
m
us
t
b
e
o
n
e
of
th
e
m
es
sa
g
esi
g
n
at
ur
e
p
ai
rs
us
e
d
to
co
ns
tr
uc
t
D(
ex
ce
pt
wi
th
pr
o
b
a
bil
ity
n
e
gl
()
=
n
e
gl
(d
)).
T
hi
s
im
pli
es
th
at
T(
A(
D)
)
=
E
nc
(^
x
probability). Thus, we
have
[A(D) is -accurate for
C
(d)
)T(A(D)) 2D] 1 negl(d);
Pr
(D ;D 0 ;i) R ~0 s coinsD A
Pr
T(A(D0)) = xi
negl(d)
(D ;D 0 ;i) R ~D A0 s
and T0 s coins
which is equivalent to the statement of
the lemma. Lemma 4.7.
Proof. Since the messages used in D0
mi
i
d
!
(
f
0
;
1
g
are drawn independently,
D0
0) output the target row xi
^nb
D 2(f0;1gd)b D, apply cto each row ^x1;:::;b
d
n
2
fis monotone
2Given
two vectors a = (a1
contains no information about the message m,
thus no adversary can, on input A(D= negl(d).
These two claims suffice to establish that Dis ( ;C)-hard-to-sanitize as synthetic data. Theorem 1.1
in the introduction follows by combining Theorems 3.5 and 4.4.
5 Relaxed Synthetic Data
The proof of Theorem 4.4 requires that the sanitizer output a synthetic database. In this section we present similar
exc
hardness results for sanitizers that produce other forms of output, as long as they still produce a collection of
elements from f0;1g, that are interpreted as the data of (possibly “fake”) individuals. More specifically, we consider
sanitizers that output a databasebut are interpreted using an evaluation function of the following form: To evaluate
predicate c2Conto get a string of ^nbits, and then apply a function f: f0;1g^ni C![0;1] to determine the answer. For
example, when the sanitizer outputs a synthetic database, we have f(b^n;c) = (1=^n) P^n i=1, which is just the fraction
of rows that get labeled with a 1 by the predicate c(independent of c).We now give a formal definition of relaxed
synthetic data Definition 5.1 (Relaxed Synthetic Data). A sanitizer A: (f0;1gn)d^n)biwith evaluator Eoutputs relaxed
synthetic data for a family of predicates Cif there exists f: f0;1g C![0;1] such that For every c2CE( b D;c) =
f(c(^x1);c(^x2);:::;c(^x^n^n);c); and
in the first ^ninputs. This relaxed notion of synthetic data is of interest because many natural
approaches to sanitizing yield
outputs of this type. In particular, several previous sanitization algorithms [20, 35, 9] produce a set of synthetic
databases and answer a query by taking a median over the answers given by the individual databases.
Sanitizers that use medians of synthetic databases no longer have the advantage that they are “interchangeable”
with the original data, but are still be desirable for data releases because they retain the property that a small
data structure can give accurate answers to a large number of queries. We view such databases
) and b = (b1; : : : ; bn) we say b a iff bi
; : : : ; a 13
ai
n
for every i 2[n]. We say a function f : f0; 1g![0; 1] is monotone if b a =)f (b ) f (a).
as
a
si
n
gl
e
sy
nt
h
eti
c
d
at
a
b
as
e
b
ut
re
q
ui
re
th
at
fh
av
e
a
sp
ec
ial
fo
r
m
.
U
nf
or
tu
n
at
el
y,
th
e
sa
nit
iz
er
s
of
[2
0]
a
n
d
[3
5]
ru
n
in
ti
m
e
ex
p
o
n
e
nti
al
in
th
e
di
m
e
ns
io
n
of
th
e
d
at
a,
d,
a
n
d
th
e
re
su
lts
of
th
e
n
ex
t
su
bs
ec
tio
n
sh
o
w
thi
s
li
mi
ta
tio
n
is
in
h
er
e
nt
ev
e
n
fo
r
si
m
pl
e
co
nc
e
pt
cl
as
se
s.
T
hr
o
u
g
h
o
ut
thi
s
se
cti
o
n
w
e
wi
ll
co
nti
n
u
e
to
us
e
c(
b
D)
to
re
fe
r
to
th
e
a
ns
w
er
gi
ve
n
by
n)
!(f0;1gd
^n
)
5.1 Hardness of Sanitizing as Medians
Definition 5.2 (medians of synthetic data). A sanitizer A: (f0;1gd
1
2
‘
c(^xi); 1jS2i2Sj
E(^x1;:::;^x^n;c) = median 8< :1 jS1i2SXj 2
14
X1
c(^xi);:::;
1jS‘i2Sj c(^:
X‘
xi)
1;:::;S‘ 9=
;
i
Theorem 5.3. Let = (d)d2N
ensemble D= Dd
(d)
d
D= Dlet nb Dd1; b D2;:::; b
D‘^x2 b Dsuch that val(’Cv
k
Cv k
Pr
(D ;D 0 ;i) R ~D A0 s and T0 s coins
(d)
[:d
j)j2Si
as a synthetic database. We now present our hardness results for relaxed synthetic data where b Dwhen
the function ftakes the median
interpreted
[S
[S
such
over synthetic database (Section 5.1), where fis an arbitrary monotone, symmetric function
that
(Section 5.2), or when the family of concepts contains CSPs that are very hard to approximate
(Section 5.3). Our proofs use the same construction of hard-to-sanitize databases as Theorem 4.4
with a modified analysis and parameters to show that the output must still contain a
PCP-decodable row.
In this section we establish that the distribution Dused in the proof of Theorem 4.4 is
hard-to-sanitize as medians of synthetic data, formally defined as:
with evaluator E outputs medians of synthetic data if there is a partition [^n] = S
N
ote
that
med
ians
of
synt
heti
c
data
are
a
spe
cial
cas
e of
rela
xed
synt
heti
c
data
. In
the
follo
win
g,
we
rule
out
effic
ient
sani
tizer
s
with
med
ians
of
synt
heti
c
data
for
CS
Ps
that
are
har
d to
app
roxi
mat
e
with
in a
mult
iplic
ativ
e
fact
or
larg
er
than
2.
By
The
ore
m
3.6,
thes
e
CS
Ps
incl
ude
k-cl
aus
e
3-C
NF
form
ulas
for
som
e
con
stan
t k.
be a
fami
ly of
nice
(De
finiti
on
3.2)
q(d)
-CS
Ps
suc
h
that
is
(( (d
)+
(d))
=2;(
d))har
d-to
-ap
prox
imat
e
und
er
(pos
sibl
y
inef
ficie
nt)
Levi
n
red
ucti
ons
for
= (
d) 2
(0;1
=2).
Ass
umi
ng
the
exis
tenc
e of
one
-wa
y
func
tion
s,
for
ever
y
poly
nom
ial
n(d)
,
ther
e
exis
ts a
distr
ibuti
on
on
n(d)
-row
data
bas
es
that
is
( (d)
;C)har
d-to
-san
itize
as
med
ians
of
synt
heti
c
data
.
Pro
of.
Let
be a
fami
ly of
CS
Ps
that
is
(( +
)=2;
)-ha
rd-t
o-a
ppr
oxi
mat
e
und
er
Levi
n
red
ucti
ons.
Let
be
the
data
bas
e
distr
ibuti
on
ens
emb
le
des
crib
ed
in
the
pro
of of
The
ore
m
4.4.
Let
A(D
)=
b
Dan
do
be
the
parti
tion
of
the
row
s
ofb
Dco
rres
pon
ding
to
S,
i.e.
b
D=
(^x.
Ass
umi
ng
that
b
Dis
-ac
cura
te
as
med
ians
of
synt
heti
c
data
, we
will
sho
w
that
ther
e
mus
t
exis
ta
row;
^x)
(+
)=2
=
( )=
2.
To
do
so,
we
obs
erve
that
if b
Dis
acc
urat
eas
med
ians
of
synt
heti
c
data
bas
es,
then
for
eac
h
pre
dica
te,
half
of b
D’s
synt
heti
c
data
bas
es
mus
t
give
ana
nsw
er
that
is
“clo
se
to
D’s
ans
wer”
.
Thu
s
one
of
thes
e
synt
heti
c
dats
bas
es
mus
t be
“clo
se”
to
Dfor
half
of
the
pre
dica
tes
in ’.
By
our
con
stru
ctio
n of
D,
we
con
clud
e
that
eac
h of
thes
e
pre
dica
tes
is
sati
sfie
d by
man
y
row
s of
this
synt
heti
c
data
bas
e
and
thus
som
e
row
sati
sfie
s
eno
ugh
of
the
pre
dica
tes
to
dec
ode
a
mes
sag
e-si
gnat
ure
pair.
W
e
n
e
e
d
to
sh
o
w
th
at
th
er
e
ex
ist
s
a
n
a
dv
er
sa
ry
T
su
ch
th
at
fo
r
ev
er
y
p
ol
yn
o
mi
al
ti
m
e
sa
nit
iz
er
A,
h
(A
(D
)
is
ac
cu
ra
te
fo
r
C)
^(
T(
A(
D)
)
\D
=
;)
i
n
e
gl
(d
)
(2
)
To do so, we will use the same subroutine T0consider a subroutine that looks for rows satisfying
sufficiently many clauses of ’Cv k( b D) we used for the proof of Lemma 4.5. That is, weand returns the
PCPdecoding of that row. It will suffice to establish the following claim, analogous to Claim 4.6:
0
(d) Cvk(m; ) = 1.Claim 5.4. If b
0( b D) outputs a pair (m; )
Dis -accurate for C
s.t.for < 1=2, then K( b D) = vk,
the
1 m m j=1
k R[‘]
m
h
’’
j
(D)
. We say that b D;:::; b is good for ’j if ’j( b Dv
m
k
b ; b Dm
D‘kis good for ’
D
j
)i
k
m
v k;^xi
Then
Ek R[‘]2 41 jSki2jSj Xkjval(’
= f’1;:::;’
g, then
If ’Cc
vk.v k
1P
m
[‘]
2
h bD
C
k
R[‘]k
X Pr
Pr
R
j
=
1
=
1
)
3
j j=1
X’
j
ko
is -accurate for every
constraint ’ji
k
2’C
m
j
=
1
1R[‘]
2 (
X ’
m
j
=
1
X
verification key used in the construction of DProof. As in the proof of Claim 4.6, if b Dis -accurate
as medians of synthetic data,
for C. For the rest of the proof we will be justified in substituting vkfor
T
)
’(D) . Since the median over nwe have
(d)
j
1
2
5=E
E(^x1;:::;^x^n
2 14 m1
^
E n
Cm
(b
Dk
( b Dh
bDkk
!
(
i
f
)
0
;
!
1
:
g
)d
)
3
5
j
i
d
(
(
’
j
(
D
)
)
1
1
^
n
2
is good for ’
(D) )
j
;^x) ( )=2. Since the distribution Dis
unchanged, Lemma 4.7 still holds in this setting. Thus we have established that Dis ( ;C)-hard-tosanitize as
medians of synthetic data.
5.2 Hardness of Sanitizing with Symmetric Evaluation Functions
In this section we establish the hardness of sanitization for relaxed synthetic data where the evaluator
function is symmetric.
Definition 5.5 (symmetric relaxed synthetic data). A sanitizer A: (f0;1gwith evaluator Eoutputs symmetric
relaxed synthetic data if there exists a monotone function g: [0;1] ![0;1] such that;c) = g ^ni=1Xc(^xn)
15
So for at least one row ^x2 b Dit must be the case
that val(’
vk
vk
Note that symmetric relaxed synthetic data is also a special case of relaxed synthetic data.
P)
Pm
Our definition of symmetric relaxed synthetic data is actually symmetric in two respects, becausej=1
we require that gdoes not depend on the predicate cand also that gonly depends on the fraction of
rows that satisfy c. Similar to medians of synthetic data,we show that it is intractable to produce a
sanitization as symmetric relaxed synthetic data that is accurate when the queries come from a
CSP that is hard to approximate.
Theorem 5.6. Let = (be a family of nice (Definition 3.2) q(d)-CSPs that is closed under
complement (d= :dd)d2N) and is ( (d) + 1=2;(d))-hard-to-approximate under (possibly inefficient)
Levin reductions for = (d) 2(0;1=2). Assuming the existence of one-way functions, for every
polynomial n(d),there exists a distribution ensemble D= Dd(d) on n(d)-row databases that is
( (d);C)-hard-to-sanitize as symmetric relaxed synthetic data.
By Theorem 3.6, the family of k-clause CNF formulas, for some constant k, is (1=2
+ )-hard-toapproximate under Levin reductions for >0.
Proof. Let be a family of CSPs that is ( +1=2)-hard-to-approximate under Levin reductions. Let
D= Dbe the database distribution ensemble described in the proof of Theorem 4.4, D RdD, and
b D= A(D).We will use the idea from Theorem 4.4, which shows that the underlying synthetic
database cannot contain a row that satisfies too many clauses of ’C, to show that gmust map a
small input to a large output and a large input to a small answer, contradicting the
monotonicity of g.
mj=1X ’j(D) ’j
+ 1=2( b D)
J
1
m
J
J
J
J
Consider the negation of ’J
J
1=2. But if this were
J
(b
D) 0
we
also
have
’J
2
:
2:
1 ’j(D) :It must also be that1=2. Otherwise there would exist a row ^x2 b 1
m ’j( b D)
Let ’Cv k = f’1;:::;’mg, then
m
m
j=1
D= A(D) such that val(’Cv k
so there must exist J2[m] s.t. ’J (D) ’
( b D) + 1=2. Since ’J(D) + 1=2 >1=2, and since ’(D) 1 we also have ’( b D) 1=2
monotonicity of gand -accuracy of b Das symmetric relaxed synthetic data we
haveg(1=2 ) g(’(D) 1
J . Since :’( b D) 1=2 + . Thus
( b D)) ’(D) = 1 ’(D) we can conclude
we have
that :’J
(D
)
1
=
2
a
n
d
:’
g(1=2 + ) g(:’
B
u
t
g
(
1
=
2
)
1
=
2
g
(
1
=
2
+
( b D)) :’J (D) +
16
1
5.
3
H
ar
d
n
e
ss
of
S
)
a
n
d
>
0
c
o
n
t
r
a
d
i
c
t
s
t
h
e
m
o
n
o
t
o
n
i
c
i
t
y
o
f
g
.
a
ni
tiz
in
g
V
er
y
H
ar
d
C
S
P
s
wi
th
R
el
a
x
e
d
S
y
nt
h
et
ic
D
at
a
In
thi
s
se
cti
o
n
w
e
sh
o
w
th
at
n
o
ef
fic
ie
nt
sa
nit
iz
er
ca
n
pr
o
d
uc
e
ac
cu
ra
te
re
la
xe
d
sy
nt
h
eti
c
d
at
a
fo
r
a
se
q
u
e
nc
e
of
C
S
P
s
th
at
is
(1
n
e
gl
(d
))h
ar
dto
-a
p
pr
ox
im
at
e
u
n
d
er
in
ef
fic
ie
nt
L
ev
in
re
d
uc
tio
ns
.
B
y
T
h
e
or
e
m
3.
6,
th
es
e
C
S
P
s
in
cl
u
d
e
3C
N
F
fo
r
m
ul
as
of
!(l
o
g
d)
cl
a
us
es
.
Intuitively, an efficient sanitizer must produce a synthetic database of ^n(d) = poly(d) rows, and
thus as dgrows, an efficient sanitizer cannot produce a synthetic database that contains a row
satisfying a nonnegligible fraction of clauses from a particular CSP instance (the
signature-verification CSP from our earlier results). Thus using evaluators of the type in Definition
5.1 there can only be one answer to most queries, and thus we cannot get an accurate sanitizer.
be a family of nice (Definition 3.2) q(d)-CSPs such that d [:d
Theorem 5.7. Let = (d)d2N
(d)
ensemble D= Dd
1m
= f’1;:::;’mg. By the construction of Dd
Let D= Dd
R
dDd,
and A(D) = b D.d;Decd;R
’j
Let ’Cv k
j=1X’j
Since E
^n;’j
j2J, ’j
is
(1 (d);1)-hard-to-approximate under (possibly inefficient) Levin reductions for a negligible
function (d). Assuming the existence of one-way functions, for every polynomial n(d), there exists a
distribution
on n(d)-row databases that is (1=3;C)-hard-to-sanitize as relaxed synthetic data. Proof. Let = d)
be a family of CSPs that is (1 (d);1)-hard-to-approximate under Levin reductions.be the database
distribution ensemble described in the proof of Theorem 4.4. Note that since d) depends on dthere
will be a sequence of triples (Enc) and the construction of Dshould use the appropriate encoder for
each choice of d. Let D we have
) (d). But if this were the case we
could PCP-decode ^xas in the proof of Theorem 4.4.
j R[m][’j(
b D)] (d), there must exist a subset J [m] of size jJj 2m=3 such that for all( b
D) 3 (d) negl(d).Since b D2(f0;1gd^n)for ^n= ^n(d) = poly(d) (by the efficiency of A),1 ^n^ni=1X’j(^xi)
= ’j
( b D)
1);:::;c(^x^n
vk
2f0;1=^n(d);2=^n(d);:::;1g( b
D) = 0 for large n. Let E(0=
(0n)
( b D) negl(d)
implies ’j
’Cv k (D0) = ’Cv k d(0) (d):
are idenitcal, ’j(D0
Sinc
e the
rows
of D0
= A(D j 0
1m
mj=1’X
j( b D0) (d)
17
b D0
(D) = 1
vk
for every j2[m]. As in the proof of Theorem 5.6 it must be that
m
( b D) (d):Otherwise there would exist a row ^x2 b D= A(D)
such that val(’
and thus ’j );c). If we assumeb Dis 1=3-accurate as relaxed synthetic data then f(0) 2=3 for every j2J. Now
consider the execution of Aon the database D. With probability 1negl(d), Dec (0) is not a valid
message-signature pair, thus by Definition 3.4 Part 2, we have
(D) 2f0;1gfor every j2[m]. So for at least a 1 (d) fraction of j2[m], we have ’) = 0. Let0). By
repeating the signature-forging argument, we see that with probability 1negl(d)
and thus there must exist a
subset J
3 (d) negl(d). So ’( b D0) = 0 for every j2J0as well. There must also exist a
size jJj= (2=3 (d))m, such that for every j2J00, ’j(D0) = ’j( b D0) = 0. So if b D0^n
for Cas relaxed synthetic data it must be that f(0j) 1=3 for every j2J00. By our
J00there must exist j2J\J00^nsuch that: 1. f(0^n;’j;’) 2=3, and 2. f(0j) 1=3, whic
contradiction.
j)
5.4 Positive Results for Relaxed Synthetic Data
In this section we present an efficient, accurate sanitizer for small (e.g. polynomial in d) families of
parity queries that outputs symmetric relaxed synthetic data and show that this sanitizer also yields
accurate answers for any family of constant-arity predicates when evaluated as a relaxed synthetic
data. Our result for parities shows that relaxed synthetic data (and even symmetric relaxed
synthetic data) allows for more efficient sanitization than standard synthetic data, since Theorem
4.4 rules out an accurate, efficient sanitizer that produces a standard synthetic database, even for
the class of 3-literal parity predicates. Our result for parities also shows that our hardness result for
symmetric relaxed synthetic data (Theorem 5.6) is tight with respect to the required hardness of
approximation, since the class of 3-literal parity predicates is (1=2 )hard-to-approximate [25]
d;k when!f0;1gis a k-junta if it depends on at most kvariables. Let Jd;k
A function f:
f0;1gd
n Cd
k
be the set of all
k-juntas on dvariables.
Theorem 5.8. There exists an -differentially private sanitizer that runs in time poly(n;d) and
produces relaxed synthetic data and is ( ; )-accurate for J
.dThe privacy, accuracy, and efficiency guarantees of our theorem can be achieved without relaxed
synthetic data simply by releasing a vector of noisy answers to the queries [17]. Our sanitizer will, in
fact, begin with this vector of noisy answers and construct relaxed synthetic data from those
answers. Our technique is similar to that of Barak et. al. [5], which begins with a vector of noisy
answers to parity queries (defined in Section 5.4.1) and constructs a (standard) synthetic database
that gives answers to each query that are close to the initial noisy answers. They construct their
synthetic database by solving a linear program over 2variables that correspond to the frequency of
each possible row x 2f0;1g, and thus their sanitizer runs in time exponential in d. Our sanitizer also
starts with a vector of noisy answers to parity queries and efficiently constructs symmetric relaxed
synthetic data that gives answers to each query that are close to the initial noisy answers after
applying a fixed linear scaling. We then show that the database our sanitizer constructs is also
accurate for the family of k-juntas using an affine shift that depends on the junta.
18
log for a sufficiently large constant
d k =
C, whered k =Pk i=0d i
d
5.4.1 Efficient Sanitizer for Parities To prove Theorem 5.8, we
start with a sanitizer for parity predicates.
!f1;1gis a parity predicate
Definition 5.9 (Parity Predicate). A function :
hx;si
f0;1gdds.t. (x) = s
0
si
P
n 2jPjlog(2jPj= ) :
d
i
=
1
d
s; t
3
: f0;1gd
i
f there exists a vector s2f0;1g(x) = (1): We will use wt(s) =to
denote the number of non-zero entries in s. Theorem 5.10. Let
Pbe a family of parity predicates on dvariables such that d 62P.
There exists an -differentially private sanitizer A(D;P) that runs in
time poly(n;d) and produces symmetric relaxed synthetic data that
is ( ; )-accurate for Pwhen
The analysis of our sanitizer will make use of the following
standard fact about parity predicates Fact 5.11. Two parity
predicates !f1;1gare either identical or orthogonal. Specifically,
for s6= t, s6= 0and b2f1;1g,
R
x
R
Ef0;1gd[ s(x)j t(x) = b] = Ex f0;1gd[ s(x)] = 0:
Our sanitizer will start with noisy answers to the predicate
queries s(D). Each noisy answer will be the true answer
perturbed with noise from a Laplace distribution. The Laplace
distribution Lap( ) is a continuous distribution on R with probability
density function p(y) /exp(yj= ). The following theorem of Dwork,
et. al. [17] shows that these queries are differentially private for an
appropriate choice of .
Theorem 5.12 ([17]). Let (c1;c2;:::;ck) be a set of predicates and
let = k=n and let D2(f0;1g1(D)+Z1;c2(D)+Z2;:::;ck(D)+Zk), where
(Z1dn);:::;Zkbe a database. Then the mechanism A(D) = (c) are
independent samples from Lap( ) is -differentially private.
In order to argue about the accuracy of our mechanism we
need to know how much error is introduced by noise from the
Laplace distribution. The following fact gives a bound on the tail of
a Laplace random variable.
Fact 5.13. The tail of the Laplace
distribution decays exponentially.
Specifically, Pr[jLap( )j t] = exp(t= ):
Now we present our sanitizer for queries that are parity
functions. We will not consider the query 0d
as 0d (x) = 1 for every x2f0;1gd. Let Pbe a set of parity functions
that does not contain 0d . We now present a poly(n;d;jPj)-time
sanitizer for P.Our sanitizer starts by getting noisy estimates of the
quantities (D) for each predicate 2Pby adding Laplace noise.
Then it builds the relaxed synthetic datab Din blocks of rows.
Each block of rows is“assigned” to contain an answer to a query .
In that block we randomly choose rows such that the expected3dIn
the preliminaries we define a predicate to be a f0; 1g-valued function but our
definition naturally generalizes to f1; 1gvalued functions. For c : f0; 1g!f1; 1gand
database D = ( x1; : : : ; xnd) 2(f0; 1gn), we define c(D ) =1 nPn i=1c(xi)
19
value of on each row equals the noisy estimate of (D). By Fact 5.11, the expected value of every
other predicate 0is 0 for rows in this block. The sanitizer is accurate so long as the total number of
rows is sufficient for the value of (b D) to be concentrated around its expectation.Sanitizer A(D;P),
where P= (1) Let := jPj=n T:= (2jPj= 2(t);:::; :)log(4jPj= ) for all j= 1;:::;tdo
i j
(j):=
x2f0;1gid j
x2f0;1gd(
(j)(x) = 1
R
x)
=
1
tT
1
Let aj(D) + Lap( ) for i= jT+ 1 to
(j+ 1)Tdo
With probability (a + 1)=2: Let
^x
Otherwise: Let ^x
end for end for ;:::;^x )
returnb D=
(^xEvaluator E
R
j
(j)
n 2jPjlog(2jPj= ) :
(j)2P
D)
jPj
(j)(
b
(j)(D)
j
every
(j)
j
(j)
to
(j)(
b D) by sampling
P(
b
D; ):
return jPj ( b D)The following claims will suffice to establish Theorem 5.10 Claim 5.14.
Ais -differentially private. Proof. The output of Aonly depends on the answer to jPjpredicate
queries. By Theorem 5.12 the answersto jPjpredicate queries perturbed by independent samples
from Lap(jPj=n ) is -differentially private. Claim 5.15. Ais ( ; )-accurate for Pwhen
Proof. We want to show that for every
except with probability . To do so we consider separately the error introduced in going from (D)
to ausing Laplacian noise and the error introduced in going from noisy answers a
rows at random. First we bound the error introduced by the noisy queries to D. Specifically, we
want to show that for
(j)(D) a
we
have
=2
2P except with probability =2. For
each
(D) a jj =2] exp(n =2jPj)
(
by Fact 5.13. So by a unionj bound we
)
have
P
r
[
j
(j)Pr[9
20
(j)j
(D) ajj =2] jPjexp( =2 ) jPjexp(n =2jPj) < =2;
so long as n
We also want to show that for
every (j)
except with probability =2, where aj
for
every
(j)
2P. By a
Chernoff-Hoeffding
Bound4
2jPjlog(2jPj= ) :
2P
jPj (j)( b D) ajis the noisy answer for
=2(j)(D) computed in A(D). To do so
show that the expectation of jPj ( b D) is indeed aj(j), then we will use a Chernoff-Hoeffd
bound to show that the random rows generated by A(D) are close to their expectation.
we take a union bound over all 2P.Fix (j)2Pand consider (j)( b D). (j)jT+1;jT+2;:::;(j+1
rows where we focus on (j)) the expectation of each coin flip is aj( b D) is the sum of
Tindependent biased coin flips. In rows, and in all other rows the expectation of each c
0 by Fact 5.11. Thus
( b D) i = E (j) E h
" ^ni=1(j) (^xi) # = aj=jPj
1^nX
21
1
( b D) a
j
2
2=2jPj)
k
0
0
n
=2:
Pr
h
we conclude
i <2exp(T 2=2jPj):
jPj
) [13]
s
d
( b D) in our output, as
4
n i=1
Combining the two bounds suffices to prove the claim.
(j)
j
By taking a union bound over Pwe
concludePr h(j)9
jP j (j)( b D) a
5.4.2 Efficient Sanitizer for k-Juntas We now show that our sanitizer for parity quer
used to give accurate answers for any family of
k-juntas, for constant k. We start with the observation that k-juntas only have Fouri
coefficients of weight at most k. Alternatively, this says that any k-junta can be writt
combination of parity functions on at most kvariables. (In the language of our previo
construction, such that wt(s) k.) Thus we start by running our sanitizer for parity p
the set Pcontaining all parity predicates on at most kvariables. We have to modify t
function to take into account that not every k-junta predicate has the same bias. Ind
cannot control d (D) = 1 for any database. Thus our evaluator will apply an affine s
result that depends on the junta. Because the evaluator depends on the predicate,
sanitizer no longer outputs symmetric relaxed synthetic data.
The use of a sanitizer for parity queries as a building block to construct a sanitiz
k-juntas is inspired by [5], which uses a noisy vector of answers to parity queries as
block to construct
One form of the Chernoff-Hoeffding Bound states if Xare independent random variables over [0; 1]
and X = (1=n)Pthen Pr[jX E [X ]j t ] < 2 exp( 2nt; : : : ; X
synthetic data for a particular class of k-juntas (conjunctions on k-literals). However, while their
sanitizer constructs a standard synthetic database and is inefficient, our construction of symmetric
relaxed synthetic data for parity predicates is efficient, and thus our eventual sanitizer for k-juntas
will also be efficient.
Consider a predicate c:
!f0;1g. Then we can take the Fourier
f0;1gd
expansion c(x) =Xs2f0;1gds
d
bc(s) (x)
d;k
k
kb
D;c) for a k-junta c:jc( b D) (jPk) is
( ; )-accurate for Jd;k
k
applied to the set P
return jP
[
accuracy of our sanitizer relies on the following
c fac
k-juntas
Fact
5.16.
If
c:
f0;1g!f0;1gis
a
k-junta,
then
( wt(s
R
P=
sd
js2f0;1g
k
;1
wt(s)
k
.
Our
sanitizer
for
k-juntas
x
w
define
the
evaluator
that
computes
the
answer
to
conju
)
h
Evaluator
E(j1)bc(;) Efficiency and privacy follow from th
e
k-juntas
on dvariables. Theorem 5.17. A(D;Pusing Ewh
r
e
kxThe
b
c
(
s
)
=
E
f
0
;
1
g
d
n 2jPkjlog(2jPkj= )
:
Proof. Let c2Jd;k c( b D)
=
a fixed predicate. Assume that b Dis -accurate using
for Pk
EP
i=1XXc(^ be
xi)
. This event occurs with probability at least 1 by Theorem 5.10 and our assumption on n. We now
analyze the quantity c(b D).1 ^n^n
^n Xs2f0;1gdbc(s) s(^xi) s( b D) (4)
=
:1 jsj k
1^n i=1
:jsj k
s2f0;1g
d
X
s( b D) (3)
j s2f0;1g0 @d:1 jsj k bc(s)
s2f0;1gd=
(D) + )
2
2
bc(s)
bc(s)
= bc(;) + X bc(;) +
X
1jP j ((jP s2f0;1gj1)bc(;) +
X1)bc(;) + c(D)) +
= 1jP
s
kj
s
:
j
1s
j
k
(
D
)
(5)
+
j
P
A
d
(jPk k
k
k
(bc(s)
where step 3 uses Fact 5.16, step 4 uses the
fact that 0
1 j ((jP
jP
is -accurate for
Pk
kj1)bc(;)
References
when evaluated by . A similar argument shows thatd (x) = 1
EP
step 5 uses the fact that b D
+ c(D))
23
k
Thus b D= A(D) is -accurate for Jd;k
Ac
kn
o
wl
ed
g
m
en
ts
c
(
using Ewith probability at least 1 .
b
D Cynthia Dwork, Vitaly Feldman, Oded Goldreic
We thank Boaz Barak, Irit Dinur,
)
astad, Valentine Kabanets, Dana
Moshkovitz, Anup Rao, Guy Rothblum, and Le
helpful conversations.
[1]A DAM, N. R., AND W ORTMANN, J. Security-control methods for statistical databases: A
study. ACM Computing Surveys 21 (1989), 515–556.
[2]A LEKHNOVICH, M., BRAVERMAN, M., FELDMAN, V., KLIVANS, A. R., AND PITASSI, T. The
properly learning simple concept classes. In J. Comput. Syst. Sci. (2008), vol. 74, pp. 1
[3]A RORA, S., LUND, C., MOTWANI, R., SUDAN, M., AND SZEGEDY, M. Proof verification a
approximation problems. J. ACM 45, 3 (1998), 501–555.
[4]B ABAI, L., FORTNOW, L., LEVIN, L. A., AND SZEGEDY, M. Checking computations in po
In STOC (1991), pp. 21–31.
[5]B ARAK, B., CHAUDHURI, K., DWORK, C., KALE, S., MCSHERRY, F., AND TALWAR, K. Priv
consistency too: A holistic solution to contingency table release. In Proceedings of the
Principles of Database Systems (2007), pp. 273–282.
[6]B ARAK, B., AND GOLDREICH, O. Universal arguments and their applications. In SIAM
vol. 38, pp. 1661–1694.
[7]B EN-SASSON, E., GOLDREICH, O., HARSHA, P., SUDAN, M., AND VADHAN, S. P. Robust
shorter pcps, and applications to coding. In SIAM J. Comput. (2006), vol. 36, pp. 889–9
[8]B LUM, A., DWORK, C., MCSHERRY, F., AND NISSIM, K. Practical privacy: The SuLQ fra
Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of
(June 2005).
[9]B LUM, A., LIGETT, K., AND ROTH, A. A learning theory approach to non-interactive da
Proceedings of the 40th ACM SIGACT Symposium on Thoery of Computing (2008).
[10]C REIGNOU, N. A dichotomy theorem for maximum generalized satisfiability problems
Syst. Sci. (1995), vol. 51, pp. 511–522.
[11]D INUR, I., AND NISSIM, K. Revealing information while preserving privacy. In Proceed
TwentySecond ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database
pp. 202–210.
[12]D INUR, I., AND REINGOLD, O. Assignment testers: Towards a combinatorial proof of th
SIAM J. Comput. (2006), vol. 36, pp. 975–1024.
[13]D UBHASHI, D. P., AND SEN, S. Concentration of measure for randomized algorithms:
applications. In Handbook of Randomized Algorithms, 2001.
[14]D UNCAN, G. International Encyclopedia of the Social and Behavioral Sciences. Elsevier, 2001, ch.
Confidentiality and statistical disclosure limitation.
[15]D WORK, C. A firm foundation for private data analysis. Communications of the ACM (to appear). [16]D
WORK, C. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages
and Programming (ICALP)(2) (2006), pp. 1–12. [17]D WORK, C., MCSHERRY, F., NISSIM, K., AND SMITH, A.
Calibrating noise to sensitivity in private data
analysis. In Proceedings of the 3rd Theory of Cryptography Conference (2006), pp. 265–284. [18]D
WORK, C., NAOR, M., REINGOLD, O., ROTHBLUM, G., AND VADHAN, S. When and how can privacypreserving data release be done efficiently? In Proceedings of the 2009 International ACM Symposium on
Theory of Computing (STOC) (2009).
[19]D WORK, C., AND NISSIM, K. Privacy-preserving datamining on vertically partitioned databases. In
Proceedings of CRYPTO 2004 (2004), vol. 3152, pp. 528–544.
[20]D WORK, C., ROTHBLUM, G., AND VADHAN, S. P. Boosting and differential privacy. In Proceedings of FOCS
2010 (2010).
[21]E VFIMIEVSKI, A., AND GRANDISON, T. Encyclopedia of Database Technologies and Applications. Information
Science Reference, 2006, ch. Privacy Preserving Data Mining (a short survey).
[22]F ELDMAN, V. Hardness of proper learning. In The Encyclopedia of Algorithms. Springer-Verlag, 2008. [23]F
ELDMAN, V. Hardness of approximate two-level logic minimization and PAC learning with membership
queries. Journal of Computer and System Sciences 75, 1 (2009), 13–26. [24]G OLDREICH, O. Foundations
of Cryptography, vol. 2. Cambridge University Press, 2004. [25]H ° ASTAD, J. Some optimal inapproximability
results. In J. ACM (2001), vol. 48, pp. 798–859. [26]K EARNS, M. J., AND VALIANT, L. G. Cryptographic
limitations on learning boolean formulae and finite
automata. In J. ACM (1994), vol. 41, pp. 67–95. [27]K HANNA, S., SUDAN, M., TREVISAN, L., AND
W ILLIAMSON, D. P. The approximability of constraint
satisfaction problems. In SIAM J. Comput. (2000), vol. 30, pp. 1863–1920. [28]K ILIAN, J. A note on
efficient zero-knowledge proofs and arguments (extended abstract). In STOC (1992). [29]M ICALI, S.
Computationally sound proofs. In SIAM J. Comput. (2000), vol. 30, pp. 1253–1298. [30]N AOR, M., AND YUNG,
M. Universal one-way hash functions and their cryptographic applications. In STOC
(1989), pp. 33–43. [31]P APADIMITRIOU, C. H., AND YANNAKAKIS, M. Optimization, approximation, and
complexity classes. In J.
Comput. Syst. Sci. (1991), vol. 43, pp. 425–440. [32]P ITT, L., AND VALIANT, L. G. Computational
limitations on learning from examples. In J. ACM (1988),
vol. 35, pp. 965–984. [33]R EITER, J. P., AND DRECHSLER, J. Releasing multiply-imputed synthetic data
generated in two stages to
protect confidentiality. Iab discussion paper, Intitut f ¨ ur Arbeitsmarkt und Berufsforschung (IAB), N ¨
urnberg (Institute for Employment Research, Nuremberg, Germany), 2007.
[34]R OMPEL, J. One-way functions are necessary and sufficient for secure signatures. In STOC (1990), pp.
387–394. [35]R OTH, A., AND ROUGHGARDEN, T. Interactive privacy via the median mechanism. In STOC 2010
(2010). [36]V ALIANT, L. G. A theory of the learnable. Communications of the ACM 27, 11 (1984), 1134–1142.
24
ECCC
http://eccc.hpi-web.de
ISSN 1433-8092
Download