Slides

advertisement
SplitX: High-Performance
Private Analytics
Ruichuan Chen (Bell Labs / Alcatel-Lucent)
Istemi Ekin Akkus (MPI-SWS)
Paul Francis (MPI-SWS)
Data analytics is important



Evaluate system
performance
Understand user
behavior
Discover statistical
patterns
Data exposure has become a
major concern
Third-party
Trackers
Smart-phone
Apps
User-owned and operated


Data exposure has to be brought under
control!
User-owned and operated principle

Personal data should be stored in a local
host under the user’s control.
Motivation and problem
Analyst
Data

Data
Data
How to make aggregate queries over
distributed private user data while still
preserving user privacy?
Outline

Related work

SplitX system





Key insights
System design
Performance comparison
Implementation & deployment
Conclusion
A general approach


Based on differential privacy.
Differential privacy adds noise to the
output of a computation (i.e., query).
Database
Data
Data

Query Module
(add noise)
Analyst
Data
Hide the presence or absence of a user.
Previous systems
Analyst

Analyst
Servers

Data
Data
Data
Servers aggregate
answers without
seeing individual
user data.
Differentially private
noise is added to
the aggregate result.
Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06;
Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Primary technical problems

Scale poorly

Require public-key operations or something
even more expensive.
Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06;
Rastogi et al., SIGMOD’10; Shi et al., NDSS’11

Suffer from answer pollution

Even a single malicious user can
substantially distort the aggregate result
through a single answer.
Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Outline

Related work

SplitX system





Key insights
System design
Performance comparison
Implementation & deployment
Conclusion
SplitX

A high-performance private analytics
system



2 to 3 orders of magnitude more efficient in
bandwidth
3 to 5 orders of magnitude more efficient in
computation
Resistant to answer pollution
Components & assumptions
Analyst
Analysts are potentially malicious
(violating user privacy)
Analyst
Servers
(1 aggregator and 2 mixes)
Data
Data
Data
Servers are honest but curious
1) Follow the specified protocol
2) Try to exploit additional info
that can be learned in so doing
Clients are user devices.
Clients are potentially malicious
(distorting the final results)
Outline

Related work

SplitX system





Key insights
System design
Performance comparison
Implementation & deployment
Conclusion
Key insights: XOR encryption

How to achieve high performance?
generate R
M
R
Client
R

Mix1
M
Mix2
M
R
recreate M
Aggregator
R
Client wants to send M to aggregator


Client splits M, and sends split messages to
aggregator via mixes
Aggregator joins split messages to recreate M
Key insights: XOR encryption

How to achieve high performance?
generate R
M
R
Mix1
M
R
Client
recreate M
Aggregator
R
Mix2
R
For clarity
Mix1
Client
Aggregator
Mix2
M
M denotes that client sends two split messages
of M to aggregator via Mix1 and Mix2.
Key insights: query buckets

How to limit answer pollution?

Solution:



Ensure that a client cannot arbitrarily
manipulate answers.
Divide answer’s value range into buckets.
Enforce a binary answer in each bucket.
Key insights: query buckets

Query: “SELECT age FROM splitx”


4 buckets: 0~19, 20~39, 40~59, and ≥60.
Answers: a ‘1’ or ‘0’ per bucket.



30 years-old  0, 1, 0, 0
Answers encoded in a bit-vector.
An answer from a malicious client cannot
substantially distort the query result!
Outline

Related work

SplitX system





Key insights
System design
Performance comparison
Implementation & deployment
Conclusion
System design
1) Query publish/subscribe


Analyst publishes its queries
Client subscribes to an analyst’s queries
2) Query answering




Client answers queries
Mixes add differentially private noise
Mixes shuffle answers
Aggregator generates query results
1) Query publish/subscribe
Mix1
Client
Aggregator
Mix2
Analyst ID
Query1, Query2, …
Analyst
Query1, Query2, …
1) Query publish/subscribe

Query example: age distribution among
male users?





QID: 123
SQL: SELECT age FROM splitx
WHERE gender=‘male’
Buckets: 0~19, 20~39, 40~59, and ≥60
DP parameter ( ): 1.0
Tend: 11:59:59PM on Aug 16, 2013
2) Query answering




Client answers queries
Mixes add differentially private noise
Mixes shuffle answers
Aggregator generates query results
Step 1: client answers queries

Client executes query over its local data
and generates an answer

‘1’ or ‘0’ per bucket

Encoded as a bit-vector
Step 1: client answers queries

Client splits its answer, and sends the
split answers with the query ID to the
two mixes, respectively.
Mix1
Client
Aggregator
Analyst
Mix2
QID, answer
Mix knows which query a client answered.
Privacy violation!
Step 2: mixes add DP noise
Mix1
Mix2
0100
1101
1110
1001
……
……
Mix1
Mix2
0100
1101
1110
1001
……
……
0111
0101
……
……


Each mix individually adds
some random bit-vectors as the
differentially private noise
How many bit-vectors needed?
c: # clients queried
: DP parameter
random bit-vectors as noise
Step 3: mixes shuffle split answers


Mix1
Mix2
Mix1
Mix2
0100
1101
1110
1101
1110
1001
0111
1101
……
……
……
……
0111
0101
0100
0001
……
……
……
……
shuffle
Each mix maintains c+n split answers
Mixes shuffle the split answers for each
column (i.e., bucket) in a synchronized way.
Mixes transmit shuffled answers

Each mix transmits the shuffled split
answers to the aggregator.
Mix1
……
c+n shuffled split answers
Mix1
Client
Aggregator
Analyst
Mix2
Mix2
……
c+n shuffled split answers
Step 4: aggregator generates
query result

Mix1
Mix2
Agg
1110
1101
0011
0111
1101
1010
……
……
0100
0001
0101
……
……
……
=

……

Join each bit position in the
two split answer arrays.
Sum up the values for each
bucket.
Obtain the noisy count for
each bucket.
Privacy issue at the mixes

Client splits the answer, and sends the
split answers with the query ID to the
two mixes
Mix1
Client
Aggregator
Analyst
Mix2
QID, answer

Mix knows which query a specific client
answered!
Solution: double-splitting
Mix1
Client
Aggregator
Analyst
Mix2
QID, answer
Aggregator
Mix2
Mix1
Mix1
Mix2
Client
Aggregator
QID, answer
Duplicate answer detection

A client can answer a query many times!

How to detect and remove duplicate
answers?

Triple-splitting is needed 

Section 5 in the paper.
Outline

Related work

SplitX system





Key insights
System design
Performance comparison
Implementation & deployment
Conclusion
Computational overhead
PDDP [NSDI’12]
Akkus et al. [CCS’12] – “A” is #buckets that a client reports
Three to five orders
of magnitude more
efficient in
computation than
previous systems
Implementation

Client side



Google Chrome extension
Capture webpages browsed, searches made,
extensions installed
Server side (mix + aggregator)


Web services on Jetty
RPCs defined in Thrift language
Deployment

Query results from a 416-client
deployment





Most visited websites: google, facebook,
youtube
Most used apps: gmail, youtube, google drive
91% of clients made ≤50 searches / day
70% of clients visited >50 webpages / day
97% of clients visited ≤100 websites / day
Conclusion

SplitX: a high-performance private
analytics system



Orders of magnitude more efficient than
previous systems
Resistant to answer pollution
Key insights


XOR-based encryption
Query buckets
Download