How to compile searching software so that it is impossible to reverse-engineer.

advertisement
How to compile
searching software so
that it is impossible to
reverse-engineer.
(Private Keyword Search on Streaming Data)
Rafail Ostrovsky
William Skeith
UCLA
http://www.cs.ucla.edu/~rafail/
(patent pending)
MOTIVATION: Problem 1.



Each hour, we wish to
find if any of hundreds of
passenger lists has a
name from “Possible
Terrorists” list and if so
his/hers itinerary.
“Possible Terrorists” list
is classified and should
not be revealed to
airports
Tantalizing question: can the
airports help (and do all the
search work) if they are not
allowed to get “possible
terrorist” list?
Airport 2
Airport 3
passenger list
passenger list
Airport 1
passenger list
Mobile code
(with state)
Mobile code
(with state)
PROBLEM 1: Is it possible to design mobile
software that can be transmitted to all airports
(including potentially revealing this software to
the adversary due to leaks) so that this
software collects ONLY information needed and
without revealing what it is collecting at each
node?
Non-triviality requirement: must send back
only needed information, not everything!
MOTIVATION: Problem 2.


Looking for malicious
insiders and/or terrorists
communication:

(I) First, we must identify
some “signature” criteria
(rules) for suspicious
behavior – typically, this is
done by analysts.

(II) Second, we must
detect which
nodes/stations transmit
these signatures.
Here, we want to tackle
part (II).
Public
networks
PROBLEM 2: Is it possible to design software
that can capture all messages (and network
locations) that include secret/classified set of
“rules”? Key challenge: the software must not
reveal secret “rules”.
Non-triviality requirement: the software
must send back only locations and
messages that match given “rules”, not
everything it sees.
What we want
Various data
streams, consisting
of flows of
documents/packets
Search software,
that has a set of
“rules” to choose
which documents
and/or packets to
keep and which to
toss.
Small storage
(that collects
selected
documents
and/or packets)
Our “compiler” outputs straight line
executable code (with program state) and a
decryption key “D”.
Various data
streams, consisting
of flows of
documents/packets
STRAIGHT LINE
EXECUTABLE CODE THAT
DOES NOT REVEAL
SEARCH “RULES”
Punch line:
we can send
executable
code publicly.
(it won’t reveal its
secrets!)
documents/
packets that
match
secret
“rules”
Small Fixed-size
Program State
(encrypted in a special
way that our code
modifies for each
document processed)
Decrypt
using D
Current Practice

Continuously transfer all data to
a secure environment.

After data is transferred, filter in
the classified environment, keep
only small fraction of
documents.
Current
practice:
Classified Environment
Filter
 D(1,3)D(1,2) D(1,1)
Storage
D(3,3)
(3,1)
(1,1)
(1,2)
(2,2)
(2,3)
(3,2)
(2,1)
(1,3)
D(2,3)D(2,2) D(2,1)
 D(3,3)  D(3,2) D(3,1) 
Amount of data
that must be
transferred to a
classified
environment is
enormous!
Filter rules are
written by an
analyst and
are classified!
Current Practice
 Drawbacks:
Communication
Processing
Cost
and timeliness
How to improve performance?
 Distribute
work to many locations on
a network, where you decide “on the
fly” which data is useful
 Seemingly ideal solution, but…
 Major problem:
 Not
clear how to maintain security,
which is the focus of this technology.
Storage
… D(1,3) D(1,2)D(1,1)
Filter
E (D(1,2))
E (D(1,3))
Classified
Environment
Decrypt
Storage
… D(2,3)D(2,2)D(2,1)
Filter
E (D(2,2))
Storage
… D(3,3)D(3,2)D(3,1)
Storage
D(1,2)
D(1,3)
D(2,2)
Filter
Open network

Example Filters:


Look for all documents that contain special
classified keywords (or string or data-item
and/or do not contain some other data),
selected by an analyst.
Privacy
Must hide what rules are used to create the
filter
 Output must be encrypted

More generally:
We define the notion of Public Key
Program Obfuscation
 Encrypted version of a program

Performs same functionality as un-obfuscated
program, but:
 Produces encrypted output
 Impossible to reverse engineer
 A little more formally:

Public Key Program Obfuscation


Can compile any code into a “obfuscated code
with small storage”.
Think of the Compiler as a mapping:




Source code  “Smart Public-Key Encryption” with
initial Encrypted Storage + Decryption Key.
Non-triviality: Sizes of complied program &
encrypted storage & encrypted output are not
much bigger, compared to uncomplied code.
Nothing about the program is revealed, given
compiled code + storage.
Yet, Someone who has the decryption key get
recover the “original” output.
Privacy
Related Notions




PIR (Private Information Retrieval)
[CGKS],[KO],[CMS]…
Keyword PIR [KO],[CGN],[FIPR]
Cryptographic counters [KMO]
Program Obfuscation [BGIRSVY]…


Here output is identical to un-obfuscated program, but
in our case it is encrypted.
Public Key Program Obfuscation:

A more general notion than PIR, with lots of
applications
What do we want?
… D(1,3)D(1,2)D(1,1)
2 requirements:
correctness: only
matching documents are
saved, nothing else.
efficiency: the
decoding is proportional to
the length of the buffer, not
the size of the entire
stream.
Filter
Storage
E (D(1,2))
E (D(1,3))
Conundrum:
Complied Filter Code is
not allowed to have ANY
branches (i.e. any “if then
else” executables). Only
straight-line code is
allowed!
REMARK: Comparison of our work to
[Bethencourt, Song, Waters 06]
[BSW-06]
[OS-05]

Buffer size to store m
items: O(m log m)

Buffer size to store m
items: O(m)

Efficiency: decoding
time is proportional to
the buffer size.

Efficiency: decoding
time is proportional to

NEXT – OUR
CONSTRUCTION…
the length of the
entire stream.
Simplifying Assumptions for this
Talk
All keywords come from some poly-size
dictionary
 Truncate documents beyond a certain
length

Sneak peak: the compiled code

Suppose we are looking for all documents
that contain some secret word from
Webster dictionary.

Here is how it looks to the adversary: For
each document, execute the same code
as follows:
Dictionary
w1
E(*)
w2
E(*)
w3
E(*)
w4
E(*)
w5
E(*)
Lookup encryptions of all words
appearing in the document and
multiply them together. Take this
value and apply a fixed formula to
it to get value g.
D
.
.
.
wn-2
E(*)
wn-1
E(*)
wn
E(*)
g
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
Small Output Buffer
(*,*,*)
How should a solution look?
This is a
Nonmatching
matching
document
#2
document
#3
This is a
matching
Nondocument
matching
#1
document
This is a
Nonmatching
document
How do we accomplish this?
Reminder: PKE
Key-generation(1k)  (PK, SK)
 E(PK,m,r)  c
 D(c, SK)  m


We will use PKE with additional properties.
Several Solutions based on
Homomorphic Public-Key Encryptions


For this talk: Paillier Encryption
Properties:

E(x) is probabilistic, in particular can encrypt a
single bit in many different ways, s.t. any
instances of E(0) and any instance of E(1)
can not be distinguished.
 Homomorphic:
i.e., E(x)*E(y) = E(x+y)
Using Paillier Encryption


E(x)E(y) = E(x+y)
Important to note:
E(0)c = E(0)*…*E(0) =
= E(0+0+….+0) = E(0)
c
 E(1) = E(1)*…*E(1) =
= E(1+1+…+1) = E(c)
Assume we can somehow compute an encrypted value
v, where we don’t know what v stands for, but v=E(0) for
“un-interesting” documents and v=E(1) for “interesting”
documents.


c
 What’s v ? It is either E(0) or E(C) where
we don’t know which one it is.
Dictionary
w1
E(0)
w2
E(1)
w3
E(0)
w4
E(0)
w5
E(1)
D
g  E(0) * E(1)
* E(0)
g = E(0) if there are no matching words
g = E(c) if there are c matching words
gD= E(0) if there are no matching words
.
gD= E(c*D) if there are c matching words
.
.
wn-2
E(1)
wn-1
E(0)
wn
E(0)
Thus: if we keep g=E(c) and gD=E(c*D),
we can calculate D exactly.
(g,gD)
E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0)
Output Buffer
Here’s
another
Collisions cause two problems:
matching
document
1. Good documents are destroyed
2. Non-existent documents could be
fabricated
This is
matching
document
#1
This is
matching
document
#3
This is
matching
document
#2
We’ll
make use of two
combinatorial lemmas…
Combinatorial Lemma 1
 Claim:
color survival games succeeds
with probability > 1-neg(g)
How to detect collisions?

Idea: append a highly structured, (yet
random) short combinatorial object to the
message with the property that if 2 or
more of them “collide” the combinatorial
property is destroyed.

can always detect collisions!
100|001|100|010|010|100|001|010|010
010|001|010|001|100|001|100|001|010
010|100|100|100|010|001|010|001|010
=
100|100|010|111|100|100|111|010|010
Combinatorial Lemma 2
Claim: collisions are detected with
probability > 1 - exp(-k/3)
We do the same for all
documents!
Dictionary
w1
E(*)
w2
E(*)
w3
E(*)
w4
E(*)
w5
E(*)
For every document in the stream
do the same: Lookup encryptions
of all words appearing in the
document and multiply them
together (= g).
D
Compute gD and f(g)
.
multiply (g,gD,f(g))into g
randomly chosen locations
.
.
wn-2
E(*)
wn-1
E(*)
wn
E(*)
(g,gD,f(g))
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
(*,*,*)
Small Output Buffer
(*,*,*)
(*,*,*)
(*,*,*)
Detecting Overflow > m
 Idea:
Double buffer size from m to 2m
 If m < #documents < 2m, output
“overflow”
 If #documents > 2m, then expected
number of collisions is large, thus
output “overflow” in this case as well.
Overflow: how to always
collect at least m items
(with arbitrary overflow of matching documents)

Idea: create a logarithmic (in stream size)
number of original buffers.





First buffer is processed for every stream item
Second buffer takes every item in a stream with probability ½
Third buffer takes every item with (independent) probability ¼
i’th buffer processes items with independent probability 1/2i
Key point: If number of documents >M, at least
one buffer will get O(M) matching documents!
More from the paper that we don’t
have time to discuss…





Reducing program size below dictionary size
(using  – Hiding from [CMS])
Queries containing AND (using [BGN]
machinery)
Eliminating negligible error (using perfect
hashing)
Scheme based on arbitrary homomorphic
encryption
Extending to words not from dictionary (with
small error prob.)
Conclusions





We introduced Private searching on streaming
data
More generally: Public key program obfuscation - more general than PIR, or cryptographic
counters
Practical, efficient protocols
Eat your cake and have it too: ensure that only
“useful” documents are collected.
Many possible extensions and lots of open
problems
THANK
YOU!
Download