Private Information Retrieval Using Trusted Hardware

advertisement
Private Information Retrieval Using
Trusted Hardware
Shuhong Wang1 , Xuhua Ding1 , Robert H. Deng1 , and Feng Bao2
1
School of Information Systems, SMU
{shuhongw,xhding,robertdeng}@smu.edu.sg
2
Institute for Infocomm Research, Singapore
baofeng@i2r.a-star.edu.sg
Abstract. Many theoretical PIR (Private Information Retrieval) constructions have been proposed in the past years. Though information
theoretically secure, most of them are impractical to deploy due to the
prohibitively high communication and computation complexity. The recent trend in outsourcing databases fuels the research on practical PIR
schemes. In this paper, we propose a new PIR system by making use of
trusted hardware. Our system is proven to be information theoretically
secure. Furthermore, we derive the computation complexity lower bound
for hardware-based PIR schemes and show that our construction meets
the lower bounds for both the communication and computation costs,
respectively.
1 Introduction
Retrieval of sensitive data from databases or web services, such as patent databases, medical databases, and stock quotes, invokes concerns on user privacy
exposure. A database server or web server may be interested in garner information about user proles by examining users' database access activities. For
example, a company's query on a patent from a patent database may imply that
it is pursuing a related idea; an investor's query on a stock quote may indicate
that he is planning to buy or sell the stock. In such cases, the server's ability
of performing information inference is unfavorable to the users. Ideally, users'
database retrieval patterns are not leaked to any other parties, including the
servers.
A PIR (Private Information Retrieval) scheme allows a user to retrieve a
data item from a database without revealing information about the data item.
The earliest references of "query privacy" date back to Rivest et al [19] and
Feigenbaum [9]. The rst formal notion of PIR was dened by Chor et al [6]. In
their formalization, a database is modelled as a n-bit string x = x1 x2 · · · xn , and
a user is interested in retrieving one bit from x. With this formalization, many
results have been produced in recent years. Depending on whether trusted hardware is employed or not, we classify PIR schemes into two categories: traditional
PIR which does not utilize any trusted hardware and hardware-based PIR which
2
Shuhong Wang et. al.
employs trusted hardware in order to reduce communication and computation
complexities.
The major body of PIR work focuses on the traditional PIR. Interested readers are referred to a survey [10] for a thorough review. A big challenge in PIR
design is to minimize the communication complexity, which measures the number of bits transmitted between the user and the server(s) per query. A trivial
solution of PIR is for the server to return the entire database. Therefore, the
upper bound of communication complexity is O(n) while the lower bound is
O(log n), since by all means the user has to provide an index. For a single server
PIR with information theoretic privacy, it is proven in [6] that the communication complexity is at least O(n) and therefore conrming that O(n) is the
lower bound. Two approaches are used to reduce the communication cost. One
is to duplicate the database in dierent servers, with the assumption that the
servers do not communicate with each other. Without assuming any limit the
servers' computation capability, PIR schemes with multiple database copies are
able to oer information theoretic security with lower communication cost. The
best known result is [3] due to Beimel et. al., with communication complexity
O(nlog log ω/ω log ω ), where ω is the number of database copies. The other approach still uses single server model but assumes that the server's computation
capability is bounded. Schemes following this approach oer computational security with relatively low communication complexity. The best result to date is
due to Lipmaa [16], where the user and the server communication complexity
are O(κ log2 n) and O(κ log n) respectively, with κ being the secure parameter
of the underlying computationally hard problem.
Another key performance metric of PIR schemes is their computation complexity. All existing traditional PIR schemes require high computation cost at
the server(s) end. Beimel et al [4] proved that the expected computation of the
server(s) is Ω(n)1 , which implies that any study on traditional PIR schemes is
only able to improve its computation cost by a constant factor.
To the best of our knowledge, two hardware-based PIR constructions exist in
literature. The earlier scheme [20] due to Smith and Saord is proposed solely to
reduce the communication cost. On each query, a trusted hardware reads all the
data items from an external database and returns the requested one to the user.
The other hardware-based scheme [14, 15] is due to Iliev and Smith. The scheme
facilitates an ecient online query process by ooading heavy computation load
oine. For each query, its online process costs O(1). Nonetheless, depending on
the hardware's internal storage size, for every k (k << n) queries the external
database needs to be reshued with a computation cost O(n log n). For convenience, we refer to the rst scheme as SS01 and the latter as IS04. Both schemes
have O(log n) communication complexity.
Our Contributions The contributions of this paper are three-fold: (1) We
present a new PIR scheme using the same trusted hardware model as in [14] and
1
Ω is the notation for asymptotic lower bound. f (n) = Ω(g(n)) if there exists a
positive constant c and a positive integer n0 such that 0 ≤ cg(n) ≤ f (n) for all
n ≥ n0 .
PIR Using Trusted Hardware
3
prove that it is secure in the information theoretical sense; (2) Among all existing
PIR constructions, our scheme achieves the best performance in all aspects:
O(log n) communication complexity, O(1) online computation cost and O(n)
oine computation cost; (3) We prove that our average computation complexity
per query, O(n/k), is the lower bound for hardware-based PIR schemes using
the same model, where k (k << n) is the maximum number of data items stored
by the trusted hardware.
2 Models and Denitions
Database Model and Its Permutation. We use π to denote a permutation of n
integers: (1, 2, · · · , n). For 1 ≤ i ≤ n, the image of i under π is denoted by π(i).
A database D is modelled as an array, represented by D = [d1 , d2 , · · · dn ], where
di is the i-th data item in its original form, for 1 ≤ i ≤ n. A permuted D under π
is denoted by Dπ and its i-th data record is denoted by Dπ [i], for 1 ≤ i ≤ n. The
database D is permuted into Dπ by using π in such a way that the i-th element
of Dπ is the π(i)-th element in D, i.e.
Dπ [i] = dπ(i)
(1)
In other words, Dπ = π −1 (D). To protect the secrecy of π , the permutation
is always coupled with encryption operations. In the rest of the paper, we use
Dπ [i] ' dπ(i) to denote that Dπ [i] is the ciphertext of dπ(i) . To illustrate the idea,
a trivial example is as follows. Let π = (1324), which means π(1) = 3, π(2) =
4, π(3) = 2, π(4) = 1. Then for D = [d1 , d2 , d3 , d4 ], we have Dπ ' [d3 , d4 , d2 , d1 ].
To distinguish the entries in the original plaintext database and the permuted
and encrypted database, we use the convention throughout the paper that data
items refer to those in D and data records refer to those in Dπ .
Architecture. As shown in Figure 1 below, our hardware-based PIR scheme
comprises of three types of entities: a group of users; a server and a trusted
hardware denoted by TH. The server hosts a permuted and encrypted version
of a database D = [d1 , · · · , dn ], which consists of n items of equal length2 . TH
is a secure and tamper-resistant device residing on the server. With limited
computation power and storage cache, TH shues the original database D into
database Dπ based on a permutation π ; it remembers the permutation π and
answers users' queries.
Each user interacts with TH via a secure channel, e.g. a SSL connection.
When a user wants to retrieve the i-th data item of D, she sends a query q to
TH through the channel. On receiving q , TH accesses the permuted database Dπ
and retrieves the intended item di . Throughout the paper, by using "q = i", we
mean the query q requests the i-th item of D. By saying "the index of a (data)
item", we refer to its index in the original database D, by saying "the index of a
(data) record", we refer to its index in the shued database in which the record
locates.
2
If necessary, we use padding for those data items with dierent length. Dierent to
the bit model in [6],we extend it to the block model.
4
Shuhong Wang et. al.
π
Fig. 1. Hardware-based PIR Model
Trusted Hardware TH is trusted in the sense that it honestly executes the
PIR protocol. Neither outside adversaries nor the server is able to tamper its
execution or access its private space. Its limited private cache is able to store up
to k (k << n) data items along with their indices. The indices of those cached
items are managed by TH using a list denoted by Γ . In other words, Γ stores
the original indices. We assume TH is capable of performing CPA-secure (i.e.,
secure under Chosen Plaintext Attacks) symmetric key encryption/decryption
and to generate random numbers or secret keys.
Access Pattern : As in [11], the access pattern for a time period is dened
A = (a1 , · · · , aN ), where ai is the data record read in the i-th database access,
for i ∈ [1, N ]. We observe that a record in the access pattern is essentially a
probabilistic result of both the current query and the query history. In fact, the
latter results in the current state of TH and the database.
Adversary : We consider adversaries who attempt to derive non-trivial information from the user's queries. Possible adversaries include both outside attackers and the server. Note that we do not assume any trust on the server. The
adversary is able to monitor all the input and output of TH. Moreover, the adversary is allowed to query the database at her will and receives the replies from
TH.
Stained Query and Clean Query : A query is stained if its content, e.g. the
index of the requested data, is known to the adversary without observing the
access pattern. This may occur in several scenarios. For instance, a query is
compromised or revealed accidently; or the query could be originated from the
adversary herself. On the other hand, a query is clean if the adversary does not
know its content before observing the access pattern.
Security Model : Our security model follows the security notion in ORAM
[11]. We measure the information leakage from PIR query executions. A secure
PIR scheme ensures that the adversary does not gain additional information to
determine the distribution of queries. Formally, we dene it as follows.
PIR Using Trusted Hardware
5
Denition 1. A hardware-based PIR scheme is secure, if given a clean query q
and an access pattern A, the conditional probability for the event that the query
is on index j (i.e., q=j) is the same as its a-priori probability, i.e.
Pr(q = j|A) = Pr(q = j), for all j ∈ [1, n].
The equation implies that the access pattern A reveals to the adversary no
information on the target query q 's content.
Table 1 below highlights the notations used throughout this paper.
Table 1. Notations
Notation Description
k
D
π0 , π1 , · · ·
Dπs
ai
A
As
Γ
The maximum number of data items cached in TH.
The original database in the form of (d1 , d2 , · · · , dn ).
A sequence of secret random permutations of n elements {1, 2, · · · , n}.
A permuted database of D using permutation πs such that Dπs [j] ' dπs (j) , for 1 ≤
j ≤ n, where Dπs [j] denotes the j -th record in Dπs .
The retrieved data record by TH during its i-th access to a shue database.
The access pattern comprising all the retrieved records (a1 , · · · , aN ) during a xed
time period.
The access pattern comprising all the retrieved records during the s-th session, as
dened in Section 3.
The list of (original) indices of all data items stored in TH's cache.
3 The PIR Scheme
System setup
We consider applications where a trusted third party (TTP) is available to initialize the system. This TTP is involved only in the initialization phase and then
stays oine afterwards. For other scenarios where the TTP is not available, an
alternative solution is provided in Section 6.
TTP secretly selects a random permutation π0 and a secret key sk0 . It permutes the original database D into Dπ0 , which is encrypted under sk0 , such that
Dπ0 [j] ' dπ0 (j) for j ∈ [1, n]. Dπ0 is then delivered to the server. TTP secretly
assigns π0 and sk0 to TH, which completes the system initialization.
The outline of our PIR scheme is as follows. Every k consecutive query executions are called a session. For the s-th session, s ≥ 0, let πs , Dπs and sks
denote the permutation, the database, and the encryption key respectively. On
receiving a query from the user, TH retrieves a data record from Dπs , decrypts it
with sks to get the data item, and stores the item in its private cache. Then TH
6
Shuhong Wang et. al.
replies to the user with the desired data item. The detailed operations on data
retrieval are shown in Algorithm 1. After k queries are executed, TH generates
a new random permutation πs+1 and an encryption key sks+1 . It reshues Dπs
into Dπs+1 by employing πs+1 and sks+1 . Note that in the newly shued database Dπs+1 , all data items are encrypted under the secret key sks+1 . The details
on database reshue are given in Algorithm 2.
The original database D is not involved in any database retrieval operations.
Since TH always performs a decryption for every read operation and an encryption for every write operation, we omit them in the algorithm description in
order to keep the presentation compact and concise.
Retrieval Query Process
The basic idea of our retrieval algorithm is the following. TH always reads a
dierent record on every query and every record is accessed at most once. Thus,
if the database is well permutated (in the sense of oblivious permutation), all
database accesses within the session appear random to the adversary.
Without loss of generality, suppose that the user intends to retrieve di in D
during the s-th session (s ≥ 0). On receiving the query for index i, TH performs
the following: If the requested di is not in TH's cache, it locates the corresponding
record in the permutated database Dπs by computing the record index as πs−1 (i).
If the requested item resides in TH, it reads from Dπs a random record which is
not accessed before 3 . The algorithm is elaborated in Figure 2 below.
Algorithm 1: Retrieving record di using Dπs
1. TH decrypts the query and gets the requested index i.
2. If i ∈
/Γ
3.
TH reads πs−1 (i)-th record of Dπs and stores the item di into the cache;
4.
Γ = Γ ∪ {i};
5. Else
6.
TH selects a random index j , j ∈R {1, · · · , n} \ Γ
7.
TH reads the πs−1 (j)-th record from Dπs and stores the item dj into the cache;
8.
Γ = Γ ∪ {j};
9. TH returns di to the user.
Fig. 2. Retrieval Query Processing Algorithm
Access Pattern The access pattern As produced by Algorithm 1 is a
sequence of data records which are retrieved from Dπs during the s-th session. It
is clear from Figure 2 that on each query, TH reads exactly one data record from
Dπs . Therefore, when the s-th session terminates, As has exactly k records.
3
The operation should be coded so that both "if" and "else" situations take the same
time to stand against side-channel attack. This requirement is applied at the similar
situation of reshue algorithm later.
PIR Using Trusted Hardware
7
Reshue Process
After k retrievals, TH's private cache reaches its limit, which demands a reshue
of the database with a new permutation. Note that simply using cache substitution introduces a risk of privacy exposure. The reason is that when a discarded
item is requested again, the adversary knows that a data record is retrieved more
than once by TH from the same location. Therefore, a reshue procedure must
be executed at the end of each session.
TH rst secretly chooses a new random permutation πs+1 . The expected
database Dπs+1 satises Dπs+1 [j] ' dπs+1 (j) , j ∈ [1, n]. The correlation between
Dπs and Dπs+1 is
Dπs+1 [j] ' Dπs [πs−1 ◦ πs+1 (j)],
(2)
for 1 ≤ j ≤ n, where πs−1 ◦ πs+1 (j) means πs−1 (πs+1 (j)).
The basic idea of our reshue algorithm is as follows. We sort the items in
TH's cache in ascending order based on their new positions in Dπs+1 . We observe
that those un-cached items in Dπs will also be logically sorted in the ascending
order based on their new indexes, because the database supports index-based
record retrieval. The reshue process is similar to a merge-sort of the sorted
cached items and un-cached items. TH plays two roles: (1) participating in the
merge-sort to initialize Dπs+1 ; (2) obfuscating the read/write pattern to protect
the secrecy of πs+1 .
TH rst sorts indices in Γ based on the ascending order of their images
−1
under πs+1
. It assigns database Dπs+1 sequentially, starting from Dπs+1 [1]. For
the rst n−k assignments, TH always performs one read operation and one write
operation; for the other k assignments, TH always performs one write operation.
The initialization of Dπs+1 [j], j ∈ [1, n], falls into one of the following two cases,
depending on whether its corresponding item is in the cache or not.
Case(i) The corresponding item is not cached (i.e., πs+1 (j) 6∈ Γ ): TH reads it
(i.e., the record Dπs [πs−1 ◦ πs+1 (j)]) from Dπs and writes it to Dπs+1 as Dπs+1 [j].
Case (ii) The corresponding item is in the cache (i.e., πs+1 (j) ∈ Γ ): Before
retrieving Dπs [πs−1 ◦πs+1 (j)] from the cache and writing it into Dπs+1 as Dπs+1 [j],
TH also performs a read operation for two purposes: (a) to demonstrate the same
reading pattern as if in Case (i) so that the secrecy of πs+1 is protected; (b) to
save the cost of future reads. Moreover, instead of randomly reading a record
from the Dπs , TH looks for the smallest index which has not been initialized
and falls in Case (i). It then retrieves the corresponding data record from Dπs .
Since Γ is sorted, this searching process totally costs k comparisons for the entire
reshue process. The benet of this approach coupled with a sorted Γ is that
both the costs for testing if πs+1 (j) ∈ Γ and the item retrieval from the cache
are O(1); Otherwise, both cost O(k) per item and O(nk) in total.
The details of the reshue algorithm are shown in Figure 3, where min
denotes the head of sorted Γ . We use sortedel(i)/ sortins(i) to denote the
sorted deletion/insertion of index i from/to Γ and subsequent adjustments.
8
Shuhong Wang et. al.
Fig. 3. Database Reshue Algorithm
Algorithm 2: Reshuffle Dπs into Dπs+1 , executed by TH
1. secretly select a new random permutation πs+1 ;
−1
2. sort indices in Γ based on the order of their images under πs+1
.
0
set j = 1; j = 1.
3. while 1 ≤ j ≤ n − k do
4. while πs+1 (j 0 ) ∈ Γ do j 0 = j 0 + 1 end;
5. set r = πs−1 ◦ πs+1 (j 0 ); read Dπs [r] from Dπs ;
6. if j = j 0
/∗ Case (i): πs+1 (j) ∈
/ Γ ∗/
7.
write Dπs+1 by setting Dπs+1 [j] ' Dπs [r];
8. else
/∗ Case (ii): πs+1 (j) ∈ Γ ∗/
9.
write Dπs+1 by setting Dπs+1 [j] ' dmin ; a
10.
sortedel(j); Insert Dπs [r] into cache and sorteins(j 0 );
11. j = j + 1;j 0 = j 0 + 1;
12. end{while};
13. while n − k + 1 ≤ j ≤ n do
14.
set Dπs+1 [j] = dmin ; sortdel(j);
15. j = j + 1;
16. end
a d
−1
min is exactly Dπs [πs ◦ πs+1 (j)] since Dπs+1 is lled in by an increas-
ing order.
The reshue algorithm is secure and ecient. An intuitive explanation of its
security is as follows. After a reshue, the new database is reset to its initial
status. If an item has been accessed in the previous session, it is placed at a random position by the reshue. Other items are indeed relatively linkable between
sessions. However, the linkage does not provide the adversary any advantage
since they have not been accessed at all. Note that the addition in the inner loop
is executed at most n − 1 times in total, since j 0 never decreases. Because Γ is
a sorted list and the inserted and deleted indices are in an ascending order, the
insertion and deletion are of constant cost. Totally at most n comparisons are
needed for the whole execution. Therefore, the overall computation complexity
of Algorithm 2 is O(n).
Reshuffle Pattern The access pattern produced by Algorithm 2 is denoted by Rs . Since TH only reads n − k data records, Rs has exactly n − k
elements. Note that the writing pattern is omitted because it is in a xed order,
i.e., sequentially writing from position 1 to position n.
4 Security
We now proceed to analyze the security of our scheme based on the notion
dened in Section 2. Our proof essentially goes as follows. Lemma 1 proves that
the reshue procedure is oblivious in the same notion as in Oblivious RAM [11].
Thus after each reshue, the database is reset into the initial state such that
PIR Using Trusted Hardware
9
the accesses between dierent sessions are not correlated. Then in Theorem 1 we
show that each individual query session does not leak information of the query,
which leads to the conclusion on user privacy across all sessions.
Lemma 1. The reshue algorithm in Figure 3 is oblivious. For any non-negative
integer s, any integer j ∈ [1, n],
Pr(Dπs [j] = dl | A0 , R0 , · · · , As−1 , Rs−1 ) = 1/n,
(3)
for all l ∈ [1, n], where Ai and Ri , i ∈ [0, s − 1], are the access pattern and
reshue pattern for i-th session respectively.
We prove Lemma 1 for a xed j by induction on the session index s.
The proof applies to all j ∈ [1, n].
I. s = 0. Since Dπ0 , the initial shued database, is constructed in advance under
a secret random permutation π0 , the probability Pr(Dπ0 [j] = dl | ∅) = 1/n holds
for all 1 ≤ l ≤ n.
Proof
II. Suppose the lemma is true for s = i, that is, Pr(Dπi [j] = dl | A0 , R0 , · · · , Ai−1 , Ri−1 ) =
1/n. We proceed to prove that it holds for s = i + 1, i.e.
Pr(Dπi+1 [j] = dl | A1 , R1 , · · · , Ai , Ri ) = 1/n,
(4)
for all l ∈ [1, n].
By conditional probability,
Pr(Dπi+1 [j] = dl | A1 , R1 , · · · , Ai , Ri )
n
X
=
Pr(Dπi+1 [j] = Dπi [x] | A1 , R1 , · · · , Ai , Ri ) · Pr(Dπi [x] = dl | A1 , R1 , · · · , Ai , Ri ).
x=1
(5)
For clarity, we dene px and qx by
px = Pr(Dπi+1 [j] = Dπi [x] | A1 , R1 , · · · , Ai , Ri ),
and
qx = Pr(Dπi [x] = dl | A1 , R1 , · · · , Ai , Ri ).
The objective now is to prove
n
X
(6)
px qx = 1/n.
x=1
Dene X = {x| x ∈ [1, n], πi (x) ∈ Γ } and Y = {x| x ∈ [1, n], πi (x) ∈
/ Γ }.
Note that |X| = |Γ | = k , |Y | = n − k and X ∪ Y = [1, n]. Thereafter,
n
X
x=1
p x qx =
X
x∈X
px qx +
X
x∈Y
px qx .
(7)
10
Shuhong Wang et. al.
We observe that the adversary would have dierent projections on the new
indices for those records in Dπi . For those items in TH's cache, i.e. those whose
indices in Dπi are in X , the adversary obtains no information about their positions in Dπi+1 . On the other hand, for the other items, i.e. those whose indices in
Dπi are in Y , the adversary is certain that they would not be placed to positions
which have been initialized before their retrievals from Dπi . Moreover, the item
retrieved in the rst reshue read will appear in one of the rst k + 1 positions
in Dπi+1 . Therefore, for x ∈ X ,
px = Pr(πi+1 (j) = πi (x)) = 1/n
It does not hold for px , x ∈ Y . Consequently,
X
X
px = 1 −
px = (n − k)/n.
x∈Y
(8)
(9)
x∈X
The computation of qx is related to the stained queries. Recall that our
adversary is allowed to explore the protocol by sending queries and receiving
replies. Therefore, queries in TH could be stained. Let C denote the set of stained
queries in the i-th session. We have two cases for 1 ≤ l ≤ n:
P
Case (1) l ∈ C : x∈X qx = 1, because in the adversary's perspective, there
exists one and only one item in the cache which corresponds to query on dl .
For x ∈ Y , qx = 0 because none matches. Thus
n
X
px qx = 1/n + 0 = 1/n, for l ∈ C.
x=1
(10)
P
qx = δ for 0 ≤ δ ≤ 1, then x∈Y qx = 1 − δ .
Note that the execution of queries does not aect the adversary's observation on those not cached records, since they are not accessed. Therefore, by
induction assumption: Pr(Dπi [x] = dl | A0 , R0 , · · · , Ai−1 , Ri−1 ) = 1/n, for
1−δ
all l ∈ [1, n], we have qx = n−k
due to equiprobability 4 . Hence,
Case (2) l ∈/ C : Suppose
n
X
P
x∈X
px qx = δ/n + (1 − δ)/n = 1/n.
(11)
x=1
Combining Case 1 and Case 2, we have
n
X
px qx = 1/n, for all 1 ≤ l ≤ n,
x=1
which concludes the proof.
¤
Lemma 1 implies that the reshue procedure resets the observed distribution
of the data items. Therefore, the events occurring during separated sessions
are independent of each other. Theorem 1 below addresses the security of the
proposed PIR scheme.
4
Hint: Otherwise, for x0 , x1 ∈ Y , qx0 6= qx1 implies Pr(Dπi [x0 ] = dl ) 6= Pr(Dπi [x1 ] =
dl ) where d ∈
/ C . This result is also provable by the indistinguishability for the
adversary using two dierent permutations which are identical for l ∈ C .
PIR Using Trusted Hardware
11
Theorem 1. Given a time period, the observation of the access pattern A =
(a1 , a2 , · · · , aN ), N > 0, provides the adversary no additional knowledge to determine any clean query q , i.e. for all j ∈ [1, n],
Pr(q = j|A) = Pr(q = j)
(12)
where Pr(q = j) is an a-priori probability of query q being on index j .
For 1 < t ≤ N , let Pr(at | a1 , · · · , at−1 ) denote the probability of the
event that data at is accessed immediately after the access of t − 1 records.
Let Pr(at | a1 , · · · , at−1 ; q = j) denote the probability of the same event with
additional knowledge that the requested index of query q is j . Note that we do
not assume any temporal order of the query q and the t-th query. We proceed
to show below that
Proof
Pr(at | a1 , · · · , at−1 ) = Pr(at | a1 , · · · , at−1 ; q = j)
(13)
Without loss of generality, suppose at is read from Dπs during the s-th session.
Consider the following two cases:
1. at ∈ Rs , i.e. at is accessed during a reshue process: Obviously Pr(at |
a1 , · · · , at−1 ) = Pr(at | a1 , · · · , at−1 ; q = j), due to the fact that the access
to at is completely determined by permutation πs and πs+1 .
2. at ∈ As , i.e. at is accessed during a query process: Let this query be the l-th
query in this session, l ∈ [1, k]. Therefore, l − 1 data items are cached by TH
before at is read. We consider two scenarios based upon Algorithm 1:
(a) The requested data is cached in TH: at is randomly chosen from those
1
data items not cached in TH. Therefore, Pr(at | a1 , · · · , at−1 ) = n−(l−1)
.
(b) The requested data is not cached in TH: at is retrieved from Dπs based
on the permutation πs . According to Lemma 1, the probability that at
1
is selected is n−(l−1)
.
Note that the compromise of a query, i.e. knowing q = j , possibly helps
an adversary to determine whether at is in case (2a) or (2b). Nonetheless,
this information does not change Pr(at | a1 , · · · , at−1 ), since their values are
1
n−(l−1) in both cases. Thus, Pr(at | a1 , · · · , at−1 ) = Pr(at | a1 , · · · , at−1 , q =
j) when at ∈ As .
In total, we conclude that for any t and q = j ,
Pr(at | a1 , · · · , at−1 ) = Pr(at | a1 , · · · , at−1 ; q = j).
Shuhong Wang et. al.
12
As a result,
Pr(A | q = j) = Pr(a1 , · · · , aN | q = j)
= Pr(aN | a1 , · · · , aN −1 ; q = j) · Pr(a1 , · · · , at−1 | q = j)
=
N
Y
Pr(at | a1 , · · · , at−1 ; q = j)
t=1
=
N
Y
Pr(at | a1 , · · · , at−1 )
t=1
= Pr(A).
(14)
Then,
Pr(q = j | A) = Pr(q = j, A)/Pr(A)
Pr(A | q = j) · Pr(q = j)
=
Pr(A)
= Pr(q = j).
The result shows that, given the access pattern, the a-posteriori probability
of a query equals to its a-priori probability, which concludes the security proof
for our PIR scheme.
¤
5 Performance
We proceed to analyze the communication and computation complexity of our
PIR scheme. They are evaluated with respect to the database size n. Both complexities of our scheme reach the respective lower bounds for hardware-based
PIR schemes.
Communication: We consider the user/system communication cost per query.
In our scheme, the users only inputs an index of the desired data item and TH
returns exactly one data item. Therefore, its communication complexity per
query is O(log n). Note that O(log n) is the lower bound of communication cost
for all PIR constructions.
Computation: For simplicity purpose, each reading, writing, encryption, and
decryption of a data item is treated as one operation. The computation cost
is measured by the average number of operations per session and per query.
We also measure the online cost which excludes the expense of oine reshue
operations.
As evident in Figure 2 and 3, it costs the trusted hardware O(1) operations to
process a query and O(n) operations to reshue the database. Table 2 compares
the computation cost of our scheme against [20] and [14, 15].
Our scheme outperforms the other two hardware-based PIR schemes in all
three metrics. The advantage originates from our reshue algorithm which
utilizes the hardware's cache in a more ecient manner. Moreover, we prove
PIR Using Trusted Hardware
13
Table 2. Comparison of Computation Cost of Three Hardware-based PIR Schemes.
Schemes
Our scheme
IS04 [14, 15]
SS01 [20]
Total cost
Online cost Average Cost
per session (k queries) per query
per query
O(n)
O(n log n)
O(kn)
O(1)
O(1)
O(n)
O(n/k)
O( nk log n)
O(n)
that the average cost per retrieval of our scheme reaches the lower bound for
all information-theoretic PIR schemes with the same trusted hardware system
model. Our result is summarized in the following theorem.
Theorem 2. For any information-theoretically secure PIR scheme with a trusted
hardware storing maximum k data items, the average computation cost per retrieval is Ω(n/k).
Our proof is constructed in a similar manner to the proof in [4] which
shows the computational lower bound for traditional PIR schemes.
Fix a PIR scheme, let Bi denote the set of all indices that the hardware
reads in order to return di . Bi is essentially a random variable probabilistically
determined by both the user query on di and the items that the hardware has
already stored inside. Consequently, E(|Bi |) denotes the expect number of data
items to read by the hardware to process a query on index i. We evaluate E(|Bi |)
as the computation cost for PIR schemes.
For 1 ≤ l ≤ n, let Pr(l ∈ Bi ) be the probability that the hardware reads dl in
order to answer the query on index i. We dene P(l) as the maximum of these
probabilities for all 1 ≤ i ≤ n, i.e.
Proof:
P(l) = max {Pr(l ∈ Bi )}.
1≤i≤n
Note that for an information-theoretically secure PIR scheme, user privacy implies that B1 , B2 , · · · , Bn have the identical distribution. Therefore, for convenience purpose, let
P(l) = Pr(l ∈ B1 ).
Due to Lemma 5 of [4]5 ,
E(|Bi |) =
n
X
P(l).
(15)
l=1
Our target now is to show E(|Bi |) = Ω(n/k). We prove it by contradiction.
Suppose that among P(1), · · · , P(n), there at least exist k + 1 of them whose
values are less than 1/(k+1). Without loss of generality, let the k+1 probabilities
5
It can be proved by dening the random variables Y1 , · · · , Yn where Yl = 1 if l ∈ Bi
and Yl = 0 otherwise.
14
Shuhong Wang et. al.
be P(1), P(2), · · · , P(k+1). Now consider the probability Pr(1 ∈
/ B1 ∩2 ∈
/ B2 · · ·∩
(k + 1) ∈
/ Bk+1 ). We have,
Pr(1 ∈
/ B1 ∩ 2 ∈
/ B2 · · · ∩ (k + 1) ∈
/ Bk+1 )
= 1 − Pr(1 ∈ B1 ∪ 2 ∈ B2 · · · ∪ (k + 1) ∈ Bk+1 )
≥ 1−
k+1
X
Pr(l ∈ Bl ) = 1 −
l=1
k+1
X
P(l)
l=1
> 1 − (k + 1)
1
= 0.
k+1
On the other hand, note that TH only caches k data items in maximum. As a
consequence, there always exists one data item which must be read from the
database during the k + 1 queries on 1, 2, · · · , k + 1. Thus, the event that 1 ∈
/
B1 ∩2 ∈
/ B2 · · ·∩(k+1) ∈
/ Bk+1 never occurs, i.e. Pr(1 ∈
/ B1 ∩2 ∈
/ B2 · · ·∩(k+1) ∈
/
Bk+1 ) = 0, which contradicts the probability computation above.
Thus, at most k elements in {P(1), · · · , P(n)} whose values are less than
1/(k + 1) . As a result,
n
X
P(l) ≥ (n − k) ·
l=1
1
,
k+1
which shows the lower bound for average computation cost is Ω(n/k).
(16)
¤
6 Discussion
Database Initialization Using TH
For applications where no trusted third party exists, TH can be used to initialize
the database. TH rst chooses a random permutation π0 . For 1 ≤ i ≤ n, TH
tags the i-th item di with its new index π0−1 (i). Using the merge-sort algorithm
[8], d1 , d2 , · · · , dn are sorted based on their new indices by TH. With the limited
cache size in TH, Batcher's odd-even merges sorter [1] is an appropriate choice
which requires (log2 n−log n+4)n/4−1 comparisons. One may argue that Bene²
network [22] and Goldstein et al's switch networks [12] incur less comparisons.
Unfortunately, neither is feasible in our system since the rst one requires at
least n log n-bit (>> k ) memory in TH while the latter has a prohibitively high
setup cost. Note that encryption is applied during tagging and merging so that
the process is oblivious to the server.
A simple example is presented in Fig. 4. The database in the example has
4 items d1 , d2 , d3 , d4 . The permutation is π0 = (1324), i.e. π0 (1) = 3, π0 (2) =
4, π0 (3) = 2 and π0 (4) = 1. The circles denote TH and the squares denote
encrypted data items. After initialization, the original four items are permuted
as shown on the right end. All the encrypted items are stored on the host. In
every operation, only two items are read into TH's cache and then written back
to the server.
PIR Using Trusted Hardware
15
Fig. 4. Initial oblivious shue example using odd-even merges.
Instantiation of Encryption and Permutation Algorithms
An implicit assumption of our security proof in Section 4 is the semantic security of the encryption of the database. Otherwise, the encryption reveals the
data information and consequently exposes user privacy. Our adversary model
in Section 2 allows the adversary to submit queries and observe the subsequent
access patterns and replies. Thereafter, the adversary is able to obtain k pairs
of plaintext and ciphertext in maximum for each encryption key, since dierent
random keys are selected in dierent sessions. Thus, it is demanded to have an
encryption algorithm semantically secure under CPA (Chosen Plaintext Attack)
model. In practice, CPA secure symmetric ciphers such as AES, are preferred
over public key encryptions, since the latter have more expensive computation
cost and higher storage space demand.
For the permutation algorithm, we argue that it is impractical for a hardwarebased PIR to employ a true random permutation, since it requires O(n log n)
bits of storage, comparable to the size of the whole database. As a result, we opt
for a pseudo-random permutation with a light computation load.
Since a cipher secure under CPA is transformed into an invertible pseudorandom permutation, we choose a CPA secure block cipher, e.g. AES, to implement the needed pseudo-permutation. With a block cipher, a message is encrypted by blocks. When n 6= 2dlog ne , the ciphertext may be greater than n.
In that case, the encryption is repeated until the output is in the appropriate
range. Since 2dlog ne ≤ 2n, the expected number of encryptions is less than 2.
Black and Rogaway's result in [5] provides more information on ciphers with
arbitrary nite domains.
Service Continuity
The database service is disrupted during the reshue process. The duration
of a reshue is non-negligible since it is an O(n) process. A trivial approach to
maintaining the continuity of service is to deploy two pieces of trusted hardware.
While one is re-permuting the database, the other deals with user queries. In
case that installing an extra hardware is infeasible, an alternative is to split the
16
Shuhong Wang et. al.
cache of the hardware into two halves with each having the capacity of storing
k/2 items. Consequently, the average computation cost will be doubled.
Update of data items
A byproduct of the reshue process is database update operations. To update
di , the trusted hardware reads di obliviously in the same way as handling a read
request. Then, di is updated inside the hardware's cache and written into the
new permuted database during the upcoming reshue process. Though the new
value of di is not written immediately into the database, data consistency is
ensured since the hardware returns the updated value directly from the cache
upon user requests.
7 Conclusion
In summary, we present in this paper a novel PIR scheme with the support of a
trusted hardware. The new PIR construction is provably secure. The observation
of the access pattern does not oer additional information to adaptive adversaries
in determining the data items retrieved by a user.
Similar to other hardware-based PIR schemes, the communication complexity of our scheme reaches its lower bound, O(log n). In terms of computation
complexity, our design is more ecient than all other existing constructions. Its
online cost per query is O(1) and the average cost per query is only O(n/k),
which outperforms the best known result by a factor O(log n) (though using a
big-O notation, the hidden constant factor is around 1 here). Furthermore, we
prove that O(n/k) is the lower bound of computation cost for PIR schemes with
the same trusted hardware based architecture.
The Trusted Computing Group (TCG) [21] denes a set of Trusted Computing Platform (TCP) specications aiming to provide hardware-based root of
trust and a set of primitive functions to propagate trust to application software
as well as across platforms. The root of trust in TCP is a hardware component
on the motherboard called the Trusted Platform Module (TPM). TPM provides
protected data by never releasing root keys outside of the TPM. In addition,
TPM provides some primitive cryptographic functions, such as random number
generation, RSA key pair generation, RSA algorithms and hash function. Most
importantly, TPM provides mechanism for integrity measurement, storage, and
reporting of a platform, from which trust and attestation capabilities can be
built. In addition to TCG compliant TPM, Intel's LaGrande Technology (LT)
[13] includes an extended CPU enabling software domain separation and protection. Beyond the hardware layer there is a domain manager supporting protected
execution environments by domain separation, including separation of processes,
memory pages, and device drivers. This emergence of industry standard trusted
computing technologies promise to establish an adequate foundation for building
a practical trusted platform for our proposed PIR scheme. How to extend and
improve our proposed PIR scheme based on trusted computing technologies will
be one of our future research directions.
PIR Using Trusted Hardware
17
Acknowledgement
This research is partly supported by the Oce of Research, Singapore Management University.
We would like to thank anonymous reviewers for their valuable suggestions
and criticisms on an earlier draft of this paper.
References
1. Kenneth E. Batcher. Sorting networks and their applications. In AFIPS Spring
Joint Computing Conference, pages 307314, 1968.
2. M. Bellare, A. Desai, D. Pointcheval, and P. Rogaway. Relations among Notions of
Security for Public-key Encryption Schemes. In Proceedings of Crypto '98, LNCS
1462, pages 2645, Berlin, 1998.
3. Amos Beimel, Yuval Ishai, Eyal Kushilevitz, and Jean-François Raymond. Breaking the o(n1/(2k−1) ) barrier for information-theoretic private information retrieval.
In FOCS, pages 261270. IEEE Computer Society, 2002.
4. Amos Beimel, Yuval Ishai, and Tal Malkin. Reducing the servers computation in
private information retrieval: Pir with preprocessing. In CRYPTO, pages 5573,
2000.
5. John Black and Phillip Rogaway. Ciphers with arbitrary nite domains. In Bart
Preneel, editor, CT-RSA, volume 2271 of Lecture Notes in Computer Science, pages
114130. Springer, 2002.
6. Benny Chor, Eyal Kushilevitz, Oded Goldreich, and Madhu Sudan. Private information retrieval. In FOCS, pages 4150, 1995.
7. Benny Chor, Eyal Kushilevitz, Oded Goldreich, and Madhu Sudan. Private information retrieval. J. ACM, 45(6):965981, 1998.
8. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Cliord Stein.
Introduction to Algorithms, Second Edition. ISBN 0-262-03293-7.
9. Joan Feigenbaum. Encrypting problem instances: Or ..., can you take advantage
of someone without having to trust him? In Hugh C. Williams, editor, CRYPTO,
volume 218 of Lecture Notes in Computer Science, pages 477488. Springer, 1985.
10. William Gasarch. A survey on private information retrieval. The Bulletin of the
European Association for Theoretical Computer Science, Computational Complexity Column(82), 2004.
11. Oded Goldreich and Rafail Ostrovsky. Software protection and simulation on oblivious rams. J. ACM, 43(3):431473, 1996.
12. J. L. Goldstein and S. W. Leibholz. On the synthesis of signal switching networks
with transient blocking. IEEE Transactions on Electronic Computers. vol.16, no.5,
637-641, 1967.
13. LaGrande technology architecrure. Intel Developer Forum, 2003.
14. Alexander Iliev and Sean Smith. Private information storage with logarithm-space
secure hardware. In International Information Security Workshops, pages 199214,
2004.
15. Alexander Iliev and Sean W. Smith. Protecting client privacy with trusted computing at the server. IEEE Security & Privacy, 3(2):2028, 2005.
16. Helger Lipmaa. An oblivious transfer protocol with log-squared communication. In
Jianying Zhou, Javier Lopez, Robert H. Deng, and Feng Bao, editors, ISC, volume
3650 of Lecture Notes in Computer Science, pages 314328. Springer, 2005.
18
Shuhong Wang et. al.
17. Michael Luby and Charles Racko. How to construct pseudorandom permutations
from pseudorandom functions. SIAM J. Comput., 17(2):373386, 1988.
18. Jacques Patarin. Luby-racko: 7 rounds are enough for 2n(1−²) security. In Dan
Boneh, editor, CRYPTO, volume 2729 of Lecture Notes in Computer Science, pages
513529. Springer, 2003.
19. Ronald L. Rivest, Leonard Adleman, and Michael L. Dertouzos. On data banks
and privacy homomorphisms.
20. Sean W. Smith and David Saord. Practical server privacy with secure coprocessors. IBM Systems Journal, 40(3):683695, 2001.
21.
TCG
Specication
Architecture
Overview.
Available
from
http://www.trustedcomputinggroup.org.
22. Abraham Waksman. A permutation network. J. ACM, 15(1):159163, 1968.
Download