A H -P O

advertisement
A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE
CONVEY HC-2EX HETEROGENEOUS COMPUTING PLATFORM
BASED ON “PHANTOM: PRACTICAL OBLIVIOUS COMPUTATION IN A SECURE PROCESSOR” FROM CCS-2013!
Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari
Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song
Cryptographic Construct!
A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE
CONVEY HC-2EX HETEROGENEOUS COMPUTING PLATFORM
BASED ON “PHANTOM: PRACTICAL OBLIVIOUS COMPUTATION IN A SECURE PROCESSOR” FROM CCS-2013!
High-performance, FPGA-based platform!
Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari
Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song
Secure Processor!
Organizations move to the cloud
E.g. government, financial/medical companies
Raises privacy concerns for sensitive data
Attackers with Physical Access
Chanpipat,!FreeDigitalPhotos.net!
Malicious
Employees
Intruders
Government
Surveillance
Physical Attack Vectors
E.g. replace DRAM DIMMs with NVDIMMs that have
non-volatile storage to record accesses
Computation on Encrypted Data
Sealed (tamper-proof),
Remote attestation!
CPU!
DRAM!
encrypted!
encrypted!
Hard!Drive,!etc.!
e.g. Secure Processors (AEGIS, XOM, AISE-BMT),
IBM Cryptographic Coprocessors, Intel SGX
Memory Address Leakage
Sealed (tamper-proof),
Remote attestation!
CPU!
Memory Addresses (plaintext)!
DRAM!
encrypted!
encrypted!
Hard!Drive,!etc.!
Leaks e.g. transactions, subjects of surveillance/
audit, geolocations, OS fingerprints, crypto keys
Physical Memory Address
Read database
A real-world example: SQLite
Texas
SELECT population FROM all counties in <Texas | California>
California
Loading database
Memory Accesses (Time)
Query
We want to prevent this
information leakage
In the context of a secure processor
Oblivious RAM (ORAM)
•  Problem investigated since 1987
•  Originally for memory accesses of a
processor, later for e.g. FSs, DBs,...
•  Algorithms required MBs of trusted
storage or complex ( ✗ Hardware)
Path ORAM (CCS’13, Best Paper)
New algorithm by Stefanov et al.
✔ Low trusted storage requirement
✔ Simple enough to implement in
hardware on a secure processor
Where’s the problem?
How hard can it be to put Path
ORAM into a processor?
1. ORAM Microarchitecture
•  Prior work algorithmic, ignores ORAM
microarchitectural implementation
•  ORAM needs to fully utilize resources
•  You need to build it to find the details
not apparent from the algorithm
2. Practicality on real system
•  Want obliviousness for real systems
•  Custom chips (ASICs) very expensive
unless widely adopted
•  There is a trend towards FPGA-based
accelerators (programmable H/W)
PHANTOM: A Practical Oblivious
Computing Platform
Featuring an ORAM microarchitecture
implemented on an FPGA platform
Overview
1.  Overview & Attack Model
2.  Path Oblivious RAM
3.  The Oblivious Memory System
4.  Building PHANTOM
5.  Evaluation
PART I
Attack Model and Deployment
PHANTOM Overview
Data Center!
!
Client!
PHANTOM!
Our focus!
Attack Model
Memory Traffic (data, addresses,...)!
!
Other digital
(OS,...)!
Analog (radiation,
power,...)!
Orthogonal!
PART II
Path Oblivious RAM
Path Oblivious RAM
Oblivious memory is divided into blocks:!
A!
B!
C!
D!
E!
F!
…!
When accessing a block through Path ORAM!
Request block!
Read/write to
requested block!
Random appearing
DRAM accesses!
Path Oblivious RAM
0!
D
0!
0!
1!
Position Map (secure)!
1!
1!
Block&ID&
0!
0!
1!
0!
1!
1!
0!
1!
C!
000!
B!
001!
010!
F!
011!
A!
100!
E!
101!
110!
Leaf&ID&
A!
101!011!
101!
B!
011!
C!
000!
D!
010!
E!
101!
F!
010!
111!
Stash (secure)!
Required Stash Size
•  Blocks stay behind in the stash
•  How large does the stash have to be
to never overflow?
•  Bound known up to constant factors:
determined constants empirically
PART III
The Oblivious Memory System
The Oblivious Memory System
!
PHANTOM!
High-throughput Memory
Design&
Cycles&
Basic!128bit!
34816!
8x!Memory!BW!
4352!
?!
AES counter mode!
Note: 1,000 cycles = 6.6us @ 150 MHz
(ORAM Size 1GB, 17 level tree, 4KB blocks)
Challenge:
✔ Memory Bandwidth
High-throughput Memory
Design&
Cycles&
Basic!128bit!
34816!
8x!Memory!BW!
4352!
?!
AES counter mode!
Note: 1,000 cycles = 6.6us @ 150 MHz
(ORAM Size 1GB, 17 level tree, 4KB blocks)
Challenge:
Keep up with memory
Writing Back Blocks
?!
Leaf&ID,…&
Block&Contents&
110!
0xcafecafecafecafecafecafecafe…!
011!
0xcafecafecafecafecafecafecafe…!
010!
…!
110!
111!
011!
110!
010!
111!
000!
000!
…!
110!
For each node in the path, select an entry from
the stash to write to it (or put a dummy).
Time to pick a block
•  In our case, we have 32 cycles to pick
the next block (otherwise we will stall
the memory system).
•  Examining all blocks takes C cycles
for each block.
stash size!
Picking from the full stash
Design&
Cycles&
Basic!128bit!
34816!
8x!Memory!BW!
4352!
10880!
C cycles to select next block
to write back, C = 128!
Note: 1,000 cycles = 6.6us @ 150 MHz
(ORAM Size 1GB, 17 level tree, 4KB blocks)
Challenge:
Keep up with memory
Adding a sorting step
Design&
Cycles&
Basic!128bit!
34816!
8x!Memory!BW!
4352!
10880!
C!log!C!SorNng!
5248!
Sorting stage!
Note: 1,000 cycles = 6.6us @ 150 MHz
(ORAM Size 1GB, 17 level tree, 4KB blocks)
Challenge:
Keep up with memory
Heap-based Sorting
Insert!
Note: 1,000 cycles = 6.6us @ 150 MHz
Design&
Cycles&
Basic!128bit!
34816!
8x!Memory!BW!
4352!
10880!
C!log!C!SorNng!
5248!
Fully!overlap!
4352!
Extract!
(ORAM Size 1GB, 17 level tree, 4KB blocks)
Challenge:
✔ Keep up with memory
Timing Channels
Operation is data-driven; risk to leak
information from timing
1.  Operation always take the maximum amount of
time (avoiding large overheads) or are overlapped
2.  Decouple DRAM timing variations
Challenge: Side Channels
DRAM Buffer
Absorb timing variations at periphery
Timing isolation&
Trust Boundary!
DRAM!Buffer!
DRAM!
Address of path to access!
Memory!Request!
GeneraNon!Logic!
All!other!PHANTOM!state!
(Stash,!Sorter,!AES!Units)!
✔ Challenge: Side Channels
The Whole Picture
More details can be found in the paper
PART IV
Building PHANTOM
PHANTOM Prototype
X86!Host!
CPU!
Management!
Processor!
FPGA!
!
FPGA!
FPGA!
PHANTOM!
FPGA!
Crossbar!
Implemented on Convey HC-2ex platform
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
Host!DRAM!
DIMM!
MC!
Integrated with RISC-V CPU
X86!Host!
Host!Interface!
CPU!
Management!
Processor!
FPGA!
FPGA!
FPGA!
PHANTOM!
FPGA!
Crossbar!
Developed by UC Berkeley’s Architecture Group
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
MC!
DIMM!
DIMM!
Host!DRAM!
DIMM!
MC!
PHANTOM Secure Processor
•  Integrated a RISC-V CPU with ORAM
•  Loads and runs real-world programs,
including (in-memory) SQLite
•  Not optimized for FPGA yet, very small
cache sizes (4KB/4KB/8KB)
Implementation on the HC-2ex
•  Use Convey development kit, bundles
Convey and user logic into personality
•  Implement Verilog module, interfaces
with MCs, management unit, etc.
•  Personality loaded by Convey runtime
Convey Personality Workflow
X86!Host!
CPU!
Management!
Processor!
FPGA!
FPGA!
FPGA!
FPGA!
Crossbar!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
&
DIMM!
DIMM!
DIMM!
MC! Management&
MC!
MC!
MC! Shared!Regs!
MC!
MC!
MC! Dispatch!
Processor&Code&
foo:!
Personality!Logic!
!!mov!%8,!$1,!%aeg!
Memory!Controllers!
!!caep01.ae0!$0!
DIMM!
Convey&RunBTime& MC!
init(personality);!
copcall(foo,!A,!B);!
Host!DRAM!
do_work(...);!
!
Using this to build two-way communication channel
Interaction with RISC-V CPU
X86!Host!
CPU!
Management!
Processor!
FPGA!
FPGA!
FPGA!
FPGA!
Crossbar!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
DIMM!
Management&
MC!
MC!
MC!
MC! Shared!Regs!
MC!
MC!
MC! Dispatch!
Processor&Glue&
•  Write!values!to!
PHANTOM!(CPU+ORAM)!
shared!registers!
Memory!Controllers!
•  Blocking!caep!
DIMM!
DIMM!
DIMM!
FrontBend&Server& MC!
•  Load!programs!
•  Execute!sysc
calls!for!the!
Host!DRAM!
RISCcV!CPU!!
&
RISC-V CPU runs independently but talks to host
ORAM Microarchitecture
•  Fully implemented, except remote
attestation and AES units
•  ORAM controller tested/verified for
millions of random ORAM accesses
•  ORAM Block Size of 4KB (for now)
Implementation Challenges
•  Many challenges and unknown details
•  Min-heap, BRAM multiplexing, block
headers, stash management, block
caching, timing domains, inter-FPGA
communication, block buffering,...
Min-heap Implementation
Need to write and look at two children at
every step, running at 150 Mhz
Pre-fetch four grandchildren
to avoid long combinational
path (read and write to BRAM
in the same cycle)!
Split into multiple BRAMs to
avoid limitation to 2 ports!
Synthesized FPGA Design
3rd Workshop on the Intersections of Computer Architecture and Reconfigurable Logic (CARL 2013): Category 2
1
Current comparison (will be written this cycle)
Current node
2
Virtex-60 LX760 1
FPGA!
odes to be read from
2
Potential next
nodes
RAM during this cycle
0
1
2
3
3
0
3
Nodes read from BRAM
during previous cycle
igure 4: Min-heap implementation. The numbers represent the
RAM each node is stored in. The four second-degree children
f a node are put into four BRAMs to access them in parallel.
hecks in one cycle whether it can be written back into the
urrent position or not – if not, no other block can, and we
ave to write a dummy.
We further improve on this approach by replacing the sortng and selection stages by a min-heap (sorted by the curent path’s leaf ID). This replaces an O(C log C) operation
y O(log C) operations (each of which completely overlaps
ither with a block arriving from or being written to memry), where C is the size of the stash. This makes it now
Figure 5: Synthesized design on a Virtex-6 LX760 FPGA
therefore introduce bu↵ers at the DRAM interface to isolate
PART V
Evaluation
Time per ORAM access (us)
ORAM Access times
40!
35!
Total access time!
4719 cycles!
4352
cycles!
30!
25!
20!
32x!
15!
10!
5!
0!
0! 1! 2! 3!
0! 1! 2! 3!
0! 1! 2! 3!
0! 1! 2! 3!
0! 1! 2! 3!
0! 1! 2! 3!
0! 1! 2! 3!
13!
14!
15!
16!
17!
18!
19!
Row 1: ORAM size in levels (64MB-4GB) - Row 2: # cached levels (k)
Average of 1M ORAM accesses each (4KB)
0.812us!
Time for read phase!
Application Performance
Simulation (Caches:
16KB/32KB/1MB)
sqlitecquery3!
sqlitecquery2!
FPGA Prototype (Caches:
4KB/4KB/8KB)
sqlitecquery1!
No ORAM
sqlitecwarmup!
0!
2!
4!
6!
8!
10!
12!
14!
16!
Execution Time normalized to no ORAM
20%-5.5x Overhead (1MB LLC)
1GB ORAM
Future Work
•  Prototype is a starting point
•  Integrate additional Path ORAM
optimizations, HW/SW co-design
•  Compiler/OS support to avoid ORAM
accesses and reduce size of ORAMs
Conclusion
•  Investigated ORAM microarchitecture
to exploit high memory bandwidth
•  PHANTOM: Make oblivious computation
practical on existing hardware
Thank you! Any Questions?!
Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari
Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song
{maas, ericlove, emil}@eecs.berkeley.edu, tiwari@austin.utexas.edu,
elaine@cs.umd.edu, {krste, kubitron, dawnsong}@eecs.berkeley.edu
!
Download