A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE CONVEY HC-2EX HETEROGENEOUS COMPUTING PLATFORM BASED ON “PHANTOM: PRACTICAL OBLIVIOUS COMPUTATION IN A SECURE PROCESSOR” FROM CCS-2013! Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song Cryptographic Construct! A HIGH-PERFORMANCE OBLIVIOUS RAM CONTROLLER ON THE CONVEY HC-2EX HETEROGENEOUS COMPUTING PLATFORM BASED ON “PHANTOM: PRACTICAL OBLIVIOUS COMPUTATION IN A SECURE PROCESSOR” FROM CCS-2013! High-performance, FPGA-based platform! Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song Secure Processor! Organizations move to the cloud E.g. government, financial/medical companies Raises privacy concerns for sensitive data Attackers with Physical Access Chanpipat,!FreeDigitalPhotos.net! Malicious Employees Intruders Government Surveillance Physical Attack Vectors E.g. replace DRAM DIMMs with NVDIMMs that have non-volatile storage to record accesses Computation on Encrypted Data Sealed (tamper-proof), Remote attestation! CPU! DRAM! encrypted! encrypted! Hard!Drive,!etc.! e.g. Secure Processors (AEGIS, XOM, AISE-BMT), IBM Cryptographic Coprocessors, Intel SGX Memory Address Leakage Sealed (tamper-proof), Remote attestation! CPU! Memory Addresses (plaintext)! DRAM! encrypted! encrypted! Hard!Drive,!etc.! Leaks e.g. transactions, subjects of surveillance/ audit, geolocations, OS fingerprints, crypto keys Physical Memory Address Read database A real-world example: SQLite Texas SELECT population FROM all counties in <Texas | California> California Loading database Memory Accesses (Time) Query We want to prevent this information leakage In the context of a secure processor Oblivious RAM (ORAM) • Problem investigated since 1987 • Originally for memory accesses of a processor, later for e.g. FSs, DBs,... • Algorithms required MBs of trusted storage or complex ( ✗ Hardware) Path ORAM (CCS’13, Best Paper) New algorithm by Stefanov et al. ✔ Low trusted storage requirement ✔ Simple enough to implement in hardware on a secure processor Where’s the problem? How hard can it be to put Path ORAM into a processor? 1. ORAM Microarchitecture • Prior work algorithmic, ignores ORAM microarchitectural implementation • ORAM needs to fully utilize resources • You need to build it to find the details not apparent from the algorithm 2. Practicality on real system • Want obliviousness for real systems • Custom chips (ASICs) very expensive unless widely adopted • There is a trend towards FPGA-based accelerators (programmable H/W) PHANTOM: A Practical Oblivious Computing Platform Featuring an ORAM microarchitecture implemented on an FPGA platform Overview 1. Overview & Attack Model 2. Path Oblivious RAM 3. The Oblivious Memory System 4. Building PHANTOM 5. Evaluation PART I Attack Model and Deployment PHANTOM Overview Data Center! ! Client! PHANTOM! Our focus! Attack Model Memory Traffic (data, addresses,...)! ! Other digital (OS,...)! Analog (radiation, power,...)! Orthogonal! PART II Path Oblivious RAM Path Oblivious RAM Oblivious memory is divided into blocks:! A! B! C! D! E! F! …! When accessing a block through Path ORAM! Request block! Read/write to requested block! Random appearing DRAM accesses! Path Oblivious RAM 0! D 0! 0! 1! Position Map (secure)! 1! 1! Block&ID& 0! 0! 1! 0! 1! 1! 0! 1! C! 000! B! 001! 010! F! 011! A! 100! E! 101! 110! Leaf&ID& A! 101!011! 101! B! 011! C! 000! D! 010! E! 101! F! 010! 111! Stash (secure)! Required Stash Size • Blocks stay behind in the stash • How large does the stash have to be to never overflow? • Bound known up to constant factors: determined constants empirically PART III The Oblivious Memory System The Oblivious Memory System ! PHANTOM! High-throughput Memory Design& Cycles& Basic!128bit! 34816! 8x!Memory!BW! 4352! ?! AES counter mode! Note: 1,000 cycles = 6.6us @ 150 MHz (ORAM Size 1GB, 17 level tree, 4KB blocks) Challenge: ✔ Memory Bandwidth High-throughput Memory Design& Cycles& Basic!128bit! 34816! 8x!Memory!BW! 4352! ?! AES counter mode! Note: 1,000 cycles = 6.6us @ 150 MHz (ORAM Size 1GB, 17 level tree, 4KB blocks) Challenge: Keep up with memory Writing Back Blocks ?! Leaf&ID,…& Block&Contents& 110! 0xcafecafecafecafecafecafecafe…! 011! 0xcafecafecafecafecafecafecafe…! 010! …! 110! 111! 011! 110! 010! 111! 000! 000! …! 110! For each node in the path, select an entry from the stash to write to it (or put a dummy). Time to pick a block • In our case, we have 32 cycles to pick the next block (otherwise we will stall the memory system). • Examining all blocks takes C cycles for each block. stash size! Picking from the full stash Design& Cycles& Basic!128bit! 34816! 8x!Memory!BW! 4352! 10880! C cycles to select next block to write back, C = 128! Note: 1,000 cycles = 6.6us @ 150 MHz (ORAM Size 1GB, 17 level tree, 4KB blocks) Challenge: Keep up with memory Adding a sorting step Design& Cycles& Basic!128bit! 34816! 8x!Memory!BW! 4352! 10880! C!log!C!SorNng! 5248! Sorting stage! Note: 1,000 cycles = 6.6us @ 150 MHz (ORAM Size 1GB, 17 level tree, 4KB blocks) Challenge: Keep up with memory Heap-based Sorting Insert! Note: 1,000 cycles = 6.6us @ 150 MHz Design& Cycles& Basic!128bit! 34816! 8x!Memory!BW! 4352! 10880! C!log!C!SorNng! 5248! Fully!overlap! 4352! Extract! (ORAM Size 1GB, 17 level tree, 4KB blocks) Challenge: ✔ Keep up with memory Timing Channels Operation is data-driven; risk to leak information from timing 1. Operation always take the maximum amount of time (avoiding large overheads) or are overlapped 2. Decouple DRAM timing variations Challenge: Side Channels DRAM Buffer Absorb timing variations at periphery Timing isolation& Trust Boundary! DRAM!Buffer! DRAM! Address of path to access! Memory!Request! GeneraNon!Logic! All!other!PHANTOM!state! (Stash,!Sorter,!AES!Units)! ✔ Challenge: Side Channels The Whole Picture More details can be found in the paper PART IV Building PHANTOM PHANTOM Prototype X86!Host! CPU! Management! Processor! FPGA! ! FPGA! FPGA! PHANTOM! FPGA! Crossbar! Implemented on Convey HC-2ex platform DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! Host!DRAM! DIMM! MC! Integrated with RISC-V CPU X86!Host! Host!Interface! CPU! Management! Processor! FPGA! FPGA! FPGA! PHANTOM! FPGA! Crossbar! Developed by UC Berkeley’s Architecture Group DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! MC! DIMM! DIMM! Host!DRAM! DIMM! MC! PHANTOM Secure Processor • Integrated a RISC-V CPU with ORAM • Loads and runs real-world programs, including (in-memory) SQLite • Not optimized for FPGA yet, very small cache sizes (4KB/4KB/8KB) Implementation on the HC-2ex • Use Convey development kit, bundles Convey and user logic into personality • Implement Verilog module, interfaces with MCs, management unit, etc. • Personality loaded by Convey runtime Convey Personality Workflow X86!Host! CPU! Management! Processor! FPGA! FPGA! FPGA! FPGA! Crossbar! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! & DIMM! DIMM! DIMM! MC! Management& MC! MC! MC! Shared!Regs! MC! MC! MC! Dispatch! Processor&Code& foo:! Personality!Logic! !!mov!%8,!$1,!%aeg! Memory!Controllers! !!caep01.ae0!$0! DIMM! Convey&RunBTime& MC! init(personality);! copcall(foo,!A,!B);! Host!DRAM! do_work(...);! ! Using this to build two-way communication channel Interaction with RISC-V CPU X86!Host! CPU! Management! Processor! FPGA! FPGA! FPGA! FPGA! Crossbar! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! DIMM! Management& MC! MC! MC! MC! Shared!Regs! MC! MC! MC! Dispatch! Processor&Glue& • Write!values!to! PHANTOM!(CPU+ORAM)! shared!registers! Memory!Controllers! • Blocking!caep! DIMM! DIMM! DIMM! FrontBend&Server& MC! • Load!programs! • Execute!sysc calls!for!the! Host!DRAM! RISCcV!CPU!! & RISC-V CPU runs independently but talks to host ORAM Microarchitecture • Fully implemented, except remote attestation and AES units • ORAM controller tested/verified for millions of random ORAM accesses • ORAM Block Size of 4KB (for now) Implementation Challenges • Many challenges and unknown details • Min-heap, BRAM multiplexing, block headers, stash management, block caching, timing domains, inter-FPGA communication, block buffering,... Min-heap Implementation Need to write and look at two children at every step, running at 150 Mhz Pre-fetch four grandchildren to avoid long combinational path (read and write to BRAM in the same cycle)! Split into multiple BRAMs to avoid limitation to 2 ports! Synthesized FPGA Design 3rd Workshop on the Intersections of Computer Architecture and Reconfigurable Logic (CARL 2013): Category 2 1 Current comparison (will be written this cycle) Current node 2 Virtex-60 LX760 1 FPGA! odes to be read from 2 Potential next nodes RAM during this cycle 0 1 2 3 3 0 3 Nodes read from BRAM during previous cycle igure 4: Min-heap implementation. The numbers represent the RAM each node is stored in. The four second-degree children f a node are put into four BRAMs to access them in parallel. hecks in one cycle whether it can be written back into the urrent position or not – if not, no other block can, and we ave to write a dummy. We further improve on this approach by replacing the sortng and selection stages by a min-heap (sorted by the curent path’s leaf ID). This replaces an O(C log C) operation y O(log C) operations (each of which completely overlaps ither with a block arriving from or being written to memry), where C is the size of the stash. This makes it now Figure 5: Synthesized design on a Virtex-6 LX760 FPGA therefore introduce bu↵ers at the DRAM interface to isolate PART V Evaluation Time per ORAM access (us) ORAM Access times 40! 35! Total access time! 4719 cycles! 4352 cycles! 30! 25! 20! 32x! 15! 10! 5! 0! 0! 1! 2! 3! 0! 1! 2! 3! 0! 1! 2! 3! 0! 1! 2! 3! 0! 1! 2! 3! 0! 1! 2! 3! 0! 1! 2! 3! 13! 14! 15! 16! 17! 18! 19! Row 1: ORAM size in levels (64MB-4GB) - Row 2: # cached levels (k) Average of 1M ORAM accesses each (4KB) 0.812us! Time for read phase! Application Performance Simulation (Caches: 16KB/32KB/1MB) sqlitecquery3! sqlitecquery2! FPGA Prototype (Caches: 4KB/4KB/8KB) sqlitecquery1! No ORAM sqlitecwarmup! 0! 2! 4! 6! 8! 10! 12! 14! 16! Execution Time normalized to no ORAM 20%-5.5x Overhead (1MB LLC) 1GB ORAM Future Work • Prototype is a starting point • Integrate additional Path ORAM optimizations, HW/SW co-design • Compiler/OS support to avoid ORAM accesses and reduce size of ORAMs Conclusion • Investigated ORAM microarchitecture to exploit high memory bandwidth • PHANTOM: Make oblivious computation practical on existing hardware Thank you! Any Questions?! Martin Maas, Eric Love, Emil Stefanov, Mohit Tiwari Elaine Shi, Krste Asanovic, John Kubiatowicz, Dawn Song {maas, ericlove, emil}@eecs.berkeley.edu, tiwari@austin.utexas.edu, elaine@cs.umd.edu, {krste, kubitron, dawnsong}@eecs.berkeley.edu !