Border Control: Sandboxing Accelerators Lena E. Olson Jason Power Mark D. Hill David A. Wood University of Wisconsin-Madison Computer Sciences Department {lena, powerjg, markhill, david}@cs.wisc.edu ABSTRACT As hardware accelerators proliferate, there is a desire to logically integrate them more tightly with CPUs through interfaces such as shared virtual memory. Although this integration has programmability and performance benefits, it may also have serious security and fault isolation implications, especially when accelerators are designed by third parties. Unchecked, accelerators could make incorrect memory accesses, causing information leaks, data corruption, or crashes not only for processes running on the accelerator, but for the rest of the system as well. Unfortunately, current security solutions are insufficient for providing memory protection from tightly integrated untrusted accelerators. We propose Border Control, a sandboxing mechanism which guarantees that the memory access permissions in the page table are respected by accelerators, regardless of design errors or malicious intent. Our hardware implementation of Border Control provides safety against improper memory accesses with a space overhead of only 0.006% of system physical memory per accelerator. We show that when used with a current highly demanding accelerator, this initial Border Control implementation has on average a 0.15% runtime overhead relative to the unsafe baseline. Categories •Security and privacy → Hardware attacks and countermeasures; •Computer systems organization → Processors and memory architectures; Reliability; Keywords accelerators, memory protection, hardware sandboxing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. MICRO-48 December 05 - 09, 2015, Waikiki, HI, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4034-2/15/12...$15.00 DOI: http://dx.doi.org/10.1145/2830772.2830819. 1. INTRODUCTION ‘[B]order control’ means the activity carried out at a border . . . in response exclusively to an intention to cross or the act of crossing that border, regardless of any other consideration, consisting of border checks and border surveillance. Article 2, Section 9 of the Schengen Borders Code [1] With the breakdown of Dennard scaling, specialized hardware accelerators are being hailed as one way to improve performance and reduce power. Computations can be offloaded from the general-purpose CPU to specialized accelerators, which provide higher performance, lower power, or both. Consequently there is a proliferation of proposals for accelerators. Examples include cryptographic accelerators, GPUs, GPGPUs, accelerators for image processing, neural and approximate computing, database accelerators, and user reconfigurable hardware [2–7]. Some of this plethora of accelerators may be customdesigned by third parties, and purchased as soft or firm intellectual property (IP). For example, third-party designers such as Imagination Technologies provide accelerator IP (e.g., for GPUs) to companies such as Apple for inclusion in their devices. 3D die stacking allows dies fabricated in different manufacturing processes to be packaged together, increasing the opportunity for tightly integrated third-party accelerators [8]. Consumer electronic manufacturers may not have introspection capabilities into the third-party IP. For instance, with soft IP the netlist may be encrypted by the hardware IP designers to protect their research investment, which complicates verification and increases the likelihood of bugs and other errors. However, the consumer-facing company still must guarantee performance, stability, and consumer privacy. We expect that SoC designers will increasingly need methods to sandbox on-die and onpackage third-party accelerators. Accelerators increasingly resemble first-class processors in the way they access memory. Models such as Heterogeneous System Architecture (HSA) [9] adopt a shared virtual address space and cache coherence between accelerators and the CPU. This has a variety of benefits including the elimination of manual data copies, “pointer-is-a-pointer” semantics, increased performance of fine-grained sharing, and behavior familiar to programmers. Recently, FUSION [10] investi- gated the performance and energy improvements possible by using coherent accelerator caches rather than DMA transfers. Many systems today have tightly integrated accelerators which share a coherent unified virtual address space with CPU cores, including Microsoft’s Xbox One and AMD Kaveri [11–14]. Much of the current support for tightly integrated accelerators is in first-party designs, but there is some initial support for third-party accelerators: for example, ARM Accelerator Coherency Port (ACP) [15] and Coherent Accelerator Processor Interface (CAPI) from IBM [12]. Allowing accelerators access to system memory has security and fault isolation implications that have yet to be fully explored, especially with third-party accelerators. In particular, designs where the accelerator is able to make memory requests directly by physical address introduce security concerns. There has been speculation about the existence of malicious hardware [16], and even unintentional hardware bugs can be exploited by an attacker (e.g., a bug in the MIPS R4000 allowed a privilege escalation attack [17]). A malicious or compromised accelerator that is allowed to make arbitrary memory requests could read sensitive information such as buffers storing encryption keys. It could then exfiltrate this data by writing it to any location in memory. A buggy accelerator could cause a system crash by erroneously corrupting OS data structures. Buggy accelerators may be more common in the future with userprogrammable hardware on package, such as the Xilinx Zync [7, 18]. In addition, more complex accelerators will be harder to perfectly verify, as evidenced by the errata released by even trusted CPU and GPU manufacturers. Design errors such as the AMD Phenom TLB bug and other errata relating to TLBs show that even first-party designs may incorrectly translate memory addresses [19–21]. A number of approaches have been used by industry to make accelerators more programmable and safer. However, these approaches have focused either on safety or high performance, but not both. For example, approaches such as the I/O Memory Management Unit Current Solutions (a) (b) Safe Unsafe Trusted first-party Border Control (c) Safe & high perf. Untrusted third-party Untrusted third-party Accel Accel CPU Accel Untrusted third-party CPU CPU TLB TLB TLB TLB TLB $$ $$ $$ $$ $$ IOMMU or CAPI Memory Border Control Memory Figure 1: An example system with untrusted thirdparty accelerators interacting with the system in three ways: a safe but slow IOMMU (a), a fast but unsafe direct access (b), and Border Control, which is safe and high performance (c). (IOMMU) [22–24] and CAPI [12, 25] provide safety by requiring that every memory request be translated and checked by a trusted hardware interface (Figure 1a). However, they prevent the accelerator from implementing TLBs or coherence-friendly physically addressed caches. This may be acceptable for accelerators with regular memory access patterns or low performance requirements, but some accelerators require higher performance. In contrast, other approaches prioritize high performance over safety (Figure 1b). Some current highperformance GPU configurations use the IOMMU for translation, but allow the accelerator to store physical addresses in an accelerator TLB and caches [22–24]. The accelerator is then able to bypass the IOMMU and make requests directly to memory by physical address, and there are no checks of whether the address is legitimate. There are also orthogonal security techniques, such as ARM TrustZone, which protects sensitive data but provides little protection between processes. We propose Border Control, which allows accelerators to use performance optimizations such as TLBs and physical caches, while guaranteeing that memory access permissions are respected by accelerators, regardless of design errors or malicious intent. To enforce this security property, we check the physical address of each memory request as it crosses the trusted-untrusted border from the accelerator to the memory system (Figure 1c). If the accelerator attempts an improper memory access, the access is blocked and the OS is notified. We build upon the existing process abstraction, using the permissions set by the OS as stored in the page table, rather than inventing a new means of determining access permissions. We present a low-overhead implementation of Border Control, which provides security via a per-accelerator flat table stored in physical memory called the Protection Table, and a small cache of this table called the Border Control Cache. The Protection Table holds correct, up-to-date permissions for any physical address legitimately requested by the accelerator. We minimize storage overhead with the insight that translation and permission checking can be decoupled. For a current high-performance accelerator, the general-purpose GPU, this design has negligible performance overhead (on average 0.15% with an 8KB Border Control cache). This shows we can provide security with a simple design and low overheads for performance and storage. This paper proactively addresses a class of accelerator security threats (rather than react after systems are designed and deployed), making the following contributions: • We describe an emerging security threat to systems which use third-party accelerators. • We propose Border Control : a comprehensive means of sandboxing and protecting system memory from misbehaving accelerators. • We present a hardware design of Border Control and quantitatively evaluate its performance and space overheads, showing that they are minimal for even a highperformance accelerator. 2. GOALS AND CURRENT COMMERCIAL APPROACHES We first define the threat model Border Control addresses and the scope of the threat model, and then discuss current commercial approaches. 2.1 Threat Model The goal of Border Control is to provide sandboxing, a technique first proposed to provide isolation and protection from third-party software modules [26]. Rather than protecting against incorrect software modules, Border Control applies to incorrect hardware accelerators. The threat model we address is an untrusted accelerator violating the confidentiality and integrity of the trusted host system physical memory via a read/write request to a physical memory location in violation of the permissions set by the trusted OS. More formally, The threat vector is: An untrusted accelerator, which may be buggy or malicious, that contains arbitrary logic and has direct access to physical memory. The addressed threats are: Violation of confidentiality of host memory, if the untrusted accelerator reads a host physical address to which it is not currently granted read permission by the trusted OS. Violation of integrity of host memory, if the untrusted accelerator writes a host physical address to which it is not currently granted write permission by the trusted OS. There are several cases where accelerators might violate confidentiality or integrity of the host memory by making incorrect memory accesses. Non-malicious accelerators may send memory requests to incorrect addresses due to design flaws. For example, in an accelerator with a TLB, an incorrect implementation of TLB shootdown could result in memory requests made with stale translations. A malicious programmer could exploit these flaws (as in the MIPS R4000 [17]) to read and write addresses to which they do not have access permissions. An accelerator that contains malicious hardware—for example, one that contains a hardware trojan—poses an even greater threat, because it can send arbitrary memory requests. Incorrect memory accesses can cause a number of security and reliability problems for the system. In nonmalicious cases, wild writes can corrupt data including OS structures, resulting in information loss, unreliability, and crashes. Allowing arbitrary reads to host memory can compromise sensitive information stored in memory, such as data stored in keyboard or network buffers, private keys, etc. The attacker can then exfiltrate this data through a variety of means, including writing it to locations belonging to other processes or enqueueing network packets to send to the Internet. In essence, the ability to make arbitrary memory requests provides full control of the system. Note that these risks apply to any accelerator that can make physical memory accesses, even accelerators designed for non-critical tasks. For example, a video de- coding accelerator may seem harmless from a security perspective if the videos it decodes are neither sensitive nor critical. However, the ability to access system memory means that these seemingly unimportant accelerators can have major implications for system security. The Principle of Least Privilege [27] states that any part of a system should “operate using the least amount of privilege necessary to complete the job.” Border Control enforces this by leveraging the fine-grained memory access permissions of virtual memory. Virtual memory has provided isolation and safety between processes for over 40 years, and Border Control extends this support to the burgeoning accelerator ecosystem by providing the OS with a mechanism to enforce permissions for untrusted accelerators. Current systems use the process construct as their container for isolation and protection, and we believe it is important to extend this isolation to computational engines other than the CPU. Whether a process (or an accelerator kernel invoked by the process) is running on the CPU or an accelerator, it should be subject to the same process memory permissions. Border Control’s memory sandboxing allows system designers to tightly incorporate untrusted thirdparty or user-programmed accelerators into their devices while keeping the safety and isolation provided by virtual memory. 2.2 Threat Model Scope There are a number of threats that are outside the scope of this threat model including the following: • incorrect behavior internal to the accelerator. • the accelerator violating the confidentiality or integrity of data to which it has been granted access by the OS. • the OS providing incorrect permissions. In particular, Border Control treats the accelerator as a black box, and therefore cannot control what it does internally. That is, a process that uses an accelerator assumes the same level of trust as a process that invokes a third-party software module. When a program calls a library, the OS does not ensure that the library is performing the functionality it advertises; it is possible that the library is buggy or malicious. However, it does prevent the library from accessing other processes’ memory or other users’ files. Similarly, Border Control does not monitor the operations the accelerator is performing for correctness or place limits on its access to the process’s memory. It instead allows the OS to enforce memory access permissions, preventing the accelerator from accessing memory unassociated with the process it is executing. In analogy to the function of border control between countries in the real world, a country can inspect and block traffic entering at its border, but cannot interfere with or observe activity in neighboring countries. 2.3 Existing Commercial Approaches There have been a number of previous approaches to preventing hardware from making bad memory requests. We discuss existing commercial approaches here (summarized in Table 1) and defer discussion of other Provides Protection Direct Access Between to physical For OS Processes memory ATS-only IOMMU 7 7 3 Full IOMMU 3 3 7 IBM CAPI 3 3 7 ARM TrustZone 3 7 3 Border Control 3 3 3 Table 1: Comparison of Border Control with other approaches. approaches and conceptually similar work to Section 6. Unlike CPUs, accelerators cannot perform page table walks, and rely on the Address Translation Service (ATS), often provided by the IOMMU [22–24]. The ATS takes a virtual address, walks the page table on behalf of the accelerator, and returns the physical address. High-performance accelerators may cache the physical addresses returned by the ATS in accelerator TLBs. This lets them maintain physical caches to reduce access latency and more easily handle synonyms, homonyms, and coherence requests. In this design, the accelerator makes memory requests by physical address, either directly to the memory or by sending them through the IOMMU as “pre-translated” requests, which are not checked for correctness [22]. Other accelerators use only virtual addresses, and send requests via the IOMMU, which translates them and passes them on to memory. In this case, the IOMMU provides full protection in an analogous manner to how the CPU TLB provides protection between software programs. Translating every request at the IOMMU provides safety but degrades performance. It also prevents the accelerator from maintaining physical caches. A similar approach to the IOMMU is IBM CAPI [12, 25], which allows FPGA accelerators to access system memory coherently. The CAPI interface implements TLBs and coherent, physical caches in the trusted hardware. This provides complete memory safety from incorrect accelerators. However, accelerator designers are unable to optimize the caches and TLB for the needs of their accelerator, and the loose coupling may result in longer TLB and cache access times. An orthogonal approach to providing protection from untrusted hardware components is ARM TrustZone [28]. This approach divides hardware and software into the Secure world and the Normal world. Although this approach could be used to prevent untrusted accelerators from accessing host memory belonging to the OS, it is coarse-grained and cannot enforce protection between Normal world processes. For example, with TrustZone a misbehaving accelerator could access memory belonging to other users. In addition, TrustZone involves large hardware and OS modifications. 3. BORDER CONTROL Border Control sandboxes accelerators by checking the access permissions for every memory request crossing the untrusted-to-trusted border (Figure 1). If the accelerator makes a read request (e.g., on a cache miss), Border Control checks that the process running on the accelerator has read permission for that physical page. If the accelerator makes a write request (e.g., on an accelerator cache writeback), the accelerator process must have write permission for that page. If the accelerator attempts to access a page for which it does not have sufficient permission, the access is not allowed to proceed and the OS is notified. In a correct accelerator design, these incorrect accesses should never occur. Detecting one therefore indicates that there is a problem with the accelerator hardware. We first give an overview of the structures used for Border Control, then describe the actions that occur on various accelerator events. We then discuss how Border Control deals with multiprocess accelerators, as well as some accelerator design considerations. 3.1 Border Control Structures Border Control sits between the physical caches implemented on the accelerator and the rest of the memory hierarchy—at the border between the untrusted accelerator and the trusted system (left side of Figure 2). Border Control checks coherence requests sent from the accelerator to the memory system. It is important for our approach to be light-weight and able to support high request rates. Our design consists of two pieces: the Protection Table and the Border Control Cache (BCC) (right side of Figure 2). The Protection Table contains all relevant permission information and is sufficient for a correct design. The BCC is simply a cache of the Protection Table to reduce Border Control latency, memory traffic, and energy by caching recently used permission information. 3.1.1 Protection Table Design The Protection Table contains all information about physical page permissions needed to provide security. We rely upon the insight that it is not necessary to know the mapping from physical to virtual page number to determine page permissions; instead, we simply need to know whether there exists a mapping from a physical page to a virtual page with sufficient permissions in the process’s address space. This drastically reduces the storage overhead per page—from 64 bits in the process page table entry to 2 bits in the Protection Table. We implement the Protection Table as a flat table in physical memory, with a read permission bit and write permission bit for each physical page number (PPN) (Figure 2). The Protection Table is physically indexed (in contrast to the process page table, which is virtually indexed), because lookups are done by physical address. There is one Protection Table per active accelerator; details for accelerators executing multiple processes concurrently are in Section 3.3. The Protection Table does not store execute permissions, because once a block is within the accelerator, Border Control cannot monitor or enforce whether the block is read as data or executed as instructions (see Section 2.2). Because the minimum page size is 4KB, this implies a memory storage overhead of approximately 0.006% of Border Control Design Border Control Cache (per accelerator, e.g. 8 KB) System Overview Trusted first-party Untrusted third-party Untrusted third-party Accel Accel TLB TLB TLB TLB $$ $$ $$ $$ MMU CPU MMU CPU ATS Border Control 0 1 2 3 BCC tag BCC tag BCC tag BCC tag 63 BCC tag PPN offset 0 3 5 6 1 2 4 RWRWRWRWRWRWRW page permissions page permissions page permissions 510 511 RWRW page permissions Protection table base register Protection table bounds register per accelerator, in main-memory, Protection table e.g. 1 MB for 16 GB system Byte offset 0 128 256 384 512 640 1048448 Memory PPN 128 129 130 131 132 133 134 RWRWRWRWRWRWRW 190 191 RWRW Figure 2: Implementation of Border Control. Left shows Border Control in the system. Right shows the Border Control design with a Protection Table in memory and a Border Control Cache between the accelerator caches and the shared memory system. the physical address space per active accelerator. Thus, a system with 16GB of physical memory would require only 1MB for each accelerator’s Protection Table. The flat layout guarantees that all permission lookups can be completed with a single memory access, which can proceed in parallel with read requests. Section 3.4.4 discusses handling large/huge pages. The Protection Table is initialized to zero, indicating no access permissions. The Protection Table is updated on each accelerator request to the ATS (Address Translation Service). This occurs whether or not the accelerator caches the translation in a local TLB. The Protection Table is thus guaranteed to contain up-to-date permissions for any legitimate memory request from the accelerator—that is, any request where the physical address was originally supplied by the ATS. When the accelerator makes a request for a physical address not first obtained from the ATS, we consider the behavior undefined and seek only to provide security. Since the accelerator is misbehaving, we do not attempt to provide the “true” page permissions as stored in the process page table. Thus, in some cases Border Control may report an insufficient permissions error although the process page table contains a mapping with sufficient permissions. However, safety is always preserved, and incorrect permissions are seen only when the accelerator misbehaves. We expect the Protection Table will often be sparsely populated and an alternate structure could be more spatially efficient (e.g., a tree), or it could be stored in system virtual memory and allocated upon demand. However, the flat layout has small enough overhead that we do not evaluate alternate layouts. 3.1.2 Border Control Cache Design While the Protection Table provides a correct implementation of Border Control, it requires a memory read for every request from the accelerator to memory, wasting memory bandwidth and energy. In addition, Border Control is on the critical path for accelerator cache misses. We therefore add a Border Control Cache (BCC) to store recently accessed page permissions. The BCC caches the Protection Table and is tagged by PPN. To take advantage of spatial locality across physical pages [29], we store information for multiple physical pages in the same entry, similar to a subblock TLB [30]. This reduces the tag overhead per entry. We fetch an entire block at a time from memory; in our memory system the block size is 128 bytes, which corresponds to read/write permission for 512 pages per block. This gives our BCC a large reach—a 64-entry BCC contains permissions for 32K 4KB pages, for a total reach of 128MB. This cache is explicitly managed by the Border Control hardware and does not require hardware cache coherence, simplifying its design. 3.2 Border Control Operation In this section, we describe the various actions that occur on several important events. The information is summarized in Figure 3. We discuss actions taken by Border Control when (a) a process starts executing on the accelerator, (b) the ATS completes a translation, (c) a memory request reaches the Border Control, (d) the page table changes, and (e) the process releases the accelerator upon termination. 3.2.1 Process Initialization (Figure 3a) When a process starts executing on the accelerator, the OS sets up the ATS for the process by pointing it to the appropriate page table. If the accelerator was previously idle, the OS sets up the Protection Table by providing Border Control with a base register pointing to a zeroed region of physical memory. The OS also sets a bounds register to indicate the size of physical memory. If the accelerator was already running another process, the table will have already been allocated. Rather than setting the permissions for all pages at process initialization, Border Control initially assumes no permissions, Process Initialization (a) Protection Table Insertion (b) Accelerator Memory Request (c) Valid Border Control Cache entry? yes no Valid Border Control Cache entry? Setup ATS Update ATS Setup accelerator no In use by another process? Allocate and zero protection table Set base and limit registers Incr. use count Process Completion (e) Load entry from protect. table Update entry yes Incr. use count Flush accelerator caches Zero protection table Decr. use count Flush accelerator caches Load entry from protect. table Use count 0? Write-through entry to protect. table Memory Mapping Update (d) yes no no Protection check passes? Raise exception yes no Zero protection table Forward request yes clear base and limit registers Deallocate protection table Figure 3: Actions taken by Border Control on different events. Follows Section 3.2. then lazily updates the table. This avoids the overhead of having to populate the entire large table at startup. However, the Protection Table always satisfies the invariant that no page ever has read or write permission in the Protection Table if it does not have it according to the process page table. 3.2.2 Protection Table Insertion (Figure 3b) When the accelerator has a TLB miss, it requests an address translation from the ATS, which performs the translation and sends the result to both the accelerator TLB and Border Control. The ATS checks that the address space ID provided by the accelerator corresponds to a process running on the accelerator. If there is an entry for this page in the BCC and it has the correct permissions, no action is taken. If there is an entry with fewer permissions (most commonly, because the entry contains information about many pages and this particular page has never been seen before), the entry is updated in the BCC, and the change is written back to the Protection Table. Otherwise, on a BCC miss, a new entry is allocated, and the BCC fetches the corresponding bits from the Protection Table and updates the permissions for the inserted page. Updating the permissions stored at Border Control upon address translation reduces the complexity and overhead of our implementation. In a correctly functioning accelerator, the accelerator always receives a VPN-to-PPN translation from the ATS before making a memory request to that PPN. Thus, assuming zero permissions until a Protection Table insertion results in correct behavior without the overhead of determining and storing permissions for all entries in the process page table upon process initialization. 3.2.3 If the request requires permissions not indicated in the BCC or Protection Table, Border Control notifies the OS. The OS can act accordingly by terminating the process or disabling the accelerator. In addition, any requested read data from the page is not returned to the accelerator, and any write request does not proceed. Accelerator Memory Request (Figure 3c) Every accelerator memory request is checked by Border Control before it reaches the host memory system. Border Control first looks up the PPN in the BCC. On a BCC hit for the PPN, the permission bits are retrieved from the BCC entry. On a BCC miss, Border Control allocates an entry for the page, and fills it with the corresponding bits from the Protection Table. The Protection Table is only checked after ensuring that the physical address is within the bounds register. 3.2.4 Memory Mapping Update (Figure 3d) During process execution, the page table may be updated for several reasons. If a new translation from virtual to physical page number is added, the Border Control takes no action (similar to the TLB). This case occurs frequently, because the OS lazily allocates physical pages to virtual pages. When an existing virtual-to-physical mapping changes, a TLB shootdown occurs, and Border Control also needs to take action. In a TLB shootdown, the TLB either has the offending entry invalidated, or the entire TLB is flushed. In either case, no further action is needed on a CPU or trusted accelerator. The TLB will obtain a new entry from the page walk, and any dirty blocks in the caches are stored by physical address and can be correctly written back to memory upon eviction. With Border Control, any dirty cache blocks from the affected page must be written back to memory before the Protection Table and BCC are updated on permission downgrades. Otherwise, if the accelerator is caching dirty copies of data from that page—even if it was legal for the accelerator to write to those addresses at the time—Border Control will report an error and block the subsequent writeback. If a mapping changes and the page had read-only permission, the Protection Table and BCC entry can simply be updated, because no cached lines from the page can be dirty. If the page has a writable permission bit set in the Protection Table, then it has previously been accessed by the accelerator and the accelerator’s caches may contain dirty blocks from the page. The accelerator can flush all blocks in its caches, or, as an optimization, selectively flush only blocks from the affected page. Once the cache flush is complete, the corresponding entry in the BCC and the Protection Table should be updated. Alternately, if the entire accelerator cache is flushed, the Protection Table can be zeroed and the BCC and accelerator TLB can be invalidated; these are equivalent in terms of correctness. Mapping updates resulting in TLB shootdowns commonly occur for a few reasons: context switches, process termination, swapping, memory compaction, and copyon-write. Context switches, process termination, memory compaction, and swapping are expected to be infrequent and slow operations. For copy-on-write pages, the mapping change will not incur an accelerator cache flush, because the page was previously read-only and thus could not have been dirty. Copy-on-write thus incurs no extra overhead over the trusted accelerator case. In our workloads, we never encountered a permission downgrade. We evaluate permission downgrade overheads in 5.2.4. Even if the accelerator ignores the request to flush its caches, there is no security vulnerability. If the cache contains dirty blocks that are not flushed, they will be caught later when the accelerator cache attempts to write them back to memory. This will raise a permission error, and the writeback will be blocked. 3.2.5 Process Completion (Figure 3e) When a process finishes executing on an accelerator, the accelerator caches should be flushed, all entries in the BCC and accelerator TLB invalidated, and the Protection Table zeroed. This ensures that all dirty data is written back to memory, and access permissions for that process are revoked from the accelerator. If the accelerator is not running any other processes, the OS can reclaim the memory from the Protection Table. 3.3 Multiprocess Accelerators Border Control allows accelerators to safely use physically tagged caches, making it straightforward to run multiple processes simultaneously. The overhead of Border Control is per-accelerator and thus is not higher for multiprocess accelerators. When an accelerator is running multiple processes simultaneously, Border Control checks whether at least one process is able to access a page when determining permissions; the permissions we use are the union of those for all processes currently running on the accelerator. Although this design choice may seem to weaken the safety properties of Border Control, we argue that this is not the case. The purpose of Border Control is to protect the rest of the system from incorrect accelerator memory accesses. Just as Border Control does not inspect the computations internal to the black box of the accelerator, it cannot prevent a buggy accelerator from allowing incorrect accesses to data stored in accelerator caches by other co-scheduled processes. For example, if a process exposes sensitive data to the accelerator, the accelerator can immediately leak the data to another process address space to which it has write access, or store it internally and leak it later. The accelerator is sandboxed from the rest of the system, but interference between processes within the sandbox is possible in an incorrectly implemented accelerator. The OS can deal with security threats internal to the accelerator by simply not running sensitive or critical processes on untrusted accelerators. Dealing with misbehavior internal to the black box of the accelerator is outside the scope of our threat model. 3.4 Border Control Design Considerations We briefly discuss several considerations with our Border Control implementation, including other permission sources than the page table, constraints on cache organization, and dealing with different page sizes. 3.4.1 Alternate Permission Sources The access permissions stored in the process page tables are generally a good source of permission information for Border Control. However, in some cases, alternate sources may provide more fine-grained control. In particular, the OS might run an accelerator kernel directly. Because the OS has access to every page in the system, this would eliminate the memory protection from the accelerator. A simple way to handle this case is for the OS to provide an alternate (shadow) page table for the accelerator. This can already be done in implementations of the ATS as part of the IOMMU and requires no changes to the Border Control hardware. Border Control can also be extended to work with alternate permission systems, provided that permissions correspond to physical addresses. For example, in Mondriaan Memory Protection [31], the accelerator keeps a Protection Lookaside Buffer (PLB) rather than a TLB. On a PLB miss, Border Control can update the Protection Table, just as it would on a TLB miss. Capabilities are another access control method [32]. The accelerator cannot be allowed to directly access capability metadata, or it could forge capabilities. However, if the capabilities correspond to physical addresses and can be cached in a structure at the accelerator, the Protection Table and BCC can store and verify these permissions. For permissions at finer granularities than 4KB pages, an alternate format for Border Control’s Protection Table and BCC may be more appropriate, to reduce storage overhead. 3.4.2 Virtualization We have defined Border Control to operate with an unvirtualized trusted OS. Border Control can also operate with a trusted Virtual Machine Monitor (VMM) below guest OSes. In this case, the VMM allocates the Protection Table in (host physical) memory that is inaccessible to guest OSes. The present implementation works unchanged because table indexing uses “bare-metal” physical addresses: unvirtualized or host. 3.4.3 Cache Organization Requirements Our implementation of Border Control places minor constraints on accelerator cache organizations. One invariant Border Control requires of cache coherence is that an untrusted cache should never provide data for a block for which it does not have write permission. This is simple to enforce in an inclusive or non-inclusive M(O)(E)SI cache: the ownership of non-writable blocks should always remain with the directory or the trusted cache hierarchy—for example, by not allowing a readonly request to be answered with an owned E state. For exclusive caches, this invariant requires exclusive cache behavior to be slightly modified. Specifically, when a block is sent from a shared cache to the private accelerator cache, it is normally invalidated at the shared cache, so that there is at most one level of the cache hierarchy holding the block. If the block was dirty when sent to the accelerator, the copy in memory will be stale and the accelerator will need to write back data for a block for which it does not have write permission, violating the above invariant. A solution is to require that any dirty block requested by the accelerator with read-only permissions first be written back to memory. using physical addresses, but letting it use caches with virtual addresses. Then the virtual-to-physical translation occurs at the border, giving the accelerator no way to access physical pages belonging to other processes. There are a number of known drawbacks to virtual caches [36, 37], including problems with synonyms and homonyms. In particular, implementing coherence is significantly more difficult, especially because current CPU caches use physical addresses only. Recently proposed designs with virtual accelerator caches require modifications to the CPU coherence protocol (VIPSM [38]), or do not allow homonyms or synonyms in the accelerator caches (FUSION [10]). Border Control allows standard cache coherence using physical addresses, including handling homonyms and synonyms. 3.4.4 Why not limit to “safe” memory ranges? The system could use static regions and mark ranges of physical addresses which the accelerator is allowed to access, requiring applications to (either explicitly or implicitly) copy data into the accelerator address space. However, copying data between address spaces incurs a performance penalty, and it may be difficult to determine a priori which memory the accelerator will use. Also, limiting data to a contiguous range may also require rearranging the layout of data and updating pointers accordingly. Border Control allows full shared virtual memory between the accelerator and CPU. Page Size We assumed above that page size is 4KB, which is the minimum page size on most systems. However, some workloads see significant performance benefits when using larger page sizes. Our implementation of Border Control works with larger page sizes. When inserting a new translation for a large page, we can update the Protection Table and BCC entries for every 4KB page covered by the large page. Thus, for a 2MB page, Border Control updates the entries for 512 entries. This is the size of a single BCC entry or one memory block in our system with 128 byte blocks, so using 2MB pages does not cause any difficulties. More complex and efficient mechanisms may be required if future accelerators commonly use large pages. 4. BORDER CONTROL FAQ We seek to anticipate and answer questions readers may have regarding Border Control and its alternatives. Why not perfectly verify accelerator hardware? Perfect verification of accelerator hardware would eliminate the problems associated with malicious or buggy hardware design. However, accelerators that wish to access shared memory are likely to be too complex to easily validate, especially in the presence of die stacking and/or malicious hardware which is intended to be difficult to detect [33, 34]. Even first-party hardware has bugs, as shown by the errata released by companies such as Intel and AMD [17,19–21,35]. Border Control makes perfect verification unnecessary by sandboxing memory errors, so that they can only affect processes running on the imperfect accelerator. Why not eschew third-party accelerators? Using only trusted (first-party) hardware could prevent the accelerator from generating incorrect memory requests, as all generation and propagation of physical addresses would occur in trusted hardware. However, we hypothesize that with the increasing prevalence of accelerators, there will be economic benefits to being able to purchase IP from third parties. Why not forbid use of physical addresses? Instead of disallowing all caches, Border Control can also be implemented by preventing the accelerator from Why not allow caches and TLBs and do address translation again at the border to verify? Border Control could be implemented by doing a backwards translation from physical to virtual addresses and checking permissions with the process page table. Both Linux and Windows contain OS structures to perform reverse translation for swapping. However, these structures are OS-specific, and thus not good candidates for being accessed by hardware. Instead, this would require frequent traps into the OS, which may have performance impacts both on the accelerator process and host processes. Our implementation of Border Control relies on the insight that performing a reverse translation not required to determine access permissions. 5. EVALUATION 5.1 Methodology Border Control aims to provide safety without significant performance or storage overheads. We quantitatively evaluate Border Control and several other safety approaches on the GPGPU, a high-performance accelerator which is capable of high memory traffic rates and irregular memory reference patterns. A GPGPU is a stress-test for memory safety mechanisms. We summarize the configurations for each approach in Table 2 and describe them in detail below. We evaluate five approaches to memory safety. We use the unsafe ATS-only IOMMU as a baseline, where the IOMMU serves only to perform initial address translation but the GPU uses physical addresses in its TLB and caches. This enables the GPU to use all performance optimiza- ATS-only IOMMU Full IOMMU CAPI-like Border Control-noBCC Border Control-BCC Safe? L1 $ L1 TLB L2 $ BCC 7 3 3 3 N/A 3 — — — N/A 3 — — 3 N/A 3 3 3 3 — 3 3 3 3 3 Table 2: Comparison of configurations under study. tions, and is the norm in coherent integrated GPUs today, but does not satisfy our security goal. The full IOMMU configuration is a simple approach to safety, appropriate for low-performance accelerators. For the IOMMU to enforce safety, the accelerator must issue every memory request as a virtual address to the IOMMU, which performs translation and permissions checking. We therefore remove the accelerator caches and accelerator TLB, but leave the L2 TLB because the IOMMU caches translations. Although this configuration is unrealistic for high-performance, high-bandwidth accelerators like GPUs, we include it to illustrate the challenges of providing safety using existing mechanisms. The CAPI-like configuration is modeled on the philosophy of IBM CAPI, where the accelerator caches and TLB are implemented in the trusted system. Thus, they are more distant and less tightly integrated with the accelerator. We model the longer latency to the trusted accelerator cache by including only a shared L2 cache. We evaluate two Border Control configurations. First, Border Control-noBCC includes the Protection Table for safety but does not include the BCC, to show the impact of caching the permission information. Border Control allows standard GPU performance optimizations: accelerator L1 and L2 caches and accelerator TLBs. The Border Control-BCC configuration adds in the BCC for higher performance. To explore the varying pressures different accelerators put on Border Control, we evaluate two GPU configurations, a highly threaded GPU and a moderately threaded GPU. The highly threaded GPU is similar to a modern integrated GPU (e.g., AMD Kaveri [14]) with eight compute units, each with many execution contexts. This configuration is a proxy for a high-performance, latency-tolerant accelerator. The moderately threaded GPU has a single compute unit and can execute a single workgroup (thread block) at a time, but it executes multiple wavefronts (execution contexts). This configuration is a proxy for a more latency-sensitive accelerator. We use a variety of workloads to explore a diverse set of high-performance accelerator behaviors. We evaluate Border Control with workloads from the Rodinia benchmark suite [39], including machine-learning, bioinformatics, graph, and scientific workloads. These workloads range from regular memory access patterns (e.g., lud) to irregular, data-dependent accesses (e.g., bfs). They use a unified address space between the CPU and GPU rather than explicit memory copies. CPU CPU Cores CPU Caches CPU Frequency 1 64KB L1, 2MB L2 3 GHz GPU Cores (highly threaded) 8 Cores (moderately threaded) 1 Caches (highly threaded) 16KB L1, shared 256KB L2 Caches (moderately threaded) 16KB L1, shared 64KB L2 L1 TLB 64 entries Shared L2 TLB (trusted) 512 entries GPU Frequency 700 MHz Memory System Peak Memory Bandwidth 180 GB/s Border Control BCC Size 8KB BCC Access Latency 10 cycles Protection Table Size 196KB Protection Table Access La- 100 cycles tency Table 3: Simulation configuration details. To simulate our system, we use the open-source gem5gpu simulator [40], which combines gem5 [41] and GPGPUSim [42]. We integrated our Border Control implementation into the coherence protocol. Table 3 contains the details of the system we simulated. We use a MOESI cache coherence protocol with a null directory for coherence between the CPU and the GPU. Within the GPU, we use a simple write-through coherence protocol. This system is similar to current systems with an integrated GPU (e.g., AMD Kaveri [14]), and has increased memory bandwidth to simulate future systems. 5.2 Results To better understand the tradeoff between performance and safety, we evaluate the runtime overheads of Border Control and the other approaches to safety compared to the unsafe baseline (ATS-only IOMMU). Figure 4 shows the runtime overhead of the approaches we evaluate, for both GPU configurations. ATS-only IOMMU The ATS-only IOMMU has no overhead, and all other configurations are normalized to it. However, it does not provide any safety guarantees. Full IOMMU The first bar (red) in Figure 4 shows the impact of the full IOMMU: geometric mean of 374% runtime overhead for the highly threaded case and 85% for moderately threaded. The overhead is higher in the highly threaded case because the 8 core GPU is capable of issuing a high bandwidth of requests, and without the L2 to filter some of this bandwidth, the DRAM is overwhelmed and performance suffers. The moderately threaded case does not saturate the DRAM bandwidth and sees less performance degradation, but runtime overhead is still impractically high. CAPI-Like (second, blue bar of Figure 4). The average runtime overhead for the CAPI-like configuration is only 3.81% in the highly threaded case. There are a few benchmarks (e.g., pathfinder, hotspot) that show almost no performance degradation. This is not surprising, as the GPU is designed to tolerate high memory latency. However, the moderately threaded case has higher overhead, at 16.5% on average. This is because 40% 143% 983% 160% 898% 176% 814% 215% 50% Full IOMMU 40% 30% CAPI-like 20% Border Control -noBCC Border Control -BCC 10% 0% t d p fs spo lu pro b t k o c h ba nn nw er find h t pa Runtime overhead Runtime overhead 50% 80% 253% 82% 178% 55% 30% 20% 10% 0% t d p fs spo lu pro b t k o c h ba (a) Highly threaded GPU nn nw er find h t pa (b) Moderately threaded GPU Figure 4: Runtime overhead compared to ATS-only IOMMU (which has 0% overhead). 1.0 0.6 0.4 back- bfs hot- lud nn nw pathprop spot finder AVG Figure 5: Number of requests per cycle checked by Border Control for the highly threaded GPU. the moderately threaded GPU does not have enough execution contexts to hide the increased memory latency. The CAPI-like configuration has lower overhead than the full IOMMU, but may still cause significant performance degradation, especially for more latencysensitive accelerators. Border Control-noBCC (third, purple bar of Figure 4). Border Control without a BCC has on average 2.04% runtime overhead for highly threaded and 7.26% for moderately threaded. Similarly to the CAPI-like configuration, the extra latency and bandwidth from accessing the protection table affects the the moderately threaded GPU more than the highly threaded GPU. Even without a BCC, Border Control reduces execution overhead compared to the IOMMU and CAPI-like configurations. Border Control-BCC (fourth, green bar of Figure 4). Border Control with a BCC has an average runtime overhead for the highly threaded GPU of just 0.15%, and 0.84% for the moderately threaded case. The BCC eliminates the additional memory access on each accelerator memory access, improving performance. In summary, Border Control with a BCC provides safety with almost no overhead over the unsafe baseline. For the highly threaded case, Border Control attains an average of 4.74× speedup over the full IOMMU and 1.04× over the CAPI-like configuration. For the moderately threaded case, Border Control is on average 1.83× faster than the full IOMMU and 1.16× faster than the CAPI-like configuration. 5.2.1 1 pages/entry 2 pages/entry 32 pages/entry 512 pages/entry 0.8 BCC Miss Ratio Requests per Cycle 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Border Control Requests Figure 5 shows the number of requests per cycle checked 0.2 0.0 0 200 400 600 800 BCC size (bytes) 1000 Figure 6: BCC miss ratio when varying BCC size and number of pages per entry. by Border Control for the highly threaded GPU. On average, we saw approximately 0.11 requests per cycle, but there is significant variability: from 0.025 for backprop to 0.29 for bfs. From this data, we conclude that bandwidth at Border Control is not a bottleneck. This is not surprising since the private accelerator caches provide good bandwidth filtering for the rest of the system. 5.2.2 BCC Sensitivity Analysis We show the results of a sensitivity analysis of BCC size in Figure 6, which gives the miss ratio at the BCC as size increases. Each line represents a different BCC entry size, from 1 page/entry (2 bits) up to 512 pages/entry (1024 bits) with the addition of a 36-bit tag per entry. The miss ratio is averaged over the benchmarks. For the workloads we evaluate, storing multiple pages per entry shows a large benefit, especially when storing 512 pages (a reach of 2MB) per entry. Since there is spatial locality, larger entries reduce the per-page tag overhead. For a 1KB BCC with 512 pages/entry, the average BCC miss rate is below 0.1% for our workloads. However, we conservatively use an 8KB BCC, as it is still a small area overhead and may be more appropriate for larger, future workloads. 5.2.3 Area and Memory Storage Overheads Our Border Control mechanism has minimal space overhead. The Protection Table uses 0.006% of physical memory capacity per active accelerator. The BCC was effective for our workloads with 64 entries of 128 bytes each, for a total of 8KB. 0.5% When memory mappings are updated and page permissions are downgraded, any accelerator must finish all outstanding requests (where most of the downgrade time is spent), invalidate its TLB entries, and the ATS must flush its caches. These actions occur even with trusted accelerators. Our implementation of Border Control maintains safety and correctness by also flushing the accelerator caches, the Protection Table, and the BCC, which incurs additional overhead (details discussed in Section 3.2.4). Figure 7 shows that the overhead for permission downgrades is negligible (approximately 0.02%) in the most common case of downgrades today, context switches (10–200 downgrades per second). Additionally, the overhead for permission downgrades continues to be small even when the rate of downgrades is much higher, as may be common in future systems. Figure 7 also shows that Border Control incurs roughly twice the overhead of the unsafe ATSonly IOMMU. This increased overhead is mostly due to writing back dirty data in the accelerator L2 cache (highly threaded) or compulsory cache misses (moderately threaded). protects against wild writes to remote memory by using permission vectors, to simplify recovery in the presence of hardware and software faults. Hive assumes that the hardware is correctly implemented but may have (nonByzantine) software or hardware faults. Intel’s noDMA table and AMD’s Device Exclusive Vector (DEV) were designed to protect against buggy or misconfigured devices by allowing the system to designate some addresses as not accessible by devices, and have been subsumed by the IOMMU. Some previous work decouples translation and protection for all memory accesses. HP’s PA-RISC and the single address space operating system [47,48] and Mondriaan memory protection [31] have some similarities to our implementation of Border Control, but with different assumptions and goals. The single address space OS attempts to simplify sharing and increase cache access speed. Mondriaan provides fine-grained (word or even byte-level) permissions, enabling fine-grained sharing and zero-copy networking. These approaches are replacements for the current page-based memory protection system, and rely on correct hardware. Border Control can use permissions from these approaches to provide memory protection. Some approaches use metadata to enforce safety guarantees. The IBM System/360 and successors associate protection keys with each physical page [49] and check them before writes for software (but not hardware) security. The SPARC M7 provides Realtime Application Data Integrity (ADI) to help prevent buffer overflows and stale memory references [5], requiring pointers to have correct version numbers. However, version numbers may be guessable by the accelerator. Intel Trusted Execution Technology is a hardware extension that protects against attacks by malicious software and firmware [50]. It protects the launch stack and boot process from threats, but not against malicious or buggy hardware components in the system. 6. 7. Border Control-BCC Highly threaded Border Control-BCC Moderately threaded ATS-only IOMMU Highly threaded ATS-only IOMMU Moderately threaded Runtime overhead 0.4% 0.3% Linux Scheduling Rate 0.2% 0.1% 0.0% 0 200 400 600 800 Permission downgrades per second 1000 Figure 7: Runtime overhead with varying frequencies of page permission downgrades. 5.2.4 Sensitivity to Memory Mapping Updates RELATED WORK Some previous work is similar to Border Control in intent or mechanism. Current commercial approaches to protection against bad memory accesses were discussed in Section 2.3. We briefly discuss other work below. rIOMMU [43] decreases IOMMU overheads for highbandwidth I/O devices such as NICs and PCIe SSD drives. It drastically reduces IOMMU overheads while ensuring safety, but relies on the device using ring buffers, which are accessed in a predictable fashion. rIOMMU is not intended for devices that make unpredictable finegrained memory accesses. Hardware Information Flow Tracking (IFT) can provide comprehensive system safety guarantees though tracking untrusted data at the gate level [44]. However, this protection incurs large overheads relative to unsafe systems. Border Control focuses on one specific safety guarantee, but works with existing systems with very low overhead. Hive [45] is an operating system designed for the Stanford FLASH [46] to provide fault containment. It CONCLUSION Specialized hardware accelerators, including ones that directly access host system memory, are becoming more prevalent and more powerful, leading to new security and reliability challenges. In particular, incorrect memory accesses from accelerators can cause leaks, corruption, or crashes even for processes not running on the accelerator. Border Control provides full protection against incorrect memory accesses while maintaining high performance, and with low storage overhead. 8. ACKNOWLEDGEMENTS We would like to thank Arka Basu, Tony Nowatzki, Marc Orr, Guri Sohi, Mike Swift, and Hongil Yoon for feedback on this work. This work is supported in part by the National Science Foundation (CCF-1218323, CNS1302260, CCF-1438992, CCF-1533885), Cisco Systems Distinguished Graduate Fellowship, John P. Morgridge Chair, Google, and the University of Wisconsin–Madison (Named Professorship). Hill and Wood have a significant financial interest in AMD and Google. 9. REFERENCES [1] “Regulation (EC) No 562/2006 of the European Parliament and of the Council of 15 March 2006 establishing a Community Code on the rules governing the movement of persons across borders (Schengen Borders Code),” 2006. [2] W.-C. Park, H.-J. Shin, B. Lee, H. Yoon, and T.-D. Han, “RayChip: Real-time ray-tracing chip for embedded applications,” in Hot Chips 26, 2014. “Efficient software-based fault isolation,” in SOSP 14, 1993. [27] J. Saltzer and M. Schroeder, “The protection of information in computer systems,” Proc. of the IEEE, vol. 63, Sept 1975. [28] ARM Security Technology: Building a Secure System using TrustZone Technology. [29] M. Gorman, “Understanding the Linux virtual memory manager,” 2004. [3] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in MICRO-45, 2012. [30] M. Talluri and M. D. Hill, “Surpassing the TLB performance of superpages with less operating system support,” in ASPLOS VI, 1994. [4] O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan, “Meet the walkers: Accelerating index traversals for in-memory databases,” in MICRO-46, 2013. [31] E. Witchel, J. Cates, and K. Asanovic, “Mondrian memory protection,” in ASPLOS X, 2002. [5] S. Phillips, “M7: Next generation SPARC,” in Hot Chips 26, 2014. [6] K. Atasu, R. Polig, C. Hagleitner, and F. R. Reiss, “Hardware-accelerated regular expression matching for high-throughput text analytics,” in FPL 23, 2013. [7] V. Rajagopalan, “All programmable devices: Not just an FPGA anymore,” MICRO-45, 2013. Keynote presentation. [8] B. Black, “Die stacking is happening!.” MICRO-45, 2013. Keynote presentation., Dec. 2013. [9] P. Rogers, “Heterogeneous system architecture overview,” in Hot Chips 25, 2013. [10] S. Kumar, A. Shriraman, and N. Vedula, “Fusion : Design tradeoffs in coherent cache hierarchies for accelerators,” in ISCA 42, 2015. [11] J. Sell and P. O’Connor, “The Xbox One system on a chip and Kinect sensor,” IEEE Micro, vol. 34, Mar. 2014. [12] J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel, “Capi: A coherent accelerator processor interface,” IBM Journal of Research and Development, vol. 59, pp. 7:1–7:7, Jan. 2015. [13] P. Hammarlund, “4th generation Intel core processor, codenamed haswell,” in Hot Chips 26, 2014. [32] H. M. Levy, Capability-Based Computer Systems. Digital Press, 1984. [33] A. Waksman and S. Sethumadhavan, “Silencing hardware backdoors,” in SP, 2011. [34] C. Sturton, M. Hicks, D. Wagner, and S. T. King, “Defeating UCI: Building stealthy and malicious hardware,” in SP, 2011. [35] D. Price, “Pentium FDIV flaw-lessons learned,” Micro, IEEE, vol. 15, pp. 86–88, Apr. 1995. [36] M. Cekleov and M. Dubois, “Virtual-address caches part 1: Problems and solutions in uniprocessors,” IEEE Micro, vol. 17, Sept 1997. [37] M. Cekleov and M. Dubois, “Virtual-address caches, part 2: Multiprocessor issues,” IEEE Micro, vol. 17, Nov 1997. [38] S. Kaxiras and A. Ros, “A new perspective for efficient virtual-cache coherence,” in ISCA 40, 2013. [39] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in IISWC, 2009. [40] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood, “gem5-gpu: A heterogeneous cpu-gpu simulator,” Computer Architecture Letters, vol. 13, no. 1. [14] AMD, “AMD’s most advanced APU ever.” http://www.amd.com/us/products/desktop/processors/aseries/Pages/nextgenapu.aspx. [41] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,” CAN, 2011. [15] J. Goodacre, “The evolution of the ARM architecture towards big data and the data-centre.” http://virtical.upv.es/pub/sc13.pdf, Nov. 2013. [42] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simulator,” in ISPASS, 2009. [16] US Department of Defense, “Defense science board task force on high performance microchip supply,” 2005. [43] M. Malka, N. Amit, M. Ben-Yehuda, and D. Tsafrir, “rIOMMU: Efficient IOMMU for I/O devices that employ ring buffers,” in ASPLOS 20, 2015. [17] M. T. Inc, MIPS R4000PC/SC Errata, Processor Revision 2.2 and 3.0. May 1994. [18] “Zynq-7000 all programmable SoC.” http://www.xilinx.com/products/silicondevices/soc/zynq-7000.html, 2014. [19] A. L. Shimpi, “AMD’s B3 stepping Phenom previewed, TLB hardware fix tested,” Mar. 2008. [44] M. Tiwari, J. K. Oberg, X. Li, J. Valamehr, T. Levil, B. Hardekopf, R. Kastner, F. T. Chong, and T. Sherwood, “Crafting a usable microkernel, processor, and I/O security system with strict and provable information flow security,” in ISCA 38, 2011. [45] J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, “Hive: Fault containment for shared-memory multiprocessors,” in SOSP 15, 1995. [20] I. Corporation, “Intel Core i7-900 desktop processor extreme edition series and Intel Core i7-900 desktop [46] processor series specification update.” http://download.intel.com/design/processor/specupdt/320836.pdf, May 2011. [21] Intel Xeon Processor E5 Family: Specification Update, Jan. 2014. [22] AMD I/O Virtualization Technology (IOMMU) Specification, Revision 2.00, Mar. 2011. [23] Intel Virtualization Technology for Directed I/O, Revision 2.3, Oct. 2014. [24] ARM System Memory Management Unit Architecture Specification, SMMU architecture version 2.0, 2012-2013. [25] J. Stuecheli, “Power8,” in Hot Chips 25, 2013. [26] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham, J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. Hennessy, “The Stanford FLASH multiprocessor,” in ISCA 21, 1994. [47] E. J. Koldinger, J. S. Chase, and S. J. Eggers, “Architectural support for single address space operating systems,” in ASPLOS V, 1992. [48] J. Wilkes and B. Sears, “A comparison of protection lookaside buffers and the PA-RISC protection architecture,” Tech. Rep. HPL-92-55, Hewlett Packard Labs, 1992. [49] G. M. Amdahl, G. A. Blaauw, and F. P. Brooks, “Architecture of the IBM System/360,” IBM Journal of Research and Development, vol. 8, pp. 87–101, Apr. 1964. [50] J. Greene, “Intel trusted execution technology,” Intel Technology Whitepaper, 2012.