Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao ECE Department The University of Texas at Austin joao@ece.utexas.edu M. Aater Suleman Onur Mutlu Calxeda Inc. Computer Architecture Lab. aater.suleman@calxeda.com Carnegie Mellon University onur@cmu.edu Abstract ECE Department The University of Texas at Austin patt@ece.utexas.edu (without requiring programmer effort) can not only increase parallel program performance and scalability but also reduce the work of the programmers. An Asymmetric CMP (ACMP) consists of at least one large, fast core and several small, slow cores.1 Previous research has shown that faster cores on an ACMP can be used to accelerate serializing bottlenecks. However, these proposals lack generality and finegrained adaptivity. For example, several proposals [5, 13, 21, 26] use the fast core to accelerate Amdahl’s serial portions (i.e., portions of a program where only one thread exists) but they do not handle serializing bottlenecks that happen in the parallel portions of the code. Accelerated Critical Sections (ACS) [36] accelerates critical sections –code regions that only one thread can execute at a given time–, but it does not always accelerate the critical sections that are currently most critical for performance and is limited to ACMPs with a single large core. Meeting Points [8] accelerates the thread with the longest expected completion time in the parallel region to reduce barrier synchronization overhead. However, it only applies to barriers in statically scheduled workloads, where the work to be performed by each thread is known before runtime. Feedback-Directed Pipelining (FDP) [34, 37], a softwareonly library, assigns threads to cores to balance stage throughput on pipelined workloads and improve performance or reduce power consumption, but it is slow to adapt to phase changes because it is software-based. Realistic workloads contain multiple different bottlenecks that can become critical at different execution phases, as we show in this paper. Therefore, it is important to have a single mechanism that is general and adaptive enough to accelerate the most critical bottlenecks at any given time. We propose Bottleneck Identification and Scheduling (BIS), a general cooperative software-hardware framework to identify the most critical serializing bottlenecks at runtime, and accelerate them using one or multiple large cores on an ACMP. The key insight for identification is that the most critical serializing bottlenecks are likely to make other threads wait for the greatest amount of time. The programmer, compiler or library delimits potential bottlenecks using BottleneckCall and BottleneckReturn instructions, and replaces the code that waits for bottlenecks with a BottleneckWait instruction. The hardware uses these instructions to measure thread waiting cycles (i.e., number of cycles threads wait due to the bottleneck) for each bottleneck and tracks them in a Bottleneck Table. The bottlenecks with the highest number of thread waiting cycles are selected for acceleration on one or more large cores. On executing a BottleneckCall instruction, the small core checks if the bottleneck has been selected for acceleration. If it has, the small core enqueues an execution request into the Scheduling Buffer of a large core. The large core dequeues each request from its Scheduling Buffer, executes the bottleneck and notifies the small core once it encounters the BottleneckReturn instruction. Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. This paper proposes Bottleneck Identification and Scheduling (BIS), a cooperative software-hardware mechanism to identify and accelerate the most critical bottlenecks. BIS identifies which bottlenecks are likely to reduce performance by measuring the number of cycles threads have to wait for each bottleneck, and accelerates those bottlenecks using one or more fast cores on an Asymmetric Chip MultiProcessor (ACMP). Unlike previous work that targets specific bottlenecks, BIS can identify and accelerate bottlenecks regardless of their type. We compare BIS to four previous approaches and show that it outperforms the best of them by 15% on average. BIS’ performance improvement increases as the number of cores and the number of fast cores in the system increase. Categories and Subject Descriptors C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) General Terms Yale N. Patt Design, Performance Keywords Critical sections, barriers, pipeline parallelism, multicore, asymmetric CMPs, heterogeneous CMPs 1. Introduction Speeding up a single application using Chip Multi-Processors (CMPs) requires that applications be split into threads that execute concurrently on multiple cores. In theory, a CMP can provide speedup equal to the number of cores, however, the speedup is often lower in practice because of the thread waiting caused by serializing bottlenecks. These bottlenecks not only limit performance but also limit application scalability, e.g., performance of the mysql database peaks at 16 threads (Section 5.1.1). Programmers are often burdened with the task of identifying and removing serializing bottlenecks. This solution is too costly because (a) writing correct parallel programs is already a daunting task, and (b) serializing bottlenecks change with machine configuration, program input set, and program phase (as we show in Section 2.2), thus, what may seem like a bottleneck to the programmer may not be a bottleneck in the field and vice versa. Hence, a solution that can identify and remove these bottlenecks at run-time Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’12 March 3–7, 2012, London, England, UK. c 2012 ACM 978-1-4503-0759-8/12/03. . . $10.00 Copyright Reprinted from ASPLOS’12, Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, March 3– 7, 2012, London, England, UK. 1 We use the terms fast and large, and slow and small, interchangeably, to refer to the two types of cores on an ACMP. 1 This paper makes three main contributions: 1. We propose a cooperative hardware-software mechanism to identify the most critical bottlenecks of different types, e.g., critical sections, barriers and pipeline stages, allowing the programmer, the software or the hardware to remove or accelerate those bottlenecks. To our knowledge, this is the first such proposal. T1 ... Idle N T2 C1 N T3 N T4 N 10 N C1 30 ... C1 C1 20 N C1 N 40 N 50 60 time 70 (a) Critical section 2. We propose an automatic acceleration mechanism that decides which bottlenecks to accelerate and where to accelerate them. We use the fast cores on an ACMP as hardware-controlled finegrain accelerators for serializing bottlenecks. By using hardware support, we minimize the overhead of execution migration and allow for quick adaptation to changing critical bottlenecks. Idle T1 ... ... T2 T3 T4 time 10 20 30 40 50 60 70 (b) Barrier 3. This is the first paper to explore the trade-offs of bottleneck acceleration using ACMPs with multiple large cores. We show that multiple large cores improve performance when multiple bottlenecks are similarly critical and need to be accelerated simultaneously.2 Our evaluation shows that BIS improves performance by 37% over a 32-core Symmetric CMP and by 32% over an ACMP (which only accelerates serial sections) with the same area budget. BIS also outperforms recently proposed dynamic mechanisms (ACS for non-pipelined workloads and FDP for pipelined workloads) on average by 15% on a 1-large-core, 28-small-core ACMP. We also show that BIS’ performance improvement increases as the number of cores and the number of large cores increase. T1 ... Idle S1 T2 T4 10 ... S1 T3 S2 S3 time 20 30 40 50 60 70 (c) Pipeline stages Figure 1. Examples of bottleneck execution. on the critical path), but also reduce the wait for both T1 and T4. However, accelerating the same C1 on T2 between times 20 and 30 would not help because all the other threads are running useful work and no thread is waiting for the lock. 2. Motivation 2.1.3 Barriers Threads that reach a barrier must wait until all threads reach the barrier. There can be multiple executers and multiple waiters. Figure 1(b) shows the execution of four threads reaching a barrier. T1 and T4 reach the barrier at time 30 and start waiting on T2 and T3. T2 reaches the barrier at time 40, leaving T3 as the only running thread until time 60, when T3 reaches the barrier and all four threads can continue. Therefore, every cycle saved by accelerating T3 gets subtracted from the total execution time, up to the difference between the arrivals of T2 and T3 at the barrier. 2.1 Bottlenecks: What They Are and Why They Should Be Accelerated We define bottleneck as any code segment for which threads contend. A bottleneck may consist of a single thread or multiple threads that need to reach a synchronization point before other threads can make progress. We call the threads that stall due to a bottleneck waiters and the threads executing a bottleneck executers. A single instance of a bottleneck can be responsible for one or many waiters. The effect of bottlenecks on performance can be substantial because every cycle a bottleneck is running is a cycle wasted on every one of the waiters of that bottleneck. Bottlenecks cause thread serialization. Thus, a parallel program that spends a significant portion of its execution in bottlenecks can lose some or even all of the speedup that could be expected from parallelization. Next we describe examples of bottlenecks. 2.1.4 Pipeline Stages In a pipelined parallel program, loop iterations are split into stages that execute on different threads. Threads executing other pipeline stages wait for the slowest pipeline stages. There can be multiple executers and multiple waiters. Figure 1(c) shows the execution of a pipelined program with three stages (S1-S3) on four threads. S1 is scalable and runs on both T1 and T2. S2 is not scalable, so it runs on a single thread (T3). S3 is shorter and runs on T4. The throughput of a pipeline is the throughput of the slowest stage. Since S2 has a smaller throughput than S1, S2’s input queue gets full and S1 has to stall to produce work for S2 at the pace S2 can deal with it, which introduces idle time on T1 and T2. On the other side, S3 is faster than S2 and finishes each iteration before S2 can feed it more work to do, which introduces idle time on T4. Therefore, T1, T2 and T4 are all waiters on T3 at some point. If S2 could be accelerated by a factor of 2, the pipeline throughput would be balanced (i.e., each stage would deal with one iteration every 5 time units) and all idle time would be eliminated. Since S2 is not scalable, this acceleration is only feasible with higher single-thread performance on T3. Note that only accelerating the actual slowest stage improves performance. Accelerating any other stage does not. Also note that the best way to accelerate a scalable stage is to add more threads (cores) to work on it. From these examples, we can conclude that accelerating the bottlenecks that are making other threads wait is likely to improve multithreaded program performance. 2.1.1 Amdahl’s Serial Portions When only one thread exists [4], it is on the critical path and should be scheduled on the fastest core to minimize execution time. Meanwhile, all other cores are idle. 2.1.2 Critical Sections Critical sections ensure mutual exclusion, i.e. only one thread can execute a critical section at a given time, and any other thread wanting to execute the critical section must wait. That is, there is only one executer and possibly multiple waiters. Figure 1(a) shows the execution of four threads (T1-T4) that run a simple loop that consists of a critical section C1 and a non-critical section N. Threads can execute in parallel until time 30, at which T3 acquires the lock and prevents T1 and T4 from doing so. Then, T1 and T4 become idle, waiting on T3. Accelerating the execution of C1 on T3 during this time would not only accelerate T3 and reduce the total execution time (because this instance of code segment C1 is 2 We also evaluate large cores with Simultaneous Multithreading (SMT) in Section 5.2. 2 Contention (# of threads waiting) 2.2 How Bottlenecks Change Over Time Let’s first consider a simple example, Program 1, where elements X are moved from linked list A to linked list B. Removal from A and insertion into B are protected by critical sections to avoid races that could corrupt the lists. The amount of work performed inside each of those critical sections varies over time, because list A is shrinking while list B is growing. Figure 2 shows that contention (i.e. the number of threads waiting to enter each critical section) with 32 threads changes as follows: in the first phase, critical section A is the bottleneck, while in the second phase, critical section B becomes the bottleneck. Successful acceleration of this program at runtime requires dynamic identification of which code is the current critical bottleneck and then running it on a fast core. Contention (# of threads waiting) Lock A Lock B 25 20 15 CRITICAL SECTION A IS THE BOTTLENECK 5 0 CS B IS THE BOTTLENECK 3. 0 2 4 6 8 time [Mcycles] 10 12 LOCK_open LOCK_log 12 10 PHASE C PHASE C 8 LOCK_open LOCK_open 6 4 2 0 0 1 2 3 4 5 time [Mcycles] 6 7 8 Previously proposed ACS [36] can accelerate only one critical section at a time on a single large core (or two critical sections on a 2-way SMT large core, but this reduces the benefit of acceleration due to resource sharing between SMT threads). When there are multiple critical sections contending to be accelerated, ACS disables acceleration of critical sections that are less frequent, in order to avoid false serialization (i.e., serialized execution of different critical sections at the large core). Periodically (every 10M instructions in [36]), ACS re-enables acceleration of all critical sections, allowing for some false serialization while it relearns which critical sections should be disabled. This mechanism works well when contention for critical sections is stable for long periods but it is suboptimal for applications like mysql, because it ignores the frequent changes on which critical section is currently the most limiting bottleneck, as shown in Figure 3. In addition, ACS is specific to critical sections and cannot handle other types of bottlenecks such as barriers and slow pipeline stages. Our goal is to design a mechanism that identifies at any given time which bottlenecks of any type are currently limiting performance, and accelerates those bottlenecks. A=full linked list; B=empty linked list repeat Lock A Traverse list A Remove X from A Unlock A Compute on X Lock B Traverse list B Insert X into B Unlock B until A is empty 10 PHASE B 14 Figure 3. Most contended critical sections on mysql. Program 1 Example of time-variant critical sections 30 16 PHASE A Bottleneck Identification and Scheduling 3.1 Key Insight The benefit of parallel computing comes from concurrent execution of work; the higher the concurrency, the higher the performance. Every time threads have to wait for each other, less work gets done in parallel, which reduces parallel speedup and wastes opportunity. To maximize performance and increase concurrency, it is pivotal to minimize thread waiting as much as possible. Consequently, a large amount of effort is invested by parallel programmers in reducing bottlenecks that can lead to thread waiting. For example, programmers spend an enormous amount of effort in reducing false sharing, thread synchronization, thread load imbalance, etc. Previous research on computer architecture also targets thread waiting, e.g., transactional memory [12], primitives like test-and-test-andset [31], lock elision [29], speculative synchronization [24], etc. We use this conventional wisdom –that reducing thread waiting improves performance– to identify the best bottlenecks for acceleration. Our key idea is thus simple: measure the number of cycles spent by threads waiting for each bottleneck and accelerate the bottlenecks responsible for the highest thread waiting cycles. Our proposal, BIS, consists of two parts: identification of critical bottlenecks and acceleration of those bottlenecks. 14 Figure 2. Contention for each critical section on Program 1. Similar examples exist in important real-life workloads. For example, the most contended critical section in the widely used mysql database changes over short periods of time. The two most contended critical sections in mysql with the OLTP-nontrx input set are LOCK open and LOCK log. LOCK open is used to ensure mutual exclusion while opening and closing tables in the open table and close thread tables functions. LOCK log is used to protect access to the log file, which happens in the MYSQL LOG::write function. Figure 3 shows contention for LOCK open and LOCK log over the first 8M cycles of mysql with 16 threads (the optimal number of threads) while being driven by SysBench [3] with the OLTP-nontrx input set3 (mysql-3 in our benchmark set, Table 6). The graph shows three clearly different phases: A) the initial use of LOCK log by every thread, B) the heavy use of LOCK open while all threads simultaneously want to open the tables they need, C) the steady state period where transactions continue executing and LOCK log is usually the most contended, but LOCK open sporadically gets highly contended (e.g., see arrows around 4.8Mcycles and 7.7Mcycles). This last phase continues for the rest of the program: the most contended lock changes intermittently for the duration of this phase. The temporal behavior of the critical sections depends entirely on the database operations that are executed, i.e., the input set. For example, with the OLTP-complex input set, LOCK open gets highly contended even more often. 3.2 Identifying Bottlenecks Identification of critical bottlenecks is done in hardware based on information provided by the software. The programmer, compiler or library assigns a unique bottleneck ID (bid) to each potential serializing bottleneck, and encapsulates the bottleneck code between BottleneckCall and BottleneckReturn instructions. Using these instructions, the hardware keeps track of the bottlenecks and which hardware threads (core ID and SMT context) they execute on. 3 SysBench is a widely used multi-threaded benchmark to measure system and database performance. 3 ... BottleneckCall bid, targetPC ... ... BottleneckCall bid , targetPC ... targetPC: while cannot acquire lock BottleneckWait bid acquire lock ... release lock BottleneckReturn bid targetPC: code running for the barrier ... enter barrier while not all threads in barrier BottleneckWait bid exit barrier BottleneckReturn bid (a) Critical section while not done BottleneckCall bid, targetPC ... targetPC: while empty queue BottleneckWait prev_bid dequeue work do the work ... while full queue BottleneckWait next_bid enqueue next work BottleneckReturn bid (b) Barrier // previous stage // caused the waiting // next stage // caused the waiting (c) Pipeline Figure 4. Modifications to code implementing bottlenecks. To identify which bottlenecks cause thread waiting, we use a new instruction BottleneckWait. The purpose of this instruction is twofold: it implements waiting for a memory location to change (similarly to existing instructions like mwait in x86 [17]), and it allows the hardware to keep track of how many cycles each thread is waiting for each bottleneck. Typically, when threads are waiting for a bottleneck, they are waiting for the value of a particular memory location watch addr to satisfy a condition. For example, for critical sections, watch addr can be the lock address; for barriers it can be the address of the counter of threads that have reached the barrier; and for pipeline stages it can be the address of the size of the queue where work quanta are read from [37]. In most cases bid can be the same as watch addr, but we separate them to provide the flexibility required for some implementations, e.g. MCS locks [25]. Table 1 summarizes the format and purpose of the new instructions required by BIS. 3.3 Accelerating Bottlenecks There are multiple ways to accelerate a bottleneck, e.g. increasing core frequency [5], giving a thread higher priority in shared hardware resources [9], or migrating the bottleneck to a faster core with a more aggressive microarchitecture or higher frequency [26, 36]. The bottleneck identification component of BIS can work with any of these schemes. As an example, we show how our identification mechanism can use the fast/large cores of an ACMP as fine-grain hardware-controlled bottleneck accelerators, and the slow/small cores for running the remaining parallel code. We first assume a single large core context and in Section 3.4 we extend the design to multiple large core contexts. When the BT updates thread waiting cycles (TWC), bottlenecks with TWC greater than a threshold are enabled for acceleration and the rest are disabled.4 When a small core encounters a BottleneckCall instruction, it checks whether or not the bottleneck is enabled for acceleration. To avoid accessing the global BT on every BottleneckCall, each small core is augmented with an Acceleration Index Table (AIT) that caches the bid and acceleration enable bit of bottlenecks.5 If a bid is not present in the AIT, it is assumed to be disabled for acceleration. If the bottleneck is disabled for acceleration, the small core executes the bottleneck locally. If acceleration is enabled, the small core sends a bottleneck execution request to the large core and stalls waiting for a response. When the large core receives a bottleneck execution request from a small core, it enqueues the request into a new structure called Scheduling Buffer (SB). Each SB entry has: bid, originating small core/hardware thread ID, target PC, stack pointer and thread waiting cycles (TWC, copied from the BT entry). The SB works as a priority queue based on TWC: the first bottleneck that runs on the large core is the oldest instance of the bottleneck with the highest TWC. When the large core executes the BottleneckReturn instruction, it sends a BottleneckDone signal to the small core. On receiving BottleneckDone, the small core continues executing the instruction after the BottleneckCall. As we will explain in Section 3.3.2, the large core may also abort the request while it is on the SB. If that happens, the large core removes the request from the SB and sends a BottleneckCallAbort signal to the small core. On receiving BottleneckCallAbort, the small core executes the bottleneck locally. BottleneckCall bid, targetPC Marks the beginning of the bottleneck identified by bid and calls the bottleneck subroutine starting at targetPC BottleneckReturn bid Marks the end of the bottleneck identified by bid and returns from the bottleneck subroutine BottleneckWait bid[, watch addr[, timeout]] Waits for a maximum of timeout cycles for the content of memory address watch addr associated with bottleneck bid to change, while keeping track of the number of waiting cycles Table 1. ISA support for BIS. The new instructions required by BIS do not affect the semantics or the implementation of the bottlenecks and can be easily added to any existing bottleneck implementation. The compiler encapsulates potential bottlenecks into functions using function outlining [40] and inserts BottleneckCall, BottleneckReturn and BottleneckWait instructions. Figure 4 shows the use of these instructions for critical sections, barriers and pipeline stages. Since programming primitives that become bottlenecks are usually implemented inside libraries, the changes can be hidden so that the programmer does not need to change the program. We track waiting cycles for each bottleneck, and identify the critical bottlenecks with low overhead in hardware using a Bottleneck Table (BT). Each BT entry corresponds to a bottleneck and includes a thread waiting cycles (TWC) field. TWC is computed by aggregating the number of waiters on the bottleneck (i.e., the number of threads currently running a BottleneckWait instruction) over time. The exact computation is described in Sections 3.5.1 and 3.5.2. The number of waiters is another field in the BT entry, and is incremented when a BottleneckWait instruction starts executing for that bid and decremented when the BottleneckWait instruction commits, i.e., after the wait for the bottleneck ends. The Bottleneck Table contains logic to identify the most critical bottlenecks, i.e. the entries with top N highest thread waiting cycles. We provide more details about the implementation of the BT in Section 3.5.2. 3.3.1 Threads Waiting on the Scheduling Buffer A thread executing a BottleneckCall instruction that was sent to the large core and is waiting on the SB also incurs thread waiting cycles that must be attributed to the bottleneck. To account for these waiting cycles, when a bottleneck execution request is enqueued into the SB, the large core sends an update to the BT, incrementing waiters. It also increments a new field waiters sb, which separately 4A threshold of 1024 works well for our workloads. a bottleneck gets enabled or disabled for acceleration, the BT broadcasts the updates to the AITs on all small cores. 5 When 4 Third, the preemptive mechanism described in Section 3.3.3 is extended so that in case a bottleneck scheduled on small cores (a) becomes the top bottleneck, and (b) its number of executers is less than or equal to the number of large core contexts, the BT sends signals to preemptively migrate those threads to the large cores. This mechanism allows acceleration on N large core contexts of the last N threads running for a barrier. keeps track of the number of waiters on the Scheduling Buffer for purposes we explain in Section 3.3.2. When the bottleneck execution request is dequeued from the SB, to be executed or aborted, the large core updates the BT, decrementing waiters and waiters sb. 3.3.2 Avoiding False Serialization and Starvation Since multiple bottlenecks can be scheduled on a large core, a bottleneck may have to wait for another bottleneck with higher priority (i.e. higher thread waiting cycles) for too long, while it could have been executing on the small core. This problem is called false serialization6 and BIS mitigates it by aborting a request enqueued on the SB if the bottleneck: (a) does not have the highest thread waiting cycles in the SB, and (b) would be ready to run on its original small core. A bottleneck is considered ready to run if no other instance of that bottleneck is currently executing and all the waiters are on the SB. This situation results in unnecessary delaying of all those waiters and can be easily detected when the BT is updated. To perform this detection, each BT entry keeps the number of executers (incremented by BottleneckCall and decremented by BottleneckReturn), the number of waiters (incremented while a BottleneckWait instruction is executing or a bottleneck is waiting on the SB), and the number of waiters on the SB (waiters sb, incremented while a request is waiting on the SB). If executers is zero and waiters is non-zero and equal to waiters sb, the BT sends a BottleneckReady signal to the SB. Upon receiving this signal, the SB dequeues the oldest request for the bottleneck and sends a BottleneckCallAbort to the originating small core. This mechanism also prevents starvation of requests in the SB. As a failsafe mechanism to avoid starvation, the SB also aborts requests that have been waiting for more than 50K cycles. 3.5 Implementation Details 3.5.1 Tracking Dependent and Nested Bottlenecks Sometimes a thread has to wait for one bottleneck while it is executing another bottleneck. For example, in a pipelined workload, a thread executing stage S2 may wait for stage S1. Similar situations occur when bottlenecks are nested. For example, a thread T1 is running B1 but stalls waiting for a nested bottleneck B2. Concurrently, T2 also stalls waiting for B1, which means it is indirectly waiting for B2. In these scenarios, the thread waiting cycles should be attributed to the bottleneck that is the root cause of the wait, e.g., waiting cycles incurred by T2 should not be attributed to bottleneck B1 which is not really running, but to bottleneck B2. In general, to determine the bottleneck Bj that is the root cause of the wait for each bottleneck Bi, we need to follow the dependency chain between bottlenecks until a bottleneck Bj is found not to be waiting for a different bottleneck. Then, the current number of waiters for Bi, i.e. the required increment in TWC, is added to the TWC for Bj. To follow the dependency chain we need to know (a) which threads is executing a bottleneck and (b) which bottleneck, if any, that thread is currently waiting for. To know (a), we add an executer vec bit vector on each BT entry (one bit per hardware thread) that records all current executers of each bottleneck (set on BottleneckCall and reset on BottleneckReturn). To know (b), we add a small Current Bottleneck Table (CBT) associated with the BT and indexed with hardware thread ID that gives the bid that the thread is currently waiting for (set during the execution of BottleneckWait, or zero at all other times). To summarize, the BT updates TWC for each bottleneck Bi with the following algorithm: 3.3.3 Preemptive Acceleration Bottleneck instances can vary in length. Since our mechanism uses history to decide whether or not to accelerate a bottleneck, it is possible for a long instance to get scheduled on a small core and later become responsible for high thread waiting cycles. To fix this problem, we introduce a preemptive mechanism: if upon updating thread waiting cycles, the BT detects that a bottleneck running on a small core has become the bottleneck with the highest thread waiting cycles, the bottleneck is shipped to the large core. BT sends a preempt signal to the small core informing it to stall execution, push the architectural state on the stack and send a request to the large core to continue execution. The large core pops the architectural state from the stack and resumes execution until BottleneckReturn. Acceleration of barriers relies on this preemptive mechanism. foreach Bi in BottleneckTable: B = Bi; done = false while (not done): Bj = B E = randomly picked Executer of B // uniformly distribute TWC B = CurrentBottleneck(E) done = (B == 0) or // Bj is running or (B == Bj) // waiting for another instance of itself TWC(Bj) += waiters(Bi) The basic idea of the algorithm is to follow the dependence chain between bottlenecks until a bottleneck Bj is found not to be waiting for a different bottleneck. This algorithm is also applicable when there are no dependencies, i.e. the loop will execute once and will result in Bj = Bi. Note that the algorithm is not exact and only attempts to assign thread waiting cycles to a bottleneck that is actually running and making other threads wait, so that if that bottleneck were accelerated, the cycles spent by other threads waiting would also be reduced. The computation of TWC does not need to be perfect because BIS is solely for performance and does not affect correctness. 3.4 Support for Multiple Large Core Contexts Our mechanism can scale to multiple large cores or a large core with multiple hardware contexts (Simultaneous Multithreading, SMT) with only three modifications: First, each large core has its own Scheduling Buffer. SMT contexts at a large core share the Scheduling Buffer but execute different bottlenecks because otherwise they would wait for each other. Second, each bottleneck that is enabled for acceleration is assigned to a fixed large core context to preserve cache locality and avoid different large cores having to wait for each other on the same bottlenecks. To that effect, each BT entry and Acceleration Index Table entry is augmented with the ID of the large core the bottleneck is assigned to. If there are N large core contexts, the Bottleneck Table assigns each of the top N bottlenecks to a different large core context to eliminate false serialization among them. Then, the rest of the bottlenecks that are enabled for acceleration are assigned to a random large core. 3.5.2 Bottleneck Table The new instructions update the Bottleneck Table after committing and off the critical path of execution. Table 2 describes each BT entry. When a BottleneckCall or BottleneckWait instruction is executed with a bid that is not in the BT, a new entry is inserted in the BT. Since BIS is solely for performance and not for correctness, the BT only needs enough entries to track the major bottlenecks in each phase of a program. Our studies (not included here due to space constraints) show that a BT with 32 entries is sufficient for our workloads. 6 False serialization can also introduce deadlocks on a deadlock-free program [36]. 5 Field Description bid pid executers Bottleneck ID Process ID Current number of threads that are running the bottleneck, i.e., executing between a BottleneckCall and a BottleneckReturn Bit vector of threads executing the bottleneck, each bit set on BottleneckCall and reset on BottleneckReturn Current number of threads that are waiting for the bottleneck, i.e. executing a BottleneckWait or enqueued on the SB Current number of threads that are waiting on the SB, i.e. executing a BottleneckCall that was sent to a large core Thread waiting cycles ID of the large core assigned for acceleration executer vec waiters waiters sb TWC large core id bottlenecks spanning multiple functions or modules may require programmer intervention and certainly BIS may not be applicable when not even the programmer can perform function outlining on the bottlenecks. A more flexible ISA with two separate instructions BottleneckBegin and BottleneckEnd to delimit bottlenecks without encapsulating them in functions can expand the applicability of BIS at the expense of potentially higher overheads, but we have not explored it. The consequences of failing to mark code segments as potential bottlenecks depend on their actual criticality. There is no penalty for failing to mark non-critical bottlenecks. However, if critical bottlenecks are not marked, BIS misses the opportunity to improve performance. Marking too many non-critical bottlenecks does not significantly reduce performance because the Bottleneck Table retains only the currently most critical bottlenecks. We have targeted same-ISA ACMPs. However, BIS could be extended to heterogeneous-ISA ACMPs by including two binary implementations for each bottleneck in the executable image, one to be executed on the small core and the other to be executed on the large core. This is part of our future work. Table 2. Bottleneck Table entry. We implement the BT as an associative cache. The replacement algorithm evicts the entry with the smallest thread waiting cycles. We update thread waiting cycles for each bottleneck by incrementing it by the number of waiters every cycle (for practicality we add 32 * number of waiters every 32 cycles). We halve the thread waiting cycles for all entries every 100K cycles to allow for the replacement of stale entries. We find that a centralized BT structure is not a bottleneck, however, if need arises, the BT can be partitioned without much effort using the principles of a distributed directory. 3.5.7 Support for Program Tuning BIS can increase programmer productivity by providing feedback to programmers about the most critical bottlenecks. Programmers can then decide which bottlenecks to optimize to get the best results from their effort. We propose adding a set of performance counters to the BT to store bid and cumulative thread waiting cycles (TWC) for the bottlenecks with highest TWC. These counters can be accessed by tools such as Vtune and Intel Parallel Studio [18] to guide the programmer. Due to space limitations we do not describe the structure and the interface, but these are straightforward since the BT already has the required information. 3.5.3 Hardware Cost Figure 5 shows the location of the new structures required by BIS on an ACMP, and Table 3 describes their storage cost, which is only 18.7 KB for a CMP with 2 large cores and 56 small cores (area budget of 64). The storage cost for a CMP with 2 large cores and 120 small cores (area budget of 128) would be 39.6 KB. Small Core ... Small Core AIT Large Core SB 4. AIT We use an x86 cycle-level simulator that models symmetric and asymmetric CMPs. Symmetric CMPs (SCMPs) consist of only small cores. The small in-order cores are modeled after the Intel Pentium processor and the fast cores are modeled after the Intel Core 2 architecture. Details are shown in Table 4. We model all latency overheads associated with our mechanism faithfully. We evaluate our proposal on different processor configurations with equal area budget. We compare our proposal to previous work, summarized in Table 5. SCMP and ACMP use different thread scheduling policies for non-pipelined and pipelined workloads. We apply ACS and MC-ACS only to non-pipelined workloads, while FDP only applies to pipelined workloads by design. BIS is general and applies to both. Our evaluation is based on the 12 workloads shown in Table 6, which we simulate to completion. We use mysql with three different input sets because (a) mysql is a widely used commercial workload, and (b) its behavior significantly depends on the input set. On−chip interconnect BT Experimental Methodology CBT Figure 5. Hardware structures for BIS. 3.5.4 Handling Interrupts The Operating System may interrupt the cores, e.g., on time quantum expiration or for I/O. If a small core gets an interrupt while waiting for a large core to execute a bottleneck, it does not service the interrupt until a BottleneckDone or BottleneckCallAbort is received. If a small core gets an interrupt while executing a BottleneckWait instruction, the BottleneckWait instruction is forced to finish, which removes the waiter from the BT, but it is reexecuted after returning from the interrupt. If a large core gets an interrupt while accelerating a bottleneck, it aborts all bottlenecks in its Scheduling Buffer, finishes the current bottleneck, and then services the interrupt. 5. Evaluation 5.1 Single Large Core It is difficult to determine before runtime the optimal number of threads (and this number varies dynamically), thus it is common practice to run applications with as many threads as available cores [16, 28]. On applications that are not I/O-intensive, this policy can maximize resource utilization without oversubscribing the cores, but may result in inferior performance if the application performance peaks at fewer threads than the number of cores. Thus, we evaluate two situations: number of threads equal to the number of cores and optimal number of threads, i.e., the number of threads at which performance is maximum. Table 7 summarizes the average speedups of BIS over other techniques in both situations for different area budgets. 3.5.5 Transfer of Cache State to the Large Core A bottleneck executing remotely on the large core may require data that resides in the small core, thereby producing cache misses that reduce the benefit of acceleration. Data Marshaling [35] (DM) has been proposed to reduce these cache misses, by identifying and marshaling the cache lines required by the remote core. It is easy to integrate DM with our proposal and we evaluate our proposal with and without DM in Section 5.4. 3.5.6 Compiler Support The compiler can trivially convert bottlenecks that start and end in the same scope to use BottleneckCall/BottleneckReturn. Complex 6 Structure Purpose Location and entry structure (field sizes in parenthesis) Cost Bottleneck Table (BT) To decide which bottlenecks are more critical and where to accelerate them To help track the root cause of thread waiting on nested or dependent bottlenecks To reduce global accesses to BT to find if a bottleneck is enabled for acceleration To store and prioritize bottleneck execution requests on each large core One global 32-entry table for the CMP. Each entry has 192 bits: bid(64), pid(16), executers(6), executer vec(64), waiters(6), waiters sb(6), TWC(24), large core id(6) Attached to the BT, 64-entries. Each entry has 64 bits, the bid of the bottleneck the thread is waiting for, and is indexed by hardware thread ID. One 32-entry table per small core. Each entry has 66 bits: bid(64), enabled (1), large core ID (1). Each AIT has 264 bytes. One 56-entry buffer per large core. Each entry has 222 bits: bid(64), small core ID (6), target PC(64), stack pointer (64), thread waiting cycles (24). Each SB has 1554 bytes. Total 768 B Current Bottleneck Table (CBT) Acceleration Index Tables (AIT) Scheduling Buffers (SB) 512 B 14.4 KB 3.1 KB 18.7 KB Table 3. Description of hardware structures for BIS and storage cost on a 56-small-core, 2-large-core CMP. Small core 2-wide, 5-stage in-order, 4GHz, 32 KB write-through, 1-cycle, 8-way, separate I and D L1 caches, 256KB write-back, 6-cycle, 8-way, private unified L2 cache 4-wide, 12-stage out-of-order, 128-entry ROB, 4GHz, 32 KB write-through, 1-cycle, 8-way, separate I and D L1 caches, 1MB write-back, 8-cycle, 8-way, private unified L2 cache MESI protocol, on-chip distributed directory, L2-to-L2 cache transfers allowed, 8K entries/bank, one bank per core Shared 8MB, write-back, 16-way, 20-cycle Bidirectional ring, 64-bit wide, 2-cycle hop latency 64-bit wide, split-transaction, 40-cycle, pipelined bus at 1/4 of CPU frequency 32-bank DRAM, modeling all queues and delays, row buffer hit/miss/conflict latencies = 25/50/75ns Large core Cache coherence L3 cache On-chip interconnect Off-chip memory bus Memory CMP configurations with area equivalent to N small cores. N small cores 1 large core and N-4 small cores, the large core always runs any single-threaded code 1 large core and N-4 small cores, the large core always runs any single-threaded code and accelerates critical sections as proposed in [36] L large cores and S=N-4*L small cores, one large core always runs any single-threaded code, 32-entry Bottleneck Table, N-entry Current Bottleneck Table, each large core is equipped with an S-entry Scheduling Buffer, each small core is equipped with a 32-entry Acceleration Index Table SCMP ACMP ACS BIS Table 4. Baseline processor configuration. Mechanism Non-pipelined workloads Pipelined workloads SCMP ACMP ACS MC-ACS One thread per core Serial portion on large core, parallel portion on all cores [5, 26] Serial portion and critical sections on large core, parallel portion on small cores [36] To compare to the best possible baseline, we extend ACS to multiple large cores by retrying once more the acceleration of falsely serialized critical sections on a different large core before disabling their acceleration Not applicable (designed for pipelined workloads) Our proposal Best static integer schedule described in [37] Best static integer schedule including large cores [34] Not used, because in our workloads it is more profitable to schedule the large core/s to accelerate pipeline stages using FDP, as described in [34], instead of critical sections within stages FDP BIS Dynamic scheduling of all cores to stages described in [34] Table 5. Experimental configurations. Workload Description iplookup mysql-1 mysql-2 mysql-3 specjbb sqlite tsp webcache mg ft rank pagemine IP packet routing MySQL server [1] MySQL server [1] MySQL server [1] JAVA business benchmark sqlite3 [2] DB engine Traveling salesman Cooperative web cache Multigrid solver FFT computation Rank strings based on similarity to another string Computes a histogram of string similarity Source Input set # of Bottl. [39] SysBench [3] SysBench [3] SysBench [3] [33] SysBench [3] [20] [38] NASP [6] NASP [6] [37] Rsearch [27] 2.5K queries OLTP-complex OLTP-simple OLTP-nontrx 5 seconds OLTP-simple 11 cities 100K queries S input T input 800K strings 1M pages # of threads 29 20 18 39 5 2 33 13 17 3 5 Bottleneck description Critical sections (CS) on routing tables CS on meta data, tables CS on meta data, tables CS on meta data, tables CS on counters, warehouse data CS on database tables CS on termination condition, solution CS on replacement policy Barriers for omp single/master Barriers for omp single/master Pipeline stages Pipeline stages Table 6. Evaluated workloads. Num. of threads Area Budget SCMP ACMP ACS/FDP Equal to number of cores 8 16 32 64 -20 6.9 2.4 10 13 6.2 45 43 17 75 70 34 8 -20 1.4 2.4 Optimal 16 32 64 10 12 6.2 43 36 19 37 32 15 budget of 32 cores. On average, it improves performance by 43% over ACMP, 45% over SCMP and 17% over ACS/FDP. Three key observations are in order. First, as the number of cores increases, so does the benefit of BIS. On average, BIS outperforms ACS/FDP by 2.4%/6.2%/17%/34% for 8/16/32/64-core area budget. Recall from Table 4 that the large core replaces 4 small cores, reducing the maximum number of threads by 3 for ACMP and FDP (because they run only one thread on the large core), and by 4 for ACS and BIS (because they dedicate the large core for critical sections and bottlenecks) with respect to SCMP. With a small area budget, this loss of parallel throughput cannot be offset by the benefit of acceleration because contention for bottlenecks is smaller at fewer threads. Therefore, with a small area budget, BIS performs similar to ACS/FDP and underperforms SCMP (e.g. for an 8-core area Table 7. Average speedups (%) of BIS over SCMP, ACMP and ACS/FDP. 5.1.1 Number of Threads Equal to Number of Cores Figure 6 shows speedups over a single small core for different area budgets (measured in equivalent number of small cores). The figure shows the speedups of SCMP, ACMP and ACS (for nonpipelined workloads) or FDP (for pipelined workloads) and BIS. Our proposal improves performance over ACMP and SCMP and matches or improves over ACS/FDP on all benchmarks for an area 7 6 7 6 5 6 5 5 4 2 1 3 SCMP ACMP ACS BIS 2 1 0 3 SCMP ACMP ACS BIS 2 1 0 8 16 32 64 SCMP ACMP ACS BIS 8 16 32 (g) tsp 32 64 8 16 (b) mysql-1 8 7 6 5 4 3 2 1 0 64 SCMP ACMP ACS BIS 0 8 16 (a) iplookup 8 7 6 5 4 3 2 1 0 4 4 3 8 7 6 5 4 3 2 1 0 32 SCMP ACMP ACS BIS 2 1 SCMP ACMP ACS BIS 0 64 8 16 32 1 64 (i) mg 3 2 SCMP ACMP ACS BIS 32 64 0 8 16 32 64 7 6 6 5 3 SCMP ACMP ACS BIS 2 1 64 4 3 SCMP ACMP FDP BIS 2 1 0 64 32 (f) sqlite 4 (j) ft 8 16 (e) specjbb 5 8 16 32 SCMP ACMP ACS BIS 1 0 8 16 11 10 9 8 7 6 5 4 3 2 1 0 3 (h) webcache SCMP ACMP ACS BIS (d) mysql-3 4 32 4 (c) mysql-2 5 5 2 64 6 8 16 3 SCMP ACMP FDP BIS 0 8 16 32 (k) rank 64 8 16 32 64 (l) pagemine Figure 6. Speedup over a single small core, x-axis is area budget (in equivalent small cores). budget). As the area budget increases, the marginal loss of parallel throughput due to the large core becomes relatively smaller, and with more threads running, contention for serializing bottlenecks increases. Therefore, the dynamic mechanisms that accelerate those bottlenecks (ACS, FDP and BIS) start to prevail. Second, with more threads, the benefit of BIS over ACS/FDP increases on workloads where the critical bottlenecks change over time (iplookup, all mysql, webcache, mg, ft and rank), because BIS is better at choosing which bottlenecks to accelerate and quickly adapts to changes. Additionally, both critical sections and barriers are limiting bottlenecks in mg and ft at different times, and BIS accelerates both types of bottlenecks, unlike ACS. On workloads where the critical bottlenecks do not change over time (specjbb, sqlite and tsp), BIS matches the performance of ACS. Pagemine has two very similar non-scalable pipeline stages (the reduction of local histograms into a global histogram) that are frequently the limiting bottlenecks. Therefore, BIS with a single large core is unable to significantly improve performance over FDP because that would require both stages to be simultaneously accelerated. However, BIS is able to improve performance when multiple large cores are employed, as we will show in Section 5.2. Third, BIS increases the scalability (the number of threads at which performance peaks) of four of the workloads (all versions of mysql and rank) beyond the already-improved scalability of ACS/FDP over ACMP or SCMP. By accelerating bottlenecks more effectively than previous proposals, BIS reduces contention and thread waiting, allowing the workloads to take advantage of a larger number of threads before performance gets limited by bottlenecks. We conclude that BIS improves performance and scalability of applications over ACS/FDP. As the area budget increases, so does the performance benefit of BIS. With an area budget of 8 cores, all benchmarks except sqlite run with the maximum number of threads (8 for SCMP, 5 for ACMP/FDP and 4 for ACS and BIS) and with low contention for bottlenecks. Therefore, SCMP performs the best because the loss of throughput on the configurations with a large core is greater than the benefit of acceleration. The exception is sqlite, which becomes critical-section limited at 4 threads and can take full advantage of the acceleration mechanisms. Since sqlite has only one dominant critical section during the whole execution, both ACS and BIS provide the same performance improvement. On average, ACS, BIS and ACMP perform within 2% of each other. With higher area budgets, the relative impact of running fewer threads due to the presence of a large core becomes less significant, which progressively reduces the advantage of SCMP and favors ACMP, ACS/FDP and BIS. On average, BIS outperforms SCMP by 10%/37%/43%, ACMP by 12%/32%/36% and ACS/FDP by 6.2%/15%/19% on a 16/32/64-core budget. As contention increases, it becomes more important to focus the acceleration effort on the most critical bottlenecks. While BIS actively identifies and accelerates the bottlenecks that are stopping more threads from making progress, ACS disables acceleration of critical sections that suffer from false serialization, even though such critical sections can be causing the most thread waiting. Additionally, BIS is more adaptive to changes in the important bottlenecks, while both ACS and FDP maintain the decisions they make for a fixed and relatively long period (10Mcycles for ACS and until the next major phase change for FDP). As a result of these two factors, BIS significantly outperforms ACS and FDP. The only benchmark where BIS does not outperform ACS is tsp because BIS accelerates many more bottlenecks than ACS, increasing core-to-core cache misses. Since those bottlenecks are only 52instruction long on average, the benefit of accelerating them does not overcome the cache miss penalty, which causes BIS’ performance to be lower than that of ACS. BIS is able to improve performance of tsp over ACS with the Data Marshaling mechanism [35] that reduces cache misses incurred by the bottlenecks, as we show in Section 5.4. We conclude that even with optimal number of threads, BIS improves performance and its benefit over all other mechanisms increases as the number of cores (area budget) increases. 5.1.2 Optimal Number of Threads Figure 7 shows the speedups of our proposal for different area budgets with the optimal number of threads, relative to the ACMP configuration. We determine the optimal number of threads for each application-configuration pair by simulating all possible numbers of threads and reporting results for the one that minimizes execution time. For example, the optimal number of threads for iplookupBIS on the 32-core area budget is 20. 8 189 210 n ea m gm pa ge ft ra n k in e e g m 210 180 m in gm e ea n pa ge nk ra ft e ch g m ca eb w b e tsp lit sq -3 jb ec sp ql m ys -1 ql ys ql m lo ip (c) Area budget=32 small cores -2 SCMP ACS/FDP BIS ok m in gm e ea n pa ge nk ra ft e g ch ca m eb w tsp sq lit e b -3 sp ec jb -2 ql ys m ql ys m m ys ql -1 up lo 174 ys 163 SCMP ACS/FDP BIS ip tsp w eb ca ch ku p ys ql m 1 ys ql m 2 ys ql -3 sp ec jb b sq lit e m ip lo o (b) Area budget=16 small cores 192 m 192 up 156 ok Speedup norm. to ACMP (%) (a) Area budget=8 small cores 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 189 SCMP ACS/FDP BIS n in e m 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 gm k pa ge ft ra n m 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 ea e g tsp w eb ca ch ku p ys ql m 1 ys ql m 2 ys ql -3 sp ec jb b sq lit e m Speedup norm. to ACMP (%) 160 Speedup norm. to ACMP (%) 181 SCMP ACS/FDP BIS ip lo o Speedup norm. to ACMP (%) 181 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 (d) Area budget=64 small cores Speedup (%) Speedup (%) Figure 7. Speedup for optimal number of threads, normalized to ACMP. 130 120 110 100 90 80 70 60 50 40 30 20 10 0 MC-ACS BIS 1 LC 1 LC SMT 2 LC 2 LC SMT 3 LC 3 LC SMT 130 120 110 100 90 80 70 60 50 40 30 20 10 0 MC-ACS BIS 1 LC (a) iplookup at 32-core area budget 1 LC SMT 2 LC 2 LC SMT 3 LC 3 LC SMT (b) mysql-2 at 16-core area budget Figure 8. Speedup with multiple large core contexts, normalized to ACMP with a single large core (LC: large core). area budget, ranging from an area budget of 8 small cores to 64 small cores. The workloads run with the optimal number of threads for each configuration. An area budget of 8 small cores only allows for a single-large-core ACMP configuration, and on average there is not much contention for bottlenecks that either ACS/FDP or BIS can provide benefit for. With an area budget of 16 cores, performance decreases with additional large cores because the benefit of acceleration on the extra large cores is not enough to compensate for the loss of parallel performance due to the reduced number of threads. When the area budget is 32 cores, the average performance benefit of BIS remains stable for different numbers of large cores. With an area budget of 64 cores, there is no loss of parallel throughput with 2 or 3 large cores, because Figure 6 shows that performance peaks at fewer than 40 threads on all workloads. Therefore, BIS’ performance increases by 3.4% and 6% with 2 and 3 large cores, respectively, with respect to a single large core. We conclude that having more large cores increases the benefit of BIS at large area budgets. Having SMT on the large core does not change BIS’ performance by more than 2% in any configuration. The trade-off is that a 2-way SMT large core can accelerate two different bottlenecks concurrently, but the two hardware threads share all processor resources and therefore the acceleration benefit of each thread is lower than if the thread were running alone on the large core. Two effects reduce the acceleration benefit when SMT is used on the large core. First, since local data is frequently accessed inside critical sections and pipeline stages and stays resident in the cache, our 5.2 Multiple Large Core Contexts We first explain the trade-offs of multiple large cores with two examples. Figure 8(a) shows the performance of iplookup with optimal number of threads as we increase the number of large cores while keeping the same area budget. The figure also shows results for large cores with 2-way Simultaneous Multithreading. Iplookup executes a large number of independent critical sections that are contended. Therefore, it can significantly benefit from extra large cores that can simultaneously accelerate more of those equallycritical bottlenecks. The performance improvement of BIS over ACMP steadily grows from 10% for one large core to 25% with three large 2-way SMT cores. On the other hand, Figure 8(b) shows that for mysql-2 additional large cores decrease performance with MC-ACS or BIS because the benefit of simultaneously accelerating multiple critical sections that are not heavily contended is much smaller than the cost of reducing the number of threads that can be executed concurrently to accommodate the additional large cores. The conclusion is that the best number of large cores dedicated to bottleneck acceleration significantly depends on the application and number of threads. Therefore, achieving the best performance requires either software tuning to tailor bottleneck behavior to what the available hardware configuration can efficiently accelerate, or building reconfigurable “large” cores that can either run more threads with lower performance or a single thread with higher performance. Both are interesting avenues for future work. Figure 9 shows the geometric mean of speedups for all of our benchmarks on different core configurations with the same 9 Speedup (%) Speedup (%) 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 ACS/FDP BIS 1 LC 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 1 LC SMT MC-ACS/FDP BIS 1 LC 1 LC SMT 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 MC-ACS/FDP BIS 1 LC 1 LC SMT 2 LC 2 LC SMT 2 LC 2 LC SMT 3 LC 3 LC SMT 3 LC 3 LC SMT (b) Area budget=16 small cores Speedup (%) Speedup (%) (a) Area budget=8 small cores 3 LC 3 LC SMT 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 MC-ACS/FDP BIS 1 LC 1 LC SMT (c) Area budget=32 small cores 2 LC 2 LC SMT (d) Area budget=64 small cores Figure 9. Speedup with multiple large core contexts, normalized to ACMP with a single large core (LC: large core). workloads rarely need to access main memory (average L3 MPKI is 0.79 on BIS with the 32-core budget), hence SMT’s main benefit on an out-of-order core, cache miss latency tolerance, rarely gets exercised. Second, tight resource sharing between threads degrades performance of each thread. We conclude that using SMT to increase the number of large core contexts available for bottleneck acceleration is not useful for BIS. are most numerous and only 52-instruction long on tsp, which increases the overhead since even a single cache-to-cache transfer due to migration introduces a penalty in the same order of magnitude as the benefit of acceleration, resulting in a performance loss unless Data Marshaling is used, as we will show next. Workload 100 90 80 70 60 50 40 30 20 10 0 iplookup mysql-1 mysql-2 mysql-3 specjbb sqlite tsp webcache mg ft rank pagemine n in m ea am nk ge pa ft ra 888 (4.6%) 9342 (50.6%) 1486 (12.6%) 2593 (10.4%) 1008 (61.0%) 116 (66.7%) 12704 (57.9%) 2099 (13.9%) 515 (99.8%) 227 (99.6%) 1024 (34.1%) 1030 (20.6%) 2130 391 378 453 1468 29595 52 1908 11585 39804 1800 1160 5.4 Interaction with Data Marshaling Bottlenecks that are accelerated on a large core may need to access data that was produced in the code segment running on the small core before execution migrated to the large core. These intersegment memory accesses become cache misses that may limit the performance benefit of acceleration. To reduce these cache misses, Suleman et al. [35] proposed Data Marshaling (DM), which can be easily integrated with BIS. DM identifies and marshals the data needed by a bottleneck to the large core using compiler and hardware support. The compiler uses profiling to find the store instructions that generate the data to be produced in the small core and consumed by a bottleneck. The hardware keeps track of the cache lines modified by those special store instructions and pushes those cache lines into the large core’s cache as soon as the bottleneck is shipped to the large core. When the bottleneck starts executing, after waiting in the Scheduling Buffer, the inter-segment data is likely to be available in the large core’s cache. Figure 11 shows performance of BIS and ACS/FDP with and without DM on 32-core area budget with 28 threads, normalized to ACMP. In general, DM improves ACS and BIS, but it improves BIS more, because BIS migrates bottlenecks to a large core more frequently than ACS. In particular, BIS underperforms ACS on tsp e e ch g m ca tsp eb w ite sq l -3 bb ql cj sp e -2 m ys -1 ql ys m ql ys m ok lo ip 19184 18475 11761 25049 1652 174 21932 15118 516 228 3003 5005 Average Length (Instructions) Table 8. Bottleneck characterization. ACS/FDP-non-crit ACS/FDP-critical BIS-non-critical BIS-critical up Execution time (%) 5.3 Analysis Figure 10 shows the normalized execution time spent accelerating code segments on and off the critical path for ACS/FDP and BIS on a 32-core area budget running 28 threads. To avoid changing the critical path on the fly due to acceleration, we disable acceleration and only perform bottleneck identification in this experiment. Then the actual critical path is computed off-line and compared to the code segments identified as bottlenecks by ACS/FDP and BIS. We define coverage as the fraction of the critical path that is actually identified as bottleneck. BIS matches or improves coverage of ACS/FDP on every benchmark and the average coverage of BIS (59%) is significantly better than the average coverage of ACS/FDP (39%). We define accuracy as the fraction of execution time of identified bottlenecks that is actually on the critical path. The average accuracy of BIS (73.5%) is slightly better than that of ACS/FDP (72%). We conclude that the significantly better coverage and similar accuracy of BIS vs. ACS/FDP results in the performance improvements we showed in Sections 5.1 and 5.2. Annotated Bottlenecks Executed Identified Figure 10. Bottleneck identification on and off the critical path. Table 8 shows the number of dynamic instances of annotated bottlenecks that are executed and identified for acceleration, along with their average length in number of instructions. Bottlenecks 10 247 256 247 256 Previously proposed techniques [15, 37] to reduce load imbalance among stages of a parallel pipeline do so by either repipelining the workload via compiler/programmer support or by choosing the thread-to-stage allocation to reduce imbalance. Their performance benefit is limited because the decisions are coarsegrain and they are unable to speedup stages that do not scale with the number of cores. In BIS, we can leverage these previous mechanisms to manage our thread-to-core allocation but we also identify and accelerate the critical bottlenecks using a large core at a fine granularity. Using the faster cores allows us to improve performance even if the bottleneck pipeline stage does not scale. In general, our proposal is complementary to mechanisms that reduce the impact of bottlenecks. Evaluating their interactions is part of our future work. ft ra nk pa ge m in gm e ea n ACS/FDP ACS/FDP+DM BIS BIS+DM tsp w eb ca ch e m g ip lo ok up m ys ql m 1 ys ql m 2 ys ql -3 sp ec jb b sq lit e Speedup norm. to ACMP (%) 200 180 160 140 120 100 80 60 40 20 0 Figure 11. Speedup with and without Data Marshaling. due to inter-segment cache misses, which are eliminated by DM, which results in BIS+DM outperforming ACS+DM. On average on the 32-core budget, BIS+DM improves performance by 5.2% over BIS, while ACS+DM improves by 3.8% over ACS. We conclude that BIS interacts positively with techniques that can reduce cache misses incurred due to bottleneck acceleration. 6.2 Accelerating Bottlenecks Annavaram et al. [5] proposed to make the core running the serial bottleneck faster using frequency throttling. Morad et al. [26] proposed accelerating Amdahl’s serial bottleneck [4] using the fast core of an ACMP while running the parallel phases on the small cores. While accelerating the single-threaded bottleneck is provably beneficial for performance, it does not reduce serialization in parallel portions of programs. We use their mechanism as a baseline and provide significant performance improvement on top of it by also accelerating bottlenecks in the parallel portions. Suleman et al. [36] use the faster cores to accelerate critical sections. Their scheme differs from ours because (1) it is limited to critical sections while BIS targets thread waiting for any bottleneck, (2) it does not consider critical section criticality when deciding which critical sections to accelerate while BIS prioritizes bottlenecks that are predicted to be most critical, and (3) it only supports a single large core (which can execute only a single bottleneck) while BIS can schedule multiple bottlenecks for acceleration on any arbitrary number of fast cores. The generality and fine-grained adaptivity of BIS allow it to significantly outperform ACS, as we showed in Section 5. Meeting Points [8], Thread Criticality Predictors (TCP) [7] and Age-Based Scheduling [22] try to reduce the overhead of barriers by detecting and accelerating the threads which are likely to reach the barrier last. Age-Based Scheduling uses history from the previous instance of the loop to choose the best candidates for acceleration. Thus, their scheme is restricted to iterative kernels that have similar behavior across multiple invocations. TCP uses a combination of L1 and L2 cache miss counters to predict which threads are likely to arrive early at a barrier and slows them down to save power, but it can also be used to guide acceleration of the thread that is lagging. Meeting Points uses hints from the software to determine which thread is lagging (thus likely to be a bottleneck). We can leverage these schemes to enhance our bottleneck identification mechanism, because they may be able to identify threads that need acceleration before they start making other threads wait, allowing their acceleration to start earlier and be more effective. 6. Related Work Our proposal is a comprehensive mechanism to identify and accelerate bottlenecks (using heterogeneous cores) in a multithreaded program. To our knowledge, this is the first proposal that reduces thread waiting caused by bottlenecks regardless of bottleneck type. The mechanism is unique as it requires no significant programmer support and uses the fast cores on the ACMP to accelerate the bottlenecks. Most related to our work are previous proposals to reduce the performance impact of bottlenecks and other techniques to accelerate specific bottlenecks using heterogeneous cores. 6.1 Reducing the Impact of Bottlenecks Removing or alleviating serializing bottlenecks has been a topic of wide interest for decades. For example, test-and-test-and-set [31] was proposed to reduce the serialization caused by thread communication. These primitives are only used at the start and the end of the bottleneck (lock acquire/release). As such, they are orthogonal to BIS since BIS accelerates the code inside the bottleneck. Several mechanisms have been proposed to overlap execution of bottlenecks as long as they do not modify shared data. The first of these approaches is Transactional Memory (TM) [12], which targets critical sections. TM’s benefit heavily depends on bottlenecks not having data conflicts, i.e. if two instances of a transaction have a data conflict, TM cannot execute them in parallel. In contrast, BIS speeds up bottlenecks regardless of their data access patterns. BIS can also complement TM as it can be used to accelerate transactions that cannot be overlapped due to their length or conflicts. Other proposals like Speculative Lock Elision [29], Transactional Lock Removal [30], and Speculative Synchronization (SS) [24] can overlap the execution of non-conflicting instances of the same critical section without requiring programmers to write code using transactions. In addition to critical sections, SS can also reduce the overhead of barriers by speculatively allowing threads to execute past the barrier as long as there are no data conflicts. These approaches fundamentally differ from BIS for four reasons. First, they cannot hide the latency of bottlenecks that have data conflicts, whereas BIS can accelerate such bottlenecks. Second, they best apply to short bottlenecks as long bottlenecks complicate conflict detection hardware and are more likely to conflict, while BIS can accelerate arbitrarily long bottlenecks. Third, these techniques do not take criticality of the bottleneck into account, which is a main feature of BIS. Fourth, they are limited to particular types of bottlenecks (e.g., these approaches cannot be applied to bottleneck pipeline stages), whereas BIS is more general. BIS can complement these proposals by accelerating bottlenecks to which they are not applicable. 6.3 Critical Path Prediction Hollingsworth [14] was the first to propose an online algorithm to compute the critical path of a message-passing parallel program. Fields et al. [10, 11] proposed a data- and resource-dependencybased token-passing algorithm that predicts instruction criticality and slack in single-threaded applications. In contrast, our approach predicts criticality at the code segment (i.e., bottleneck) granularity in multithreaded applications by only focusing on thread waiting. Dependencies between instructions in a single-threaded application and between bottlenecks in a multithreaded application are conceptually different. While instructions obey well-defined data-flow dependencies, bottlenecks have more complicated and dynamic dependencies. For example, critical sections with the same lock can 11 execute in any order as long as they do not execute concurrently, e.g., block A may wait for block B in one dynamic instance while block B may wait for block A in another instance of execution, depending on the timing of each thread’s execution. With many threads, dependencies between bottlenecks become more complicated and their execution order becomes less repeatable. Therefore, we found that methods to predict instruction criticality in singlethreaded applications similar to Fields et al. [11] are not straightforwardly applicable to predict bottlenecks in multithreaded applications, although future work may find a way of doing it. Li et al. [23] proposed an efficient off-line DAG-based algorithm to find instruction-level criticality in multithreaded applications. Our problem is different in that it requires online prediction of critical bottlenecks to enable their acceleration. [6] D. H. Bailey et al. The NAS parallel benchmarks. Technical Report RNR-94007, NASA Ames Research Center, 1994. [7] A. Bhattacharjee and M. Martonosi. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In ISCA, 2009. [8] Q. Cai, J. González, R. Rakvic, G. Magklis, P. Chaparro, and A. González. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In PACT, 2008. [9] E. Ebrahimi, R. Miftakhutdinov, C. Fallin, C. J. Lee, J. A. Joao, O. Mutlu, and Y. N. Patt. Parallel application memory scheduling. In MICRO, 2011. [10] B. Fields, R. Bodik, and M. Hill. Slack: maximizing performance under technological constraints. In ISCA, 2002. [11] B. Fields, S. Rubin, and R. Bodik. Focusing processor policies via critical-path prediction. In ISCA, 2001. [12] M. Herlihy and J. E. B. Moss. Transactional memory: architectural support for lock-free data structures. In ISCA, 1993. [13] M. D. Hill and M. R. Marty. Amdahl’s law in Multicore Era. Technical Report CS-TR-2007-1593, Univ. of Wisconsin, 2007. [14] J. K. Hollingsworth. An online computation of critical path profiling. In SIGMETRICS Symposium on Parallel and Distributed Tools, 1996. [15] A. H. Hormati, Y. Choi, M. Kudlur, R. Rabbah, T. Mudge, and S. Mahlke. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures. In PACT, 2009. [16] Intel. Source code for Intel threading building blocks. [17] Intel. IA-32 Intel Architecture Software Dev. Guide, 2008. [18] Intel. Getting Started with Intel Parallel Studio, 2011. [19] D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys, 2010. [20] H. Kredel. Source code for traveling salesman problem (tsp). http://krum.rz.unimannheim.de/ba-pp-2007/java/index.html. [21] R. Kumar, D. Tullsen, N. Jouppi, and P. Ranganathan. Heterogeneous chip multiprocessors. IEEE Computer, 2005. [22] N. B. Lakshminarayana, J. Lee, and H. Kim. Age based scheduling for asymmetric multiprocessors. In SC, 2009. [23] T. Li, A. R. Lebeck, and D. J. Sorin. Quantifying instruction criticality for shared memory multiprocessors. In SPAA, 2003. [24] J. F. Martı́nez and J. Torrellas. Speculative synchronization: applying threadlevel speculation to explicitly parallel applications. In ASPLOS, 2002. [25] J. M. Mellor-Crummey and M. L. Scott. Synchronization without contention. In ASPLOS, 1991. [26] T. Morad, U. C. Weiser, A. Kolodny, M. Valero, and E. Ayguade. Performance, power efficiency and scalability of asymmetric cluster chip multiprocessors. Computer Architecture Letters, 2006. [27] R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. MineBench: A benchmark suite for data mining workloads. In IISWC, 2006. [28] Y. Nishitani, K. Negishi, H. Ohta, and E. Nunohiro. Implementation and evaluation of OpenMP for Hitachi SR8000. In ISHPC, 2000. [29] R. Rajwar and J. Goodman. Speculative lock elision: Enabling highly concurrent multithreaded execution. In MICRO, 2001. [30] R. Rajwar and J. R. Goodman. Transactional lock-free execution of lock-based programs. In ASPLOS, 2002. [31] L. Rudolph and Z. Segall. Dynamic decentralized cache schemes for MIMD parallel processors. In ISCA, 1984. [32] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore systems. In EuroSys, 2010. [33] The Standard Performance Evaluation Corporation. Welcome to SPEC. http://www.specbench.org/. [34] M. A. Suleman. An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs. PhD thesis, UT-Austin, 2010. [35] M. A. Suleman, O. Mutlu, J. A. Joao, Khubaib, and Y. N. Patt. Data marshaling for multi-core architectures. ISCA, 2010. [36] M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt. Accelerating critical section execution with asymmetric multi-core architectures. ASPLOS, 2009. [37] M. A. Suleman, M. K. Qureshi, Khubaib, and Y. N. Patt. Feedback-directed pipeline parallelism. In PACT, 2010. [38] Tornado Web Server. Source code. http://tornado.sourceforge.net/, 2008. [39] M. Waldvogel, G. Varghese, J. Turner, and B. Plattner. Scalable high speed IP routing lookups. In SIGCOMM, 1997. [40] P. Zhao and J. N. Amaral. Ablego: a function outlining and partial inlining framework. Softw. Pract. Exper., 37(5):465–491, 2007. 6.4 Coarse-Grain Thread Scheduling Several proposals schedule threads to large cores of an ACMP at coarse granularity, i.e. the operating system level. For example, Saez et al. [32] proposed running on large cores the threads that can make better use of them, according to a utility factor that considers both memory-intensity and thread-level parallelism. Koufaty et al. [19] proposed assigning a large core to the thread that is expected to have a larger speedup running on the large core, i.e. those that have fewer off-core memory requests and fewer core frontend stalls. In contrast, BIS does fine-grain code segment scheduling in hardware, running only serial and parallel bottlenecks on large cores, i.e. the portions of any thread that are adaptively identified as performance limiters. 7. Conclusion We propose Bottleneck Identification and Scheduling (BIS), the first generalized mechanism to identify the most critical bottlenecks that cause thread waiting on multithreaded applications and to accelerate those bottlenecks using one or more large cores of an ACMP. We show that BIS improves performance of a set of bottleneck-intensive applications on average by 15% over ACS, which is the state-of-the-art technique for critical section acceleration, and FDP, a software technique to dynamically assign pipeline stages to cores, on a 1-large-core, 28-small-core ACMP. The benefit of BIS increases with the number of cores and BIS improves scalability of 4 out of 12 workloads compared to ACS/FDP. We are also the first to propose bottleneck acceleration using multiple large cores and find that performance improves with multiple large cores since there are independent bottlenecks that need to be simultaneously accelerated. We conclude that BIS provides a comprehensive approach to fine-grain bottleneck acceleration on future ACMPs that can help existing and future applications take advantage of a larger number of cores while reducing programmer effort. Acknowledgments We thank Eiman Ebrahimi, Veynu Narasiman, Santhosh Srinath, other members of the HPS research group, our shepherd Ras Bodik and the anonymous reviewers for their comments and suggestions. We gratefully acknowledge the support of the Cockrell Foundation, Intel Corporation, Oracle Corporation and the Gigascale Systems Research Center. José Joao was supported by an Intel PhD Fellowship. References [1] MySQL database engine 5.0.1. http://www.mysql.com. [2] SQLite database engine version 3.5.8. 2008. [3] SysBench: a system performance benchmark v0.4.8. http://sysbench.sourceforge.net. [4] G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS, 1967. [5] M. Annavaram, E. Grochowski, and J. Shen. Mitigating Amdahl’s law through EPI throttling. In ISCA, 2005. 12