Turbo Boost and Overclocking © Intel Corp. Architecture and Early Performance Results of Turbo Boost Technology on Intel® CoreTM i7 Processor and Intel Xeon® Processor 5500 Series (2009) Markus Mattwandel, Todd Baird, Jorge Garcia, Seongwoo Kim, Herbert Mayer* Abstract We survey the Turbo Boost Technology on the new Intel® Core TM i7 multi-core, multi-threaded micro processor. Turbo Boost Technology dynamically increases the frequency of processor cores for the benefit of higher performance, while operating under thermal design limits and maintaining safe conditions on the physical chip. This paper outlines the degree, how much the core frequency can be raised as a function of the number of currently active cores and of other electrical and temperature parameters. We explain conditions, under which such boosts are possible, depending on instantaneously flowing current, on overall power consumption with resulting heat generation, and on actual temperature of the core[s] being boosted. We contrast Turbo with Overclocking, another method of boosting frequency and improving performance, and discuss the pros and cons of Turbo versus thermal throttling. Since the Turbo Boost Technology has been implemented in silicon on the Core i7, on both single-socket desktop and dual-socket servers, we include actual performance data from average to ideal cases. Core i7 is implemented in 45 nm High-K Silicon, launched in late 2008 as a High End Desktop platform with 1, and in 2009 as a server with 2 processors, each having 4 cores and 2 hardware threads per core. We conclude with conjectures into the future and a list of references. Keywords: Multi-Core; Turbo Mode; Overclocking; Simultaneous Multi-Threading (SMT); Parallel System; Logical Core; Green Computing 1. Introduction Turbo Boost Technology (Turbo, for short) dynamically enables a temporary performance boost on the new Intel® CoreM i7 multi-core, multi-threaded micro processor, stylized in Figure 2.1. Turbo Boost Technology increases the core clock of a processor in defined, discreet frequency steps (AKA bins) for the benefit of higher performance, while conditions on the physical chip allow this without endangering the microprocessor. This survey outlines the degree, how much the core frequency can be raised as a function of the number of active cores and of other parameters. Section 2 describes the design goals of Turbo Boost Technology on Core i7 and contrasts the new method with an older Turbo legacy method implemented on earlier Intel silicon. It discusses the pros and cons of Turbo vs. Overclocking, both of them being methods of boosting frequency to increase performance, yet with different goals and conditions. It also compares Turbo with thermal throttling. Section 3 summarizes, how much Turbo boosting is theoretically possible, as set by predefined system parameters. In section 4 we list costs, shortcomings, and dangers of Turbo. Since the Turbo Boost Technology has been implemented in silicon on the Core i7, on both single-socket desktops and dualsocket servers, Section 5 includes detailed, actual performance data on client- and server platforms, from average to ideal cases. Section 6 contrasts Turbo with other performance boost ideas, while sections 7 and 8 conclude with a conjecture into the future and references. The physical Core i7 microprocessor is realized by Intel in 45 nm High-K Silicon technology, launched in late 2008 as a High-End Desktop platform with a single socket, and in 2009 as a server with 2 sockets. 2. Description of Turbo Boost Technology Why Turbo Boost? Intel Turbo Boost Technology, introduced on Intel’s flagship Core i7 and Core i7 Extreme Edition processors in Q3’2008, allows processor cores to automatically run faster than their base operating frequency if cores are operating at the low end of a defined envelope of power, current, and temperature, the specification limits. The amount of additional frequency upside each core actually will achieve depends on the total number of active cores, executing processes (threads) that a workload has spawned, and on the thermal operating environment, which includes current (thermal design current, or TDC) and power consumption (thermal design power, or TDP), as well as temperature. Turbo Boost kicks in when the OS power scheme is set for performance and the processor package is operating below critical constraints. The core frequency is dynamically adjusted within the defined limits, as the operating conditions change. Frequency & Voltage Independent Interface DRAMs DDR3 C O R E 0 C O R E 1 C O R E 2 Last Level Cache IMC QPI QPI C O R E 3 C O R E S Pw r & Clk U N C O R E QP I Figure 2.1 High-Level Nehalem Architecture Corresponding author: herb.g.mayer@intel.com SPEC, SPECint and SPECfp are copyright of SPEC 1 Turbo Boost and Overclocking © Intel Corp. Thermal Throttling comes from the other end by taking a greedy approach of performance enhancement. Thermal throttling assumes that the microprocessor is generally running in some steady state of execution, but acknowledges that temporary hot spots are possible. This happens when the typical mix of IObound plus compute-bound execution is replaced by compute-bound only execution, resulting in more heat generation than is safe. Similar to the safety action taken in Turbo, the frequency is throttled in thermal throttling, resulting is less current and thus less heat being generated, and less performance being delivered. A microprocessor architect must decide, which safe technology of performance boosting should be realized in Silicon, one, or the other, or both. On the Core i7 Intel decided to provide both methods. 2.1. Turbo Boost Technology vs. Enhanced Dynamic Acceleration Technology: Prior to the introduction of Turbo Boost in Core i7, Intel’s previous generation Core 2 Duo processors introduced the 1st generation of Turbo technologies known as Enhanced Dynamic Acceleration Technology (EDAT). This technology allows processor cores to automatically run faster than their base operating frequency, if one or more core(s) are idle. In that event, the operating frequency of the other cores is increased. Note that this increase is influenced by the number of active hardware threads and by various electrical and thermal parameters, before taking advantage of a clock boost within the product constraints. Turbo Boost and EDAT also happen to be “Green” technologies that provide performance on demand, while keeping power consumption at a minimum when the additional processor performance is not needed, as judged by the current load. 3. Ideal Performance Speedup with Turbo A number of dynamic parameters dictate the upper limit of Turbo Boost speedup limit. These include the current core’s temperature, the overall current and momentary power, and the number of active cores. Each frequency step of turbo boots is 133.33 MHz. For each SKU, fuse values are set in a small internal table during chip manufacturing, to define an upper bound, how many of these frequency steps maximally a core can increase safely. The table parameters d-c-b-a mean: If 1 core is active, that core’s frequency may increase by a bins. Else if 2 cores are active, these cores can grow by b frequency steps, etc. Applying the same encoding principle, but starting from he other end, the table entry 1-1-4-8 means that for 3 or 4 cores being active, the frequency may increase by just 1 frequency step. But if only 2 cores are busy, the speed may grow up to 4 steps, and if only a single core is active, the current one may grow by 8 frequency steps, amounting to 1.06 GHz incremental clock speed. However, this boost may decrease, if for any reason a predefined envelope of maximally allowable current or temperature is exceeded. Decrease is designed to not only save the microprocessor from thermal stress, but to save power and run “more green”. Similarly, as the sample 1-1-4-8 bound shows, other cores may become active, forcing a current high boost rate to decrease, again to protect the processor and save power. 2.2. Turbo Boost vs. Overclocking Turbo is quite distinct from overclocking. First of all, overclocking increases clock frequency by running outside the specification of the part, while Turbo operates completely within spec. Turbo does not change the reliability or durability of a part. Overclocking occurs when the clock rate of the processor is manually and statically increased. This results in running the processor out of its specified and thus safe limits. Conversely, Turbo technologies run the processor within specification, and aim to take advantage of optional thermal headroom available during under-utilized conditions. Overclocking is not a “Green” technology, since it forces increased processor power consumption continuously without regard to actual demand. Starting Clock heat protection Protective action Turbo Boost Base op frequency Yes Decrease clock Application Mechanism Automatic, based on sys. Conditions Overclocking Base op frequency Yes Thermal throttling set by user. Manual, user driven by brute force 2.3. Turbo Execution vs. Thermal Throttling Turbo Boost Technology is a conservative performance enhancement method that increases the clock rate, after the microprocessor recognizes that an increase in clock speed is safe; it is understood that the processor was already operation in a safe way before boosting the clock speed. When the thermal parameters change, or when the number of active cores increases, then the prior clock increase is reversed, not only saving the chip from possible damage, but also saving power. Starting Clock heat protection Protective action Arch. driven Turbo Boost Low, to run safely Yes Decrease clock Yes 4. Technology Investment for Turbo Boost Although the goal of improving performance with Intel Turbo Boost Technology is worth pursuing, the longterm investments and shorter-term costs must be weighed against gains on the performance side for the user and the business side for the manufacturer. Thermal Throttle High, to run fast Yes Decrease clock Yes 4.1. Engineering Investment The up front engineering costs to design and implement the Turbo Boost Technology were noticeable but contained despite the existence of past 2 Turbo Boost and Overclocking © Intel Corp. technological history at Intel; e.g. the Enhanced Speed Step Technology. Design costs included a new minicontroller, called the Power Control Unit (PCU), and associated microcode. Also, the cost of validation was significant because new methods were developed to ensure that the feature was working properly without interfering with the operation of the feature. The manufacturing flow was also updated to support testing of the PCU, which added another minor development cost. 4.2. End User Costs When Turbo Boost Technology promotes cores to a higher frequency, the processor will draw more current than it would while running at nominal frequency. The end user will incur an incremental cost for additional electrical power consumed in this mode, however this cost is very minor compared to the power used by the system as a whole. If necessary, users may choose to manually adjust the balance between performance and power consumption through the OS power policies. A final theoretical cost to note is the introduction of a variable frequency processor into an environment that has largely been able to depend on a constant processor frequency. Some applications may attempt to synchronize events in time based on the assumption that frequency does not change over time, although none has yet been found by Intel. Computer users may also become alarmed when their frequency reporting tools begin to show dynamic frequency changes. Figure 5.1 Cinebench 10 Allowing single threaded workloads to run on any hardware thread incurs performance penalties because each time a thread moves around the OS needs to SAVE/RESTORE state to preserve determinism. 5. Actual Performance Data with Turbo Boost We isolated workloads known to be CPU-centric, and concentrated further on single and multi-threaded workloads in our focus on turbo performance measurements. We proceeded by running three baseline frequencies without enabling turbo. The base frequencies were 2.66 GHz, 2.8 GHz, and 2.93 GHz to simulate the lower and upper bounds of the workload. Initial results showed mix results because the OS scheduler was allowing single-core workloads to run on multiple CPUs. By setting affinity manually, and forcing workloads to run on a single CPU we were able to obtain maximum benefit from Turbo. Affinity here means to associate any particular thread with a dedicated core or hyper-thread. The learning of setting affinity manually was then applied to all singlethreaded workloads. Figure 5.2 Cinebench 9.5 Figures 5.1 and 5.2 show Cinebench obtaining highest Turbo upside when affinity is set, as it can run on one single core for the whole test duration. Rendering software performance data show Turbo to have a positive result; rendering is conventionally calculated in time units, hence smaller is better. 5.1. Turbo Speedup on UP Client Setting processor affinity is the process by which an application manually tells the OS scheduler where to run, in other words, it restricts the available hardware threads where the workload may run. For instance, setting Affinity = p3, tells the OS scheduler to only run on Processor 3. Setting Affinity = P0, P2, P3, allows an application to run on hardware thread 0, 2, or 3. 3 Turbo Boost and Overclocking © Intel Corp. Figure 5.6 Estimated Individual SPEC CPU2000 Score, 4-Users Figure 5.3 Rendering Workloads Figures 5.5 and 5.6 display various components of CPU2000 visibly benefiting from Turbo. These workloads represent a gamut of diverse disciplines and do not all scale linearly with core frequency. Thus, some workloads do not reach full theoretical Turbo benefit. Estimated individual SPEC CPU 2000 scores are based on measurements on Intel internal development platforms and may differ from measurements on production platforms available later in 2009. For more information about the benchmarks see [4]. Figure 5.3 shows 3DStudioMax and MainConcept H.264 reaching nearly ideal Turbo speedup because these workloads are CPU centric, can run on specific cores, and incur no other overhead. 5.2 Turbo Speedup on DP Server Table 5.1 summarizes our setup for DP Turbo experiments. We used an engineering validation board, called Green City with an open bench top configuration. This is certainly a different thermal system condition compared to a typical end-user environment in a standard chassis. However, we learned that thermal impact on Turbo performance is still second-order based on pre-Si study and other postSi experiments conducted. As shown in Figure 5.4, each processor has an individual heatsink with active fans attached. In addition, four external fans are placed on the side to cool down the memories, voltage regulators, etc. All fans were running at a constant speed. If the workload does not hit Turbo constraints, the Core frequency can increase up to 3.33 GHz dynamically depending on the number of active cores. Figure 5.4 Arithmetic and Multi-Media Workloads Figure 5.4 exhibits Sandra measurements of Arithmetic and Multimedia application with multi-threaded workloads. Figure 5.5 Estimated Individual SPEC CPU 2000 Score, 4-Users Table 5.1 Experimental Setup 4 Turbo Boost and Overclocking © Intel Corp. separate processor (two separate sockets on a server platform) and it is the main explanation for the case where two ideal performance bars mismatch. Figure 5.8 Turbo Speedup for SPEC CPU 2000 Integer Rate IC11.0 – 16-user Figure 5.7 NHM-EP System with external fans Figure 5.9 is the observed speedup for SPECfpRate. The average performance benefit by Turbo is 3.3% out of a 3.5% goal, which is less than observed in the integer suite. One of the reasons is that some floatingpoint components do not rely on activity that scales with frequency, e.g. DRAM accesses. Even though some components directly take advantage of faster core clock, e.g., sixtrack, they often hit TPD constraint throughout the execution. Some bars look erroneous in terms of basic relationship. However, run-to-run variation has to be factored in to explain. Although we present the variation only for Turbo case here, its level was not dramatically different in non-Turbo cases. There is no empirical evidence so far suggesting that Turbo introduces additional run-to-run variation on a given system. We first tested the SPEC CPU2000 benchmark compiled with Intel Compiler 11.0 for multiple cases of our interest. The baseline configuration was to turn off the Turbo mode and the benchmark scores were compared with the cases of Turbo mode. Since Turbo is designed to operate within predetermined TDC, TDP, and thermal constraints, it may not always run at maximum Turbo frequency. In order to assess the efficiency, we compared actual performance against maximum performance without the constraints. This unconstrained case would give the same performance as the case of overclocking the processors to the Turbo frequency in non-Turbo mode, e.g., 3.20 GHz for multi-core active workload, unless there is some overhead caused by the Turbo. Note that we used maximum scores among several samples of each experiment in the comparison. The system could occasionally generate exceptionally low scores due to certain abnormal transient conditions at the beginning of tests. Based on our previous experience, we believe this water-marking approach in sampling is effective when dealing with a pre-production platform prior to fine tuning. Figure 5.5 presents the Turbo speedup for 16-user SPECintRate along with the level of run-to-run variation. The red line indicates the amount of frequency increase between P1 and P0, i.e., slightly more than 9%. On average, Turbo mode brings about 5.8% performance upside, compared to non-Turbo. This extra boost is still within the thermal design envelope. For example, bzip2 and gcc reach the ideal Turbo performance target. On the other hand, multiple components are below the ideal level. Our analysis using workload profile from the power control unit showed that these workloads hit TDP limit. The variation is represented by standard deviation over the mean for 9 different trials of each benchmark component. The variability is mainly due to suboptimal memory usage by OS under non-uniform memory configurations (NUMA) between the two Figure 5.9. Turbo speedup for SPEC CPU 2000 Floating-point Rate IC11.0 – 16-user. Since we observed TDP is the only limiter in our test on this platform with 95W processors, one may wonder how much performance improvement can be obtained by a bit more power headroom via process enhancement or other system implementation factors, e.g., voltage regulator accuracy. To address this question, we experimented with additional cases by artificially adjusting the TDP to higher limits. Figure 5.7 illustrates the performance impact of 4W and 8W additional TDP budget for selected benchmark 5 Turbo Boost and Overclocking © Intel Corp. components, which are core-bound and power constrained. It is clear that performance gain is measurable for these workloads. the clock from its standard rate, with thermal throttling, which slows down clock rates from the standard rate for the sake of component protection. Whether future processors, which will exceed the 1 billion transistors per part, continue to provide both Turbo boost and thermal throttling, remains to be seen. But it will be a natural evolutionary step to let the number of cores grow beyond the 4 in the current Core i7. Whether such future cores shall have sibling hyperthreads, or whether the architects shall use those same transistors instead for even more cores remains to be seen. 8. References and Referees Figure 5.10 Performance impact with Additional Power Headroom We wish to thanks the anonymous reviewers … and our colleagues at Intel, Ronak Singhal and Jeff Reilly, who suggested crucial improvements and contributed clarifications. We also evaluated SPEC JBB 2005 benchmark. As shown in Figure 5.11 we compared the impact of the simultaneous multi-threading (SMT) under Turbo mode. The Turbo provides the upside with and without SMT while the best performance is achievable with SMT and Turbo for this workload. In this case, the benchmark rarely hit TDP, which is indicated by the unconstrained Turbo case. [1] 2008 November, Intel White Paper, http://download.intel.com/design/processor/applnots/320354.pdf?iid =tech_tb+paper “Intel® Turbo Boost Technology in Intel Core™ Microarchitecture (Nehalem) Based Processors.” [2] 2008 November 8, POD Tech website http://www.podtech.net/home/search/Turbo+Boost+Technology “Turbo Boost Technology” [3] 2008 November 3, Intel website http://download.intel.com/pressroom/kits/corei7/pdf/Intel%C2%AE %20Core%E2%84%A2%20i7_Overview.pdf “Intel® Core™ i7 Microprocessors, The Best Processor on The Planet” [4] General SPEC website http://www.spec.org [5] 2006 August, SPEC website for integer component of SPEC CPU2006: http://www.spec.org/cpu2006/CINT2006/ [6] 2003 October, SPEC website for floating point component of SPEC CPU2000: http://www.spec.org/cpu2000/CFP2000/ [7] Intel® Turbo Boost technology, http://www.intel.com/technology/turboboost/ Figure 5.11 Turbo Performance for SPEC JBB 2005 The results of JBB2005 are not to be interpreted as official results by Intel, and instead are being presented as we found them in 2008 on our development platform, in line with section 5.0 of the JBB run rules. 6. Related Work If “EIST (Enhanced Intel SpeedStep Technology)” -see Jeff Reilly comment –is different from EDAT, explain and provide reference. 7. Conclusion and a Look Ahead In this survey we provided a high-level explanation of the Turbo Boost mechanism on Core i7, how it can accelerate some applications, but how the degree of speedup is dependent on activity-factors of cores on the same physical processor, and dependent on electrical and thermal conditions. We compared Turbo, a builtin, dynamic, automatic boosting feature with overclocking, initiated by the end user at the user’s own risk. We also contrast Turbo, which will speed up 6