Performance Characterization of Mobile-Class Nodes: Why Fewer Bits is Better Michelle McDaniel and Kim Hazelwood Department of Computer Science University of Virginia I. I NTRODUCTION 978-1-61284-368-1/11/$26.00 ©2011 IEEE 2.0 1.0 1.5 Mobile Class Server Class k bm xa la nc r ta as tp ne om re f p m tu 64 h2 ng an qu lib sje er k m bm hm go 2 c gc ip bz pe rlb en ch 0.0 0.5 Speedup Mobile-class nodes, also known as netbooks, have become increasingly popular in the personal computing market. As is the trend in the computing market, the processors in these mobile-class nodes moving from 32 bits to 64 bits. This move extends the memory ceiling beyond the traditional 4GB, allowing for a significantly larger virtual address space. In addition, on the x86_64 architecture, 64-bit technology gives mobile-class processors access to eight extended 64-bit registers, as well as eight additional extended registers and eight additional 32-bit registers. Mobile-class processors, however, have many features unlike server-class processors. They are limited in both speed and memory since they are not expected to be used for the scientific calculations that are performed on server-class nodes such as the Xeon processor. Also, mobile-class processors are designed for low power usage, rather than high performance, and, therefore, high performance-per-watt [1] [2]. In order to decrease power consumption, these nodes feature in-order processors [3], which decreases performance. Ye et al. [4] discuss the performance differences when moving from 32-bit to 64-bit on server-class processors, showing that generally moving to 64-bit results in performance improvements but that improvements cannot be expected for all applications. We show that these results to not generalize to all processors by exposing the difference between mobile-class and server-class processors. To do so, we performed a systemwide performance characterization of the SPEC CPU2006 integer benchmarks [5] on the Intel Atom processor and on the Intel Xeon processor. Specifically, we compare the performance of the 32-bit execution to the 64-bit execution on both platforms. We will show that memory is a serious consideration when developing for mobile-class processors. Developers cannot always assume there will be a speedup on mobile-class processors if there is a speedup on server-class processors, as shown in Figure 1. While the 64-bit benchmarks consistently perform better than 32-bit mode on the serverclass processor, this is not true on the mobile-class processor, where only three benchmarks experience a speedup in 64-bit mode. The mobile-class processor’s reliance on the ordering of instructions and data layout causes a decrease in the memorysubsystem performance, which leads to the differences seen in runtime performance between the mobile-class processor and the server-class processor. Speedup of SPEC CPU2006 Integer Benchmarks: 64−bit vs. 32−bit Fig. 1. A speedup of below 1 indicates that 32-bit mode resulted in a speedup over 64-bit mode and a speedup of above 1 means that 64-bit mode resulted in a speedup over 32-bit mode. This figure shows that, for most of the benchmarks, 32-bit mode is faster than 64-bit mode on mobile-class processors, while 64-bit mode faster on server-class processors. II. R ESULTS Mobile-class nodes differ from server-class nodes in two fundamental ways. First, mobile-class nodes are populated with significantly less memory than server-class nodes. In our study, the mobile-class node was populated with 1GB of RAM, while the server-class node was populated with 8GB of RAM. Second, mobile-class nodes feature an in-order processor. These two characteristics of mobile-class nodes result in lower power consumption. Because mobile-class processors are used in netbooks, the decreased power consumption is important, and so only those performance enhancing features that did not increase power consumption beyond a certain threshold were implemented [1]. In order to improve performance on mobile-class nodes, developers must actively optimize for these restrictions. In this section, we present the results of our system-wide performance characterization, which compares CPU utilization and memory subsystem performance of mobile-class and server-class processors. Our mobile-class processor for this study was the Intel Atom N450 processor, a 64-bit Atom processor, clocked at 1.66 GHz and populated with 1GB of RAM. For comparison, our server-class system was the Intel Dual Xeon E5310 Quad Core processor, clocked at 1.60GHz, populated with 8GB of RAM. Both machines were dual booted with Ubuntu Linux 10.04, 64-bit edition and 32- 131 TABLE I S UMMARY OF OUR SYSTEM - WIDE CHARACTERIZATION OF SERVER - CLASS AND MOBILE - CLASS NODES . Server-Class Node Mobile-Class Node Runtime 14% faster in 64-bit mode; 8 benchmarks perform better in 64-bit mode 1% faster in 32-bit mode; 8 benchmarks perform better in 32-bit mode IPC 7% higher in 32-bit mode; 7 benchmarks higher in 32-bit mode 17% higher in 32-bit mode; all benchmarks higher in 32-bit mode Instruction Cache Misses 17% fewer in 64-bit mode; 3 benchmarks higher in 64-bit mode 18% fewer in 64-bit mode; 2 benchmarks higher in 64-bit mode L2 Cache Misses 36% more in 64-bit mode; 2 benchmarks lower in 64bit mode 11% more in 64 bit mode; 3 benchmarks lower in 64bit mode Static Binary Size 8% larger in 64-bit mode 7% larger in 64-bit mode Dynamic Instruction Count 18% more instructions executed in 32-bit mode 16% more instructions executed in 32-bit mode Uses more 64-bit registers in 64-bit mode than 32-bit registers Uses more 64-bit registers in 64-bit mode than 32-bit registers; mobile-class processors have a higher ratio of 64-bit to 32-bit register usage than server-class processors Registers bit edition. This characterization was performed using the SPEC CPU2006 integer benchmarks compiled with GCC 4.5.0. Table I summarizes the results of our system-wide characterization. A. CPU Utilization As mobile-class processors cannot perform out-of-order instruction scheduling in hardware, they must stall more often than server-class processors. We can recognize this by monitoring the IPC of applications on these machines. On the mobile-class node, IPC is higher for all of the 32-bit binaries than the 64-bit binaries. The 64-bit versions of libquantum and omnetpp have the greatest percentage decrease in IPC— 37% and 42% decrease, respectively—on the mobile-class processor compared to the 32-bit versions of these binaries. In contrast, several of the benchmarks experience an increase in IPC in 64-bit mode on the server-class processor. hmmer experiences the greatest increase in IPC—56%— corresponding to a significant speedup, while libquantum again experiences a significant decrease in IPC—58%—in 64bit mode. libquantum also experiences a greater degradation of IPC on the server-class processor than on the mobileclass processor, as well as experiencing a smaller speedup on the server-class node compared to the mobile-class node. B. Memory Subsystem Characteristics Mobile-class processors feel the effects of an inefficient memory subsystem utilization to a greater degree than server class processors. 64-bit applications put more strain on the memory subsystem due to the size of data and instructions. This increased strain on the mobile-class processor heavily contributes to the performance differences that we see between the mobile-class processor and the server-class processor. In order to achieve higher performance on mobile-class processors, application developers and compiler writers must focus on improving instruction and data placement in order to improve cache performance and decrease page faults. Only then will applications achieve higher performance. On mobile-class processors, performance is more susceptible to degradation due to instruction cache misses than on server-class processors. On the mobile-class node, a decrease in instruction cache misses in 64-bit mode does not necessarily lead to a speedup. Only bzip2, hmmer, and libquantum experience both a decrease in instruction cache misses and an improvement in runtime. Alternatively, while there are more instruction cache misses in 64-bit mode on the server class node, many of these benchmarks still achieve a speedup. L2 cache performance also affects application performance on mobile-class processors more than it does on server-class processors. Of the benchmarks, only three experience fewer L2 cache misses in 32-bit mode than in 64-bit mode. These three—bzip2, hmmer, and libquantum—are also the only benchmarks that experience a speedup in 64-bit mode. While both mobile-class and server-class nodes both have more L2 cache misses in 64-bit mode, most of the benchmarks experience a performance improvement on the server-class nodes because of this move, while this increase in L2 cache misses decreases performance on mobile class nodes. III. C ONCLUSION Mobile-class nodes cannot be considered just smaller versions of server-class or desktop-class nodes. Generalities that hold on server-class processors—for example, that moving to a 64-bit ISA improves performance as suggested by other studies—do not hold on mobile-class nodes. Mobile-class nodes are affected to a greater degree by memory subsystem performance than server-class nodes are, and the in-order nature of the processor requires better instruction scheduling than out-of-order processors require. The differences between these classes of processors must be acknowledged in development of applications, and efforts must be made to more effectively utilize the resources available on mobile-class nodes. R EFERENCES [1] B. Beavers, “The story behind the Intel Atom processor success,” IEEE Design and Test of Computers, vol. 26, pp. 8–13, 2009. [2] S. Sutanthavibul and S. Perabala, “First Intel low-cost IA Atom-based system-on-chip for Nettop/Netbook,” in 1st Asia Symposium on Quality Electronic Design, July 2009, pp. 88 –91. [3] R. Mùˆller-Albrecht, “Optimized for the Intel Atom Processor with Intel’s Compiler,” Intel Software, Technical Report, 2009. [4] D. Ye, J. Ray, C. Harle, and D. Kaeli, “Performance Characterization of SPEC CPU2006 Integer Benchmarks on x86-64 Architecture,” in IEEE International Symposium on Workload Characterization, San Jose, CA, October 2006, pp. 120–127. [5] C. D. Spradling, “SPEC CPU2006 benchmark tools,” ACM SIGARCH Computer Architecture News, vol. 35, no. 1, pp. 130–134, March 2007. 2 132