Performance Characterization of Mobile

advertisement
Performance Characterization of Mobile-Class
Nodes: Why Fewer Bits is Better
Michelle McDaniel and Kim Hazelwood
Department of Computer Science
University of Virginia
I. I NTRODUCTION
978-1-61284-368-1/11/$26.00 ©2011 IEEE
2.0
1.0
1.5
Mobile Class
Server Class
k
bm
xa
la
nc
r
ta
as
tp
ne
om
re
f
p
m
tu
64
h2
ng
an
qu
lib
sje
er
k
m
bm
hm
go
2
c
gc
ip
bz
pe
rlb
en
ch
0.0
0.5
Speedup
Mobile-class nodes, also known as netbooks, have become
increasingly popular in the personal computing market. As
is the trend in the computing market, the processors in
these mobile-class nodes moving from 32 bits to 64 bits.
This move extends the memory ceiling beyond the traditional
4GB, allowing for a significantly larger virtual address space.
In addition, on the x86_64 architecture, 64-bit technology
gives mobile-class processors access to eight extended 64-bit
registers, as well as eight additional extended registers and
eight additional 32-bit registers.
Mobile-class processors, however, have many features unlike server-class processors. They are limited in both speed
and memory since they are not expected to be used for the
scientific calculations that are performed on server-class nodes
such as the Xeon processor. Also, mobile-class processors are
designed for low power usage, rather than high performance,
and, therefore, high performance-per-watt [1] [2]. In order
to decrease power consumption, these nodes feature in-order
processors [3], which decreases performance.
Ye et al. [4] discuss the performance differences when
moving from 32-bit to 64-bit on server-class processors,
showing that generally moving to 64-bit results in performance
improvements but that improvements cannot be expected for
all applications. We show that these results to not generalize to
all processors by exposing the difference between mobile-class
and server-class processors. To do so, we performed a systemwide performance characterization of the SPEC CPU2006
integer benchmarks [5] on the Intel Atom processor and
on the Intel Xeon processor. Specifically, we compare the
performance of the 32-bit execution to the 64-bit execution
on both platforms. We will show that memory is a serious
consideration when developing for mobile-class processors.
Developers cannot always assume there will be a speedup
on mobile-class processors if there is a speedup on server-class
processors, as shown in Figure 1. While the 64-bit benchmarks
consistently perform better than 32-bit mode on the serverclass processor, this is not true on the mobile-class processor,
where only three benchmarks experience a speedup in 64-bit
mode. The mobile-class processor’s reliance on the ordering of
instructions and data layout causes a decrease in the memorysubsystem performance, which leads to the differences seen in
runtime performance between the mobile-class processor and
the server-class processor.
Speedup of SPEC CPU2006 Integer Benchmarks:
64−bit vs. 32−bit
Fig. 1. A speedup of below 1 indicates that 32-bit mode resulted in a
speedup over 64-bit mode and a speedup of above 1 means that 64-bit mode
resulted in a speedup over 32-bit mode. This figure shows that, for most
of the benchmarks, 32-bit mode is faster than 64-bit mode on mobile-class
processors, while 64-bit mode faster on server-class processors.
II. R ESULTS
Mobile-class nodes differ from server-class nodes in two
fundamental ways. First, mobile-class nodes are populated
with significantly less memory than server-class nodes. In our
study, the mobile-class node was populated with 1GB of RAM,
while the server-class node was populated with 8GB of RAM.
Second, mobile-class nodes feature an in-order processor.
These two characteristics of mobile-class nodes result in lower
power consumption. Because mobile-class processors are used
in netbooks, the decreased power consumption is important,
and so only those performance enhancing features that did
not increase power consumption beyond a certain threshold
were implemented [1]. In order to improve performance on
mobile-class nodes, developers must actively optimize for
these restrictions.
In this section, we present the results of our system-wide
performance characterization, which compares CPU utilization and memory subsystem performance of mobile-class and
server-class processors. Our mobile-class processor for this
study was the Intel Atom N450 processor, a 64-bit Atom
processor, clocked at 1.66 GHz and populated with 1GB
of RAM. For comparison, our server-class system was the
Intel Dual Xeon E5310 Quad Core processor, clocked at
1.60GHz, populated with 8GB of RAM. Both machines were
dual booted with Ubuntu Linux 10.04, 64-bit edition and 32-
131
TABLE I
S UMMARY OF OUR SYSTEM - WIDE CHARACTERIZATION OF SERVER - CLASS
AND MOBILE - CLASS NODES .
Server-Class Node
Mobile-Class Node
Runtime
14% faster in 64-bit mode;
8 benchmarks perform better in 64-bit mode
1% faster in 32-bit mode;
8 benchmarks perform better in 32-bit mode
IPC
7% higher in 32-bit mode;
7 benchmarks higher in
32-bit mode
17% higher in 32-bit
mode; all benchmarks
higher in 32-bit mode
Instruction
Cache
Misses
17% fewer in 64-bit mode;
3 benchmarks higher in
64-bit mode
18% fewer in 64-bit mode;
2 benchmarks higher in
64-bit mode
L2 Cache
Misses
36% more in 64-bit mode;
2 benchmarks lower in 64bit mode
11% more in 64 bit mode;
3 benchmarks lower in 64bit mode
Static
Binary Size
8% larger in 64-bit mode
7% larger in 64-bit mode
Dynamic
Instruction
Count
18% more instructions executed in 32-bit mode
16% more instructions executed in 32-bit mode
Uses more 64-bit registers
in 64-bit mode than 32-bit
registers
Uses more 64-bit registers
in 64-bit mode than 32-bit
registers; mobile-class processors have a higher ratio
of 64-bit to 32-bit register usage than server-class
processors
Registers
bit edition. This characterization was performed using the
SPEC CPU2006 integer benchmarks compiled with GCC
4.5.0. Table I summarizes the results of our system-wide
characterization.
A. CPU Utilization
As mobile-class processors cannot perform out-of-order
instruction scheduling in hardware, they must stall more
often than server-class processors. We can recognize this by
monitoring the IPC of applications on these machines. On the
mobile-class node, IPC is higher for all of the 32-bit binaries
than the 64-bit binaries. The 64-bit versions of libquantum
and omnetpp have the greatest percentage decrease in IPC—
37% and 42% decrease, respectively—on the mobile-class
processor compared to the 32-bit versions of these binaries.
In contrast, several of the benchmarks experience an increase in IPC in 64-bit mode on the server-class processor.
hmmer experiences the greatest increase in IPC—56%—
corresponding to a significant speedup, while libquantum
again experiences a significant decrease in IPC—58%—in 64bit mode. libquantum also experiences a greater degradation of IPC on the server-class processor than on the mobileclass processor, as well as experiencing a smaller speedup on
the server-class node compared to the mobile-class node.
B. Memory Subsystem Characteristics
Mobile-class processors feel the effects of an inefficient
memory subsystem utilization to a greater degree than server
class processors. 64-bit applications put more strain on the
memory subsystem due to the size of data and instructions.
This increased strain on the mobile-class processor heavily
contributes to the performance differences that we see between
the mobile-class processor and the server-class processor. In
order to achieve higher performance on mobile-class processors, application developers and compiler writers must focus
on improving instruction and data placement in order to
improve cache performance and decrease page faults. Only
then will applications achieve higher performance.
On mobile-class processors, performance is more susceptible to degradation due to instruction cache misses than on
server-class processors. On the mobile-class node, a decrease
in instruction cache misses in 64-bit mode does not necessarily
lead to a speedup. Only bzip2, hmmer, and libquantum
experience both a decrease in instruction cache misses and an
improvement in runtime. Alternatively, while there are more
instruction cache misses in 64-bit mode on the server class
node, many of these benchmarks still achieve a speedup.
L2 cache performance also affects application performance
on mobile-class processors more than it does on server-class
processors. Of the benchmarks, only three experience fewer
L2 cache misses in 32-bit mode than in 64-bit mode. These
three—bzip2, hmmer, and libquantum—are also the only
benchmarks that experience a speedup in 64-bit mode. While
both mobile-class and server-class nodes both have more
L2 cache misses in 64-bit mode, most of the benchmarks
experience a performance improvement on the server-class
nodes because of this move, while this increase in L2 cache
misses decreases performance on mobile class nodes.
III. C ONCLUSION
Mobile-class nodes cannot be considered just smaller versions of server-class or desktop-class nodes. Generalities that
hold on server-class processors—for example, that moving to
a 64-bit ISA improves performance as suggested by other
studies—do not hold on mobile-class nodes. Mobile-class
nodes are affected to a greater degree by memory subsystem
performance than server-class nodes are, and the in-order nature of the processor requires better instruction scheduling than
out-of-order processors require. The differences between these
classes of processors must be acknowledged in development
of applications, and efforts must be made to more effectively
utilize the resources available on mobile-class nodes.
R EFERENCES
[1] B. Beavers, “The story behind the Intel Atom processor success,” IEEE
Design and Test of Computers, vol. 26, pp. 8–13, 2009.
[2] S. Sutanthavibul and S. Perabala, “First Intel low-cost IA Atom-based
system-on-chip for Nettop/Netbook,” in 1st Asia Symposium on Quality
Electronic Design, July 2009, pp. 88 –91.
[3] R. Mùˆller-Albrecht, “Optimized for the Intel Atom Processor with Intel’s
Compiler,” Intel Software, Technical Report, 2009.
[4] D. Ye, J. Ray, C. Harle, and D. Kaeli, “Performance Characterization of
SPEC CPU2006 Integer Benchmarks on x86-64 Architecture,” in IEEE
International Symposium on Workload Characterization, San Jose, CA,
October 2006, pp. 120–127.
[5] C. D. Spradling, “SPEC CPU2006 benchmark tools,” ACM SIGARCH
Computer Architecture News, vol. 35, no. 1, pp. 130–134, March 2007.
2
132
Download