SPEC web result

advertisement
Performance
Comparison of Niagara,
Xeon, and Itanium2
Daekyeong Moon
(dkmoon@cs)
Outline






Evaluated Machines
SPEC CPU Result
SPEC WEB Result
Using Fixed Point Number
MTTF Issue
Conclusion
(Pictures from www.sun.com and www.intel.com)
Evaluated Machines
Sun T2000
Dell 1850
HP rx1620
CPU
Niagara 1GHz
Xeon 3GHz
Itanium2 1.3GHz
Chip/Core/Thread
1/8/32
2/2/2
2/2/2
L1 Cache (I/D)
16KB/8KB (/core)
~12KB/16KB
16KB/16KB
L2 Cache
3MB
1MB
256KB
L3 Cache
N/A
N/A
3MB
FU (int/fp) / chip
8/1
2/1
6/2
Word length
64 bits
32 bits
64 bits
Multithreads
Fine-grain
Deep out-of-order
Static (VLIW)
Max Watt / chip
79 W
111 W
130 W
Memory
8 GB
3 GB
4 GB
Disks
1 x 73GB
2 x 147 GB
2 x 73 GB
OS
Solaris 10
Linux 2.6.11
Linux 2.6.11
Dimension
2U
1U
1U
(Data are from google search and datasheets from www.intel.com and www.sun.com)
Outline






Evaluated Machines
SPEC CPU Result
SPEC WEB Result
Using Fixed Point Number
MTTF Issue
Conclusion
(Pictures from www.sun.com and www.intel.com)
SPEC CPU 2000

Integer benchmarks (CINT2000)
 11 C programs


1 C++ program


eon
Floating Point benchmarks (CFP2000)
 3 C programs



mesa, equake, ammp
6 Fortran 77 programs

wupwise, swim, mgrid, applu, apsi, sixtrack
4 Fortran 90 programs


gzip, vpr, gcc, mcf, crafty, parser, perlbmk, gap, vortex, bzip2, twolf
fma3d, facerec, galgel, lucas
Speed vs. Throughput


Basically, it measures Speed with single benchmark instance
SPEC rate runs multiple benchmarks simultaneously to measure throughput
SPEC CPU 2000




3 different Architectures
2 different Compilers for each Architecture
w/ & wo/ optimization for each compiler
Note
 Optimization for Itanium2 includes profile-directed optimization


Floating point optimization is turned off for some applications



Itanium relies on compiler’s static instruction scheduling
The optimization which relies on extended precision generates incorrect
results.
e.g. vpr under Xeon-icc, gcc under Itanium2-gcc, applu under Itanium2gcc
FP benchmarks written in Fortran 90 couldn’t be evaluated


Due to unavailability of compilers (i.e. icc & suncc)
i.e. fma3d, facerec, galgel, lucas
SPEC int result
(Integer Speed)
SPECint Comparison
Itanium2
Xeon
Niagara
Reference (1.3Ghz*2) (3GHz*2) (1GHz*32)
without optimization
with optimization
153
102
cc
gcc
63
148
icc
733
294
gcc
597
icc
949
134
gcc
854
575
183
1227
1
1278
2
0
200
400
600
800
SPECint
1000
1200
1400
Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21)
Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)
SPEC fp result
(FP Speed)
SPECfp Comparison
Niagara Itanium2
Xeon
Reference (1GHz*32) (1.3Ghz*2) (3GHz*2)
without optimization
icc
264
gcc
245
icc
with optimization
664
594
1500
45
gcc
92
cc
50
22
gcc
36
20
328
2285
1
1621
2
0
500
1000
1500
2000
2500
SPECfp
Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21)
Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)
SPEC int rate result
(Integer Throughput)
SPECint_rate Comparison
48 Simultaneous Apps
3
7
Niagara (1GHz*32)
32
24
12
20
1415
Xeon (3GHz*2)
16
0
0
8
4
2
23 25
23
16
17
17
21
21
21
2021
21
Itanium2 (1.3Ghz*2)
0
26.1
Reference 1
26.6
Reference 2
0
5
10
15
20
25
30
Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21)
Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp)
Outline






Evaluated Machines
SPEC CPU Result
SPEC WEB Result
Using Fixed Point Number
MTTF Issue
Conclusion
(Pictures from www.sun.com and www.intel.com)
SPEC web 2005

Operation Overview

Components





Workload




Server Under Test (SUT)
Prime Client
Workhorse Clients (I used 20)
Backend Simulator
SPECweb_Banking: dynamic pages through SSL
SPECweb_Ecommerce: dynamic pages though SSL + non-SSL
SPECweb_Support: static pages though non-SSL to test large file transfer
Metric


# of sessions at which SUT sustains “Tolerable QoS” of more than 99% and
“Good QoS” of more than 95%
To figure out the number,



Initial guess and try & error
I increased 100 sessions per step
SPECweb value: GeoMean of the three workloads
SPEC WEB network topology
SUT
Dynamic
data
Init
Query
Prime Client
Static
Data
Backend Simulator
20 Workhorse Clients
Static workload files
50GB per 1500-session
SPEC WEB 2005

Tested Web Server


Zeus + php
C.f. Sun’s report on Niagara used JAVA web server +
JAVA Servlet (14,000 simultaneous sessions!!)

But, JAVA is especially optimized in SPARC + Solaris =>
It’s NEVER fair!! => USE PHP!!
SPEC web result
SPECweb Comparison
Reference 1
Reference 2
423
378
865
SPECWeb
4177
700
500
800
Support
0
Itanium2
13160
4695
400
600
1200
Banking
Xeon
14001
4820
600
400
1500
Ecommerce
Niagara
5000
21500
7140
10000
21500
15000
20000
25000
Reference1: Niagara (1.2GHz CPU, Sun Java Web Server 6.1 + JSP)
Reference2: Xeon (2x3.8GHz CPU w/ HT, 2MB L2, Zeus + JSP)
SPEC web result 2
SPEC web Comparison
(Normalized by Xeon)
SPECWeb
SPECWeb/Watt
0.96
Itanium2
1.12
1.00
Xeon
1.00
3.22
Niagara
0.00
2.29
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Outline






Evaluated Machines
SPEC CPU Result
SPEC WEB Result
Using Fixed Point Number
MTTF Issue
Conclusion
(Pictures from www.sun.com and www.intel.com)
Weak FP with Niagara?

Slow FPU




8 cores share one FPU

→
•
FADD/FMULd/FDIVd = 26/29/83 cycles
c.f. Integer: ADD/MULX/UDIVX = 1/11/72 cycles
What if we use faster one?
What if each core has one?
Fixed Point number using 64-bit Integer register
Not precise experiment, but we can see the
possibility
Using Fixed Point

IEEE 754 Floating Point




32 bits + 32 bits



1 bit for sign
11 bits for exponent
52 bits for fraction
Add(X, Y) = (X) + (Y)
Mul(X, Y) = ((X) * (Y)) >> 32
Sign Exponent
63
62
Fraction
52
Before decimal pt After decimal pt
63
decimal pt
Note: Tested Xeon has 32-bit word (IA-32 Xeon)


Long Long is used for 64-bit
Can degenerate the performance
0
0
Fixed Point Result 1-1
(Response Time – What if using faster FPU)
Time for Single Matrix Multiplication without Optimization
(Matrix Size=1000x1000)
Using Integer
Using Fixed Point Number
Niagara
(1GHz*32)
Using Double
260
187
186
42
Xeon
(3GHz*2)
68
21
77
78
78
Itanium2
(1.3GHz*2)
0
50
100
150
Seconds
200
250
300
Fixed Point Result 1-2
(Response Time – What if using faster FPU)
Time for Single Matrix Multiplication with Optimization
(Matrix Size=1000x1000)
Using Integer
Using Fixed Point Number
Niagara
(1GHz*32)
Using Double
114
69
68
11
Xeon
(3GHz*2)
27
26
29
Itanium2
(1.3GHz*2)
0
20
41
37
40
60
Seconds
80
100
120
Fixed Point Result 2-1
(Throughput – What if each core has a FU?)
Time for simultenous 24 Matrix Multiplications without Optimization
(Matrix Size=1000x1000)
Using Integer
Using Fixed Point Number
Using Double
498
Niagara
(1GHz*32)
242
243
265
Xeon
(3GHz*2)
536
514
957
938
931
Itanium2
(1.3GHz*2)
0
200
400
600
800
Seconds
1000
1200
Fixed Point Result 2-2
(Throughput – What if each core has a FU?)
Time for simultenous 24 Matrix Multiplications with Optimization
(Matrix Size=1000x1000)
Using Integer
Niagara
(1GHz*32)
Using Fixed Point Number
Using Double
482
84
82
150
Xeon
(3GHz*2)
345
333
332
Itanium2
(1.3GHz*2)
513
417
0
100
200
300
Seconds
400
500
600
Outline






Evaluated Machines
SPEC CPU Result
SPEC WEB Result
Using Fixed Point
MTTF Issue
Conclusion
(Pictures from www.sun.com and www.intel.com)
Putting all eggs in one basket?

Two scenarios

32 machines cooperate for a service




One machine failures results in service outage
MTTF’ = MTTF / 32
Niagara can be justified
32 machines are replicas just for load balancing

“Partial service is better than service outage”
32
MTTF 32
MTTF
MTTF ' 
 MTTF
( 32 1) 
31
( MTTR MTTF )
32*MTTR

What about Niagara?



Not enough redundancy: only two redundant power supplies and fans
What if partial of CPU, memory subsystem or disks fail?
What if service outage penalty is much greater than server
management cost? => Usually TRUE
Conclusion


Evaluated 3 machines by SPEC
Niagara shows the poorest performance in
terms of response time


Niagara outperforms the others in terms of
throughput



the slow CPU, simple pipeline, and the lack of FU.
32 threads
Having a FPU for each core can be beneficial.
I’m skeptical about consolidating too many
servers without redundancy
(Pictures from www.sun.com and www.intel.com)
Download