Performance Comparison of Niagara, Xeon, and Itanium2 Daekyeong Moon (dkmoon@cs) Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point Number MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com) Evaluated Machines Sun T2000 Dell 1850 HP rx1620 CPU Niagara 1GHz Xeon 3GHz Itanium2 1.3GHz Chip/Core/Thread 1/8/32 2/2/2 2/2/2 L1 Cache (I/D) 16KB/8KB (/core) ~12KB/16KB 16KB/16KB L2 Cache 3MB 1MB 256KB L3 Cache N/A N/A 3MB FU (int/fp) / chip 8/1 2/1 6/2 Word length 64 bits 32 bits 64 bits Multithreads Fine-grain Deep out-of-order Static (VLIW) Max Watt / chip 79 W 111 W 130 W Memory 8 GB 3 GB 4 GB Disks 1 x 73GB 2 x 147 GB 2 x 73 GB OS Solaris 10 Linux 2.6.11 Linux 2.6.11 Dimension 2U 1U 1U (Data are from google search and datasheets from www.intel.com and www.sun.com) Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point Number MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com) SPEC CPU 2000 Integer benchmarks (CINT2000) 11 C programs 1 C++ program eon Floating Point benchmarks (CFP2000) 3 C programs mesa, equake, ammp 6 Fortran 77 programs wupwise, swim, mgrid, applu, apsi, sixtrack 4 Fortran 90 programs gzip, vpr, gcc, mcf, crafty, parser, perlbmk, gap, vortex, bzip2, twolf fma3d, facerec, galgel, lucas Speed vs. Throughput Basically, it measures Speed with single benchmark instance SPEC rate runs multiple benchmarks simultaneously to measure throughput SPEC CPU 2000 3 different Architectures 2 different Compilers for each Architecture w/ & wo/ optimization for each compiler Note Optimization for Itanium2 includes profile-directed optimization Floating point optimization is turned off for some applications Itanium relies on compiler’s static instruction scheduling The optimization which relies on extended precision generates incorrect results. e.g. vpr under Xeon-icc, gcc under Itanium2-gcc, applu under Itanium2gcc FP benchmarks written in Fortran 90 couldn’t be evaluated Due to unavailability of compilers (i.e. icc & suncc) i.e. fma3d, facerec, galgel, lucas SPEC int result (Integer Speed) SPECint Comparison Itanium2 Xeon Niagara Reference (1.3Ghz*2) (3GHz*2) (1GHz*32) without optimization with optimization 153 102 cc gcc 63 148 icc 733 294 gcc 597 icc 949 134 gcc 854 575 183 1227 1 1278 2 0 200 400 600 800 SPECint 1000 1200 1400 Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp) SPEC fp result (FP Speed) SPECfp Comparison Niagara Itanium2 Xeon Reference (1GHz*32) (1.3Ghz*2) (3GHz*2) without optimization icc 264 gcc 245 icc with optimization 664 594 1500 45 gcc 92 cc 50 22 gcc 36 20 328 2285 1 1621 2 0 500 1000 1500 2000 2500 SPECfp Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp) SPEC int rate result (Integer Throughput) SPECint_rate Comparison 48 Simultaneous Apps 3 7 Niagara (1GHz*32) 32 24 12 20 1415 Xeon (3GHz*2) 16 0 0 8 4 2 23 25 23 16 17 17 21 21 21 2021 21 Itanium2 (1.3Ghz*2) 0 26.1 Reference 1 26.6 Reference 2 0 5 10 15 20 25 30 Reference1: Itanium2 (2x1.6GHz CPU, 3MB L2 Cache, icc, Linux 2.4.21) Reference2: Xeon (2x3.4GHz CPU, 1MB L2 Cache, icc, Linux 2.6.4-smp) Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point Number MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com) SPEC web 2005 Operation Overview Components Workload Server Under Test (SUT) Prime Client Workhorse Clients (I used 20) Backend Simulator SPECweb_Banking: dynamic pages through SSL SPECweb_Ecommerce: dynamic pages though SSL + non-SSL SPECweb_Support: static pages though non-SSL to test large file transfer Metric # of sessions at which SUT sustains “Tolerable QoS” of more than 99% and “Good QoS” of more than 95% To figure out the number, Initial guess and try & error I increased 100 sessions per step SPECweb value: GeoMean of the three workloads SPEC WEB network topology SUT Dynamic data Init Query Prime Client Static Data Backend Simulator 20 Workhorse Clients Static workload files 50GB per 1500-session SPEC WEB 2005 Tested Web Server Zeus + php C.f. Sun’s report on Niagara used JAVA web server + JAVA Servlet (14,000 simultaneous sessions!!) But, JAVA is especially optimized in SPARC + Solaris => It’s NEVER fair!! => USE PHP!! SPEC web result SPECweb Comparison Reference 1 Reference 2 423 378 865 SPECWeb 4177 700 500 800 Support 0 Itanium2 13160 4695 400 600 1200 Banking Xeon 14001 4820 600 400 1500 Ecommerce Niagara 5000 21500 7140 10000 21500 15000 20000 25000 Reference1: Niagara (1.2GHz CPU, Sun Java Web Server 6.1 + JSP) Reference2: Xeon (2x3.8GHz CPU w/ HT, 2MB L2, Zeus + JSP) SPEC web result 2 SPEC web Comparison (Normalized by Xeon) SPECWeb SPECWeb/Watt 0.96 Itanium2 1.12 1.00 Xeon 1.00 3.22 Niagara 0.00 2.29 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point Number MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com) Weak FP with Niagara? Slow FPU 8 cores share one FPU → • FADD/FMULd/FDIVd = 26/29/83 cycles c.f. Integer: ADD/MULX/UDIVX = 1/11/72 cycles What if we use faster one? What if each core has one? Fixed Point number using 64-bit Integer register Not precise experiment, but we can see the possibility Using Fixed Point IEEE 754 Floating Point 32 bits + 32 bits 1 bit for sign 11 bits for exponent 52 bits for fraction Add(X, Y) = (X) + (Y) Mul(X, Y) = ((X) * (Y)) >> 32 Sign Exponent 63 62 Fraction 52 Before decimal pt After decimal pt 63 decimal pt Note: Tested Xeon has 32-bit word (IA-32 Xeon) Long Long is used for 64-bit Can degenerate the performance 0 0 Fixed Point Result 1-1 (Response Time – What if using faster FPU) Time for Single Matrix Multiplication without Optimization (Matrix Size=1000x1000) Using Integer Using Fixed Point Number Niagara (1GHz*32) Using Double 260 187 186 42 Xeon (3GHz*2) 68 21 77 78 78 Itanium2 (1.3GHz*2) 0 50 100 150 Seconds 200 250 300 Fixed Point Result 1-2 (Response Time – What if using faster FPU) Time for Single Matrix Multiplication with Optimization (Matrix Size=1000x1000) Using Integer Using Fixed Point Number Niagara (1GHz*32) Using Double 114 69 68 11 Xeon (3GHz*2) 27 26 29 Itanium2 (1.3GHz*2) 0 20 41 37 40 60 Seconds 80 100 120 Fixed Point Result 2-1 (Throughput – What if each core has a FU?) Time for simultenous 24 Matrix Multiplications without Optimization (Matrix Size=1000x1000) Using Integer Using Fixed Point Number Using Double 498 Niagara (1GHz*32) 242 243 265 Xeon (3GHz*2) 536 514 957 938 931 Itanium2 (1.3GHz*2) 0 200 400 600 800 Seconds 1000 1200 Fixed Point Result 2-2 (Throughput – What if each core has a FU?) Time for simultenous 24 Matrix Multiplications with Optimization (Matrix Size=1000x1000) Using Integer Niagara (1GHz*32) Using Fixed Point Number Using Double 482 84 82 150 Xeon (3GHz*2) 345 333 332 Itanium2 (1.3GHz*2) 513 417 0 100 200 300 Seconds 400 500 600 Outline Evaluated Machines SPEC CPU Result SPEC WEB Result Using Fixed Point MTTF Issue Conclusion (Pictures from www.sun.com and www.intel.com) Putting all eggs in one basket? Two scenarios 32 machines cooperate for a service One machine failures results in service outage MTTF’ = MTTF / 32 Niagara can be justified 32 machines are replicas just for load balancing “Partial service is better than service outage” 32 MTTF 32 MTTF MTTF ' MTTF ( 32 1) 31 ( MTTR MTTF ) 32*MTTR What about Niagara? Not enough redundancy: only two redundant power supplies and fans What if partial of CPU, memory subsystem or disks fail? What if service outage penalty is much greater than server management cost? => Usually TRUE Conclusion Evaluated 3 machines by SPEC Niagara shows the poorest performance in terms of response time Niagara outperforms the others in terms of throughput the slow CPU, simple pipeline, and the lack of FU. 32 threads Having a FPU for each core can be beneficial. I’m skeptical about consolidating too many servers without redundancy (Pictures from www.sun.com and www.intel.com)