Intel® Xeon Phi™ Coprocessor Architecture Overview Shuo Li, Mahesh Bhat Financial Services Engineering SSG, Intel Legal Disclaimer • INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPETY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL ® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. • Intel may make changes to specifications and product descriptions at any time, without notice. • All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. • Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. • Sandy Bridge and other code names featured are used internally within Intel to identify products that are in development and not yet publicly announced for release. Customers, licensees and other third parties are not authorized by Intel to use code names in advertising, promotion or marketing of any product or services and any such use of Intel's internal code names is at the sole risk of the user • Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance • Intel, Core, Xeon, VTune, Cilk, Intel and Intel Sponsors of Tomorrow. and Intel Sponsors of Tomorrow. logo, and the Intel logo are trademarks of Intel Corporation in the United States and other countries. • *Other names and brands may be claimed as the property of others. • Copyright ©2011 Intel Corporation. • Hyper-Threading Technology: Requires an Intel® HT Technology enabled system, check with your PC manufacturer. Performance will vary depending on the specific hardware and software used. Not available on all Intel® Core™ processors. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading • Intel® 64 architecture: Requires a system with a 64-bit enabled processor, chipset, BIOS and software. Performance will vary depending on the specific hardware and software you use. Consult your PC manufacturer for more information. For more information, visit http://www.intel.com/info/em64t • Intel® Turbo Boost Technology: Requires a system with Intel® Turbo Boost Technology capability. Consult your PC manufacturer. Performance varies depending on hardware, software and system configuration. For more information, visit http://www.intel.com/technology/turboboost iXPTC 2013 2 Intel® Xeon Phi™ Coprocessor Agenda • Intel® Many Integrated Core Architecture • Intel® Xeon Phi™ Coprocessor Overview • Core, Vector Processing Unit and Intel® IMCI • Interconnect and Cache Hierarchy • Performance • Summary iXPTC 2013 3 Intel® Xeon Phi™ Coprocessor Intel Many Integrated Core Architecture Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors. Images do not reflect actual die sizes. Actual production die may differ from images. Intel® Xeon® processor Intel Xeon processor 64-bit 5100 series Core(s) 1 2 Threads 2 2 Intel Xeon processor 5500 series Intel Xeon processor Intel Xeon processor 5600 series E5 Product Family 4 6 8 12 Intel Xeon processor code name Intel Xeon processor code name Ivy Bridge Haswell 8 10 16 20 To be deter mined Intel® Xeon Phi™ Coprocessor 61 244 Intel® Xeon Phi™ Coprocessor extends established CPU architecture and programming concepts to highly parallel applications iXPTC 2013 5 Intel® Xeon Phi™ Coprocessor Intel® Multicore Architecture Intel® Many Integrated Core Architecture Suited for full scope of workloads Performance and performance/watt optimized for highly parallelized compute workloads Industry leading performance and performance/watt for serial & parallel workloads Common software tools with Xeon enabling efficient application readiness and performance tuning Foundation of HPC Performance Focus on fast single core/thread performance with “moderate” number of cores IA extension to Manycore Many cores/threads with wide SIMD iXPTC 2013 6 Intel® Xeon Phi™ Coprocessor Consistent Tools & Programming Models Compiler Libraries Parallel Models Code Multicore Intel® Xeon Processors Manycore Intel® Xeon Processor Intel® Xeon Phi™ Coprocessor Standards Programming Models Vectorize, Parallelize, & Optimize iXPTC 2013 7 Intel® Xeon Phi™ Coprocessor Intel® Xeon Phi™ Coprocessor Overview Introducing Intel® Xeon Phi™ Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.1 GHz/ 244 Threads Up to 8GB memory with up to 352 GB/s bandwidth 512-bit SIMD instructions Linux operating system, IP addressable Standard programming languages and tools Leading to Groundbreaking results Over 1 TeraFlop/s double precision peak performance1 Up to 2.2x higher memory bandwidth than on an Intel® Xeon® processor E5 family-based server.2 Up to 4x more performance per watt than with an Intel® Xeon® processor E5 family-based server. 3 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Notes 1, 2 & 3, see backup for system configuration details. 9 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® Xeon Phi™ Architecture Overview 8 memory controllers 16 Channel GDDR5 MC PCIe GEN2 High-speed bi-directional ring interconnect Fully Coherent L2 Cache 10 Cores: 61 core s, at 1.1 GHz in-order, support 4 threads 512 bit Vector Processing Unit 32 native registers Reliability Features Parity on L1 Cache, ECC on memory CRC on memory IO, CAP on memory iXPTC 2013 Intel® Xeon Phi™ Coprocessor Core Architecture Overview Instruction Decode Scalar Unit Vector Unit • 60+ in-order, low power IA cores in a ring interconnect • Two pipelines – Scalar Unit based on Pentium® processors – Dual issue with scalar instructions Scalar Registers Vector Registers 32K L1 I-cache 32K L1 D-cache 512K L2 Cache Ring – Pipelined one-per-clock scalar throughput • SIMD Vector Processing Engine • 4 hardware threads per core – 4 clock latency, hidden by round-robin scheduling of threads – Cannot issue back to back inst in same thread • Coherent 512KB L2 Cache per core iXPTC 2013 11 Intel® Xeon Phi™ Coprocessor Core and Vector Processing Unit Vector Processing Unit Extends the Scalar IA Core PPF Thread 0 IP Thread 1 IP Thread 2 IP Thread 3 IP L1 TLB and L1 instruction cache 32KB PF D0 D1 D2 E WB Instruction Cache Miss TLB miss 16B/cycle ( 2 IPC) 4 threads in-order Decoder Pipe 1 (v-pipe) Pipe 0 (u-pipe) VPU RF VPU 512b SIMD uCode X87 RF X87 HWP L2 CRI 512KB L2 Cache L2 TLB Scalar RF ALU 0 On-Die Interconnect ALU 1 TLB miss L1 TLB and L1 Data Cache 32 KB 13 TLB Miss Handler Data Cache Miss iXPTC 2013 Intel® Xeon Phi™ Coprocessor Vector Processing Unit and Intel® IMCI • Vector Processing Unit Execute Intel® IMCI – Intel® Initial Many Core Instructions • 512-bit Vector Execution Engine – 16 lanes of 32-bit single precision and integer operations – 8 lanes of 64-bit double precision and integer operations – 32 512-bit general purpose vector registers in 4 thread – 8 16-bit mask registers in 4 thread for predicated execution • Read/Write – One vector length (512-bits) per cycle from/to Vector Registers – One operand can be from the memory free • IEEE 754 Standard Compliance – 4 rounding Model, even, 0, +∞, -∞ – Hardware support for SP/DP denormal handling – Sets status register VXCSR flags but not hardware traps 14 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Core extension Vector Processing Unit PPF D2 DEC E PF VC1 D0 VC2 D1 V1 D2 E WB D2 E VC1 V2 VC2 V3 V1-V4 WB V4 LD VPU RF 3R,1W EMU Vector ALUs 16 X 32-bit Wide 8 X 64-bit Wide ST Fuse Multiply Add Mask RF Scatter Gather iXPTC 2013 15 Intel® Xeon Phi™ Coprocessor Examples of Intel® IMCI • Ternary Operands – vop ::: zmm1, zmm2, zmm3 zmm1 = zmm2:::vop:::zmm3 – vop ::: zmm1, zmm2, [ptr] zmm1 = zmm2::: vop:::MEM[ptr] • Fused operation Multiply-Add, Multiply-subtract – vfmadd132ps::: zmm1, zmm2, zmm3 zmm1=zmm1Xzmm3+zmm2 – vfmadd213ps::: zmm1, zmm2, zmm3 zmm1=zmm2Xzmm1+zmm3 – vfmadd231ps::: zmm1, zmm2, zmm3 zmm1=zmm2Xzmm3+zmm1 – Standard IEEE 754-2008R 0.5 ulps not 1 upls as two operations • Prefetching – Memory Prefetching minimize the likelihood of L1, L2 cache misses – Intel® Xeon Phi Coprocessor has a hardware prefetcher – L1 prefetch: vprefetch1::: ptr, hint – L2 prefetch: vprefetch2::: ptr, hint 16 iXPTC 2013 Intel® Xeon Phi™ Coprocessor EMU - Extended Math Unit • Single Precision Transcendental function • Minimax quadratic polynomial approximation • Directly implement 4 Elementary functions – – – – vrcp23ps v1 {k1}, v0 vrsqrt23ps v1 {k1}, v0 vlog223ps v1 {k1}, v0 vexp223ps v1 {k1}, v2 // Reciprocal // Reciprocal square root // Logarithmic // Exponential • Benefit other Derived Functions – pow(x,y), sqrt(), div(), ln() 17 Function name Latency Throughput exp2() 8 2 log2() 4 1 rcp() 4 1 rsqrt() 4 1 sqrt() 8 2 pow() 16 4 div() 8 2 ln() 8 2 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Vector Instruction Performance • VPU contains 16 SP ALUs, 8 DP ALUs, • Most VPU instructions have a latency of 4 cycles and TPT 1 cycle – Load/Store/Scatter have 7-cycle latency – Convert/Shuffle have 6-cycle latency • VPU instruction are issued in u-pipe • Certain instructions can go to v-pipe also – Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar 18 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Interconnect and Cache Hierarchy Ring Interconnect Distributed Tag Directories TAG Core Valid Mask State TAG Core Valid Mask State TAG Core Valid Mask State Core Core Core Core L2 L2 L2 L2 TD TD TD TD Data Command Address Coherence Coherence TD TD L2 L2 L2 Core Core Core TD L2 Core 20 TD Tag Directories track the Cache line in all L2 caches Command Address Data iXPTC 2013 Intel® Xeon Phi™ Coprocessor Cache Hierarchy 21 Parameter L1 L2 Coherence MESI MESI Size 32KB + 32 KB 512 KB Associativity 8-way 8-way Line Size 64 Bytes 64 Bytes Banks 8 8 Access Time 2 cycle 23 cycle Policy Pseudo LRU Pseudo LRU Duty Cycle 1 per clock 1 per clock Ports Read or Write Read or Write iXPTC 2013 Intel® Xeon Phi™ Coprocessor Power and Performance Theoretical Maximum (Intel® Xeon® processor E5-2670 vs. Intel® Xeon Phi™ coprocessor 5110P & SE10P/X) Single Precision Memory Bandwidth Double Precision (GF/s) (GB/s) (GF/s) Up to 3.2x Up to 3.45x Up to 3.2x 1200 Higher is Better 2,147 Higher is Better 2,022 1,074 1,011 2000 Higher is Better 350 352 320 1000 300 800 1500 250 200 600 1000 400 666 150 333 100 500 102 200 50 0 E5-2670 (2x 2.6GHz, 8C, 115W) 5110P (60C, 1.053GHz, 225W) SE10P/X (61C, 1.1GHz, 300W) 0 E5-2670 (2x 2.6GHz, 8C, 115W) 5110P (60C, 1.053GHz, 225W) SE10P/X (61C, 1.1GHz, 300W) 0 E5-2670 5110P SE10P/X (2x 2.6GHz, 8C, (60C, 1.053GHz, (61C, 1.1GHz, 115W) 225W) 300W) Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Source: Intel as of October 17, 2012 Configuration Details: Please reference slide speaker notes. For more information go to http://www.intel.com/performance 23 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Synthetic Benchmark Summary SGEMM (GF/s) Up to 2.9X 1,860 1000 (GF/s) (GF/s) Up to 2.8X Up to 2.6X 1000 Higher is Better 800 5110P SE10P (60C, (61C, 1.1GHz, 1.053GHz, 300W) 225W) 0 150 400 303 100 80 50 200 E5-2670 Baseline (2x 2.6GHz, 8C, 115W) 5110P SE10P (60C, (61C, 1.1GHz, 1.053GHz, 300W) 225W) 0 E5-2670 Baseline (2x 2.7GHz, 8C, 115W) 5110P SE10P (60C, (61C, 1.1GHz, 1.053GHz, 300W) 225W) 0 E5-2670 Baseline (2x 2.6GHz, 8C, 115W) ECC On E5-2670 Baseline (2x 2.6GHz, 8C, 115W) 722 ECC On 82% Efficient 309 82% Efficient 86% Efficient 85% Efficient 400 200 0 159 600 600 500 174 803 75% Efficient 1500 640 200 833 800 1000 Up to 2.2X Higher is Better 883 1,729 Triad (GB/s) Higher is Better 71% Efficient 2000 Higher is Better STREAM SMP Linpack DGEMM 5110P (60C, 1.053GHz, 225W) SE10P (61C, 1.1GHz, 300W) Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) 24 iXPTC 2013 Intel® Xeon Phi™ Coprocessor Intel® Xeon Phi™ Coprocessor vs. Intel® Xeon® Processor 12 Financial Services Workloads 10.75 Higher is Better Relative Performance (Normalized to 1.0 Baseline of a 2 socket Intel® Xeon® processor E5-2687 10 8.92 8 7.52 6 4.48 3.94 4 3.45 2 Intel® Xeon Phi™ Coprocessor vs. 2 Socket Intel® Xeon® processor 1.00 0 2S Intel® Xeon® Processor BlackScholes Compute DP BlackScholes Compute & BW DP Monte Carlo Simulation DP BlackScholes Compute SP BlackScholes Compute & BW SP Monte Carlo Simulation SP Coprocessor results: Benchmark run 100% on coprocessor, no help from Intel® Xeon® processor host (aka native) Notes 1. 2 X Intel® Xeon® Processor E5-2670 (2.6GHz, 8C, 115W) 2. Intel® Xeon Phi™ coprocessor SE10 (ECC on) with pre-production SW stack 25 Higher SP results are due to certain Single Precision transcendental functions in the Intel® Xeon Phi™ coprocessor which are not present in the Intel® Xeon® processor iXPTC 2013 Intel® Xeon Phi™ Coprocessor Summary Summary • Intel® Xeon Phi™ coprocessor provides Performance and Performance/Watt for highly parallel HPC with cores/threads, wide-SIMD, caches, memory BW 27 iXPTC 2013 Intel® Xeon Phi™ Coprocessor