Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan Larrabee • Intel’s new approach to a GPU • Considered to be a hybrid between a multicore CPU and a GPU • Combines functions of a multi-core CPU with the functions of a GPU Larrabee Larrabee FETCH Fetch • Utilizes a hardware prefecther • Supports four threads of execution – Separate register files for each thread – Switches threads in order to cover cases where the compiler is unable to schedule code without stalls or if the prefetcher has not received new instructions – Inactive thread data is written to the core’s local L2 cache Larrabee PIPELINE ORGANIZATION Pipeline • Pipeline derived from the dual-issue Pentium processor, which is 5-stages – Short, inexpensive execution pipeline • Pairing rules for primary and secondary instruction pipes are deterministic – Allows compilers to perform offline analysis with a wide scope Pipeline • Pairing rules for primary and secondary instruction pipes are deterministic – Allows compilers to perform offline analysis with a wide scope • All instructions can be issued on the primary pipeline – Minimizes the combinational problems for a compiler • Secondary pipeline can execute a large x86 instruction set – Small and cheap – Power wasted by failing to dual-issue on every cycle is minimal Pipeline • Each core has own pipeline – Based upon the 5 stage Pentium – Dual issues instructions – In order execution • Pipeline is shared between threads – Hardware can switch between threads that have instructions that have instructions ready to execute Pipeline • Designed software-rendering pipeline to minimize the number of locks and other synchronization events • Graphics-rendering pipeline written with highlevel languages and tools – Enables developers to add innovative rendering capabilities Larrabee SIMD ORGANIZATION Vector Processor Unit • 16-wide vector processor unit (VPU) – executes integer, single-precision float, and double-precision float instructions – VPU and register are approximately one-third the area of the processor core • Tradeoff – Increased computational density – Wider VPU’s have higher utilization Vector Processor Unit • VPU instructions can be predicated by a mask register • Mask controls which parts of a vector register or memory location are written and which are left untouched • Advantages – Reduces branch misprediction penalties – Gives instruction scheduler greater freedom Number of Cores • Many-core processor – Planned to have 24 to 48 cores Number of Cores Number of Cores Larrabee SYSTEM ON-CHIP COMPONENTS System On-Chip Components • x86 computer cores - Dual issue, in order processors that support the x86 protocol with Larrabee extensions. Connected to ring network and high bandwidth connection to adjacent L2 Cache subset. System On-Chip Components • L2 Cache subsets – High bandwidth access to adjacent CPU – Connected directly to the ring network – Coherent cache, uses the ring network to check coherency when allocating new cache lines System On-Chip Components • Ring Network Nodes – Simple bi-directional routers with a 512 bit data path in each direction (1024 bit total bandwidth) – Organized in rings of 8-16 cores and other devices – Interconnected with other rings – All data moved between cores and fixed functional units passes through the ring network System On-Chip Components • Fixed function logic components – Provides rasterization, interpolation and other commonly needed functions – Directly connected to the ring network – Will be spread among the cores to provide lower latency and load balancing on the ring network System On-Chip Components • Memory & I/O interface – Provides and manages communication between the Ring Network and off chip devices. – Manages initial routing and tasking of cores Larrabee MEMORY HIERARCHY Memory Interface Larrabee ON-CHIP INTERCONNECT On-Chip Interconnect • Ring interconnect bus • Similar to the Sony Cell processor. Ring Bus Ring Bus Features • • • • Bi-directional 512 Bits in each direction Presumably running at core speed. Each element can take from one direction on odd CC and other direction on even CC. Ring Bus Comparisons • Compared to AMD’s R600/RV670 bus, it is half the bit-width. • The higher clock speed of Larrabee’s bus should make up for the difference in bandwidth. Ring Bus Tradeoff Analysis Ring Bus Tradeoff Analysis Pros: •Straightforward, not complex •Able to deliver high bandwidth •Great performance if memory clients need high bandwidth. Cons: •Waste of chip area if most applications don’t need high memory bandwidth •That area could be spent elsewhere to increase performance in a different way. Larrabee MULTITHREADING ORGANIZATION Multithreading Organization • • • • Superscalar In-Order Four Threads of execution Dual issue (with a vector processing unit) Comparison to OO Execution # CPU cores: 2 out-of-order 10 in-order Instruction issue: 4 per clock 2 per clock VPU per core: 4-wide SSE 16-wide L2 cache size: 4 MB 4 MB Single-stream: 4 per clock 2 per clock Vector throughput: 8 per clock 160 per clock Larrabee Vector Processor 8 per clock Scheduling Policy • Software Controlled • More flexible due to the software controlled scheduling than a typical GPU. Software Controlled Scheduling Pros • Flexible: can choose the scheduler to suit the application. • Worst case won’t be so bad. (As compared to a hardware encoded scheduling policy) Cons • Overhead of scheduler takes a bite out of performance • Programmer overhead of selecting the correct scheduler. Criticism • NVIDIA – “like a GPU from 2006” – Unrealistic performance projections – Motivated by interest to retain market share Possible Market • Dreamworks Animation • Xbox / Playstation • Scientific research Questions?