AMD Bulldozer Colin Donahue Jason Lowden February 28, 2012 Outline y Bulldozer Module History y Pipeline Stages y Features y y Benchmarks y Performance Flaw with Windows 7 y Applications y The future Release y Predecessor – Phenom II (February 2009, used K10 microarchitecture) y Bulldozer released in October 2011 http://cdn3.techworld.com/cmsdata/features/3210727/AMD%20Desktop%20Roadmap.jpg Alpha 21264 Microprocessor y Designed in 1996 y Bulldozer borrows architectural design R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, Mar. 1999, vol. 19, no. 2, pp. 24-36. Bulldozer Module y Each module has 2 integer cores and a single FPU y All modules share a single L3 Cache http://images.anandtech.com/doci/4955/BDArch.png Bulldozer Module y Architecture known as Clustered Multi-Thread y Bulldozer vs. Intel Hyperthreading y Offers more cores with less overhead per core y Much larger focus on core count and parallel workloads while single threaded performance was largely ignored Fetch y Single L1 instruction cache shared between two integer cores y Cache is 64KB 2-way Set Associative y In Phenom II, 64KB I-cache per core y Can fetch 4 instructions at a time Branch Prediction y Predicted at the same time as fetching y Completely separated from the Fetch hardware y Stalls in fetching will not affect the branch prediction y Uses a prediction queue so it can keep going even when the system is stalled Decode y Two integer cores will share a 4-wide instruction decoder y Higher width than Phenom II when a single core is active, but as more cores are active it falls behind y Relies on the fact that few programs are fetch-decode bound y Each x86 instruction translated into 1 or 2 macro-operations and then queued Operation Fusion y After the decode stage some operations are fused together y Compare-test, test-branch operations are combined to increase IPC y First done by Intel in 2003, first time for AMD Integer Cores y Each core has 16KB of write through L1 data cache y Separate schedulers and register files y Two ALU/AGU ports compared to three in Phenom II y Shared Unified 2 MB L2 Cache http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2 Shared FP Core y Independent scheduler y Multithreaded, fully out-of-order execution y Two 128-bit multiply accumulate units y Single 128-bit high bandwidth load-store unit y Added SSE4 and AVX instruction support y Throughput has not increased over Phenom II, but can use more instructions Clock Speed y Many design decisions were made in an effort to improve y y y y clock speed Fewer resources for each core while adding features Low number of gates per pipeline stage = higher clock speeds Aimed for 30% increase in clock speed over Phenom II, got only 9% / Yield issues? Turbo Core y If half or fewer cores are active the active cores can go to max turbo clock frequencies y If more cores are active then the cores may go to an intermediate turbo frequency http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/4 Cache and Memory Latencies http://www.extremetech.com/wp-content/uploads/2011/10/Sandra-Latency.png Benchmarks – Single Threaded http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/3 http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/5 Benchmarks – Multi-Threaded http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/7 http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/7 Power Usage http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/8 http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/8 Scheduling on Windows 7 y Currently the scheduling is not optimal y Best for unrelated threads to be on different modules, while closely coupled threads are on a single module Sub-optimal Thread 1a Thread 3 Thread 1b Thread 2 Optimal Thread 1a Thread 1b Thread 2 Thread 3 Module 1 Module 2 Module 3 Module 4 Server Applications y Allocate a virtual machine to each core y For each server, there can be up to 64 cores per motherboard y Bounded by memory constraints y Bulldozer supports up to 384 GB per core y Memory in that quantity is very expensive y With current standards, more than one VM is operated per core Current Servers can run 100 VMs on 48 cores y When using Bulldozer, there is significant potential for even more VMs to run simultaneously y Future Generations http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-2012analystday Summary y Novel architecture y Optimized for server workloads y Didn’t quite meet design goals y What caused it to fail? y Partly of a new yearly cadence Questions?