AMD Bulldozer

advertisement
AMD Bulldozer
Colin Donahue
Jason Lowden
February 28, 2012
Outline
y Bulldozer Module
History
y Pipeline Stages
y Features
y
y Benchmarks
y Performance Flaw with Windows 7
y Applications
y The future
Release
y Predecessor – Phenom II (February 2009, used K10
microarchitecture)
y Bulldozer released in October 2011
http://cdn3.techworld.com/cmsdata/features/3210727/AMD%20Desktop%20Roadmap.jpg
Alpha 21264 Microprocessor
y Designed in 1996
y Bulldozer borrows architectural design
R.E. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, Mar. 1999, vol. 19, no. 2, pp. 24-36.
Bulldozer Module
y Each module has 2 integer cores and a single FPU
y All modules share a single L3 Cache
http://images.anandtech.com/doci/4955/BDArch.png
Bulldozer Module
y Architecture known as Clustered Multi-Thread
y Bulldozer vs. Intel Hyperthreading
y Offers more cores with less overhead per core
y Much larger focus on core count and parallel workloads
while single threaded performance was largely ignored
Fetch
y Single L1 instruction cache shared between two integer cores
y Cache is 64KB 2-way Set Associative
y In Phenom II, 64KB I-cache per core
y Can fetch 4 instructions at a time
Branch Prediction
y Predicted at the same time as fetching
y Completely separated from the Fetch hardware
y Stalls in fetching will not affect the branch prediction
y Uses a prediction queue so it can keep going even when the
system is stalled
Decode
y Two integer cores will share a 4-wide instruction decoder
y Higher width than Phenom II when a single core is active,
but as more cores are active it falls behind
y Relies on the fact that few programs are fetch-decode bound
y Each x86 instruction translated into 1 or 2 macro-operations
and then queued
Operation Fusion
y After the decode stage some operations are fused together
y Compare-test, test-branch operations are combined to
increase IPC
y First done by Intel in 2003, first time for AMD
Integer Cores
y Each core has 16KB of write
through L1 data cache
y Separate schedulers and
register files
y Two ALU/AGU ports
compared to three in
Phenom II
y Shared Unified 2 MB L2
Cache
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/2
Shared FP Core
y Independent scheduler
y Multithreaded, fully out-of-order execution
y Two 128-bit multiply accumulate units
y Single 128-bit high bandwidth load-store unit
y Added SSE4 and AVX instruction support
y Throughput has not increased over Phenom II, but can use
more instructions
Clock Speed
y Many design decisions were made in an effort to improve
y
y
y
y
clock speed
Fewer resources for each core while adding features
Low number of gates per pipeline stage = higher clock
speeds
Aimed for 30% increase in clock speed over Phenom II, got
only 9% /
Yield issues?
Turbo Core
y If half or fewer cores are active the active cores can go to max
turbo clock frequencies
y If more cores are active then the cores may go to an intermediate
turbo frequency
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/4
Cache and Memory Latencies
http://www.extremetech.com/wp-content/uploads/2011/10/Sandra-Latency.png
Benchmarks – Single Threaded
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/3
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/5
Benchmarks – Multi-Threaded
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/7
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/7
Power Usage
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/8
http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/8
Scheduling on Windows 7
y Currently the scheduling is not optimal
y Best for unrelated threads to be on different modules, while
closely coupled threads are on a single module
Sub-optimal
Thread 1a
Thread 3
Thread 1b
Thread 2
Optimal
Thread 1a
Thread 1b
Thread 2
Thread 3
Module 1
Module 2
Module 3
Module 4
Server Applications
y Allocate a virtual machine to each core
y For each server, there can be up to 64 cores per
motherboard
y
Bounded by memory constraints
y Bulldozer supports up to 384 GB per core
y Memory in that quantity is very expensive
y With current standards, more than one VM is operated per
core
Current Servers can run 100 VMs on 48 cores
y When using Bulldozer, there is significant potential for even
more VMs to run simultaneously
y
Future Generations
http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-2012analystday
Summary
y Novel architecture
y Optimized for server workloads
y Didn’t quite meet design goals
y
What caused it to fail?
y Partly of a new yearly cadence
Questions?
Download