Multicore Architectures

advertisement
Multicore Architectures
Michael Gerndt
Development of Microprocessors
© Intel
Transistor capacity doubles every 18 months
Development of Microprocessors
• Moore’s Law
• Estimated to stay at least in next 10 years
• But: Transistor count ≠ Power
• How to use transistor resources?
• Better execution core
– Enhance pipelining, superscalarity, …
– Better vector processing (SIMD, like MMX/SSE)
– Problem: Gap to memory speed
• Larger Caches
– Improves memory access speed
• More execution cores
– Problem: Gap to memory speed
•…
Development of Microprocessors
• Objective for manufactures
• As much profit as possible: Sell processors …
• Customers only buy when applications run faster
• Increase CPU power
• How to increase CPU power
• Higher clock rate
• More parallelism
– Instruction Level Parallelism (ILP)
– Thread Level Parallelism (TLP)
Development of Microprocessors
• Higher clock rates
• increase power consumption
– proportional to f and U²
– higher frequency needs higher voltage
– Small structures: Energy loss by leakage
• increase heat output and cooling requirements
• limit chip size (speed of light)
• at fixed technology (e.g. 60 nm)
– Smaller number of transistor levels per pipeline stage possible
– More, simplified pipeline stages (P4: >30 stages)
– Higher penalty of pipeline stalls
(on conflicts, e.g. branch misprediction)
Development of Microprocessors
• More parallelism
• Increased bit width (now: 64 bit architectures)
– SIMD
• Instruction Level Parallelism (ILP)
– exploits parallelism found in a instruction stream
– limited by data/control dependencies
– can be increased by speculation
– average of ILP in typical programs: 6-7
– modern superscalar processors can not get better…
Development of Microprocessors
• More parallelism
• Thread Level Parallelism (TLP)
– Hardware multithreaded (e.g. SMT: Hyperthreading)
– better exploitation of superscalar execution units
– Multiple cores
– Legacy software must be parallelized
– Challenge for whole software industry
– Intel moved into the tools business
Multicore Architectures
• SMPs on a single chip
• Chip Multi-Processors (CMP)
• Advantage
• Efficient exploitation of available transistor budget
• Improves throughput and speed of parallelized applications
• Allows tight coupling of cores
– better communication between cores than in SMP
– shared caches
• Low power consumption
– low clock rates
– idle cores can be suspended
• Disadvantage
• Only improves speed of parallelized applications
• Increased gap to memory speed
Multicore Architectures
• Design decisions
• homogeneous vs. heterogeneous
– specialized accelerator cores
– SIMD
– GPU operations
– cryptography
– DSP functions (e.g. FFT)
– FPGA (programmable circuits)
– access to memory
– own memory area (distributed memory)
– via cache hierarchy (shared memory)
• Connection of cores
– internal bus / cross bar connection
– Cache architecture
Multicore Architectures: Examples
Core
Core
Core
Core
L1
L1
L1
L1
L2
L2
Core (2x SMT)
Local
Store
Local
Store
Core
Core
Core
Core
Local
Store
Local
Store
L1
L2
I/O
L3
L3
Memory
Module 1
Memory
Module 2
Homogeneous with
shared caches and cross bar
Memory
Module
I/O
Heterogeneous with
caches, local store and ring bus
Shared Cache Design
Core
Core
Core
Core
Core
Core
L1
L1
L1
L1
L1
L1
Switch
Switch
Switch
L2
L2
L2
Memory
Memory
Traditional design
Multicore Architecture
Multiple single-cores
with shared cache off-chip
Shared Caches on-chip
Shared Cache Design
Core
Core
Core
Core
L1
L1
L1
L1
Switch
L2
Memory
Multicore Architecture
Shared Caches on-chip
Shared Caches: Advantages
• No coherence protocol at shared cache level
• Less latency of communication
• Processors with overlapping working set
• One processor may prefetch data for the other
• Smaller cache size needed
• Better usage of loaded cache lines before eviction (spatial
locality)
• Less congestion on limited memory connection
• Dynamic sharing
• if one processor needs less space, the other can use more
• Avoidance of false sharing
Shared Caches: Disadvantages
• Multiple CPUs  higher requirements
• higher bandwidth
• Cache should be larger (larger  higher latency)
• Hit latency higher due to switch logic above cache
• Design more complex
• One CPU can evict data of other CPU
Multicore Processors
• SUN
• UltraSparc IV / IV+
– dual core
– 2x multithreaded per core
• UltraSparc T1 (Niagara):
– 8 cores
– 4x multithreaded per core
– one FPU for all cores
– low power
• UltraSparc T2 (Niagara 2)
Intel Itanium 2 Dual Core - Montecito
• Two Itanium 2 cores
• Multi-threading (2 Threads)
– Simultaneous multi-threading for memory hierarchy resources
– Temporal multi-threading for core resources
– Besides end of time slice, an event, typically an L3 cache
miss, might lead to a thread switch.
• Caches
– L1D 16 KB, L1I 16 KB
– L2D 256 KB, L2I 1 MB
– L3 9 MB
• Caches private to cores
• 1,7 Billion transistors
Itanium 2 Dual Core
Intel Core Duo
• 2 mobile-optimized execution cores
• No multi-threading
• Cache hierarchy
• Private 32-KB L1I and L1D
• Shared 2 MB L2 cache
• Provides efficient data sharing between both cores
• Power reduction
• Some states individually by each processor
• Deeper and enhanced deeper sleep states only for die
• Dynamic Cache Sizing feature
– Flushes entire cache
– This enables Enhanced Deeper Sleep with lower voltage which
does not guarantee cache integrity
• 151 Million transistors
IBM Cell
• IBM, Sony, Toshiba
• Playstation 3 (Q1 2006)
• 256 GFlops
• Bei 3 GHz nur ~30W
• ganze PS3 nur 300-400$
• http://www-128.ibm.com/developerworks/power/library/pa-cellperf
Cell: Architecture
• 9 parallele processors
• Specialized for different tasks
• 1 large PPE - 8 SPEs
Synergistic Processing Element
Cell: SPE Synergistic Processing Element
•
•
•
•
•
•
•
•
128 registers 128-Bit
SIMD
Single Thread
256KByte local memory not cache
DMA execute memory transfers
Simple ISA
Less functionality to save space
Limitations can become a problem
if memory access is too slow.
• 25,6 GFlops single precision für
multiply-add operations
Intel Westmere EX
• Processor of the fat node of
SuperMUC @ LRZ
• 2,4 GHz
• 9.6 Gflop/s per core
• 96 Gflop/s per socket
• 10 hyperthreaded cores, i.e.
two logical cores each
• Caches
• 32 KB L1 private
• 256 KB L2 private
• 30 MB L3 shared
• 2,9 billion transistors
• Xeon E7-4870 (2,4 GHz,
10 Kerne, 30 MByte L3)
NUMA
• On-chip NUMA
• L3 Cache organized in 10 slices
• Interconnection via a bidirectional ring bus
• 10-way physical address hashing to avoid hot spots, and can
handle five parallel cache requests per clock cycle
• Mapping algorithm is not known, no migration support
• Off-chip NUMA
• Glueless combination of up to 8 sockets into SMP
• 4 Quick Path Interconnect (QPI) interfaces
• 2 on-chip memory controllers
Cache Coherency
• Cbox
• Connects core to ringbus and one memory bank
• Responsible for processor read/write/writeback and external
snoops, and returning cached data to core and QuickPath
agents.
• Distribution of physical addresses is determined by hash
function
• Sbox
• Caching Agent
• Each associated with 5 Cboxes
Cache Coherency
• Bbox
• Home agent
• Responsible for cache coherency of the cache line in this
memory. Keeps track of the Cbox replies due to coherence
messages.
• Directory Assited Snoopy (DAS)
• Keeps states per cache line (I – Idle or no remote sharers, R –
may be present on remote socket, E/D owned by IO Hub)
• If line is in I state it can be forwarded without waiting for snoop
replies.
Summary
• High frequency -> high power consumption
• Trend towards multiple cores on chip
• Broad spectrum of designs: homogeneous,
heterogeneous, specialized, general purpose, number
of cores, cache architectures, local memories,
simultaneous multithreading, …
• Problem: memory latency and bandwidth
Download