Memory Speculation of the Blue Gene/Q
Compute Chip
Martin Ohmacht/ IBM BlueGene Team
© 2011 IBM Corporation
Acknowledgements
 Blue Gene/Q is currently under development by IBM and is not yet generally available.
 The IBM Blue Gene/Q development teams are located in
– Yorktown Heights NY,
– Rochester MN,
– Hopewell Jct NY,
– Burlington VT,
– Austin TX,
– San Jose CA,
– Bromont QC,
– Toronto ON,
– Boeblingen (FRG),
– Haifa (Israel)
– Hursley (UK).
 Columbia University
 University of Edinburgh
 The Blue Gene/Q project has been supported by Argonne National Laboratory and the Lawrence
Livermore National Laboratory on behalf of the United States Department of Energy, under Lawrence
Livermore National Laboratory Subcontract No. B554331
2
10/10/2011
© 2011 IBM Corporation
Blue Gene/Q system objectives
 Massively parallel supercomputing systems
– Focusing on large scale scientific and analytics applications
– Broadening to applications with commercial / industrial impact
– Laying groundwork for Exascale computing
 Reduce total cost of ownership
– Power efficient chips
 power/cooling efficiency
 dense packaging
 floor space efficiency
– Reliability
• Long MTBF for large installations
3
10/10/2011
Chip design objectives:
 optimize FLOPS/Watt
 optimize redundancy /
ECC usage /
SER sensitivity
© 2011 IBM Corporation
BlueGene/Q Compute chip
System-on-a-Chip design : integrates processors,
memory and networking logic into a single chip
 360 mm² Cu-45 technology (SOI)
– ~ 1.47 B transistors
 16 user + 1 service processors
–plus 1 redundant processor
–all processors are symmetric
–each 4-way multi-threaded
–64 bits PowerISA™
–1.6 GHz
–L1 I/D cache = 16kB/16kB
–L1 prefetch engines
–each processor has Quad FPU
(4-wide double precision, SIMD)
–peak performance 204.8 GFLOPS@55W
 Central shared L2 cache: 32 MB
–eDRAM
–multiversioned cache
will support transactional memory,
speculative execution.
–supports atomic ops
 Dual memory controller
–16 GB external DDR3 memory
–1.33 Gb/s
–2 * 16 byte-wide interface (+ECC)
 Chip-to-chip networking
–Router logic integrated into BQC chip.
 External IO
–PCIe Gen2 interface
4
10/10/2011
© 2011 IBM Corporation
BG/Q Processor Unit
 A2 processor core
– Mostly same design as in PowerEN™ chip
– Implements 64-bit PowerISA™
– Optimized for aggregate throughput:
• 4-way simultaneously multi-threaded (SMT)
• 2-way concurrent issue 1 XU (br/int/l/s) + 1 FPU
• in-order dispatch, execution, completion
– L1 I/D cache = 16kB/16kB
– 32x4x64-bit GPR
– Dynamic branch prediction
– 1.6 GHz @ 0.8V
 Quad FPU
– 4 double precision pipelines, usable as:
• scalar FPU
• 4-wide FPU SIMD
• 2-wide complex arithmetic SIMD
– Instruction extensions to PowerISA
– 6 stage pipeline
– 2W4R register file (2 * 2W2R) per pipe
– 8 concurrent floating point ops (FMA)
+ load + store
– Permute instructions to reorganize vector data
• supports a multitude of data alignments
5
10/10/2011
QPU: Quad FPU
© 2011 IBM Corporation
– Additional: 4 List-based Prefetching engines:
• One per thread
• Activated by program directives,
e.g. bracketing complex set of loops
• Used for repeated memory reference patterns
in arbitrarily long code segments
• Record pattern on first iteration of loop;
playback for subsequent iterations
• On subsequent passes, list is adaptively refined
for missing or extra cache misses (async events)
List
address
a
a
b
b
x
c
c
d
d
y
e
z
f
e
g
f
h
g
i
h
k
i
Prefetched
addresses
– Normal mode: Stream Prefetching
• in response to observed memory traffic, adaptively balances
resources to prefetch L2 cache lines (@ 128 B wide)
• from 16 streams x 2 deep through 4 streams x 8 deep
L1 miss
address
L
a d is t
d
ss re
 L1 prefetcher
L
mi 1
ad ss
d
ss re
BlueGene/Q PUnit – ct.
 Wake-up unit
– Will allow SMT threads to be suspended, while waiting for an event
List-based “perfect” prefetching has
– Lighter weight than wake-up-on-interrupt -- no context switching
tolerance for missing or extra cache
– Improves power efficiency and resource utilization
misses
6
10/10/2011
© 2011 IBM Corporation
Crossbar switch
 Central connection structure between
– PUnits (L1-prefetchers)
– L2 cache
– Networking logic
– Various low-bandwidth units
 Half frequency (800 MHz) clock grid
 3 separate switches:
– Request traffic
– Response traffic
– Invalidate traffic
-- write bandwidth 12B/PUnit @ 800 MHz (under simultaneous reads)
-- read bandwidth 32B/PUnit @ 800 MHz
 22 master ports
– PUnits
– DevBus master
– Network logic ports
-- PCIe
-- Remote DMA
 18 slave ports
– 16 L2 slices
– DevBus slave
– Network logic
-- PCIe, boot / messaging RAM, performance counters, …
-- injection, reception
 Peak on-chip bisection bandwidth 563 GB/s
7
10/10/2011
© 2011 IBM Corporation
L2 structure
 32 MB / 16 way set-associative / 128B line size
 Point of coherency
 Organization:
– 16 slices @ 2MB each
– each slice contains 8 * 2.25 Mb eDRAMs (data+ECC)
plus directory SRAMs, buffers, control logic.
Crossbar switch
DEVBUS
L2
L2
L2
L2
L2
L2_counter
L2
L2
L2
L2
L2_counter
L2
L2
L2
L2
L2_counter
L2
L2
L2
L2_counter
L2_central
memory
controller
8
10/10/2011
memory
controller
© 2011 IBM Corporation
Speculation
– Code section containing side effects is
executed
– Side effects are either
speculative
a) made permanent (and visible to other
section
threads), or
b) side effects are removed and a state
restored as if the section was never
executed
– A successful execution implies atomicity of
commit,
main memory updates in most proposals
make side effects
permanent
and visible
9
10/10/2011
~
event,
conflict
invalidation,
revert
side effects
of section
© 2011 IBM Corporation
BG/Q implementation features:
 Hardware support for speculative state in memory, not for core registers
– Responsibility of software to save and restore state of live registers if necessary
 Support for multiple modes:
– ordered transactions:
• Final value of multiple speculative writes resolved by hardware based on thread
order
• Forwarding: Younger threads can observe changes caused by older threads while
both being speculative
• Younger threads need to be invalidated if older thread is invalidated
=> Speculative Execution
– unordered transactions:
• Access to state accessed by other transactions considered a conflict if at least one
speculative write involved
• Threads can commit in any order
=> Transactional Memory
10
10/10/2011
© 2011 IBM Corporation
Rollback / fine grain checkpointing
 Snapshot interrupt driven at timer interval
or by user call
– Live registers or architected core state
saved
– Changes of previous calculation
interval committed
– New speculative calculation interval
started
 Detected error event causes rollback
– Changes since last snapshot
invalidated
– Core restored to last saved state
– New speculative calculation interval
started
 Detection of I/O using speculative data
causes commit
– No rollback capability until next
snapshot
– Temporary vulnerability to SER events
11
10/10/2011
Thread Thread Thread
0
1
67
Calculation
…
Snapshot
SER event
~
Rollback
Snapshot
© 2011 IBM Corporation
Speculation IDs
Avail
 Accesses of speculative threads distinguished by short
IDs, tagged onto memory access requests by the cores
Allocation
 L2 directory entries marked with IDs to enable
multiversioning and conflict detection
Spec
 IDs are in one of four states:
Available, Speculative, Invalid or Committed
 State table lookup used when matching tags, allowing
instant invalidates/commits by state table change
 Directory entries are cleaned up if they contain Ids in
committed or invalid state
– Removal of lines written using ID in invalid state
– Merge of lines written using ID in committed state
– clean-up on access and by background scrub
 Counter set keeps track of number of references in the
directory
– When counter reaches zero, ID can be reused for
new speculation
12
10/10/2011
Abort
Invalid
Commit
Committed
Reclaim
© 2011 IBM Corporation
Ordering
 BG/Q provides 128 IDs, numbered 0 to 127
 Numeric value of ID determines order
– Generally, larger values indicate younger threads
 ID value space wraps
– Allocation pointer used in comparisons
 ID space can be divided into up to 16 groups (domains):
– Each domain provides independent allocation pointer, allowing independent ordering
– Each domain can be assigned a different mode, allowing concurrent ordered and
unordered transactions
 IDs can be assigned to any hardware thread
13
10/10/2011
© 2011 IBM Corporation
Implementation
 Up to 15 ways of each directory set can be used for speculative data
– Up to 30MB speculative space
– Number of speculative ways protected from eviction configurable
 Write accesses recorded down to 8B granularity, reusing coherence directory space
 Read accesses recorded with dynamically determined granularity, 8B best case
14
10/10/2011
© 2011 IBM Corporation
L2 cache -- continued
 Atomic operations
–
–
–
–
–
Can be invoked on any 64-bit word in memory
Atomic operation type is selected by unused physical address bits
Set of 16 operations, including fetch-and-increment, store-add, store-XOR, etc.
Some operations access multiple adjacent locations, e.g., fetch-and-increment-bounded
Low latency even under high contention
• avoids L2-to-PU roundtrip cycles of lwarx/stwcx -- “queue locking”
 s/w operations: locking, barriers
 efficient work queue management, with multiple producers and consumers
 efficient inter-core messaging
number of processor cycles
Barrier speed using different syncronizing hardware
16000
14000
12000
10000
8000
6000
4000
2000
0
atomic: no -invalidates
atomic: invalidates
lwarx/stwcx
1
10
100
number of threads
15
10/10/2011
© 2011 IBM Corporation
DDR3 memory interface
 L2 cache misses are handled by dual on-chip DDR3 memory controllers
– each memory controller interfaces with 8 L2 slices
 Interface width to external DDR3 is 2 * (16B + ECC)
– aggregate peak bandwidth is 42.7 GB/s for DDR3-1333.
 Designed to support multiple density/rank/speed configurations
– currently configured with 16GB DDR3-1333
– DRAM chips and BQC chip soldered onto same card
• eliminates connector reliability issues
• reduces driver and termination power
 Extensive ECC capability on 64B basis:
– Double symbol error correct / triple detect
– Retry
– Partial or full chip kill
-- symbol = 2bits * 4 beats.
 DDR3 PHY
– integrated IO blocks: 8bit data + strobe; 12 /16 bit address/command
– integrates IOs with delay lines (deskew), calibration, impedance tuning, …
16
10/10/2011
© 2011 IBM Corporation
Networking logic
 Communication ports:
– 11 bidirectional chip-to-chip links @ 2GB/s
• Implemented with High Speed Serial (HSS) cores
– 2 links can be used for PCIe Gen2 x8
 On-chip networking logic
– Implements 14-port router
– Designed to support point-to-point, collective and barrier messages
– Integrated floating point and fixed point arithmetic, bit-wise operations
– Integrated DMA: connects network to on-chip memory system
 With these hardware assists, most aspects of messaging will be handled autonomously
 Minimal disturbance of PUnits
17
10/10/2011
© 2011 IBM Corporation
The 17th Core
 Assistant to the 16 user cores
– Designed to handle Operating System services
– Planned usage:
• Offload interrupt handling
• Asynchronous I/O completion
• Messaging assist, e.g. MPI pacing
• Offload RAS Event handling
 Reduces O/S noise and jitter on the user cores
 Will help user applications to run predictably / reproducibly
18
10/10/2011
© 2011 IBM Corporation
Redundancy – the 18th core
3
3
CORE 17
...
...
...
3
3
CORE 2
3
3
CORE 1
3
3
CORE 0
Fences
18x3=54 SO
620
NEST
LOGIC
OPMISR
62
31SI + 31 SO
FANOUT
3 SI
62
(31SI + 31 SO)
62
test mode
control
Scan chain arrangement designed for simple determination
of PUnit logic fails at manufacturing test
19
10/10/2011
© 2011 IBM Corporation
Physical-to-Logical mapping of PUnits in presence of a fail
Physical
Processor
core IDs
0
1
Logical
Processor
core IDs
0
1
X
N
N-1
N
used
shut down
 Inspired by array redundancy
 PUnit N+1 redundancy scheme designed to increase yield of large chip
 Redundancy can be invoked at any manufacturing test stage
– wafer, module, card, system
 Redundancy info will travel with physical part -- stored on chip (eFuse) / on card (EEPROM)
– at power-on, info transmitted to PUnits, memory system, etc.
 Single part number flow
 Will be transparent to user software: user will see N consecutive good processor cores.
20
10/10/2011
© 2011 IBM Corporation
Blue Gene/Q packaging hierarchy
2. Module
Single Chip
3. Compute Card
One single chip module,
16 GB DDR3 Memory
4. Node Card
32 Compute Cards,
Optical Modules, Link Chips,
Torus
1. Chip
16 cores
5b. I/O Drawer
8 I/O Cards
8 PCIe Gen2 slots
6. Rack
2 Midplanes
1, 2 or 4 I/O Drawers
7. System
20PF/s
5a. Midplane
16 Node Cards
Ref: SC2010
21
Design Challenges
 Area is the enemy
– 16 processor cores (A2 + QPU + L1P) + 1 helper core + 1 redundant spare
– enough cache per core / per thread
– high bandwidth to/from cache and to external memory
– high speed communication
– leads to LARGE chip: 18.96x18.96 mm
 redundant processor core will significantly help yield
 Power is the enemy
– SOC design (processors, memory, network logic) reduces chip-to-chip crossings
– 2.4 GHz PowerEN™ core design is run at reduced speed (1.6 GHz), reduced voltage (~0.8V)
• reduced voltage will reduce both active power and leakage power
• speed binning  all chips run @ 1.6 GHz, with voltage adjusted to match speed sort.
– Deployed methodologies/tools to keep power down
• Architecture/logic level: clock gating
• Processor cores: re-tuned for low power
• Power-aware synthesis; power-recovery steps
• Physical design: power-efficient clock networks
 Soft Errors are the enemy
– Sensitivity to SER events will affect reliability for large installations – such as BlueGene/Q
– Design provides redundancy for data protection:
• DDR3, L2 cache, network, all major arrays and buses ECC protected
• Minor buses, GPRs, FPRs: parity protected, with recovery
• Stacked / DICE latches
22
10/10/2011
© 2011 IBM Corporation
Design Challenges -- continued
And the enemy is us…
 Methodology Complexity
– Processor cores originated in a high-speed custom design methodology
– Rest of the chip implemented as ASIC
 Required merging of different clocking/latching, timing and test methodologies
 Logic verification
– On-chip memory sub-system (transactional memory, speculative execution)
– Full-chip POR sequence, X-state (… inherited “proven” logic)
 Extensive use of cycle simulation / hardware accelerators / FPGA emulator
 Test pattern generation
– Again, mixed chip / mixed methodologies
– Full chip models
– turn-around time is becoming a bottleneck
23
10/10/2011
© 2011 IBM Corporation
Conclusions
 The Blue Gene/Q Compute chip will be the building block for a power-efficient supercomputing system
that will be able to scale to tens of PetaFLOPS.
 Hardware
– BQC will introduce architectural innovations to enable multithreaded / multicore computing
– Cache structure designed to support Speculative Execution and Transactional Memory
– On-chip networking logic will allow dense, high-bandwidth chip-to-chip interconnect,
with hardware assist for collective functions and RDMA
– Designed to achieve over 200 GFLOPS peak in a power-efficient fashion
 2.1 GFLOPS/W Linpack performance -- #1 in Green500 June 2011
 Software
– Processors are homogeneous, implement standard PowerISA (plus SIMD extensions)
• Compilers will be available that leverage the on-chip hardware assists for multithreading
– Plan to support open standards:
• Parallel processing: MPI
• Multi-threading: OpenMP
… and will allow for many other programming models
 Applications:
– are in bring-up / scale-up
24
10/10/2011
© 2011 IBM Corporation
Disclaimer
IBM Corporation
Integrated Marketing Communications, Systems & Technology Group
Route 100
Somers, NY 10589
Produced in the United States of America
07/15/2011
All Rights Reserved
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole
discretion. Information regarding potential future products is intended to outline our general product direction and it should
not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future
products may not be incorporated into any contract. The development, release, and timing of any future features or
functionality described for our products remains at our sole discretion.
Some information in this document addresses anticipated future capabilities. Such information is not intended as a definitive
statement of a commitment to specific levels of performance, function or delivery schedules with respect to any future
products. Such commitments are only made in IBM product announcements. The information is presented here to
communicate IBM’s current investment and development activities as a good faith effort to help with our customers’ future
planning.
All performance information was determined in a controlled environment. Actual results may vary. Performance information
is provided “AS IS” and no warranties or guarantees are expressed or implied by IBM.
IBM does not warrant that the information offered herein will meet your requirements or those of your distributors or
customers. IBM provides this information “AS IS” without warranty. IBM disclaim all warranties, express or implied,
including the implied warranties of noninfringement, merchantability and fitness for a particular purpose or
noninfringement.
25
10/10/2011
© 2011 IBM Corporation
Disclaimer
References in this publication to IBM products or services do not imply that IBM intends to make them
available in every country in which IBM operates. Consult your local IBM business contact for information
on the products, features, and services available in your area.
Blue Gene, Blue Gene/Q and PowerEN are trademarks or registered trademarks of IBM Corporation in the
United States, other countries or both.
PowerISA and Power Architecture are trademarks or registered trademarks in the United States, other
countries, or both, licensed through Power.org
Linux is a registered trademark of Linus Torvalds.
Tivoli is a registered trademark of Tivoli Systems Inc.in the United States, other countries or both.
UNIX is a registered trademark in the United States and other countries, licensed exclusively through The
Open Group.
Other trademarks and registered trademarks are the properties of their respective companies.
Photographs shown are of engineering prototypes. Changes may be incorporated in production models.
This equipment is subject to all applicable FCC rules and will comply with them upon delivery.
IBM makes no representations or warranties, expressed or implied, regarding non-IBM products and
services.
26
10/10/2011
© 2011 IBM Corporation