Uploaded by leoxuzp006

ISSCC2022 Session 2

advertisement
ISSCC 2022 / SESSION 2 / PROCESSORS / OVERVIEW
Session 2 Overview: Processors
DIGITAL ARCHITECTURES AND SYSTEMS SUBCOMMITTEE
Session Chair: Hugh Mair
MediaTek, Austin, TX
Session Co-Chair: Shidhartha Das
Arm, Cambridge, United Kingdom
Mainstream high performance processors take center stage in this year’s conference with next-generation architectures being detailed for x86 and Power™
processors by Intel, AMD, and IBM. In addition to mainstream compute, the session features groundbreaking work in parallel/array compute, leading off with
the massive performance and integration of Intel’s Ponte Vecchio, while a multi-die approach to reconfigurable compute from researchers at UCLA features an
ultra-high density die-to-die interface. Mobile processing also marks a milestone this year with the introduction of the ARMv9 ISA into flagship smartphones.
2.1
8:30 AM
Ponte Vecchio: A Multi-Tile 3D Stacked Processor for Exascale Computing
Wilfred Gomes, Intel, Portland, OR
In Paper 2.1, Intel details the Ponte-Vecchio platform for next-generation data center processing, integrating 47 tiles from 5 different
process nodes into a single package, including 16 5nm compute tiles. 45TFLOPS of sustained FP32 vector processing is demonstrated
alongside 5TB/s of memory fabric bandwidth and >2TB/s of aggregate memory and scale-out bandwidth.
2.2
8:40 AM
Sapphire Rapids: The Next-Generation Intel Xeon Scalable Processor
Nevine Nassif, Intel, Hudson, MA
In Paper 2.2, Intel’s next-generation Xeon Scaleable processor utilizing a quasi-monolithic approach to integration in 7nm is presented.
The 2×2 die array features an ultra-high bandwidth multi-die fabric IO featuring 10TB/s total die-to-die bandwidth across the 20 interfaces,
while maintaining a low 0.5pJ/b energy consumption.
8:50 AM
2.3
IBM Telum: A 16-Core 5+ GHz DCM
Ofer Geva, IBM Systems and Technology, Poughkeepsie, NY
In Paper 2.3, IBM’s Z series processor advances into 7nm technology, featuring many architectural improvements to best leverage this
class of CMOS technology. The processor leverages a large 32MB L2 cache, creating large virtual L3 and L4 caches and a fully synchronous
interface that connects two co-packaged 530mm2 die operating at half the CPU clock.
40 • 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 8:30 AM
2
2.4
9:00 AM
POWER10TM: A 16-Core SMT8 Server Processor with 2TB/s Off-Chip Bandwidth in 7nm Technology
Rahul M. Rao, IBM, Bengaluru, India
In Paper 2.4, IBM describes a 7nm 16-core Power10™ processor, featuring a series of architectural, design and implementation
improvements to ensure continued performance gains. The 602mm2 die features an impressive 2TB/s bandwidth aggregated across
chip-to-chip, DRAM and PCIe interfaces.
2.5
9:10 AM
A 5nm 3.4GHz Tri-Gear ARMv9 CPU Subsystem in a Fully Integrated 5G Flagship Mobile SoC
Ashish Nayak, MediaTek, Austin, TX
In Paper 2.5, MediaTek unveils their first ARMv9 CPUs for flagship mobile applications, featuring a 3.4GHz maximum clock rate and
a tri-gear CPU subsystem with out-of-order CPUs for both mid and high-performance gears. Manufactured in 5nm, resource scaling
and implementation methodologies of the high-performance gear achieve a 27% performance uplift vs. the mid-gear.
2.6
9:20 AM
A 16nm 785GMACs/J 784-Core Digital Signal Processor Array with a Multilayer Switch Box Interconnect,
Assembled as a 2×2 Dielet with 10µm-Pitch Inter-Dielet I/O for Runtime Multi-Program Reconfiguration
Uneeb Rathore, University of California, Los Angeles, CA
In Paper 2.6, UCLA demonstrates a 2×2 multi-die reconfigurable processor with a multi-layer switch-box interconnect and ultra-highdensity multi-die interfacing. Fabricated in 16nm and utilizing a silicon interposer, the die-to-die IO features a 10μm bump pitch with
each IO circuit occupying 137μm2 and consuming 0.38pJ/b of energy.
2.7
9:30 AM
Zen3: The AMD 2nd-Generation 7nm x86-64 Microprocessor Core
Thomas Burd, AMD, Santa Clara, CA
In Paper 2.7, AMD discusses the micro-architectural features of “Zen 3”, providing a unique 7nm-to-7nm (same-node) power and
performance comparison to the prior generation. The 68mm2 8-Core Complex with 32MB L3 cache achieves a 19% core IPC
improvement through architectural enhancements, coupled with a 6% frequency improvement, yielding an increase in power efficiency
by up to 20%.
DIGEST OF TECHNICAL PAPERS •
41
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.1
2.1
Ponte Vecchio: A Multi-Tile 3D Stacked Processor for
Exascale Computing
Wilfred Gomes1, Altug Koker2, Pat Stover3, Doug Ingerly1, Scott Siers2,
Srikrishnan Venkataraman4, Chris Pelto1, Tejas Shah5, Amreesh Rao2,
Frank O’Mahony1, Eric Karl1, Lance Cheney2, Iqbal Rajwani2, Hemant Jain4,
Ryan Cortez2, Arun Chandrasekhar4, Basavaraj Kanthi4, Raja Koduri6
Intel, Portland, OR; 2Intel, Folsom, CA; 3Intel, Chandler, AZ; 4Intel, Bengaluru, India
Intel, Austin, TX, 6Intel, Santa Clara, CA
1
5
Ponte Vecchio (PVC) is a heterogenous petaop 3D processor comprising 47 functional
tiles on five process nodes. The tiles are connected with Foveros [1] and EMIB [2] to
operate as a single monolithic implementation enabling a scalable class of Exascale
supercomputers. The PVC design contains >100B transistors and is composed of sixteen
TSMC N5 compute tiles, and eight Intel 7 memory tiles optimized for random access
bandwidth-optimized SRAM tiles (RAMBO) 3D stacked on two Intel 7 Foveros base dies.
Eight HBM2E memory tiles and two TSMC N7 SerDes connectivity tiles are connected
to the base dies with 11 dense embedded interconnect bridges (EMIB). SerDes
connectivity provides a high-speed coherent unified fabric for scale-out connectivity
between PVC SoCs. Each tile includes an 8-port switch enabling up to 8-way fully
connected configuration supporting 90G SerDes links. The SerDes tile supports
load/store, bulk data transfers and synchronization semantics that are critical for scaleup in HPC and AI applications. A 24-layer (11-2-11) substrate package houses the 3D
Stacked Foveros Dies and EMIBs. To handle warpage, low-temperature solder (LTS) was
used for Flip Chip Ball Grid Array (FCBGA) design for these die and package sizes.
The foundational processing units of PVC are the compute tiles. The tiles are organized
as two clusters of 8 high-performance cores with distributed caches. Each core contains
8 vector engines processing 512b floating-point/integer operands and 8 matrix engines
with an 8-deep systolic array executing 4096b vector operations/engine/clock. The
compute datapath is fed by a wide load/store unit that fetches 512B/clock from a 512KB
L1 data cache that is software configurable as a scratchpad memory. Each vector engine
achieves throughput of 512/256/256 operations/clock for FP16/FP32/FP64 data formats,
while the matrix engine delivers 2048/4096/4096/8192 ops/clock for
TF32/FP16/BF16/INT8 operands as shown in Fig. 2.1.1.
The two 646mm2 base dies provide a communication network for the stacked tiles and
includes SoC infrastructure modules including memory controllers, fully integrated
voltage regulators (FIVR) [3], power management and 16 PCIe Gen5/CXL host interface
lanes. The base dies are fabricated on a 17-metal Intel 7 [4] enhanced for Foveros process
technology that includes through-silicon vias (TSV) in a high-resistivity substrate.
Compute and memory tiles are stacked face-to-face on top of the base dies using a dense
array of 36μm-pitch micro bumps. This dense pitch provides high assembly yield, high
power bump density and current capacity, and ~2× higher signal density compared to
the 50μm bump pitch used in Lakefield [5]. Power TSVs through the base die are built
as 1×2, 2×1, 2×2, 2×3 and 2×4 arrays within a single C4 bump shadow. Die-to-die
routing and power delivery uses two top-level copper metals with 1μm and 4μm pitch
thick metal layers [6]. Each base die connects to four HBM2E tiles and a SerDes tile using
a 55μm pitch EMIB. The cross-sectional details of the 3D stacked base die and EMIB are
shown in Fig. 2.1.2.
The base tile also contains a 144MB L3 cache, called the Memory Fabric (MF), with a
complex geometric topology operating at 4096B/cycle to support the distributed caches
located under the shadow of the compute tile cores. The L3 cache is a large storage that
backs up various L1 caches inside the Core. It is organized as multiple independent banks
each of which can perform one 64B read/write operation/clock. To maximize the cache
storage that can be supported on PVC, the distributed cache architecture splits the total
cache between the base die and RAMBO tiles. A modular L3 bank design allows the data
array to be placed at arbitrarily distances with a FIFO interface. This allows the pipeline
to tolerate large propagation latencies. With this organization, the data array can be placed
on a separate tile above the bank logic that resides on the base die. Each RAMBO tile
contains 4 full banks of 3.75MB, providing 15MB per tile, with 60MB distributed among
four RAMBO tiles per base tile (Fig. 2.1.3).
The base tile connects the compute tiles and RAMBO tiles using a 3D stacked die-to-die
link called Foveros Die Interconnect (FDI). Transmitter (Tx) and receiver (Rx) circuits of
this interface are powered by the compute tile rail. Level-shifters on the base tile convert
to the supply voltage within the asynchronous interface. The CMOS link has a 3:2 buswidth compression and can run at a variable ratio from 1.0-1.5× relative to the base clock
to balance speed and energy efficiency. After traversing the FDI link, signals are
decompressed back to full width in the destination Rx domain. The FDI link is organized
as eight groups, with each group consisting of 800 lanes per compute tile. Each group
uses common clocking with phase compensation on the base die to correct for variation
between base and compute tiles. This necessitates a base-to-compute tile clock and a
return clock going back to the base die to enable clock compensation.
42 • 2022 IEEE International Solid-State Circuits Conference
FDI cells are clumped together to limit lead way routes to <300μm. 33% of micro bumps
are reserved for a robust power delivery for tile circuitry located under the micro bump
field. Data lanes are bundled to minimize the clock power overhead. The die-to-die
channel includes capacitance from the micro-bump and relatively small ESD diodes. The
coupling capacitance from the six adjacent micro bumps is <5% of the total interconnect
capacitance. The energy efficiency of FDI is 0.2pJ/b, and the entire link can run up to
2.8GT/s. A multiple shorted spine structure enables a 20% reduction in skew. The
clocking scheme for the die-to-die interface is shown in Fig. 2.1.4. The capability of
independent voltage planes along with frequency separation between the compute tiles
and base die allows independent voltage islands to target each compute die and the base
die for the optimal frequency.
Power delivery is implemented with 3D-stacked Fully Integrated Voltage Regulators
(FIVRs) located on the base die. 3D-stacked FIVRs enable high-bandwidth fine-grained
control over multiple voltage domains and reduce input current by ~60% and I2R losses
by 85%. The power density challenges of 3D stacking, along with degraded inductor Q
factor caused by scaled core footprints, required a new in-package substrate inductor
technology called Coaxial Magnetic Integrated Inductor (CoaxMIL) [5] that provides highQ inductors with reduced footprint (Fig. 2.1.5). This technology improves regulator
efficiency by 3% relative to air core inductors. High-density Metal-Insulator-Metal (MIM)
capacitors on the base die reduce supply first droop and augment power delivery to the
top dies. FIVR on the base die deliver up to 300W per base die into a 0.7V supply. The
effective input resistance of 0.15mΩ for both base dies minimizes I2R loss. Efficiency of
this 3D-stacked FIVR implementation compares favorably to a monolithic solution, with
just a <1% gap (Fig. 2.1.5). The top die shadows (TDS) need to be managed for both IR
drop and current return and the process and the power grid was co-optimized to enable
the power delivery for this product (Fig. 2.1.5).
There are two external facing high-speed interfaces included in the Ponte Vecchio SoC:
a 16 PCIe Gen5/CXL 32 GT/s interface on the base die and an 8×4-lane SerDes
connectivity tile. The PCI Gen5 interface is used to connect PVC as a host interface to
the CPU. The high-speed signals for this interface are connected to the package through
the base die TSVs and the impact of the TSVs on the channel insertion and return loss
is included as part of the link budgeting. The TSVs in this Intel 7 Foveros base die
technology minimize high-frequency loss for high-speed I/Os as described in [6]. The
TSVs contribute ~0.3dB to the channel insertion loss at 16GHz.
Thermal management poses significant challenges in a 3D-stacked design. Several
strategies were used to meet 600W target for high-end platforms (Fig. 2.1.6). Thick
interconnect layers in the base and compute tiles act as lateral heat spreaders. High
micro-bump density is maintained over potential hotspots to compensate for reduced
thermal spreading in a thin-die stack. High array density of power TSVs is used to reduce
C4 bump temperature. Compute tile thickness is increased to 160μm to improve thermal
mass for turbo performance. In addition to the 47 functional tiles, there are 16 additional
thermal shield dies stacked to provide a thermal solution over exposed base die area to
conduct heat. Backside metallization (BSM) with solder thermal interface material (TIM)
is applied on all the top dies. The TIM eliminates air gaps caused by different die stack
heights to reduce thermal resistance. We also add BSM to the HBM and SerDes dies and
enable a solder TIM. The thermal solution and the results are shown in Fig. 2.1.6.
As die-to-die FDI micro-bump pitch (36μm) is denser than the available high-volume
wafer probe pitch, the probed micro-bumps were snapped to a more relaxed pitch.
Comprehensive test of each tile precedes the assembly of each design to ensure high
yield before packaging the parts and attaching the HBM die.
Initial PVC silicon provides 45TFLOPS of sustained vector FP32 performance, 5TB/s of
sustained memory fabric bandwidth and >2TB/s of aggregate memory and scale-out
bandwidth. PVC achieves ResNet-50 inference throughput of >43K images/s, with
training throughput reaching 3400 images/s.
References:
[1] D. Ingerly et al., “Foveros: 3D Integration and the use of Face-to-Face stacking for
Logic Devices,” IEDM, pp. 19.6.1-19.6.4, 2019.
[2] R. Mahajan et al., “Embedded Multi-die Interconnect Bridge (EMIB) — A High
Density, High Bandwidth Packaging Interconnect,” IEEE Trans. on Components,
Packaging and Manufacturing Tech., vol. 9, no. 10, pp. 1952-1962, 2019,.
[3] E. A. Burton et al., “Fully integrated voltage regulators on 4th generation Intel® Core™
SoCs”, IEEE Applied Power Electronics Conf., pp. 432-439, 2014.
[4] C. Auth et al., “A 10nm High Performance and Low-Power CMOS Technology,” IEDM,
pp. 29.1.1-29.1.4, 2017.
[5] Krishna Bharath et al., “Integrated Voltage Regulator Efficiency Improvement using
Coaxial Magnetic Composite Core Inductors,” IEEE Elec. Components and Tech. Conf.,
pp. 1286-1292, 2021
[6] W. Gomes et al., “Lakefield and Mobility Compute: A 3D Stacked 10nm and 22FFL
Hybrid Processor System in 12×12mm2, 1mm Package-on-Package,” ISSCC, pp. 144145, 2020.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 8:40 AM
2
Figure 2.1.1: 3D and 2D system partitioning with Foveros and EMIB on PVC.
Figure 2.1.2: Process details for Foveros and EMIB.
Figure 2.1.3: Base die cache design, RAMBO and die-to-die connectivity.
Figure 2.1.4: PVC clocking, die-to-die IO and comparisons.
Figure 2.1.5: Power delivery, voltage drop for compute and base tiles.
Figure 2.1.6: Thermal solutions for PVC.
DIGEST OF TECHNICAL PAPERS •
43
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.1.7: Ponte Vecchio chip photographs and key attributes.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.2
2.2
Sapphire Rapids: The Next-Generation Intel Xeon Scalable
Processor
Nevine Nassif1, Ashley O. Munch1, Carleton L. Molnar1, Gerald Pasdast2,
Sitaraman V. Iyer2, Zibing Yang1, Oscar Mendoza1, Mark Huddart1,
Srikrishnan Venkataraman3, Sireesha Kandula1, Rafi Marom4,
Alexandra M. Kern1, Bill Bowhill1, David R. Mulvihill5, Srikanth Nimmagadda3,
Varma Kalidindi1, Jonathan Krause1, Mohammad M. Haq1, Roopali Sharma1,
Kevin Duda5
Intel, Hudson, MA; 2Intel, Santa Clara, CA; 3Intel, Bangalore, India; 4Intel, Haifa, Israel
Intel, Fort Collins, CO
1
5
Sapphire Rapids (SPR) is the next-generation Xeon® Processor with increased core
count, greater than 100MB shared L3 cache, 8 DDR5 channels, 32GT/s PCIe/CXL lanes,
16GT/s UPI lanes and integrated accelerators supporting cryptography, compression
and data streaming. The processor is made up of 4 die (Fig. 2.2.7) manufactured on Intel
7 process technology which features dual-poly-pitch SuperFin (SF) transistors with
performance enhancements beyond 10SF, >25% additional MIM density over SuperMIM
and a metal stack with a 400nm pitch routing layer optimized for global interconnects.
This layer achieves ~30% delay reduction at the same signal density and is key for
achieving the required latency. The core provides better performance via a programmable
power management controller. New technologies include Intel Advanced Matrix
Extensions (AMX), a matrix multiplication capability for acceleration of AI workloads and
new virtualization technologies to address new and emerging workloads.
Server usages benefit from high core counts and an increasing number of IO lanes to
deliver the performance demanded by customers but are limited by die size. Yield
constraints favor smaller die. The concept of a quasi-monolithic SoC is introduced: 4
interconnected die with aggregate area beyond the reticle limit implementing an
equivalent monolithic die (Fig. 2.2.1).
Each die is built as a 6×4 array, where each slot is populated by one modular component.
The on-die coherent fabric (CF) is used to provide low latency, high bandwidth (BW)
communication. Each modular component (core/LLC, memory controller, IO, or
accelerator complex) contains an agent which provides access to the CF. On-die voltage
regulators (FIVR) [1] are arranged in horizontal strips. The SoC requires 2 die, mirrors
of each other, arranged in a 2×2 matrix connected by 10 Embedded Multi-Die
Interconnect Bridges (EMIB) [2].
Multi-Die Fabric IO (MDFIO) is introduced as a new ultra-high BW, low-latency, lowpower (0.5pJ/b) Die-2-Die (D2D) interconnect aiming to provide an extension of the CF
across multiple die. This requires handling on-die dynamic voltage frequency scaling
(DVFS) (from 800MHz to 2.5GHz) and carrying the full fabric BW (500GB/s per crossing)
across 20 crossings, equal to over 10TB/s aggregate D2D BW. The raw-wire BER spec
was set to 1e-27 to achieve a FIT of <1 and avoid the need for in-line correction or replay
to allow <10ns round trip latency. A highly parallel 1.6G-to-5.0GT/s signaling scheme
was chosen to meet all requirements.
MDFIO PHY architecture is shown in Fig. 2.2.3. Source-synchronous clocking with DLL
and PIs for training on the RX side, low-swing double-data-rate N/N transmitter for
reduced power with AC-DBI for power supply noise reduction, strongarm latch receiver
with VREF training and offset cancellation directly connected on the Rx bump. The PHY
is trained at several operating frequency points and the resulting DLL and PI trained
values are locally stored to handle fast transitions during DVFS events.
Most of the traffic across this interface is source synchronous and retimed with an
asynchronous FIFO not requiring to be validated by timing tools. However, there are
various debug and test-related fabrics which utilize asynchronous buffers for both data
and clock. These interfaces are validated by one of 2 models: a cross-die timing model,
or a single-die loopback model (Fig. 2.2.2). These models comprehend the Tx/Rx analog
circuit delays, EMIB electrical parasitics and variation cross die.
While disaggregation helps with overall yield, SPR also employs extensive block
repair/recovery methods to further increase good die per wafer. SPR extends techniques
used in past products [3] for core and cache recovery to uncore and IO blocks. Repair
techniques use redundant circuits to enable a block with a defect to be returned to full
functionality. MDFIO implements lane repair to recover defects pre and post assembly.
Block-level recovery takes advantage of the modular die, as well as redundant placements
in the socket. Unused PCIe blocks in the south die enable IO recovery through positional
defeaturing. In total 74% of the die is recoverable.
Test and debug solutions provide a way to test and debug die individually or assembled
with reuse of test patterns and debug hooks. A combination of muxes and IO signals
allow the test and debug fabrics to operate within a die or utilize the EMIB to enable a
D2D data-path on assembled parts. A parallel test interface is accessible via DDR
channels and a single die provides scan data, via packetized test data, to access IPs on
an assembled 4-die part. Parallel test of re-instantiated IPs on all four die is supported
44 • 2022 IEEE International Solid-State Circuits Conference
with a single set of patterns in tester memory along with test failure information to
support recovery. A parallel trace interface is available via dedicated GPIO to allow access
to debug data from any of the 4 individual dies. JTAG is implemented as single controller
for system view, but four individual controllers for high volume manufacturing access.
SPR uses a combination of FIVRs and motherboard VRs optimized for power
management. Choice of voltage regulator types were determined based on max-current
and physical spread of the domain and fastest switching events. To decouple the
challenges of power delivery with large switching current impacting sensitive IO, highpower noisy digital logic FIVRs and sensitive low-power analog IO FIVRs were sourced
by separate MBVRs. FIVR was also used to deliver two quiet analog power rails to MDFIO
to enable double pumping and reduce die perimeter. To combat the challenge of a
shrinking core footprint, coaxial magnetic integrated inductors [4] are used resulting in
high-Q inductors with smaller footprint and increased FIVR efficiency.
Power optimization techniques are employed to achieve high performance and
reliability. For high-speed interconnect, distributed inline datapath layout is implemented
including optimized driver placement/sizing with customized full-metal-stack routing and
low-resistive via laddering. In conjunction, recombinant multi-source clock distribution
is used across the high-speed uncore clock domains to achieve low insertion delay and
skew. Multiple approaches including optimized multiple poly-pitch libraries, low-leakage
device maximization, vector sequential optimization, automated clock gating, multi-power
domain partitioning, and selective power gating are employed. Protection of vulnerable
architecture states is increased via ECC coverage, end-to-end parity and soft-errorresilient sequentials.
SPR supports DDR5 memory technology, with eight channels, each capable of 2-DIMM
per channel, allowing SPR to deliver twice the bandwidth of the prior generation. The
DDR PHY achieves 50% higher maximum data rate by adopting a new IO architecture
and advanced signal integrity enablers. Half-rate analog and digital clocking are
introduced, along with pseudo differential wide-range DLLs for optimal power and
performance. The DDR5 receiver (Rx) has unmatched data (DQ) and strobe (DQs) paths,
and its new 4-tap Decision Feedback Equalizer (DFE) enables better channel ISI
compensation (Fig. 2.2.4). On-die local voltage regulators provide quiet supplies to the
clocking circuits minimizing the power-noise jitter for READ and WRITE operations. In
addition, voltage-mode linear equalization with two post-cursor taps is implemented in
the full-rate command transmitters. Periodic DQ-DQs retraining helps mitigate the jitter
due to the different temperature drift on the unmatched DQ and DQs paths. The 8-channel
interface is split into four independent and identical IP instances, each with two channels.
Every channel has two 40b sub-channels on the north and south sections of the IP, and
command, control, clock signals and PVT compensation circuitry located in the middle.
Each die contains 2.5-to-32Gb/s NRZ transceiver lanes used for PCIe and UPI® links. In
the transceiver, the transmitter clock path is sourced by two LCPLLs. Meeting the LCPLL
reference clock jitter requirements was challenging given board and package crosstalk,
EMI, and skew. A differential reference clock distribution using specialized clock buffers
allows all transceivers within the die to share a reference clock pin, enabling optimum
placement of the reference clock pin away from serial IO and DDR aggressors within the
package, socket and board (Fig. 2.2.5). Transmitters include a 3-tap FIR equalizer, a
power-efficient voltage mode driver with T-coils to cancel ESD capacitance, and a lowlatency serialization pipeline. The receiver front-end (Fig. 2.2.6) consists of an input
pi-coil network to optimize insertion and return loss followed by a passive low-frequency
zero-pole pair. Gain and boost are provided by a 3-stage active Continuous Time Linear
Equalizer (CTLE) with inductive shunt-peaking. Two-way interleaved data and edge
summers feed into respective data and edge double-tail samplers with combined offset
and reference voltage cancellation differential pairs. 12-tap data and 4-tap edge DFE
currents also feed into the respective summing nodes. A four-stage differential ring
oscillator with active-inductor load generates 8-to-16GHz clocks used by the samplers
after appropriate dividers. The oscillator employs RC filtering to suppress bias and powersupply noise, eliminating the need for a dedicated voltage regulator. The design features
extensive digital calibrations to cancel device mismatch. Digital synthesized logic and
firmware running on a micro-controller are used to adapt and optimize the receiver gain,
equalization, offset and DFE across channels up to 37dB at 16GHz. Eight data lanes and
a shared common lane, consume 6.48pJ/b running at 32Gb/s and occupy 2.27mm2.
Acknowledgement:
The authors thank all the teams that have worked relentlessly on this product.
References:
[1] E. A. Burton et al., “FIVR – Fully integrated voltage regulators on 4th generation intel
Core SoCs,” IEEE APEC, pp. 432-439, 2014.
[2] https://www.intel.in/content/www/in/en/silicon-innovations/6-pillars/emib.html.
[3] S. M. Tam et al., “SkyLake-SP: A 14nm 28-Core Xeon® Processor,” ISSCC, pp. 3435, 2018.
[4] K. Bharath et al., “Integrated Voltage Regulator Efficiency Improvement using Coaxial
Magnetic Composite Core Inductors,” IEEE Elec. Components and Tech. Conf., pp. 12861292, 2021.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 8:50 AM
2
Figure 2.2.1: Die floorplan in 2×2 quasi-monolithic configuration. EMIB highlighted
Figure 2.2.2: A cross-die timing model, or a single-die loopback model.
at die-to-die interfaces.
Figure 2.2.3: MDFIO PHY architecture.
Figure 2.2.4: SPR DDR5 DQ Rx and 4-tap DFE.
Figure 2.2.5: A differential reference clock distribution allows all transceivers within
the die to share a reference clock pin.
Figure 2.2.6: SPR PCIe Gen5 receiver front end.
DIGEST OF TECHNICAL PAPERS •
45
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.2.7: Die photo of left and right die arranged in 2×2 quasi-monolithic
configuration. EMIB placement highlighted in blue.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.3
2.3
IBM Telum: A 16-Core 5+ GHz DCM
Ofer Geva , Chris Berry , Robert Sonnelitter , David Wolpert , Adam Collura ,
Thomas Strach2, Di Phan1, Cedric Lichtenau2, Alper Buyuktosunoglu3,
Hubert Harrer2, Jeffrey Zitz1, Chad Marquart1, Douglas Malone1, Tobias Webel2,
Adam Jatkowski1, John Isakson4, Dina Hamid1, Mark Cichanowski4,
Michael Romain1, Faisal Hasan4, Kevin Williams1, Jesse Surprise 1, Chris Cavitt1,
Mark Cohen1
1
1
1
1
1
IBM Systems and Technology, Poughkeepsie, NY
IBM Systems and Technology, Boeblingen, Germany
3
IBM Research, Yorktown Heights, NY
4
IBM Systems and Technology, Austin, TX
1
2
IBM “Telum”, the latest microprocessor for the next generation IBM Z system has been
designed to improve performance, system capacity and security over the previous
enterprise system [1]. The system topology has changed from a two-design strategy
featuring one central system-controller chip (SC) and four central-processor (CP) chips
per drawer into a one-design strategy, featuring distributed cache management across
four Dual-Chip Modules (DCM) per drawer, each consisting of two processor chips for
both core function and system control. The CP die size is 530mm² in 7nm bulk
technology [5] [6]. The system contains up to 32 CPs in a four-drawer configuration (Fig
2.3.1). Each CP (shown in Fig 2.3.2 die photo) contains 22B transistors, operates at
over 5GHz and is comprised of 8 cores, each with a 128KB L1 instruction cache, a 128KB
L1 data cache and a 32MB L2 cache. Chip interfaces include 2 PCIe Gen4 interfaces, an
M-BUS interface to the other CP on the same DCM and 6 X-BUS interfaces connecting
to other CP chips on the drawer. 6 out of the 8 CPs in each drawer have an A-Bus
connection to the other drawers in the system.
The transition from the previous system 14nm SOI process [2] to the 7nm bulk process
drove significant changes to the physical design capabilities and technology enablement
tooling. Key electrical scaling challenges were reduced maximum operating voltage,
increased BEOL RC delay and power grid IR loss. Additional significant changes were
the loss of deep-trench memory cells (EDRAM) and decoupling capacitors (DTCAP),
conversion from radial BITIEs to per-row substrate/NW contacts and new antenna and
latch-up avoidance infrastructure. New challenges in EUV and design rule constraints
drove new shape-based and cell-based fill tooling to support cross-hierarchical and multilayer density challenges. Many of these updates further complicated concurrent
hierarchical design capabilities, prompting new design methodologies as well as
additional design constraints and methodology-checking infrastructure.
Migrating to a bulk 7nm process and losing the EDRAM technology leveraged by previous
designs drove a new cache microarchitecture and a single chip design. The new structure
of the CP chip consists of eight processor cores, each with a 32MB private L2 cache
which are connected by a dual on-chip ring that provides >320 GB/s of bandwidth (Fig.
2.3.3). A fully populated drawer has a total of 64 physical processor cores and L2
caches. The 64 independent physical caches work together to act as a multi-level shared
victim cache that provides 256MB of on chip virtual L3 and 2GB of virtual L4 across up
to 8 chips. The 8 chips are fully connected which provides substantially more aggregate
drawer bandwidth than predecessor designs. It also reduces the average L1 miss latency
compared to the shared, inclusive cache hierarchy on the previous Z15 system, and when
combined with the larger L2 and more effective drawer cache utilization enables “Telum”
to significantly improve overall system performance.
“Telum” features an on-chip AI accelerator that was designed to support the low-latency,
high-volume in-transaction AI inferencing requirements of the enterprise customers [3].
Instead of spreading the AI compute power with generic wide-vector arithmetic operation
in each core, a single, on-chip AI accelerator is accessible by all the cores on the chip. It
consists of high-density multiply-accumulate systolic arrays for matrix multiplication
and convolution computation, as well as specialized activation blocks to efficiently handle
operations like “sigmoid” or “soft max”. It provides in total 6TFLOPS of FP16 compute
performance and is connected to the on-chip ring reaching up to 200GB/s read/write
bandwidth. Internal data movers and formatters feed the compute engines with a
bandwidth exceeding 600GB/s.
Workloads that run on IBM Z generally have large instruction cache footprints [1], as a
result, they benefit from massive capacity in the branch prediction logic tables. Prior
generation predictors used EDRAM technology to help store the large number of
branches. The new branch prediction logic was completely redesigned utilizing an
approach that allowed the branch predictor to dynamically reconfigure itself to adapt to
the number of branches within a line of a given workload while maximizing capacity and
minimizing latency to the downstream core pipeline. This design was partitioned into
four identical quadrants with a semi-inclusive level 1 Branch Target Buffer (BTB1) and
level 2 (BTB2) hierarchy surrounding a central logic complex that predicts up to 24
branches per prediction bundle. A single quadrant’s BTB2 contained 24 dense SRAM
46 • 2022 IEEE International Solid-State Circuits Conference
arrays with a capacity of 4× the full-custom SRAMs used in the BTB1, providing a BTB2
capacity of over 3.6Mb. The total capacity of the entire branch predictor’s BTB1 and BTB2
was over 15Mb which was used to store up to 272,000 branches.
The loss of deep-trench EDRAM technology that drove the nest topology change also
impacted the amount of on-die decap. Deep Trench Capacitors (DTCAP) were formerly
used to achieve a total chip capacitance of more than 30μF to deal with the total chip
leakage and transient currents, both expected to increase in many areas due to the denser
design. To effectively manage chip power noise, an on-chip decoupling scheme was
engineered to replace the previous deep-trench distribution method. The decoupling
capacitors distribution was strategically implemented across the different areas of the
chip, while maintaining the necessary VDD/VIO decoupling ratio in the IO areas. The
design’s overall on-chip decoupling capacitance increased by ~25% vs. the default decap
distribution (Fig. 2.3.4).
The decrease in overall on-chip decoupling capacitance (vs. previous technologies that
used DTCAPs) caused voltage droop times to be much shorter in very large delta-I
events. Noise simulations were used to improve reaction time of the on-chip throttling
logic [4]. A global Performance Throttle Mechanism (PTM) is augmented with local PTM
for faster reaction to droop events. The local PTM is within the core-centric Digital Droop
Sensor (DDS) and is physically close to the areas where throttling can be induced, acting
more quickly than the global PTM. The controllability and data accuracy were improved
by increasing the throttling level and number of throttle patterns from 16 to 32 and
increasing the resolution of the DDS from 12b to 24b.
The DCM (71×79mm2) packages both dies with a 500μm edge-to-edge spacing,
connected by a high-speed M-Bus (3.6088Tb/s or 451.1GB/s for a processor clock
running up to 5.2GHz) interface through the top layers of the laminate, enabling
synchronous operation and effectively allowing the microarchitecture to treat the DCM
as a single chip with 16 physical cores per socket (Fig. 2.3.5).
The MBUS tight-length interconnect constraints required innovation of a highperformance, low-latency, power-efficient interface. The interface is fully synchronous,
operating at half the processor clock rate without the use of any serialization and
deserialization to achieve minimal latency. The interface is implemented as an extremely
wide bus consisting of 1388 single-ended data lanes to compensate for the lack of
serialization. Test drivers and receivers are utilized, as well as redundant lanes to repair
the bus if defective lanes are detected.
Thermal and mechanical stress/strain management (Fig. 2.3.6) is critical to package
performance with two closely spaced die. The concentrated heat load is managed by
IBM-developed Thermal Interface Materials (TIMs), a single custom copper lid optimized
for heat and mechanical load spreading, and a liquid-cooled cold plate integral to the
socket-level mechanical loading system. Core, I/O and cache temperature are kept below
critical limits at all projected workloads and further protected from excursions with active
system-level temperature monitoring core throttling.
Both processors are connected to each other on the board by a 2×18 differential pair
bus running synchronously at four times the processor speed. The board routing
encroached underneath the DCM to satisfy the tight signal integrity requirements due to
the plane openings of the land grid array connector with 4753 IOs (Fig. 2.3.7).
To support both CPs acting in unison, a new PLL clock signal distribution and associated
circuits were employed to minimize clocking uncertainty for the M-Bus interface. A single
PLL from one processor chip sources both chips’ clock grid distributions eliminating
the effects of long-term drift error associated with multiple PLLs. Each chip is equipped
with specialized circuits that allow for the sampling of each die’s clock grid signal and
de-skewing and aligning the phases for both clock signals.
Acknowledgement:
The authors would like to thank the entire IBM Z team, the IBM EDA team, the IBM
Enterprise Systems Product Engineering team, the IBM Research team, and the Samsung
fabrication team for all their hard work and contributions to the success of this project.
References:
[1] C. Berry et al., “IBM z15: A 12-Core 5.2GHz Microprocessor,” ISSCC, pp. 54–56,
2020.
[2] D. Wolpert et al., “IBM’s Second Generation 14-nm Product, z15,” IEEE JSSC, vol.
56, no.1, pp. 98-111, 2021.
[3] C. Jacobi, “IBM Z Telum Mainframe Processor,” IEEE Hot Chips Symp, 33, 2021.
[4] Tobias Webel et al., “Proactive Power Management in IBM z15”, IBM J. of Res. and
Dev., pp. 15:1-15:12, Sept.-Nov., 2020
[5] Samsung 7nm Logic Technology.
[6] W. C. Jeong et al., “True 7nm Platform Technology featuring Smallest FinFET and
Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion
Break,” IEEE Symp. VLSI Tech., pp. 59-60, 2018.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 9:00 AM
2
Figure 2.3.1: 32 CPs on a 4 drawer configuration.
Figure 2.3.2: CP Die photo (courtesy of Samsung).
Figure 2.3.3: CP structure and RING.
Figure 2.3.4: VDD Gate Decoupling Capacitance density (pF/mm2).
Figure 2.3.5: Physical view of the high speed M-BUS interconnect between processor Figure 2.3.6: Chip carrier mechanical strain variation showing peak levels
concentrated between dies (mm/mm).
die.
DIGEST OF TECHNICAL PAPERS •
47
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.3.7: Single wiring layer board routing example.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.4
2.4
POWER10TM: A 16-Core SMT8 Server Processor with 2TB/s
Off-Chip Bandwidth in 7nm Technology
Rahul M. Rao1, Christopher Gonzalez2, Eric Fluhr3, Abraham Mathews3,
Andrew Bianchi3, Daniel Dreps3, David Wolpert4, Eric Lai3, Gerald Strevig3,
Glen Wiedemeier3, Philipp Salz5, Ryan Kruse3
IBM, Bengaluru, India; 2IBM, Yorktown Heights, NY; 3IBM, Austin, TX
IBM, Poughkeepsie, NY; 5IBM, Boblingen, Germany
1
4
The POWER10™ processor designed for enterprise workloads contains 16 synchronous
SMT8 cores (Fig. 2.4.1) coupled through a bi-directional high-bandwidth race-track
[1][2]. A SMT8 core with its associated cache is called a core chiplet, and a pair of core
chiplets forms a 39.4mm2 design tile. Designed in a 7nm bulk technology, the 602mm2
chip (0.85× of POWER9™ [3]) has nearly 18B transistors, 110B vias and 20 miles of onchip interconnect distributed across 18 layers of metal: 8 narrow-width layers for short
range routes, 8 medium-width layers for high performance signals and two 2160nm
ultra-thick metal (UTM) layers dedicated for power and global clock distribution. There
are 10 input voltages as shown in Fig. 2.4.1: core/cache logic (Vdd), cache arrays (Vcs),
nest logic (Vdn), two PHY voltages (Vio, Vpci), stand-by logic (Vsb), a high-precision
reference voltage (Vref), DPLL voltage (VDPLL), analog circuitry voltage (VAVDD), and an
interface voltage (V3P3). The C4 array contains 24477 total connections (1.25× of [3])
with 10867 power, 11879 ground and 1731 signal connections. A core and its associated
L2 cache are power-gated together, while the L3 cache is power-gated independently.
Scaling from 14nm SOI to 7nm bulk technology forced numerous design innovations.
Power grid robustness is bolstered with 19μF (0.4× of [3]) of on-chip capacitance along
with additional sensors for droop detection. The change from SOI’s sparse buried
insulator ties to bulk’s denser nwell/substrate contacts resulted in 32× as many placed
cells consuming 14× as much area as the previous design, driving new algorithms for
cell insertion around large latch clusters and voltage regions. A new hierarchical antenna
infrastructure was created, aware of macro dimensions, pin locations, and multi-sink
nets, with path-aware insertion of wire jumpers and diodes to enable concurrent
hierarchical design. Increased wire parasitics and 1× metal layer rule constraints drove
extensive use of via meshes, particularly on wide-wire timing-critical nets. Place and
route innovations coupled with a 9 tracks-per-bit library cell image improved silicon
utilization and routability without sacrificing performance. SER-resilient latches with
redundant state-saving nodes were used, trading 2.5× latch size in <0.5% of
architecturally critical instances.
As shown in Fig. 2.4.2, the clock infrastructure uses a pair of redundant reference clocks
with dynamic switch-over capability for RAS. 34 PLL/DPLLs control 90 independent
meshes across the chip. A complex network of 20 differential multiplexers allow a
multitude of reference clocks with varying spread-spectrum and jitter capability into the
PLL/DPLLs. A single DPLL generates synchronous clocks for the 8 design tiles, each of
which contains four 1:1 resonant meshes (tuned to 2.8GHz), four 2:1 non-resonant
meshes, and a chip-wide nest / fabric mesh. Within each design tile, 4 skew-sensors, 4
skew-adjusts, and 4 programmable delays continuously align all running meshes to the
chip-wide nest mesh within 15ps across the entire voltage range. Another DPLL
generates groups of synchronous Power Accelerator Unit (PAU) clocks, where each PAU
portion can be independently mesh-gated. The clock design can also choose between
an on-chip-generated PCIe reference clock or two off-chip PCIe reference clocks to
further improve its spread-spectrum and jitter. Similar to [3], the programmability of
clock drive strengths, pulse widths, and resonance mode reduces clock power by 18%
over traditional designs.
Four variants of customized SRAM cells constitute over 200MB of on-die memory. A
performance-optimized lower threshold voltage (VT) 0.032μm2 6 transistor (6T) SRAM
cell with single- and dual-port versions is used in 7 high-speed core arrays and singleport compilable SRAMs. The dense L2/L3 caches use a leakage-optimized 6T 0.032μm2
cell with dual supply, while a 0.054μm2 8 transistor (8T) SRAM cell is used in two-port
compilable arrays. A larger menu (1.5× of [3]) of different ground rule clean cells is used
in 10 custom plus several compilable multi-port register files, 3 content-addressable
memories (CAM) and synthesized memories. SRAM arrays have optional write-assist
circuitry applying negative bitline boost or local voltage collapse for banked 6T designs
to support operation down to 0.45V. Most of the array peripheral logic (decoder, latches,
IO and test) is synthesized, with structured placement enabling in-context optimization,
logic and latch sharing and simplified custom components [4].
The SMT8 core was optimized for both single thread and overall throughput performance
on enterprise-scale workloads. Core microarchitecture enhancements [5] include 1.5×
instruction cache, 4× L2 cache, and 4× TLB, all with constant or improved latency, larger
branch predictors and significantly increased vector SIMD throughput compared to [3].
The core additionally includes four 512b matrix-multiply assist (MMA) units. The MMA
unit shares the core clock mesh to reduce overall instruction latency, yet has a separate
power domain that is dynamically power-gated off when not in use to optimize energy
48 • 2022 IEEE International Solid-State Circuits Conference
efficiency. The core physical design consists of fully abutted physical hierarchies, with
the instruction and execution control units flanking the load store and the arithmetic
units, placed below the MMA (Fig. 2.4.1). Traditional logical unit boundaries were
dissolved, and content was redistributed into large floorplanned blocks. Each resulting
block utilized all metal levels except the UTM layers. This improved methodology
removed 2 levels of physical hierarchy, enabled efficient area and metal usage in the
core, and resulted in 13 synthesized blocks (0.1× of [3]) with an average of 800k nets,
750k cells and a total of 404 hard array instances. Each core chiplet includes 6 digital
thermal sensors (DTS), 2 digital droop sensors (DDS) with associated controllers for
thermal and voltage management, along with 4 process-sensitive ring oscillators (PSRO).
The DTS and DDS are located at high thermal and voltage stress locations in the design
as shown in Fig. 2.4.3, superimposed on a thermal map of a 10-core enabled processor
running a core-heavy workload. Additionally, 14 DTS, 8 high-precision analog thermal
diodes, 16 PSROs (distributed across two voltage rails), 12 skitters for clock jitter
monitoring and a composite array of process monitors are distributed across the rest of
the chip.
The high-bandwidth performance-critical race track is structured through the design tiles
on metal10 and above, enabling improved silicon efficiency and reduced latency. The
synchronous portion of the nest also includes power management, 2 memory
management units, interrupt handling, a compression unit, test infrastructure and system
configuration and control logic. Additionally, 6 accelerator functions, 4 memory
controllers, 2 PCIe host bridge units, data and transaction link logic, all on the VDN rail,
operate asynchronous to the core, while tied to the I/O speeds. These components, 12
of which can be selectively power gated (Fig. 2.4.4) are built in a modular fashion
(identical across the east / west regions), primarily at metal8 ceiling to prioritize racetrack wiring.
POWER10TM features 144 lanes of high-speed serial I/O capable of running 25-to32.5Gb/s at 5pJ/b supporting the OpenCAPI protocol and SMP interconnect providing
585GB/s of bandwidth in each direction as shown in Fig. 2.4.4. To support up to 30dB
of channel loss an Rx architecture using 3-tap decision feedback equalizer (DFE) with
continuous time-linear equalization (CTLE) and LTE was used, as well as a series-source
terminated style Tx utilizing 2 taps of feed-forward equalization and duty-cycle correction
circuitry. A dual-bank architecture on the receiver enables complete recalibration of
analog coefficients during runtime. 16 OpenCapi Memory Interface (OMI)/DDIMM 8-lane
busses capable of running at 21-to-32GB/s designed with the same PHY architecture as
the SMP interconnect provides 409.6GB/s of bandwidth. 32 lanes of industry-standard
32Gb/s PCIe Gen5-compatible PHYs are implemented with vendor IP. 16 of the 32 lanes
are limited to Gen4, providing a total bandwidth of 96GB/s.
Energy efficiency was significantly improved through micro-architectural and design
changes including improved clock gating and branch prediction, instruction fusion and
reduced memory access [5]. Power consumption of functional areas and components
is illustrated in Fig. 2.4.5. Splitting the VIO domain into multiple islands (Fig. 2.4.1) allows
for system-specific power-supply enablement of only used interfaces. System-specific
supply voltage modulation enables nest power to be maintained at less than 5%. Leakage
power is maintained at less than 20% via aggressive multi-corner design optimization
and intelligent usage of three different threshold voltage (VT) logic devices, with less than
3% usage of the fastest device type. A 25% increase in latches connected to a local clock
buffer through improved library design, placement algorithms, and less than 3% usage
of high-power latches enables power of sequential components of a core chiplet to be
less than ~20%. The clock network consumes ~10% of the total power. These
enhancements enable 65% of the power to be allocated to the core chiplets.
Frequency vs. voltage shmoo is shown in Fig. 2.4.6 for cores within a chip, and across
process splits. The shallower slope at higher voltage and faster process can be attributed
to wire-dominated lowest VT paths being limiters. Frequency is boosted based on
workload [6] up to an all-core product maximum of 4.15GHz. A POWER10™ processor
can be packaged in a single-chip, as well as dual-chip module, enabling up to a maximum
of 256 threads per socket.
References:
[1] Samsung 7nm Technology.
[2] W. C. Jeong et al., “True 7nm Platform Technology featuring Smallest FinFET and
Smallest SRAM cell by EUV, Special Constructs and 3rd Generation Single Diffusion
Break,” IEEE Symp. VLSI Tech., pp. 59-60, 2018.
[3] C. Gonzalez et al., “POWER9TM: A Processor Family Optimized for Cognitive
Computing with 25Gb/s Accelerator Links and 16Gb/s PCIe Gen4,” ISSCC, pp. 50-51,
2017.
[4] P. Salz et al., “A System of Array Families and Synthesized Soft Arrays for the
POWER9™ Processor in 14nm SOI FinFET technology,” ESSCIRC, pp. 303-307, 2017.
[5] W. Starke, B. Thompto “IBM’s POWER10TM Processor”, IEEE HotChips Symp., 2020.
[6] B. Vanderpool et al., “Deterministic Frequency and Voltage Enhancements on the
POWER10TM Processor,” ISSCC, 2022.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 9:10 AM
2
Figure 2.4.1: POWER10 floorplan and voltage representation.
Figure 2.4.2: POWER10 clock distribution.
Figure 2.4.3: POWER10 sensors on 10 core thermal map for a core sensitive
Figure 2.4.4: POWER10 interfaces and power gating.
workload.
Figure 2.4.5: Chip power components.
Figure 2.4.6: Frequency sensitivity across voltage and process.
DIGEST OF TECHNICAL PAPERS •
49
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.4.7: POWER10 die micrograph (courtesy of Samsung).
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.5
2.5
A 5nm 3.4GHz Tri-Gear ARMv9 CPU Subsystem in a Fully
Integrated 5G Flagship Mobile SoC
Ashish Nayak1, HsinChen Chen1, Hugh Mair1, Rolf Lagerquist1, Tao Chen1,
Anand Rajagopalan1, Gordon Gammie1, Ramu Madhavaram1, Madhur Jagota1,
CJ Chung1, Jenny Wiedemeier1, Bala Meera1, Chao-Yang Yeh2, Maverick Lin2,
Curtis Lin2, Vincent Lin2, Jiun Lin2, YS Chen2, Barry Chen2, Cheng-Yuh Wu2,
Ryan ChangChien2, Ray Tzeng2, Kelvin Yang2, Achuta Thippana1, Ericbill Wang2,
SA Hwang2
devices, avoiding any lower-VT devices, thereby reducing leakage power and extending
the useable range of the BP core to lower voltage and frequency. Figure 2.5.4 shows the
silicon measurement of Vmin vs. yield data collected from >1000 dies for the HP core
demonstrating robust yield at 3.4GHz.
MediaTek, Austin, TX; 2MediaTek, Hsinchu, Taiwan
To enhance the capability of the on-die power supply monitoring, a high-bandwidth peak
low detection capability is introduced which measures the lowest voltage of rapid supply
voltage droops. Conventional data converters need to sample and convert at a multi-GHz
rate to measure meaningful values, resulting in a power-hungry and area-consuming
design. Prior solutions [2] utilized a single-bit converter with a predetermined voltage
threshold. This solution, however, relies on under-sampling and the repetitive nature of
the supply droop.
This paper presents a tri-gear ARMv9 CPU subsystem incorporated in a 5G flagship
mobile SoC. Implemented in a 5nm technology node, a 3.4GHz High-Performance (HP)
core is introduced along with circuit and implementation techniques to achieve CPU PPA
targets. A die photograph is shown in Fig. 2.5.7. The SoC integrates a 5G modem
supporting NR sub-6GHz with downlink and uplink speed up to 7.01Gb/s and 2.5Gb/s,
respectively, an ARMv9 CPU subsystem, an ARM Mali G710 GPU for 3D graphics, an
in-house Vision Processing Unit (VPU), and a Deep-Learning Accelerator (DLA) for highperformance and power-efficient AI processing. The integrated display engine can
provide portrait panel resolution up to QHD+ 21:9 (1600×3360) and frame rates up to
144Hz. Multimedia and imaging subsystems decode 8K video at 30fps, while encoding
4K video at 60fps; camera resolutions up to 320MPixels are supported. LPDDR56400/LPDDR5X-7500 memory interfaces facilitate up to 24GB of external SDRAM over
four 16b channels for a peak transfer rate of 0.46Tb/s.
To address deficiencies of prior works, a hybrid peak low detector is developed with a
high-speed peak low detection channel 0 and two low-speed conversion channels, shown
in Fig. 2.5.5. In channel 0, one of the sampling capacitors (C0p or C0m) holds the previous
peak low, while the other capacitor is sampling the current VDD. If the newly sampled VDD
is lower than the previous peak low, the newly sampled voltage will be held. Cp and Cm
are synchronized with C0p and C0m, respectively, therefore sampling and holding the same
voltage as C0p and C0m. An 8b R2R DAC is used as the reference voltage for channel P
and channel N. Once C0p (or C0m) holds a new peak low voltage, channel P (or channel
M) will start a SAR conversion at a low clock rate to obtain the output code D. Since lowspeed A2D conversion is only triggered with a new peak low event, the power
consumption is significantly reduced. Die-level supply measurements utilizing this newly
introduced circuit are shown in Fig. 2.5.6 where the waveform is a continuous singlepass capture.
All processor cores in the CPU subsystem incorporate the ARMv9 instruction set with
key architectural advances. Memory Tagging Extension (MTE) enables greater security
by locking data in the memory using a tag which can only be accessed by the correct
key held by the pointer accessing the memory location, as shown in Fig. 2.5.1. Further,
a Scalable Vector Extension 2 (SVE2) allows a scalable vector length in multiples of 128b,
up to 2048b, enabling increased DSP and ML vector-processing capabilities, as shown
in Fig. 2.5.1.
To further improve the power efficiency of the CPU subsystem, this work proposes a
adaptive voltage scaling (AVS) architecture by combining frequency-locked loop (FLL)
and variation-resistant CPU speed binning technologies. The FLL architecture proposed
in [3] addresses the challenge of limited power supply bandwidth, but is unable to cover
for long-term (>ms) DC variations, such as supply load regulation or temperature
variations. To achieve full-bandwidth protection and voltage margin reduction, a digitally
controlled Ring Oscillator (ROSC) frequency-limiting mechanism is added to the FLL, to
provide an operating condition reference. This is achieved by applying a minimum fine
code (minFC) that the ROSC is allowed to operate on.
1
The heterogeneous CPU complex, shown in Fig. 2.5.2, is organized into 3 gears. The 1st
gear is a single HP core which utilizes the ARMv9 Cortex-X2 microarchitecture with 64KB
L1 instruction cache, 64KB L1 data cache, and a 1MB private L2 cache. The 2nd gear
consists of three Balanced Performance (BP) cores utilizing the ARMv9 Cortex-A710
architecture, each with a 64KB L1 instruction cache, a 64KB L1 data cache, and a 512KB
private L2 cache. The 3rd gear features four High Efficiency (HE) ARMv9 Cortex-A510
cores [1], with each core using a 64KB L1 instruction cache, 64KB L1 data cache. Further,
the HE CPU cores are implemented in pairs to facilitate the sharing of a 512KB L2 cache,
floating-point and vector hardware between two CPUs cores, improving area and power
efficiency, maintaining full v9 compatibility, without sacrificing performance of key
workloads. Finally, an 8MB L3 cache is shared across all the cores of the CPU complex.
The HP core runs up to 3.4GHz clock speed to meet high-speed compute demands, while
the HE cores are optimized to operate efficiently at ultra-low voltage. The BP cores
provide a balance of power and performance for average workloads. Depending on the
dynamic computing needs, workloads can be seamlessly switched and assigned across
different gears of the CPU subsystem enabling maximum power efficiency. Dynamic
voltage and frequency scaling (DVFS) is employed along with adaptive voltage scaling
to adjust operating voltage and frequency. Figure 2.5.1 demonstrates the power efficiency
of the CPU subsystem achieving 27% improvement in single thread performance of the
HP core over the BP core.
To achieve higher performance, several microarchitectural changes, listed in Fig. 2.5.3,
were adopted across multiple pipeline stages in the HP / Cortex-X2. These
microarchitectural advancements have led to an increase in instance count and silicon
area of the X2 CPU, as shown in Fig. 2.5.3. Due to the increase in instance count, the
implementation adopted a hierarchical approach to achieve an acceptable implementation
turn-around time. The implementation followed the hierarchical strategy shown in Fig.
2.5.3, with two main sub-blocks integrated into the top level. The sub-block shape, pin
assignments, and timing budgets were pushed down from top-level implementation. The
sub-blocks were then implemented stand-alone meeting these requirements. The toplevel implementation utilized abstracted timing and physical models, enabling a
concurrent approach to sub-block and top-level implementation. While implementation
was hierarchical, final timing sign-off remained flat ensuring the interface timing between
all blocks were met, eliminating the need for any additional timing margin. A similar flat
approach was adopted for verifying power-grid integrity, layout versus schematic, and
design rule checks.
All cores in the CPU subsystem are implemented using a 210nm tall standard cell library
in TSMC 5nm. The HP core targets ULVT (ultra-low threshold voltage) devices with
minimal use of ELVT (extra-low threshold voltage) devices on critical timing paths to
meet the frequency target. On the other hand, the BP core targets low-leakage ULVT
50 • 2022 IEEE International Solid-State Circuits Conference
When the operating condition degrades, such as increased IR-drop, the FLL clock
frequency will be limited to guarantee safe CPU operation. A voltage increase request,
sent to the PMIC, is generated by comparing the FLL output frequency to the PLL input
frequency. Conversely, when operating conditions improve and extra voltage margin is
no longer needed, the ROSC will oscillate at PLL frequency with a fine code higher than
minFC. A voltage decrease request will be sent to the PMIC to reduce the supply voltage
until the FLL is frequency locked while using minFC for the ROSC.
The minFC is derived through a post-silicon binning process using automated test
equipment patterns. By enhancing the same concept of voltage and frequency
relationship adjustment described in [4], this work proposes deriving the optimal ROSC
configuration by sweeping the ROSC code from high (slower) to low (faster) while
various workloads are executed, gradually increasing ROSC frequency until the test
pattern fails. The minFC is then recorded as the optimal ROSC code that can still pass
the silicon test. This utilizes the benefit of both ROSC and CPU being subjected to the
same supply and temperature variations, hence reducing a large portion of the fixed
voltage margin of the conventional approach. Figure 2.5.6 shows the adaptive voltage
scaling design, as well as the silicon result. This AVS technology is able to provide 4%
supply voltage reduction based on different CPU workloads, while tracking closely with
the target frequency at 100%.
In summary, a tri-gear ARMv9 CPU subsystem for a flagship 5G smartphone SoC is
introduced with a high-performance core achieving 3.4GHz at robust yield and delivering
up to 27% higher peak performance through microarchitectural and implementation
advancements. Further, circuit innovation to enable continuous monitoring of on-die
power supply is shown. Finally, a new variation aware AVS technology is introduced to
further improve CPU power efficiency.
References:
[1] “First Generation Armv9 High-Efficiency “LITTLE” Cortex CPU based on Arm
DynamIQ Technology,” <https://www.arm.com/products/silicon-ip-cpu/cortex-a/cortexa510>.
[2] H. Mair et al., “A 10nm FinFET 2.8GHz Tri-Gear Deca-Core CPU Complex with
Optimized Power-Delivery Network for Mobile SoC Performance,” ISSCC, pp. 56-57,
2017.
[3] H. Chen et al., “A 7nm 5G Mobile SoC Featuring a 3.0GHz Tri-Gear Application
Processor Subsystem,” ISSCC, pp. 54-56, 2021.
[4] F. Atallah et al., “A 7nm All-Digital Unified Voltage and Frequency Regulator Based
on a High-Bandwidth 2-Phase Buck Converter with Package Inductors,” ISSCC, pp. 316318, 2019.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 9:20 AM
2
Figure 2.5.1: ARMv9 memory tagging, SVE2 performance chart, CPU power
Figure 2.5.2: ARMv9 CPU cluster.
efficiency.
Figure 2.5.3: HP core microarchitecture, instance count and hierarchical
Figure 2.5.4: HP core silicon measurement data.
implementation.
Figure 2.5.5: Peak low detector circuit and timing diagram.
Figure 2.5.6: Adaptive Voltage Scaling (AVS) using FLL and minimum ROSC code.
DIGEST OF TECHNICAL PAPERS •
51
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.5.7: Die micrograph.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.6
2.6
A 16nm 785GMACs/J 784-Core Digital Signal Processor Array
with a Multilayer Switch Box Interconnect, Assembled as a
2×2 Dielet with 10µm-Pitch Inter-Dielet I/O for Runtime
Multi-Program Reconfiguration
Uneeb Rathore*, Sumeet Singh Nagi*, Subramanian Iyer, Dejan Marković
University of California, Los Angeles, CA
*Equally Credited Authors (ECAs)
The increasing amount of dark silicon area in power-limited SoCs makes it attractive to
consider reconfigurable architectures that could intelligently repurpose dark silicon.
FPGAs are more efficient than CPUs, but lack temporal dynamics of CPUs, efficiency and
throughput of accelerators. Coarse-grain reconfigurable arrays (CGRAs) can achieve
higher throughput, with substantial energy efficiency gap relative to accelerators, and
limited multi-program dynamics. This paper introduces a domain-specific and energyefficient (within 2-10× of accelerators) multi-program runtime-reconfigurable 784-core
processor array in 16nm CMOS. Our design maximizes generality for signal processing
and linear algebra with minimal area and energy penalty. The main innovation is a
statistics-driven multi-layer network which minimizes network delays, and a switch box
that maximizes connectivity per hardware cost. The layered network is O(N) with the
number of processing elements N, which allows monolithic or multi-dielet scaling. The
network features deterministic routing and timing for fast program compile and
hardware-resource reallocation, suitable for data-driven attentive processing, including
program trace uncertainties and application dynamics. Further, this work demonstrates
multiple functional dielets that have been integrated at 10μm bump pitch to build a
monolithic-like scalable design. The multi-dielet (2×2) scaling is enabled by energyefficient high-bandwidth inter-dielet communication channels that seamlessly extend the
intra-die routing network across dielet boundaries, quadrupling the number of compute
resources.
The reconfigurable compute fabric (Fig. 2.6.1) consists of 2×2 Universal Digital Signal
Processor (UDSP) dielets on a 10μm bump pitch Silicon Interconnect Fabric (Si-IF).
Each dielet has 196 compute cores (14×14 array), a 3-layer interconnect network, 28boundary Streaming Near Range 10μm (SNR-10) inter-dielet 64b I/O channels, a PLL
for high-speed clock generation and a control interface. The control module supports
run-time (RT) reconfiguration, including holding pre-written programs in soft reset,
concurrent writing and execution of multiple programs, and collecting finished programs
to make space or for reuse at a later time. The control module includes a program counter
and associated memory for each of the cores, which is used to support algorithm
dynamics.
The internal architecture of the core (Fig. 2.6.2, some connections omitted for clarity)
contains two 16b adders and two 16b multipliers, four 16b inputs and four 16b outputs,
2-9 variable delay lines, a 256b data memory and a 384b instruction memory per
program counter. In order to balance compute efficiency and flexibility, the core
configuration is designed to support signal processing and basic linear algebra kernels
such as FIR/IIR direct- and lattice-form filters, matrix multiplication, CORDIC, mix-radix
FFT, etc.
The internal architecture of SNR-10 (Fig. 2.6.2) uses standard-cell-based high-current
buffers to drive short-distance (~100μm), low-impedance Si-IF inter-die communication
channels. The SNR-10 channels can be configured for synchronous or asynchronous
operation. The built-in self-test allows the SNR-10 channel to heal hardware faults at
chip start-up by routing around the faults using redundant pins and communicating the
correction to its connected neighboring dielet through repair/control pads. Each SNR10 channel houses 32 Tx and 32 Rx bits and an Rx FIFO to adjust for clock skew. The
channel occupies 237×40μm2 as dictated by the physical dimensions of Si-IF bumps.
There is sufficient space to accommodate Duty-Cycle Correction (DCC) and Double Data
Rate (DDR) features in future iterations.
Figure 2.6.3 shows the layered interconnect network and the tile-able core, called Vertical
Stack (VS). We analyze the Data-Flow Graphs (DFGs) of common DSP and linear algebra
functions by clustering them into cores and deriving cluster statistics, such as node
degree distributions. We generate DSP-like Bernoulli random graphs with similar
statistics and sort them on a 2D Manhattan grid by minimizing the wire length cost
∑all wires(Lx+Ly)2. The distribution of obtained wire lengths is shown as a 2D PDF and CDF
in Fig. 2.6.3. Based on the CDF, a distance-2 1-hop network can cover over 90%
connectivity requirements of the domain.
The VS consists of a compute core connected to three-layer switch boxes (SBs). Layer1 SB interfaces with the compute core and provides distance-1 communication, layer-2
SB provides sqrt(2) and layer-3 SB provides distance-2 communication with adjacent
52 • 2022 IEEE International Solid-State Circuits Conference
VSs. The data can only be registered inside the compute core and communication
between SBs does not add pipeline registers, allowing for a single clock cycle
communication between VSs at 1.1GHz in 16nm technology.
The internal design and optimizations of SBs (Fig. 2.6.4) aim to reduce hardware cost
while maximizing connectivity. We start with a multilayer SB design, where the number
of layers and nodes per layer are treated as optimization parameters. The connections
between the layers are converted to a hyper-matrix representation where each entry in
the hyper-matrix represents a complete path through SB. The connections are pruned
by minimizing hyper-row cross-correlations of the matrix. At every prune step, the SB
design is tested for Mean Connections Before Failure (MCBF), the number of successfully
mapped random input-output pairs before a routing conflict, against the Hardware Cost
(HwC) in terms of 2-input MUXes. The selected switch box architecture maximizes the
ratio of MCBF to HwC.
The setup for UDSP testing (Fig. 2.6.5) includes a compiler flow supporting RT
reconfiguration. The user input to the compiler is a DFG. A library of commonly used
blocks such as FFT or vector MAC can be designed to simplify and speed up
programming. The compiler retimes and clusters the DFG, maps I/O to the SNR-10
channels & arithmetic units to cores, maximizing core utilization. The clustered DFG is
assigned to the array grid and placement of the cores is optimized using simulated
annealing, with timing-aware masks. The SBs inside the VSs are configured nearly
independently and, because of the deterministic nature of SB, the routing process can
quickly inspect nearly all of the SB routes in parallel. If the routing fails, the compiler
optimizes the program placement, core mappings and clustering iteratively to achieve a
successful compilation. At the end, the compiler generates metadata pertaining to the
bounding boxes of the program and I/O port map to help the run-time scheduler achieve
a fast run-time placement of programs. The modularity & symmetry of the array allows
the scheduler to use the program metadata to quickly perform program relocation based
on runtime dynamics and current hardware resource utilization of other programs on
the array, and also use quick SB routing to route the I/Os of the program to the external
I/O ports.
The 2×2 multi-dielet UDSP assembly (Figs. 2.6.1 and 2.6.7) features 784 cores on an
Si-IF with 10μm inter-dielet I/O links [1]. The VSs on the die edge communicate
seamlessly over the Si-IF interposer using the multilayer SBs and SNR-10 channels. The
intra-dielet SB interconnect naturally extends across dielet boundaries for a multi-dielet
configuration. Clock is sourced from the PLL of one of the dielets and distributed as a
balanced tree on the interposer.
The plot of power and frequency vs. supply voltage (Fig. 2.6.6) shows a maximum
frequency of 1.1GHz at 0.8V for high-throughput applications. A peak energy efficiency
of 785GMACs/J is achieved at 0.42V and 315MHz, for energy-sensitive applications. DSP
algorithms including beamforming, FIR filters, and matrix-multiply show algorithmindependent, line-rate throughput as well as >90% utilization of the allocated cores. The
FFT achieves 53% utilization due to its intrinsic multiplier-to-adder ratio of ~0.6 (for large
FFT sizes), as compared to the ratio of 1 inside the core. The energy efficiency is still
high, at 4.9pJ per complex radix-2, by keeping the unused elements in soft-reset. The
SNR-10 I/O links connecting the 4 dielets are stress-tested for efficiency and operating
frequency. With a clock rate of 1.1Gbps/pin, at 0.8V, the channel achieves a 70.4Gbps
bandwidth with a shoreline density of 297Gbps/mm. This is achieved with a 2-layer SiIF and 10μm bump pitch. While state-of-the-art multi-chip modules (MCMs) use lower
swing and/or lower frequency [2,3] to increase energy efficiency, SNR-10 link achieves
a better energy efficiency of 0.38pJ/b, which accounts for total power draw of Tx, Rx
channels and wire links, due to smaller capacitance of shorter reach wires, smaller pads
(due to fine pitch), smaller ESD, and simpler unidirectional drivers.
Acknowledgement:
The authors thank SivaChandra Jangam and Krutikesh Sahoo for help with Si-IF
packaging and assembly, Sina Basir-Kazeruni for help with layout, DARPA CHIPS and
DRBE programs for funding support.
References:
[1] S. Jangam et al., “Demonstration of a Low Latency (<20 ps) Fine-pitch (≤10 μm)
Assembly on the Silicon Interconnect Fabric,” IEEE Elec. Components and Tech. Conf.,
pp. 1801-1805, 2020.
[2] C. Liu, J. Botimer, and Z. Zhang, “A 256Gb/s/mm-shoreline AIB-Compatible 16nm
FinFET CMOS Chiplet for 2.5D Integration with Stratix 10 FPGA on EMIB and Tiling on
Silicon Interposer,” IEEE CICC, 2021.
[3] M. Lin et al., “A 7-nm 4-GHz Arm¹-Core-Based CoWoS¹ Chiplet Design for HighPerformance Computing,” IEEE JSSC, vol. 55, no. 4, pp. 956-966, 2020.
[4] J. M. Wilson et al., “A 1.17pJ/b 25Gb/s/pin Ground-Referenced Single-Ended Serial
Link For Off- and On-Package Communication in 16nm CMOS Using a Process- and
Temperature-Adaptive Voltage Regulator,” ISSCC, pp. 276-278. 2018.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 9:30 AM
2
Figure 2.6.1: Top-level architecture, multi-chip reconfigurable fabric with 4 UDSPs,
Figure 2.6.2: UDSP core and SNR-10 channel design.
784 cores and 112 channels.
Figure 2.6.3: Analysis of algorithm statistics and design of a delay-less 3D multi- Figure 2.6.4: Switchbox design space exploration and optimal switchbox
architecture.
layer interconnect.
Figure 2.6.5: Automated compiler, run-time scheduler and measurement setup.
Figure 2.6.6: Chip performance, power and efficiency measurements, and inter-dielet
communication protocol comparison.
DIGEST OF TECHNICAL PAPERS •
53
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.6.7: UDSP chip summary and micrograph.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / SESSION 2 / PROCESSORS / 2.7
2.7
Zen3: The AMD 2nd-Generation 7nm x86-64 Microprocessor
Core
Thomas Burd1, Wilson Li1, James Pistole1, Srividhya Venkataraman1,
Michael McCabe1, Timothy Johnson1, James Vinh1, Thomas Yiu1, Mark Wasio1,
Hon-Hin Wong1, Daryl Lieu1, Jonathan White2, Benjamin Munger2,
Joshua Lindner2, Javin Olson2, Steven Bakke2, Jeshuah Sniderman2,
Carson Henrion3, Russell Schreiber4, Eric Busta3, Brett Johnson3, Tim Jackson3,
Aron Miller3, Ryan Miller3, Matthew Pickett3, Aaron Horiuchi3, Josef Dvorak3,
Sabeesh Balagangadharan 5, Sajeesh Ammikkallingal 5, Pankaj Kumar5
AMD, Santa Clara, CA
AMD, Boxborough, MA
3
AMD, Fort Collins, CO
4
AMD, Austin, TX
5
AMD, Bangalore, India
1
2
“Zen 3” is the first major microarchitectural redesign in the AMD Zen family of
microprocessors. Given the same 7nm process technology as the prior-generation “Zen
2” core [1], as well as the same platform infrastructure, the primary “Zen 3” design goals
were to provide: 1) a significant instruction-per-cycle (IPC) uplift, 2) a substantial
frequency uplift, and 3) continued improvement in power efficiency. The core complex
unit (CCX) consists of 8 “Zen 3” cores, each with a 0.5MB private L2 cache, and a 32MB
shared L3 cache. Increasing this from 4 cores and 16MB L3 in the prior generation
provides additional performance uplift, in addition to the IPC and frequency
improvements. The “Zen 3” CCX shown in Fig. 2.7.1 contains 4.08B transistors in 68mm2,
and is used across a broad array of client, server, and embedded market segments.
The ”Zen 3” design has several significant microarchitectural improvements. A high-level
block diagram is shown in Fig. 2.7.2. The front-end had the largest number of changes,
including a 2× larger L1 Branch-Target-Buffer (BTB) at 1024 entries, improved branch
predictor bandwidth by removing the pipeline bubble on predicted branches and faster
recovery from mispredicted branches, and faster sequencing of op-cache fetches. In the
execution core, the integer unit issue width was increased from 7 to 10, including
dedicated branch and store pipes, the reorder buffer was increased by 32 entries to 256,
while in the floating-point unit, the issue width was increased from 4 to 6 and the FMAC
latency was reduced from 5 to 4 cycles. In the load-store unit, both the maximum load
and store bandwidth was increased by one to 3 and 2, respectively, and the translation
look-aside buffer (TLB) was enhanced with 4 additional table walkers for a total of 6.
Overall, the “Zen 3” core delivers +19% average IPC uplift over “Zen 2” across a range
of 25 single-threaded industry benchmarks and gaming applications, with some games
showing greater than +30% [2].
To support the change to an 8-core CCX for “Zen 3,” and the corresponding increase in
L3 cache slices from 4 to 8, the cache communication network was converted to a new
bi-directional ring bus with 32B in each direction, providing both high bandwidth and
low latency. The 8-core CCX with 32MB L3 provides 2× the available local L3 cache as
“Zen 2” per core, providing significant uplift on many lightly-threaded applications. The
L3 cache also contains the through-silicon vias (TSVs) to support AMD V-Cache, allowing
an additional 64MB AMD 3D V-Cache to attach to the base die via direct copper-to-copper
bond, which can triple the L3 capacity per CCX to 96MB. Another key change from “Zen
2” was converting the L3 cache from the high-current (HC) bitcell to the high-density
(HD) bitcell, yielding an area improvement of 14%, and a leakage reduction of 24% of
the cache array, all while matching the higher core clock frequencies.
The critical physical design challenge of the new “Zen 3” cache was the implementation
of the ring bus and the interface to the AMD V-Cache. As can be seen in Fig. 2.7.3, two
columns of TSVs run down the left and right halves of the cache, providing connectivity
to the stacked AMD V-Cache that effectively triples the memory capacity of each slice.
The TSV interface supports >2Tb/s per slice, for an aggregate bandwidth between the
two dies of >2TB/s. The ring bus has a cross-sectional bandwidth of >2Tb/s to match
the peak core-L3 bandwidth. Architecturally, the L3 cache supports up to 64 outstanding
misses from each core to the L3, and 192 outstanding misses from L3 to external
memory.
54 • 2022 IEEE International Solid-State Circuits Conference
“Zen 3” uses the same 13-metal telescoping stack as “Zen 2,” optimized for density at
the lower layers and for speed on the upper layers [1]. A key goal of “Zen 3” was to
increase frequency at fixed voltage by 4% across the full range of voltage, with 6%
increase at high-voltage to drive a corresponding improvement in single-thread
performance. Median silicon measurements of frequency vs. voltage are shown in Fig.
2.7.4, demonstrating success in meeting the goal, and more importantly, the ability to
achieve superior single-thread operation at 4.9GHz. Structured logic placement, judicious
cell selection, targeted use of low VTs, and wire engineering were used to drive the large
frequency increase at high voltage, and the tiles of the core were kept small to enable
multiple design iterations per week. To deliver the average IPC uplift of 19% and
frequency increase of 6%, the CCX effective switched capacitance (Cac) increased by
15%. Maintaining a ratio of ΔCac/ΔIPC at less than one produces a more power efficient
core. A breakdown of the “Zen 3” Cac is shown in Fig. 2.7.5. The fraction of Cac
consumed by clock gaters has increased slightly over the previous generation and the
fraction consumed by flops and combinational logic has decreased slightly due to the
additional effort to improve the clock gating efficiency of “Zen 3.” A modest increase in
leakage power was outweighed by the frequency increase to further improve power
efficiency. While the Cac and leakage increase leads to higher power at fixed voltage, at
ISO performance the “Zen 3” core delivers up to 20% higher performance/W, as shown
in Fig. 2.7.6.
The “Zen 3” core IP was simultaneously used in two distinct tape-outs. The first was the
“Zen 3” core complex die (CCD) chiplet, comprising the CCX, a system management unit
(SMU), and an Infinity Fabric On-Package (IFOP) SerDes link to connect to a separate IO
die (IOD). The primary upgrade from the prior “Zen 2” CCD [3] was replacing the two 4core “Zen 2” CCXs with a single 8-core “Zen 3” CCX, which allowed for a more
streamlined on-die Infinity Fabric to connect only one CCX and leveraged the remaining
IP from the prior-generation CCD. The size of this CCD is 81mm2, containing 4.15B
transistors. Similar to the prior generation, the CCD was combined with a low-cost 12nm
IOD to provide very cost-effective performance, and the modularity of chiplets enabled
product configurations to service the entire breadth of the server and desktop PC
markets. AMD “Milan” server products combine the server IOD with 2-to-8 “Zen 3”
chiplets to deliver a cost-optimized product stack from 16-to-64 cores. AMD “Vermeer”
client products combine the much smaller client IOD with 1-to-2 “Zen 3” chiplets to cover
the spectrum from top-end 16-core performance desktop products to mainstream 8core value products.
The second tape-out was AMD “Cezanne”, a monolithic APU, which upgraded the prior
generation APU [4] from two “Zen 2” CCXs to a single “Zen 3” CCX, having the same
core as in the CCD and a cut-down 16MB L3 Cache. This die, measuring 180mm2 with
10.7B transistors, contains an 8-compute-unit Vega graphics engine, a multimedia
engine, a display engine, an audio engine, two DDR4/LPDDR4 memory channels, as well
as PCIe, USB and SATA ports.
Combining both the IPC and frequency gains of “Zen 3” over “Zen 2” yields significant
performance gains in both single- and multi-thread performance across the entire
spectrum of product segments. Figure 2.7.7 shows the Cinebench R20 benchmark score
increase of single-thread (1T) performance of 17% on the “Vermeer” desktop part: 13%
from IPC and 4% from increased frequency. The multi-threaded Cinebench R20
improvement on the “Cezanne” mobile APU shows 14% uplift. Despite being constrained
to the same process technology as the prior generation, the “Zen 3” family of products
deliver a substantial boost in performance over the prior generation through the
combination of a ground-up redesign of the underlying Zen microarchitecture and the
innovative work of the physical design team to achieve up to 6% higher frequency at the
same voltage.
References:
[1] T. Singh et al., “Zen 2: The AMD 7nm Energy-Efficient High-Performance x86-64
Microprocessor Core,” ISSCC, pp. 42-43, 2020.
[2] M. Evers et al., “Next Generation ‘Zen 3’ Core,” IEEE Hot Chips Symp., 2021.
[3] S. Naffziger et al., “AMD Chiplet Architecture for High-Performance Server and
Desktop Products,” ISSCC, pp. 44-45, 2020.
[4] S. Arora et al., “AMD Next Generation 7nm Ryzen 4000 APU ‘Renoir’,” IEEE Hot Chips
Symp., 2020.
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
ISSCC 2022 / February 21, 2022 / 9:40 AM
2
Figure 2.7.1: Die photo of “Zen 3” core complex unit (CCX).
Figure 2.7.2: “Zen 3” architecture.
Figure 2.7.3: L3 ring bus design.
Figure 2.7.4: Frequency improvement.
Figure 2.7.5: Cac breakdown.
Figure 2.7.6: Performance (average) versus power for one 8C32M “Zen 3” CCX
versus two 4C16M “Zen 2” CCXs.
DIGEST OF TECHNICAL PAPERS •
55
ISSCC 2022 PAPER CONTINUATIONS
Figure 2.7.7: Cinebench R20 performance improvement.
• 2022 IEEE International Solid-State Circuits Conference
978-1-6654-2800-2/22/$31.00 ©2022 IEEE
Download