ISSCC 2025
SESSION 16
Invited Industry
Tomahawk 5
51.2Tbps Ethernet Switch
Asad Khamisy, PHD
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
1 of 26
Outline
Overview
Technology Highlights
Packet Flow
Monolithic Chip Drivers
Low Power, Air Cooled Pizza Box Design
TH5-Bailly: Direct Drive Co-Packaged Optics
Summary
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
2 of 26
Overview
51.2 Tbps Ethernet/IP Switching Bandwidth
In Mass Production Q1 2023
2X bandwidth every 2 years at lower $/Tbps, maintaining Moore’s Law
<1W total power per 100Gbps
Monolithic 5nm implementation
Diverse Physical Connectivity
4m DAC, KR backplane, pluggable optics, LPO/LRO, co-packaged optics
64 x 800GbE
128 x 400GbE
256 x 200GbE
512 x 100GbE
Accelerates AI Workloads
Cognitive routing, advanced telemetry and congestion control
Resource Virtualization
Improved security, efficient use of massively shared infrastructure
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
64x800G in 2RU
3 of 26
Technology Highlights
Monolithic
BGA package, 87.5x77.5mm, 7-2-7 stackup, lidless
TSMC 5nm process node
750+mm^2 area
60B transistors
1.325GHz core clock frequency
DSP-based 100G SerDes
Support 2RU pizza box, PCB routed, air cooled
Standard and co-packaged optics products
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
4 of 26
Logical View
Memory Management
2x[8x100G]
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
MB
MB
Packet Processing
Pipeline
MB
… MB
MB
MB
Enqueue
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
PCS/MAC
IDB/EDB
QOS
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
•
•
•
•
•
Serdes
PCS/MAC
IDB/EDB
…
…
Serdes
Dequeue
Admision
x16
2x[8x100G]
Packet Buffer
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
SerDes/PCS/MAC support IEEE standards, sub-ns timestamping, DPLL, RS544/272 FEC
Ingress Data Buffer (IDB) provides oversub input buffer with QOS and PFC capability
Packet processor support low latency L2/L3/Tunnels/ACL/Load-Balancing/Fast-Failover
Memory Management provides advanced traffic management capabilities
Packet Buffer supports 167MB of shared packet buffer storage
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
5 of 26
Packet Flow
Memory Management
2x[8x100G]
PCS/MAC
Serdes
1.6T Data Path
IDB/EDB
1
2x[8x100G]
Serdes
Packet Processing
Pipeline
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
1. SerDes/PCS/MAC performs FEC and MAC functions. Support fast codeword error detection
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
6 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
2
2x[8x100G]
Serdes
PCS/MAC
Packet Processing
Pipeline
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
2. IDB over-subscription buffer allows for burst absorption for packet sizes less than 295B
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
7 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
3
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
3. PP perform parsing, L2/L3/Tunnel, ACL, adaptive load balancing, and determines port/queue
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
8 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
4
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
4. MMU perform admission control, enqueues packet into packet buffer, input replicate MC
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
9 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
5
Dequeue
…
x16
…
Admision
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
5. Packet buffer store packet as fixed size cells, distribute cells evenly across memory banks
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
10 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
6
MB
MB
MB
… MB
MB
MB
6. MMU schedule port/queue and dequeue all cells of the packet, perform ECN marking
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
11 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
7
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
7. PP perform editing, ACL rules, send packet to EDB, or redirect packet back to Ingress PP
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
12 of 26
Packet Flow
Memory Management
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Packet Processing
Pipeline
2x[8x100G]
Serdes
PCS/MAC
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
8
2x[8x100G]
Serdes
PCS/MAC
IDB/EDB
1.6T Data Path
Packet Processing
Pipeline
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
8. EDB buffer packet per port or per Ingress PP and send packet to PCS/MAC
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
13 of 26
Packet Flow
Memory Management
2x[8x100G]
PCS/MAC
Serdes
1.6T Data Path
IDB/EDB
Packet Processing
Pipeline
2x[8x100G]
PCS/MAC
Serdes
1.6T Data Path
IDB/EDB
Enqueue
Packet Buffer
MB
MB
MB
… MB
MB
MB
Dequeue
…
…
Admision
x16
QOS
2x[8x100G]
PCS/MAC
Serdes
IDB/EDB
9
2x[8x100G]
Serdes
1.6T Data Path
Packet Processing
Pipeline
PCS/MAC
IDB/EDB
1.6T Data Path
CC
MB
MB
MB
… MB
MB
MB
9. MAC/PCS adds CRC, FEC, timestamp and send packet out
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
14 of 26
Monolithic Chip Drivers
Symmetric Floor Plan
Data Movement – Wiring
Shared Memory Traffic Manager
Semi-Custom Packet Buffer
Pipeline Packet Processor
Density-Optimized SerDes
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
15 of 26
Symmetric Floor Plan
Ingress
/ Egress
Core
Ingress
/ Egress
Core
Packet
Buffer
Port Macro
Port Macro
Port Macro
Port Macro
Port Macro: Serdes/PCS/MAC
Ingress/Egress Core: Datapath, Packet
Processor, MMU Control
Port Macro
•
•
Port Macro
Port Macro
Port Macro
• Allows focus on reduced # of P&R blocks to optimize area/power/effort
• Packet buffer in the middle of the chip, reduced distances
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
16 of 26
Data Movement - Wiring
D
Q
400-800um / ns
D
Q
10-15% of chip area
• Routing resources NOT scaling with process node, compound issue as BW increases
• Possible solutions
• Feedthru signals: utilize relatively low utilization in P&R blocks
• Dedicated channels: can achieve longer distances
• DDR signaling: difficult timing closure
• Interconnect Compression: Quad+ signaling using custom layout/clocking techniques
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
17 of 26
Memory Management
S
e
r
d
e
s
PCS/MAC
1.6T Data Path
I
D
B
/E
D
B
M
B
M
B
Packet Processing Pipeline
Enqueue
Dequeue
2x[8x100G]
S
e
r
d
e
s
PCS/MAC
M
B
…
M
B
M
B
M
B
Admision
1.6T Data Path
I
D
B
/
E
D
B
QOS
CC
x
1
6
…
…
Shared Memory Traffic Manager
Packet Buffer
2x[8x100G]
2x[8x100G]
S
e
r
d
e
s
PCS/MAC
I
D
B
/
E
D
B
1.6T Data Path
M
B
M
B
Packet Processing Pipeline
M
B
2x[8x100G]
S
e
r
d
e
s
Packet info
32 data paths
M
B
M
B
1.6T Data Path
cell
cell
32 cell pointers
Enqueue
Control
Admit?
Admission
Control
•
•
•
•
Jitter Buffer
Jitter Buffer
32 cell
pointers
Cell Free
Pointer
Read (32 cells)
I
D
B
/
E
D
B
…
Semi custom
Packet Buffer
Write (32 cells)
PCS/MAC
M
B
…
Dequeue
Control
Queue
Database
Port,Queue
Scheduler
Data cell written directly to memory buffer
Control path handles cell pointers
Jitter buffer allows for high bandwidth cell read from multiple banks
Lower area compared with input buffered switch due to lower cost of control structures
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
18 of 26
Memory Management
Packet Buffer
2x[8x100G]
Pipelined Packet Processor
S
e
r
d
e
s
PCS/MAC
1.6T Data Path
I
D
B
/E
D
B
M
B
M
B
Packet Processing Pipeline
Enqueue
Dequeue
2x[8x100G]
S
e
r
d
e
s
PCS/MAC
1.6T Data Path
I
D
B
/
E
D
B
M
B
…
M
B
M
B
M
B
Admision
QOS
CC
…
…
x
1
6
2x[8x100G]
Packet data
S
e
r
d
e
s
1.6T Data Path
PCS/MAC
I
D
B
/
E
D
B
1.6T Data Path
M
B
Packet Processing Pipeline
M
B
M
B
Packet hdr
2x[8x100G]
S
e
r
d
e
s
Parse
Lookup
ACL
Switch
PCS/MAC
I
D
B
/
E
D
B
M
B
…
M
B
M
B
1.6T Data Path
1.6T Data Path
Packet data
Shared
Database
Packet data
Packet hdr
Packet data
1.6T Data Path
Parse
Lookup
ACL
Switch
1.6T Data Path
• Pipelined packet processing:
• in-order processing and memory-control proximity minimize wiring
• Shared databases bet. 2 pipelines: enables efficient memory
• Decoupled data and packet rate: lower area and power
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
19 of 26
Memory Management
Packet Buffer
2x[8x100G]
Density-Optimized SerDes
S
e
r
d
e
s
PCS/MAC
1.6T Data Path
I
D
B
/E
D
B
M
B
M
B
Packet Processing Pipeline
Enqueue
Dequeue
2x[8x100G]
S
e
r
d
e
s
PCS/MAC
1.6T Data Path
I
D
B
/
E
D
B
M
B
…
M
B
M
B
M
B
Admision
QOS
CC
…
…
x
1
6
2x[8x100G]
S
e
r
d
e
s
PCS/MAC
I
D
B
/
E
D
B
1.6T Data Path
M
B
Packet Processing Pipeline
M
B
M
B
2x[8x100G]
S
e
r
d
e
s
High Density E/W
Placement
PCS/MAC
I
D
B
/
E
D
B
M
B
…
M
B
M
B
1.6T Data Path
Stacking of SerDes
on N/S
Peregrine: a 102.5G ADC/DAC + DSP-based SerDes
E/W orientation optimized to enable maximum # of cores
N/S orientation enables stacking
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
20 of 26
Generational Power Reduction
Process
Technology
Bandwidth
[Tbps]
Typical Case
Power [Watt]
Pj/bit
Tomahawk 1
28nm
3.2
115
35.9
Tomahawk 2
16nm
6.4
180
28.1
Tomahawk 3
16nm
12.8
220
17.2
Tomahawk 4
7nm
25.6
306
12.0
Tomahawk 5
5nm
51.2
450
8.8
• 30% power reduction between generations
• Attributed to uArch and power reduction techniques
• Process technology offers 15-20%
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
21 of 26
Standard Air Cooled Pizza Box Design
•
•
•
•
•
2RU 64x800G OSFP based front panel
PCB routed, no flyover cables
Air cooled, vapor chamber heat sink
Achieves 39.9mV voltage droops with Idle to Max traffic load
PDN design using multi-phase VRs, sufficient copper PCB layers and decoupling caps
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
22 of 26
TH5-Bailly: Direct Drive Co-Packaged Optics
•
•
•
•
In production since 2023
128x400G FR4
8 x 6.4T SiP based optical modules
External pluggable laser modules
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
23 of 26
TH5-Bailly: Direct Drive Co-Packaged Optics
Typical power efficiency estimate
[pJ/bit] at 100 Gb/s per lane
Fully retimed
15+
LRO
12
LPO
10
CPO
6
Power
decreases
• CPO provide power and cost reduction
• Lowers latency by 100ns
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
24 of 26
Summary
Tomahawk 5 volume production since 2023
Monolithic enabled by uArch/memorySerDes/integration
innovations
Extreme focus on low power and simplicity of system design
Diverse media support DAC, backplane, LPO, LRO, and CPO
Technology readiness for A0 volume production via test chips
Robustness and 10Y+ lifetime via signoff and qual methodologies
Please check the demo today!
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
25 of 26
Thank You!
© 2025 IEEE
International Solid-State Circuits Conference
16.1: Tomahawk 5 51.2Tbps Ethernet Switch
26 of 26
Please Scan to Rate This Paper
RNGD: A 5nm Tensor-Contraction Processor
for Power-Efficient Inference
on Large Language Models
S. M. Lee1, H. Kim1, J. Yeon1, M. Kim1, C. Park1, B. Bae1, Y. Cha1, W. Choe1,
J. Choi1, Y. Choi1, K. J. Han2, S. Hwang1, K. Jang1, J. Jeon1, H. Jeong1,
Y. Jung1, H. Kim1, S. Kim1, S. Kim1, W. Kim1, Y. Kim1, Y. Kim1, H. Kwon1,
J. K. Lee1, J. Lee1, K. Lee1, S. Lee1, M. Noh1, J. Park1, J. Seo1, J. Paik1
1FuriosaAI, Seoul, Korea
© 2025 IEEE
International Solid-State Circuits Conference
2Dongguk University, Seoul, Korea
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
1 of 29
Outline
n Introduction
n Architecture
n Implementation and Results
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
2 of 29
Growth Bottleneck & Importance of Energy Efficiency
Requirements for LLM inference
- High memory BW
- Dense compute power
- Only limited products available
Spec
Cost
One analysis* estimated ChatGPT
cost $700,000 a day to run — chiefly
in compute-intensive server time
n LLM inference demand has surged since ChatGPT's launch in
Nov. 2022
n Hardware power consumption drives soaring inference costs
n Energy efficiency of the chip is critical as TCO scales with TDP
*Source: Business Insider
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
3 of 29
Tensor Contraction, Not MatMul, Used as a Primitive
Tensor contraction is core computation in deep learning, and
higher dimensional generalizations of matrix multiplication.
Tensor Contraction
Flop analysis for BERT*
Source: Data Movement is All You Need: a case study on optimizing transformers, MLSYS’21
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
4 of 29
One Simple Example of Tensor Contraction
N
Computation
K
B
K
M
A
C
SRAM
for (n in 0..N) {
for (m in 0..M) {
for (k in 0..K) {
C[m][n] += A[m][k] x B[k][n]
}
}
}
Compute
Units
mk, kn à mn
Chip
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
5 of 29
A Whole Tensor Contraction as a Primitive
N
Multicast
temporal
access
SRAM
// spatial mapping
for (n_blk in 0..4) {
// temporal scheduling
for (m_blk in 0..4) {
for (k_blk in 0..(K/W)) {
for (m_idx in 0..(M/4)) {
for (n_idx in 0..(N/4/H)) {
K
K
// unit computation
for (h in 0..H) {
for (w in 0..W) {
m = m_index_of(m_blk, m_idx)
n = n_index_of(n_blk, n_idx, h)
k = k_index_of(k_blk, w)
C[m][n] += A[m][k] x B[k][n]
M
mk, kn à mn
Lowered shape
memory layout of tensors
Multicast
W
H
Compute
Units
} } } } } } }
Tactic
scheduling of computation
Chip
a single primitive
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
6 of 29
Spatial & Temporal Orchestration Boosts
Utilization & Efficiency
slice 0
slice 1
slice 2
slice 3
SRAM
slice 0
..
..
..
fetch
fetch net.
feed buffer
dot product x H
accumulators
commit
..
slice 1
fetch
fetch net.
feed buffer
dot product x H
accumulators
commit
Fetch Unit
Fetch
Network
Operation Unit
Contraction
Engine
Dot-product
Engine
(DPE)
Feed Buffer
Feed Buffer
RF
RF
accumulators
Commit Unit
© 2025 IEEE
International Solid-State Circuits Conference
Feed Buffer
accumulators
input reuse
0 1 2 3 0 1 2 3 0 1 2 3
multicast
0 1 2 3 0 1 2 3 0 1 2
Feed Buffer
RF
accumulators
time
read once
RF
n
n
accumulators
n
n
Single SRAM read supports multicast & reuse
Temporal pipelining maximizes utilization of spatially
parallel compute units
Streamlined data paths for efficiency
Hardware follows software-optimized tactics
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
7 of 29
Flexible Reconfigurability Is Crucial in Inference
SRAM
SRAM
slice
dot product
PE
PE
…
…
data reuse
slice
slice
slice
slice
Fetch
Fetch
Commit Unit
Commit Unit
slice
slice
slice
PE
PE
data reuse
PE
PE
…
PE
…
PE
…
PE
SRAM
Systolic Array
128 x 128
dot product
TCP W x 4H
TCP (2W x H) x 2
n For inference, batch sizes can vary widely
n Thus, it is even more important to exploit parallelism
and data reuse from given tensor shape
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
8 of 29
Outline
n Introduction
n Architecture
n Implementation and Results
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
9 of 29
RNGD, Powerfully Efficient AI Inference Chip
n
n
n
n
n
INT8: 512 TOPS
n
INT4: 1024 TOPS
n
BF16: 256 TFLOPS n
FP8: 512 TFLOPS
48 GB HBM3, 1.5 TB/s
256 MB SRAM, 384 TB/s
n
PCIe Gen5 x16, 128 GB/s,
P2P for scaling LLMs
across multi-RNGDs
150 W TDP: targeting
air-cooled data
centers
Features for cloud
l
l
l
© 2025 IEEE
International Solid-State Circuits Conference
Multiple-instance support
Secure boot & model
encryption
Virtualization
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
10 of 29
Interposer and Package
HBM3 x 2
6.0Gbps
12Hi (24GB) x 2
1.5TB/s
Silicon Interposer
Die size: 1114mm²
© 2025 IEEE
International Solid-State Circuits Conference
SoC
5nm
Clock: 1GHz
Die size: 653mm²
Package
HFCBGA
Size: 55.0 x 55.0mm2
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
11 of 29
n
SoC Top
n
Total 8x PE
Per PE
n
n
n
n
n
n
PE fusion
n
Core CPU: L1$=128k/L2$=256k,
3.5MB SPM with SECDED
IPC / doorbell: host comm., 64kB
with SECDED
CPU: 2GHz, others: 1GHz
PEC: clock and reset ctrl, interrupt
Tensor DMA: 256GB/s for both
HBM and data memory
512-way vector processing
Transcendental functions
1 spare slice per PE to improve
yield
Up to 4 PEs can be fused
l
l
© 2025 IEEE
International Solid-State Circuits Conference
Easier to generate programs
Removes data copy
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
12 of 29
Processing Element (PE)
n
Fetch network
l
n
n
Fetched data is multi-cast to slices for parallel
computation via circuit-switching-like network
Control network
l
l
Tensor
DMA
Engine
Data movement between Slices and HBM
Dual ring network (128GB/s each)
Fetch Unit
time
Contraction Contraction Contraction Contraction
Engine
Engine
Engine
Engine
Fetch Unit
Fetch Unit
commit
sequencer
space
Input tensor
SRAM
Fetch Unit
Operation Unit
Commit Unit
© 2025 IEEE
International Solid-State Circuits Conference
fetch
sequencer
Fetch Unit
Scratch
Pad
Memory
Configure SFRs of slices
Data memory network
l
Tensor Unit
Core
Vector
Engine
Vector
Engine
Vector
Engine
Vector
Engine
Transpose
Engine
Transpose
Engine
Transpose
Engine
Transpose
Engine
Commit
Unit
Commit
Unit
Commit
Unit
Commit
Unit
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
13 of 29
Programming Interface
time
n
CPU core executes model binaries
loaded on SPM
l
n
Core pushes commands to TUC
Tensor Unit Controller (TUC) - a
coprocessor
l
l
n
DMA
Compute
Has command queue & general registers
Configures control registers of slices by
commands
Asynchronous execution of
tensor operation & DMA
l
It allows hiding DMA during computation
© 2025 IEEE
International Solid-State Circuits Conference
Layer N
...
Layer N
Command
Queue
dma (.., #0)
load (.., .., ..)
Tensor
DMA
Engine
Layer N
...
Layer N+2
Layer N+1
Layer N+2
DRAM
...
..
Layer N+1
...
Layer N+1
Tensor Unit
Command
Processor
wait.d(#0)
exec(#0)
dma (.., #1)
Core
DataMemory
DataMemory
DataMemory
DataMemory
Fetch
Fetch
Fetch
Operation
Operation
Operation
Operation
Commit
Commit
Commit
Commit
load (.., .., ..)
wait.e(#0)
wait.d(#1)
exec(#1)
wait.e(#1)
...
Fetch
SPM
..
...
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
14 of 29
Slice - Compute Pipeline
SRAM
n Data reuse
l Weight stationary via the register file
l Input stationary via the CE’s input buffer
l Output stationary via accumulator registers
n Flexible indexing pattern support for
tensor operations
l Support any kind of axis in tensor operation
l Single SFR configuration controls whole tensor
operations to minimize control overhead
n Vector engine, transpose engine, and
commit unit
l Tensors stored in optimal memory layout for the
next operations
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
Fetch Unit
Fetch Network Router
Contraction
Engine
Operation
Unit(CE)
Feed Unit
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
Accumulators
Accumulators
Accumulators
Accumulators
DotProductEngine
Accumulators
Accumulators
Accumulators
cum
cum
mul
ulat
ulat
ator
Reg
tors
tors
ors
ors
Files
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
MAC
Tree
Accumulators
Accumulation
Unit
Accumulators
Accumulators
Accumulators
Accumulators
Accumulators
Accumulators
Vector Engine
Transpose Engine
Commit Unit
15 of 29
Contraction Engine
n
n
n
n
Receives tensors from the fetch unit in
a defined order and feed data multiple
times to the multiple DPEs
Controls feed, register file, and
accumulator accesses
8 DPEs per slice
Multiple contexts & async execution
l
l
Slice
Slice
Data Memory Slice
Control Network
Data Memory Slice
Control Network
Fetch Commit Arbiter
Status Network
Fetch Commit Arbiter
Status Network
Fetch Unit
Sub Fetch Unit
Fetch Unit
Sub Fetch Unit
Network
Network
Router
Router
Executing main and sub-contexts simultaneously
Operation Unit
− Example: Main-context runs tensor contraction
while sub-context moves data
CE
TRF
CE
TRF
VE
VRF
VE
VRF
Two sets of SFRs for the main context
TE
Commit Unit
Operation Unit
RCU
Sub Commit Unit
TE
Commit Unit
RCU
Sub Commit Unit
Red : Main Context Datapath
Blue : Sub Context Datapath
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
16 of 29
Vector Engine and Transpose Engine
n
n
Element-wise add and multiply operations
Input
or non-linear functions like SiLU, GeLU, etc.
VectorEngine
Inter-Slice Router / Internal Dataflow Manager
Software defines vector pipeline
l
l
l
l
l
n
i32/fp32 arithmetic operators (add, mul, div,
bitwise operators, …)
Transcendental functions (exp, sin, cos, log,
tanh, sigmoid, ..)
Type conversion (i32, fp32 – i4, i8, i16, fp8,
bf16)
Predicated executions (branch, ...)
Reduction across slices and routes data
ReduceUnit
ArithmeticUnit
Conditional Operation
Int/Fp.Cluster
Add
Min / Max
…
8way
Int.Cluster
Fp.Cluster
Add
Mul
Shift
Logic
…
Add
Mul / Div
Exp / Log
Sin / Cos
…
8way
Spatial / Temporal Reduction
Type Conversion
Transpose Engine
l
l
Transpose Engine for last axis transpose
Commit Unit supports transpose, split, concat,
etc. except last-axis transpose
© 2025 IEEE
International Solid-State Circuits Conference
Black: Intra-Slice Datapath
Red: Inter-Slice Reduce Channel
Blue: Inter-Slice Distribute Channel
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
VRF
4way
Output
17 of 29
Interconnect Networks
n
Each cluster NoC up to 1TB/s
l
l
n
Address Translation Unit between PE NoC
and the router
l
l
l
l
n
Four routers @ 1GHz
256GB/s BW for both HBM and PE-to-PE
Same PE Core binary can allocate different
HBM regions
User data protection between PE
Secure access - only access from secure
firmware
Unified abstraction via P2P for intra/inter-chip
PE access
interleaver
Programmable address interleaver
l
Mode
−
−
−
l
No interleave: test and simulation
32-channel interleave: 2-stack HBM
16-channel interleave: 1-stack HBM
Address hashing for HBM channels with
pseudo random selection bit
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
18 of 29
Outline
n Introduction
n Architecture
n Implementation and Results
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
19 of 29
SoC Die
n
n
n
Four PEs are abutted for
minimum latency
Security Engine, I2C, UART,
GPIO, Q/SPI, JTAG, PVT
sensor, margin detector,
voltage droop detector
MiM cap., DTC
Decap (µF)
VDDCCLUS0/1
VDDC
VDDQLDRAM0/1
VDDCDRAM0/1
5.9/5.9
1.7
-
-
Interposer
24.5/24.5
1.1
1.3/1.3
-
Package
6.6/2.1
-
2.5/2.5
4/4
On-die
© 2025 IEEE
International Solid-State Circuits Conference
Package routing
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
20 of 29
Floorplan and Clock
© 2025 IEEE
International Solid-State Circuits Conference
clock
1220µm
600µm
n Eight different
physical shapes for
slices
n Strong trunk clock cell
n Top two metal layers
for routing
n Reducing clock tree
latency and jitter
between the root to
the furthest block by
>50% compared to a
typical clock tree
n On-chip-variation
mitigation
Slice_0 clk
Slice_1 clk
600µm
280µm
600µm
600µm
Slice_2 clk
600µm
Root driver
clkSlice_31
clkSlice_30
clkSlice_29
600µm
Slice_3 clk
clkSlice_28
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
21 of 29
HBM Channel Routing Layout
Cross-Section View
1.3µm
2.0µm
Metal 5: VSS
Metal 4: Signal
Metal 3: VSS
Metal 2: Signal
Channel 0
Channel 1
Channel 2
Channel 3
Channel 4
Channel 5
Channel 6
Channel 7
Channel 8
Channel 9
Channel 10
Channel 11
Channel 12
Channel 14
Channel 15
16-channel
à DWORD 0-1
à DQ 0-31
Channel 13
Metal 1: VSS
© 2025 IEEE
International Solid-State Circuits Conference
Top View
Layer M3, M4
~2868µm
shielding
Layer M2, M1
m
4µ
3
~2
~2978µm
RDL
M4
M3
DRAM side
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
M2
M1
PHY side
22 of 29
HBM Signal and Power Integrity
Signal Integrity
S-Parameters
Return Loss
Near-end Crosstalk
Power Integrity
Eye-diagram
Ch A
0.78UI
Ch E
0.78UI
Width > 0.5UI
Insertion Loss
n
PDN Z-profile
Voltage Fluctuation
Current
Profile
Frequency
Sweep
Target Impedance Sweep with
Different Current Profile
Frequency
Far-end Crosstalk
Ch M
0.76UI
n
DC
Ch I
0.79UI
Interposer RLC extracted with 3D
full-wave EM modeling
Linearized switching DC-DC power
supply modeling included
© 2025 IEEE
International Solid-State Circuits Conference
Vpp < 8%
HBM3 Power Rails
PDN Z-Profile
Voltage
Fluctuation
DC (mΩ)
Zpeak (mΩ)
Vpp (%)
VDDQL (0.4V)
4.39
5.71@13.8MHz
3.65
VDD (0.75V)
5.78
52.0@7.4MHz
7.48
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
23 of 29
Core Power Integrity
n
Package bump R & L
Workload transient
l
VDDC_CLUS0: peak 307A
DC IR: (750-685)/750=8.67%
AC IR (Vmin): (685-678)/750=0.93%
Resistance
u-Bump
Vavg: 685mV
Inductance
C4 bump
Vmin: 678mV
Ball
l
PCB
CPM 10x10
Interposer 10x10
Probing
n
n
DC IR: (750-648)/750=13.6%
AC IR (Vmin): (648-642)/750=0.8%
Package
Package bump resistance and inductance simulated
Bumps are merged by 10x10 grid for simulation
© 2025 IEEE
International Solid-State Circuits Conference
VDDC: peak 73A
Vavg: 648mV
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
Vmin: 642mV
24 of 29
PVT Sensors with Thermal Model and Heat Sink
0
1
PE
0
P
PE
1
6
7
4
5
5
2
0
1
6
PE
3
2
CPU, Security
PE
PE
NoC
2
PE
4
HBM3
Sub-system
Process monitor
Group A
Temp. sensor
Voltage monitor
Group B
Temp. sensor
Voltage monitor
7
3
PE
P
PE
0
1
PCIe
n
n
Custom heat sink, case, and bracket for air cooling
Infrared camera photo without heat sink for BERT
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
25 of 29
N61.. 1
oper..
Reliability & Yield
Highlight Agent Index
12.00
31.00
Select Chart
Margin Map
Margin Map
Select Measure
26K
Avg Margin
[buffers]
24K
Hard Bin
22K
Soft Bin
PE
20K
Coordinate Y [um]
era Tempe..
ng 53
int
PE
PE
18K
Voltag.. Power ..
Freque..
16K
755
1000
14K
VDD
CPU, Security
PE
n Extra slice for each PE
n ECC for SRAM, DRAM, NoC
n Timing margins are
continuously monitored
l Long-term aging is monitored
HBM3
Sub-system
NoC
12K
n Voltage droop detector
10K
gn Partitio.. Partitio.. FCC ID
Unit ID
Clk Gate
PE
oper..
PE
Agent ..
8K
PE
6K
PE
voltage drops
l Trigger interrupts if the voltage
falls below a pre-defined threshold
4K
2K
PCIe
0K
Highlight
Agent
Index map
Timing
margin
2K
4K
6K
8K
@755mV, 53oC
© 2025 IEEE
International Solid-State Circuits Conference
10K
12K
Coordinate X [um]
14K
16K
18K
20K
Dashboard
Version: 5.10.0.10
012.00
MA ID
l Fully digital and wide-bandwidth
1 ICs
l Monitor localized fast supply
31
31.00
n Supports encryption for
Select Chart
secure
booting
Margin Map
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
26 of 29
Dynamic Voltage and Frequency Scaling
n Temperature and peak performance are efficiently
balanced for the PE and NoC
n Total board power is controlled <150W
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
27 of 29
Comparison
n
RNGD vs. L40s
l
l
l
×1.7
×2.2
×0.43
×4.7
l
n
×1.72
×4.1
l
n
© 2025 IEEE
International Solid-State Circuits Conference
n
×2.7 (=8.62/3.19) higher perf/W
79% lower TDP
GPT-J 6B, 99% accuracy
l
×2.7
×1.53
RNGD vs. H100
l
×1.76
×4.1 (=6.24/1.52) higher perf/W
×1.7 peak memory BW
×1.76 in throughput
57% lower TDP
Measured perf / power is 53% better
than L40s
Throughput per power is
superior thanks to its favorable
memory BW-to-TDP ratio
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
28 of 29
Conclusions
n Optimized for energy-efficient AI inference
l Built on a 5nm process for high efficiency
l Features HBM3 for high memory bandwidth
l Utilizes high-bandwidth NoCs for fast data movement
l Implements slice redundancy for reliability
n Maximizing performance and efficiency
l Leverages parallelism to enhance compute utilization
l Exploits data locality inherent in tensor contraction
n Please visit us at the Demonstration Session!
© 2025 IEEE
International Solid-State Circuits Conference
16.2 RNGD: A 5nm Tensor-Contraction Processor for Power-Efficient Inference on Large Language Models
29 of 29
Please Scan to Rate This Paper
An On-device Generative AI focused Neural
Processing Unit in 4nm flagship mobile
SoC with Fan-Out Wafer Level Package
Jun-Seok Park, Taehee Lee, Heonsoo Lee, Changsoo Park,
Youngsang Cho, Mookyung Kang, Heeseok Lee, Jinwon Kang, Taeho Jeon,
Dongwoo Lee, Yesung Kang, Kyungmok Kum, Geunwon Lee, Hongki Lee,
Minkyu Kim, Suknam Kwon, Sungbum Park, Dongkeun Kim, Chulmin Jo,
HyukJoon Chung, Ilryoung Kim, Jongyul Lee
System LSI, Samsung Electronics
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
1 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advanced packaging & CMOS process
Measurement result & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
2 of 28
Motivation
Generative AI
Text Prompt:
“A golden retriever jumping on the beach”,
epicRealism, 8 steps, pan right
Image generation & Editing
Text-to-Video Generation
3D Generation
Generative AI presents two key requirements:
A significant computational workload
High computational flexibility
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
3 of 28
Motivation
NPU on a mobile AP
Image classification
Utilization & Flexibility
99%
Cat!
Super resolution
Support various
applications
NPU
On hard
constraints
Voice recognition
Power/Thermal
Area
“Hi, Bixby”
NPU design challenges for generative AI
High performance requirements for iterative computations
Functional flexibility to support diverse transformer architectures
Strict thermal/power constraints on mobile environment
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
4 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advanced packaging & CMOS process
Measurement result & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
5 of 28
Key feature I: Heterogeneous NPU with shared memory
NPU Architecture
NPU core
GTE #0
GTE #1
8K MAC
STE #0
512
MAC SIMD
STE #1
512
MAC SIMD
8K MAC
NPU Buffer
NPU Buffer
NPU Buffer
NPU Buffer
6 MB
Scratchpad Memory (TCM)
NPUMEM
NPU
~ TB/sec
NPUMEM
(L2 Scratchpad)
100GB/s
Last Level Cache (LLC)
L1 (I/D)
CPU
GIC
NPU Controller
DMA
Pool
NTC
MMU
68GB/s
External memory
High performance and functional flexibility requirements for Gen AI
System-wide balance of compute vs. memory resources
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
6 of 28
Key feature I: Heterogeneous NPU with shared memory
NPU Architecture: Heterogeneous NPU
L1
QCache
Dispatch
Unit
FetchUnit
L0 QCache
Sparsity
Buffer
Weight
Buffer
MAAG2
MAAG1
MAAG2
MAAG1
MAAG2
MAAG1
MAAG2
MAAG3
MAAG1
MAAG2
MAAG3
MAAG2
MAAG3
MAAG2
MAAG3
MAAG2
MAAG3
MAAG3
MAAG3
MAAG3
Command Queue
Pre-fetch
Unit
Activation
Unit
Act. &
Quant. 0
Act. &
Quant. 1
Act. &
Quant. 2
Act. &
Quant. 3
STE
Buffer
unit
PSUM
buffer
L1
QCache
Dispatch
Unit
FetchUnit
L0 QCache
Sparsity
Buffer
Weight
Buffer
XMAA#6
#7
XMAA
XMAA
#5
XMAA
#4
wgt.
buf
XMAAwgt.
#3 buf
XMAA
#2
wgt.
buf
XMAA
#1
wgt.
bufx 8
32
XMAAwgt.
#0 32
bufx 8
wgt.32
buf
x88
MAC
wgt.32
buf
x
MAC
32
x
8
wgt.
buf
MAC
32MAC
8Array
32
xx8
Array
MAC
8 xMAC
2Array
Array
MAC
Array
MAAG1
MAC
Array
MAAG1
Array
MAAG1
MAAG1
Array
MAAG2
MAAG1
MAAG2
MAAG0
MAAG0
IFMMAAG0
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
IFM
IFMreg.
reg.
Pre-fetch
Unit
MAAG0
MAAG0
IFMMAAG0
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
MAAG0
IFM
reg.
IFM
IFMreg.
reg.
GTE
XMAA#6
#7
XMAA
XMAA
#5
XMAAwgt.
#4 buf
XMAA
#3
wgt.
buf
XMAA
#2
wgt.
buf
XMAAwgt.
#132
bufx 8
XMAA
#0
wgt.32
bufx 8
wgt.32
buf
x88
wgt.32
buf
xMAC
MAC
wgt.32
buf
x
8
MAC
32MAC
8Array
x8x8
Array
3232
xMAC
Array
MAC
Array
MAC
Array
MAAG1
MAC
Array
MAAG1
Array
MAAG1
Array
MAAG1
MAAG1
MAAG2
MAAG1
MAAG2
MAAG3
MAAG1
MAAG2
MAAG3
MAAG2
MAAG3
MAAG2
MAAG3
MAAG3
MAAG2
MAAG3
MAAG3
MAAG3
Command Queue
Activation
Unit
Act. &
Quant. 0
Act. &
Quant. 1
Act. &
Quant. 2
Act. &
Quant. 3
Buffer
unit
PSUM
buffer
Vector Engine
Vector Engine
*GTE: General Tensor Engine, **STE: Shallow Tensor Engine
Multi-engine-based NPU architecture for integrating more MACs
Heterogeneous architecture to have enough flexibility & utilization
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
7 of 28
Key feature I: Heterogeneous NPU with shared memory
Heterogeneous Tensor Engines
L0
Q-Cache
L0 tile
L1
Q-Cache
L1 tile
GTE #1
STE #1
Conv. / Depthwise Conv. Layer with RELU Convolution with non-linear operations
Vector
Vector
GTE
STE
GTE
STE
unit
unit
Shared memory
NPUMEM
100%
External Memory
GTE #1
STE #0
STE #1
Without
ScatterGa
ther
With
ScatterGather
75%
Utilization
L2 tile
GTE #0
50%
25%
Shared memory
37.5%
12.5%
Deep layer Shallow layer Depthwise Conv
Layer
© 2025 IEEE
International Solid-State Circuits Conference
GTE (Input channel = 32)
100%
75%
75%
Utilization
STE #0
GTE #0
50%
50%
25%
Deep layer
Shallow layer
Depthwise
Conv
STE (Input channel = 8)
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
8 of 28
Key feature I: Heterogeneous NPU with shared memory
Tensor Engine vs. Vector Engine
Legend
Non-Linear
Layer
Tensor
Layers
Linear
Layer
Vector
Layers
L2 Tile
L2 Tile
L2 Tile
Vector
Vector
Vector
Tensor
Tensor
Tensor
* JaewanChoi et al, “Accelerating Transformer Networks through
Recomposing SoftmaxLayers”, Arxiv, 2023.
Operation Mapping to
corresponding HW
Optimized TE- VE after tiling
: Performance trace with mobileBert
The computational configuration greatly varies among different neural
networks (Specially in Transformers)
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
9 of 28
Data
Request
(Fetch Unit)
Data
Consumer
(DSU)
Data
Request
(Fetch Unit)
IFM
Weight
IFM
Weight
IFM
Weight
Data
Consumer
(DSU)
IFM
L0 Q-cache
L1 Q-cache
NPUMEM
NPUMEM
Weight
Reservation
token
Memory Bank
Tag
Data
Ref.Cnt
Cache
lines
Queueing Cache
Queueing Cache
Key feature I: Heterogeneous NPU with shared memory
L0 Q-Cache
•
•
Small capacity
Simultaneous access to the same addr
•
•
Big capacity
Sequential access to the same addr
L1 Q-Cache
Queueing caches are types of simplified cache memory
The predetermined operation sequence simplifies memory managements
Reference counter indicates the number of the reservations for the data
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
10 of 28
for c channel
for h height/ker_height
for w width
for r ker_height
0,2
1,2
2,2
1,2
2,2
3,2
2,2
3,2
4,2
3,2
4,2
5,2
0,3
1,3
2,3
1,3
2,3
3,3
2,3
3,3
4,3
3,3
4,3
5,3
Fetch Unit
L1 Q-cache
: miss
: hit
0,1
1,1
2,1
1,1
2,1
3,1
2,1
3,1
4,1
3,1
4,1
5,1
0,2
1,2
2,2
1,2
2,2
3,2
2,2
3,2
4,2
3,2
4,2
5,2
0,2
1,2
2,2
1,2
2,2
3,2
2,2
3,2
4,2
3,2
4,2
5,2
0,3
1,3
2,3
1,3
2,3
3,3
2,3
3,3
4,3
3,3
4,3
5,3
:hit if not
evicted
Fetch Unit
0,0
1,0
2,0
1,0
2,0
3,0
2,0
3,0
4,0
3,0
4,0
5,0
0,1
1,1
2,1
1,1
2,1
3,1
2,1
3,1
4,1
3,1
4,1
5,1
L0 Q-cache
Pre-fetch
Unit
0,0
0,2
1,1
2,0
2,2
3,1
4,0
4,2
5,1
0,3
0,5
1,4
2,3
2,5
3,4
4,3
4,5
5,4
With Pre-fetch unit
0,1
1,0
1,2
2,1
3,0
3,2
4,1
5,0
5,2
0,4
1,3
1,5
2,4
3,3
3,5
4,4
5,3
5,5
for c channel
for h height
for w width
for s ker_width
for r ker_height
Nested Loop
of Pre-fetch Unit
0,1
1,1
2,1
1,1
2,1
3,1
2,1
3,1
4,1
3,1
4,1
5,1
0,2
1,2
2,2
1,2
2,2
3,2
2,2
3,2
4,2
3,2
4,2
5,2
Nested Loop
of Fetch Unit
Clk
0,0
1,0
2,0
1,0
2,0
3,0
2,0
3,0
4,0
3,0
4,0
5,0
0,1
1,1
2,1
1,1
2,1
3,1
2,1
3,1
4,1
3,1
4,1
5,1
Q-Cache & Pre-fetch
Without Pre-fetch unit
Key feature I: Heterogeneous NPU with shared memory
The prefetching unit preloads data into the L1 Q-cache to minimize the impact
of the initial cold miss and reduce the latency without complex task scheduling
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
11 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advanced packaging & CMOS process
Measurement result & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
12 of 28
Key feature II: Skewness-based optimization
Skewness & Reuse factor
Short input channel length
C1
8
2
Reuse factor: 2
Skewness: 1
4
Reuse factor: 4
Skewness: 1
x
x
8
3
4
1
7
Reuse factor: 1.75
Skewness: 7
Reuse factor: 1.5
Skewness: 3
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺(𝜶𝜶) =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚
© 2025 IEEE
International Solid-State Circuits Conference
C1
x
4
x
M1
X1
Equal-sized
Matrix-multiplication
Different-sized
Matrix Multiplication
Long input channel length
# 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐
𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇𝒇 =
𝑀𝑀𝑀𝑀𝑀𝑀 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
# of Compute = 𝐶𝐶1 𝑀𝑀1 𝑋𝑋1
= (𝐶𝐶1 𝑋𝑋1 )2
𝛼𝛼
𝐶𝐶1
𝑴𝑴𝑴𝑴𝑴𝑴𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺2 𝛼𝛼
=
(1 + 𝛼𝛼)2 𝐶𝐶1
Reuse factor =
𝑴𝑴𝑴𝑴𝑴𝑴𝑺𝑺𝑺𝑺𝑺𝑺𝑺𝑺 𝛼𝛼
(1 + 𝛼𝛼)2 𝐶𝐶1
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
13 of 28
Key feature II: Skewness-based optimization
Skewness
Minimum input channel
HW Optimization using Skewness
Memory Bound
Compute bound
Minimum skewness=1
Compute Bound: 𝑇𝑇𝐷𝐷𝐷𝐷𝐷𝐷 ≤ 𝑇𝑇𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑖𝑖𝑖𝑖𝑖𝑖
# 𝑜𝑜𝑜𝑜 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶
≤
𝑀𝑀𝑀𝑀𝑀𝑀𝐵𝐵𝐵𝐵
# 𝑜𝑜𝑜𝑜 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 T
TDMA
compute
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 × # 𝑜𝑜𝑜𝑜 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
𝛼𝛼
≤ 𝑀𝑀𝑀𝑀𝑀𝑀𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 2
1 + 𝛼𝛼 2 𝐶𝐶1
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
# of Compute
© 2025 IEEE
International Solid-State Circuits Conference
Input Channel Length
(1 + 𝛼𝛼)2 # 𝑜𝑜𝑜𝑜 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀
≤ 𝑴𝑴𝑴𝑴𝑴𝑴𝑺𝑺𝒊𝒊𝒊𝒊𝒊𝒊
𝐶𝐶1
𝑀𝑀𝑀𝑀𝑀𝑀𝐵𝐵𝐵𝐵
𝛼𝛼
Network
Characteristics
Hardware
Spec.
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
14 of 28
Key feature II: Skewness-based optimization
Tiling using Skewness Curve
Step #0
© 2025 IEEE
International Solid-State Circuits Conference
Step #1
Step #2
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
15 of 28
Key feature II: Skewness-based optimization
Tiling using Skewness Curve
Step #3
© 2025 IEEE
International Solid-State Circuits Conference
Step #4
Step #5
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
16 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advanced packaging & CMOS process
Measurement result & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
17 of 28
Key feature III: Advanced packaging & CMOS process
Clock frequency drop at high temperature
1200
120
Clock frequency (Mhz)
100
800
90
80
600
70
400
60
*DT Team, “Design Methodologies for Advanced Mobile
SOCs”, Samsung Foundry
0
50
Clock frequency
200
No more than 40℃ not to cause skin burn
Junction Temperature(° C)
110
1000
40
Temperature
0
100
200
300
400
30
Time (sec)
DTM (Dynamic thermal management) adjusts clock frequency to prevent NPU from
overheating
Since overheating leads system instability or long-term damage, DTM throttles NPU’s
performance by reducing clock frequency
To prevent temperate rising quickly, FOWLP is employed
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
18 of 28
Key feature III: Advanced packaging & CMOS process
Fan-Out Wafer Level Package (FOWLP)
DRAM
DRAM
Exynos 2200
Exynos 2400
I-PoP
FOWLP
CU
Post
Fan-out wafer-level package has higher CU post than solder ball used in I-POP
Due to high CU post, the die thickness can nearly double
Solder
Ball
Die thickness is changed from 110um to 215um
A thicker die can improve thermal resistance by offering better heat spreading
capability and potentially enhancing the thermal management of a package
FOWLP reduces thermal resistance by 16 % from 16.52℃/W to 13.83℃/W
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
19 of 28
Key feature III: Advanced packaging & CMOS process
Thermal throttling reduction by FOWLP
FOWLP reduces thermal throttling by improving heat dissipation
NPU with FOWLP lowers the clock frequency smaller than I-POP
FOWLP prevents overheating
NPU performance is improved with less thermal throttling
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
20 of 28
Key feature III: Advanced packaging & CMOS process
RO AC Performance Improvement
850
Operating voltage (mV)
6
5
4
3
2
1
800
750
700
650
94
96
98 100 102 104 106 108 110
RO freq (Normalized)
100
300
500
700
Operating frequency (MHz)
900
1100
A 3rd gen. 4nm process improves RO AC perf. Gain by 11% compared to 1st gen.
92
Exynos 2400 (4n_3rd)
550
500
90
Exynos 2200 (4n_1st)
600
Reduction in Ceff and Reff is achieved with source and drain engineering, middle-of-line (MOL)
resistance reduction and replacement metal gate (RMG) optimization
NPU can achieve higher operational frequency at the same voltage
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
21 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advanced packaging & CMOS process
Measurement & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
22 of 28
Measurement & Comparison
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0.74V
0.55V
0.035
0.03
0.025
0.02
0.015
0.01
0.005
EdgeTPU
(Single)
MobileDet
Mosaic
0
Exynos 2200
6246.8
EDSR
LVM Unet
6,000
Exynos 2400
5,000
4,000
3,433
3,000
2217.0
2,000
1,000
0
1429.2
934
540
20
EdgeTPU MobileDet
(offline)
317.1
Mosaic MobileBert
140.3
EDSR
8.3
LVM Unet
NPU in Exynos2400 successfully supports LVM (diffusion model) and LLM
The inference throughput of NPU at 1200 MHz was measured as 1429 and 6246
inferences/s for Mosaic and MobileNetEdgeTPU, respectively
0.83V
0.63V
7,000
0.04
Performance ( Inference/sec )
FPS/W
Measurement Results
NPU achieves 1.81x~2.65x higher performances for CNN models
NPU shows 317 FPS with MobileBert (Transformer based network)
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
23 of 28
Measurement & Comparison
Comparison
ISSCC 2021 [13]
ISSCC 2020 [14]
7
19.6
0.55 – 0.75
7
3.04
0.575 - 0.825
332 - 1196
1,000 – 1,600
290 - 880
2,048
3,072
8,192
2,176
INT 8,16 / FP16
INT 4,8, 16 / FP16
INT4, 2 / FP8, 16, 32
8, 16
17,408 (INT8)
8,192(INT8)
8, 16
6,144 ( = 2,048 / core)
Peak Performance
(TOPS)
41.8 (8b)
20.9 TFLOPS (FP16)
39.3 (4b) ,19.7 (8b)
9.8 TFLOPS (FP16)
39.3 (4b) ,19.7 (8b)
9.8 TFLOPS (FP16)
Power (mW)
730 - 6177
14.7 (8b) @ no Skip
29.4 (8b) @ max Skip
327 @0.6V
794 @0.9V
Inferences/second
MobileNetTPU(INT8):
6246.8@0.83V
Area-efficiency
(Peak TOPS/mm2)
INT8: 3.48
FP16:1.74
Process (nm)
Area (mm2)
Supply Voltage (V)
Working Frequency
(MHz)
On-Chip Data
Memory (kB)
Bit Precision
Multiplier Number
This WORK
4
ISSCC 2021 [12]
12
0.55 – 0.83
ISSCC 2022 [11]
4
4.74
0.55 - 1.0
5
5.46
0.55 - 0.9
533 - 1196
332 - 1196
6,144 (NPUMEM)
MobileNetTPU(INT8): 381-5114
DeepLabV3(FP16): 393-5133
MobileNetTPU(INT8): 3433@1.0V
InceptionV3: 622.7@0.9V
DeepLabV3(FP16): 211@1.0V
INT4:6.90
2.69
INT8:3.45
FP16:1.72
-
-
3.6 (8b)
173@0.575V
1053 @0.825V
-
-
INT4:5.33,
INT8:1.33
FP16:0.66
1.184
NPU achieves an area efficiency of 3.48 TOPS/mm2
Increasing the internal buffer size from 2MB to 6MB to enhance the data reuse
Sharing weight buffer across the MAAs in spatial direction and the optimized MAC design
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
24 of 28
Measurement & Comparison
Die Photo
Process
3rd gen 4nm CMOS
technology (Samsung)
Area
12mm2
Voltage
0.55-to-0.83V
Frequency
533-to-1196-MHz
Best Peak
Performance
6246.8 inferences/s @ 0.83V
(MobileNetTPU@MLPerf)
Area-Efficiency
3.48 TOPS/mm2
© 2025 IEEE
International Solid-State Circuits Conference
GTE #0
GTE #1
NPMEM
STE #0
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
STE #1
NPU Controller
25 of 28
Outline
Motivation
NPU architecture
Key features
I. Heterogeneous multi-engine NPU with shared memory
II. Skewness-based optimization
III. Advancedc& CMOS process
Measurement result & Comparison
Conclusion
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
26 of 28
Conclusion
Conclusion
Challenges of Generative AI
High performance requirements for iterative computations
Functional flexibility to support diverse transformer architectures
Strict thermal/power constraints on mobile environment
Solutions
Heterogeneous multi-engine NPU with shared memory
NPU HW/SW optimization based on Skewness
System level optimization leveraging advanced packaging and
CMOS process
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
27 of 28
Thank you for your attention
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
28 of 28
Please Scan to Rate This Paper
© 2025 IEEE
International Solid-State Circuits Conference
16.3: An On-device Generative AI focused Neural Processing Unit in 4nm flagship mobile SoC with Fan-Out Wafer Level Package
29 of 28
SambaNova SN40L:
A 5nm 2.5D Dataflow Accelerator
with Three Memory Tiers
for Trillion Parameter AI
Raghu Prabhakar, Junwei Zhou, Darshan Gandhi, Youngmoon Choi,
Mahmood Khayatzadeh, Kyunglok Kim, Uma Durairajan,
Jeongha Park, Satyajit Sarkar, Jinuk Luke Shin
SambaNova Systems, Inc
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
1 of 24
Outline
SN40L Chip and System Overview
LLM Inference Optimizations
Agentic AI requirements and benefits of SN40L
Energy Efficiency and Power Management
Closing Remarks
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
2 of 24
Overview of new language-optimized SN40L
“Cerulean” Architecture-based Reconfigurable Dataflow Unit (RDU)
5nm TSMC
3-tier Dataflow Memory
102B Transistors
520 MB
On-Chip Memory
1,040 RDU Cores
64 GB High
Bandwidth Memory
638 TFLOPS (BF16)
1.5 TB High-Capacity
Memory
Generative AI Training and Inference
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
3 of 24
3-Tier Memory System with SRAM, HBM and DDR
On-Chip SRAM
[520 MB, >100TBps]
High throughput
inference with
caching
Dataflow enabled by
large On-Chip
Memory
1.6 TB/s
RDU High Bandwidth
Memory [64 GB]
100 GB/s
RDU High Capacity DDR
Memory [Up to 1.5 TB]
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
Low Latency
Model Switching
4 of 24
SN40L: Chip Overview
P2P P2P P2P P2P P2P PCIE
RDU Core consists of
programmable compute and
memory units with meshed
network
Top Level Network (TLN)
provides a bridge to off-chip
communication: P2P, HBM,
DDR and PCIe
Die-to-Die connectivity
enhances core-to-core
communication
© 2025 IEEE
International Solid-State Circuits Conference
DDR
DDR
DDR
TLN
T
L
N
RDU Cores
RDU Cores
T
L
N
HBM
T
L
N
DDR
Die-to-Die
HBM
HBM
T
L
N
RDU Cores
RDU Cores
TLN
HBM
DDR
DDR
P2P P2P P2P P2P P2P PCIE
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
5 of 24
Core Architecture
1040 Distributed Memory and Compute Units
S
S
S
S
S
PMU PCU
PMU PCU
PMU PCU
AGCU
AGCU
S
S
PMU PCU
S
S
S
PMU PCU
S
S
PMU PCU
S
S
S
AGCU
AGCU
AGCU
AGCU
AGCU
AGCU
S
S
PMU PCU
S
S
S
PMU PCU
S
S
PMU PCU
S
S
S
PCU: Pattern Compute Unit / PMU: Pattern Memory Unit
S: Mesh Network Switches / AGCU: Address Generation and Coalescing Unit
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
6 of 24
Pattern Compute Unit (PCU)
Vector In FIFO
Scalar In FIFO
Counters
Control Inputs
HEADER:
Consume and organize
dataflow packets from
PMU and Switch
© 2025 IEEE
International Solid-State Circuits Conference
H
E
A
D
E
R
Broadcast
Buffer
S
I
M
D
r
e
g
s
S
I
M
D
r
e
g
s
T r
A e
I g
L s
Vector In FIFO
Scalar Out FIFO
Control Block
BODY:
Systolic Array
SIMD Core
Cross Lane reduction tree
8 bit/16 bit/32 bit operations
Control Outputs
TAIL/Output:
Special elementwise
operations
Export packets to PMU and
Switch
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
7 of 24
Pattern Memory Unit (PMU)
Vector
In FIFO
Scalar In FIFO
Counters
Write Data
Align
Address
Predication
H
D
R
Control Inputs
ALU Pipeline:
Complex address creation
Tensor access patterns
© 2025 IEEE
International Solid-State Circuits Conference
R
Scratchpad
Banks
W
Fragmentable
Scalar ALU
Pipeline
Read
Data
Align
ScalarSRAM Scalar Out
FIFO
Scalar
ALU
Control Block
Scratchpad:
Programmer managed
memory
Concurrent read and writes
Vector
Out FIFO
Control Outputs
Data Aligner:
Transposes, Permute,
LUT, Layout Conversion,
Format Conversion
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
8 of 24
Switch Network and AGCU
Mesh based Switch Interconnect
Vector network: carries data, packet switched
Scalar network: carries addresses, other metadata, packet switched
Control network: carries flow control, synchronization and graph orchestration tokens
Hardware 2-D dimension order routing (DOR) with software override
Address Generation and Coalescing Unit (AGCU)
Memory address generation to access DDR, HBM and Host
Peer-to-Peer communication between RDUs for collective communications
Graph control interface to load and orchestrate graph execution without host involvement
Segment Lookaside Buffer for virtual-physical address translation and memory access
management
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
9 of 24
Top Level Network (TLN)
Interconnects RDU Cores with
DDR, HBM, PCIe and Die-to-Die
interfaces
Four independent networks
operating in parallel
PCIE
DDR
HBM
RDU Cores
Request, Data, Response, and Credit
Hybrid Mesh/Ring Packet
Switched interconnect
Y-X DOR packet routing
End-to-end credits to avoid deadlocks
© 2025 IEEE
International Solid-State Circuits Conference
RDU Cores
D2D
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
D2D
10 of 24
SN40L XRDU PCB and SN40L-16 Rack
SN40L -16 Rack
Standard 19in 42RU Form Factor
Switches
XRDU with 2 SN40L Chips
PSU
PDUs
SN40L
FAN
Board
RDIMM
SN40L
© 2025 IEEE
International Solid-State Circuits Conference
BMC
Switch
NIC
P2P
XRDUs
Node:
16 SN40L
chips
Host Server
Front
Rear
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
XRDUs
11 of 24
Unlocking Operator Fusion Potential
SRAM
Bandwidth
RDU/GPU ~ 10x
Flexible
Programming
RDU based:
1 kernel
= 100-1000s
operators
Tensor access
patterns/Manipulations
SRAM
Capacity
RDU/GPU ~ 10x
GPU based:
1 kernel = 1-5s operators
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
12 of 24
Executing Transformer on RDU
Example: Llama3.1 8B
Embedding
Decoder 0
Decoder 1
Launch Overheads
(microseconds)
Decoder 2
Weight Load
(microseconds)
Decoder 31
Classifier
Sampling
Compute
Sync
(microseconds)
Two Key Optimizations:
Spatial Fusion: Captures data locality
Kernel looping: Overlaps compute and communication, eliminates overheads
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
13 of 24
Spatial Fusion: 1 decoder = 1 kernel on SN40L RDU
Example: Llama3.1 8B
section_0()
section_1()
section_1()
section_1()
section_1()
section_2()
section_3()
Embedding
Decoder 0
Decoder 1
Decoder 2
Decoder 31
Classifier
Sampling
Q
GEMM
QK
matmul
Wq
xd-1
RMS
Norm
K
GEMM
Mask
fill
Softmax
PV
matmul
All
RMS
O
GEMM Reduce Norm
Gate
GEMM
Wo
Wgate
Wv
© 2025 IEEE
International Solid-State Circuits Conference
Mul
Down
GEMM
Add
All
Reduce
xd
Wdown
Up
GEMM
transpose
Wk
V
GEMM
SilU
Wup
High Operator fusion: One kernel call for all decoders :
Zero Kernel Launch Overheads
High data locality
K0
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
14 of 24
Executing Transformer on RDU
Example: Llama3.1 8B
Baseline
+ Single Kernel Decoder
…
sec_0() sec_1()
sec_1()
D0
D1
= 100 tokens/s
sec_3()
D2
…
D31
= 500 tokens/s
+ Kernel Looping
sec_0() sec_1()
D0
© 2025 IEEE
International Solid-State Circuits Conference
sec_3()
D1
D2
…
D31
= 1115 tokens/s
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
15 of 24
Large Language Model Family Performance
Highest tokens/sec in a Single rack with16 SN40L
Models
Tokens/second
1B Llama 3.2
2500
3B Llama 3.2
1500
8B Llama 3.1
1115
32B Qwen 2.5
317
32B QwQ
311
70B Llama 3.1
580
72B Qwen 2.5
226
405B Llama 3.1
200
671B DeepSeek-R1
198
© 2025 IEEE
International Solid-State Circuits Conference
Close to Ideal Throughput Scaling
with Batch Size on Llama 3.1 70B
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
16 of 24
Agentic AI
Multiple AI models work collaboratively to accomplish complex tasks
Most advanced LLM systems constitute agentic workflows
Composition of Experts
Speculative Decoding
Architecture for agentic AI needs to excel in
RAW SPEED
© 2025 IEEE
International Solid-State Circuits Conference
MODEL VARIETY
MODEL SWAPPING
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
17 of 24
Agentic AI Advantage on SN40L
Host and Serve Trillions of Parameters at a fraction of the cost of GPUs
3.7x Faster
vs. DGX H100
© 2025 IEEE
International Solid-State Circuits Conference
19x Smaller
Machine Footprint
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
15x Faster
Model Switching
18 of 24
SN40L-16 Energy Efficiency and Power Management
Dataflow Architecture
3-Tier Memory Architecture
Fused operation minimizes data
movement
Eliminate core-based Instruction Set
Architecture (ISA) overhead
65C at 22C ambient
Consolidate 10s-100s of models with highspeed model switching in a single system
Advanced Power Management
Multiple control loops to effectively
manage thermal, electrical and Ldidt
challenges
Sub-usec power monitoring and control to
handle highly oscillating workload
Continuous monitoring of temperature and
voltage to ensure reliable operation
© 2025 IEEE
International Solid-State Circuits Conference
Graph Sections
Full TFlop utilization in air-cooled system !
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
19 of 24
High-Speed Power Estimation Unit (PEU)
xPE
Switch
SPE
PMU
PCU
MPE
CPE
Data
Monitor
Activity
Monitor
Power
Calculator
Switch
PMU
PCU
MPE
CPE
SPE
PEU
CPA
Switch
SPE
PMU
PCU
MPE
CPE
Power
Accumulator
CPA
Activator
Monitoring activity and data toggling of
~3500 compute, memory, network
component powers in sub-usec range
Switch
SPE
PMU
PCU
MPE
CPE
■ Monitoring of op-code, clock enable, zero data and data toggling with real time
voltage and frequency synchronization for accurate power estimation
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
20 of 24
Accuracy Correlation of PEU
Slow time scale correlation (msec)
Fast time scale correlation (usec)
■ PEU accurately correlates with VRM power in msec timescale and creates
perfect mirror image of voltage behavior in usec timescale
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
21 of 24
Batched Inference Performance for Low-power Datacenters
3%
27%
■
Llama3.1 405B batched inference performance and power comparison between unconstrained
power and power-constrained rack scenarios
■
Token generation performance degradation is minimized with PEU for power-constrained
datacenters
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
22 of 24
Performance and Energy Efficiency Comparison
VIT=Vision Transformer
COE 150 8B = Composition
of 150 Llama 3.1 8B experts
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
23 of 24
Summary
SN40L presents a completely reimagined chip architecture targeting
generative agentic AI era.
Unique dataflow features enable optimizations which are beyond horizon
for traditional accelerators, delivering SOTA performance and energy
efficiency.
Most accurate and highly responsive power and thermal monitoring
systems enable the best power/performance tradeoffs in datacenters
across the world.
© 2025 IEEE
International Solid-State Circuits Conference
16.4: SambaNova SN40L: A 5nm 2.5D Dataflow Accelerator with Three Memory Tiers for Trillion Parameter AI
24 of 24
Please Scan to Rate This Paper
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )