Pushing the Limits of Accelerator Efficiency While Retaining Programmability Vinay Gangadhar*

advertisement
Pushing the Limits of Accelerator Efficiency
While Retaining Programmability
Tony Nowatzki*, Vinay Gangadhar*, Karu Sankaralingam*, Greg Wright+
*Vertical Research Group
University of Wisconsin – Madison
+Qualcomm
1
Executive Summary
•
5 common principles of architectural specialization
•
A programmable architecture (LSSD) embodying the
specialization principles
•
LSSD compared to single domain specific accelerator (DSA)



•
Performance: Matches DSA
Area: Overhead of at most 4x
Power: Overhead of at most 4x
LSSD power overhead inconsequential with system-level
energy efficiency tradeoffs
2
Outline
•
Introduction and Motivation
Concurrency
Computation
Principles of architectural specialization

Data Reuse
Embodiment of principles in DSAs
•
Architecture for programmable specialization (LSSD)
•
Evaluation of LSSD with 4 DSAs
(Performance, power & area)
•
Communication
System-level energy efficiency tradeoffs
with LSSD and DSA
Coordination
Energy
•
Speedup
Accel.
Core
$
System Bus
Memory
3
Era of Specialization
DSAs
Traditional Multicore
Reg Expr.
Cache
Cache
Core
Core
Application domain
specialization
Core
Core
Core
Core
Scan
Deep
Neural
AI
Graph
Traversal
Neural
Approx.
Linear
Algebra
Stencil
Sort
• Performance and/or energy gains from multicore chips is challenging
• Specialization of application domains with custom hardware units
Domain Specific Acceleration
• Domain Specific Accelerators (DSAs):
+ High Efficiency
10 – 100x
Performance/Power
or
Performance/Area
- No Generality
Not general purpose
programmable
- Obsoletion Prone
4
Our Goal:
Programmable Specialization
Specialization benefits of DSAs in a
Programmable Architecture
Programmable architecture
matching the efficiency of DSAs
5
Key Insight: Commonality in
DSAs’ Specialization Principles
Host System
Core
Core
Core
Cache
DSAs
Reg Expr.
AI
Scan
Graph
Traversal
Stencil
Deep Neural
Neural
Approx.
Linear
Algebra
Sort
Most DSAs employ 5 common Specialization Principles
Computation
Communication
S
+
Concurrency
Data Reuse
Coordination
S
FU
S
S
FU
6
Solution: Architecture for
Programmable Specialization
Idea 1: Specialization principles can be exploited in a general way
Idea 2: Composition of known uArch. mechanisms embodying
the specialization principles
Programmable
Architecture (LSSD)
Low power core
Spatial fabric
Scratchpad
DMA
LSSD as a programmable hardware template
to map one or many application domains
*Figures not to scale
Deep Neural
Stencil, Sort, Scan, AI
Domain provisioned LSSD
Balanced LSSD
7
Outline
•
Introduction and Motivation
Concurrency
Computation
Principles of architectural specialization

Data Reuse
Embodiment of principles in DSAs
•
Architecture for programmable specialization (LSSD)
•
Evaluation of LSSD with 4 DSAs
(Performance, power & area)
•
Communication
System-level energy efficiency tradeoffs
with LSSD and DSA
Coordination
Energy
•
Speedup
Accel.
Core
$
System Bus
Memory
8
Principles of Architectural
Specialization
•
Match hardware concurrency to that of algorithm
•
Problem-specific computation units
•
Explicit communication as opposed to implicit
communication
•
Customized structures for data reuse
•
Hardware coordination using simple low-power control logic
Computation
Communication
S
+
Concurrency
Data Reuse
Coordination
S
FU
S
S
FU
9
5 Specialization Principles
Concurrency
Computation
Communication
S
Data Reuse
Coordination
S
+
FU
S
FU
S
How do DSAs embody these principles in a
domain specific way ?
Neural
Approx.
Reg Expr.
NPU
Scan
Stencil
AI
Graph
Traversal
Convolution
Deep Neural
Engine
Stencil
Neural
Approx.
Deep
Neural
Linear
Algebra
Database
Sort
DianNao
Q100
10
Principles in DSAs
NPU – Neural Proc. Unit
In Fifo
Out Fifo
High Level
Organization
General Purpose Processor
Bus Sched
PE
PE
PE
PE
PE
PE
PE
PE
Processing
Engine
Weight Buf.
•
•
•
•
•
Match hardware concurrency to that
of algorithm
Problem-specific computation units
Explicit communication as opposed to
implicit communication
Customized structures for data reuse
Hardware coordination using simple
low-power control logic
Fifo
Controller
Mult-Add
Acc Reg.
Sigmoid
Out Buf.
Concurrency
Computation
Communication
Data Reuse
Coordination
11
Most DSAs employ 5 common
Specialization Principles
Processing
Units
High Level
Organization
Principles in DSAs
Concurrency
Computation
Communication
Data Reuse
Coordination
12
Outline
•
Introduction and Motivation
Concurrency
Computation
Principles of architectural specialization

Data Reuse
Embodiment of principles in DSAs
•
Architecture for programmable specialization (LSSD)
•
Evaluation of LSSD with 4 DSAs
(Performance, power & area)
•
Communication
System-level energy efficiency tradeoffs
with LSSD and DSA
Coordination
Energy
•
Speedup
Accel.
Core
$
System Bus
Memory
13
Implementation of Principles in
a General Way
Composition of simple micro-architectural mechanisms
•
Concurrency:
Multiple tiles
•
Computation:
Special FUs in spatial fabric
•
Communication:Dataflow + spatial fabric
(Tile – hardware for coarse grain unit of work)
•
Data Reuse:
Scratchpad (SRAMs)
•
Coordination:
Low power simple core
Concurrency
Computation
Communication
Data Reuse
Each Tile
Coordination
14
LSSD Programmable Architecture
Memory
Memory
Memory
D$
DMA
DMAD$ Scratchpad
Scratchpad
Input Interface
Spatial Fabric
Low-power
Core
(LX3)
Low-power
. . . Core
(LX3)
FU
Spatial Fabric
Input Interface
FU
S
FU
FU
S – Switch
Output Interface
Output Interface
Low power core | Spatial fabric | Scratchpad | DMA  LSSD
Concurrency
Computation
Communication
Data Reuse
Coordination
15
Instantiating LSSD
Programmable hardware template for specialization
LSSD
Provisioned for
one single application domain
Neural Approx.
LSSDN
Stencil
LSSDC
Deep Neural
LSSDD
Database
LSSDQ
Provisioned for
multiple application domains
Deep Neural
Stencil
Neural Approx.
Database
LSSDBalanced
or
LSSDB
Design point selection, Synthesis & Programming:
More details in the paper…..
*Figures not to scale
16
Outline
•
Introduction and Motivation
Concurrency
Computation
Principles of architectural specialization

Data Reuse
Embodiment of principles in DSAs
•
Architecture for programmable specialization (LSSD)
•
Evaluation of LSSD with 4 DSAs
(Performance, power & area)
•
Communication
System-level energy efficiency tradeoffs
with LSSD and DSA
Coordination
Energy
•
Speedup
Accel.
Core
$
System Bus
Memory
17
Methodology
•
Modeling framework for LSSD


Perf: Trace driven simulator + application specific modeling
Power & Area: Synthesized modules, CACTI and McPAT
•
Compared to four DSAs (published perf., area & power)
•
Four parameterized LSSDs
•
One combined balanced LSSD
LSSDN
LSSDC
LSSDD
LSSDQ
LSSDB
1 Tile
1 Tile
8 Tiles
4 Tiles
8 Tiles
NPU
Conv.
DianNao
Q100
NPU
Conv.
DianNao
Q100
Provisioned to match performance of DSAs

Other tradeoffs possible (power, area, energy etc. )
18
Geometric
Mean
sobel
(9-8-1)
kmeans
(6-8-4-1)
jpeg
(64-16-64)
jmeint
(18-32-8-2)
LSSDN (+reuse.)
Spatial (+comm.)
SIMD (+concur.)
LP Core + Sig. (+comp.)
NPU (DSA)
inversek2j
(2-8-2)
18
16
14
12
10
8
6
4
2
0
LSSDN vs. NPU
fft
(1-4-4-2)
Speedup
Performance Analysis (1)
Baseline – 4 wide OOO core (Intel 3770K)
19
Performance Analysis (2)
LSSDD vs. DianNao
(1 Tile)
50
45
40
35
30
25
20
15
10
5
0
(8 Tiles)
400
350
300
Speedup
LSSDC (+reuse.)
Spatial (+comm.)
SIMD (+concur.)
LP core + FUs (+comp.)
Conv. (domain-acccel)
LSSDD (+reuse.)
250
Spatial (+comm.)
200
SIMD (+concur.)
150
8-Tile (+concur.)
100
LP core + Sig. (+comp.)
Domain Provisioned LSSDs
DianNao (domain-acccel)
GeoMean
pool5
conv5
conv4
class3
pool3
conv3
conv2
class1
conv1
pool1
0
Geometric
Mean
FME
EXTR.
DOG
50
• Performance: LSSD able to match DSA
LSSDQ vs. Q100
(4 Tiles)
500
• Main contributor to speedup: Concurrency
400
LSSDQ (+comm.)
Speedup
SIMD (+concur.)
300
4-Tile (+concur.)
LP core + SFUs (+comp.)
200
Q100 (domain-acccel)
100
GM
q17
q16
q15
q10
q17
q6
q5
q4
q3
q2
0
q1
IME
Speedup
LSSDc vs. Conv.
Baseline – 4 wide OOO core (Intel 3770K)
20
Domain Provisioned LSSDs
LSSD area & power compared to a single DSA ?
21
Area Analysis
Domain Provisioned LSSDs
4
3.8x
Normalized Area
3.5
3
2.5
Domain provisioned LSSD overhead
2
1.5
1x –1.7x
4x worse in Area
1.2x
1
0.5
0.5x
0
*Detailed area breakdown in paper
22
Normalized Power
Power Analysis
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Domain Provisioned LSSDs
4.1x
3.6x
Domain provisioned LSSD overhead
2x
2x – 4x worse in Power
*Detailed power breakdown in paper
0.6x
23
Balance LSSD design
Area and power of LSSDBalanced design, when
multiple domains mapped ?
24
LSSDBalanced Analysis
Area
1.2
1
0.8
0.6
0.4 •
0.2
0
Power
3
Normalized Power
Normalized Area
1.4
2.5
2.5x
Balance LSSD design overheads
2
1.5
•0.6xArea efficient than multiple
DSAs
1
2.5x worse in Power than multiple DSAs
0.5
0
25
Outline
•
Introduction and Motivation
Concurrency
Computation
Principles of architectural specialization

Data Reuse
Embodiment of principles in DSAs
•
Architecture for programmable specialization (LSSD)
•
Evaluation of LSSD with 4 DSAs
(Performance, power & area)
•
Communication
System-level energy efficiency tradeoffs
with LSSD and DSA
Coordination
Energy
•
Speedup
Accel.
Core
$
System Bus
Memory
26
LSSD’s power overhead of
2x - 4x matter in a system with accelerator?
In what scenarios you want to build
DSA over LSSD?
27
Energy Efficiency Tradeoffs
System with accelerator
Core power
OOO
Core
t: execution time
Caches
Pcore: 5W
Psys: 5W
System power
Accel.
(LSSD or DSA)
Pacc: 0.1 – 5W
Accel. power
S: accelerator’s speedup
System Bus
U: accelerator utilization
Memory
Overall energy of the computation executed on system
E = Pacc * (U/S) * t + Psys * (1 – U + U/S) * t
Accel. energy
*Power numbers are example representation
System energy
+ Pcore * (1 - U) * t
Core energy
28
Energy Efficiency Gains of
LSSD & DSA over OOO core
18
16
14
12
10
8
6
4
2
0
500mW Power overhead
Plssd = 0.5W
18
16
14
U=1
12
U = 0.95
10
U = 0.9
8
U = 0.75
6
4
2
0
0
10
20
30
40
50
0
10
20
30
40
50
Accelerator Speedup w.r.t OOO core
Accelerator Speedup w.r.t OOO core
Pdsa ≈ 0.0W
Energy Eff. of LSSD over OOO
Energy Eff. of DSA over OOO
Speeduplssd = Speedupdsa (Speedup w.r.t OOO)
At higher speedups
(S 
), energy
efficiency gains
Baseline
– 4∞
wide
OOO core
‘capped’ due to large system power
29
LSSD’s power overhead of
2x - 4x matter in a system with accelerator?
When Psys >> Plssd, 2x - 4x power overheads of
LSSD become inconsequential
30
Energy Efficiency Gains of
DSA over LSSD
Energy Eff. of DSA over LSSD
Speeduplssd = Speedupdsa (Speedup w.r.t OOO)
1.12
1.10
1.08
𝑬𝒇𝒇𝒅𝒔𝒂
1.06
U=1
= (1 / DSA energy) / (1 / LSSD energy)
U = 0.95
= LSSD energy / DSA energy U = 0.9
U = 0.75
𝒍𝒔𝒔𝒅
1.04
1.02
1.00
0
10
20
30
40
50
Accelerator Speedup w.r.t OOO core
𝑬𝒇𝒇speedups,
is no
more
than
10%
even
at
100%
At
benefits
of DSA
less
than
5%
on
energy
efficiency
𝒅𝒔𝒂
Athigher
lower
DSA’s
energy
efficiency
gains
6 - utilization
10% over
LSSD
𝒍𝒔𝒔𝒅
Baseline – LSSD
31
In what scenarios you want to build
DSA over LSSD?
Only when application speedups are small &
small energy efficiency gains too important
32
Conclusion
• 5 common principles for architectural specialization
• Programmable architecture (LSSD) composed of simple
uArch. mechanisms embodying the principles
• LSSD competitive with DSA performance and overheads
of only up to 4x in area and power
• Power overhead inconsequential when system-level
energy tradeoffs considered
• LSSD as a baseline for future accelerator research
33
Back Up Slides
34
Design-Time vs. Runtime
Decisions
Synthesis – Time
Run – Time
Concurrency
No. of LSSD Units
Power-gating unused LSSD Units
Computation
Spatial fabric FU mix
Scheduling of spatial fabric
and core
Communication Enabling spatial datapath
elements, & SRAM interface
widths
Data Reuse
Scratchpad (SRAM) size
Config. of spatial datapath,
switches and ports, memory
access pattern
Scratchpad used as DMA/reuse
buffer
35
LSSD Design Point Selection
Design
Concurrency
Computation
Comm.
Data Reuse
No. of
LSSD
Units
LSSDN
24-tile CGRA
(8 Mul, 8 Add, 1 Sigmoid)
2k x 32b sigmoid
lookup table
32b CGRA; 256b 2k x 32b
SRAM interface weight buffer
1
Standard 16b FUs
LSSDC
64-tile CGRA
(32 Mul/Shift, 32 Add/logic)
16b CGRA; 512b 512 x 16b
SRAM interface SRAM for inputs
1
LSSDD
64-tile CGRA
(32 Mul, 32 Add, 2 Sigmoid)
Piecewise linear
sigmoid unit
32b CGRA; 512b 2k x 16b SRAMs
SRAM interface for inputs
8
32-tile CGRA
(16 ALU, 4 Agg, 4 Join)
Join + Filter units
64b CGRA; 256b SRAMs for
SRAM interface buffering
4
32-tile CGRA
(Combination of above)
Combination of
above FUs
64b CGRA; 512b 4KB SRAM
SRAM interface
8
LSSDQ
LSSDB
36
Accelerator Workloads
Neural Approx.
DNN
1. Ample Parallelism
3. Large Datapath
Convolution
Database Streaming
2. Regular Memory
4. Computation Heavy
37
LSSD in Practice
Designer
1. Design Synthesis
Performance
Requirements
Perf.
App. 1: ...
App. 2: ...
App. 3: ...
Design
decisions




2. Programming
H/W
Constraints
Area goal:
...
Power goal: ...
FU Types
No. of FUs
Spatial fabric size
No. of LSSD tiles
Synthesis
For each application:
 Write Control Program
(C Prog. + Annotations)
 Write Datapath Program
(spatial scheduling compiler framework)
LSSD
38
Programming LSSD
Pragmas
Insert data transfer
Memory
#pragma lssd cores 2
#pragma reuse-scratchpad weights
DMA
D$
Scratchpad
Input Interface
Low-power Core
x
x
x
x
x
x
x
x
+
+
+
+
+
+
+
Ʃ
Spatial Fabric
void nn_layer(int num_in, int num_out,
const float* weights,
const float* in,
const float* out )
{
for (int j = 0; j < num_out; ++j)
{
for (int i = 0; i < num_in; ++i)
{
out[j] += weights[j][i] *in[i];
}
out[j] = sigmoid(out[j]);
}
}
LSSD
Output Interface
Loop Parallelize, Insert Communication,
Modulo Schedule
Resize Computation (Unroll), Extract Computation Subgraph, Spatial Schedule
39
Power & Area Analysis (1)
LSSDN
1.2x more Area than DSA
2x more Power than DSA
LSSDC
1.7x more Area than DSA
3.6x more Power than DSA
40
Power & Area Analysis (2)
LSSDD
LSSDQ
3.8x more Area than DSA
4.1x more Power than DSA
0.5x more Area than DSA
0.6x more Power than DSA
41
LSSD Area & Power Numbers
Neural
Approx.
Stencil
Deep Neural.
Database
Streaming
Area (mm2)
Power (mW)
LSSDN
0.37
149
NPU
0.30
74
LSSDC
0.15
108
Conv. Engine
0.08
30
LSSDD
2.11
867
DianNao
0.56
213
LSSDQ
1.78
519
Q100
3.69
870
LSSDBalanced
2.74
352
*Intel Ivybridge 3770K CPU 1 core Area – 12.9mm2 | Power – 4.95W
*Intel Ivybridge 3770K iGPU 1 execution lane Area – 5.75mm2
+AMD Kaveri APU Tahiti based GPU 1CU Area – 5.02mm2
*Source: http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3
+Estimate from die-photo analysis and block diagrams from wccftech.com
42
Power & Area Analysis (3)
LSSDB  Balanced LSSD design
2.7x more Area than DSAs
2.4x more Power than DSAs
0.6x more Area than DSA
2.5x more Power than DSA
43
Energy Efficiency Gains of
DianNao over LSSD
SpeedupLSSD = SpeedupDianNao (Speedup w.r.t OOO)
Energy Eff. of DianNao over LSSD
1.14
1.12
1.10
U=1
1.08
U = 0.95
1.06
U = 0.9
1.04
U = 0.75
1.02
1.00
0
10
20
30
40
Accelerator Speedup w.r.t OOO
50
44
Does Accelerator power matter?
• At Speedups > 10x, DSA eff. is around 5%, when
accelerator power == core power
• At smaller speedups, makes a bigger difference, up to 35%
45
Download