Some issues in Application Specific Network on Chip Design Shashi Kumar

advertisement
Some issues in Application Specific
Network on Chip Design
Shashi Kumar
Embedded Systems Group
Department of Electronics and Computer Engineering
School of Engineering, Jönköping University
Shashi Kumar
Linköping University
13th Jan. 2005
1
Outline
¾ Introduction: rather long!
™NoC Evolution
™Case for Application Specific NoC
¾ A NoC Methodology
¾ Specializing NoC for an Application
™Processor Selection and Evaluation for NoC
™Link Optimization and QNoC
¾ Mapping Applications to NoC
Shashi Kumar
Linköping University
13th Jan. 2005
2
An Interesting Cross-road
FPGA
Chip Design
Shashi Kumar
SoC
Chip
Multiprocessors
Linköping University
13th Jan. 2005
Computer
Architecture
3
An Interesting Cross-road
FPGA
Chip Design
Chip
Multiprocessors
SoC
Computer
Architecture
NoC
Shashi Kumar
Linköping University
13th Jan. 2005
4
Merging Roads
NoC
Computer
Networks
Shashi Kumar
Linköping University
13th Jan. 2005
5
A case for NoC
¾ Driving Forces
™Exponentially increasing Silicon Capacity
™Demand of products requiring large capacity
™Market: Competitive in price and short Time-to-Market
¾ Techniques to handle large and complex system
design
™Reuse
™Higher Level of abstraction
™CAD tools
¾ NoC paradigm offers all the above features
Shashi Kumar
Linköping University
13th Jan. 2005
6
Design Productivity and Level of Abstraction
Productivity (number of
gates/number of modules)
Gates
Modules
Time
Shashi Kumar
Linköping University
13th Jan. 2005
7
Properties for REUSE
¾ Well defined external interfaces
™Easy composability in different contexts
¾ Reuse means generalization which leads to underutilization
™Static Cost : More area
™Dynamic cost: Consumes more power
¾ How to make Reuse efficient ?
™Specialization:
ƒ Trim out the unnecessary resources
™Higher utilization through programmability
Shashi Kumar
Linköping University
13th Jan. 2005
8
Evolution of ASIC and Programmable Devices
ASIC used capacity
Log(Capacity/speed)
<10
Programmable
devices
>100
60
Shashi Kumar
ASIC
70
80
90
Years
Linköping University
13th Jan. 2005
00
9
Evolution: ASIC Vs. PLD Architectures
ASIC
Building Blocks/
Architecture
Cores
RTL Library
Cell Lib.
Custom
75
Shashi Kumar
PBD
80
FPGAs
CLBs
Gates
85
90
Time 95
Linköping University
13th Jan. 2005
00
10
Abstraction levels of programming: FPGA
•Program logic blocks and interconnections
•Bit file
•ARM core is programmable at higher
level of programming
ARM
core
Shashi Kumar
Linköping University
13th Jan. 2005
11
Chip Multi-processor Road
Question: New computer architecture which can use such a
high available capacity?
Options:
™Super-computer on a chip: Powerful instruction set;
Higher accuracy, On-chip main memory, On-chip
interfaces; ……
™Super-scalar Architecture
™Multi-threaded super-scalar architectures
™ Multi-processor on a chip
Shashi Kumar
Linköping University
13th Jan. 2005
12
Multiprocessor SoC vs. NoC
¾ Multiprocessor SoC is a special NoC
™ All the cores are processors
ƒ Homogenous vs. Heterogeneous
™ Functionality of the system fully software programmable
¾ Other extreme : ASIC built using special function
hardware IP cores
™ Functionality of the system quite fixed at the time of fabrication
¾ NoC covers the complete range between these extremes
Shashi Kumar
Linköping University
13th Jan. 2005
13
NoC Architecture Space
Performance/cost/
design time
HardNoC
MixedNoC
Application
Specific
MPSoC
MPSoC
(Hetero)
General
MPSoC
(Homog.)
Reuse/Flexibility/Programmability
Shashi Kumar
Linköping University
13th Jan. 2005
14
Why Application Specific NOC ?
¾ Performance
™Use of application specific cores
™Specialize Size, Topology, Routers, Links, Protocols
for the application
¾ Power Consumption
™Optimal utilization of computing and communication
resources
™Optimal voltage and clock selection for computing and
communication resources
Shashi Kumar
Linköping University
13th Jan. 2005
15
Degrees of Freedom in Mixed NoC design
¾ Topology and Size
¾ Core Selection and Core Positioning
¾ Core operating Voltage and Clock Speed
¾ Link specialization
¾ Router Design
¾ Protocol Specialization
¾ OS specialization
Shashi Kumar
Linköping University
13th Jan. 2005
16
ASNOC Design Methodology[1]
Cores
Memories
Communication
structure
Generic backbone
Accelerators
Definition of
NOC platform
Optimised Virtual Components
“Application area specific IPR”
Processors
and hardware
Algorithms
Product area specific platform
Applications
Instantiation
of NoC
platform
“Product specific IPR”
Features
Optimised Intellectual Property
Code and
configuration
NoC system
Shashi Kumar
Linköping University
13th Jan. 2005
17
Development of NOC based systems
High-perforrmance
communication systems
High-capacity
communication
systems
Baseband platform
Database platform
Personal
assistant
Data
collection
systems
BACKBONE
Multimedia platform
Entertainment
devices
PLATFORMS
SYSTEMS
Shashi Kumar
Linköping University
13th Jan. 2005
Virtual reality games
18
Backbone-Platform-System Methodology
¾ Backbone Design
™ Topology
™ Switches, Channels and Network Interfaces (with generic parameters)
™ Protocols
¾ Platform Design
™
™
™
™
Scaling
Selection and placement of resources
Specialization of Switches, Channels, Network Interfaces and Protocols
Basic communication services
¾ Application Mapping
™ Programming functionality into resources
™ OS
Shashi Kumar
Linköping University
13th Jan. 2005
19
Development of ASNoC System
Function development
Communication
channels
Non-configurable
hardware
NOC System
Application
mapping
Product differentiation
Architecture
design
Backbone
Platform
System Services
Operation principles
Product area specialisation
System
does
not exist
Shashi Kumar
Resource development
Linköping University
13th Jan. 2005
20
Earlier Approaches
System exists
Function development
ASIC Design
System does
not exist
Shashi Kumar
HW
/SW
Co
n
sig
e
d
Software
Design
Resource development
Linköping University
13th Jan. 2005
21
Algorithm-Core Mappability Analysis[2,3]
Based on Juha-Pekka’s papers
Shashi Kumar
Linköping University
13th Jan. 2005
22
Development of ASNoC System
Function development
Communication
channels
Non-configurable
hardware
NOC System
Application
mapping
Product differentiation
Architecture
design
Backbone
Platform
System Services
Operation principles
Product area specialisation
System
does
not exist
Shashi Kumar
Resource development
Linköping University
13th Jan. 2005
23
Specialization of Mesh Topology NoC Architecture
™Specialize Back-Bone
1,1
1,2
2,1
2,2
3,1
3,2
™Specialize Resources
Available Cores
Shashi Kumar
Linköping University
13th Jan. 2005
24
Core Selection
¾ Task is similar to ”Processor Selection” in
Embedded Systems
¾ Generalization of Processor Selection Problem
™A core may be used for more than one algorithm/task
™An algorithm may use more than one core
™Set of Cores for a set of algorithms
ƒ Cluster/Region of cores
Shashi Kumar
Linköping University
13th Jan. 2005
25
Overview of the approach
Algorithms
C1, C2, C3,....,Cn
C1, C2, C3,....,Cn
A1, A2, A3,....,Am
Available Cores
Algorithm-Core Suitability Index
Mappability
Analysis
C2
C3
...
Cn
A1
0,5
0,1 0,3
-
.,.
A2
0,9
0,2 0,6
-
.,.
A3
0
0,5 0,4
-
.,.
...
Shashi Kumar
C1
-
-
-
-
.,.
Am
.,.
.,.
.,.
.,.
.,.
Linköping University
13th Jan. 2005
26
Core-Algorithm Mappability Analysis
¾ The goodness of a processor architecture-algorithm pair
™ Computational Capacity of architecture
™ Performance Requirement of algorithm
™ Optimal mapping implies that architecture does not constrain
execution and does not have underutilization
™ CAMALA: Core Algorithm Mappability Analysis Approach
Tool
Shashi Kumar
Linköping University
13th Jan. 2005
27
Mappability Analysis Goals
¾ Compare and Select the best core for a given
algorithm
™Select a set of cores for a set of algorithms
¾ Specialize a configurable core architecture for a
given algorithm
™Specialize the core for a set of algorithms
™Architecture parameters
ƒ Number of registers
ƒ Number of functional units
ƒ Instruction set
Shashi Kumar
Linköping University
13th Jan. 2005
28
Mappability Parameters
¾ Instruction Set Suitability
¾ External Data Availability
¾ Internal Data Availability
¾ Control Flow Continuity
¾ Data Flow Continuity
¾ Execution Unit Availability
Shashi Kumar
Linköping University
13th Jan. 2005
29
Characterization of Algorithm and processor cores
Algorithm
Characterization
Processor Core
characterization
Instruction Set Suitability
Effectiveness of instructions
w.r.t. operations in algorithm
Cost of instruction
execution
External Data Availability
Number of memory accesses
Bus capacity of core
Internal Data Availability
Amount of temporary data to
be stored
Number of registers
Control Flow Continuity
Number of branches
Branch penalty
Branch prediction
Data Flow Continuity
Possibility of data hazards
Handling of data hazards
Execution Unit Availability
Parallelism in algorithm
Shashi Kumar
Number of functional
units
Linköping University
13th Jan. 2005
30
Instruction Set Co-relation
¾ Effectiveness of instruction usage
™Availability of instructions for operations used in the
algorithm
ƒ If not available they have to be replaced by procedures
– If floating point operation not available will lead to extra cost
factor
™Handling of operands of various data widths
¾ Methods used in compilers and High Level
Synthesis used for this analysis
Shashi Kumar
Linköping University
13th Jan. 2005
31
Internal Data Availability Co-relation
¾ Effectiveness of register usage
¾ Algorithmic Requirement
™Number of intermediate results in one scheduled step
during execution
™ASAP and ALAP on the data flow graph used to get
this value
ƒ No constraint on resources: Maximum parallelism
e(a): Maximum number of intermediate results
e(c) = number of registers
Shashi Kumar
Linköping University
13th Jan. 2005
32
Control Flow Continuity Co-relation
¾ For the algorithm it depends on the branch instruction ratio
™ Ratio of branch instruction w.r.t. the total number of instructions
¾ For Core architecture it depends on
™ Branch prediction technique ( Pe)
™ Branch penality (D)
ƒ Degree of pipelining and super-pipeling
e(c ) ~ (1- Pe)D
¾ Good Matches
ƒ Small pipelines for algorithms with more branches
ƒ Long pipelines for algorithms with less branches
Shashi Kumar
Linköping University
13th Jan. 2005
33
Execution Unit Availability
¾ Operation level parallelism in algorithm vs.
Number of functional units in the processor core
e(a) ~ number of instructions / number of steps in ASAP
e(c) ~ number of functional units
Shashi Kumar
Linköping University
13th Jan. 2005
34
Viewpoint-specific characterization
¾ The algorithm is compiled into a graph
representation where each node is a block of code
For each computation node j in the algorithm
For each view-point i
ei(aj): Algorithm characterization
ei(c): Core characterization
Shashi Kumar
Linköping University
13th Jan. 2005
35
Mappability of each view-point
ei(c )
, ei(c ) ≤ ei(aj)
m i , j ( c ,a j ) = e i ( a j )
ei(aj)
, ei(c ) ≥ ei(aj)
ei(c )
Shashi Kumar
Linköping University
13th Jan. 2005
36
Total Mappability of a view point
∑ (w ⋅ m
j
Mi (c, a ) =
j =1...⎮V⎮
i, j
∑w
(c, aj ))
j
j =1...⎮V⎮
wj : number of times the block j is executed
Shashi Kumar
Linköping University
13th Jan. 2005
37
Total Mappability
M (c, a ) = ∑ wi ⋅ Mi (c, a )
i
wi : relative importance of different view points
Shashi Kumar
Linköping University
13th Jan. 2005
38
CAMALA Implementation
Application in C
Core Information
SUIF Compiler
SUIF Format
CAMALA
Mappability
Figures
Shashi Kumar
Linköping University
13th Jan. 2005
39
Verification of Mappability Analysis
¾ Higher the total mappability between an algorithm
and a core more suitable it is for the job
¾ Mappability analysis was verified using
instruction set simulator Simplescalar
™Goodness : Utilization x Speed-up
Example : Processor with 2 functional units
Utilization = 60% and Speed-up = 1.5
Goodness = 0.60 x 1.5/2 = 0.45
Shashi Kumar
Linköping University
13th Jan. 2005
40
Experimental Results[
Shashi Kumar
Linköping University
13th Jan. 2005
41
Experimental Results: WLAN Modem
Shashi Kumar
Linköping University
13th Jan. 2005
42
Architectures for WLAN Modem Algorithms
D: Super-pipeling Degree
(1..20)
E: Number of Execution
paths (1..6)
R: Number of registers
(1..64)
B: Number of Data Buses
(1..6)
Shashi Kumar
Linköping University
13th Jan. 2005
43
Multiprocessor for WLAN Modem
Shashi Kumar
Linköping University
13th Jan. 2005
44
Summary
¾ Mappability Analysis technique provides
possibility of fast selection/specialization of cores
for an application or application set.
¾ Effect of memory organization on mappability are
not considered
¾ Effect of inter-core communication organization
on mappability not considered
Shashi Kumar
Linköping University
13th Jan. 2005
45
Quality of Service NoC (QNoC) [4,5]
Based on Evegeny Bolotin’s papers and presentations
Shashi Kumar
Linköping University
13th Jan. 2005
46
Goals of QNoC
¾ Design an Application Specific NoC which
provides required communication performance
(QoS) at minimum cost
™Communication QoS: Required throughput and end to
end delay guarantee between communicating cores
ƒ Specialize links in NoC
™Cost
ƒ Area: links, buffers and routers are trimmed
ƒ Power Consumption: shortest path routing
Shashi Kumar
Linköping University
13th Jan. 2005
47
Quality of Service NoC (QNoC) Architecture
Module
Module
Module
Module
Module
Module
Module
Module
Figures taken from Evegeny Bolotin’s Presentation
¾ Irregular Mesh Topology
Module
Module
Module
Shashi Kumar
Module
Linköping University
13th Jan. 2005
48
Specializion of Links
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
Figures taken from Evegeny Bolotin’s Presentation
Module
Adapt Links
Module
Module
Module
Module
Module
Module
Module
Module
Module
Module
™Link bandwidth is adjusted according to expected traffic
™Bandwidth is adjusted by either number of wires or the data
frequency
Shashi Kumar
Linköping University
13th Jan. 2005
49
QNoC Service Levels
¾ Four different service levels with different priorities
™ Signaling
ƒ Short urgent packets
ƒ Suitable for control signals and Interrupts
™ Real-Time
ƒ Guaranteed bandwidth and latency
ƒ Useful for streamed audio or video processing
™ Read/Write ( RD/WR)
ƒ Provides Bus Semantics
™ Block Transfer
ƒ Large blocks of data
ƒ DMA transfers
¾ Priority
Signaling >Real Time > Read/Write > Block Transfer
Shashi Kumar
Linköping University
13th Jan. 2005
50
QNoC: Routing Algorithm
¾ Static shortest X-Y coordinate based routing
™Avoids deadlocks
™No reordering at end points
¾ Wormhole packet forwarding with credit based
flow control
¾ Packets of various priorities are forwarded in an
interleaved manner according to packet priorities
¾ A high priority packet can pre-empt a low priority
long packet
Shashi Kumar
Linköping University
13th Jan. 2005
51
QNoC: Router Architecture [4]
SIGNAL
Buffers
SIGNAL
RT
RT
RD/WR
BLOCK
BLOCK
Control
Routing
CREDIT
SIGNAL
RT
Scheduler
CREDIT
SIGNAL
RT
RD/WR
RD/WR
BLOCK
BLOCK
CREDIT
Shashi Kumar
CROSS-BAR
RD/WR
Control
Routing
Scheduler
Linköping University
13th Jan. 2005
Slide taken from Evegeny Bolotin’s Presentation
Output ports
Input ports
CREDIT
52
QNoC Packet Format
¾ Arbitrary Sized Packets
TRA
Command
Flit
Flit
Flit
Flit
TRA: Target Routing Address
Command: Info. about payload
Payload: Arbitrary Length
Flit Types:
FP (full packet): a one-flit packet
EP (end of packet): last flit
BDY (body): a non-last flit
Shashi Kumar
Linköping University
13th Jan. 2005
53
QNoC Design Flow
Figures taken from Evegeny Bolotin’s Presentation
Modules with
Ideal Network
Characterize
Traffic
Place
modules
QNoC
Architecture
Map Traffic
to Grid
Optimize
Estimate cost
Shashi Kumar
Linköping University
13th Jan. 2005
54
Design Example
Figures taken from Evegeny Bolotin’s Presentation
Shashi Kumar
Linköping University
13th Jan. 2005
55
Design Example[4]
Table taken from Evegeny Bolotin’s Presentation
Representative Design Example, each module contains 4 traffic sources:
Traffic
Source
Traffic
interpretation
Signaling
Every 100 cycles each
module sends interrupt to
a random target
Real-Time
Periodic connection from
each module: 320 voice
channels of 64 Kb/s
RD/WR
Random target RD/WR
transaction every ~25
cycles.
BlockTransfer
Random target BlockTransfer transaction
every ~12 500 cycles .
Shashi Kumar
Average
Packet
Length [flits]
2
Average Interarrival time [ns]
100
40
4
2 000
25
2 000
12 500
Linköping University
13th Jan. 2005
Total
Load per
Module
ETE
requirements
For 99.9% of
packets
320 Mbps
20 ns
(several
cycles)
320 Mbps
125 µs
(Voice-8 KHz
frame)
2.56 Gbps
~150 ns
(tens of
cycles)
2.56 Gbps
50 µs
(Several tx.
delays on typ.
bus)
56
Uniform Scenario - Observations
Figures taken from Evegeny Bolotin’s Presentation
Calculated Link Load Relations:
Shashi Kumar
Linköping University
13th Jan. 2005
57
Uniform Scenario - Observations
Figures taken from Evegeny Bolotin’s Presentation
Various Link BW allocations:
Allocated
Link BW
[Gbps]
Average
Link
Utilizatio
n
[%]
2560Gbps
Packet ETE delay of packets [ns or cycles]
Signaling
(99.9%)
Real-Time
(99.9%)
RD/WR
(99%)
BlockTransfer
(99%)
10.3
6
80
20
4 000
850Gbps
30.4
20
250
80
50 000
512Gbps
44
35
450
1 000
300 000
Shashi Kumar
Linköping University
13th Jan. 2005
Desired QoS
58
Uniform Scenario - Observations
Figures taken from Evegeny Bolotin’s Presentation
Fixed Network Configuration -Uniform Traffic
Network behavior under different traffic loads?
BLOCK
ETE
Delay
Real-Time
RD/WR
Shashi Kumar
Signaling
Linköping University
13th Jan. 2005
Traffic Load
59
Alternatives for connecting n cores[4,5]
Table taken from Evegeny Bolotin’s Presentation
For achieving the same Communication Bandwidth
Shashi Kumar
Total Area
Power Dissipation
(
)
On n
(
)
O n3 n
NoC
O n2 n
O( n)
PTP
S-Bus NS-Bus
Arch
O n2 n
(
)
Operating Frequency
( )
⎛1⎞
O⎜ 2 ⎟
⎝n ⎠
On n
( )
⎛1⎞
O⎜ ⎟
⎝ n⎠
O( n)
O(1)
( )
⎛1⎞
O⎜ ⎟
⎝ n⎠
On n
Linköping University
13th Jan. 2005
60
QNoC vs. Alternative Solutions
(4x4 mesh, uniform traffic)
Wire-Length(Area) and Power
Arch.
Frequen
cy
Utilization
Av. Link
Width
100.0
Wire Length
45.0
QNo
C
Power
1GHz
30%
28
Bus
50
MHz
PTP
100MH
z
50%
80%
3 700
6
Cost
10.0
3.8
2.9
1.0
1.0
1.0
0.8
0.1
BUS
Shashi Kumar
BUS
Linköping University
13th Jan. 2005
NoC
QNoC
PTP
PTP61
Figures taken from Evegeny Bolotin’s Presentation
Uniform scenario (Same QoS):
Summary about QNoC Approach
¾ Lower cost and higher performance are the driving
forces behind Application Specific NoC
approaches like QNoC
Performance/c
ost/design time
HardNoC
QNOC
MixedNoC Application
Specific
MPSoC
MPSoC
(Hetero)
General
MPSoC
(Homog.)
Reuse/Flexibility/Programmabi
lity
Shashi Kumar
Linköping University
13th Jan. 2005
62
Mapping Applications to NoC Platforms
Shashi Kumar
Linköping University
13th Jan. 2005
63
Various Options
¾ Architecture-Application Co-development
™Optimized in terms of performance and cost
™High Development Time
™Less Flexible
¾ Mapping sequential code to a Fixed Platforms
™Low development time and cheap
™Medium Performance
™Low utilization of resources
¾ Mapping parallel code through emulation of one
network on another network
Shashi Kumar
Linköping University
13th Jan. 2005
64
Application Mapping
Function development
Communication
channels
Non-configurable
hardware
NOC System
Application
mapping
Product differentiation
Architecture
design
Backbone
Platform
System Services
Operation principles
Product area specialisation
System
does
not exist
Shashi Kumar
Resource development
Linköping University
13th Jan. 2005
65
Task Graph as an Application Model
¾ Most common graph model used for representing
applications for multi-processor systems
™ Application is partitioned into smaller units called tasks
¾ Similar in properties to Data Flow graph used for high
level synthesis of ASICs
¾ It can represent
™ Concurrency among computations
™ Data Dependencies among computations
™ Communication among computations
™ Control Dependencies among Computations
™ Temporal Dependencies among computations
™ Non-determinism among computation
Shashi Kumar
Linköping University
13th Jan. 2005
66
Task Graph
¾ Task Graph is a weighted directed acyclic graph
(DAG)
™Nodes represents computational tasks
ƒ Weight on a node may represent different features of the
computation : Number of instructions, Execution Time etc.
™Edges between tasks represent dependencies
ƒ Weight on an edge may represent size of data to be
communicated between tasks or some temporal information
Shashi Kumar
Linköping University
13th Jan. 2005
67
Task Graph Example
¾ Parameters
Period : 20 ms
™ Granularity of nodes
T1
™ Period and deadlines
100
300
T2
™ High Granularity
ƒ Less available parallelism
™ Low Granularity
1KB
2KB
¾ Granularity of Nodes
ƒ Less communication cost
50
T3
4B
400
T4
3KB
1KB
ƒ Higher available parallelism
T5
ƒ More communication cost
100
Deadline: 20ms
™ Hierarchical Task Graphs
ƒ Task node itself is a task graph at lower level
Shashi Kumar
Linköping University
13th Jan. 2005
68
Task Graph Extraction from Sequential Code
¾ Job similar to compiler design
Task Graph
T1
1KB
2KB
C-Program
Task Graph Extractor
T2
T3
4B
T4
3KB
1KB
Profiler/Simulator
Shashi Kumar
Linköping University
13th Jan. 2005
T5
69
Concurrent Applications as Multi-Task Graph (MTG)
¾ NoC is expected to integrate more than 50 processor size
resources in a few years
™ Can concurrently run many applications
¾ Each application can be represented by a task graph called
Single Task Graph (STG)
¾ We call aggregation of many STGs as a Multi-Task Graph
(MTG).
™ There is no data or temporal dependencies among various STGs
™ Relative importance of STGs may be specified in some form
Shashi Kumar
Linköping University
13th Jan. 2005
70
Fixed NoC Platforms
¾ A NoC Platform is fixed if
™Topology and Size of NoC is decided
™Communication Protocols are decided
™Resources for all the slots have been selected
1,2
1,1
RNI
RNI
Video
Receiver
2,2
2,1
RNI
2,3
RNI
DSP
RNI
Memory
DSP
3,2
3,1
RNI
Shashi Kumar
RNI
Audio
Receiver
Processor
Video
Transmitter
1,3
3,3
RNI
I/O-
RNI
Audio
Interface
Transmitter
Linköping
University
13th Jan. 2005
71
Mapping and Scheduling Problem
Mapping: For every task decide the core in NoC where it will
be executed
Scheduling: For every task decide the starting time of its
execution
1,2
1,1
R
NI
R
NI
Video
Receiver
2,2
2,1
R
NI
Memory
Video
Transmitt
er
DSP
3,2
3,1
R
NI
2,3
R
NI
DSP
STG2
Audio
Receiver
Processor
R
NI
STG1
1,3
R
NI
R
NI
I/OInterface
3,3
R
NI
Audio
Transmitt
er
Multi-Task Graph
Shashi Kumar
Linköping University
13th Jan. 2005
72
Issues : Static Vs. Dynamic Mapping & Scheduling
¾ Static or Off-line Mapping and Scheduling
™ The resource on which the task will be run decided before runtime
™ Starting time for execution of each task on the resource is also
decided
™ Normally based on worst case estimates of task execution time
¾ Dynamic or On-line Mapping and Scheduling
™ Task assignment to resources as well ordering of their execution is
done at run time
ƒ Non-preemptive or Preemptive scheduling
™ Based on actual execution time of tasks
™ Which resource runs mapping and scheduling algorithm?
¾ Dynamic mapping and scheduling can lead to better
performance
Shashi Kumar
Linköping University
13th Jan. 2005
73
Objective Functions
¾ Primary Objective for real-time application is to
meet all hard deadline
™Find a mapping and scheduling of tasks on the
computing platform such that the performance
requirements are met
¾ Secondary Objectives
™Power consumption minimization
™Soft deadline are met as much as possible
Shashi Kumar
Linköping University
13th Jan. 2005
74
Tang’s Lei’s 2-step genetic Algorithm [6]
¾ Solves mapping and scheduling as a single
integrated problem
™Static mapping and static and non-preemptive
scheduling
™Fixed NoC architecture
¾ Objective
™Each task graph can meet its individual execution
deadline
™Maximize the overall execution performance
Shashi Kumar
Linköping University
13th Jan. 2005
75
Inputs to the Mapping Algorithm
¾ Weighted Task Graph
™ Weight on each edge corresponds to amount of data
communicated from source node to destination node
¾ Deadline for execution of each STG
¾ Task Execution Time Table
™ Worst case execution Time on each core
Task/Core
C1
C2
T1
100
∞
75
T2
200
50
150
TN
CM
∞
--------
Shashi Kumar
-----------
50
Linköping
University
∞
13th Jan. 2005
40
76
Mapping Issues and Assumption
¾ A task graph node has different execution time on different
resources
™ There may be many copies of the same resource in the network
™ Every task can be executed by at least one resource
™ A task is executed without pre-emption
¾ The execution time of an STG depends on:
™ The type and the position of resources on which its tasks are
executed.
™ Execution of tasks of various STGs may be interleaved
¾ There is enough local memory with every core to store all
the required data for all the tasks executing on them
Shashi Kumar
Linköping University
13th Jan. 2005
77
Estimation of STG delay
™ Delay of STG corresponds to the delay of critical
path in the task graph
ƒ
Sum of execution time of tasks and edges on the critical
path
Tk = Tcritical − path ,k =
Shashi Kumar
∑ Tv
i
Vi ∈critical − pathk
+
Linköping University
13th Jan. 2005
∑ Te
j
E j ∈critical − pathk
78
Edge delay
¾Two communicating vertices of MTG may get
mapped on two different NoC resources.
¾An edge delay in MTG depends on:
™Mapping of task nodes to resources
™Data size
™Router Design including protocols
™Network traffic situation at that time
Shashi Kumar
Linköping University
13th Jan. 2005
79
Edge delay (assumption)
¾ In mesh based network using a dynamic routing
algorithm, delay of Ei going from node (x1, y1) to
(x2, y2) can be coarsely estimated as
Te i = k e ⋅ wi ⋅ ( x1 − x 2 + y1 − y 2 )
where wi is the size of data to be communicated
Ke absorbs all architectural parameters
Shashi Kumar
Linköping University
13th Jan. 2005
80
MTG effective delay: Objective Function
™Since the main optimizing goal is to meet every STG’s
deadline constraint
™we define a normalization, called MTG Effective
Delay, to reflect every STG’s contribution to the
overall Objective Function and then combine them all
as:
T( c −1)
T0
}
TMTG = Dmax × max{ , L,
D0
D( c −1)
Where Ti and Di are execution delay in a mapping and
deadline for STGi. Dmax is the largest deadline.
Shashi Kumar
Linköping University
13th Jan. 2005
81
Mapping Algorithm: Basic Idea
Two Steps
1
3
2
Step 1
5
7
6
{2, 4, 6, 7}
{1, 3, 5}
{8, 9}
8
4
9
STG1
STG2
Fixed NoC
Step 2
1,3
MTG
2,4
8,9
5
Shashi Kumar
Linköping University
13th Jan. 2005
6,7
82
The First Step: Partition nodes according to types
¾ We assume that there are a few types of NoC resources
¾ For each task in MTG find the type of resource it should be
executed on
™ Non-trivial problem, since we want to use maximum parallelism (
maximum utilization of resources)
ƒ Insisting on use of fastest resource for a task can delay start time of
the task or other tasks
ƒ To avoid communication delay it may be better to execute connected
tasks in the same resource/neighboring resources
¾ This step is implemented as a genetic algorithm and
generates many candidate solutions
™ This is used as the initial population for the second step
Shashi Kumar
Linköping University
13th Jan. 2005
83
The Second step: Binding of Position
¾ In each of the candidate solution
™For each task node decide the exact resource of the type
decided in first step
ƒ If the number of resources of a type is exactly one then the
choice is trivial.
ƒ Choice will affect the execution time of individual STG
¾ This binding problem is also hard!
¾ This step is also implemented using a separate
genetic algorithm
Shashi Kumar
Linköping University
13th Jan. 2005
84
Task Graph Mapping Tool
Shashi Kumar
Linköping University
13th Jan. 2005
85
Input Generation Tool
Shashi Kumar
Linköping University
13th Jan. 2005
86
MTG performance variations with NoC size
MTG Effective Delay
1850
1650
1450
3 STGs' MTG
1250
2 STGs' MTG
1098
1050
850
650
450
770
482
2
3
4
1 STG
5
6
7
8
9
10
11
NoC Size
MTG performance variations with NoC size changing
Shashi Kumar
Linköping University
13th Jan. 2005
87
Conclusions
¾ NoC research is in its infancy. Many tools will be required
for ASNoC Design
™ Simulators for evaluation of design choices
™ Performance and power estimators
™ Tools for mapping and scheduling applications on the NoC
™ Communication libraries for coding applications
¾ There is a lack of availability of real large applications
which can be used you evaluate current research proposals
™ Researchers still use random traffic and random task graphs
Shashi Kumar
Linköping University
13th Jan. 2005
88
Important References
1.
2.
3.
4.
5.
6.
7.
Shashi Kumar et. al., ”A Network on Chip Architecture and Design Methodology”,
SVLSI 2002, Pittsburgh, 2002.
Juha-Pekka Soininen et. al. ,” Extending Platform based design to Network on Chip
Systems”, Proceedings of the 16th International Conference on VLSI Design 2003,
India, Jan. 2003.
Juha-Peka Soininen et. al. ”Mappability Estimation Approach for processor
Architecture Evaluation”, Proceedings of the 20th NORCHIP 2002, Nov. 2002,
Copenhagen, Denmark.
Evegeny Bolotin et. al., ”QNoC: QoS Architecture and Design Process for Network
on Chip”, Journal of System Architecture 50 (2004), 105-128.
Evegeny Bolotin et. al., ”Cost considerations in Network on Chip”, Integration – The
VLSI Journal, special issue on Network on Chip, Volume 38, Issue 1, October 2004,
pp. 19-42.
Tang Lei, Shashi Kumar, “ A two step genetic algorithm for mapping task graphs to a
NoC Architecture”, Proceedings DSD 2003, Turkey, Sept. 2003.
Ruxandra Pop and Shashi Kumar, “A survey of techniques for mapping and
scheduling applications to Network on Chip Systems”, Research Report 04:4, School
of Engineering, Jönköping University, Dec. 2004, ISSN 1404-0018.
Shashi Kumar
Linköping University
13th Jan. 2005
89
Download