Coarse Grained Reconfigurable Architecture for Variable Block Size

advertisement
COARSE GRAINED RECONFIGURABLE
ARCHITECTURE FOR VARIABLE BLOCK
SIZE MOTION ESTIMATION
1
03/26/2012
OUTLINE
Introduction
 Motivation
 Network-on-Chip (NoC)
 ASIC based approaches
 Coarse grain architectures
 Proposed Architecture
 Results

2
INTRODUCTION

Goal


Purpose


Application specific hybrid coarse grained
reconfigurable architecture using NoC
Support Variable Block Size Motion Estimation
(VBSME)
First approach
No
 ASIC and other coarse grained reconfigurable
architectures


Difference
Use of intelligent NoC routers
 Support full and fast search algorithms

3
MOTIVATION
H.264
Ө(f)=
Motion
Estimation
4
MOTION ESTIMATION
Sum of Absolute
Difference (SAD)
Search
Window
Current 16x16 Block
Previous
Frame
Current
Frame
5
SYSTEM-ON-CHIP (SOC)
Single chip systems
 Common components

Microprocessor
 Memory
 Co-processor
 Other blocks


Increased processing power and data intensive
applications

Facilitating communication between individual
blocks has become a challenge
6
TECHNOLOGY ADVANCEMENT
7
DELAY VS. PROCESS TECHNOLOGY
8
NETWORK-ON-CHIP (NOC)
Efficient communication via use of transfer
protocols
 Need to take into consideration the strict
constraints of SoC environment
 Types of communication structure

Bus
 Point-to-point
 Network

9
COMMUNICATION STRUCTURES
10
BUS VS. NETWORK
Bus Pros & Cons
Network Pros & Cons
Every unit attached adds
parasitic capacitance
x
✓ Local performance not
degraded with scaling
Bus timing is difficult
x
✓ Network wires can be
pipelined
Bus arbitration can become
a bottleneck
x
✓ Routing decisions are
distributed
Bus testability problematic
and slow
x
✓ Locally placed BIST is fast
and easy
Bandwidth is limited and
shared by all
x
✓ Bandwidth scales with
network size
Bus latency is wire speed
once granted
✓
x
Network contention may
cause latency
Very compatible
✓
x
IPs need smart wrappers
Simple to understand
✓
x
Relatively complicated
11
EXAMPLE
12
EXAMPLE OF NOC
13
ROUTER ARCHITECTURE
14
BACKGROUND

ME
General purpose processors, ASIC, FPGA and coarse
grain
 Only FBSME
 VBSME with redundant hardware


General purpose processors
Can exploit parallelism
 Limited by the inherent sequential nature and data
access via registers

15
CONTINUED…

ASIC
No support to all block sizes of H.264
 Support provided at the cost of high area overhead


Coarse grained
Overcome the drawbacks of LUT based FPGAs
 Elements with coarser granularity
 Fewer configuration bits
 Under utilization of resources

16
ASIC Approaches
Topology
1D systolic
array
2D
systolic
array
SAD accumulation
Partial
Sum
Parallel
Sum
•Large
number
of broadcasted
registers
•Mesh
•Reference
based
pixels
architecture
•All
pixel
differences
of a 4x4 block computed in parallel
•Store
•SAD
computation
partial
SADs
forreused
each 4x4 block pipelined
•Reference
pixels
are
•Area
•Each overhead
processing
element depends
computes
difference,
•Direction
of data transfer
onpixel
search
pattern accumulates it to the
•High
latency
previous
partial SAD and sends the computed partial SAD to the next
•No VBSME
processing
element
17
•Large number of registers
OU’S APPROACH
16 SAD modules to process 16 4x4 motion vectors
 VBSME processor



Chain of adders and comparators to compute larger
SADs
PE array
Basic computational element of SAD module
 Cascade of 4 1D arrays


1D array
1D systolic array of 4 PEs
 Each PE computes a 1 pixel SAD

18
current_block_data_i
current_block_data_0 search_block_data_0
D
D
block_strip_A
block_strip_B
Module 0
SAD_0
MV_0
current_block_data_1 search_block_data_1
32
bits
Module 1
SAD_1
MV_1
current_block_data_15 search_block_data_15
32
bits
1D
Array
0
1 bit
1 bit
1D
Array
3
Module 15
SAD_15
MV_15
4 bits
SAD_i
MUX for SAD
MV_i
strip_sel
read_addr_B
read_addr_A
write_addr
SAD Modules
PE Array
19
1D Array
32 bits
32 bits
PE
D
D
PE
D
D
D
PE
D
D
D
D
PE
D
D
D
ACCM
20
PUTTING IT TOGETHER

Clock cycle
Columns of current 4x4 sub-block scheduled using a delay
line
 Two sets of search block columns broadcasted



4 block matching operations executed concurrently
per SAD module
4x4 SADs -> 4x4 motion vectors


4x4 SADs -> 4x8 SADs -> … 16x16 SADs


Chain of adders and comparators
Chain of adders and comparators
Drawbacks
No reuse of search data between modules
 Resource wastage

21
ALTERNATIVE SOLUTION:
COARSE GRAIN
ARCHITECTURES
22
ChESS
MATRIX
*(M x 0.8M)/256 x 17 x 17
*(M x0.8M)/256 x 17 x 17
• Resource utilization
• Generic interconnect
* Performance (clock cycles) [Frame Size: M x 0.8M]
RaPiD
*272+32M+14.45M2
PROPOSED ARCHITECTURE

2D architecture






16 CPEs
4 PE2s
1 PE3
Main Memory
Memory Interface
CPE (Configurable Processing Element)
PE1
 NoC router
 Network Interface
 Current and reference block from main memory

23
Memory Interface
(MI)
Main Memory
c_d_(x,y)
(32 bits)
r_d_(x,y)
(32 bits)
c_d
c_d
r_d
CPE
(1,1)
32 bits
12 bits
CPE
(2,1)
r_d
c_d
reference_block_id
(5 bits)
r_d
c_d
CPE
(1,2)
PE
2(1)
c_d
data_load_contro
l
(16 bits)
c_d
CPE
(1,3)
r_d
c_d
14 bits
CPE
(1,4)
r_d
r_d
PE
2(2)
c_d
CPE
(2,2)
CPE
(2,3)
r_d
CPE
(2,4)
r_d
PE 3
c_d
CPE
(3,1)
CPE
(3,2)
r_d
PE
2(3)
c_d
r_d
CPE
(4,1)
c_d
CPE
(3,3)
c_d r_d
c_d r_d
CPE
(3,4)
r_d
PE
2(4)
c_d
CPE
(4,2)
c_d
r_d
CPE
(4,3)
c_d
r_d
CPE
(4,4) 24r_d
c_d
r_d
1
8 bit sub
CPR
2
8 bit sub
RPR
CPR
3
8 bit sub
RPR
To/From NI
CPR
8 bit sub
CPR
8 bit sub
CPR
7
8 bit sub
RPR
CPR
8
8 bit sub
RPR
8 bit sub
CPR
10
8 bit sub
RPR
CPR
RPR
11
8 bit sub
CPR
8 bit sub
CPR
RPR
12
8 bit sub
RPR
10
bit
adder
13
RPR
CPR
RPR
CO
MP
12
bit
adder
9
CPR
To/From Eas
10
bit
adder
6
RPR
8 bit sub
RPR
10
bit
adder
5
4
RE
G
CPR
RPR
10
bit
adder
14
8 bit sub
CPR
RPR
15
8 bit sub
To/From South
CPR
RPR
16
8 bit sub
4x4 mv
CPR
25
RPR
NETWORK INTERFACE
reference_block_id to MI
data_load_control to MI
CONTROL
UNIT
PACKETIZATION
UNIT
DEPACKETIZATION
UNIT
Network Interface
26
Input/Output
Control Signals
NOC ROUTER
Receives packets
from NI/
adjacent router
request
ack
request
ack
Input
Controller
Sends packets to
NI or adjacent
router
Output
Controller
PE 1
PE 1
East
West
North
Header
Decoder
3
South
First Index
2
4
1
5
0
Ring Buffer
Stores packets
East
West
North
South
Last Index
•XY routing protocol
•Extracts direction of
data transfer from
header packet
•Updates number of
hops
27
Router 1
Router 2
Input
Controller
ack
(1 bit)
Input
Controller
packet
32 bit
Output
Controller
req
(1 bit)
Busy?
Buffer space
available?
Output
Controller
Step 1: Send a message from Router 1 to Router 2
Step 2: Send a 1 bit request signal to Router 2
Step 3: Router 2 first checks if it is busy. If not checks for available buffer space
Step 4: Send ack if space available
Step 5: Send the packet
28
PE2 AND PE3
Muxes
Adders
De-muxes
Comparators
Registers
29
FAST SEARCH ALGORITHM

Diamond Search
•9 candidate search points
•Numbers represent order of
processing the reference
frames
•Directed edges labeled with
data transmission equations
derived based on data
dependencies
30
EXAMPLE
Frame
Macroblock
SAD
31
CONTINUED…
32
DATA TRANSFER
Data Transfer between
PE1(1,1) and PE1(1,3)
Individual Points
Intersecting Points
33
DATA LOAD SCHEDULE
34
OTHER FAST SEARCH ALGORITHMS
Hexagon
Big Hexagon
Spiral
35
FULL SEARCH
36
CONTINUED…
37
RESULTS
38
CONTINUED…
39
40
Download