ms_defense

advertisement
H.264 Intra Frame
Coder System Design
Özgür Taşdizen
Microelectronics Program
at Sabanci University
4/8/2005
OUTLINE
• Introduction
• Hardware Architectures For
Intra Frame Coder Modules
• Top Level Intra Frame Coder Hardware
• H.264 Intra Frame Coder System
• Conclusions and Future Work
H.264 VIDEO CODING STANDARD
• The latest video coding standard
• Developed with the collaboration of ITU-T and MPEG
• Includes 3 Profiles and 14 Levels
Standards
H.263
H.261
ITU-T
MPEG-1
MPEG
Joint ITU-T /
MPEG
H.263+
MPEG-4
H.262 /
MPEG-2
1984
1985
1986
1988
1990
1992
H.263++
H.264 / MPEG-4
Part 10
1994
1996
1998
2000
2002
2004
Years
H.264 VIDEO CODING STANDARD
It Provides Significant Performance Gains
Coder
MPEG-4
ASP
H.263 HLP
MPEG-2
H.264
38.62%
48.80%
64.46%
Average Bit Rate Savings
3.0
2025
: MPEG-2
: MPEG-4 (ASP)
1.8
90-minute DVD-quality movie
(Download time at 700 Kbps)
1234
: H.264
1.1
727
386
235
139
Bandwidth
Required (Mbps)
Storage Utilization
(MB)
Download Time
(Minutes)
H.264 Encoder Block Diagram
Current
Frame
Reference
Frame
Motion
Estimation
Residue
+
Transform
Quant
-
Motion
Compensation
Entropy
Coder
Mode
Decision
Choose
Intra
Mode
Intra
Prediction
+
Reconstructed
Frame
Reorder
Deblocking
Filter
+
Inverse
Transform
Reconstruction
Intra Frame Coder
Inverse
Quant
OUTLINE
• Introduction
• Hardware Architectures For
Intra Frame Coder Modules
• Top Level Intra Frame Coder Hardware
• H.264 Intra Frame Coder System
• Conclusions and Future Work
Transform and Quantization Algorithms
Residue
Forward
Transform
Quantizer
Hadamard
Transform
Inverse
Hadamard
Transform
Reconstruction
Inverse
Transform
Inverse
Quantizer
VLC
H.264 Transform Algorithm
•
•
A multiply-free 4x4 integer transform is used. It only requires additions and shifts.
For 16x16 intra coded luminance blocks and for 8x8 chrominance blocks a second
transform, Hadamard Transform, is applied on DC coefficients.
4x4 Forward
Integer Transform
4x4 Hadamard
Transform
2x2 Hadamard
Transform
4x4 Inverse
Integer Transform
H.264 Transform Algorithm
• 4x4 Forward Integer Transform is applied to all the blocks except –1, 16, 17
• 4x4 Hadamard Transform is applied to –1 if intra 16x16 mode is selected
• 2x2 Hadamard Transform is applied to 16, 17
-1
16
17
LUMA
0
1
4
5
18
19
22
23
2
3
6
7
20
21
24
25
8
9
12
13
10
11
14
15
CHROMA
CB
CHROMA
CR
Transform Hardware
Register 0 stores: (x0+x4+x8+x12)
Register 1 stores: (x1+x5+x9+x13)
Register 2 stores: (x2+x6+x10+x14)
Register 3 stores: (x3+x7+x11+x15)
Pipelining Registers are used to increase
the maximum clock frequency
Register 4 stores the result of
transform operations
(x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15)
2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15)
(x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15)
(x0+x4+x8+x12) - 2* (x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15)
Quantization Hardware
QP ranges from 0 to 51.
qbits = 15+floor(QP/6)
AC Coefficients :
|Zij| = (|Wij|.MF + f) >> qbits,
sign(Zij) = sign(Wij)
DC Coefficients :
|Zij| = (|Yij|.MF + 2f) >> (qbits + 1),
sign(Zij) = sign(Yij)
Inverse Quantization
AC Coefficients :
W’ij = Zij.V.2floor(QP/6)
DC Coefficients :
If QP > 12 W’ij = Wqij.V.2floor(QP/6) - 2
Else W’ij = [ Wqij.V + 21 - floor(QP/6) ] >> (2-floor (QP/6))
Transform and
Quantization Hardware
Hardware Implementation Results
In the worst case, it takes 2500 cycles to complete the TQIQIT operations of a 4x4 block
FPGA
implementation
Excluding I/O
Register Files
Including I/O
Register Files
FPGA implementation works
at 81MHz and it can code 27
Function
Generators
2497
4054
VGA frames per second
CLB Slices
1249
2027
Dffs or Latches
581
583
Block Multipliers
1
1
0.18µ ASIC
implementation
0.18µ ASIC implementation
works at 210MHz and it can
code 70 VGA frames per second
Critical Path
Delay [ns]
Gate Count
Transform part of the
Datapath
2.77
1978
Datapath
4.78
12773
Datapath + Control Unit
4.8
23162
4.8
130505
Datapath + Control +
Input Register File +
Output Register File TQ
Context Adaptive Variable Length Encoder Hardware
1) After prediction, transformation and quantization, blocks typically contain zeros and ones
2) The highest non-zero coefficients after the zig-zag scan are often sequences of +/-1.
3) The number of non-zero coefficients in neighbouring blocks are correlated
4) The magnitude of non-zero coefficients tends to be higher at the start
Intra Prediction Hardware
• 9 prediction modes for 4x4 luma blocks
• 4 prediction modes for 16x16 luma and 8x8 chroma blocks
Inputs from
Top-Level
Reconstructed
Pixels
Address Generation
Hardwares
Neigbouring Buffers
Top Level Mode
Controller
Internal Buffers
Controller for 4x4
Luma Prediction
Modes
Datapath for 4x4
Luma Prediction
Modes
Controller for 16x16
Luma Prediction
Modes
Datapath for 16x16
Luma Prediction
Modes
Controller for 8x8
Chroma Prediction
Modes
Datapath for 8x8
Chroma Prediction
Modes
Reconstructed
Pixels
Output
MUX
Prediction
Buffer
(384x8)
OUTLINE
• Introduction
• Hardware Architectures For
Intra Frame Coder Modules
• Top Level Intra Frame Coder Hardware
• H.264 Intra Frame Coder System
• Conclusions and Future Work
Top Level Intra Frame Coder Hardware
Input
SEARCH
Pipelining
CODER
Output
Register File
HARDWARE
Register File
HARDWARE
Register File
Functional
Units
1st MB
Search
Hardware
2nd MB
Coder
Hardware
4th MB
3rd MB
4000
8000
12000
16000
Time
(cycles)
CIF @ 30 fps requires processing 11800 Macroblocks per second
Level
@30Mhz
@40Mhz
@50Mhz
@60Mhz
@70Mhz
@80Mhz
2.0
(CIF @30 fps)
2525
3367
4208
5050
5892
6734
Search Hardware
384 x 8
Reg. for 16
DC coefs.
Current MB
Luma 16x16
Intra Pred.
Residue
Mux
Neighbors
Hadamard
Transform
Chroma 8x8
384 x 8
Predicted MB
QP
256 x 8
Current MB
Neighbors
Luma 4x4
Intra Pred.
256 x 8
Predicted MB
Residue
Hadamard
Transform
Mode
Decision
Mode
Mode Decision
SATD based mode decision algorithm
Cost4x
Cost16x16
 << 3
4
18
18
9
Mux
1) Compute the cost of each 4x4 mode
Intra 4x4 vs Intra 16x16
Cost Comparator
18
Select the 4x4 mode with lowest cost
2) Compute the cost of each 16x16 mode
Add_sub
Add/Sub
19
Select the 16x16 mode with lowest cost
Register
3) Compute the cost of each 8x8 mode
19
Select the 8x8 mode with lowest cost
4) Compare selected 4x4 and 16x16 costs
and select the best mode
5) Start the coder hardware with selected
mode information
Result
1.
Cycle: Register = 8 x 
2.
Cycle: Register = 16 x 
3.
Cycle: Register = 24 x 
4.
Cycle: Register = 4x4cost + 24 x 
5.
Cycle: Register = 16x16cost – (4x4cost + 24 x )
High Speed Hadamard Transform Hardware
• Performs SATD computation
• 13-bit adders/subtractors
• Reguires only 18 cycles for a 4x4 Block
• Two-stage pipeline
z2
z3
z4
add/sub
add/sub
z5
z6
add/sub
add/sub
z7
z9
z8
add/sub
add/sub
add/sub
z11
z12
add/sub
add/sub
add/sub
Register
z13
z14
add/sub
add/sub
P. Register
add/sub
z10
P. Register
z1
z0
add/sub
add/sub
add/sub
z15
Coder Hardware
384 x 8
Current MB
384 x 9
Residue
Quant
Transform
384 x 16
Reg. file
Reg. file
HT
384 x 8
IHT
CAVLC
Predicted MB
Inverse
Transform
Inverse
Quant
192 x 32
Intra Pred.
16 x 16
Reconstruct
Reg. File
Reg. File
384 x 8
Reconstructed
MB
Bitstream
Scheduling of Intra 4x4 modes
Modules
Intra
Prediction
Residue
TQ
IQIT
TQ
IQIT
1st Block
TQIQIT
2nd Block
CAVLC
Reconstruction
0 24 42
86
142 160
202
246
302 320
Time (cycles)
Worst Case cycle counts required to complete a 4x4 block :
TQIQIT = 100, CAVLC = 120, Residue&Reconstruction = 18, Intra Prediction = 24
Scheduling of Intra 16x16 modes
Modules
Intra
Prediction
Residue
1st Block
TQ
TQ
TQ
HT
IQIT
IQIT
2nd Block
TQIQIT
16th Block
CAVLC
Reconstruction
42 75
24 48 86
0
130
384 402
746
800
860 880
920
1040
Time (cycles)
Implementation Results for H.264 Intra Frame Coder
Hardware
• Synthesized at 61.4 MHz and Placed & Routed at 53.8 MHz.
• The total equivalent gate count is 1,051,458
Device Utilizations for XC2V8000 FPGA
Resources
Used
Available
Utilization
IOs
Global Buffers
Function
Generators
CLB Slices
Dffs or Latches
Block RAMs
Block
Multipliers
418
2
1108
16
37.73%
12.50%
21404
93184
22.97%
10702
3881
1
46592
96508
168
22.97%
4.02%
0.60%
1
168
0.60%
OUTLINE
• Introduction
• Hardware Architectures For
Intra Frame Coder Modules
• Top Level Intra Frame Coder Hardware
• H.264 Intra Frame Coder System
• Conclusions and Future Work
System Overview
• PC is used to develop Verilog modules and debug the system
• Multi Ice Debugger communicates with the development board
• Development Board is used for testing the designed hardware
• Color LCD Panel is used for visual verification
ARM-based Development Platform
Logic Tile
Xilinx Virtex II 8000 FPGA
Arm 926EJ-S Processor
based Development Chip
Versatile Platform Baseboard
Xilinx Virtex II 2000 FPGA
Development Chip
ARM AMBA 2.0
Software Implementation
• Matlab and C codes are developed
• ARM AXD Tool is used to debug the system
• C codes run on ARM926EJ-S processor
• SRAM available on Logic Tile is used to store image data
Capturing the image
in RGB format
SRAM
Converting the image
from RGB format to
YCbCr format
Partitioning the
image into
macroblocks
H.264 Intra Frame
Coder Hardware
Displaying the
reconstructed image
SRAM
Converting the image
from YCbCr format to
RGB format
4:2:0
Sampling
SRAM
Reconstructing the
image in raster-scan
order
Hardware Implementation
ARM Development Board implements Tri-state AHB buses
An AHB master is designed for reading and writing the image data to the
SRAMs available on the logic tile.
2 SRAM controllers are instantiated in the design as slaves on AHM M1
and AHM M2 buses.
System Arbiter
controls the
multiplexing
Design Flow
Verilog
modules
High Effort
for Speed
Leonardo
Spectrum
Modify
HDL files
Compiler
Synthesis
Constraints
Logic
Optimizer
Modify
Constraints
Met?
Netlist for
XC2V8000
No
Yes
High Effort
for Speed
Bitstream
Options
Mapper
Xilinx
Project
Translator
Navigator
Placer
Router
Place and
Route
Constraints
Modify
Constraints
Met?
No
Yes
Bitsream for
XC2V8000
Resulting
bitsream
OUTLINE
• Introduction
• Hardware Architectures For
Intra Frame Coder Modules
• Top Level Intra Frame Coder Hardware
• H.264 Intra Frame Coder System
• Conclusions and Future Work
Conclusions
• Transform – Quant architecture is designed and verified to work at 81 MHz
• Mode Decision, Intra Prediction and CAVLC are integrated.
• Top – Level design is synthesized at 61.4 MHz and placed & routed at
53.8MHz.
• Device utilization for XC2V8000 FPGA is approximately 23% with a total
equivalent gate count of 1,051,458.
• The H.264 Intra Frame Coder System is verified to work on an ARM
Versatile Platform development board.
Future Work
• Implementing header generation functionality
• Further verification by decoding the generated bitstream using an H.264
compliant decoder
• Implementing low-power techniques such as clock gating
• Adding a camera to the system for real-time video capturing and coding
• Developing an ASIC implementation and fabricating a prototype
• Creating a complete H.264 video coding system by integrating motion
estimation, motion compensation, deblocking filter, intra vs. inter mode
decision and rate control units
Thanks
?
Questions...
Download