Presentation Part A - High Speed Digital Systems Lab

advertisement
Jpeg decompression algorithm
implementation using HLS
Performed by: Dor Kasif, Or Flisher
Instructor: Rolf Hilgendorf
Final part A presentation
Winter 2013-14
The Necessity
JPEG- most widely used standard for compression of digital
images.
Done by software, takes a lot of CPU resources.
The solution
Implementation of the JPEG decompression/compression
algorithm on dedicated hardware.
Implementation on hardware
Design of hardware is done by Hardware Description
Languages (HDLs such as VHDL, Verilog etc…).
HDLs are programmed concurrently, and it’s problematic
for usage in complex designs. Programming languages
(C/C++,JAVA etc..) are easier to comprehend.
The solution
using HLS (High Level Synthesis), which enables the use of
a programming language as the design and synthesis
language.
Our objective
Developing a JPEG decompressor in a programming language
(namely C++), converting it to a Hardware Description Languages
(VHDL) Using Vivado HLS, and Implementing it on a FPGA.
The decompressed image will then be available for display on
screen in RGB format.
Project Goals
• Implementing the Jpeg decompression algorithm on a FPGA Using HLS.
• Displaying the decompressed image in a RGB format on screen.
• Optimizing the implementation to reach the optimal performances
possible within the performance envelope of the FPGA.
• Compare the software decompressed picture to the hardware
decompressed picture in terms of Structural Similarity Index Metric
(SSIM) .
The VIVADO HLS
• Allow us to design hardware using a programing language, which is
much more easy to work and design with.
• When using a programing language which is inherently serial in nature
to design hardware which is inherently concurrent, we can’t use the
already well known programming paradigms, so we will need to
combine several disciplines.
• Those modifications include replacing non synthesizable commands
and changes to reduce usage of system resources and optimizing
overall system performance.
The VIVADO HLS types
• In software, there are numerous types for a Variable : Integer,Char,Float, etc.
• A major disadvantage is being unable to access parts of the variable, like bits.
Bitwise operations are enabled with programming language but not on a scope
appropriate to hardware design. Moreover when designing hardware, system
resources are valuable (i.e. board area, wiring etc.) so for some variable
representations we’ll need to control the amount of memory allocated for the
variable.
• The VIVADO HLS presents new types: ap_int<>, ap_uint<>,ap_fixed<><> etc.
• Those new types Not only allow us to determine the memory used but also grants us
access to every part of the memory itself and in every scope desired (such as a single
bit).
• For example: a Variable with only two options: “0” (“no”) and “1” (“yes”).
• In C/C++: we can use Char , which takes a byte (8 bits) of memory.
• In VIVADO HLS : ap_uint<1> - takes 1 bit of memory.
The VIVADO HLS functions
• with The VIVADO HLS types we can use some new and helpful functions:
For example we used:
• Variable.range- access to certain bits of the Variable
• Variable.set-set a certain bit of the Variable to the value “1”.
Progress until midterm presentation
• Acquired a C++ encoding/decoding algorithm and modified it for
our needs-removing the use of non-standard libraries, removing
user interface , adjusting the algorithm to process a single color
channel etc.
• Developing auxiliary Matlab scripts for handling the images and
SSIM computation.
Progress until midterm presentation
• Modified the decoding algorithm for synthesis in HLS.
• Eliminating the use of Cosine functions in the decoding
process.
• Adjusting the decoder and test-bench for 8x8 blocks +
handshake protocol.
• Replacing the use of C++ floating point types with VIVADO
HLS fixed point types .
Progress since midterm presentation
• Making the algorithm usable for the standard RGB color channel.
• optimizing the hardware implementation using code
optimizations and directives.
• Simulating the synthesis in VHDL, thus getting a time assessment
and identified the system’s bottle necks.
Implementing the encoding/decoding
process
Block diagram
Highest Hierarchy
Encoded
picture
JPEG
Testbench
(in C++)
27 bits
8x8 decompressed block
Hand shake protocol
JPEG-DECOMPRESSOR
Module
(convert from C++ to
VHDL using HLS)
Block diagram
Highest Hierarchy
• We use the test bench file ,which inputs the encoded image stream of bits
into the module file, aka the decoder.
• The test bench is sending a stream of 27 bits to the module each time.
• The maximum code length of the AC/DC Huffman tables is 27 bits.
• The module constructs a full sub matrix block of size 8x8.
• The module will announce the completion of the 8x8 sub matrix through a
handshake procedure.
• The image will then be ready for display on screen in RGB format.
Lets See What's Under The Hood
Block diagram
module
27 bits
Building the
sub matrix
Huffman
decoding
Zig
Zag
DE
quantization
Hand shake protocol
8x8 decompressed block
The module will do this operation for all (640X480)/(8X8)=4800 blocks
Inverse
DCT
and
adding
value of
128 to the
block
Encountered Problems and Solutions
Before Midterm
1. The acquired decompression algorithm wasn't synthesizable.
• Solution: Modifying the decoding algorithm for synthesis and
adjusting it for 8x8 blocks + handshake protocol.
2. Problems with HLS handling a matrix of Char pointers (Strings).
• Solution: adding a binary-to-integer converter for bit by bit
comparison.
3. Use of Trigonometric functions
• Solution: Replacing the use of Trigonometric functions with
constant variables (pre calculated matrices).
4. problems with HLS handling C++ float point type and the multiplication
of them.
• Solution: replacing the use of C++ fixed point types with VIVADO HLS
fixed point types.
Block diagram
Buildingmodule
the sub matrix
27 bits
Building the
sub matrix
Huffman
decoding
Zig
Zag
Hand shake protocol
8x8 decompressed block
DE
quantization
Inverse
DCT
and
adding
value of
128 to the
block
Block diagram
Building the sub matrix
Hand shake protocol
Buffer has less then
27 bits
For AC coeff
For DC coeff
Buffer
(3*27bit size)
getACvalue
Char to int
sub matrix
buffer
getDCvalue
the complete matrix
code Improvements
Eliminating the dependency on the binary to integer converter.
• The VIVADO HLS types, unlike the standard C/C++ types,
allows us access to the bits in the memory itself.
• Using the range function, we can compare bit-by-bit between
the buffer and the members of the AC/DC Huffman tables
without the need of converting the buffer into an integer!
• Improvement in both area and performance (no need for the
converter module) .
Block diagram
Building the sub matrix
Hand shake protocol
Buffer has less then
27 bits
For AC coeff
For DC coeff
Buffer
(3*27bit size)
getACvalue
Char to int
sub matrix
buffer
getDCvalue
the complete matrix
Block diagram
Building the sub matrix
• The input will be inserted into a buffer.
• For the matrix DC coefficient, the input will be inserted to the “getDCvalue” in
order to find the DC coefficient.
• The same for the AC coefficients and “getACvalue”.
• The coefficients will be stored inside another buffer.
• When there are less then 27 bits inside the buffer, the handshake protocol will be
activated and the module will ask for another input.
• When the matrix is complete, the processed image will continue into the Huffman
decoding stage.
Block diagram
module
27 bits
Building the
sub matrix
Huffman
decoding
Zig
Zag
DE
quantization
Hand shake protocol
8x8 decompressed block
The module will do this operation for all (640X480)/(8X8)=4800 blocks
Inverse
DCT
and
adding
value of
128 to the
block
code Improvements
combining the de-zigzag and Huffman decoding operations and
Improving them
• The de-zigzag operation may take a lot of time (storing in the
memory each member in it’s place in the matrix). Also, we
know that as part of the JPEG decoding algorithm, many of the
matrix’ members will be zeroes.
• First, we zeroed the matrix in the beginning of our decoding
operation (to avoid data dependencies). We also converted the
de-zigzag operation to a MUX like code and integrated it as
part of the Huffman decoding operation. The MUX usage costs
us in area but improved performances.
Block diagram
module
27 bits
Building the
sub matrix
Huffman
Huffman decoding
Zig
decoding
and de-zigzag Zag
DE
quantization
Hand shake protocol
8x8 decompressed block
The module will do this operation for all (640X480)/(8X8)=4800 blocks
Inverse
DCT
and
adding
value of
128 to the
block
Block diagram
module
• The module will then perform on the sub matrix:
• Decoding of the compressed image using Huffman and Differential decoding
and De-Zig Zag operation
• De quantization
• Inverse DCT and Adding 128 to the image bit map
• The module sends back the sub matrix block to the test bench where the test
bench will assemble the reconstructed image.
Time dependency
The potential bottle necks
Applied multiple times!
27 bits
inserting the
input in the
Buffer
negligible
getDCvalue
getACvalue
While the sub matrix
is not complete
8x8 decompressed block
Huffman decoding
and de-zigzag
DE
quantization
Inverse DCT and adding
value of 128 to the block
Takes a lot of time!
Improvements
Improvements to the decoder
• After adjusting the decoding algorithm for synthesis, We started
working on improving the decoder.
• Code improvements – changing the C++ code to better fit hardware
implementation.
• Directives- targeting a specific hardware to be used.
VIVADO HLS Directives
• In VIVADO HLS, when synthesizing a module, the VIVADO tries to create
the best hardware it can in terms of speed and resources utilization, But it
has some limitations.
• At first, improvements and changes must be applied to the C++ code to
better fit hardware implementation. This method is also limited as it
doesn’t allow us to decide the specific hardware to be used.
• Using the directives, we are allowed to influence the hardware generating
process and improve it even more, by targeting a certain resource to be
used and adding hardware for improvements in terms of speed.
Directives we used
• Pipeline- allow us to pipeline a certain hardware, pipelining the internal loops and
unrolling them for parallel calculation.
• The unrolling process and pipelining will cause more resources utilization, but
may increase performances
• Array partition- partitioning the memory into small elements, thus allowing us
access different memory elements at parallel.
• Loop_trip_count- sometimes loops are variable dependent, meaning that the
number of loops depends on our input. In that case, we can’t have a measuring of
the operation time. In order to have some boundaries to estimate our operation
time, using this directory we can decide an upper and lower bound for our
number of loops.
• At later times, when enough improvements has been made, this Directive will
become obsolete.
code Improvements
Limiting memory area
• By using VIVADO HLS types, we are able to restrict the memory
area each variable takes, thus reducing area utilization.
Time dependency
The potential bottle necks
Applied multiple times!
27 bits
inserting the
input in the
Buffer
negligible
getDCvalue
getACvalue
While the sub matrix
is not complete
8x8 decompressed block
Huffman decoding
and de-zigzag
DE
quantization
Inverse DCT and adding
value of 128 to the block
Takes a lot of time!
code Improvements
Input_to_buffer and shift_bit
Replacing the functions handling the buffer to VIVADO HLS functions
• In the past, in order to handle the bits stored in the buffer
(inserting the input or shifting them) we applied a FIFO like
algorithm, inserting or shifting the input bit by bit- takes time
and resources.
• Using the VIVADO HLS function range, we are able to replace
this algorithm with a simple parallel insertion, greatly
improving our time and resource .
code Improvements
getDCvalue and getACvalue
Improving the exponentiation function.
• In our code, there are several places where we need to calculate
2^i where i is the order of the digit, but the exponentiation
function isn’t synthesizable.
• At first, we calculated it using a loop: for(j=0;j<i;j++) p=2*p;
-Took time (about i clock cycles) and hardware.
• In binary, the number calculated by 2^i is a zeroed number with a
“1” in the i-th bit
• For example: 2^2=100, 2^4=10000…
• we can use the set function to set the i-th bit of a zeroed variable
to “1” –almost no hardware and a lot less time (1 clock cycle).
Directive improvements
getDCvalue and getACvalue
Pipelining the getACvalue and getDCvalue functions
• In a single module operation, the getACvalue function is invoked multiple
times until all of the sub matrix AC coefficient are recovered- a potential
bottle neck.
• In order to accelerate the getACvalue function, we used the pipeline
directive.
• using the pipeline also forced the function to unroll it’s loops, so it
compares the input to the different members of the AC/DC Huffman tables
simultaneously, thus greatly accelerating the getACvalue function.
• We used the same method on the getDCvalue (also part of our bottle
neck).
Directive improvements
DCT FUNCTION
Pipelining loops in the DCT function
• The inverse DCT function is composed of computing:
𝑒,𝑣=7
𝑓[π‘₯][𝑦] =
𝐹[𝑒][𝑣] ⋅ 𝑑𝑐𝑑_π‘π‘œπ‘’π‘“π‘“[π‘₯][𝑒] ⋅ 𝑑𝑐𝑑_π‘π‘œπ‘’π‘“π‘“[𝑦][𝑣] + 128
𝑒,𝑣=0
Where dct_coeff is a constant matrix, and F[u][v] is our sub matrix before
DCT.
• Because the dct_coeff is a fixed point type, multiplication of its members is a
huge bottle neck to our design.
• We applied the pipeline directive on the loop right before the most inner
loop, also enforcing unrolling of the most inner loop.
• Using this, we can make 8 multiplications at once, greatly accelerating our
function, but in cost of much more resource utilization.
Directive improvements
DCT FUNCTION
Array partitioning the memory members
• As we said, with the pipeline directory we were
supposedly able do make 8 calculations at once. But, every
time we can only load up to two members from the
memory at once, limiting us in our calculation.
• In order to avoid this, we partitioned the memory holding
the dct_coeff and F into 8 blocks, allowing us to load 8
members for multiplication at once.
Encountered problems and solutions
5. The DCT function took more then our designated clock cycle time.
• After using the pipelining directory and array partitioning, we
encountered a new problem: our calculation took more then
our designated clock cycle time.
• We learned that the problem was that the summation was
invoked along with the multiplication:
a ο€½ F[u][v] οƒ— dct _ coeff [ x][u] οƒ— dct _ coeff [ y][v]
• Solution: separating the multiplication and summarize process:
𝑏 = 𝐹[𝑒][𝑣] ⋅ 𝑑𝑐𝑑_π‘π‘œπ‘’π‘“π‘“[π‘₯][𝑒] ⋅ 𝑑𝑐𝑑_π‘π‘œπ‘’π‘“π‘“[𝑦][𝑣]
π‘Ž += 𝑏
Directories improvements
Quantisize FUNCTION
Pipelining loops in the quantization function
• Same as DCT
Array partitioning the memory members
• Same as DCT
Improvements per block
BUFFER handling - replacing FIFO algorithm with HLS range functions.
getACvalue/ getDCvalue - putting pipeline directives.
Improving the exponentiation function.
DCT - Pipelining loops in the DCT function.
Array partitioning the members calculated.
Huffman decoding and de-zigzag - merging the Huffman decoding and
de-zigzag process, resetting the quantization matrix and transforming
the de-zigzag process to a mux.
DE-Quanitisize function - same as DCT.
Improvement since midterm
presentation (Hardware) AKA the
Tradeoff
Resource utilization:
Memory - +%8
Multiplication block - +%15
Flip-flop - +%0
LUT - +%31
Improvement since midterm
presentation (clock cycles) AKA
the Tradeoff
π‘π‘’π‘Ÿmin = 24.25
π‘π‘’π‘Ÿπ‘Žπ‘£π‘” = 58.97
π‘π‘’π‘Ÿmax = 96.93
π‘π‘’π‘Ÿπ‘“π‘œπ‘Ÿπ‘šπ‘Žπ‘›π‘π‘’_𝑠𝑝𝑒𝑒𝑑_𝑒𝑝 =
𝑁𝑐𝑦𝑐_π‘œπ‘™π‘‘ ⋅ π‘π‘¦π‘π‘œπ‘™π‘‘
π‘‘π‘œπ‘™π‘‘
=
𝑑𝑛𝑒𝑀 𝑁𝑐𝑦𝑐_𝑛𝑒𝑀 ⋅ 𝑐𝑦𝑐𝑛𝑒𝑀
Future planning
• Integrating the module into the VIVADO environment.
• Upload the image to the memory from an external device.
• Integrating the module with other modules that allow us to
display the image on screen using VIVADO environment.
• Examining the possibility of parallel operations.
Gantt chart
Num
task
1 Integrating the module to the VIVADO environment.
2 Upload the image to the memory from an external device.
Integrating the module with other modules that allow us to display the e
3 image on screen using VIVADO environment.
Duration start
end
(days)
date
date
21 ‫א׀ר‬-03 ‫א׀ר‬-24
30 ‫א׀ר‬-24 β€«ΧžΧΧ™β€¬-24
30 β€«ΧžΧΧ™β€¬-24
‫יונ‬-23
Gantt chart
23.6.2014
Final B presentation
10-Feb
VIVADO environment.
mamory upload
Integrating
2-Mar
22-Mar
11-Apr
1-May
21-May
10-Jun
30-Jun
Resources
http://www.cs.northwestern.edu/~agupta/_projects/image_processing/web/JPEGEncoding/report.html
-source for the encoding/decoding code
http://en.wikipedia.org/wiki/JPEG
-jpeg Wikipedia entry
http://sipl.technion.ac.il/
-the technion signal and image processing lab-experiment 3
www.stanford.edu/class/ee398a/handouts/lectures/08-JPEG.pdf
-Stanford university , department of electrical engineering explanation of the Jpeg format.
Essay- The JPEG Still Picture Compression Standard-by Gregory K.Wallace and co.
-explanation of the Jpeg format.
https://ece.uwaterloo.ca/~z70wang/research/ssim/
Howard Hughes Medical Institute, and Laboratory,. By Z. Wang, A. C. Bovik, H. R. Sheikh,and E. P. Simoncelli
-Matlab script for SSIM computation.
Essay-Image Quality Assessment Techniques pn Spatial Domain-by C.Sasi varnan and co.
Download