Zhao2012 - Edinburgh Research Archive

advertisement
High Efficiency Coarse-Grained
Customised Dynamically
Reconfigurable Architecture for Digital
Image Processing and Compression
Technologies
Xin Zhao
A thesis submitted for the degree of Doctor of Philosophy
The University of Edinburgh
November 2011
Abstract
Digital image processing and compression technologies have significant
market potential, especially the JPEG2000 standard which offers outstanding
codestream flexibility and high compression ratio. Strong demand for high
performance digital image processing and compression system solutions is
forcing designers to seek proper architectures that offer competitive
advantages in terms of all performance metrics, such as speed and power.
Traditional architectures such as ASIC, FPGA and DSPs have limitations in
either low flexibility or high power consumption. On the other hand, through
the provision of a degree of flexibility similar to that of a DSP and
performance and power consumption advantages approaching that of an
ASIC, coarse-grained dynamically reconfigurable architectures are proving to
be strong candidates for future high performance digital image processing
and compression systems.
This thesis investigates dynamically reconfigurable architectures and
especially the newly emerging RICA paradigm. Case studies such as ReedSolomon decoder and WiMAX OFDM timing synchronisation engine are
implemented in order to explore the potential of RICA-based architectures
and the possible optimisation approaches such as eliminating conditional
branches, reducing memory accesses and constructing kernels. Based on
investigations in this thesis, a novel customised dynamically reconfigurable
architecture targeting digital image processing and compression applications
is devised, which can be tailored to adopt different applications.
I
A demosaicing engine based on the Freeman algorithm is designed and
implemented on the proposed architecture as the pre-processing module in a
digital imaging system. An efficient data buffer rotating scheme is designed
with the aim of reducing memory accesses. Meanwhile an investigation
targeting mapping the demosaicing engine onto a dual-core RICA platform is
performed. After optimisation, the performance of the proposed engine is
carefully evaluated and compared in aspects of throughput and consumed
computational resources.
When targeting the JPEG2000 standard, the core tasks such as 2-D Discrete
Wavelet Transform (DWT) and Embedded Block Coding with Optimal
Truncation (EBCOT) are implemented and optimised on the proposed
architecture. A novel 2-D DWT architecture based on vector operations
associated with RICA paradigm is developed, and the complete DWT
application is highly optimised for both throughput and area. For the EBCOT
implementation, a novel Partial Parallel Architecture (PPA) for the most
computationally intensive module in EBCOT, termed Context Modeling (CM),
is devised. Based on the algorithm evaluation, an ARM core is integrated into
the proposed architecture for performance enhancement. A Ping-Pong
memory switching mode with carefully designed communication scheme
between RICA based architecture and ARM is proposed. Simulation results
demonstrate that the proposed architecture for JPEG2000 offers significant
advantage in throughput.
II
Declaration of Originality
I hereby declare that the research recorded in this thesis and the thesis itself
was composed by myself in the School of Engineering at The University of
Edinburgh, expect where explicitly stated otherwise in the text.
Xin Zhao
07/11/2011
III
Acknowledgements
Foremost, I would like to thank my Ph.D. supervisors Prof. TughrulArslanand
Dr. KhaledBenkrid for their support and guidance during my study. I would
like to thank Dr. Ahmet T. Erdogan who provides massive support and
guidance to my work during my study. I would also like to thank my
colleagues, Dr. Ying Yi who offered me guidance to RICA paradigm and
made great contributions to the dual-core demosaicing engine work. Dr. Wei
Han for his valuable suggestions to a number of difficulties I met during my
research. Mr. Ahmed O. El-Rayis for his contributions to the customised
GFMUL cell utilised in the RS decoder. Dr. Sami Khawam, Dr.
IoannisNousias and Dr. Mark I. R. Muir for their noticeable help and
suggestions to RICA based architectures in my work. Meanwhile, I would like
to thank all RICA team members: Dr. Sami Khawam, Dr. Mark Milward, Dr.
IoannisNousias, Dr. Ying Yi and Dr. Mark I. R. Muir for their brilliant invention
– the RICA paradigm and its tool flow. In addition, many thanks to all
members in SLI group for their help throughput my Ph.D. study.
A very special thank to my wife Ying Liu. We met each other in SLI group and
got married in Edinburgh. She always encourages me with her love and stays
with me getting through all tough times. Finally, I would like to express my
deepest appreciation to my parents for their love, guide and support to me
throughput my life.
IV
Acronyms and Abbreviations
ASIC
Application Specific Integrated Circuit
ALU
Arithmetic Logic Units
AE
Arithmetic Encoder
ADF
Architecture description File
BMA
Berlekamp-Massey Algorithm
CFA
Colour Filter Array
CM
Context Modeling
CCD
Charge Coupled Device
CUP
Clean Up Pass
CX/D
Context and binary Decision
CRLB
Cramer-Rao Lower Bound
CP
Cyclic Prefix
CSD
Canonical Sign Digit
DC
Direct Current
DCT
Discrete Cosine Transform
DWT
Discrete Wavelet Transform
DLP
Date Level Parallelism
DPRAM
Dual Port RAM
DAG
Data Address Generator
V
DRP
Dynamically Reconfigurable Processor
DMU
Data Management Unit
EBCOT
Embedded Block Coding with Optimal Truncation
FPGA
Field Programmable Gate Array
FFU
Flip-Flop Unit
FU
Function Unit
FPS
Frames per Second
FCM
Floating Coefficient Multiplier
GF
Galois Finite Field
GFMUL
Galois Finite Field Multiplier
GOCS
Group-Of-Column Skipping
HD
High Definition
IC
Instruction Cell
ICT
Irreversible Colour Transformation
ILP
Instruction Level Parallelism
IFFT
Inverse Fast Fourier Transform
IP
Intellectual Property
ISI
Inter-Symbol Interference
JPEG
Joint Photographic Experts Group
LUT
Look-Up Table
LS
Least Squares
LZW
LempleZiv Welch
LPS
Less Probable Symbol
MDF
Machine Description File
MR
Memory Relocation
VI
MRC
Magnitude Refinement Coding
MRP
Magnitude Refinement Pass
MSB
Most Significant Bit-plane
MPS
More Probable Symbol
MSE
Mean Squared Error
MSPS
Million Symbols per Second
MIPS
Million Instructions per Second
MAC
Multiply-Accumulate
MMASC
Multiply-Accumulates per Second
ML
Maximum Likelihood
MMSE
Minimum Mean Square Error
NMPS
Next Most Significant Bit-plane
MRPSIM
Multiple Reconfigurable Processor Simulator
MCOLS
Multiple Column Skipping
NLPS
Next Less Probable Symbol
NLOS
Non-Line-Of-Sight
LSB
Least Significant Bit-plane
OFDM
Orthogonal Frequency Division Multiplexing
PSNR
Peak Signal to Noise Ratio
PPA
Partial Parallel Architecture
PPCM
Pass Parallel Context Modelling
PE
Processing Element
QoS
Quality of Service
RCT
Reversible Colour Transformation
RGB
Red-Green-Blue
VII
RFU
Register File Unit
RICA
Reconfigurable Instruction Cell Array
RLC
Run Length Coding
RLE
Run Length Encoding
RTL
Register Transfer Level
RC
Reconfigurable Cell
RF
Register File
RSPE
Reconfigurable Stage Processing Element
RS
Reed Solomon
SC
Sign Coding
SoC
System on Chip
SIMD
Single Instruction Multiple Data
SPP
Significant Propagation Pass
SS
Sample Skipping
TCP
Turbo Decoder Coprocessor
VCP
Viterbi Decoder Coprocessor
VGOSS
Variable Group of Sample Skip
VO
Vector Operation
WiMAX
Worldwide Interoperability for Microwave Access
ZC
Zero Coding
VIII
Publication from this work
1. X. Zhao, A.T. Erdogan, T. Arslan, “High Efficiency Customised CoarseGrained Dynamically Reconfigurable Architecture for JPEG2000”,
submitted to the IEEE Transaction on Very Large Scale Integration
Systems, May, 2011.
2. X. Zhao, A.T. Erdogan, T. Arslan, “Dual-Core Reconfigurable
Demosaicing Engine for Next Generation of Portable Camera Systems,”
the IEEE Conference on Design & Architectures for Signal and Image
Processing (DASIP), October 26-28, 2010.
3. X. Zhao, A. T. Erdogan, T. Arslan, “A Hybrid Dual-Core Reconfigurable
Processor for EBCOT Tier-1 Encoder in JPEG2000 on Next Generation
Digital Cameras,” the IEEE Conference on Design & Architectures for
Signal and Image Processing (DASIP), October 26-28, 2010.
4. X. Zhao, Y. Yi, A. T. Erdogan, T. Arslan, “A High-Efficiency
Reconfigurable 2-D Discrete Wavelet Transform Engine for JPEG2000
Implementation on Next Generation Digital Cameras,” the 23 rd IEEE
International System-on-Chip (SOC) Conference, September 27-29, 2010.
5. X. Zhao, A. T. Erdogan, T. Arslan, “A Novel High-Efficiency PartialParallel Context Modeling Architecture for EBCOT in JPEG2000,” the 22 nd
IEEE International SOC Conference, pp. 57-60, 2009.
6. X. Zhao, A. T. Erdogan, T. Arslan, “OFDM Symbol Timing
Synchronization System on a Reconfigurable Instruction Cell Array,” the
21st IEEE International SOC Conference, pp. 319-322, 2008.
7. A. El-Rayis, X. Zhao, T. Arslan, A. T. Erdogan, “Low power RS codec
using cell-based reconfigurable processor, ” the 22nd IEEE International
SOC Conference, pp. 279-282, 2009.
8. A. El-Rayis, X. Zhao, T. Arslan, A. T. Erdogan, “Dynamically
programmable Reed Solomon processor with embedded Galois Field
multiplier,” IEEE International Conference on ICECE Technology, FPT, pp.
269-272, 2008.
IX
Contents
Chapter 1
Introduction....................................................................................................... 1
1.1.
Motivation................................................................................................................ 1
1.2.
Objective ................................................................................................................. 3
1.3.
Contribution............................................................................................................. 3
1.4.
Thesis Structure ...................................................................................................... 4
Chapter 2
Digital Image Processing Technologies and Architectures ........................ 6
2.1.
Introduction to Digital Image Processing Technologies ......................................... 6
2.2.
Demosaicing Algorithms ......................................................................................... 9
2.3.
JPEG2000 Compression Standard ...................................................................... 13
2.4.
Literature Review .................................................................................................. 15
2.4.1.
Demosaicing Algorithms Evaluations ............................................................... 15
2.4.2.
Solutions for Image Procesing and Compression Applications ....................... 18
2.5.
Demand for Novel Architectures ........................................................................... 31
2.6.
Conclusion ............................................................................................................ 34
Chapter 3
RICA Paradigm Introduction and Case Studies .......................................... 36
3.1.
Introduction ........................................................................................................... 36
3.2.
Dynamically Reconfigurable Instruction Cell Array .............................................. 37
3.2.1.
Architecture ...................................................................................................... 37
3.2.2.
RICA Tool Flow ................................................................................................ 39
3.2.3.
Optimisation Approaches to RICA Based Applications .................................... 41
3.3.
Case Studies ........................................................................................................ 42
3.4.
Outcomes of Case Studies ................................................................................... 43
3.5.
Prediction of Different Imaging Tasks on RICA Based Architecture .................... 45
3.6.
Conclusion ............................................................................................................ 47
Chapter 4
Freeman Demosaicing Engine on RICA Based Architecture .................... 49
4.1.
Introduction ........................................................................................................... 49
4.2.
Freeman Demosaicing Algorithm ......................................................................... 49
4.3.
Freeman Demosaicing Engine Implementation.................................................... 51
X
4.4.
System Analysis and Dual-Core Implementation ................................................. 54
4.4.1.
System Analysis ............................................................................................... 54
4.4.2.
Dual-Core Implementation ............................................................................... 56
4.5.
Optimisation .......................................................................................................... 61
4.6.
Performance Analysis and Comparison ............................................................... 63
4.7.
Future Improvement ............................................................................................. 65
4.8.
Conclusion ............................................................................................................ 66
Chapter 5
2-D DWT Engine on RICA Based Architecture ............................................ 68
5.1.
Introduction ........................................................................................................... 68
5.2.
Lifting-Based 2-D DWT Architecture in JPEG2000 Standard .............................. 68
5.3.
Lifting-Based DWT Engine on RICA Based Architecture ..................................... 70
5.3.1.
1-D DWT Engine Implementation..................................................................... 70
5.3.2.
2-D DWT Engine Implementation..................................................................... 72
5.3.3.
2-D DWT Engine Optimisation ......................................................................... 75
5.4.
Performance Analysis and Comparisons ............................................................. 77
5.5.
Conclusion ............................................................................................................ 81
Chapter 6
EBCOT on RICA Based Architecture and ARM Core.................................. 83
6.1.
Introduction ........................................................................................................... 83
6.2.
Context Modelling Algorithm Evaluation ............................................................... 83
6.3.
Efficient RICA Based Designs for Primitive Coding Schemes in CM ................... 87
6.3.1.
Zero Coding ...................................................................................................... 87
6.3.2.
Sign Coding ...................................................................................................... 88
6.3.3.
Magnitude Refinement Coding ......................................................................... 90
6.3.4.
Run Length Coding .......................................................................................... 90
6.4.
Partial Parallel Architecture for CM ...................................................................... 93
6.4.1.
Architecture ...................................................................................................... 93
6.4.2.
PPA based CM Coding Procedure ................................................................... 94
6.5.
Arithmetic Encoder in EBCOT .............................................................................. 98
6.6.
EBCOT Tier-2 Encoder....................................................................................... 100
6.7.
Performance Analysis and Comparisons ........................................................... 102
6.8.
Conclusion .......................................................................................................... 104
Chapter 7
JPEG2000 Encoder on Dynamically Reconfigurable Architecture ......... 106
7.1.
Introduction ......................................................................................................... 106
7.2.
2-D DWT and EBCOT Integration ...................................................................... 107
7.3.
CM and AE Integration ....................................................................................... 108
7.3.1.
System Architecture ....................................................................................... 108
7.3.2.
Memory Relocation Module ............................................................................ 109
XI
7.3.3.
Communication Scheme between CM and MR ............................................. 111
7.3.4.
Ping-Pong Memory Switching Scheme .......................................................... 113
7.4.
Performance Analysis and Comparison ............................................................. 115
7.4.1.
Execution Time Evaluation ............................................................................. 115
7.4.2.
Power and Energy Dissipation Evaluation ..................................................... 116
7.4.3.
Performance Comparisons ............................................................................. 119
7.5.
Future Improvements .......................................................................................... 122
7.6.
Conclusion .......................................................................................................... 124
Chapter 8
Conclusions .................................................................................................. 126
8.1.
Introduction ......................................................................................................... 126
8.2.
Review of Thesis Contents ................................................................................. 126
8.3.
Novel Outcomes of the Research ....................................................................... 127
8.4.
Future Work ........................................................................................................ 130
Appendix ............................................................................................................................. 133
JPEG2000 Encoding Standard ........................................................................................ 133
Tiling and DC Level Shifting ........................................................................................ 133
Component Transformation ......................................................................................... 134
2-Demension Discrete Wavelet Transform ................................................................. 135
Quantisation ................................................................................................................ 138
Embedded Block Coding with Optimal Truncation ...................................................... 138
References .......................................................................................................................... 154
XII
List of Figures
Figure 2.1 Digital Image Processing System Architecture ....................................................... 7
Figure 2.2 Bayer CFA Pattern .................................................................................................. 9
Figure 2.3 Bayer CFA Pattern Demosaicing Procedure .......................................................... 9
Figure 2.4 Illustration of Freeman Demosaicing Algorithm .................................................... 11
Figure 2.5 JPEG2000 Encoder Architecture .......................................................................... 14
Figure 2.6 Test Images for Evaluating Different Demosaicing Algorithms in [13] ................. 16
Figure 2.7 Performance Comparisons between Different Demosaicing Algorithms.............. 16
Figure 2.8 Test Images in [23] ............................................................................................... 17
Figure 2.9 (a) PSNR Comparisons (b) Execution Time Comparisons [23] ........................... 17
Figure 2.10 HiveFlex ISP2300 Block Diagram [41] ............................................................... 22
Figure 2.11 TM1300 Block Diagram [42] ............................................................................... 23
Figure 2.12 TMS320C6416T Block Diagram [44] .................................................................. 25
Figure 2.13 ADSP BF535 Core Architecture [48] .................................................................. 26
Figure 2.14 CRISP Processor Architecture [54] .................................................................... 28
Figure 2.15 (a) NEC DRP Structure (b) PE in NEC DRP [56] ............................................... 29
Figure 2.16 (a) MorphoSys Architecture (b) RC Array Architecture [57] ............................... 30
Figure 2.17 (a) ADRES Architecture (b) RC Architecture [59] ............................................... 31
Figure 3.1 RICA Paradigm [6] ................................................................................................ 37
Figure 3.2 RICA Tool Flow ..................................................................................................... 40
Figure 4.1(a) Freeman Demosaicing Architecture (b) Bilinear Demosaicing for Bayer Pattern
............................................................................................................................................... 50
Figure 4.2 Freeman Demosaicing Implementation Architecture........................................... 51
XIII
Figure 4.3 Data Buffers Addresses Rotation ......................................................................... 51
Figure 4.4 Parallel Architecture for Freeman Demosaicing ................................................... 52
Figure 4.5 Freeman Demosaicing Execution Flowchart ........................................................ 53
Figure 4.6 (a) Pseudo Median Filter (b) Median Filter Reuse ................................................ 55
Figure 4.7 Mapping Methodology for MRPSIM ...................................................................... 57
Figure 4.8 Dual-Core Freeman Demosaicing Engine Architecture ....................................... 58
Figure 4.9 Pseudo Code for Dual-Core Implementation ........................................................ 60
Figure 4.10 Illustration of Pipeline Architecture for Kernels ................................................... 62
Figure 4.11 A Demosaiced 648x432 Image........................................................................... 63
Figure 4.12 Potential Vector Operations in Median Filter ...................................................... 65
Figure 5.1 (a) Convolutional DWT Architecture (b) 5/3 Lifting-based DWT Architecture (c) 9/7
Lifting-based DWT Architecture ............................................................................................. 69
Figure 5.2 Generic Lifting-Based DWT Architecture for Both 5/3 and 9/7 modes ................. 69
Figure 5.3 Lifting-Based 2-D DWT Architecture..................................................................... 70
Figure 5.4 Detailed Generic Architecture of 1-D DWT Engine on RICA ................................ 71
Figure 5.5 Reconstructed Image Quality with Different CSD Bits .......................................... 71
Figure 5.6 Streamed Data Buffers in DWT Engine ................................................................ 72
Figure 5.7 Detailed 3-Level 2-D DWT Decomposition ........................................................... 73
Figure 5.8 Parallel Pixel Transformation with VO and SIMD Technique ............................... 74
Figure 5.9 Kernel in the 2-D DWT Engine on RICA Architecture .......................................... 75
Figure 5.10 Standard Lena Image Transformed by the 2-D DWT Engine ............................ 77
Figure 5.11 Throughput (fps) Comparisons ........................................................................... 78
Figure 5.12 Area and Δ Comparisons .................................................................................... 79
Figure 5.13 Performance Comparisons ................................................................................. 80
Figure 6.1 Sample Skipping Method for CM .......................................................................... 84
Figure 6.2 Group of Column Skipping Method for CM ........................................................... 84
Figure 6.3 Pass Parallel Context Modeling ............................................................................ 85
Figure 6.4 Detailed Architecture for ZC Unit .......................................................................... 88
Figure 6.5 Detailed Architecture for SC Unit .......................................................................... 89
XIV
Figure 6.6 Detailed Architecture for MRC Unit....................................................................... 90
Figure 6.7 Codeword Structure in RLC Unit .......................................................................... 91
Figure 6.8 The Structure of RLC Unit .................................................................................... 92
Figure 6.9 Partial Parallel Architecture for Context Modeling ................................................ 93
Figure 6.10 The Example of How Data Buffers Work in PPA ................................................ 94
Figure 6.11 Pseudo Code of PPA Working Process ............................................................. 96
Figure 6.12 PPA Codeword Structure .................................................................................... 97
Figure 6.13 (a) Original RENORME Architecture(b) Optimised RENORME Architecture ..... 99
Figure 6.14 (a) Original BYTEOUT Architecture (b) Optimised BYTEOUT Architecture .... 100
Figure 6.15 Detailed Tag-Tree Coding Procedure ............................................................... 101
Figure 6.16 Detailed Codeword Length Coding Procedure ................................................. 101
Figure 6.17 PPA Based CM Execution Time under Different Pre-Conditions ..................... 102
Figure 7.1 Original data processing pattern between 2-D DWT and EBCOT ..................... 107
Figure 7.2 Modified 2-D DWT Scanning Pattern.................................................................. 108
Figure 7.3 Proposed Architecture with DPRAM ................................................................... 109
Figure 7.4 (a) Memory Relocation in JPEG2000 Encoder (b) Detailed Architecture of MR
module.................................................................................................................................. 110
Figure 7.5 Pseudo Code for EBCOT Implementation on the Proposed Architecture .......... 112
Figure 7.6 Pipeline Structure of the JPEG2000 Encoder .................................................... 113
Figure 7.7 Execution Time Ratio of Different Modules in JPEG2000 Encoder ................... 114
Figure 7.8 Ping-Pong Memory Switching Architecture ........................................................ 114
Figure 9.1 Discrete Wavelet Transform ............................................................................... 136
Figure 9.2 Multi-level 2-Demension DWT ............................................................................ 136
Figure 9.3 Lifting-Based DWT .............................................................................................. 137
Figure 9.4 Dead-Zone Illustration of the Quantiser .............................................................. 138
Figure 9.5 (a) Scanning Pattern of EBCOT (b) Significant State ......................................... 140
Figure 9.6 Illustration of One Pixel’s Neighbours ................................................................. 140
Figure 9.7 EBCOT Tier-1 Context Modeling Working Flowchart ......................................... 145
Figure 9.8 Top-Level Flowchart for Arithmetic Encoder ...................................................... 149
XV
Figure 9.9 Detailed Architectures of the Key Sub-modules in Arithmetic Encoder.............. 150
Figure 9.10 Tag Tree Encoding Procedure.......................................................................... 151
XVI
List of Tables
Table 2.1 Examples of Image Processing Technologies and Compression Standards .......... 8
Table 2.9 Comparisons of Different Architectures for Image Processing Applications ......... 32
Table 3.1 Instruction Cells in RICA ........................................................................................ 38
Table 4.1 Instruction Cells Occupied by Freeman Demosaicing Engine ............................... 61
Table 4.2 Freeman Demosaicing Performance Evaluations and Comparisons .................... 64
Table 5.1 CSD Forms of Floating-Point Parameters ............................................................. 71
Table 5.2 Numbers of Cells in Different DWT Engines .......................................................... 76
Table 6.1 Simplified LUT for XOR Bit .................................................................................... 89
Table 6.2 Valid_state in the RLC Unit .................................................................................... 92
Table 6.3 CX/D Selection in PPA ........................................................................................... 95
Table 6.4 Valid_state Indication for RLC in PPA ................................................................... 97
Table 6.5 Performance Comparisons .................................................................................. 102
Table 6.6 Numbers of Cells in CM Engines on Customised RICA Architecture .................. 103
Table 6.7 Performance Comparisons .................................................................................. 104
Table 7.1 Communication Variables .................................................................................... 111
Table 7.2 Detailed Execution Time of the JPEG2000 Encoder Sub-modules on the Proposed
Architecture .......................................................................................................................... 116
Table 7.3 Power and Energy Dissipation of the JPEG2000 Encoder Sub-modules on the
Proposed Architecture .......................................................................................................... 118
Table 7.4 Execution Time Comparisons .............................................................................. 120
Table 7.5 Energy Dissipation Comparisons ......................................................................... 121
Table 7.6 Future Throughput Improvement ......................................................................... 123
XVII
Table 9.1 Contexts for the Zero Coding Scheme................................................................. 142
Table 9.2 H/V Contributions and Contexts in the Sign Coding Scheme .............................. 143
Table 9.3 Contexts of the Magnitude Refinement Coding Scheme ..................................... 144
Table 9.4 Qe and Estimation LUT ........................................................................................ 148
Table 9.5 LUT for I(CX) and MPS(CX) ................................................................................ 148
Table 9.6 A and C Register Structure .................................................................................. 149
Table 9.7 Codewords for Number of Coding Passes .......................................................... 152
XVIII
Introduction
Chapter 1
Introduction
1.1. Motivation
With the rapid development of computer technologies, digital image
processing stands as a pivotal element in people’s life. It has been widely
utilised in not only academic and research aspects such as medical image
processing and radar image analysing, but also people’s daily life such as
mobile phones and digital cameras. Meanwhile, together with the growth of
the Internet technology and portable storage devices, digital image
compression techniques are drawing more and more attention with the
objective to reduce irrelevance and redundancy of image data in order to
store or transmit data in an efficient form [1]. Take a digital camera as an
example; usually it employs image processing technologies including
demosaicing, Gamma correction, white balancing, smooth filtering, etc. After
processing, the obtained digital image is compressed and stored in an SD
card or transmitted through internet or other media. The compressed image
may be represented in different forms such as TIFF [2], JPEG [3] and GIF [4].
Recently, a newer version of JPEG, termed JPEG2000 [5], has been
presented. Based on the wavelet-based method, JPEG2000 compression
standard offers significant flexibility and outstanding performance compared
with other existing standards.
Given these exciting technologies, a question is likely to arise: What is
desired for digital image processing solutions in applications such as mobile
1
Introduction
phones and digital cameras? Obviously, a solution which is able to provide
high throughput is highly desirable. Meanwhile, the power-efficient feature is
also very important especially targeting those portable applications powered
by batteries. Moreover, in advanced digital cameras, a good image
processing solution is normally required to have significant flexibility in order
to support different algorithms. Generally, an ideal digital image processing
solution is expected to have high throughput, strict low power and
outstanding flexibility/reconfigurability for various tasks in advanced digital
cameras.
Research targeting efficient solutions for digital image processing and
compression applications has been carried out for a long time. Application
Specific Integrated Circuit (ASIC) implementations are traditionally popular in
designing complex image processing applications such as JPEG2000
solutions. However this kind of solution is inherently inflexible and cannot be
upgraded or altered after fabrication. Field Programmable Gate Array (FPGA)
based solutions can provide more flexibility and shorter time-to-market
compared with ASIC solutions. However, traditional FPGAs normally
consume more power than ASICs and may be not suitable for embedded
image processing applications since the majority of the available transistors
are used to provide flexibility [6] . Another popular solution is to use thirdparty DSPs to build System on Chips (SoCs). Compared with ASIC/FPGA
based solutions, DSPs have advantages in either higher flexibility (compared
with ASICs) or lower power consumption (compared with FPGAs). However,
DSP based solutions usually have limited throughput due to the lack of
Instruction Level Parallelism (ILP) compared with the other two solutions.
Moreover, even they have lower power consumption compared with FPGAs,
the power-efficient feature of DSP based applications is still curbed by high
clock rates and deep submicron processes.
Recently, a new category of programmable architectures, termed coarsegrained reconfigurable architecture, has emerged targeting high performance
and area-efficient computing applications. Different from traditional FPGAs
and DSPs, coarse-grained reconfigurable architectures can be intended as
2
Introduction
hardware components whose internal architecture can be dynamically
reconfigured in order to implement different algorithms. Generally, coarsegrained reconfigurable architectures are more area and power efficient
compared with FPGAs while holding software-like programmability similar to
DSPs, and they are more efficient due to the implementation on hardware of
computational functionalities [7]. This thesis proposes a customised
dynamically
reconfigurable
architecture
based
on
coarse-grained
Reconfigurable Instruction Cell Array (RICA) paradigm for digital image
processing and compression applications such as demosaicing and
JPEG2000 standard.
1.2. Objective
The objective of this thesis is to explore a high efficiency customised
reconfigurable
architecture
targeting
digital
image
processing
and
compression technologies by utilising the coarse-grained dynamically RICA
paradigm. After investigating different RICA based architectures, this thesis
aims to design efficient solutions for demosaicing and core tasks in the
JPEG2000 standard based on the proposed architecture.
1.3. Contribution
The major contributions of this thesis are split into five key aspects:
 The potential of RICA based architecture and possible optimisation
approaches are explored by case studies including Reed-Solomon
decoder and Worldwide Interoperability for Microwave Access (WiMAX)
Orthogonal
Frequency
Division
Multiplexing
(OFDM)
timing
synchronisation engine implementations.
 A Freeman demosaicing engine is developed as the pre-processing
module in a digital imaging system. This demosaicing engine is
implemented on RICA based architecture and optimised by an efficient
data buffer rotating scheme and a pseudo median filter. A parallel
architecture for the demosaicing engine is developed. Investigation
3
Introduction
targeting mapping the demosaicing engine onto a dual-core RICA
platform is performed.
 A novel 2-D Discrete Wavelet Transform (DWT) engine for JPEG2000 is
developed. This 2-D DWT engine is based on vector operations
associated with RICA paradigm and is highly optimised for both
throughput and area.
 Solutions for efficiently implementing the four primitive coding schemes in
the Context Modeling (CM) module in JPEG2000 on RICA based
architecture are developed. A novel Partial Parallel Architecture (PPA) for
CM is developed which makes good balance between throughput and
area occupation for RICA based implementations.
 A
novel
customised
dynamically
reconfigurable
architecture
for
JPEG2000 is developed. This proposed architecture is based on RICA
paradigm and an embedded ARM core for the efficient implementation of
Arithmetic Encoder (AE) module in JPEG2000. A modified 2-D DWT
scanning pattern, a memory relocation module together with an efficient
communication scheme between RICA based architecture and ARM core
is developed. A Ping-Pong memory switching mode between RICA based
architecture and ARM core is proposed for further performance
improvement.
1.4. Thesis Structure
This thesis is structured as follows:
Chapter 2 contains descriptions of the background. It provides detailed
algorithms for demosaicing and introduces the JPEG2000 standard.
Literature reviews are also included in this chapter. The referred literatures
mainly involve demosaicing and JPEG2000 encoder solutions on different
architectures. DSP and coarse-grained reconfigurable architecture based
solutions are especially emphasised.
Chapters 3 through 7 address my Ph.D. research achievements. Based on a
detailed description of RICA paradigm, two case studies (Reed-Solomon
4
Introduction
decoder and WiMAX OFDM timing synchronisation engine) are introduced in
Chapter 3 in order to investigate the potential of RICA based architectures
and possible optimisation approaches. Chapter 4 focuses on design and
implementation of a Freeman demosaicing engine on RICA based
architecture. The work involves an efficient data buffer rotating scheme,
single-core engine implementation & optimisation and mapping the
demosaicing engine onto dual-core RICA architecture. From Chapter 5 this
thesis aims to the proposed customised dynamically reconfigurable
architecture for a JPEG2000 encoder solution. A novel vector operation
based 2-D DWT engine is proposed in Chapter 5, together with detailed
throughput and area evaluations. Chapter 6 proposes efficient solutions for
CM and AE modules in JPEG2000 encoding algorithm. This includes design
and implementations of the four primitive coding schemes involved in CM
and the novel PPA solution. Based on algorithm analysis and evaluation, an
ARM core is selected for an efficient AE implementation. In Chapter 7, the
proposed architecture for JPEG2000 is introduced based on the previous
discussion. A modified 2-D DWT scanning pattern, a shared Dual Port RAM
(DPRAM), a Memory Relocation (MR) module and a Ping-Pong memory
switching mode are presented in order to improve the performance of the
proposed architecture.
Finally, the thesis is concluded with the summary in Chapter 8.
5
Digital Image Processing Technologies and Architectures
Chapter 2
Digital Image Processing Technologies
and Architectures
2.1. Introduction to Digital Image Processing Technologies
Digital image processing is the use of computer algorithms to perform
processing on digital images. It is a subcategory of digital signal processing
which provides many advantages over analog image processing such as
wider range of algorithms to select and avoiding buildup of noise and
distortion during processing [8]. Digital image processing has been extremely
widely used in fields of digital cameras, remote sensing, multimedia, satellite
and so on. A typical digital image processing system can be viewed to be
composed of the following modules: a source of input digital image data, a
processing module and a destination for the processed image, as illustrated
in Figure 2.1. Usually, the digital image source is provided by a digitisation
procedure, which means the process of converting an analog image into an
ordered array of discrete pixels. This procedure is normally executed by a
digital camera, scanner, etc. The processing module in digital image
processing is usually a digital processor, which can be a computer or a
dedicated chip. The image destination can be realised by different kinds of
digital storages and output terminals for transmission. Generally, digital
image processing is to apply different processing algorithms onto a matrix
consisting of digitised pixels, which is the fact of a digital image. According to
different pixel formats, the image can be presented in grayscale or colour.
6
Digital Image Processing Technologies and Architectures
Object
Digital
Image
Image Capture
and Digitisation
Digital Image
Processing Unit
Processed Image
Destination
(010111010101...)
Figure 2.1 Digital Image Processing System Architecture
In the case that the image is large or has deep bit-depth pixels, transmission
and storage of an uncompressed image becomes extremely costly and
impractical. For example, an 8-bit grayscale image with the size of 320x240
requires 76,800 bytes to be stored, and this figure will increase significantly if
there is an increment in image size or pixel bit-depth. In this case, digital
image compression technique becomes critical in order to minimise the size
of a digital image to adapt the given storage/transmission ability without
degrading the image quality to an unacceptable level. According to different
compression algorithms, there are two categories of image compression:
lossless and lossy. Lossless compression is preferred in the field of medical
imaging, technical drawings and so on due to the elimination of compression
artifacts; while lossy scheme is widely used in applications where minor
distortions are tolerable and expecting a low bit-rate for storage and
transmission.
There are several different methods to compress an image. For internet use,
the two most popular compression schemes are JPEG [3] and GIF [4]. JPEG
uses compression techniques such as Discrete Cosine Transform (DCT),
chroma sub-sampling, Run Length Encoding (RLE) and Huffman coding.
This compression scheme makes good representations for natural images.
GIF uses lossless compression algorithm such as Lemple Ziv Welch (LZW),
which is good for artificial images instead of natural images as this standard
allows using only 256 colours. There are also other popular compression
methods being used nowadays such as TIFF [2], JBIG [9] and JBIG2 [10].
Also recently a newer version of JPEG, termed JPEG2000 [5], has been
proposed which is base on DWT and supports both lossless and lossy
compressions. JPEG2000 standard offers outstanding features compared
7
Digital Image Processing Technologies and Architectures
with other standards such as high compression ratio, random bit-stream
access, region of interesting coding, etc, which will be discussed in the
following sections.
Based on these digital image processing and compression techniques, digital
cameras are widely used in our daily life. Basically, a digital camera captures
the object by a Charge Coupled Device (CCD) with a Bayer filter [11], which
is the most common method for digitisation. A Red-Green-Blue (RGB) image
is then reconstructed by the demosaicing module. After that, the RGB image
can be further processed and compressed to certain file formats for storage
and display. When targeting next generation digital cameras, JPEG2000
compression standard becomes an ideal choice because of its desirable
features compared with other compression standards.
Table 2.1 Examples of Image Processing Technologies and Compression
Standards
(a) Image Processing Technologies
Technology
Purpose
Demosaicing
To reconstruct a full-colour image from CFA
Gamma correction
To correct and to adjust the colour difference
White balancing
To adjust the intensities of different colours
Sharpening
To increase the contrast around the edges of objects
Smoothing
To reduce noise within an image
(b) Image Compression Standards
Standard
Year
Features
TIFF
1986
Lossless/lossy, a popular format for high colour-depth images
GIF
1987
Lossless, supports up to 256 colours
JBIG
1993
Lossless, for bi-level image compression
JBIG2
2000
Lossless/lossy, for bi-level image compression
JPEG
1992
Usually lossy with an optional lossless mode
JPEG2000
2000
Lossless/lossy, a newly emerging standard
8
Digital Image Processing Technologies and Architectures
Table 2.1lists examples of existing image processing technologies and
compression standards. According to the above discussion and the main
work in this thesis, the following sections mainly focus on introduction to
different demosaicing algorithms and the JPEG2000 compression standard.
2.2. Demosaicing Algorithms
Commercially, the most commonly used Colour Filter Array (CFA) pattern is
the Bayer filter [11] illustrated in Figure 2.2. It has alternating Red (R) and
Green (G) filters for odd rows and alternating Green (G) and Blue (B) filters
for even rows. Due to human eye’s high sensitivity to green lights, Bayer CFA
contains twice as many green (luminance) filters as either red or blue
(chrominance) ones. As the object information captured by image sensors
R
G
R
G
R
11
21
31
41
51
G
B
G
B
G
12
22
32
42
52
R
G
R
G
R
13
23
33
43
53
G
B
G
B
G
14
24
34
44
54
R
G
R
G
R
15
25
35
45
55
Figure 2.2 Bayer CFA Pattern
Original Sampled
Image by Bayer CFA
Separate Colour Planes
Obtained by Demosaicing
Reconstructed Image
Figure 2.3 Bayer CFA Pattern Demosaicing Procedure
9
Digital Image Processing Technologies and Architectures
overlaid with CFA has only incomplete colour components (R/G/B) at each
pixel position, a full colour image needs to be reconstructed and the concept
of demosaicing arises. The aim of demosaicing is to reconstruct an image
with a full set of colour components from spatially undersampled colour
samples captured by image sensors. For Bayer CFA, demosaicing
interpolates the estimated missing two colour components for each pixel
position with a selected algorithm. Figure 2.3 illustrates the procedure from
original Bayer filter samples to the reconstructed image [12].
An ideal demosaicing algorithm should be able to avoid the introduction of
false colour artifacts such as zippering and chromatic aliases as much as
possible, with the maximum preservation of the original image resolution.
Considering embedded applications in cameras, an ideal algorithm should
also have low computational complexity for fast processing and efficient
hardware implementation. There have been a number of demosaicing
algorithms proposed. The simplest approach is called nearest-neighbour
interpolation, which simply copies an adjacent pixel with the required colour
component as the missing colour value. Obviously this approach can only be
used to generate previews given strictly limited computational resources and
is unsuitable for most applications where quality matters. Another simple
approach, bilinear demosaicing, fills missing colour components with
weighted averages of their adjacent same colour component values. This
algorithm is simple enough for implementation in most cases; however it
introduces severe demosaicing artifacts and smears sharp edges [13]. In [14],
Cok presented a constant hue-based interpolation demosaicing algorithm
which utilises a spectral correlation between different colour ratios. Hue is
termed as the property of colours by which they can be perceived as ranging
from red through yellow, green and blue, as determined by the dominant
wavelength of the light [15]. As specified in Cok demosaicing algorithm, hue
is defined by a vector of ratios as (R/G, B/G)2. By interpolating the hue value
and deriving the interpolated chrominance values from the interpolated hue
values, hues are allowed to change only gradually, thereby reducing the
10
Digital Image Processing Technologies and Architectures
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
(a)
(b)
(c)
(d)
(e)
(f)
Colour A
Colour B
(a) Original Image
(d) Colour Difference A-B
(b) Bayer CFA Samples
(e) Median Filtered Colour Difference
(c) Bilinear Demosaicing (fringe introduced)
(f) Reconstructed Image (fringe removed)
Figure 2.4 Illustration of Freeman Demosaicing Algorithm
appearance of colour fringes which would have been obtained by
interpolating only the chrominance values. Detailed algorithm description can
be referred in [14].
The Freeman demosaicing algorithm was proposed in [16].It is a two-stage
process combining bilinear interpolation and median filtering together. In
order to minimise the zippering artifacts introduced by bilinear demosaicing,
Freeman algorithm utilises colour difference, of say, red minus green and
blue minus green, for median filtering. The filtered differences are then added
back to the green plane to obtain the final red and blue planes. In this way,
fringes at edges of different colour areas can be eliminated as illustrated in
Figure 2.4, which takes a line containing two colour components in Bayer
CFA for example [16]. Figure 2.4 (a) shows an original image with two colour
components. This image contains a sharp edge between two different areas
(the vertical axis represents the colour component intensities). Figure 2.4 (b)
illustrates the sampled colour information captured by CCD with Bayer CFA.
After linear demosaicing, the image is reconstructed however with the
noticeable colour fringe between two colour areas at the positions of pixel 6
and 7, shown in (c). If we take the colour value difference, that is, colour A
minus colour B, and filter them by a median filter with a kernel of size 5, the
difference can be eliminated, as illustrated in (d) and (e). Finally, the original
image can be reconstructed without the colour fringe existing in bilinear
demosaicing, which is shown in (f).
Laroche and Prescott proposed a three-step gradient-based demosaicing
algorithm in [17], which estimates the luminance channel first and then
interpolates the colour difference, of say, red minus green and blue minus
green. Two classifiers α and β are referred and utilised to determine whether
a pixel belonging to a vertical or horizontal colour edge [15]. According to the
11
Digital Image Processing Technologies and Architectures
magnitude comparison between α and β, difference formulas are employed
to estimate the green pixel value. Both classifiers and formulas are revised
when processing pixels within difference positions of Bayer CFA pattern.
Once the luminance channel (green) is determined, chrominance values (red
and blue) are estimated from the colour difference by other formulas.
Detailed algorithm can be referred in [17]. Modification of this algorithm is
proposed by Hamilton and Adams in [18], termed as adaptive colour plane
interpolation. This method also employs a multi-step process with classifiers
similar to Laroche-Prescott algorithm but modified to accommodate first order
and second order derivatives, that is, to calculate arithmetic averages for the
chrominance channel and appropriately scaled second derivative terms for
the luminance data [15].
R. Kimmel in [19] presented another demosaicing algorithm consisting of
three stages. Firstly a missing green pixel is estimated as a linear
combination of its four neighbours. In Kimmel algorithm, a weight function,
termed Ei, is utilised in this stage for the estimation. Generally, Ei is
calculated from the probability that the neighbour green pixels belong to the
same image object as the missing green pixel. For different neighbours, there
are different equations to calculate Ei respectively. The second stage is to
estimate the missing red and blue colour components. This estimation is
performed similarly to green pixels at the previous stage, utilising the weight
functions (Ei) discussed above. After these two stages, colour correction acts
as the third stage in Kimmel algorithm. The main idea of colour correction is
to assume that the ratio of red (or blue) to green is constant within each
image object. Based on this assumption, green pixels and red/blue pixels are
corrected alternatively. Normally this colour correction stage is repeated 3
times before the final reconstructed image is obtained [19-20].
Tsai-Acharya demosaicing algorithm [21] is an adaptive method based on the
hue concept. The main idea is to assign weight coefficients to all neighbour
pixels of the one currently under processing. This algorithm is also carried on
three stages. Firstly, all missing green pixels are estimated. Secondly, the
missing red/blue colour components at blue/red pixels are estimated. Finally,
12
Digital Image Processing Technologies and Architectures
the missing blue/red colour components at green pixels are estimated. All the
estimation in these three stages is performed on the basis of assigning
different hue values to the demosaicing window. Detailed algorithm can be
referred in [21].
Wenmiao-Peng algorithm was proposed in [22]. It is based on the spectral
correlation among pixels along the respective interpolation direction. This
algorithm assumes that green and blue/red colour components are well
correlated with constant offset; meanwhile the changing rate of neighbouring
pixel values along an interpolation direction is a constant[23]. The median
filter in Freeman algorithm is utilised in this algorithm in order to suppress the
noticeable demosaicing artifacts.
2.3. JPEG2000 Compression Standard
JPEG2000 image compression standard was created by Joint Photographic
Experts Group (JPEG) committee in 2000 and introduces by ISO/IEC.
Instead of using DCT, it employs DWT technique and supports both lossless
and lossy compression schemes. Numerous features are provided in
JPEG2000 standard in addition to the basic compression functionality,
including 1~5 [24], providing JPEG2000 a very large potential application
base.
1. Progressive recovery of an image by fidelity or resolution
2. Region of interest coding
3. Random access to particular regions of an image
4. Flexible file format with provisions for specifying opacity information and
image sequences
5. Good error resilience
Figure 2.5 illustrates a block diagram oftheJPEG2000 encoding algorithm.
The original image is decomposed into rectangular blocks termed tiles and
codeblocks for processing in order to avoid massive memory usage. The
main modules in JPEG2000 encoder are: Component Transform, Tiling, 2Demonsion DWT, Quantisation, EBCOT Tier-1 encoder including Context
13
Digital Image Processing Technologies and Architectures
Tiling and DC
level shifting
Component
transformation
Original Image
2-D
DWT
Subbands
EBCOT Tier-1
Quantization
Context
Modelling
Arithmetic
Encoder
Codeblocks
Bitstream
010110011101...
Tier-2 & File Formatting
Figure 2.5 JPEG2000 Encoder Architecture
Modeling and Arithmetic Encoder, and finally the Tier-2 encoder with File
Formatting. Since the JPEG2000 standard is quite complicated, only a brief
introduction of each module is given here. Readers can refer to the Appendix
for a more detailed description of the standard.
 Tiling and DC Level Shifting: Tiling partitions the original image into a
number of rectangular non-overlapping blocks, termed tiles. Within each
tile, DC level shifting is applied to ensure each of them has a dynamic
range which is approximately centered around zero.
 Component Transformation: Normally, the input image is considered to
have three colour planes (R, G, B). JPEG2000 standard supports two
different transformations: (1) Reversible Colour Transformation (RCT) and
(2) Irreversible Colour Transformation (ICT). RCT can be applied to both
lossless and lossy compression, while ICT can only be used in the lossy
scheme.
 2-D Discrete Wavelet Transform: This is one of the key differences
between JPEG2000 and previous JPEG standard. DWT decomposes a
tile into a number of subbands at different resolution levels with both
frequency and time information. 2-D DWT is a further decomposition
based on 1-D DWT. In JPEG2000, a modified scheme termed lifting
based DWT [25-26] is utilised to simply the computation.
14
Digital Image Processing Technologies and Architectures
 Quantisation: In lossy compression mode, all the DWT coefficients are
quantised in order to reduce the precision of DWT subbands to aid in
achieving compression [27]. The quantisation is performed by uniform
scalar quantisation with dead-zone around the origin. After quantisation,
all quantised DWT coefficients are signed integers and converted into
sign-magnitude represented prior to entropy coding [27].
 Embedded Block Coding with Optimal Truncation: This is the most
computationally intensive module in JPEG2000. It can be divided into two
coding steps: Tier-1 and Tier-2. Tier-1 coding scheme consists of
fractional bit-plane coding (Context Modeling) and binary arithmetic
coding (Arithmetic Encoding). Context Modeling (CM) codes DWT
coefficients in bit-level using four primitive coding schemes. After CM,
DWT coefficients are coded into Context/Decision (CX/D) pairs. Then
Arithmetic Encoder (AE) continues coding these CX/D pairs to obtain the
compressed bit-stream. After Tier-1 coding step, the bit-stream is
organised by Tier-2 coding step and the final coded bit-stream is
generated.
2.4. Literature Review
2.4.1. Demosaicing Algorithms Evaluations
Performance
evaluations
and
Comparisons
of
various
demosaicing
algorithms discussed in Section 2.2 have been proposed in [15] and [23].
In[15],a couple of test images were employed in the authors’ proposed
experiment including bar/starburst images and real images such as macaw
and crayon, shown in Figure 2.6 (a)-(h). These images are selected to cover
different cases such as images containing sharp edges, high spatial
frequencies, speckle behaviour and distinct colour edges.
Figure 2.7 illustrates performance comparisons in terms of Mean Squared
Error (MSE). It is found that the Freeman algorithm is best suitable for
images with speckle behaviour, while Laroche-Prescott and Hamilton-Adams
algorithms are best suitable for cases with sharp edges [15].
15
Digital Image Processing Technologies and Architectures
(a)
(b)
(e)
(c)
(f)
(g)
(d)
(h)
MSE (x10-3)
Figure 2.6 Test Images for Evaluating Different Demosaicing Algorithms in [13]
Figure 2.7 Performance Comparisons between Different Demosaicing Algorithms
In [23], performance comparisons were carried out among Freeman
algorithm, Kimmel algorithm, Tsai-Acharya algorithm and Weimiao-Peng
algorithm in terms of both Peak Signal to Noise Ratio (PSNR) and execution
time. The test images utilised are shown in Figure 2.8. The first six images
are selected to be synthetic vector images and the other six are actual
photographic images. Figure 2.9 illustrates comparisons in both PSNR and
execution time aspects.
In this thesis, these demosaicing algorithms are compared from two aspects:
reconstructed image quality and whether the algorithm is suitable for
hardware based implementation. From Figure 2.18, it is seen that the
Freeman demosaicing algorithm provide the lowest MSE for most of the
images. In Figure 2.9, it is seen that both Freeman algorithm and WenmiaoPeng algorithm deliver higher PSNR compared with the other two algorithms.
When processing synthetic vector images, Wenmiao-Peng algorithm
16
Digital Image Processing Technologies and Architectures
Figure 2.8 Test Images in [23]
(a)
(b)
Figure 2.9 (a) PSNR Comparisons (b) Execution Time Comparisons [23]
provides the best PSNR. However when processing real photographic
17
Digital Image Processing Technologies and Architectures
images, Freeman algorithm provides very similar PSNRs. Hence it is
concluded that both Freeman and Wenmiao-Peng algorithm can provide
good reconstructed image quality. When considering hardware based
implementation, the Freeman algorithm is very straightforward. Both the
bilinear stage and median filter can be easily implemented on hardware. On
the other hand, Wenmiao-Peng algorithm involves divisions when calculating
its estimating coefficients and intermediate colour values, which require more
complicated hardware architecture. In this case, as a classical demosaicing
algorithm which provides both good performance and relatively simple
architecture, Freeman demosaicing algorithm is selected in this thesis.
2.4.2. Solutions for Image Procesing and Compression
Applications
In industry, demosaicing task is usually not coming with a single-purpose
product. In contrast, it is normally integrated within a complete digital image
processing chain as the pre-processing module. On the other hand, there are
various JPEG2000 encoder solutions based on different architectures
including custom chips, FPGAs, DSP&VLIW based SoCs, coarse-grained
reconfigurable architectures and so on. Generally, these solutions are
popular to some extent, even owning some non-neglectable drawbacks. In
the following subsections, various solutions for image processing and
compression applications including demosaicing and JPEG2000 are
discussed.
2.4.2.1.
Custom Chips Implementations
Custom chip implementations are traditional popular in designing image
processing and compression solutions. Usually these chips are fully
customised for targeted imaging applications and designed through the
standard ASIC design flow involving Register Transfer Level (RTL) coding,
logic synthesis and layout design. Recently, some high-end custom chips
choose to integrate one or more processor cores into their dedicated
hardware architectures for performance enhancement. One such example is
18
Digital Image Processing Technologies and Architectures
STMicroelectronics STV0986 processor [28] which provides a full image
processing chain including noise filter, demosaicing, sharpness enhancement,
lens shading correction, etc. STV0986 has a video processor, two video
pipes and a dedicated JPEG encoder. It can provide throughput up to 12.5
fps in JPEG format at 5 megapixel resolution. Another example is NXP
PNX4103 [29] which is a multimedia processing chip with embedded TM3270
and ARM926 cores. It can efficiently realise imaging tasks such as
demosaicing, white balancing, image stabilisation, sharpening, etc. and
supports video standards such as H.264 and MPEG. For JPEG2000
solutions, one such example is Analog Devices ADV212 JPEG2000 codec
[30] which can deliver up to 65 Million Symbols per Second (MSPS) for 9/7
irreversible mode or 40 MPSP for 5/3 reversible mode. Another example is
Bacro BA110 JPEG2000 encoder [31] supporting 720p/1080p High Definition
(HD) videos. Other commercial custom chip solutions include intoPIX
RB5C634A JPEG2000 encoder [32]providing throughput up to 27Mpixels/s
without truncation and so on. Obviously these ASIC based implementations
offer high throughput, high power efficiency and small footprints for image
processing and compression solutions. However, even with embedded
processor cores, this kind of solution is inherently inflexible as they are fully
customised and cannot be upgraded and altered after fabrication. This is one
of its main drawbacks, particularly when ASIC solutions are used for imaging
applications where algorithms evolve rapidly. Meanwhile, designing full
custom chips requires more human effort and cannot meet the short time-tomarket demands.
2.4.2.2.
FPGA Based Implementations
Image processing and compression applications can be mapped on FPGAs
for fast hardware prototypes and some domains where the costs of power
and area are not important. Compared with ASIC solutions, FPGA based
solutions can provide more flexibility and shorter time-to-market.
 FPGA based demosaicing solutions
19
Digital Image Processing Technologies and Architectures
A Bayer CFA interpolation IP core targeting FPGAs and ASICs was
presented by ASICFPGA Ltd. in [33]. This interpolation IP is based on
ASICFPGA’s own demosaicing algorithm which has a 5x5 processing
window. This algorithm has similar nature to Laroche and Prescott algorithm
as both of them try to detect the change of colour edges in the image. Given
an 8-bit input bit-width, this IP core can work at a frequency of up to 129MHz
on a Virtex 4 LX15 FPGA. The authors in [34] presented a bilinear
demosaicing engine on a Virtex 4 FPGA. The engine demonstrates
throughput up to 150MPixels/s when working at a frequency of 150MHz.
 FPGA based JPEG2000 solutions
FPGAs are also traditional popular solutions for complex imaging tasks such
as JPEG2000. Announced in [35], the JPEG2K-E Intellectual Property (IP)
core can be mapped on Xilinx Virtex 4~6 and Spartan-6 FPGA families,
providing throughput up to 190MSamples/s based on a 90nm technology.
The JPEG2K-E IP consists of a 2-D DWT engine, a quantiser and multiple
EBCOT engines. Based on Virtex-6 FPGA platforms, JPEG2K-E IP runs at
the frequency of 210MHz and consumes 44K LUTs and 77 BRAMs.
The authors in [36] presented a memory efficient JPEG2000 architecture
including 2-DDWT and EBCOT on Xilinx Virtex II FPGA platform. A multilevel line-based lifting 2-D DWT was implemented which was claimed to be
able to support multi-level DWT being executed simultaneously. Based on
the line-based DWT, a parallel EBCOT engine was established. The authors
declared
that
their
implementation
can
provide
throughput
up
to
44.76Mpixles/s with a working frequency of 100MHz.
M. Gangadhar and D. Bhatia presented an FPGA based EBCOT architecture
in [37]. A parallel architecture was developed in their work which can process
three coding passes simultaneously. Two column based processing elements
were designed to code different coding passes in parallel. With a XC2V1000
device running at 50MHz, the proposed application can encode a 512x512
grayscale image in less than 0.03s.
20
Digital Image Processing Technologies and Architectures
An Altera APEX20K FPGA was selected in [38] for the implementation of a
parallel EBCOT tier-1 encoder. The authors presented a split arithmetic
encoder for EBCOT Tier-1 process which well investigated the causal
relationship between different coding passes and enabled AE to code context
information generated by different coding passes simultaneously. Results
showed that the proposed architecture offered a 55% improvement in
processing time compared with the traditional serial architecture for a set of
different test images.
There are also JPEG2000 implementations based on FPGA and DSP
combined platforms such as BroadMotion JPEG2000 codec [39] on a
combination of Altera Cyclone II FPGA and TMS320DM64x DSP. Since
FPGAs are fine-grained, they require more configuration bits compared with
coarse-grained reconfigurable architectures. Meanwhile, a majority of the
transistors in FPGAs are used for providing reconfigurability. In this case,
traditional FPGAs usually consume more power than ASICs. However,
FPGAs are evolving rapidly with the latest process technology, and
applications based on elder FPGAs can be easily transplanted to newer
devices. In this case, the performance of FPGA based applications highly
depends on the manufacturing process technology and the FPGA device
itself. Moreover, FPGA based applications are normally developed with
certain hardware description languages such as VHDL and Verilog, which
increases the design difficulty for engineers.
2.4.2.3.
DSP Based SoC Implementations
Instead of designing basic components from RTL, another popular solution
for image processing and compression tasks is to use third-party or their own
DSPs to build SoCs. Several DSP solutions targeting imaging applications
including demosaicing and JPEG2000 are introduced in the following
subsections.
21
Digital Image Processing Technologies and Architectures
Figure 2.10 HiveFlex ISP2300 Block Diagram [41]
 HiveFlex ISP2000 series
HiveFlex ISP2000 series [40] provided by Silicon Hive is a series of
processors with licensable silicon proven C-programmable IPs optimised for
image signal processing. Figure 2.10illustrates the architecture of HiveFlex
ISP2300 as an example. It has an instruction set optimised for the image
processing domain and a combination of VLIW and Single Instruction
Multiple Data (SIMD) parallelism [40]. With different configurations, its SIMD
datapath can vary from 4-way to 128-way. HiveFlex ISP2300 is C
programmable, and it has scalar data path for standard C programs.
HiveFlex ISP2300 has dedicated hardware peripherals such as encoding
accelerator and filterbank accelerator in order to enhance its performance. It
supports a full-featured image processing chain including demosaicing with
Silicon Hive’s patented technology, wide dynamic range visual optimisation,
red eye removal, flexible scaling, JPEG codec, etc. With the maximum 128
SIMD factor, HiveFlex ISP 2300’s performance can reach up to 170 GOPS at
333MHz and can support full HD 1080p video at 30 fps [40].
 Philips TriMedia TM1300
22
Digital Image Processing Technologies and Architectures
Philips TriMedia TM1300 [41] is an advanced 32-bit 5 issue VLIW processor
core. Specialised processing blocks are integrated into the device in addition
to the programmable VLIW core. The VLIW architecture of TM1300 allows
parallelism of instruction execution and data manipulation. The special
functional blocks of TM1300 include digital audio ports, an image coprocessor, PCI and other external device interfaces, a memory controller,
video I/O ports, an MPEG variable length decoder block and a multi-mode
fixed function video scaling and filtering engine [42]. Figure 2.11illustrates the
TM 1300 block diagram. This processor is completely C language
programmable. In addition to providing an object oriented C/C++ compiler
and debug tools, TriMedia software tool flow provides a real-time OS kernel,
optimisation tools, a simulator and application code libraries for industry
standard digital audio and video stream processing algorithms. In the
Figure 2.11 TM1300 Block Diagram [42]
23
Digital Image Processing Technologies and Architectures
reference design [42], a fast EBCOT CM algorithm is implemented and
optimised on TM1300 processor. Optimisation approaches include using
custom operations, simplifying logic operations, removing conditional
branches, loop fusion, etc. The simulated result demonstrates the execution
time of 10.26ms for processing the standard 256x256 Lena test image by CM
with a working frequency of 143MHz.
 TI TMS320C64x DSPs
TMS320C64x DSPs are the highest performance fixed-point DSP generation
in the TMS320C6000 DSP platform. They are based on the secondgeneration (C6416T) and third-generation (C6455) high performance
advanced VelociTI VLIW architecture developed by Texas Instruments[43].
Figure 2.12illustrates the block diagram of TMS320C6416T as an example. It
has six 32/40-bit Arithmetic Logic Units (ALUs), two 16-bit multipliers and 64
32-bit general purpose registers. With performance of up to 8000 Million
Instructions per Second (MIPS) at a working frequency of 1GHz, C6416T
DSP can produce four 16-bit Multiply-Accumulates (MACs) per cycle for a
total of 4000 million MACs per Second (MMASC). C6416T DSP has two
embedded coprocessors: Viterbi Decoder Coprocessor (VCP) and Turbo
Decoder Coprocessor (TCP) in order to speed up channel-decoding
operations on chip [43]. Based on 90 nm process technology, C6455 DSP
can support a higher clock rate of 1.2GHz, which enables it with performance
of up to 9600 MIPS [44]. References [45] and [46] demonstrate JPEG2000
encoder designs based on C6416T and C6455 respectively. The utilised
optimisation approaches include Variable Group of Sample Skip (VGOSS) for
CM and SIMD functions in C6416T [45] and system-level compiler
optimisation and DMA utilisation [46]. With these optimisations, the JPEG200
encoder in [45] shows its encoding time of approximately 74.6 ms for a
256x256 grayscale image while the encoder in [46] demonstrates the
execution time of 45.25 ms for the grayscale Lena image with the same size.
24
Digital Image Processing Technologies and Architectures
Figure 2.12 TMS320C6416T Block Diagram [44]
 BLACKFIN Processors
BLACKFIN DSPs are embedded processors developed by Analog Devices.
They use a 32-bit RISC microcontroller programming model on an SIMD
architecture which offers low power and high performance features. The
ADSP-BF535 processor [47] combines a 32-bit RICS-like instruction set and
dual 16-bit MAC signal processing functionality with an extendable
addressing capability. Figure 2.13illustrates the core architecture of ADSP
BF535. It consists of a data arithmetic unit which includes two 16-bit MACs,
two 40-bit ALUs, four 8-bit video ALUs and a single 40-bit barrel shifter [48].
25
Digital Image Processing Technologies and Architectures
Figure 2.13 ADSP BF535 Core Architecture [48]
The two Data Address Generators (DAGs) support bit-reversed addressing
and circular buffering. Registers occupied by BF535 include six 32-bit
address pointer registers for fetching operands, index registers, modifier
registers, base registers and length registers [48]. ADSP-BF535 processor
contains a rich set of peripherals connected to the core via several high
bandwidth buses. Based on its dual-core architecture, ADSP-BF561
processor offers higher performance [49]. References [48] and [50] present
JPEG2000 implementations based on ADSP-BF535 and BF561 respectively.
In [48], LUTs and functional macros are utilised for the code optimisation.
Execution complexity of each submodule in JPEG2000 is analysed though
the coding time is not given. In [50], optimisation mainly focuses on logic
simplification, code reusing and memory arrangement. The execution time
provided by [50] is approximately 53 ms for encoding a 256x256 grayscale
image.
 Other DSP/VLIW based Implementations
26
Digital Image Processing Technologies and Architectures
There are a couple of other DSP/VLIW based imaging solutions such as a
CPU JPEG2000 implementation [51], an ARM920T implementation[52]and a
STMicroelectronics LX-ST230 based JPEG2000 implementation [52]. These
implementations focus on either algorithm optimisation [51] or efficient task
mapping scheme [52] in order to accelerate the coding process.
Generally, traditional DSP/VLIW solutions discussed above have noticeable
lower throughput compared with ASIC/FPGA solutions although they are
usually more power efficient. In this case, traditional DSP/VLIW based
solutions should not be considered as the ideal solution for imaging
applications in next generation digital cameras. On the other hand, DSPs
specialised for imaging applications like [40] have their limitations such as
lack of ILP. Meanwhile, as various dedicated hardware peripherals are
usually integrated into specialised imaging DSPs, they may require longer
time-to-market and cost more than traditional DSPs.
2.4.2.4.
Coarse-Grained Reconfigurable Architecture Based
Implementations
Recently, a new category of programmable processor architectures for
demanding
DSP
applications,
termed
coarse-grained
reconfigurable
architecture, has emerged targeting high performance and area-efficient
computing applications. Different from traditional FPGAs and DSPs, coarsegrained
reconfigurable
architectures
can
be
intended
as hardware
components whose internal architecture can be dynamically reconfigured in
order to implement different algorithms. Since the internal circuits can be
reused for implementing different functionalities at different times and the
required configuration information is less than fine-grained architectures,
coarse-grained reconfigurable architectures are more area and power
efficient compared with FPGAs. Meanwhile, coarse-grained reconfigurable
architectures also offer software-like programmability similar to DSPs, and
they are more efficient due to the implementations on hardware of
computational functionalities [7].
27
Digital Image Processing Technologies and Architectures
Unfortunately, as far as our investigation is concerned, there are only few
demosaicing
and
JPEG2000
applications
based
on
coarse-grained
reconfigurable architectures. In this subsection, several coarse-grained
reconfigurable architectures targeting image processing and multimedia
applications are discussed. Some of these architectures have demosaicing
engine implementations such as CRISP in[53] and core tasks of JPEG2000
standard implemented such as NEC Dynamically Reconfigurable Processor
(DRP) in [54]. Others have their potential for imaging applications
demonstrated by applying tasks such as DCT, median filter, FIR, etc.
 CRISP
Different from other coarse-grained reconfigurable architectures, CRISP
processor [53] consists of context registers, main controller, reconfigurable
interconnection, and various kinds of coarse-grained Reconfigurable Stage
Processing Elements (RSPEs). Each kind of RSPE corresponds to one
module specified for image processing such as load memory, pixel-based
operation, colour interpolation, downsample, etc. Figure 2.14 illustrates the
architecture of CRISP processor. The authors in [53] implemented several
Figure 2.14 CRISP Processor Architecture [54]
28
Digital Image Processing Technologies and Architectures
typical image processing tasks such as Gamma correction, demosaicing,
median filter, smooth filter, etc. on a fabricated chip. Performance
comparisons are made between CRISP and DSPs such as Philips TM1300
and TMS320C64x and CRISP demonstrates good throughput improvement.
However, since the CRISP processor is more ASIC-like as it has dedicated
hardwired imaging-targeted RSPEs, these comparisons become less
convictive to some extent.
 NEC Dynamically Reconfigurable Processor
NEC DRP [55] is a coarse-grained dynamically reconfigurable processor core
released by NEC. It carries an on-chip configuration data, or contexts, and it
dynamically reschedules these contexts to realise multiple functions. 64 of
the most primitive 8-bit Processing Elements (PEs)are combined to form
what is called a tile, and DRP core consists of an arbitrary number of these
tiles (Figure 2.15(a)). The architecture of a PE is illustrated in Figure 2.15(b).
A PE has an 8-bit ALU, an 8-bit Data Management Unit (DMU) (for
shifts/masks), an 8-bit x 16-word Register File Unit (RFU), and an 8-bit FlipFlop Unit (FFU). These units are connected by programmable wires specified
by instruction data, and their bitwidths range from 8 Bytes to 18 Bytes
depending on the location. A PE has 16-depth instruction memories (e.g. 16
contexts) and supports multiple context operation [54].
(a)
(b)
Figure 2.15 (a) NEC DRP Structure (b) PE in NEC DRP [56]
29
Digital Image Processing Technologies and Architectures
Based on NEC DRP architecture, the authors in [54] implement some core
tasks in JPEG2000 encoding algorithm including 2-D DWT, significant coding
pass in CM and AE. The optimisation approaches mainly focus on efficient
context controlling and reducing the number of occupied PEs. Without giving
the performance of processing areal image, NEC DRP demonstrates its
execution time of 0.213ms for processing 256 16-bit samples by significant
coding pass and 1023 CX/D pairs by AE [54], which shows advantages
compared with the TMS320C6713 DSP based implementations.
 MorphoSys
MorphoSys [56] is a reconfigurable architecture for computation intensive
applications based on combination of both coarse grain and fine grain
reconfiguration techniques. Figure 2.16(a) illustrates the architecture of
MorphoSys processor. The reconfigurable part in Morphosys is an RC array.
An RC array is an 8x8 array of Reconfigurable Cells (RCs). The configuration
data is stored in the context memory. The architecture of an RC is illustrated
in Figure 2.16(b). Each RC consists of four types of basic elements:
functional units for arithmetic and logic operations, memory element to feed
the functional units and store their results, input and output modules to
connect cells together to form the RC array architecture and a fine grain
reconfigurable logic block. TinyRisc [57] is a general purpose 32-bit RISC
processor. It controls operation sequence in MorphoSys and executes nondata parallel operations [56].The authors in [56]presented implementations of
(b)
(a)
Figure 2.16 (a) MorphoSys Architecture (b) RC Array Architecture [57]
30
Digital Image Processing Technologies and Architectures
(b)
(a)
Figure 2.17 (a) ADRES Architecture (b) RC Architecture [59]
DCT, FFT and correlation based on MorphoSys processor, which show
advantages compared with TMS320C6000 DSPs.
 ADRES
ADRES architecture [58] is a combination of a VLIW processor and a coarsegrained reconfigurable matrix. Figure 2.17(a) illustrates the architecture of
ADRES. For the VLIW part, several Function Units (FUs) are allocated and
connected together through one multi-port register file, which is typical for
VLIW architecture. For the reconfigurable matrix part, there are a number of
RCs which basically comprise FUs and Register Files (RFs) as illustrated in
Figure 2.17(b) [58]. FUs in ADRES perform coarse-grained operations on 32bit operands. Based on ADRES architecture, the authors in [59] presented
implementations
of
both
Tiff2BW
transform
and
wavelet
transform
benchmarks and made comparisons with TI C64x DSP implementations.
2.5. Demand for Novel Architectures
Table 2.2produces brief comparisons of different reviewed architectures for
image processing applications. As presented, customised chips offer good
performance in aspects of throughput and power efficiency for imaging
solutions. However their flexibility is strictly limited since these chips are fully
customised. This drawback becomes extremely noticeable when such
customised chips are used for rapidly evolving imaging technologies. For
31
Digital Image Processing Technologies and Architectures
Table 2.2 Comparisons of Different Architectures for Image Processing
Applications
Architectures
Structure
Target Applications
Customised Chips (including pure ASICs and customised chips with embedded CPUs)
STV0986 [29]
Dedicated hardware with
embedded video processor core
Image/video
processing, JPEG
NXP PNX4103 [30]
Dedicated hardware with
embedded TM3270 and ARM
cores
Image/video
processing, H.264,
MPEG
ADV212 [31]
Dedicated hardware with
embedded RISC processor
JPEG2000
Bacro BA110 [32]
Dedicated hardware
JPEG2000
intoPIX RB5C634A [33]
Dedicated hardware
JPEG2000
FPGA Based Implementations (including IP for FPGA and ASIC)
ASICFPGA IP [34]
IP for FPGA and ASIC
Demosaicing
JPEG2K-E IP [36]
IP for FPGA and ASIC
JPEG2000
BroadMotion [40]
Combination of Altera FPGA
and TI DSP
JPEG2000
Other FPGA applications
[35], [37-39]
FPGA
Demosaicing,
JPEG2000, etc
DSP and VLIW Based Implementations
HiveFlex ISP2300 [41]
Programmable VLIW core with
dedicated imaging hardware
Image/video
processing
Phillips TM1300 [42]
Programmable VLIW core with
imaging peripherals
JPEG2000
TMS320C64x [44] [45]
Programmable VLIW DSP
JPEG2000
ADSP-BF535/561 [48]
[50]
Programmable DSP
JPEG2000
ARM920T [53]
Programmable DSP
JPEG2000
STMicroelectronics LXST230 [53]
Programmable DSP, supports
multi-core architecture
JPEG2000
Coarse-Grained Reconfigurable Architectures
CRISP [54]
Dedicated imaging RSPEs with
programmable connections and
controllers
Gamma correction,
demosaicing, median
filter, smooth filter
NEC DRP [55]
PE array with programmable
connections
JPEG2000
MorphoSys [57]
RC array with TinyRICS and
peripherals
DCT, FFT,
correlations, etc.
ADRES [59]
Reconfigurable matrix with VLIW
Tiff2BW transform,
wavelet transformation
32
Digital Image Processing Technologies and Architectures
those products who have embedded processor cores such as [28], [29] and
[30], they have relatively higher flexibility compared with other pure ASICs.
However the massive human effort and long time-to-market required for
development cannot be ignored.
FPGA based solutions offer much more flexibility compared with customised
chips while keeping comparable high throughput. Meanwhile FPGA based
solutions require less developing time and human effort than ASICs.
However, traditional FPGAs may be not power or area efficient for imaging
applications especially for mobile devices. Although new FPGA devices
based on latest manufacturing process are released frequently, their actual
performance and power dissipation for complex imaging tasks need to be
evaluated and tested.
Based on their inherent nature, DSP based solutions offer high flexibility and
easy programmability. Meanwhile, the possibility of adding extended imaging
instruction sets allows DSPs to be utilised for image processing solutions.
However, Although SIMD technique is utilised in some solution such as [40]
in order to increase Data Level Parallelism (DLP), DSP based solutions often
suffer from the limited level of ILP found in their typical programs, leading to
their restricted performance. Moreover, since DSPs usually have quite high
working frequencies, their power dissipation will become critical in some
power-sensitive aspects.
Coarse-grained reconfigurable architectures fill the gap between traditional
FPGAs/ASICs and DSPs. Compared with customised chips, it is obviously
that coarse-grained reconfigurable architectures offer much more flexibility.
Meanwhile, based on a set of hardware components and/or reconfigurable
connections all of which are reusable and reduced configuration information
required, coarse-grained reconfigurable architectures are more area and
power efficient compared with fine-grained FPGAs. Moreover, with provided
software-like programmability similar to DSPs, coarse-grained reconfigurable
architectures are more efficient since their hardware based nature offers high
levels of both ILP and DLP.
33
Digital Image Processing Technologies and Architectures
Generally, an ideal architecture for imaging solutions should provide high
throughput, high flexibility and low power dissipation. Meanwhile, since the
amount of data in imaging applications is usually higher than that in other
applications such as communication, high levels of both ILP and DLP
become critical. Based on all the discussion above, coarse-grained
reconfigurable architectures appear to be strong candidates for image
processing solutions. Since there are only few coarse-grained reconfigurable
architecture based solutions for image processing applications having been
proposed, this thesis presents customised dynamically reconfigurable
architecture based on coarse-grained Reconfigurable Instruction Cell Array
(RICA) [6] paradigm for digital image processing and compression
applications such as demosaicing and JPEG2000 standard, which will be
detailed in the following chapters. Since different platforms are evaluated
from aspects of throughput, power dissipation and flexibility in this chapter,
the work described in this thesis will be evaluated with similar metrics. In the
following chapters, throughput and area (directly relevant to power
dissipation) are mainly used for evaluation. On the other hand, since the
flexibility limitation only applies to ASICs, this metric will not be included in
the following evaluation.
2.6. Conclusion
This chapter has introduced basic theories of digital image processing and
compression technologies. With a brief review of different imaging
technologies, demosaicing and JPEG2000 compression standard are
particularly discussed in detail. In Section 2.2, a number of existing
demosaicing algorithms including bilinear, Cok, Freeman, Laroche-Prescott,
Hamilton-Adam, Kimmel, Tsai-Acharya, Wenmiao-Peng, are presented. In
the following Section 2.3, different modules in JPEG2000 compression
standard are introduced.
Section 2.4 presents the literature review which mainly focuses on
demosaicing algorithms evaluation and imaging solutions based on various
architectures. In Section 2.4.1, MSE and PSNR are utilised to evaluate
34
Digital Image Processing Technologies and Architectures
performance of different demosaicing algorithms. Based on performance
comparisons and complexity evaluation, Freeman algorithm is considered to
be a promising method providing both good performance (especially for
actual photographic images) and relatively simple structure. In Section 2.4.2,
various architectures for image processing and compression solutions are
discussed in aspects of throughput, flexibility and power dissipation.Since the
targeted architecture in this thesis is DSP-like coarse-grained dynamically
reconfigurable, more investigation is launched into solutions based on DSPs
and coarse-grained reconfigurable architectures. It is concluded that
traditional architectures have limitations such as low flexibility (custom chips),
high power consumption (FPGAs) and low throughput (DSPs). On the other
hand, coarse-grained reconfigurable architectures act as strong candidates
for imaging solutions,
since
they
fill the
gap
between
traditional
FPGAs/ASICs and DSPs and provide desirable features like good throughput,
high flexibility, relatively low power dissipation, high levels of both ILP and
DLP, etc.
Based on the above discussion, this thesis aims to develop customised
coarse-grained
applications
dynamically
including
reconfigurable
demosaicing
and
architecture
JPEG2000.
for
A
imaging
dynamically
reconfigurable instruction cell array paradigm, which will be introduced in
Chapter 3, is chosen to build the proposed architecture. From Chapter 4, this
thesis focuses on presenting imaging solutions such as Freeman
demosaicing and JPEG2000 on the proposed RICA based architecture.
35
RICA Paradigm Introduction and Case Studies
Chapter 3
RICA Paradigm Introduction and Case
Studies
3.1. Introduction
As discussed in Chapter 2, there are several established architectures which
can be utilised for image processing solutions. In conclusion, ASICs are well
known to provide low power and high throughput compared with other
architectures; however, they have both high design costs and limited postfabrication flexibility. FPGAs have their success which lies in their ability to
map algorithms onto their logic and interconnects after fabrication, which
actually offers outstanding flexibility. However, an impact on energy
consumption of FPGA based solutions cannot be avoided. DSP&VLIW
architectures offer advantages in terms of generic adaptivity and easy
programming; however their performance is curbed due to the limited amount
of ILP found in typical programs. On the other hand, coarse-grained
reconfigurable architectures appear to be strong candidates for image
processing solutions; and further investigation is required for their actual
potential for imaging applications since little research work has been done in
this field.
In recent years, a novel coarse-grained dynamically Reconfigurable
Instruction Cell Array[6] has emerged, which is promising to be an ideal
candidate for high performance embedded image processing applications
such as demosaicing and JPEG2000 in next generation digital cameras. By
36
RICA Paradigm Introduction and Case Studies
designing the silicon fabric in a similar way to reconfigurable arrays but with a
closer equivalence to software, RICA paradigm based architectures can
achieve comparable high performance as coarse-grain FPGA architectures
and maintain the same flexibility, low cost and programmability as DSPs [6].
A detailed introduction of RICA paradigm will be given in the next section.
3.2. Dynamically Reconfigurable Instruction Cell Array
3.2.1. Architecture
The idea behind RICA paradigm is to provide a dynamically reconfigurable
fabric that allows building specialised circuits for different applications.
Instead of using fine-grained CLBs or homogeneous coarse-grained
elements like FPGAs and most CGRAs, RICA has its heterogeneous coarsegrained hardware modules termed Instruction Cells (ICs)[6]. Each IC can be
configured to do a small number of operations as listed in Table 3.1, and the
nature of RICA paradigm is a heterogeneous IC array interconnected through
an island-style programmable mesh fabric as illustrated in Figure 3.1 [6]. All
ICs are expected to be independent and can run concurrently. Having such
an array of interconnectable ICs allows building circuits from an assembly
representation
of
programs.
The
configuration
of
the
ICs
void oned_dct (int *coeff,int
*block) {
C Source
Code
b0 = coeff[0]; b1 = coeff[1];
b2 = coeff[2]; b3 = coeff[3];
b4 = coeff[4]; b5 = coeff[5];
b6 = coeff[6]; b7 = coeff[7];
e = b1 * const_f7 - b7 *
const_f1;
f = b5 * const_f3 - b3 *
const_f5;
c4 = e + f; c5 = e - f;
h = b7 * const_f7 + b1 *
const_f1;
g = b3 * const_f3 + b5 *
const_f5;
LDR
LDR
MOV
MOV
MUL
MUL
SUB
LDR
LDR
MOV
MOV
MUL
MUL
SUB
a1,[sp,#0x24]
ip,[sp,0x3c]
a3,#6
a4,#0x1b
ip,a4,ip
a1,a3,a1
a1,a1,ip
a5,[sp,#0x36]
ip,[sp,0x5f]
a7,#6
a8,#0x1b
ip,a8,ip
a5,a7,a7
a5,a5,ip
ASM
Code
Data
Memory
Banks
I/O Ports
Step
Configurations
100011101001101001101011001110101101000111
011011010110111010001010101010100110100101
010101110100110110010110101110101110111001
011001101011001010101001011010100111010100
010111001001010010100101111101000010100101
010100101010110010110010100101001010101001
RRC
Program
Configuration
counter
stream
Program
Memory
C code
e = b1 * 6 – b7 * 27;
CONST
(0x06)
Compiled ASM:
LDR
LDR
MOV
MOV
MUL
MUL
SUB
a1,[sp,#0x24]
ip,[sp,0x3c]
a3,#6
a4,#0x1b
ip,a4,ip
a1,a3,a1
a1,a1,ip
MUL
(mul)
REG
Define
Configuration
(read b1)
CONST
(0x1b)
Dynamic Allocation of Instruction Cells into
processing steps, scheduled within gcc toolchain
REG
(read b7)
Figure 3.1 RICA Paradigm [6]
37
MUL
(mul)
REG
ADD
(sub) (write a1)
and
RICA Paradigm Introduction and Case Studies
Table 3.1 Instruction Cells in RICA
Instruction Cell
Associated Functions
ADD
Addition and subtraction
MUL
Multiplication
REG
Registers
SHIFT
Shifting
LOGIC
Logic Operation
COMP
Comparison
MUX
Multiplexing
I/O REG
Register with access to external I/O ports
RMEM
Interface for reading data memory
WMEM
Interface for writing data memory
I/O Port
Interface for external I/O ports
RRC
Controlling reconfiguration rates
JUMP
Branches
SOURCE
Interface for read files
SINK
Interface for writing files
SBUF
Interface for accessing stream buffers
interconnections are changeable on every cycle to execute different blocks of
instructions. As illustrated in Figure 3.1, the processing datapath of RICA is a
reconfigurable array of ICs, where the program memory contains the
configuration instructions that control both the ICs and interconnections [6].
The use of an IC-based reconfigurable architecture as a datapath gives
important advantages over DSP and VLIWs such as better support for
parallel processing. The RICA architecture can execute a block containing
both independent and dependent instructions in the same cycle, which
prevents the dependent instruction from limiting the amount of ILP in the
program [6]. Different from traditional DSPs, RICA has a reconfigurable
datapath which implies that it does not have fixed clock cycles but an
operation chain, which means that RICA can execute both dependent and
independent instructions in parallel within one configuration context. In this
case, the cycle in RICA architecture is termed step. In contrast to traditional
processors which have computation units on critical paths pipelined to
38
RICA Paradigm Introduction and Case Studies
improve the throughput, RICA architecture introduces variable clock cycles in
different steps to ensure longer critical paths consume more clock cycles.
Due to the heterogeneous nature of RICA, one of its salient characteristics is
that the IC array can be customised at the design stage according to the
requirement of targeted application in term of the number and the type of ICs,
which leads to efficient computational resource utilisation and better system
performance. Another distinction from a conventional processor is the
memory access pattern. RICA paradigm allows multiple simultaneous
reading and writing operations to multiple memory locations within one step.
Since there are four memory banks existing in current RICA paradigm, totally
four memory reading and four writing operations are supported in a single
step.
3.2.2. RICA Tool Flow
An automatic tool flow has been developed for the simulation of the RICA
paradigm based architectures. The tool takes a definition of the available ICs
in the array along with other parameters such as their count, bitwidth and the
type of interconnections. These specified hardware resource can be
modelled using a simulator written in high-level C/C++ code [6]. If the
required performances determined through the RICA software simulator are
not met, the developer can modify their original code or change the mixture
of the available IC resources to improve the performance. RICA supports
pure ANSI-C programmability in a manner very similar to conventional
processors and DSPs. A dedicated tool flow [6] for RICA has been developed
which comprises compiler, scheduler, placement & routing, simulator and
emulator. Figure 3.2 illustrates the working mode of RICA tool flow.
 Compiler: The compiler takes the high-level C code and transforms it into
an intermediate assembly language format [6]. This transformation is
performed by an open source GCC compiler. After the compilation, a
RICA-targeted assembly file is obtained, which consists of instructions for
the ICs in RICA based architecture.
39
RICA Paradigm Introduction and Case Studies
C code
Compiler
MDF
Assembly code
Schedular
Netlist
Step 0
Step 1
Simulator/Emulator
Profile
Execution
Trace
...
Step 2
Placement and Routing
Memory
Dump
Configuration
Bits
Figure 3.2 RICA Tool Flow
 Scheduler: The RICA scheduler tasks the assembly file generated by the
compiler and tries to create a netlist to represent the program. The netlist
contains blocks of instructions that will be executed in a single step[6].
The partitioning for different steps is performed after scheduling the
instructions and investigating the dependencies between them. Within a
step,
dependent
instructions
are
connected
in
sequence,
and
independent instructions are executed in parallel. The scheduler[60] takes
into account the available ICs, interconnections and timing constraints in
the array by a Machine Description File (MDF). It also performs
optimisations for the code such as removing temporary registers[6].
 Simulator: The simulator takes the instruction blocks in the netlist file and
executes them step by step. It also takes into account the timing
constraints defined in the netlist to simulate the simulated execution time
of the current application. The output of the RICA simulator includes
profile which contains execution time, number of steps, etc, execution
40
RICA Paradigm Introduction and Case Studies
trace which is the detailed trance indicating how the applications is
executed, and the memory dump containing the data written into the
memory.
 Placement and Routing: If the RICA based architecture needs to be
mapped to a physical chip, a tool is provided to minimise the distance
when allocating all the ICs and connecting them to each other [6]. When
the placement and routing netlist file is generated, the configuration bits
can be generated.
3.2.3. Optimisation Approaches to RICA Based Applications
As a dynamically reconfigurable architecture, RICA needs to be reconfigured
for each step. The time consumed by fetching and loading configuration
instructions and configuring the IC array is called configuration latency. Due
to different steps occupy different numbers of ICs, the configuration latency is
variable. Generally, configuration latency for a certain step is smaller than the
step execution time. Therefore the configuration instruction set for next step
can be pre-fetched when executing the current step in order to eliminate the
latency. This kind of pre-fetching operation can work only if there is no
conditional branch involved in the current step; otherwise the location of next
step will be unknown until the condition for branch is computed. Meanwhile,
in the case that there are successive iterations of certain loops existing in the
application, if a loop can be placed into one single step, the instruction fetch
scheme associated with RICA allows the instruction set for the loop to be
fetched only once from the instruction memory before executing the loop.
This kind of step is termed kernel. In the case of executing a kernel,
configuration latency will be introduced only at the first iteration, and
absolutely no anywhere else during iterations [6]. Moreover, a kernel can be
pipelined into several stages, with which its critical path will be shortened,
leading to execution time reduction.
A simple example is provided here to indicate the performance improvement
by constructing kernels instead of keeping the code in separate steps. Given
the following code:
41
RICA Paradigm Introduction and Case Studies
…………………………………………………………………………………………......
for (i=0; i<300; i++)
{
if(a[i]>b[i])
e[i] = a[i] * b[i];
else
e[i] = 0;
}
…………………………………………………………………………………………......
This fragment of code is scheduled to 2 steps by RICA tool flow, and the
execution time provided by the simulator is 7.842us. If the code is modified
as follows:
…………………………………………………………………………………………......
for (i=0; i<300; i++)
{
asm volatile ("MUX
\tout= %0 \tin1= %1 \tin2= %2 \tsel=%3 \tconf=
`MUX_COND_NEZ_SI" : "=r" (e[i]) : "r" ((a[i])*(b[i])) , "r" (0), "r" ((a[i])>(b[i])));
}
…………………………………………………………………………………………......
This modification means to use a multiplexer to generate the required output
instead of having a conditional branch. The RICA tool flow places this
modified code into a single step (kernel), and the reported execution time is
3.948us.
From this simple example, it is obviously that constructing kernels is essential
for RICA based applications. In order to construct kernels, firstly the
conditional branches existing in the code must be eliminated. Usually
multiplexers are used to realise such eliminations. Meanwhile, the available
IC resource in RICA architecture must satisfy the minimum requirement of all
instructions in a kernel, otherwise the RICA architecture needs to be tailored.
Moreover, the memory accesses in a kernel should not exceed the maximum
allowance of RICA architecture (4 write and 4 read in a step).
3.3. Case Studies
Based on the introduction to RICA paradigm, several applications have been
implemented on RICA based architecture and evaluated in terms of
performance and efficiency. The following two sections discuss a ReedSolomon (RS) decoder and a WiMAX OFDM symbol timing synchronisation
42
RICA Paradigm Introduction and Case Studies
engine, which have been implemented on customised RICA based
architectures and optimised for performance improvement. These two
implementations are used
as case
studies
targeting RICA
based
applications. Since these two communication applications are not directly
relevant to the main work in this thesis, only a brief introduction is presented
here, and the detailed work can be found in the author’s previous published
work [61-62] for the RS decoder and [63] for the OFDM timing
synchronisation engine.
 Reed-Solomon Decoder: RS coding algorithm is constructed in Galois
Field (GF) which has its own calculation theorem. The conventional
approach uses Look-Up Tables (LUTs)to calculate multiplications in GF.
In this case study, a 32-bit GF Multiplier (GFMUL) is employed as a
custom IC integrated in RICA paradigm based architecture, which
significantly reduces the computational complexity. Meanwhile, the SIMD
technique is applied to accelerate the coding process, and kernels are
constructed for certain modules such as syndrome calculation and Chien
search.
 OFDM Timing Synchronisation Engine: This work utilises the Maximum
Likelihood (ML) estimation algorithm [64] to estimate the OFDM symbol
time offset. Instead of using memory blocks, two shifting register windows
are constructed for the accumulating calculation in the algorithm. Kernels
are also constructed for the entire engine.
3.4. Outcomes of Case Studies
Although the two case studies performed in this thesis are communication
tasks rather than imaging applications, they both well investigated the
potential of RICA paradigm and indicated the possible optimisation
approaches. The RS decoder was implemented on customised RICA based
architecture with GFMUL cells. With SIMD technique, RICA paradigm
demonstrates high level of DLP, which is a desired feature for image
processing applications. Meanwhile, the ILP nature of RICA paradigm
enables multiple GFMUL cells, together with a number of other ICs, are
43
RICA Paradigm Introduction and Case Studies
executed simultaneously within a step. With high levels of both DLP and ILP,
RICA paradigm based architecture is expected to be able to provide good
performance for image processing solutions.
The OFDM timing synchronisation engine was implemented on tailored RICA
architecture. Two 1-D shifting windows in the ML algorithm were utilised in
order to reduce memory accesses. For image processing tasks such as
demosaicing and filtering, the 1-D shifting window can be easily extended to
establish a 2-D window with a couple of registers. Given the 2-D window
moving along every row/column in an image, RICA paradigm based
architecture can efficiently deal with different imaging tasks.
Considering the two case studies, one most important thing in common is the
construction of kernels. As discussed, with a kernel, the instruction set is only
fetched once before the kernel starts, and there is no configuration latency
during the execution. It is worth noticing that in most image processing tasks
such as demosaicing and JPEG2000, the image is scanned by the
processing engine within a 2-D loop: 1-D for horizontal scanning (every line)
and 1-D for vertical moving (when a new line starts). Thus, if the processing
engine can be placed into a single kernel, there will be no configuration
latency when processing the horizontal scanning. When a new line starts, the
kernel can be either kept the same or reconfigured, depending on the task.
The configuration latency is only introduced when performing the vertical
moving loop and nowhere else. In this case, RICA paradigm based
architectures can be extremely efficient for image processing applications.
Another promising outcome in common is RICA paradigm’s tailorable nature.
In order to satisfy the requirement of constructing a kernel, ICs in RICA
based architecture can be tailored targeting the maximum usage of
computational resource. Moreover, since there are different modules with
various complexities involved in complex image processing applications such
as JPEG2000, RICA based architecture can be dynamically reconfigured to
ensure different tasks are assigned with proper computational resources.
When switching from a computation intensive task to a simple task, the
44
RICA Paradigm Introduction and Case Studies
redundant computational resources can be bypassed in order to reduce
energy dissipation.
Through the case studies, advantages of RICA based architecture are
concluded as follows:
 High levels of both DLP and ILP, providing high throughput.
 Kernels for configuration latency reduction especially when targeting
image processing applications.
 Customisable and tailorable nature.
 Flexible reconfigurability.
 Low power nature.
With these advantages, RICA based architectures are considered to be
strong candidates for solutions to image processing applications. In the
following chapters, this thesis will aim at developing customised dynamically
reconfigurable RICA based architectures targeting various image processing
applications such as demosaicing and JPEG2000.
3.5. Prediction of Different Imaging Tasks on RICA Based
Architecture
Based on the discussion in previous sections, the nature and potential of
RICA based architecture are clearly clarified. In this section, different imaging
tasks introduced in Section 2.2 (Freeman demosaicing) and 2.3 (core tasks
in JPEG2000 including 2-D DWT, CM and AE) are roughly evaluated
targeting implementations on RICA based architecture. The computational
aspects are characterised, and the possible performance is predicted.
 Freeman demosaicing: When looking into the algorithm, the first stage,
bilinear demosaicing, can be efficiently implemented on RICA based
architecture since it mainly involves simple additions and shifting
operations. When processing different lines, multiplexers can be utilised
to avoid possible conditional branches. On the other hand, how to
efficiently implement the median filter on RICA based architecture
becomes challenging, since the sorting operations in the median filter will
45
RICA Paradigm Introduction and Case Studies
introduce loads of data swaps and conditional branches. If possible, a
simplified median filtering algorithm should be applied.
 2-D DWT: The lifting-based 2-D DWT in JPEG2000 significantly reduces
the computational complexity existing in the traditional DWT architecture.
Since the lifting-based architecture has only a small number of additions
and shifting operations and no conditional branches, RICA based
architecture is expected to be able to deliver quite high performance for
DWT implementations. Moreover, RICA based architecture supports
simultaneous multiple memory read/write operations, which provides the
possibility to design some 2-D DWT architecture with high parallelism.
 CM: This is the most computationally intensive module in JPEG2000. On
one hand, the CM algorithm scans 4 bits in a stripe column from top to
bottom. This means that it is possible to develop a parallel architecture
which will accelerate the processing significantly. In this case, the high
levels of both DLP and ILP existing in RICA paradigm become enable
such a parallel architecture to be implemented. On the other hand, since
there are four primitive coding schemes in CM and the output is
generated by one of them according different conditions, the CM
algorithm is actually branch-intensive. Although multiplexers can be
utilised to avoid branches and construct kernels, the increment in required
computational resources cannot be ignored as all the four coding
schemes need to be executed and then the final output can be selected
by multiplexers. In this case, the balance between throughput and
computational resource requirement needs to be well balanced when
implementing CM on RICA.
 AE: The computation in AE is actually quite simple. However, the AE
algorithm is also branch-intensive. Different from CM, AE has a serial
architecture, which means that the critical of the constructed kernel will be
quite long even all the branches are eliminated. It is worth trying to
implement and optimise AE on RICA based architecture and seeing the
actual performance. If the performance is not good enough, some other
platform can be considered to be a better solution.
46
RICA Paradigm Introduction and Case Studies
Generally, given a new algorithm, it should be evaluated from three aspects
to predict whether RICA based architecture is suitable for its implementation:
1. Is the algorithm branch-intensive?
2. Does the algorithm require frequent memory accesses?
3. Is there any inherent parallelism in the algorithm?
In the following chapters, these imaging tasks are implemented on RICA
based architecture and their performance is evaluated. Moreover, for any
given new algorithm, its nature can be evaluated and compared with these
imaging tasks. If there is some similarity between the given new algorithm
and these imaging tasks, the given algorithm’s performance on RICA based
architecture can be roughly predicted.
3.6. Conclusion
In this chapter, the RICA paradigm is introduced. Based on a coarse-grained
architecture consisting of various ICs and reconfigurable connections, RICA
paradigm offers high levels of both DLP and ILP, outstanding flexibility and
easy programmability, which are desirable features for imaging solutions. A
dedicated tool flow associated with RICA paradigm was introduced, which
consists of compiler, scheduler, placement & routing and simulator. This tool
flow can provide developers with number of steps, required computational
resources, execution time, etc.
In Section 3.3, two case studies, RS decoder and WiMAX OFDM symbol
timing synchronisation engine, are brought into discussion in order to
investigate the potential of RICA based architectures. With the RS decoder,
RICA paradigm’s customisable and tailorable nature is well explored.
Meanwhile, it is proved that RICA paradigm offers high levels of both DLP
and ILP, which is an advantage over DSPs and VLIWs. By implementing the
WiMAX OFDM symbol timing synchronisation engine, it is clear that a shifting
window with registers can be efficiently established on RICA based
architecture, and this shifting window can be widely used in various imaging
tasks.
47
RICA Paradigm Introduction and Case Studies
Kernel construction is one of the most important outcomes found through the
two case studies. For imaging tasks, it is possible to place the processing
engine within a single kernel. In this case, the processing time will be
significantly shortened as configuration latency is only introduced when a
new image line starts and nowhere else. Another promising outcome is RICA
paradigm’s tailorable nature. The numbers of different ICs in RICA based
architecture can be tailored to adapt various tasks with the maximum
computational resources usage. When switching from computationally
intensive applications to relatively simple tasks, the redundant ICs can be
bypassed. This ensures RICA paradigm’s power-saving nature.
Based on all the above discussion, advantages of RICA based architecture
are concluded. With these promising features, RICA based architecture is
proved to be a strong candidate for solutions to image processing
applications. In Chapter 4, this thesis will aim at developing customised
dynamically reconfigurable RICA based architecture targeting Freeman
demosaicing algorithm. From Chapter 5, an efficient RICA based solution for
JPEG2000 will be presented.
48
Freeman Demosaicing Engine on RICA Based Architecture
Chapter 4
Freeman Demosaicing Engine on
RICA Based Architecture
4.1. Introduction
This chapter proposes a Freeman demosaicing engine implemented on RICA
based architecture. The demosaicing engine is highly optimised by an
efficient data buffer rotating scheme and pseudo median filter. Simulation
results demonstrate that the proposed Freeman demosaicing engine can
process a 648x432 image within 2ms. Moreover, based on the algorithm
investigation, a dual-core RICA based architecture is developed and the
demosaicing algorithm is partitioned and mapped onto the dual-core
architecture in order to provide higher performance.
4.2. Freeman Demosaicing Algorithm
Based on the evaluation presented in Chapter 2, Freeman demosaicing
algorithm is chosen for the targeted application due to its overall good
performance and low implementation complexity [65]. Figure 4.1 (a)
illustrates the Freeman algorithm architecture consisting of bilinear
demosaicing and median filtering. The first stage estimates the missing
colour components for each pixel by bilinear interpolation algorithm shown in
formulas (4.1)-(4.6), which may change according to the different pixel
layouts in Bayer pattern as illustrated in Figure 4.1 (b). After the first stage,
three intermediate colour planes are obtained. In the second stage, the
49
Freeman Demosaicing Engine on RICA Based Architecture
Intermediate
Final output
R’ R’ R’
R R R
R R R
R R R
R’ R’ R’
Bayer image
R G R
G B G
R G R
Stage 1
R’ R’ R’
Stage 2
Bilinear
Demosaicing
G’ G’ G’
G’ G’ G’
Median Filter
(R-G)&(B-G)
G’ G’ G’
G G G
G G G
G G G
B B B
B B B
B B B
B’ B’ B’
B’ B’ B’
B’ B’ B’
R11 G12 R13
G21 B22 G23
R31 G32 R33
G41 B42 G43
(b)
(a)
Figure 4.1(a) Freeman Demosaicing Architecture (b) Bilinear Demosaicing for Bayer
Pattern
colour value differences (red minus green (R-G) and blue minus green (B-G))
are median filtered. Median filter is widely used in the field of image
processing as a non-linear digital filtering technique. The main idea for
median filter is running through the input signal entry by entry and replacing
each entry with the median of its neighbouring entries. The pattern of
neighbours is called filter window, which slides over the entire signal entry by
entry [66]. When processing a 2-D image, its mathematical formula can be
represented as Yij = Med{Xij}, where Xij is the set of pixels in the filter window.
Median filter is employed in order to reduce effects of fringing from images by
removing sudden jumps in hue, which has been discussed in Chapter 2. In
Freeman demosaicing, the median filter is based on using a shifting window
over the image and calculating the median value of pixels within the window
for the output [67]. Since the demosaicing artifacts are generally manifest as
small chromatic splotches, median filtering the R-G and B-G colour planes
tends to eliminate the artifacts efficiently [68]. The final interpolated image is
generated by adding the median filtered colour difference to the
corresponding pixel value, for example, the red value in position 32 in Figure
4.1 (b) is obtained by adding the filtered R-G value to the sampled green
value at this position. Estimated results are used only at positions where the
original sampled colour pixel is different, for example, it is not necessary to
estimated blue value at position 32.
𝑅22 = (𝑅11 + 𝑅13 + 𝑅31 + 𝑅33 )/4
(4.1)
𝐺22 = (𝐺12 + 𝐺21 + 𝐺32 + 𝐺23 )/4
(4.2)
50
Freeman Demosaicing Engine on RICA Based Architecture
𝐵22 = 𝐵22
(4.3)
𝑅32 = (𝑅31 + 𝑅33 )/2
(4.4)
𝐺32 = 𝐺32
(4.5)
𝐵32 = (𝐵22 + 𝐵42 )/2
(4.6)
4.3. Freeman Demosaicing Engine Implementation
In order to implement the Freeman demosaicing algorithm on RICA based
architecture efficiently, a set of data buffers are employed as the intermediate
storage for line pixels. Each data buffer’s capacity is set to be 2048x32bit so
the engine can support image with maximum 2048 pixels per line. Figure 4.2
illustrates the RICA based Freeman demosaicing engine architecture. A 3x3
shifting window is employed for both bilinear demosaicing and median filter.
Each line of the window corresponds to one line in the image. As illustrated in
Figure 4.3, pixels belonging to the first two lines of the shifting window (line 1,
2) are read out from data buffers by SBUF cells, while the third line (line 3) is
6 rotative buffers
Intermediate
Colour Planes
buf 3
buf 4
Obtained from
Stage 1 Directly
2 rotative buffers
buf 1
buf 2
Read from image
Original Image
Stage 1
Red
buf 5
Green
Obtained from
Stage 1 Directly
Blue
buf 8
Bilinear
Demosaicing
(3x3 window)
buf 6
Stage 2
Red
Median Filter
(3x3 window)
Green
Blue
buf 7
Obtained from
Stage 1 Directly
Figure 4.2 Freeman Demosaicing Implementation Architecture
Source image with
3x3 shifting window
Line 1
Line 2
(read by sbuf 1 from buf 1)
(read by sbuf 1 from buf 2)
Line 2
Switch Read
Addresses
(read by sbuf 2 from buf 2)
Line 3
(read from source)
The data in line 3 are
written back to buf 1
Figure 4.3 Data Buffers Addresses Rotation
51
Line 3
(read by sbuf 2 from buf 1)
Line 4
(read from source)
Freeman Demosaicing Engine on RICA Based Architecture
RGB RGB RGB RGB RGB RGB
RGB RGB RGB RGB RGB RGB
G
G B G B
R G R G
G B G B
RGB RGB
R->
RGB
R
G
R
G
Shifting Window for
Bilinear Demosaicing
G
B
G
B
Shifting Window for
Median Filter
Figure 4.4 Parallel Architecture for Freeman Demosaicing
directly read from the original source image via SOURCE cell. After the data
within the shifting window being processed, data in line 3 is written back into
buffer 1 to replace the oldest line 1. When a line finished, read addresses for
buffer 1 and 2 are switched so that these two buffers will contain data
belonging to line 2 and 3 correctly without any change to the code for next
line iteration. As there are three separate colour planes generated after
bilinear demosaicing, totally 8 data buffers are occupied for the overall
implementation; two for bilinear demosaicing and six for median filtering.
For the complete demosaicing engine implementation, it will be much more
efficient that if the two stages can be executed in parallel rather than in serial.
Figure 4.4 illustrates the parallel architecture from the view of shifting
windows. Since the median filter requires outputs from bilinear demosaicing
as its input, the bilinear demosaicing shifting window needs to be processed
prior to the median filter window. The optimised parallel demosaicing engine
is based on a line-by-line scanning pattern. With the same 3x3 shifting
window, the first two lines of pixels are fed into the demosaicing engine and
are then stored in data buffers. When the third line begins, pixels belonging
to it are directly read out from the source image and fed into the bilinear
demosaicing shifting window, and the pixel in the centre of the window is
interpolated. As shown in Figure 4.4, the centre pixel in the red broken block
is interpolated to a full RGB colour set by bilinear demosaicing. The
52
Freeman Demosaicing Engine on RICA Based Architecture
Start
Feed in line 0, 1
Step 1
Original Image
Pixels belonging to
line 0,1
Feed in line 2,
Bilinear demosaicing
Bilinear Interpolated
line 0
Step 2
Pixels belonging to
line 1,2
Feed in line 3,
Bilinear demosaicing
Bilinear Interpolated
line 1
Final Interpolated
line 0
...
Median Filter
...
Bilinear Interpolated
line 2
...
...
Feed in line 4,
Bilinear demosaicing
Step 3 (Iteration)
Pixels belonging to
line 2.3
Figure 4.5 Freeman Demosaicing Execution Flowchart
interpolated pixels are then stored in three separate data buffers (R/G/B) as
the intermediate for median filter. In order to build the 3x3 median filter
shifting window, two lines of bilinear interpolated pixels are required as the
precondition, which means there are totally four lines of pixels need to be fed
into the engine before median filtering starts and six intermediate data buffers
to store different colour components belonging to different lines. When the
fifth line starts, the median filter window fetches interpolated pixels from the
intermediate data buffers for its first two lines, and interpolated pixels
belonging to the third line directly come from the bilinear demosaicing
window, as illustrated in Figure 4.4. The output of the bilinear demosaicing
window (the centre pixel of the red broken block) becomes the right-lower
corner pixel of the median filter window (the blue broken block). In this case,
these two shifting windows can slide line by line simultaneously, by which a
demosaicing engine with parallel architecture is constructed thereby the
processing speed is increased. A detailed processing flowchart is illustrated
in Figure 4.5.
53
Freeman Demosaicing Engine on RICA Based Architecture
4.4. System Analysis and Dual-Core Implementation
4.4.1. System Analysis
From the view of implementation, bilinear demosaicing is a quite simple
module. The shifting window is established with a set of registers. As
mentioned in Section 4.2, estimating formulas are different according to
different pixel positions in Bayer CFA. In order to eliminate conditional
branches in the code, a couple of multiplexers are employed to generate the
conditional outputs of each pixel’s estimation.
The median filter has been analysed in Section 4.2. Typically, by far the
majority of the computational effort and time is spent on calculating the
median of each window. When the image is large, the efficiency of median
filter becomes a critical factor in determining the algorithm’s speed as the
filter window must process every entry in the input signal [66]. The key
module in median filter is the sorting algorithm to find out the median value
among a number of pixels. Traditional sorting algorithms such as selection
algorithm and histogram medians [66] are time and energy consuming as
they both involves massive iterations and data swaps. In this Freeman
demosaicing engine, a pseudo median filter [69], [70] is utilised as it
significantly shortens the processing time meanwhile maintains a fine PSNR
for the interpolated image. For a 3x3 filter window, the pseudo median filter
calculates the median value of each column in the window, and then the
median value among the three intermediate medians is obtained as the final
output. As the median value of three pixels can be easily calculated without
any onerousness, this pseudo median filter is feasible for embedded system
applications as the iterations and swaps in traditional sorting algorithms are
eliminated. Meanwhile, the performance of pseudo median filter is very
similar to the true median filter. A comparison is given in [70] which
demonstrates that the pseudo median filter provides a normalised MSE of
0.0521 when filtering a noisy girl image and compared with the original
image, while the true median filter delivers a normalised MSE of 0.0469 (the
54
Freeman Demosaicing Engine on RICA Based Architecture
P11 P12 P13
P11 P12 P13 P14
P21 P22 P23
P21 P22 P23 P24
P31 P32 P33
P31 P32 P33 P34
Seeking median value
M1 M2 M3 M4
M1 M2 M3
Seeking median value
output
output output
1
2
(a)
(b)
Figure 4.6 (a) Pseudo Median Filter (b) Median Filter Reuse
un-filtered noisy image has a normalised MSE of 2.9494 compared with the
original image).
Figure 4.6 (a) illustrates the pseudo median filter computation. The median
value of each row is calculated first, as marked with the vertical broken
blocks. The corresponding median values are then recorded as M 1, M2 and
M3, which are sorted again for their median value calculation as the final
output. Only MAX and MIN operations are needed for seeking median value
among three samples, which can be implemented via comparators and
multiplexers in order to construct the potential kernel for the iteration. The
pseudo code is given as follows, in which totally 3 comparators and 5
multiplexers are needed to process one median window.
……………………………………………………………………………………………………………
Seeking median value among three samples (a, b, c) using comparators and multiplexers:
asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof =
‘MUX_COND_NEZ_SI” : “=r” (out1) : “r” (a) , “r” (b) , “r” (a<b));
asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof =
‘MUX_COND_NEZ_SI” : “=r” (out2) : “r” (b) , “r” (a) , “r” (a<b));
55
Freeman Demosaicing Engine on RICA Based Architecture
asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof =
‘MUX_COND_NEZ_SI” : “=r” (out1) : “r” (out1) , “r” (c) , “r” (out1<c));
// out1 contains the
minimum value of the three samples
asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof =
‘MUX_COND_NEZ_SI” : “=r” (out3) : “r” (c) , “r” (out1) , “r” (out1<c));
asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof =
‘MUX_COND_NEZ_SI” : “=r” (out) : “r” (out2) , “r” (out3) , “r” (out2<out3)); // out contains the
final median value
……………………………………………………………………………………………………………
Given the shifting window moving from left to right, it is possible to reuse the
intermediate results of the median filter. As illustrated in Figure 4.6 (b),
initially the shifting window contains pixels from P11 to P33, and the
intermediate values are M1~M3 which leading to output 1. When the shifting
window moves one pixel horizontally to the right, its contents are updated
from P12 to P34 and M4 becomes the latest output. In this case, both the
intermediates M2 and M3 can be reused to calculate output 2 and only the
new intermediate value M4 needs to be calculated.
4.4.2. Dual-Core Implementation
When analysing from the Freeman demosaicing algorithm aspect, it is seen
that even lines and odd lines of the image are processed with different
equations at the bilinear interpolation stage. In this case, it is possible to have
the demosaicing engine running on two processor cores corresponding to
even/odd lines separately. The two cores work in parallel and hence further
throughput improvement is likely to be introduced. In order to realise a RICA
based multi-core implementation, first the overall application needs to be
partitioned into several tasks. Each task is then compiled and simulated
separately with a single RICA core and then the execution trace file for each
task is generated, which contains the static timing and information for
communication instructions. After that, tasks are mapped onto the RICA
based multi-core architecture according to a certain mapping methodology
[71]. Generally, this mapping methodology analyses both the static timing
56
Freeman Demosaicing Engine on RICA Based Architecture
Application
Description
Task n
Task 1
DR Processor
Toolflow
*.c
Compiler
*.s
......
DR Processor n
Toolflow
......
Single DR
Processor n
Simulator
Schedular
*.net
Single DR
Processor 1
Simulator
...... Parser n
Parser 1
Multiprocessor
Simulator
Performance
Results
Figure 4.7 Mapping Methodology for MRPSIM
and dynamic timing during simulation. Static timing represents the time
consumed by the combinatorial critical path in each step, which is not
affected by the run-time execution. Dynamic timing refers to the time taken
by communication instructions such as memory write/read, which can only be
determined during at run-time in the multiprocessor simulation due to multiple
memory accesses [71].
When tasks mapped, the Multiple Reconfigurable Processor Simulator
(MRPSIM) presented in [71] analyses the execution traces and obtains the
dynamic delays. Only communication instructions which will contribute to the
dynamic timing are modelled by MRPSIM. Other inputs for MRPSIM include
the RAM files and an Architecture Description File (ADF). After performing
simulations with MRPSIM, the generated results are used as feedback to
change the design strategies such as task partitioning and architecture
customisations in order to achieve better performance [71]. The complete
execution flow graph is illustrated in Figure 4.7.
Figure 4.8 illustrates the Freeman demosaicing engine based on a dual-core
architecture. As discussed previously, the complete demosaicing engine is
57
Freeman Demosaicing Engine on RICA Based Architecture
Original Bayer Image
Proc. 1
Proc. 2
lock
Bilinear
demosaicing
(odd lines)
Bilinear
demosaicing
(even lines)
Intermediate
Shared
Memory
Primitive
Interpolated
Intermediate
(odd lines)
Median
filtering
(odd lines)
Primitive
Interpolated
Intermediate
(even lines)
lock
Median
filtering
(even lines)
lock
Final Full-Interpolated Image
Figure 4.8 Dual-Core Freeman Demosaicing Engine Architecture
partitioned into two tasks, one for odd lines and the other for even lines. The
two tasks are then mapped onto the RICA based dual-core architecture. In
order to reduce the implementation complexity, intermediate outputs from
bilinear demosaicing are stored in shared data buffers instead of being
immediately ready for median filter, which is different from the single-core
implementation. As the two processor cores need to share data during
processing, a scheme termed improved spinlock [72] is utilised to make sure
accesses to the shared memory are synchronised without any conflict
between the two cores. A spinlock is a lock (synchronisation variable) where
a task repeatedly checks based on busy-waiting scheme, and it allows only
one task to access the shared resource protected by the lock at any given
time. This method has been improved in [72] which allows the requiring
processor core going to sleep when the lock is unavailable instead of keeping
checking, and the core will be waked up through an inter-processor interrupt
when the lock is released. Therefore, memory access conflicts due to the
busy-waiting based spinlock are eliminated [72].
In the proposed demosaicing design, improved spinlocks are utilised when
the processor cores loading data from the original Bayer source image,
58
Freeman Demosaicing Engine on RICA Based Architecture
requiring primitive interpolated data from the intermediate shared memory
and writing the final interpolated data to the output file. Figure 4.9 provides a
pseudo code for the dual-core implementation. The two tasks mapped onto
two cores are executed in parallel. The locks control the access to the shared
memory as well as the source and output file. When iteration finishes, all the
locks for processor 1 and 2 are re-initialised in order to avoid unexpected
lock states.
59
Freeman Demosaicing Engine on RICA Based Architecture
for (j=0, j<number of line-1, j+=2)
{
Initialise the locks for both processor 1 and 2;
processor 1 lock = 0;
// available
processor 2 lock = 1;
// unavailable
for (i=0, i< number of pixels per line, i++)
{
// Processor 1
check&set lock;
source Bayer image accessing;
release lock for processor 2;
bilinear demosaicing;
check&set lock;
shared memory accessing;
release lock for processor 2;
median filtering;
check&set lock;
final output file accessing;
release lock for processor 2;
// Two processors work in parallel
// Processor 2
check&set lock;
source Bayer image accessing;
release lock for processor 1;
bilinear demosaicing;
check&set lock;
shared memory accessing;
release lock for processor 1;
median filtering;
check&set lock;
final output file accessing;
release lock for processor 1;
}
re-initialise all locks for processor 1 and 2;
}
Figure 4.9 Pseudo Code for Dual-Core Implementation
60
Freeman Demosaicing Engine on RICA Based Architecture
4.5. Optimisation
As presented in previous chapters, a salient characteristic of RICA paradigm
is its ability to be customised at the design stage according to application
requirements. For multi-core applications, each core can be tailored
differently to build an adaptive heterogeneous multi-core platform. Table 4.1
shows numbers of ICs for both customised single-core and dual-core
architectures. The ICs numbers are given by the simulator proposed in [71].
This tailorable nature of RICA paradigm eliminates the redundant ICs in
different applications and hence the energy and area consumption is
decreased.
As proposed in Chapter 3, RICA paradigm supports development by using
high level languages such as C in a manner very similar to conventional
microprocessors and DSPs. The ANSI-C programs can be compiled into a
Table 4.1 Instruction Cells Occupied by Freeman Demosaicing Engine
(a) Single-Core Architecture
Cell
Proc.
Cell
Proc.
ADD
20
SOURCE
1
LOGIC
1
SINK
3
SHIFT
5
WMEM
4
MUX
29
RMEM
4
COMP
23
JUMP
1
REG
177
RRC
1
SBUF
12
Total Area (um2)
160033
(b) Dual-Core Architecture
Cell
Proc. 1
Proc. 2
Cell
Proc. 1
Proc. 2
ADD
15
9
SOURCE
1
1
LOGIC
2
2
SINK
3
3
SHIFT
5
5
WMEM
4
4
MUX
29
27
RMEM
4
4
COMP
23
20
JUMP
1
1
REG
135
112
RRC
1
1
SBUF
12
12
Total Area
157184
125226
61
Freeman Demosaicing Engine on RICA Based Architecture
sequence of assembly configuration instruction sets, termed steps. The
content of each step is executed concurrently by RICA according to the
availability of hardware resources and data dependence, and kernels are
desirable for applications with large number of iterations especially when
processing images. For the single-core implementation, bilinear demosaicing
and median filter are integrated into one kernel; while these two modules are
put into two small kernels for each core when targeting dual-core
implementation to reduce the computational complexity. In order to eliminate
conditional branches in the application which will break kernels into separate
steps, the most common technique employed in RICA based architectures is
using multiplexers to realise conditional selections instead of branches. As
there may be several hundreds of pixels per line in the image, the kernel may
loop many times. In this case software pipelining technique can be utilised in
order to shorten the kernel critical path. Basically the technique itself creates
additional fill and flush steps which occupy a number of registers to fill in and
release the intermediate between different pipeline stages as illustrated in
Figure 4.10. For the proposed single-core demosaicing engine, the critical
path of the kernel before pipelining is 44.86ns, while after pipelining this path
is shortened to 6.86ns.
Original
Kernel
Critical
path
pipelined
A
A
B
B
A
C
B
A
step 3 (kernel)
C
B
step 4 (flush 1)
C
step 5 (flush 2)
C
Critical
path
step 1 (fill 1)
step 2 (fill 2)
Figure 4.10 Illustration of Pipeline Architecture for Kernels
62
Freeman Demosaicing Engine on RICA Based Architecture
4.6. Performance Analysis and Comparison
The Freeman demosaicing application is implemented on both single-core
and dual-core RICA architectures by ANSI-C, which is then compiled,
scheduled and simulated by the tool-flow associated with RICA. The tool flow
is based on 65nm technology. The code is optimised by both hand and
compiler. The simulator [14] provides performance results such as simulation
time, number of steps, required computational resources and so on. The
simulator is based on an accurate model of RICA paradigm which takes IC
configuration and interconnections into account.
Figure 4.11 shows a real image processed by the proposed Freeman
demosaicing engine and a PSNR1 = 26.4dB is obtained. Table 4.2 lists
performance evaluations of theproposed Freeman demosaicing engineat
different optimised stages. It is seen that the final single-core implementation
with kernel and pipeline technique achieves up to 80.1% reduction in kernel
critical path length and 4.92x speedup in throughput compared with the
original implementation.When mapped onto dual-core architecture, the
throughput has reached up to 241.6Mpixels/s, which corresponds to a 1.72x
speedup compared with the single-core engine. The throughput here is
defined as the number of pixels in the image divided by the proecssing time.
Performance comparisons in Table 4.2 are made between the proposed
Freeman demosaicing engine and a Hamilton-Adam demosaicing engine on
Figure 4.11 A Demosaiced 648x432 Image
63
Freeman Demosaicing Engine on RICA Based Architecture
Table 4.2 Freeman Demosaicing Performance Evaluations and Comparisons
For a 648*432 image
Execution
Time (ms)
Average Throughput
(Mpixels/s)
Kernel Critical Path
(ns)
Evaluations
Original
9.78
28.62
34.54
Single-core
1.99
140.7
6.86
Dual-core
1.16
241.6
7.02
Comparisons
Applications
Throughput (Mpixles/s)
Average Frequency
(MHz)
Bilinear
142
151
Single-core Freeman
140.7
145.8
Dual-core Freeman
241.6
142.4
Hamilton-Adam
127
144
FPGA bilinear [35]
150
150
CRISP bilinear [54]
345
115
RICA based architecture, a Virtex 4 FPGA based bilinear demosaicing
engine [34] and a bilinear demosaicing engine implemented on CRISP [53].
Since the bilinear demosiacing is the first stage in Freeman method, it is split
and included in the comparisons. For RICA based engines, their average
frequencies are defined as their kernels’ iterating frequencies and are
calculated via their kernels’ critical path length. It is seen that the proposed
Freeman demosaicing engine demonstrates good throughput with both
single-core and dual-core architectures. Due to the efficient pipeline
architecture, Freeman engine keeps almost the same throughput with the
bilinear engine even with extra burden of computationally intensive median
filters. The highly optimised kernel in Freeman engine shows a comparable
iterating frequency with the referred FPGA based bilinear engine, and is
higher than the maximum frequency that CRISP can achieve. It should be
noticed that CRISP has dedicated 2-D load memory RSPE and colour
interpolation RSPE, which means that multiple demosaicing windows can be
executed simulateneously. This is the reason why CRISP demonstrates
much higher throughput even with a lower working frequency. If there is only
a single demosaicing window running, the proposed Freeman engine
64
Freeman Demosaicing Engine on RICA Based Architecture
deserves better performance. Moreover, as the RSPEs in CRISP are
dedicated hardware, CRISP is actually an ASIC-like coarse-grained
reconfigurable architecture. In this case, its flexibility is restricted to some
extent. In contrast, the nature of RICA paradigm enables RICA based
architectures to be flexibly reconfigured and tailored to adpapt different
applications.
4.7. Future Improvement
A feature associated with RICA paradigm termed Vector Operation (VO) can
be utilised to further improve the demosaicing engine performance. A vector
is a 32-bit operand which is constructed of two 16-bit operands. With SIMD
technique, these two 16bit operands can be calculated in parallel via a single
vector operation. For applications which have parallel architectures, VO can
be employed expecting significantly computational resource deduction as the
number of calculations is cut to nearly a half.
In the proposed Freeman demosaicing engine, it is possible to utilise VO to
improve the median filter’s efficiency. The shifting window in median filter can
be extended to 4x3, and each two pixels within the window can be composed
to construct a vector, as illustrated in Figure 4.12. Totally six vectors can be
built initially, and the four corresponding intermediate outputs can be
obtained through two median value seeking operations, within the form of two
vector 1
P
11
P
12
P
13
P
14
vector 2
P
21
P
22
P
23
P
24
vector 3
P
31
P
32
P
33
P
34
vector 4
vector 5
vector 6
vector 7 vector 8 vector 9
M
1
M
2
M
3
M
M
4
Duplicate
1
M
M
3
M
M
3
2
2
M
4
Out1 Out2
Figure 4.12 Potential Vector Operations in Median Filter
65
Freeman Demosaicing Engine on RICA Based Architecture
vectors (M1&M2, M3&M4). After that, these two vectors are decomposed and
duplicated in order to construct new vectors (vector 7~9), and the final two
outputs can be obtained via another median value seeking operation. In this
way, the totally number of seeking operations required for obtaining two
outputs is reduced from 6 to 3, which obviously simplifies the overall
calculation complexity.
VO usually comes with the negative of additional logic and shifting resources
required for constructing and decomposing vectors, which will negate the
benefits by prolonging the critical path and increasing the area consumption.
When being applied to the proposed demosaicing engine, VO requires two
pixels to be fed in at a time instead of one, which requires more complex
control scheme and brings difficulty in building kernels. However, VO is still a
promising approach for further optimisation to the proposed demosaicing
engine.
4.8. Conclusion
In this chapter, a Freeman demosaicing engine on RICA based architecture
has been presented. With the detailed Freeman algorithm being introduced in
Section 4.2, Section 4.3 presents the demosaicing engine implementation.
The 2-D shifting window consisting of registers as discussed in the previous
chapter is utilised to construct the demosaicing and median filter windows.
An efficient data buffer rotating scheme is designed with the aim of reducing
memory accesses. Another novel outcome is that a parallel architecture is
developed which ensures bilinear demosaicing and median filter can be
executed simultaneously. This parallel architecture successfully reduces the
required intermediate data storage and improves the overall efficiency.
In Section 4.4, the proposed demosaicing architecture is further analysed. A
pseudo median filter is utilised which enables the demosaicing engine to
demonstrate both good performance and low computational complexity. Only
a few multiplexers and comparators are utilised in order to realise the pseudo
median value calculation. Given the demosaicing window shifting from left to
66
Freeman Demosaicing Engine on RICA Based Architecture
right, the pseudo median filter is reused. Only two median value seeking
calculations are required for every new pixel.
Based on the algorithm analysis, Section 4.4.2 presents dual-core
architecture developed for the proposed Freeman demosaicing engine. Two
RICA cores are occupied for processing even and odd lines in the image
respectively. An improved spin-lock method is utilised in order to ensure the
shared data between the two cores is accessed correctly without any conflict.
A pseudo code is given to illustrate that how the dual-core architecture works.
The optimisation presented in Section 4.5 mainly focuses on customising the
proposed RICA based architecture and constructing kernels with efficient
pipeline scheme. Performance evaluation demonstrates that the proposed
Freeman demosaicing engine offers a good PSNR to a real photographic
image. The throughput is improved step by step by utilising different
optimisation approaches. Comparisons between the proposed engine and
other
demosaicing
applications
show
that
the
proposed
Freeman
demosaicing engine provides good throughput with an efficient architecture.
The VO with SIMD technique is discussed in Section 4.7 as a possible future
improvement to the demosaicing engine. With VO, multiple data can be
processed through a single calculation. In the following chapter, VO is utilised
for the 2-D DWT module in JPEG2000 and its positives and negatives are
discussed in detail.
1.
𝑀𝑆𝐸 =
1
𝑚𝑛
∑𝑚 ∑𝑛[𝑅𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 𝐼𝑚𝑎𝑔𝑒(𝑚, 𝑛) − 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐼𝑚𝑎𝑔𝑒(𝑚, 𝑛)]2
𝑃𝑆𝑁𝑅 = 10 ∗ log(
𝑀𝐴𝑋 2
)
𝑀𝑆𝐸
MAX is the maximum possible pixel value of the image. If the pixel has 8 bits, MAX is 255
67
2-D DWT Engine on RICA Based Architecture
Chapter 5
2-D DWT Engine on RICA Based
Architecture
5.1. Introduction
This chapter presents a reconfigurable lifting-based 2-Dimensional Discrete
Wavelet Transform engine for JPEG2000 [73]. The proposed engine is
implemented on RICA based architecture and can be dynamically
reconfigured to support both 5/3 and 9/7 transform modes for lossless and
lossy compression schemes in JPEG2000 standard. VO with SIMD
technique is utilised in the proposed 2-D DWT engine and the advantages
and disadvantages brought by VO are well discussed. Simulation results [73]
demonstrate that the proposed 2-D DWT engine provides high throughput
that reaches up to 103.1 Frames per Second (FPS) for a 1024x1024 image,
which shows its advantage compared with a number of FPGA and DSP
based implementations.
5.2. Lifting-Based 2-D DWT Architecture in JPEG2000
Standard
DWT is the first decorrelation step in JPEG2000 standard to decompose the
input image into different subbands in order to obtain both approximation and
detailed information. The algorithm of both 1-D and 2-D DWT has already
been introduced in Chapter 2 and Appendix. Figure 5.1 illustrates both the
convolutional
DWT
architecture
in
68
(a)
and
the
lifting-based
DWT
2-D DWT Engine on RICA Based Architecture
L
LPF
2
L
even
X(z)
X(z)
HPF
2
Predict
(alpha)
Split
Update
(beta)
odd
H
H
(b)
(a)
1/K
L
K
H
even
X(z)
Split
Predict
(alpha)
Update
(beta)
Update
(delta)
Predict
(gamma)
odd
(c)
5/3 DWT: alpha = -½, beta = ¼
9/7 DWT: alpha = -1.5861342, beta = -0.052980118, gamma = 0.882911076,
delta = 0.443506852, K = 1.230174105
Figure 5.1 (a) Convolutional DWT Architecture (b) 5/3 Lifting-based DWT
Architecture (c) 9/7 Lifting-based DWT Architecture
Output for 5/3
mode
Output for 9/7
mode
L
1/K
L
K
H
even
X(z)
Split
Predict
(alpha)
Predict
(gamma)
Update
(beta)
Update
(delta)
odd
H
Figure 5.2 Generic Lifting-Based DWT Architecture for Both 5/3 and 9/7 modes
architectures in (b) and (c). As it is known, the lifting-based scheme is
selected in JPEG2000 standard as the default DWT architecture. In Figure
5.1(b) and (c), the polyphase matrices are divided into a couple of units
termed “Predict” and “Update”, which actually mean different combinations of
additions and multiplications. Parameters in these two units are determined
by DWT polyphase matrices introduced in Chapter 2. Since there is only one
extra pair of Predict and Update existing in the 9/7 architecture compared
with that in 5/3 scheme, these two architectures can be combined and
simplified to a generic architecture which adopts both schemes, as illustrated
in Figure 5.2. The lifting-based 2-D DWT architecture is illustrated in Figure
5.3, which can be viewed as the extension and duplication of the 1-D
architecture. Actually the vertical transformation can also be executed by a
single 1-D DWT engine.
69
2-D DWT Engine on RICA Based Architecture
First Demision
Second Demision
LL
Split
Predict
Update
even
X(z)
Split
Predict
LH
HL
Update
odd
Split
Predict
Update
HH
Figure 5.3 Lifting-Based 2-D DWT Architecture
5.3. Lifting-Based DWT Engine on RICA Based Architecture
5.3.1. 1-D DWT Engine Implementation
Figure 5.4 illustrates the detailed generic architecture of the 1-D DWT engine
implemented on RICA based architecture. When considering the parameters,
Predict and Update units in 5/3 DWT architecture can be easily realised by
shifting and addition operations, while the floating-point parameters in 9/7
mode require complex computations. In this work, Hardwired Floating
Coefficient
Multipliers
(FCMs)
are
utilised
to
convert
floating-point
computations to fixed-point calculation. The floating-point parameters are
represented in their Canonical Sign Digit (CSD) form [74] and hence floatingpoint multiplications are replaced by a number of shifts and additions. Table
5.1 illustrates the CSD forms of the floating-point parameters in this engine.
Since the number of bits used for CSD representation can be truncated,
Figure 5.5 gives comparisons between different CSD bits and the PSNR of
reconstructed image (256x256 Lena). It is obvious that more CSD bits
provide higher PSNR, which means better reconstructed image quality.
However, having more CSD bits also means higher computational complexity
as more addition and shift need to be executed. Since RICA paradigm has
high level of ILP, the 12-bit CSD representation is selected in this thesis.
These CSD based FCMs can be dynamically reconfigured and bypassed
when adopting 5/3 architecture.
70
2-D DWT Engine on RICA Based Architecture
Output_L
5/3 DWT
D
Input_odd
D
FCM
beta
alpha
Input_even
FCM
D
delta
gamma
FCM
Output_L
1/K
Output_H
FCM
D
D
Output_H
K
FCM = Floating Coefficient Multiplier
Figure 5.4 Detailed Generic Architecture of 1-D DWT Engine on RICA
Table 5.1 CSD Forms of Floating-Point Parameters
Value
CSD representation
Approximate Value
α
1.5861342
2-2-1+2-3-2-5-2-7+2-12
1.5861816
β
0.052980118
2-4-2-7-2-9+2-12
0.0529785
γ
0.882911076
20-2-3+2-7
0.8828125
δ
0.443506852
2-1-2-4+2-7-2-9+2-12
0.4436035
k
1.230174105
20+2-2-2-6-2-8-2-12
1.2302246
1/k
0.812893066
20-2-3-2-4+2-11
0.8129883
Figure 5.5 Reconstructed Image Quality with Different CSD Bits
When processing colour images, the proposed DWT engine can be easily
extended to be adaptive for transforming multiple colour components
simultaneously (RGB or YUV/YCrCb) due to the high parallelism of RICA
71
2-D DWT Engine on RICA Based Architecture
Even part of Y
Y
Split the data into
even and odd parts
Y_L
DWT Engine for Y
Y_H
Odd part of Y
Even part of U
Original Image
Obtain different
colour components
U
Split the data into
even and odd parts
U_L
DWT Engine for U
U_H
Odd part of U
Even part of V
V
Split the data into
even and odd parts
V_L
DWT Engine for V
Odd part of V
V_H
Figure 5.6 Streamed Data Buffers in DWT Engine
based architecture and its customisable nature. Figure 5.6 illustrates the
potential architecture of RICA based DWT engine for processing multi-colour
components. For each colour component, even and odd input data symbols
are split and stored intermediately. Totally there are three separate DWT
engines occupied in the multi-component DWT architecture, all of which can
be accessed and executed in parallel. As the multiple colour transform can
be simply realised by duplicating the DWT engine, in this work, the
discussion mainly focuses on single component transform engine.
5.3.2. 2-D DWT Engine Implementation
As discussed previously, 2-D DWT architecture is an extension of 1-D
scheme by involving two stages: horizontal transforming and vertical
transforming. Figure 5.7 illustrates the detailed 2-D DWT decomposition
procedure with an example of an 8x8 image with 3-level 2-D DWT. It is
clearly seen that how the original image is decomposed by horizontal and
vertical transformations. After the third-level transformation, the image is
decomposed to 4 individually 1x1 pixels belonging to different subbands.
The method for implementing 2-D DWT engine on RICA based architecture
has some comparability to the processing pattern proposed in [75]. The
proposed 2-D DWT engine performance is enhanced by the VO with SIMD
technique discussed in Chapter 4, which is utilised to transform pixels
belonging to adjacent lines in parallel. Instead of transforming pixels line by
72
2-D DWT Engine on RICA Based Architecture
Original Image
Level 1
L
H
LL
HL
LH
LLL
HH
LLH
Level 2
LLLL
LLHH
LLLH
LLLLL
LLHL
LLLLH
Level 3
LLLLLL
LLLLLH
LLLLHL
LLLLHH
Figure 5.7 Detailed 3-Level 2-D DWT Decomposition
line, four pixels (P00, P01, P10, P11) are processed at a time, as illustrated in
Figure 5.8. These four pixels are divided into two pairs, each of which has
two pixels belonging to adjacent lines. Each pixel pair is combined to
construct a vector, as shown in the red broken blocks. These two vectors can
be transformed by a horizontal DWT engine illustrated in Figure 5.4, and the
four outputs (P0L, P1L &P0H, P1H) are generated in parallel in the form of two
vectors by a single transformation, as highlighted by the red broken blocks in
step 1. As discussed previously, vectors can be decomposed into separate
operands and then reconstructed to build new vectors. In this work, after the
first step (horizontal transformation), the two intermediate vectors are
decomposed and the four 1-D transformed coefficients are recombined to
construct two new vectors, as highlighted in the blue broken blocks. With
73
2-D DWT Engine on RICA Based Architecture
Step 1: 1-D WT
(horizontal)
L
The original image
H
P00 P01
P0L
P0H
P10 P11
P1L
P1H
PLL
PHL
Vector 1
Vector 2
HL
LL
PLH
PHH
HH
LH
Step 2: 2-D WT
(vertical)
Figure 5.8 Parallel Pixel Transformation with VO and SIMD Technique
another vertical transform, the four final outputs (PLL, PLH, PHL and PHH) are
obtained simultaneously.
It is obviously that the engine with VO technique can offer higher processing
speed compared with the original engine which transforms the image line by
line since it can process two lines/columns at a time. Meanwhile, compared
with the case in which two lines/columns are transformed concurrently by two
parallel single operations, the computational resource required by the VO
based engine is significantly reduced. Moreover, the configuration latency is
reduced by VO technique due to decrement in required computational ICs,
which leads to further execution time decrease. However, there is a trade off
as additional LOGIC and SHIFT cells are required in order to combine and
decompose separate operands and vectors. Such an increase in IC
occupation will lead to higher area consumption. Also these additional LOGIC
and SHIFT operations will also increase the execution time of steps involving
VO, resulting in decreased throughput. Detailed performance comparisons
74
2-D DWT Engine on RICA Based Architecture
between engines with VO technique and regular engines will be given in
Section 5.4.
5.3.3. 2-D DWT Engine Optimisation
Optimisation to the proposed 2-D DWT engine still focuses on constructing
kernels. In this work, both the horizontal and vertical transforming engines
are integrated in corresponding kernels. Figure 5.9 illustrates kernels and
their moving patterns in the proposed 2-D DWT engine. The transforming
kernels scan the original image with a raster order, and generate the four
outputs belonging to four subbands simultaneously.
In order to construct DWT kernels, the original RICA architecture needs to be
tailored to satisfy the computational resource requirement of the DWT engine.
Table 5.2 illustrates numbers of cells in different 2-D DWT engines on
customised RICA based architecture, including original DWT engines
(transform an image line by line), DWT engines with two single parallel
operations (transform two lines/columns at a time) and DWT engines with VO
technique. All these engines are implemented within kernels and the
numbers of registers are calculated after pipelining the kernels. It is seen that
Original Image
...
...
...
...
...
...
...
...
Kernel
Subband LL
Subband LH
Subband HL
Subband HH
Figure 5.9 Kernel in the 2-D DWT Engine on RICA Architecture
75
2-D DWT Engine on RICA Based Architecture
Table 5.2 Numbers of Cells in Different DWT Engines
(a) Number of Cells in DWT 5/3 Engines
Cell
DWT53
(a)
DWT53
(b)
DWT53
(c)
Cell
DWT53
(a)
DWT53
(b)
DWT53
(c)
ADD
10
14
10
WMEM
4
4
4
LOGIC
0
0
4
RMEM
4
4
4
SHIFT
4
4
8
REG
33
47
45
MUX
1
1
1
JUMP
1
1
1
COMP
2
2
2
Total
Area
37049
45523
49188
(b) Number of Cells in DWT 9/7 Engines
Cell
DWT97
(a)
DWT97
(b)
DWT97
(c)
Cell
DWT97
(a)
DWT97
(b)
DWT97
(c)
ADD
31
56
32
WMEM
4
4
4
LOGIC
0
0
4
RMEM
4
4
4
SHIFT
23
42
26
REG
93
155
146
MUX
1
1
1
JUMP
1
1
1
COMP
2
2
2
Total
Area
101269
161327
129581
(a): Original engines
(b): Engines with two parallel single operations
(c): Engines with VO technique
the engines with two single parallel operations consume more computational
resource compared with original engines in aspects of ADD, SHIFT and REG
cells. On the other hand, engines with VO technique keep similar cell
occupation to original engines with slight increments in ADD, LOGIC and
SHIFT cells (the number of REG cells in VO based engines are also large
due to the requirement of establishing pipelines). When comparing the latter
two kinds of engines, the numbers of ADD cells are reduced significantly by
utilising VO technique, especially for the 9/7 mode. In contrast, the numbers
of LOGIC cells increase as more logic operations are required in order to
construct and decompose vectors. The numbers of SHIFT cells have
different trends between 5/3 mode and 9/7 mode when applying VO
76
2-D DWT Engine on RICA Based Architecture
technique. Since the computation involved in 5/3 mode is quite simple
compared with that in 9/7 mode, the trade-off of adopting VO technique,
which means increase in the required number of SHIFT cells, is obvious; on
the other hand, the massive computation in 9/7 mode requires large number
of SHIFT cells, and this requirement is reduced significantly by parallel
operations in VO based implementation. Therefore, those additional SHIFT
ICs required by constructing and decomposing vectors become negligible.
5.4. Performance Analysis and Comparisons
The proposed lifting-based 2-D DWT engine is implemented on customised
RICA based architecture using ANSI-C and simulated by the RICA tool flow
[6]. Figure 5.10 demonstrates the standard Lena test image transformed by
the proposed DWT engine for both 5/3 and 9/7 modes. The performance
analysis and comparisons mainly focus on aspects of both throughput and
computational cell occupation.
Figure 5.11 gives throughput comparisons for processing the standard
256x256 Lena test image under different 2-D DWT levels. Comparisons are
made between original DWT engines, DWT engines with two single parallel
operations and DWT engines with VO technique. It is seen that the original
DWT engines provide the lowest throughput at all DWT levels. In contrast,
(a)
(b)
Figure 5.10 Standard Lena Image Transformed by the 2-D DWT Engine
(a) DWT 5/3 mode (b) DWT 9/7 mode
77
2-D DWT Engine on RICA Based Architecture
1800
1600
1400
DWT_53 (ori)
1200
1000
DWT_53 (two parallel
single operations)
800
600
DWT_53 (Vector)
400
200
0
1-level
2-level
3-level
4-level
1800
1600
1400
DWT_97 (ori)
1200
1000
DWT_97 (two parallel
single operations)
800
600
DWT_97 (Vector)
400
200
0
1-level
2-level
3-level
4-level
Figure 5.11 Throughput (fps) Comparisons
DWT engines with two single parallel operations offer the highest throughput,
and DWT engines with VO technique demonstrate slightly lower throughput.
This is because the VO technique introduces additional operations in order to
construct and decompose vectors, and these operations normally increases
the pipeline depth by adding more filling and flushing steps while keeping
similar critical path length, leading to execution time increment. In contrast,
when comparing the area consumption as illustrated in Figure 5.12, the
original DWT engines demonstrate the lowest area occupation in both 5/3
and 9/7 modes. When comparing the other two kinds of engines, the DWT
engine with VO technique has significant reduction in cell area consumption
compared with the engine with two single parallel operationsin9/7 mode. As
discussed in Section 5.3.3, parallel operations introduced by VO technique
78
2-D DWT Engine on RICA Based Architecture
enable the engine to perform the same function with less computational cells.
On the other hand, in 5/3 mode, the area consumption of VO based engine is
actually slightly higher than the engine with two single parallel operations.
This is because the area overhead of the extra LOGIC and SHIFT cells
required by VO negates the benefit coming from deducting other
computational cells such as ADD and REG. As the total computational cell
numbers in 5/3 DWT engine are quite small, this overhead is obvious and
increases the total area. In contrast, this overhead is negligible compared to
the reduction of other cells in 9/7 mode.
In order to measure the trade-off between throughput and area consumption,
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
DWT_53
(ori)
DWT_53
(two parallel
single
operations)
DWT_53
(Vector)
DWT_97
(ori)
DWT_97
(two parallel
single
operations)
DWT_97
(Vector)
(a) Area (um2) Comparisons
DWT_53 (ori)
0.04
0.035
DWT_53 (two parallel
single operations)
DWT_53 (Vector)
0.03
0.025
0.02
DWT_97 (ori)
0.015
DWT_97 (two parallel
single operations)
DWT_97 (Vector)
0.01
0.005
0
1-level
2-level
3-level
4-level
(b) Δ Comparisons
Figure 5.12 Area and Δ Comparisons
79
2-D DWT Engine on RICA Based Architecture
a parameter Δ = throughput/area is defined to measure the efficiency of
different engines. A high Δ means that the corresponding engine has good
computational resource utilisation. It can be seen in Figure 5.12 that the
original DWT engines show the lowest Δs in both 5/3 and 9/7 modes, which
means that original DWT engines are actually not efficient. When comparing
the other two kinds of engines, the 5/3 DWT engine with two single parallel
operations has the highest Δs, and the 5/3 DWT engine with VO has
respectively lower Δs, which means that VO is not the best solution for the
5/3 mode as it brings too much overhead. In contrast, for the 9/7 mode, the
engine with VO offers higher Δs compared with the other two engines. In this
case, a conclusion is obtained that the VO technique is more suitable for
complex applications in which the computational cell reduction will not be
negated by the overhead brought by extra LOGIC and SHIFT cells.
Figure 5.13 illustrates performance comparisons between RICA based DWT
engines and an FPGA based lifting 2-D DWT implementation in [76], a TI
C6416 DSP based DWT engine [77] and a StarCore DSP based DWT
implementation [78] which are all targeting JPEG2000 standard. The test
image size is set to be 1024x1024 which is the same with [76] and [77], and
throughput of the proposed engines and [78] is scaled according to the image
size. Comparisons are made under different DWT modes and levels
0.09
0.0766
0.08
Execution time (s)
0.07
0.06
0.05
RICA
0.04
References
0.03
0.02
0.0154 0.0172
0.0154
FPGA [76]
9/7 4-level
C6416 [77]
9/7 4-level
0.0097 0.0132
0.01
0
Starcore [78]
5/3 1-level
Figure 5.13 Performance Comparisons
80
2-D DWT Engine on RICA Based Architecture
according to the references. Meanwhile, it is worth noticing that the FPGA
implementation [76] was based on an old Virtex 2 device with the working
frequency of 67MHz, so throughput improvement is expected if the
implementation can be elaborated to some newer FPGA platforms. However,
generally it is seen that DWT engines on RICA based architecture
demonstrate their clear advantages in both 5/3 and 9/7 modes. In conclusion,
RICA based DWT engines benefit from the high parallelism nature of RICA
paradigm and efficiently pipelined kernels. Meanwhile VO is proved to be a
promising technique to optimise complex applications especially to those
area sensitive applications.
5.5. Conclusion
In this chapter, a high efficiency reconfigurable lifting-based 2-D DWT engine
on customised RICA based architecture targeting JPEG2000 standard has
been proposed. In Section 5.2, the lifting-based DWT architectures are
introduced. Section 5.3 presents both 1-D and 2-D DWT implementations on
RICA based architecture. The proposed DWT engine can be reconfigured for
both 5/3 and 9/7 transforming modes in JPEG2000 standard. Hardwired
FCMs are utilised in the 9/7 mode for converting floating-point calculations to
fixed-point
calculations
instead
of
involving
dedicated
floating-point
calculation units.
With the VO and SIMD technique introduced in the previous chapter, the 2-D
DWT engine can generate its final output coefficients belonging to different
subbands simultaneously. Optimisation still targets customising RICA based
architecture and constructing kernels. Different computational resources
occupied by both regular engines and engines with VO are discussed. For
the horizontal and vertical steps in 2-D DWT, separate kernels are
constructed respectively. In a single kernel, there are four pixels included,
corresponding to the four final transformed coefficients.
In Section 5.4, performance comparisons between original DWT engines,
DWT engines with two single parallel operations and DWT engines with VO
81
2-D DWT Engine on RICA Based Architecture
technique are discussed in aspects of both throughput and area occupation.
The original DWT engines demonstrate both the lowest throughput and the
lowest area occupation in both 5/3 and 9/7 modes. Meanwhile, DWT engines
with two single parallel operations demonstrate the highest throughput.
Meanwhile, VO based DWT engines provide slightly lower throughput but
much lower area occupation in the 9/7 mode. A parameter Δ =
throughput/area is utilised to measure the efficiency of different engines. It is
concluded that VO technique is suitable for the 9/7 mode, which is more
computationally
intensive
compared
with
5/3
mode.
Performance
comparisons are also made between the proposed 2-D DWT engine and
various implementations based on other architectures. The proposed DWT
engine demonstrates clear advantages in both 5/3 and 9/7 modes compared
with various FPGA and DSP based 2-D DWT solutions. These advantages
mainly come from the high parallelism nature of RICA paradigm, efficiently
pipelined kernels and VO technique.
82
EBCOT on RICA Based Architecture and ARM Core
Chapter 6
EBCOT on RICA Based Architecture
and ARM Core
6.1. Introduction
This chapter presents a JPEG2000 EBCOT encoder based on RICA based
dynamically reconfigurable architecture and an ARM core. The EBCOT Tier1 encoding scheme consists of two modules: Context Modelling and
Arithmetic Encoder. Based on algorithm evaluation, the four primitive coding
schemes in CM are efficiently implemented on RICA based architecture. A
novel Partial Parallel Architecture for CM is applied to improve the overall
system performance. Meanwhile, an ARM core is integrated in the proposed
architecture for implementing optimised AE efficiently. Simulation results
demonstrate that the resulting CM architecture can code 69.8million symbol
bits per second, representing approximately 1.37x speed up compared with
the Pass Parallel CM architecture implemented on RICA paradigm based
architectures; while the ARM based AE implementation can process
approximately 31.25 million CX/D pairs per second. The EBCOT Tier-2
encoder and file formatting module can also be implemented on the ARM
core together with AE.
6.2. Context Modelling Algorithm Evaluation
The detailed CM algorithm has been discussed in Chapter 2. In JPEG2000
applications, EBCOT usually consumes most of the execution time (typically
83
EBCOT on RICA Based Architecture and ARM Core
more than 50%) in software-based implementations [79] and CM is
considered to be the most computationally intensive unit in EBCOT. Since
CM adopts the fractional bit-plane coding idea and codes DWT coefficients in
codeblocks by three separate coding passes in bit-level, it is actually more
suitable for specialised hardware implementation rather than general
hardware. There have been several methods proposed to accelerate CM
process, which are detailed as follows.
 Sample Skipping (SS): This method is proposed in [80] as illustrated in
Figure 6.1. Through a parallel checking performed by the encoder, if there
are n Need-to-Be-Coded (NBC) coefficient bits in a stripe column (1≤n≤4),
only n cycles are spent on coding these NBC bits, and 4-n cycles are
saved compared with the conventional method which checks all bits one
by one. In the case that there is no NBC bit existing in the column, only
one cycle is spent on checking. Since most columns have less than four
NBC samples, this method can save cycle time [80].
 Group-Of-Column Skipping (GOCS): This method is also presented in
Conventional
way
Sample
Skipping
4 cycle
required
2 cycle
required
Coefficient bit
needs to be coded
Coefficient bit does not
need to be coded
Figure 6.1 Sample Skipping Method for CM
Conventional way (process 16 columns)
Coefficient bit
needs to be coded
Coefficient bit does not
need to be coded
Group of Column Skipping (process 8 columns)
Figure 6.2 Group of Column Skipping Method for CM
84
EBCOT on RICA Based Architecture and ARM Core
[80] as illustrated in Figure 6.2. This method is to skip a group of nooperation columns together and can only be applied to pass 2 and 3. The
number of NBC bits in each group are checked and recorded while being
coded in pass 1 with a 1-bit tag for each group. When executing coding
pass 2 and 3, these tags are checked. If a tag is “0” then the
corresponding group is skipped, otherwise columns in the group are
checked one by one and coded with the SS method [80].
 Multiple Column Skipping (MCOLS): This method is proposed in [81] in
order to add more flexibility to GOCS. The tag indicator for each group is
extended to cover different states of the four columns. In this case the
coding engine can process each column respectively and determine
whether a single or multiple columns can be skipped.
 Pass Parallel Context Modelling (PPCM): All the accelerating methods
discussed above aim to save checking cycles when processing a stripe
column. The PPCM method presented in [82] targets parallel processing,
that is, performing three coding passes simultaneously.
This method
adopts the column-based operation in [80] with four coefficient bits in a
column being processed at a time. To encode a sample, firstly the
encoder decides by which coding pass the current sample should be
coded. This sample is then coded by one of the four primitive coding
schemes according to the coding pass. However, some issues occur due
Coding window
for pass 3
Coding window for
passes 1 and 2
Stripe causal
Figure 6.3 Pass Parallel Context Modeling
85
EBCOT on RICA Based Architecture and ARM Core
to the causal relationship within the three coding passes in the parallel
processing mode and must be solved. First, samples belong to pass 3
may become significant earlier than the two prior coding passes since
three coding passes are executed concurrently. Second, if the current
sample belongs to pass 2 or 3, significant states of samples that have not
been visited in the coding window shall be predicted since these samples
may become significant in pass 1 [82]. In order to solve these problems,
the coding window for pass 3 in PPCM is delayed by one stripe column to
eliminate the reciprocal effect between pass 3 and the other two passes.
Figure 6.3 illustrates the PPCM architecture. The stripe causal mode [5,
83] is utilised to eliminate the dependence of coding operations on the
significance of samples in the next stripe. Moreover, two significant state
variables σ0[k] and σ1[k] are introduced to state whether the sample
becomes significant in pass 1 or pass 3 respectively. Detailed description
of PPCM algorithm can be referred in [82]. With this architecture, the
execution time of CM can be reduced by more than 25% compared with
SS and GOCS [82].
All of these accelerating methods presented above are originally FPGAtargeted. For SS, GOCS and MCOLS, they require either additional control
units/memory or modified memory arrangements. PPCM requires large
amount of computational resources to make the three coding passes working
in parallel. Moreover, the power consumption of FPGA paradigm makes it
inefficient to use FPGA for embedded JPEG2000 solutions. Basically, an
ideal hardware architecture for CM should have high parallelism so more
than one coefficient samples can be coded simultaneously, also the
dynamically reconfigurability is desirable, with which the architecture can be
reconfigured to adapt different coding passes at different coding stages in
order to prevent unnecessary computational resource waste. Other features
such as power-saving and high integration are also essential for embedded
applications.
86
EBCOT on RICA Based Architecture and ARM Core
6.3. Efficient RICA Based Designs for Primitive Coding
Schemes in CM
Before figuring out the most suitable CM architecture for RICA based
applications, the discussion in this thesis focuses on how to efficiently
implement the four primitive coding schemes involved in the three coding
passes on RICA based architecture.
According to the analysis of CM
algorithm and features of RICA paradigm, main challenges for efficient
implementations are concluded as follows:
 Kernels need to be constructed for one or more coding passes involving
various primitive coding schemes in order to eliminate the configuration
latency.
 The number of memory accesses in each kernel must be restricted within
4 in order to prevent breaking kernels.
 The conditional branches in coding schemes must be eliminated.
 Kernels must be adaptive to the RLC coding scheme which may generate
various numbers of CX/D pairs within a single stripe column.
In CM, all CXs are generated depending on different combinations of the
significant states/refinement states/magnitudes of the current bit and its eight
neighbours, so the latter two challenges become critical. In order to
overcome the above challenges, all the four primitive coding schemes in CM
are carefully designed and implemented on RICA based architecture.
6.3.1. Zero Coding
As discussed in Chapter 2, ZC generates a CX according to sums of
significant states of the current bit and its eight neighbours by
horizontal/vertical/diagonal directions as well as which DWT subband the
current codeblock belongs to. In the proposed ZC implementation, all the
H/V/D sums are judged by comparators and the output CX is generated
through a sequence of multiplexers, whose selecting inputs are the outputs of
the comparators. Figure 6.4 illustrates the detailed circuit structure of this CM
87
EBCOT on RICA Based Architecture and ARM Core
D0
V0
D1
H = H0 + H1
V = V0 + V1
D = D0 + D1 + D2 + D3
H0
X
H1
D2
V1
D3
Comparator
Multiplexer
H
0
1
=
CX_ori
2
=
=
0
Judegments
1
2
V
0
1
=
2
=
0
=
3
>
D
0
1
=
1
=
0
>
Logic
combination
according to the
ZC LUT in
Chapter 2 to
decide the final
CX value
4
5
6
7
>
8
Decisions
CX_out
Figure 6.4 Detailed Architecture for ZC Unit
coding engine. The comparator array compares the H/V/D values with the
different parameters defined in Table 2.1 (ZC LUT table) in Chapter
2.Compared results are used as judgements for H/V/D contributions. The
logic combination block combines these judgements with logic operations
according to the ZC LUT table and generates decisions for the multiplexer
sequence to decide the final CX value. In this ZC coding unit, CX has an
initial value (normally zero), and the multiplexer sequence chooses the final
CX value by utilising decisions provided by the logic combination block. In
this way, conditional branches can be totally eliminated and the ZC coding
unit can be integrated within a kernel, with a fair trade-off of a number of
extra multiplexers. For different DWT subbands, various parameters for the
comparator array and new logic combinations in the logic block are utilised
without modifying the coding unit architecture.
6.3.2. Sign Coding
Sign coding scheme utilises sign bits and significant bits of the current bit and
its horizontal/vertical neighbours to calculate the required H/V contributions
88
EBCOT on RICA Based Architecture and ARM Core
Decisions for
H/V
contributions
Logic combination
to generate the
decisions for H/V
contributions
H
V
Horizontal
contribution
multiplexer
sequence
H
contribution
Decisions
for CX
Logic combination
to generate the
decisions for CX
CX multiplexer
sequence
CX
Sign bits and
significant
states
Vertical
contribution
multiplexer
sequence
V
contribution
Logic combination to
generate the decisions
for XOR bit
XOR bit
multiplexer
sequence
XOR
bit
D
XOR
Decisions
for XOR bit
Sign bit
Figure 6.5 Detailed Architecture for SC Unit
as discussed in Chapter 2. Figure 6.5 illustrates the detailed architecture for
SC coding unit. Since both the sign bits and the significant states can only
hold the value of 0 or 1, no comparator but logic combinations are required to
generate decisions for the H/V contributions. Similar as the ZC unit,
multiplexer sequences for both horizontal and vertical contributions generate
the H and V contributions respectively without breaking the potential kernel,
These two contributions are further used to generate decisions for CX and
XOR bit. According to Table 9.1 in Chapter 2, conditions for generating XOR
bit can be simplified as shown in Table 6.1. With these simplified conditions,
two logic combination blocks are employed to generate final decisions for CX
and XOR bit respectively, which are utilised by another two multiplexer
sequences in order to obtain the final CX and XOR bit. The Decision bit is
generated by a simple XOR operation between the current sign bit and the
XOR bit.
Table 6.1 Simplified LUT for XOR Bit
Combinations of H/V Contributions
XOR bit
H = 1, V = x (x means don't care)
0
H = 0 and V ≥ 0
0
H = 0 and V = -1
1
H = -1
1
89
EBCOT on RICA Based Architecture and ARM Core
∑H ∑V ∑D
Significant
states
Refinement
state
CX_ori
Accumulator
1
0
16
1
15
=
>=
=
Decisions
for CX
14
Logic combination
CX_out
Figure 6.6 Detailed Architecture for MRC Unit
6.3.3. Magnitude Refinement Coding
MRC requires accumulation of significant states of the current bit’s eight
adjacent neighbours and the information indicating whether the current bit
has been coded by MRC. Compared with ZC and SC, MRC implementation
is relatively simple as illustrated in Figure 6.6. There are totally three
comparators and three multiplexers utilised in the coding unit in order to
eliminate conditional branches.
6.3.4. Run Length Coding
RLC presents a more difficult problem as it may generate various numbers (1
or 3) of CX/D pairs according to the four bits in the stripe column. Due to the
memory access and conditional branch restrictions in RICA paradigm, an
efficient implementation has to ensure that this variation does not break the
potential kernel. In this work, we managed to realise the RLC unit within a
kernel by combining the generated CX/D pairs into a single codeword that
can be read or write by a single memory operation, which is demonstrated in
Figure 6.7. The codeword occupies 14 bits totally, which is constructed by
two modified CX/D pairs in each of which the decision bit is expanded to two
bits. For a zero stripe column coded by RLC, only one CX/D pair (17, 0)
90
EBCOT on RICA Based Architecture and ARM Core
A zero stripe
column
A non-zero
stripe column
17 (10001)
0 (00)
17 (10001)
0 (00)
8772
8904
8905
8906
8907
17 (10001)
1 (01)
18 (10010)
00/01/
10/11
5 bits
2 bits
5 bits
2 bits
Codeword:
Weight factor = b3<<3 + b2<<2 + b1<<1 + b0
b0/1/2/3 are the magnitude bits in the
current stripe column
8772
8907
8906
8905
8904
weight factor = 0
weight factor = 1
2 ≤ weight factor < 4
4 ≤ weight factor < 8
weight factor ≥ 8
Figure 6.7 Codeword Structure in RLC Unit
needs to be generated, and the two parts of the codeword are both filled with
it. When coding a non-zero stripe column, firstly a CX/D pair (17, 1) is
generated, which fills the highest 7 bits of the codeword. After that, another
two CX/D pairs (18, 0 or 1), (18, 0 or 1) are generated and stored in the
lowest 7 bits of the codeword by combining the two decision bits together.
According to the different contents, the 14-bit codeword is actually
represented in decimal as shown in Figure 6.7. Assume the four bits in a
stripe column are b0, b1, b2, b3 (from top to bottom), a weight factor is
employed to indicate the position of the first non-zero bit in the stripe column
with which the codeword can be assigned with correct value, as illustrated in
the figure.
There is another state employed in this work to indicate the location where
RLC finishes, which is termed valid_state. Table 6.2 illustrates the valid_state
and its value assignment. As RLC is only applied at the beginning of a stripe
column, the first bit has different valid_state values compared with the other
three bits. In the case that no RLC is applied to a stripe column, the
valid_state for the first bit is set to “0”. When RLC applied, the valid_state of
the first bit depends on the weight factor introduced above. If the weight
factor is zero, the valid_state is set to “3” indicating that the whole stripe
91
EBCOT on RICA Based Architecture and ARM Core
Table 6.2 Valid_state in the RLC Unit
valid_state
conditions
b0
0
RLC is not applied to this stripe column
1
RLC applied and weight factor ≠ 0
3
RLC applied and weight factor = 0
b1
2
RLC applied and weight factor < 8
0
else
b2
2
RLC applied and weight factor < 4
0
else
b3
2
RLC applied and weight factor < 2
0
else
column is RLC coded; otherwise it is set to “1”, which means that the RLC is
applied but only for a subset of the four bits in the column. For the other three
bits, their valid_states depend on the value of the weight factor. If the
valid_state is set to “2”, it means that RLC has been applied to this bit;
otherwise the valid_state is set to “0”.
Based on the above discussion, RLC unit is efficiently implemented on RICA
based architecture as illustrated in Figure 6.8. RLC unit checks every stripe
column at the beginning in order to find out whether RLC should be applied.
The codeword containing CX/D pairs and the valid_state indicating the
RLC applied
or not
Significant states of all
required samples
Bits in the column
Judge RLC
Generate
valid_state
Calculate
weight factor
Generate
codeword
Weight
factor
Figure 6.8 The Structure of RLC Unit
92
Valid_state
for each bit
Codeword for the
stripe column
EBCOT on RICA Based Architecture and ARM Core
RLC ending position are generated for the EBCOT CM coding engine, which
will split CX/D pairs from codewords according to the provided information.
6.4. Partial Parallel Architecture for CM
6.4.1. Architecture
Based on efficient implementations of the four primitive coding schemes, an
optimised Partial Parallel Architecture for CM [84]is developed specially
targeting RICA based architectures. The PPA method is derived from the
original PPCM architecture with utilisation of the loop splitting technique.
Figure 6.9 illustrates the PPA method. Since the coding pass 2 does not
affect the sample’s significant state, in PPA it is executed in parallel with
coding pass 1, while pass 3 is executed separately after the first two coding
passes finishing the current bit-plane. As a result, there are two separate
coding windows with the same size (5x3) for pass 1&2 and pass 3 in PPA
and two kernels are constructed respectively. As most of the primitive coding
schemes employed in coding pass 3 are the same as coding pass 1, RICA
based architecture can be dynamically reconfigured for the pass 3 kernel
without requiring additional computational resources. For each coding
Coding windows for
Go to next
pass 3
bit-plane
Current bitplane finished,
go to pass 3
Coding windows for
pass 1 & 2
Current
stripe
Column Causal
Figure 6.9 Partial Parallel Architecture for Context Modeling
93
EBCOT on RICA Based Architecture and ARM Core
window, stripe causal technique is utilised to eliminate the dependence of
coding operations on the significance of samples belonging to the next stripe
[5], [38]. The four bits in a stripe, from top to down, are coded in parallel [84].
6.4.2. PPA based CM Coding Procedure
Given a codeblock, PPA based CM starts to code it from the MSB to the LSB.
The two coding windows shift independently from left to right and stripe by
stripe. As discussed in Chapter 2, the required information for CM includes
significant state (δ), refinement state (γ), sign (χ) and magnitude (v)
information. In addition, since coding pass 3 is executed separately with the
other two coding passes, another state (θ) indicating whether the bit has
been coded by coding 1 or 2 is necessary in PPA. In order to reduce the
number of memory accesses, all the required information except χ and v is
stored in different data buffers, while χ and v are directly obtained from the
coefficients in the current codeblock. Along with the shifting coding window,
the required information is read out from data buffers consecutively via SBUF
cells, while updated state information is written into corresponding data
buffers in sequence during the coding process.
Figure 6.10 gives an example of how data buffers work in PPA. It illustrates
the case of data buffers containing δ and γ when coding the pth bitplane (LSB
<p< MSB). Each data buffer corresponds to each line of the coding window.
When coding the first stripe, data buffer 0 is reserved to be all zeros as there
is no valid sample. When the current stripe is finished, the reading address of
Coding the first
stripe
Starting
Address 0
Starting
SBUF 1
Address 1
Starting
SBUF 2
Address 2
Starting
SBUF 3
Address 3
Starting
SBUF 4
Address 4
Starting
Address 4
Continue to
read
Continue to
read
Continue to
read
Continue to
read
Reserved (0)
SBUF 0
Coding the second
stripe
When a line finished, the read
address of buffer 0 is reset to be the
last starting address of buffer 4
Codeblock line 0
Codeblock line 4
...
Codeblock line 1
Codeblock line 5
...
Codeblock line 2
Codeblock line 6
...
Codeblock line 3
Codeblock line 7
...
Codeblock line
3
Codeblock line 0
Codeblock line 4
...
Codeblock line 1
Codeblock line 5
...
Codeblock line 2
Codeblock line 6
...
Codeblock line 3
Codeblock line 7
...
Figure 6.10 The Example of How Data Buffers Work in PPA
94
EBCOT on RICA Based Architecture and ARM Core
SBUF 0 is reset to the last starting address of data buffer 4, while the other
four buffers keep consecutive reading addresses. Therefore, when coding
the second stripe, the state information belonging to codeblock line 3 is read
out via SBUF 0, while information for the other four lines is read out
consecutively. In this way, PPA ensures that the coding window contains
correct state information during the coding procedure. For the case of θ, only
four data buffers are required and there is no address reset, as only the θ
within the current stripe is required.
One of the main challenges is how to eliminate conditional branches which
will break potential kernels. In PPA, all the primitive coding schemes
belonging to different coding passes are executed simultaneously and the
required contexts are selected according to the algorithm discussed in
Chapter 2 and Section 6.3 as the final outputs. A pseudo code is given in
Figure 6.11 to show the detailed working process of PPA. The selecting
operations are realised by multiplexers with the similar architecture as that
discussed in Section 6.3. Table 6.3 lists the basis according to which these
selections are made.
Table 6.3 CX/D Selection in PPA
For coding pass 1 and 2
ZC
SC
MRC
Condition_ZC1 = 1
Current bit = 1
&&Condition_ZC = 1
Significant_currentbit = 1
For coding pass 3
RLC
ZC
SC
Condition_RLC2 = 1
Not coded by pass 1 or 2
Current bit = 1
&&Significant_currentbit = 0
1 :Condition_ZC
2
= (significant_currentbit = 0) && (any significant_neighbour = 1)
:Condition_RLC = all the significant states in the coding windows are zero
95
EBCOT on RICA Based Architecture and ARM Core
for (k=0, k<bit depth, k++)
// for bit-plane iteration
{
Initialize the buffer addresses for state variables;
for (j=0, j< number of lines, j+=4)
// coding pass 1 and 2
{
Initialize the state variables;
for (i=0; i< line length; i++) // kernel 1
{
Read the state variables including χ and v into the coding window;
Pass 1 coding;
// including four ZC and four SC
Pass 2 coding;
// parallel with pass 1, including four MRC
Select the correct CX/D pairs;
Write the updated variables back into data buffers;
Write the selected CX/D pairs into data buffers as the intermediate;
}
Reset the buffer read/write addresses;
}
for (j=0, j< number of lines, j+=4)
// coding pass 3
{
for (i=0; i< line length; i++)
// kernel 2
{
Read the state variables including χ and v into the coding window;
Read the intermediate CX/D pairs from coding pass 1 and 2 in the coding
window;
Pass 3 coding; // including RLC, four ZC and four SC
Select the correct CX/D pairs;
Write the updated variables back into data buffers;
Write the selected CX/D pairs into the final output memory;
}
Reset the buffer read/write addresses;
}
}
Figure 6.11 Pseudo Code of PPA Working Process
96
EBCOT on RICA Based Architecture and ARM Core
When processing finishes, the four coded CX/D pairs for a stripe column are
generated. The information provided by PPA for each bit includes:
a. The significant or refinement state of the sample.
b. The CX/D pair of the sign bit (if existing).
c. The CX/D pair of the magnitude bit.
d. The valid_state.
As the maximum memory access number is restricted to be 4 in a kernel, the
information for each bit must be written into the memory by a single operation.
In this case, PPA combines all the required information listed above for each
bit into a single codeword so the complete information for a stripe column can
be written into the memory simultaneously without breaking the kernels.
Figure 6.12 illustrates the codeword structure. In order to distinguish the
CX/D pairs generated by coding pass 3 from those coded by pass 1, an
offset is added to contexts generated by coding pass 3 (excluding RLC
contexts) which will be removed in the following process. Since the RLC
coding scheme may generate various numbers of CX/D pairs within one
stripe column and conditional branches must be eliminated in the kernel, the
valid_state discussed in Section 6.3.4 is used to indicate whether any RLC
coded CX/D pairs are involved in the codewords from a stripe column and
the count of them. Table 6.4 gives an illustration of the generic RLC
Significant state |
refinement state
CX/D pair of the sign bit
CX/D pair of the magnitude bit
(RLC codeword)
Valid_state
1 bits
8 bits
14 bits
2 bits
Figure 6.12 PPA Codeword Structure
Table 6.4 Valid_state Indication for RLC in PPA
Valid_state
Indication
0
The current bit is not coded by RLC
1
The current bit is the first bit in a stripe column. The bit itself and
some following bits in the stripe column are coded by RLC
2
The current bit is not the first bit in a stripe column and the bit is
coded by RLC
3
The current bit is the first bit in a stripe column and the entire stripe
column is coded by RLC
97
EBCOT on RICA Based Architecture and ARM Core
indicating procedure for each output combination in PPA. With the provided
information, CX/D pairs can be correctly derived by the following modules.
6.5. Arithmetic Encoder in EBCOT
The top-level architecture and details of the key sub-modules in AE have
been introduced in Chapter 2. It is observed that the encoding algorithm is
composed of frequent conditional branches and simple operations. When
targeting RICA based implementations, these conditional branches will split
potential kernels into separate steps, which will dramatically increase the
configuration latency and extend the execution time. On the other hand, if we
force the application to be executed in a single kernel, that is, to employ
massive comparators and multiplexers to eliminate branches like the way CM
is implemented, the kernel will have a long critical path even after being
pipelined due to the serial nature of AE. In this case, AE becomes the
bottleneck of a pure RICA based JPEG2000 encoder implementation.
In this work, an ARM [85] core is integrated with RICA based architecture for
an efficient AE solution. ARM is the leading provider of 32-bit embedded
microprocessors and offers a wide range of processors based on a common
architecture that delivers high performance, power efficiency and reduced
system cost. With the extension of DSP instruction set and instruction
prediction, ARM shows its considerable performance in signal processing
and control applications with noticeable low power consumption. Currently,
the mature products from ARM include ARM7, ARM9, ARM11 and Cortex
series. Considering this application itself, an ARM946E-S processor core is
chosen in this work due to its high performance and ultra low power
dissipation features. The ARM946E-S core implements the ARMv5TE [86]
instruction set and features and enhanced 16x32-bit multiplier capable of
single cycle MAC operations and DSP instructions in order to accelerate
signal processing algorithms and applications.
A number of accelerating methods for AE implementation have been
presented including developing parallel architectures, pipelining the coding
98
EBCOT on RICA Based Architecture and ARM Core
process and simplifying the encoding procedure via DSP technique [87-93].
Due to the nature of ARM, in this work, attention is focused on the efficient
encoding procedure simplification.
As presented, coded CXs and Ds are written into the memory simultaneously
by PPA within codewords due to the conditional branch and memory access
restriction of RICA paradigm. In this case, AE have to derive separate CXs
and Ds via shifting and logic operations. Based on the algorithm, the deciding
procedure for MPS and LPS is simplified as following:
If (D == MPS(CX)) code MPS, else code LPS
As the RENORME module is the most time consuming task in AE, the
optimisation focuses on reducing the number of iterations when the two
probability subinterval is switched. The simplification methods proposed in
[88, 93] are referred as they are suitable for ARM based implementation.
RENORME
RENORME
NS = CLZ(A) - 16
A = A << 1
C = C << 1
CT = CT - 1
Y
N
NS < CT
N
A = A << NS
C = C << NS
CT = CT - NS
CT = 0
P = NS - CT
A = A << NS
C = C << CT
Y
BYTEOUT
A & 0X8000 = 0
BYTEOUT
CT = CT – P
C = C << P
Y
N
DONE
DONE
(a)
(b)
Figure 6.13 (a) Original RENORME Architecture(b) Optimised RENORME Architecture
99
EBCOT on RICA Based Architecture and ARM Core
BYTEOUT
BYTEOUT
B = 0xff
B_tmp = B
C_tmp = C
Y
N
Y
N
C < 0x8000000
Y
N
B++;
C=(B==0xff)?
C&0x7ffffff : C
B=B+1
N
B!=0xff &&
C!=0x8000000
B = 0xff
Y
Y
C = C& 0x7ffffff
BP = BP + 1
B = C >> 19
C = C & 0x7ffff
CT = 8
B_tmp==0xff || (B_tmp!=0xff &&
C_tmp>=0x8000000 && B==0xff)
B = (C>>20)&0xff;
C = C&0xfffff;
CT = 7
BP = BP + 1
B = C >> 20
C = C & 0xfffff
CT = 7
N
B = (C>>19)&0xff;
C = C&0x7ffff;
CT = 8
BP = BP + 1
DONE
DONE
(a)
(b)
Figure 6.14 (a) Original BYTEOUT Architecture (b) Optimised BYTEOUT
Architecture
The basic idea behind these approaches is to calculate the number of leftshifting register A in RENORME so the iteration can be complete eliminated
by a single operation. In this case, the number of leading zeros in A must be
detected when performing RENORME. Instead of preparing a reference table
in [88], the DSP instruction CLZ in ARM is utilised to simplify the
implementation. Figure 6.13 illustrates the difference between the original
and the optimised RENORME module. The BYTEOUT module is also
simplified by merging logic operations and reducing conditional branches for
enhanced performance, as illustrated in Figure 6.14.
6.6. EBCOT Tier-2 Encoder
In this work, Tier-2 encoder is implemented on the ARM core together with
AE. As discussed in Chapter 2, the key modules in EBCOT Tier-2 encoder
are Tag-Tree coding and bit-stream length coding. The information needs to
be coded by Tag-Tree includes the inclusion information (relating to layer
information) and the number of zero bit-planes in each codeblock within a
100
EBCOT on RICA Based Architecture and ARM Core
Original data array
Obtain the 1-level leave
nodes
Obtain the root node
Code the root node
Codeword is stored
Code the original array
Codewords are stored
Code the 1-lvel leave nodes
Codewords are stored
Output the coded bits in raster order. The bits for the root node must be outputted first. The
bits for the original data cannot be outputted before the corresponding 1-level nodes.
Figure 6.15 Detailed Tag-Tree Coding Procedure
Given the number
of coding passes
Obtain the current
number of bits
Calculate the
difference
Stuff the codeword
indicator
Obtain the expect
number of bits
Update LBlock
Output the codeword
length
Figure 6.16 Detailed Codeword Length Coding Procedure
DWT subband. Figure 6.15 illustrates a detailed flow graph explaining how
Tag-Tree coding is implemented on ARM.
Since the bit-stream length coding process involves log calculation, an LUT is
utilised in this work to replace the log calculation and to obtain the required
bits to represent the bit-stream length. The LUT supports the number of
coding passes from 1 to 2048, which is sufficient for almost any cases.
Figure 6.16 illustrates the coding procedure for the bit-stream length
information. Since the EBCOT Tier-2 encoder only processes the global
information of a tile with a simple and straightforward architecture, its
execution time becomes negligible compared with other computationally
intensive tasks such as 2-D DWT, CM and AE.
101
EBCOT on RICA Based Architecture and ARM Core
6.7. Performance Analysis and Comparisons
The EBCOT CM and AE modules are implemented on RICA based
architecture and ARM respectively. Performance comparisons between the
original CM architecture, PPCM architecture and the proposed PPA CM
architecture which are all implemented on RICA based architecture are made
in Table 6.5. It is clear that although the original implementation has the
shortest critical path; it has the lowest throughput due to the frequent
conditional branches which introduce massive configuration latency. In
contrast, PPCM method improves the throughput significantly by executing
the complete coding process within a large kernel. However, the length of its
critical path increases dramatically since it is difficult to pipeline such a large
kernel into deep levels. Comparing with the other two approaches, PPA
Table 6.5 Performance Comparisons
PPCM
PPA
Throughput
(MSymbols/S)
14.43
50.91
69.8
Critical Path after
pipelining(ns)
6.67
35.72
15.96 / 12.78
Execution Time (ms)
Original
Codeblock
64x64
Codeblock
32x32
Codeblock
16x16
Figure 6.17 PPA Based CM Execution Time under Different Pre-Conditions
102
EBCOT on RICA Based Architecture and ARM Core
demonstrates its highest throughput by constructing two kernels with proper
sizes which can be better pipelined. Figure 6.17 illustrates the execution time
of PPA based CM under different pre-conditions (codeblock size, DWT level)
with processing a 256x256 8-bit Lena image. It is seen that the proposed
PPA based CM keeps the same execution time under different DWT levels.
This is because that various primitive coding schemes in PPA are executed
in parallel and the final outputs are selected among a couple of possible
results. Based on this architecture, PPA based CM is not data-sensitive,
which means its execution time only depends on the amount of input data
and has no relationship to the data values. In this case, PPA based CM can
provide stable performance when working with different images. It can also
be seen that the execution time increases when the codeblock size gets
smaller. This increment is introduced by the configuration latency and other
related processing when finishing the current codeblock and switching to the
next one. The more codeblock there are, the more latency is generated.
Table 6.6 gives numbers of cells utilised in CM on customised RICA based
architecture. It is observed that the COMP, MUX and LOGIC cells take the
majority of the overall occupied computational resources in order to prevent
conditional branches and to construct kernels, and PPA shows nearly a half
reduction in the cell usage compared with PPCM.
The performance of AE implemented on ARM is also compared with two
RICA
based
implementations
targeting
the
original
approach
Table 6.6 Numbers of Cells in CM Engines on Customised RICA Architecture
Cell
PPCM
PPA
Cell
PPCM
PPA
ADD
61
45
SBUF
15
27
COMP
492
235
RMEM
4
4
MUX
482
287
WMEM
4
4
REG
526
520
JUMP
1
1
SHIFT
48
61
RRC
1
1
LOGIC
496
291
Total Area
1038539
763747
103
and
EBCOT on RICA Based Architecture and ARM Core
Table 6.7 Performance Comparisons
RICA (original)
RICA (kernel)
ARM (500MHz)
0.1353
0.1187
0.033
Coding time (ms)
optimisation with kernel respectively. The ARM based AE is simulated by the
ARM RVDS tool flow [94] with the working frequency of 500MHz, which is
equal to the memory accessing delay of RICA paradigm. Execution time is
obtained via coding a simple CX/D sequence with the length of 1024 and the
results are demonstrated in Table 6.7. The significant throughput
improvement of the ARM implementation mainly benefits from the
processor’s efficient branch prediction and DSP instructions such as CLZ;
and RICA based implementations suffer from either the large configuration
latency introduced by branches or long critical path of the kernel. Generally,
the average coding speed of the ARM-based AE is approximately 16 cycles
per CX/D pair. Compared with a TI C6416 AE implementation presented in
[93] which has an average of 13 cycles per CX/D pair, the proposed ARMbased AE has extra burden of deriving CX and D from every codeword via
shifting and logic operations, and it is possible to improve its speed by further
assembly-level optimisation.
6.8. Conclusion
In this chapter, an optimised EBCOT implementation on customised RICA
based architecture and an ARM core is presented. In Section 6.2, different
existing algorithms for CM in EBCOT are introduced and their positives and
negatives are discussed, especially when targeting RICA based architecture.
Before introducing the novel PPA algorithm, Section 6.3 presents efficient
designs of the four primitive coding schemes in CM. These designs are
optimised targeting reducing memory accesses and eliminating conditional
branches. Particularly, a special codeword is designed for the RLC scheme,
which supports the multiple CX/D pairs generated by RLC being transmitted
by a single memory accessing operation. In order to eliminate conditional
104
EBCOT on RICA Based Architecture and ARM Core
branches, a variable named valid_state is designed to indicate the ending
position of RLC in every stripe column. Meanwhile, a selecting scheme for
choosing the correct CX/D pairs is developed. The output for every DWT
coefficient is generated by PPA within a single final codeword, which includes
all the required information including CX/D for magnitude bit, CX/D for sign
bit, significant state, etc.
In Section 6.5, the arithmetic encoding algorithm is evaluated. Since it
consists of massive conditional branches, it is inefficient to implement AE on
RICA based architecture. In contrast, AE is efficiently implemented on an
ARM core which can be embedded into RICA based architecture. The AE
structure is successfully optimised by utilised DSP instructions in ARM such
as CLZ and simplifying logic combinations. In Section 6.6, the Tag-Tree and
bit-stream length coding schemes in EBCOT Tier-2 encoder are presented.
Section 6.7 mainly targets performance comparisons. Different algorithms for
CM, including the original algorithm, PPCM and PPA, are implemented on
RICA based architecture and compared. It is demonstrated that PPA offers
the highest throughput and lower area occupation compared with PPCM. It is
also demonstrated that PPA can provide stable performance when working
with different images. Meanwhile, the AE implemented on ARM core shows
higher throughput compared with
implementations on RICA
based
architecture. It also demonstrates a comparable performance when
compared with a TI C6416 AE implementation. With the proposed 2-D DWT
and EBCOT, the proposed solution for JPEG2000 encoder will be presented
in the following chapter.
105
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
Chapter 7
JPEG2000 Encoder on Dynamically
Reconfigurable Architecture
7.1. Introduction
This chapter presents the system-level integration and optimisation of the
JPEG2000 encoder on the proposed dynamically reconfigurable architecture
consisting of RICA based architecture and an ARM core. Targeting an
efficient data transfer scheme between 2-D DWT and EBCOT, the scanning
pattern of the 2-D DWT presented in Chapter 5 is optimised with the aim of
accelerating the processing and reducing the required intermediate data
storage [73]. Meanwhile, CM and AE modules in EBCOT are integrated by a
memory relocation module with a carefully designed communication scheme
[95]. A Ping-Pong memory switching mode is developed in order to further
reduce the execution time. Based on the system-level integration and
optimisation, performance of the proposed architecture for JPEG2000 is
evaluated in detail. Simulation results demonstrate that the proposed
architecture for JPEG2000 offers significant advantage in throughput
compared with various DSP& VLIW and coarse-grained reconfigurable
architecture based applications. Furthermore, a power estimation method of
RICA paradigm based architectures is presented and the system energy
consumption is evaluated.
106
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
7.2. 2-D DWT and EBCOT Integration
It is observed that 2-D DWT and EBCOT have different data processing
patterns, as 2-D DWT is line (column) based while EBCOT processes data
samples at bit-level within codeblocks. In this case, EBCOT has to wait for 2D DWT to finish transforming a number of complete lines and columns in
order to obtain a full codeblock, as illustrated in Figure 7.1.
In order to improve the coding efficiency, the 2-D DWT scanning pattern is
modified in this work as illustrated in Figure 7.2. Instead of scanning a
complete line or column of the image, modified 2-D DWT takes an area of 4
codeblocks as its processing unit at a time. After 2-D DWT, this area is
directly transformed to four codeblocks belonging to different subbands.
Codeblocks for LH, HL and HH subbands are then coded by EBCOT
separately, while codeblock for LL subband is reserved and stored for a
deeper level transform. When the current DWT finishes, the next four
codeblocks become the processing unit. In this way, codeblocks for EBCOT
are generated directly without delay of finishing a complete line/column, and
the required intermediate storage for 2-D DWT is reduced since only four
codeblocks are transformed at a time instead of the entire image.
line
codeblock
...
...
...
Column
...
...
Figure 7.1 Original data processing pattern between 2-D DWT and EBCOT
107
...
...
...
...
...
...
...
...
...
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
Original (4 codeblocks)
...
L
H
L
H
L
L
H
L
...
L
H
H
L
H
L
H
L
H
H
L
H ...
L
H
L
H
L
H
L
H
L
H
L
H
L
H
LL
HL
LL
HL
LL
HL
LH
HH
LH
HH
LH
HH
LL
HL
LL
HL
LL
HL
LH
HH
LH
HH
LH
HH
LL
HL
LL
HL
LL
HL
LH
HH
LH
HH
LH
HH
...
...
...
H
Next 4 codeblocks...
Next 4 codeblocks...
...
L
...
Kernel
Codeblock LL
Codeblock LH
...
...
Codeblock HL
Codeblock HH
Figure 7.2 Modified 2-D DWT Scanning Pattern
7.3. CM and AE Integration
7.3.1. System Architecture
Since CM and AE modules are implemented separately on two different
architectures, a shared DPRAM, which acts as the communication channel
between RICA based architecture and ARM core, is utilised to integrate CM
and AE. Figure 7.3 illustrates the proposed architecture with DPRAM. As
both RICA paradigm and AE are based on 32-bit operand, a 32-bit DPRAM is
selected in this work. The depth of the DPRAM can be flexible, but its
minimum capacity should be able to satisfy the following requirements:

The storage space for DWT coefficients belonging to LH, HL and HH
subbands (three codeblocks) and CX/D codewords belonging to a single
108
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
RICA based architecture can be dynamically
reconfigured for different modules
RICA
Original
Image
2-D
DWT
C
M
ARM
AE
Shared DPRAM
Figure 7.3 Proposed Architecture with DPRAM
bitplane of a complete codeblock generated by CM. Totally the required
depth is 3x64x64+64x64 = 16384.

The storage space for all the CX/D pairs belonging to a single bitplane of
a complete codeblock. Usually the maximum size of a codeblock is 64x64.
Considering the extreme condition (all the samples are sign coded), the
required storage depth is 2x64x64 = 8192.

Some reserved space for storing communication variables (less than 32).
Based on the above analysis, a 32K x 32-bit DPRAM is select in this work.
This DPRAM can be accessed by both RICA based architecture and ARM
via its two ports. Several communication variables are utilised to control the
communication between RICA based architecture and ARM in order to avoid
memory accessing conflict, and these variables will be introduced in Section
7.3.3.
7.3.2. Memory Relocation Module
As discussed in the previous chapter, coded CX/D pairs from CM are
generated in parallel within codewords. When referring to the JPEG2000
standard, the following AE module needs to receive CX/D pairs separately
109
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
CM codewords
RICA
3
1
2
1
1
2
2
2
1
3
1
1
2
3
1
3
2-D DWT&CM
ARM
……
DPRAM
(Relocated)
Relocated CX/D
pairs
1
1
1
1
1
2
2
2
2
2
...
3
3
3
3
...
1
Communication Variables
...
CX/D Pairs
...
1
AE
MR
(a)
Split the
codeword
Obtain all
information
Locate the
magnitude CX/D
Detect the sign
CX/D
N
Move to the next
codeword
Y
Increase the
CX/D count
Locate the sign
CX/D
(b)
Figure 7.4 (a) Memory Relocation in JPEG2000 Encoder (b) Detailed Architecture of
MR module
according to different coding passes together with the numbers of these pairs.
In this case, CX/D pairs generated by CM need to be derived from
codewords and relocated. Due to the memory access and conditional branch
restrictions of RICA paradigm, deriving and relocating operations cannot be
performed simultaneously with CM. In this work, a module termed Memory
Relocation (MR) is added between CM and AE, in order to ensure AE
module receives CX/D pairs with the correct order.
The MR module is implemented on RICA based architecture as illustrated in
Figure 7.4. Given a codeword, MR first splits it and obtains all information
provided by CM mentioned in Section 6.4.2, and then the magnitude CX/D
110
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
pair is relocated to the corresponding coding pass storage by MR. After that,
MR checks whether the current codeword contains any sign CX/D pair. If yes,
the sign CX/D is relocated together with the magnitude CX/D pair. Meanwhile,
the CX/D pair count increments whenever a CX/D pair is relocated. The
valid_state is also utilised by MR in order to derive and relocate RLC coded
CX/D pairs correctly.
7.3.3. Communication Scheme between CM and MR
In order to reduce the implementation complexity, communication between
RICA based architecture and ARM core is directly realised through variables
located on specified memory addresses which are listed in Table 7.1. These
variables can be accessed by both RICA based architecture and ARM
through simple memory operations.
Figure 7.5 gives a pseudo code illustration of how the integrated EBCOT
architecture works [95]. At the beginning of encoding, AE_START signal is
initialised to “0” to make sure AE does not work until CX/D pairs of the
current bitplane are ready; at the same time MEMORY_READY is set to “1”.
On the ARM side, after initialisation, AE_READY is set to “1” indicating that
ARM is ready for AE coding.
When coding starts, the index of current bitplane is indicated by
BITPLANE_INDEX and is ready to be passed to ARM. RICA checks the
shared DPRAM. If it is ready, CM coding starts. After completing CM coding
of the current bitplane, RICA checks AE_READY to see whether ARM is
Table 7.1 Communication Variables
Variables
Purpose
AE_START
To start AE on ARM
MEMORY_READY
To indicate the shared DPRAM is ready for accessing without
conflict
AE_READY
To indicate the ARM core is ready for next AE coding
BITPLANE_INDEX
To notify ARM the index of bit-plane
COUNT_PASS1/2/3
To store the numbers of CX/D pairs coded in each coding pass
111
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
RICA
// Initialise
set AE_START = 0; // AE only starts when relocated CX/D pairs are ready
set MEMORY_READY = 1; // shared DPRAM is initialised ready for use
for (k=0; k <bitdepth; k++) // loop for bitplanes
{
BITPLANE_INDEX = k; // index the bitplane
do{} while (MEMORY_READY == 0); // waiting for DPRAM ready
// RICA based architecture processing starts
RICA based architecture processing; // CM
// processing finishes
do{} while (AE_READY == 0); // waiting for ARM finishing
// Memory relocation starts
set MEMORY_READY = 0;
MR starts;
set MEMORY_READY = 1; // indicating relocated CX/D pairs are ready
set AE_START = 1; // starting ARM
}
ARM:
// Initialise
set AE_READY = 1;
while (BITPLANE_INDEX <bitdepth)
{
// waiting for relocated CX/D pairs ready and the starting signal from RICA
do{} while ((MEMORY_READY == 0) or (AE_START==0));
// ARM processing starts
set AE_READY = 0;
ARM processing; // AE
set AE_READY = 1;
// wait until MR of the next bitplane starts
do{} while (MEMORY_READY = 1);
}
Figure 7.5 Pseudo Code for EBCOT Implementation on the Proposed Architecture
ready for AE. If yes, MR is executed with MEMORY_READY being set to “0”
in order to avoid any unexpected access to the shared DPRAM. When MR
finishes, both MEMORY_READY and AE_START are set to “1” to start AE
on ARM.
On the ARM side, ARM checks the BITPLANE_INDEX regularly until
finishing the entire codeblock. AE does not start until both MEMORY_READY
and AE_START signals are “1” to ensure it is safe to access the shared
DPRAM. The AE_READY variable is set to “0” at the time AE starts
112
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
CM
Bitplane 0
MR
Bitplane 0
AE
Bitplane 0
CM
Bitplane 1
Stage 1
Stage 2
MR
Bitplane 1
AE
Bitplane 1
CM
Bitplane 2
Stage 3
& stage 1
MR
Bitplane 2
...
Figure 7.6 Pipeline Structure of the JPEG2000 Encoder
in order to indicate that ARM is busy. When AE coding finishes, AE_READY
is set to “1” again and ARM is put to wait state and jumps back to the
beginning of bitplane loop only when the next MR starts.
One of the benefits of this working mode is that it is possible to pipeline
between RICA based architecture and ARM core during the coding process.
Excluding 2-D DWT which is executed at beginning of the coding procedure,
the JPEG2000 encoder on the proposed architecture is considered to be
composed of three stages: CM, MR and AE. Since the MR module acts as
the intermediate between CM and AE, a 3-stage pipeline is established with
which CM and AE can be executed in the same time slot. Figure 7.6
illustrates this pipeline architecture. As described in Table 7.1, four
communication variables take charge of controlling the pipeline structure. By
strictly controlling accesses to the shared DPRAM and the starting time of
different coding engines, the pipeline architecture offers significant
improvement in system performance [95].
7.3.4. Ping-Pong Memory Switching Scheme
Based on the above discussion, core tasks in JPEG2000 encoder are
implemented and optimised on the proposed architecture. A performance
evaluation is performed with the standard 256x256 grayscale Lena test
image. Figure 7.7 illustrates the execution time ratio of different modules for
encoding the entire image (codeblock size = 64x64, 5/3 DWT, 1-level). It is
seen that the MR module takes 33% of the overall execution time and
become the system bottleneck. In other words, if a more efficient pipeline
scheme can be established which ensures that MR module can also be
executed simultaneously with other modules instead of only pipelining CM
113
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
3%
24%
2-D DWT
CM
MR
AE
40%
33%
Figure 7.7 Execution Time Ratio of Different Modules in JPEG2000 Encoder
Ping-Pong
Switch
Shared
DPRAM
Memory Block
A
2-D DWT
coefficients
CM
MR
AE
Memory Block
B
CM + MR
Bitplane 0 (Memory Block A)
Idle
CM + MR
Bitplane 1 (Memory Block B)
AE
Bitplane 0
(Memory Block A)
CM + MR
Bitplane 2 (Memory Block A)
AE
Bitplane 1
(Memory Block B)
...
AE
Bitplane 2
(Memory Block A)
...
Figure 7.8 Ping-Pong Memory Switching Architecture
and AE, the system performance will be significantly improved. In this work,
another memory block with the same size of the CX/D pair storage space in
the shared DPRAM is employed to construct a Ping-Pong memory switching
scheme, as illustrated in Figure 7.8. These two memory blocks in the shared
DPRAM are accessed alternately by both RICA based architecture and ARM
core. When CM and MR finish coding the 1st bitplane, CX/D pairs are stored
in memory block A and then coded by AE. At the same time the 2 nd bitplane
is coded by CM and MR and stored in memory block B. After that, AE fetches
CX/D pairs from memory block B to code the 2 nd bitplane, meanwhile CM
and MR switch to memory block A again for the next bitplane and so until the
complete codeblock is coded. In this way, CM and MR are executed in the
114
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
same time slot with AE, leading to further execution time reduction. When the
Ping-Pong memory switching mode is applied, the DPRAM illustrated in
Section 7.3.1 can be replaced by a similar DPRAM with different address
space for the two Ping-Pong data blocks or even a 4-port RAM. Obviously,
the capacity for storing CX/D codewords and relocated CX/D pairs need to
be doubled.
7.4. Performance Analysis and Comparison
System performance analysis is performed with the standard Lena test image
(256x256, 8-bit grayscale). Core tasks in JPEG2000 are implemented on this
proposed architecture mainly using ANSI-C, including some embedded
assembly language for improving the code efficiency. Core tasks on RICA
based architecture are then compiled, scheduled and simulated by the 65nm
technology based RICA simulator [6] with optimisation by both hand and
compiler. The ARM-based modules are simulated by the ARM RVDS tool
flow [94] with the working frequency of 500MHz, which is equal to the
memory accessing delay of RICA paradigm.
7.4.1. Execution Time Evaluation
Execution time of different modules on this proposed architecture is listed in
Table 7.2. Based on the Ping-Pong memory switching mode, the total
execution time is calculated as follows:
Total execution time = DWT+MAX (CM+MR, AE)+AE_last_bitplane
It is observed that the most time consuming modules are CM, MR and AE. It
is worth noticing that the critical path length of kernels for CM module is
curbed by the number of available registers supported by the current RICA
tool flow (maximum 547 including scratch registers) for pipelining. In other
words, a deeper pipeline for CM could be established if there were more
registers can be used, leading to less execution time. Due to the memory
access limitation of RICA paradigm, MR can only process one codeword at a
time instead of four as what CM does, and this is the reason why the MR
115
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
module is even more time consuming than CM. Meanwhile, the ARM-based
AE implementation is based on serial architecture, so it has higher time
consumption than CM and MR.
Table 7.2 Detailed Execution Time of the JPEG2000 Encoder Sub-modules on the
Proposed Architecture
(a) DWT 5/3
5/3
64x64
32x32
16x16
Time
(ms)
level 1
level 2
level 1
level 2
level 3
level 1
level 2
level 3
level 4
DWT
0.83
1.04
0.88
1.1
1.15
0.98
1.22
1.28
1.3
CM
7.86
7.87
8.53
8.53
8.53
9.86
9.87
9.87
9.87
MR
10.49
10.49
10.51
10.51
10.51
10.6
10.6
10.6
10.6
AE
12.49
11.99
12.48
11.98
11.9
12.46
11.96
11.9
11.9
Total
22.1
22.42
22.86
23.34
23.49
24.37
24.89
25.06
25.12
(b) DWT 9/7
9/7
64x64
32x32
16x16
Time
(ms)
level 1
level 2
level 1
level 2
level 3
level 1
level 2
level 3
level 4
DWT
0.89
1.12
1.003
1.25
1.32
1.22
1.53
1.61
1.63
CM
7.86
7.87
8.53
8.53
8.53
9.86
9.87
9.87
9.87
MR
10.49
10.49
10.51
10.51
10.51
10.6
10.6
10.6
10.6
AE
13.22
12.77
13.23
12.78
12.76
13.25
12.8
12.77
12.77
Total
21.96
22.24
22.76
23.15
23.28
24.4
24.86
25
25.05
7.4.2. Power and Energy Dissipation Evaluation
How to calculate the power dissipation of a given architecture is a tricky
problem. Power dissipation usually consists of two parts: dynamical power
and static power. Dynamical power is determined directly by the working
frequency, the voltage, the transistor load capacity and the activity factor,
while static power is relevant to the leakage current, the threshold voltage
and the manufacturing process. Due to the limitation of RICA software, the
current tool flow cannot provide the power dissipation of a given application.
In this thesis, a rough power dissipation estimating method for RICA based
116
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
applications, which is provided by the RICA developing group, is described.
This estimating method takes into account the number of utilised gates and
average kernels frequencies. Given a RICA based application, its power
dissipation is calculated by the following steps:
1. The numbers of required ICs are calculated, which depend on the largest
kernels in the targeted application.
2. The required array area is obtained by summing up the different cell
areas provided by the RICA tool flow [6].
3. The total gate count and the internal power (mW/MHz) are calculated with
the gate density and the gate-level internal power provided by the RICA
developing group and [96].
4. The average frequency (MHz) of the application is obtained by the critical
path length of the main kernel.
5. The power (mW) is calculated by the internal power and the average
frequency.
6. The energy consumption is calculated by the power and the application
execution time.
It should be noticed that the power figures given by this method are only
rough estimations. Only the dynamical power dissipation is estimated and the
static power/reconfiguration power is not considered. Meanwhile, if there is
any peripheral enabled, the external power dissipation also needs to be
taken into account. For example, the power consumption of the DPRAM
discussed previously in this chapter is not included in this estimation.
However, this estimating method can provide a brief sense of the internal
dynamical power of RICA based architectures, which is important for
demonstrating the low power nature of RICA paradigm.
Table 7.3 shows the estimated power and energy consumption of different
modules in JPEG2000 on the proposed architecture. It is worth noticing that
for the overall architecture, the total computational resource occupation is
equal to that of the most computationally intensive module (CM), while those
redundant ICs can be reconfigured and bypassed when processing other
117
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
Table 7.3 Power and Energy Dissipation of the JPEG2000 Encoder Sub-modules on
the Proposed Architecture
(a) Power Calculation
Area
(um2)
DWT 5/3
DWT 9/7
CM
MR
ADD
3458.16
26
33
45
6
COMP
3458.16
3
3
235
9
LOGIC
910.44
0
16
291
6
SHIFT
5865.12
8
28
61
7
REG
260
56
132
520
49
SBUF
9375
0
0
27
0
JUMP
886
1
1
1
1
SBOX (including MUX)
4040
38
80
660
29
DWT 5/3
DWT 9/7
CM
MR
316173.6
661690.2
4646606
229176.9
Total area (um2)
Gate density = 694 KGate/mm2
Internal gate-level power dissipation = 1.2nW/MHz/Gate
Internal power (mW/MHz)
Average frequency (MHz)
Power (mW)
DWT 5/3
DWT 9/7
CM
MR
0.263
0.551
3.869
0.191
DWT 5/3
DWT 9/7
CM
MR
150
150
69
144
DWT 5/3
DWT 9/7
CM
MR
39.45
82.65
266.96
27.5
(b) Energy Dissipation
64x64
32x32
16x16
Energy
(mJ)
level 1
level
2
level 1
level 2
level 3
level 1
level 2
level 3
level 4
DWT
5/3
0.0327
0.041
0.0347
0.0435
0.0456
0.0386
0.0483
0.0507
0.0513
DWT
9/7
0.074
0.092
0.083
0.103
0.109
0.101
0.126
0.133
0.134
CM
2.10
2.10
2.277
2.277
2.277
2.634
2.633
2.633
2.633
MR
0.288
0.288
0.289
0.289
0.289
0.291
0.291
0.291
0.291
AE
(5/3)
0.614
0.569
0.593
0.569
0.569
0.592
0.568
0.569
0.57
AE
(9/7)
0.628
0.606
0.628
0.607
0.606
0.629
0.608
0.606
0.608
Total
(5/3)
3.035
2.998
3.194
3.178
3.180
3.555
3.540
3.544
3.545
Total
(9/7)
3.09
3.086
3.277
3.276
3.281
3.655
3.658
3.663
3.666
118
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
modules such as 2-D DWT and MR. The power dissipation of the embedded
ARM core highly varies depending on process, libraries and optimisations. In
this work, the power dissipation of 90nm technology based ARM946E-S is
chosen, which is 0.095mW/MHz [97].
7.4.3. Performance Comparisons
Performance comparisons are made between the proposed architecture and
various DSP&VLIW and coarse-grained reconfigurable architecture based
implementations such as ARM920T (400MHz) [52], STMicroelectronics LXST230 (400MHz) [52], NEC Dynamically Reconfigurable Processor (NEC
DRP, 150nm, 45.7MHz) [54], Philips TriMedia 1300 (250nm, 143MHz) [42],
TI TMS320C6416T (90nm, 600MHz) [45], TI TMS320C6455 (90nm, 1GHz)
[46], and ADI BLACKFIN ADSP-BF561 (130nm, 600MHz) [50]. The standard
Lena image is utilised as the benchmark in most of the references. For those
references using different size of test images, their performance is scaled to
meet
the
256x256
test
image
for
fair
comparisons.
For
those
implementations with 24-bit RGB test images such as [52], it is assumed that
different colour components are processed in serial so only a third of the
execution time is taken for comparison (the actual time consumption is
usually higher). Performance comparisons are made separately according to
the different 2-D DWT levels in the references. For those references not
mentioning their DWT levels, it is assumed that they are all under 1-level 2-D
DWT.
Table 7.4 lists execution time comparisons. It is observed that the proposed
architecture for JPEG2000 demonstrates considerable higher throughput in
both DWT and EBCOT aspects compared with other solutions. Generally,
these modules in this work mainly benefit from high levels of both DLP and
ILP of RICA paradigm and pipelined kernels. Meanwhile, AE execution time
is successfully eliminated by the Ping-Pong memory switching mode, which
improves the overall system throughput significantly. For those DSP&VLIW
based solutions, their performance is curbed mainly by limited parallelism in
both instruction and pixel aspects. Meanwhile, the proposed architecture
119
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
Table 7.4 Execution Time Comparisons
Category
Execution time (ms)
DWT
EBCOT
Total
0.83
21.27
22.1
10
64.63
74.63
DWT 5/3 1-level
Proposed
TI TMS320C6416T [46]
DWT 5/3 4-level
Proposed
1.3
23.82
25.12
TI TMS320C6455 [47]
4.5
40.75
45.25
ARM920T [53]
412.2
STMicroelectronics LX-ST230 [53]
85.3
DWT 9/7 1-level
Proposed
0.89
21.07
ADI BLACKFIN ADSP-BF561 [51]
21.96
53
DWT 9/7 3-level
Proposed (CM only)
8.53
Philips TriMedia TM1300 (CM only) [43]
10.26
Other
Proposed (full CM and AE for
processing the same amount of data)
0.094
NEC DRP [55]
(Significant pass in CM and AE only, the
significant pass processes 256 16-bit
samples, AE processes 1023 CX/D pairs)
0.213
outperforms NEC DRP by offering relatively higher average working
frequency and a more straightforward hardware structure due to the
dynamical reconfigurability and heterogeneous nature of RICA paradigm. On
the other hand, the proposed architecture still has some limitations. The MR
module, introduced for splitting CX/D pairs from the codewords generated by
CM, curbs the overall throughput. The two kernels for CM are also large due
to extra COMP and MUX cells which are utilised to construct kernels, and the
pipeline depth of these two kernels is limited by the available registers
supported by the current RICA tool flow. In contrast, DSP and VLIW based
solutions benefit from high frequencies and minor affects introduced by
conditional branches and memory operations. Meanwhile, the throughput of
120
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
Table 7.5 Energy Dissipation Comparisons
Energy (mJ)
DWT
EBCOT
Proposed (CM only)
2.277
Philips TriMedia 1300 (CM only) [43]
26.6
Total
Proposed
0.074
3.016
3.132
TI TMS320C6416T [46]
2.38
15.12
17.5
the proposed architecture is lower than some ASIC and FPGA based
solutions such as ADV212 [30], Bacro BA110 [31], JPEG2K-E [35] and Virtex
II based solutions in [36] and [37]. Apparently, ASIC and FPGA based
solutions benefit from specially designed hardware circuits and flexible
branch/memory operations. Even with these negatives, the proposed
architecture is still proved to be a promising solution for JPEG2000 by
providing good throughput, high flexibility and low energy consumption at the
same time.
Since the power consumption results discussed in Section 7.4.2 are rough
estimations, only a couple of simple comparisons are made between the
proposed architecture and some references in Table 7.5 in order to
demonstrate the power-saving nature of the proposed architecture. Since it is
difficult to exactly tell how far the estimated internal dynamical power
dissipation of RICA based applications is from their actual result, only the
internal dynamic core power dissipation of referred DSPs is considered here
for a closest-to-fair comparison. The energy consumption of DSPs is
estimated by using the cycle count (or the execution time and the working
frequency) and the internal dynamic power of their platforms. For example,
the scaled cycle count for EBCOT in [45] is around 3.87e+7, and the typical
internal CPU dynamic power of TI C6416T DSP under 600MHz clock
frequency is 0.39 mW/MHz [98]. In this case, the energy consumption for
EBCOT in [45] is estimated as 15.12mJ.
It is clear that the proposed architecture shows its advantages in the power
consumption aspect for both DWT and EBCOT modules. The DWT module,
which has a quite small area and short execution time, demonstrates its
121
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
outstanding low power feature. On the other hand, despite the CM module
consumes the most computational resources, it has the lowest kernel active
frequency, leading to relatively low energy consumption. Generally, the
performance of the proposed architecture is enhanced by the power-saving
nature of both RICA paradigm and ARM core.
7.5. Future Improvements
Based on the above system analysis and comparisons, advantages and
limitations of the proposed architecture for JPEG2000 can be concluded as
follows:
Advantages:
 The DWT module has high throughput as a result of pipelined kernels and
parallel processing. Modified DWT scanning pattern further improves the
overall system efficiency.
 The CM module is efficiently implemented due to efficient implementation
of the four primitive coding schemes, properly balanced kernels and
stripe-column level parallel processing provided by PPA.
 The Ping-Pong memory switching mode eliminates the AE execution time
and improves the throughput dramatically.
 The proposed architecture based on RICA paradigm and embedded ARM
core offers outstanding parallelism and power-saving feature compared
with traditional DSPs.
Limitations:
 Due to the restrictions of conditional branches and memory access, the
overall system throughput is curbed by the MR module.
 The pipeline depth of CM module is limited by the number of available
registers supported by the current RICA tool flow.
Fortunately, the latest RICA tool flow under development is quite likely to
overcome these limitations. It increases the memory access restriction per
122
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
step from 4 to 14 and is able to support more than ten thousands of available
registers as well as other computational resources. In this case, the MR
module can be simplified and a much deeper pipeline for the CM module can
be established, both of which will shorten the overall execution time. What’s
more, it is possible to employ triple pairs of CM and MR modules for
processing the LH, HL and HH subbands respectively. These three pairs can
Table 7.6 Future Throughput Improvement
5/3
64x64
32x32
16x16
Time
(ms)
level 1
level 2
level 1
level 2
level 3
level 1
level 2
level 3
level 4
DWT
0.83
1.04
0.88
1.1
1.15
0.98
1.22
1.28
1.3
CM
2.95
2.95
2.93
2.93
2.93
3.31
3.31
3.31
3.31
MR
3.93
3.93
3.61
3.61
3.61
3.56
3.56
3.56
3.56
AE
12.94
11.99
12.48
11.98
11.9
12.46
11.96
11.9
11.9
Total
12.94
11.99
12.48
11.98
11.9
12.46
11.96
11.9
11.9
Current
19.18
19.4
19.92
20.14
20.19
21.44
21.69
21.75
21.77
9/7
64x64
32x32
16x16
Time
(ms)
level 1
level 2
level 1
level 2
level 3
level 1
level 2
level 3
level 4
DWT
0.89
1.12
1.003
1.25
1.32
1.22
1.53
1.61
1.63
CM
2.95
2.95
2.93
2.93
2.93
3.31
3.31
3.31
3.31
MR
3.93
3.93
3.61
3.61
3.61
3.56
3.56
3.56
3.56
AE
13.22
12.77
13.23
12.78
12.76
13.25
12.8
12.77
12.77
Total
13.22
12.77
13.23
12.78
12.76
13.25
12.8
12.77
12.77
Current
19.24
19.48
20.04
20.29
20.36
21.68
22
22.08
22.1
work simultaneously, and the system architecture can be further optimised
with the overall throughput being improved significantly. Based on theoretical
calculation, the DWT, CM and MR modules can be processed simultaneously
with AE. The potential theoretically calculated throughput improvement is
demonstrated in Table 7.6.
123
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
7.6. Conclusion
Based on all the previous chapters, this chapter presents the system level
JPEG2000 encoder integration on the proposed customised coarse-grained
dynamically reconfigurable architecture. In Section 7.2, a modified scanning
pattern for 2-D DWT is proposed. This modified scanning pattern is blockbased and an area of four codeblocks is taken as the processing unit every
time. With this method, four codeblocks belonging to different DWT subbands
are generated simultaneously, leading to reduction in both execution time
and required intermediate data storage.
Section 7.3 presents CM and AE integration for EBCOT. Since an ARM core
has been selected to implement AE, the proposed architecture consisting of
RICA based architecture and embedded ARM core is introduced. A shared
DPRAM is utilised acting as the communication channel between the two
parts in the proposed architecture. A memory relocation module is developed
and placed between CM and AE in order to derive information from the
codeword generated by CM.
Communication between RICA based architecture and ARM core is directly
realised through communication variables located at specified memory
addresses. A three-stage pipeline is established based on the proposed
communication scheme, by which CM and AE can be executed in parallel.
Based on the evaluation of system processing time, it is found that MR
consumes approximately 33% of the overall execution time and becomes the
system bottleneck. In this case, a Ping-Pong memory switching scheme is
developed with which CM and MR can be executed at the same time with AE,
leading to further execution time reduction.
Section 7.4 presents system performance analysis and comparisons
targeting both throughput and power dissipation aspects. Execution time of
different modules including 2-D DWT, CM, MR and AE are listed and
compared under various pre-conditions. The power estimation method for
RICA based architecture is presented, which is based on the numbers of
occupied ICs and the average kernel frequency. Performance comparisons
124
JPEG2000 Encoder on Dynamically Reconfigurable Architecture
are made between the proposed architecture and various DSP&VLIW and
coarse-grained architectures. It is seen that the proposed architecture offers
significant advantage in throughput. Meanwhile, although the power
estimation method only provides roughly estimated results, the proposed
architecture still demonstrates its power-saving nature clearly.
Due to the inherent restriction of RICA paradigm and the available tool flow,
the proposed architecture still has some limitations. In Section 7.5, the
advantages and limitations of the proposed architecture are concluded.
Some possible future improvements are presented. These improvements are
feasible with the last tool flow associated with RICA paradigm which is under
development. Based on these possible improvements, the proposed
architecture’s potential performance is calculated theoretically.
125
Conclusions
Chapter 8
Conclusions
8.1. Introduction
This chapter concludes this thesis. In Section 8.2, the contents of individual
chapters are reviewed. Section 8.3 lists some specific conclusions that can
be drawn from the research work in this thesis. Finally in Section 8.4, some
possible directions for future work are addressed.
8.2. Review of Thesis Contents
Chapter 2 provided the background knowledge and detailed algorithms about
digital
image
processing
technologies
especially
demosaicing
and
JPEG2000. This chapter also gave a review of the existing literature and
various research works which are related to this thesis.
Chapter 3 described the newly emerging RICA paradigm including its
structure, its associated software tool flow and its possible optimisation
approaches. Two of the author’s initial works were proposed as case studies,
both of which demonstrated that RICA has great potential in terms of
throughput, flexibility and power consumption when targeting different kinds
of applications.
In Chapter 4, a Freeman demosaicing engine on RICA based architecture
was proposed. The shifting-window based demosaicing engine was
optimised by data buffer rotating scheme, parallel processing and pseudo
median filter. An investigation of mapping the engine onto a dual-core RICA
126
Conclusions
based architecture was performed. The simulation results demonstrate that
the proposed demosaicing engine can provide 502fps and 862fps when
processing
a
648x432
image
for
the
single-core
and
dual-core
implementations respectively.
Chapter 5 presented a lifting-based 2-D DWT engine for JPEG2000 on RICA
architecture. The 2-D DWT engine was optimised by Hardwired Floating
Coefficient Multipliers and SIMD based VO technique. Positives and
negatives of VO technique was discussed in aspects of both throughput and
area occupation in detail. The proposed 2-D DWT engine can reach up to
103.1 fps for a 1024x1024 image.
Chapter 6 presented a JPEG2000 EBCOT implementation on RICA based
architecture. A novel PPA algorithm for CM was proposed with the four
primitive coding schemes optimised for RICA based implementation. An ARM
core was employed to implement AE instead of RICA based architecture.
Simulation results demonstrate that the proposed PPA algorithm provided
better throughput compared with other popular algorithms, and the ARM
based AE implementation showed good throughput.
Chapter 7 presented the system-level implementation of JPEG2000 encoder
on RICA based architecture. A block based 2-D DWT scanning pattern was
proposed. A memory relocation module was designed and placed between
CM and AE. The CX/D pairs relocated by MR were placed in a shared
DPRAM, which could be modified to be either a 4-port RICA or a DPRAM
with doubled capacity if the Ping-Pong memory switching mode is selected.
Performance evaluations included execution time and estimated power
consumption. The proposed JPEG2000 architecture provided outstanding
performance in aspects of both throughput and energy dissipation compared
with various DSP & VLIW and CGRA based JPEG2000 solutions,
8.3. Novel Outcomes of the Research
This section presents a variety of novel outcomes, which stem from the
research in this thesis.
127
Conclusions
Most academic and industry efforts on digital image processing solutions
have focused on using traditional platforms such as ASIC, FPGA and DSP.
This thesis investigated novel coarse-grained dynamically reconfigurable
architectures which are based on a newly emerging RICA paradigm – an
area that as yet has been little explored. The results in Chapter 4, 5, 6 and 7
showed that based on the proposed architecture, different digital image
processing tasks can deliver high performance in aspects of both throughput
and energy dissipation. In this case, it was very promising to utilise RICA
paradigm based architectures in future high performance system designed
for digital image processing applications like JPEG2000. Moreover, since
these image processing tasks covers algorithms with different nature, it is
possible to predict a new algorithm’s performance on RICA based
architecture by comparing and match the new algorithm to some of these
imaging tasks.
In Chapter 4, the customisable nature of RICA paradigm enabled the
Freeman demosaicing engine to use data buffers to store intermediate data
instead of traditional memory blocks. An important outcome was the
proposed parallel demosaicing engine based on investigation of the hidden
parallelism in the algorithm. Since RICA paradigm supports independent
instructions being executed in parallel, kernelisation of the complete
demosaicing engine significantly accelerated the processing speed. When
dealing with the median filter including sorting operations, the pseudo median
filter was considered to be a reasonable solution for RICA based applications
as it would not introduce conditional branches which would break the kernel.
Moreover, mapping the demosaicing engine onto a dual-core RICA based
architecture demonstrated the potential of building up a multi-core RICA
based architecture for complicated applications.
In Chapter 5, RICA based architecture provided a good solution for 2-D DWT
tasks. Again, due to the inherent tailorable nature of RICA paradigm, the two
DWT modes in JPEG2000 could be implemented with a generic architecture.
Different from traditional DSPs, the CSD based FCMs in the 9/7 mode could
be efficiently implemented on RICA based architecture since the additions
128
Conclusions
and shifting operations can be executed in parallel. The most important
outcome is that the SIMD based VO technique can be employed to improve
the computational resource utilisation. Simulation results demonstrate that
the VO technique successfully improved the Throughput/Area ratio for the 9/7
DWT mode. More generally, the positives and negatives introduced by VO
technique were clearly clarified, which allowed developers to choose the
suitable solution for different tasks according to the nature of the algorithm
and application requirements.
Chapter 6 presented the implementation of EBCOT, which is the most
challenging module in JPEG2000.When looking into the EBCOT CM
algorithm and existing solutions, it was found that the current solutions were
not suitable for RICA based architecture as they required either frequent
conditional branches or massive computational resources. In this case, the
novel PPA solution for CM in EBCOT was developed specially for RICA
based applications. The required computational resource by the proposed
PPA solution was almost only a half of the traditional PPCM method, while
the processing speed of PPA was actually higher than PPCM when mapped
onto RICA architecture. On the other hand, simulation results demonstrated
that RICA based architecture is not a good solution for AE since the frequent
branches strictly limited the performance, and this is the reason why an
embedded ARM core was selected for AE implementation. For conclusion,
RICA based architecture can provide good performance for computationally
intensive applications with inherent parallelism, but may be not suitable for
some simple but branch-intensive applications.
In Chapter 7, the system-level integration of JPEG2000 encoder was
presented based on all the discussion and results in previous chapters. Since
the 2-D DWT is line-based while the EBCOT is codeblock-based, the novel
4-codeblock based 2-D DWT scanning pattern was considered to be an
efficient solution for JPEG2000 as the line delay between 2-D DWT and
EBCOT was eliminated. Meanwhile, the shared DPRAM provided a simple
communicating method between RICA and ARM compared with using data
buses and DMA. Moreover, the Ping-Pong memory switching mode enabled
129
Conclusions
a deeper pipeline between different modules, which is essential for reducing
the overall execution time. Simulation results proved that the proposed
architecture for JPEG2000 demonstrated outstanding performance compared
with various DSP&VLIW and CGRA based applications. In addition, the
power estimation method for RICA based architectures was introduced,
which provided a possible approach for the energy dissipation analysis.
Based on the work in this thesis, it is concluded that RICA paradigm can
provide good solutions for image processing applications. In this thesis, RICA
paradigm’s potential and advantage for different imaging tasks was clearly
investigated and evaluated. Various optimisation approaches for RICA based
applications including customisation, kernel construction, VO technique
utilisation, parallel processing and hybrid architecture development were well
performed and discussed. It was also demonstrated that it is possible to build
up a multi-core RICA based architecture for complex applications.
Meanwhile, performance comparisons of different imaging tasks between
RICA based architecture and other platforms were evaluated in detail. With
this presented work, other developers can evaluate a given algorithm to
estimate its performance on RICA based architecture and decide the
possible optimisation approaches. They can also obtain a brief idea of the
advantages and disadvantages of their RICA based work compared with
other solutions such as ASIC, FPGA and DSP&VLIWs.
8.4. Future Work
There are a few areas in which the work in this thesis can be further
investigated. Some are listed below.
 Short-term work

For the Freeman demosaicing engine, it is possible to utilise VO
technique to further optimise the median filter module. As discussed
in Chapter 4, the utilisation of VO will reduce the number of seeking
operations in median filter. Although there would be more additional
logic resources required and more complex control scheme, the
130
Conclusions
utilisation of VO technique is still a considerable approach to further
optimise the demosaicing engine.

When processing different images, the possible bit-depth increment of
2-D DWT coefficients should be taken into account. With the current
Lean image, the bit depth used for CM is fine. However there might
be 1 or 2 bits increment for other images especially in the case more
than 3 levels of 2-D DWT is applied. In this case, the number of bitlevel iteration in CM also needs to be increased.

For the 2-D DWT engine, the possible artifact brought by the block
based scanning pattern should be considered. Although there are
some papers presenting similar scanning patterns [99-100], they did
not mention the possible artifact brought by modified scanning
patterns and this side effect is worth being evaluated especially when
the engine is employed within a complete JPEG2000 encoder.
 Long-term work

A full tool set which enables joint debugging and testing of RICA
based architecture and embedded ARM core should be developed.
Since there are only separate toolflows for simulating RICA and ARM
based applications respectively, all the data communication between
RICA based architecture and ARM core in this thesis was carried out
manually, which requires massive labour and is extremely time
consuming. With a full tool set, it is possible to make joint debugging
and testing for the complete system, meanwhile the rate-distortion
control module and PSNR calculation can be implemented and
carried out respectively.

Once the full tool set is ready, more test images should be processed
to see the general performance of the proposed architecture.
Although the 2-D DWT, CM and MR are not data sensitive (the
processing time is only relevant to the amount of data), different
images do lead to different AE execution time. Meanwhile, the
possible bit-depth increment mentioned previously will also lead to
more CM execution time.
131
Conclusions

The power dissipation of the proposed architecture should be
evaluated in detail. Although the power estimating method presented
in Chapter 7 provides a possible approach to estimate the internal
dynamic power, it is only the roughly estimated result and still has
difference from the reality. Meanwhile, the shared DPRAM may
contribute a significant part to the overall power consumption. And the
memory controllers on both RICA and ARM sides need to be
considered.

A multi-core solution can be considered for the complete JPEG2000
encoder implementation. Instead of extending the number of CM and
MR pairs to three on a large scale single-core based RICA
architecture as discussed in Chapter 7, it is possible to employ three
RICA cores to realise triple pairs of CM and MR with the support of
the MRPSIM tool introduced in Chapter 4, in order to encode DWT
coefficients
belonging
to
different
subbands
simultaneously.
Moreover, more ARM cores can also be embedded into the system
and it is possible to have
three AEs running in
parallel
correspondingly. The multi-core architecture is expected to provide
much higher throughput compared with the current single-core
implementation. Obviously, the energy dissipation will increase at the
same time.
132
Appendix
Appendix
JPEG2000 Encoding Standard
Tiling and DC Level Shifting
The first preprocessing step in JPEG2000 standard is tiling, which partitions
the original image into a number of rectangular non-overlapping blocks,
termed tiles. Each tile has the exact same colour component as the original
image. Tile sizes can be arbitrary and up to the size of the entire original
image. Generally, a large tile offers better visual quality to the reconstructed
image and the best case is to treat the entire image as one single tile (no
tiling). However, a large tile also requires more memory space for
processing. Typically, tiles with the size of 256x256 or 512x512 are
considered to be popular choices for various implementations based on the
evaluation of cost, area and power consumption [27].
Originally, pixels in the input image are stored in the form of unsigned
integers. For the purpose of mathematical computation, DC level shifting is
essential to convert these pixels to ensure each of them has a dynamic range
which is approximately centered around zero. All pixels Ii(x,y) are DC level
shifted by subtracting the same quantity 2s-1 to produce DC level shifted
samples I’i(x,y) as follows [27]:
𝐼𝑖′ (𝑥, 𝑦) = 𝐼𝑖 (𝑥, 𝑦) − 2𝑠−1
(9.1)
Where s is the precision of pixels.
133
Appendix
Component Transformation
Component transformation is effective on reducing correlations amongst
multiple components in the image. Normally, the input image is considered to
have three colour planes (R, G, B). JPEG2000 standard supports two
different transformations: (1) Reversible Colour Transformation (RCT) and (2)
Irreversible Colour Transformation (ICT). RCT can be applied to both
lossless and lossy compression, while ICT can only be used in the lossy
scheme [27].
In the lossless mode with RCT, pixels can be exactly reconstructed by
inverse RCT. The forward and inverse transformations are given by:
Forward RCT:
𝑌𝑟 = [
𝑅+2𝐺+𝐵
4
]
(9.2)
𝑈𝑟 = 𝐵 − 𝐺
(9.3)
𝑉𝑟 = 𝑅 − 𝐺
(9.4)
Inverse RCT:
𝐺 = 𝑌𝑟 − [
𝑈𝑟 +𝑉𝑟
4
]
(9.5)
𝑅 = 𝑉𝑟 + 𝐺
(9.6)
𝐵 = 𝑈𝑟 + 𝐺
(9.7)
ICT is only applied for lossy compression because of the error introduced by
using non-integer coefficients as weighting parameters in the transformation
matrix [27]. Different from RCT, ICT uses YCrCb instead of YUV, in which Y is
the luminance channel while Cr and Cb are two chrominance channels. The
transformation formulas are given by:
Forward ICT:
𝑌
𝑅
0.299000 0.587000 0.114000
[𝐶𝑟 ] = [0.500000 − 0.418688 − 0.081312] [𝐺 ]
𝐶𝑏
−0.168736 − 0.331264 0.500000 𝐵
134
(9.8)
Appendix
Inverse ICT:
𝑅
[𝐺 ] =
𝐵
1.0
0.0
1.402000 𝑌
[1.0 − 0.344136 − 0.714136] [𝐶𝑟 ]
1.0
1.772000
0.0 𝐶𝑏
(9.9)
2-Demension Discrete Wavelet Transform
DWT is one of the key differences between JPEG2000 and previous JPEG
standard. It is the first decorrelation step in JPEG2000 standard and it
decomposes a tile into a number of subbands at different resolution levels
with both frequency and time information. Basically, wavelets are functions
generated from one single function, which is termed mother wavelet, by
scaling and shifting in time and frequency domains. If the mother wavelet is
denoted by ψ(t), other wavelets ψa,b(t) can be represented as
ψ𝑎,𝑏 (𝑡) =
1
√𝑎
ψ(
t−b
𝑎
)
(9.10)
Where a is the scaling factor and b represents the shifting parameter.
Based on this definition of wavelets, the wavelet transform of a function f(t)
can be mathematically represented by
∞
𝑊(𝑎, 𝑏) = ∫−∞ ψ𝑎,𝑏 (𝑡)𝑓(𝑡)𝑑𝑡
(9.11)
When targeting discrete signals, DWT can be considered to convolve the
input discrete signal with two filter banks, one for low pass and the other is
high pass. The two output streams are then down-sampled by a factor of 2.
The transforms are given by [27]
𝜏𝐿 −1
𝑊𝐿 (𝑛) = ∑𝑖=0
ℎ(𝑖)𝑓(2𝑛 − 𝑖)
(9.12)
𝜏 −1
𝐻
𝑊𝐻 (𝑛) = ∑𝑖=0
𝑔(𝑖)𝑓(2𝑛 − 𝑖)
(9.13)
where 𝜏𝐿 and 𝜏𝐻 are taps of the low-pass (h) and the high-pass (g) filters.
After the transform, the original input signal is decomposed into two
subbands: lower band and higher band. Practically, the lower band can be
further decomposed for different resolutions. The architecture is illustrated in
Figure 9.1.
135
Appendix
LPF
L
2
X(z)
HPF
LPF
2
LL
HPF
2
LH
H
2
Figure 9.1 Discrete Wavelet Transform
Step 1: Horizontal
transform
The original image
L
LL2 HL2
H
HL
LL HL
LH HH
LH HH
LH2 HH2
Step 2: Vertical
transform
Figure 9.2 Multi-level 2-Demension DWT
For digital image processing, it is essential to have 2-demensional DWT to
perform the transformation of a 2-D image. The approach for 2-D DWT is to
implement a 1-D DWT at the horizontal direction first and then another 1-D
DWT along the vertical direction is performed. After a 2-D transform, four
subbands are generated, which are LL, LH, HL and HH respectively. LL is a
coarser version of the original input image, while LH, HL and HH are highfrequency subbands containing the detail information [27]. Normally, the LL
subband can be recursively further decomposed by higher level 2-D DWT in
order to obtain new subbands with multiple resolutions such as LL2, LH2,
HL2 and HH2 illustrated in Figure 9.2.
Traditionally, DWT is implemented by convolution or FIR filter banks. These
approaches may require a large amount of computational resources and
memory storages, which should be avoided in embedded system
136
Appendix
even
X(z)
Sm(z)
Split
1/K
L
K
H
Tm(z)
odd
Figure 9.3 Lifting-Based DWT
applications. To solve this problem, a modified DWT architecture, termed liftbased architecture, is proposed in [25-26]. The main idea of lifting-based
DWT architecture is to break up both the high-pass and low-pass wavelet
filter banks into a sequence of smaller filters that in turn can be converted
into a sequence of upper and lower triangular matrices, leading DWT to
banded-matrix multiplications [27]. Figure 9.3 illustrates the lifting-based
DWT architecture, where Sm(z) and Tm(z) are filter matrices and K is an
constant. The polyphase matrix of the lifting-based architecture can be
realised as
1
1 𝑇𝑚 (𝑧)
𝑃(𝑧) = ∏ [
][
𝑆
0
1
𝑚 (𝑧)
0 𝐾
][
1 0
0
]
1/𝐾
(9.14)
For JPEG2000 standard, there are two default wavelet filter schemes
employed corresponding to lossless and lossy modes separately. In lossless
mode, the Le Gall (5,3) spline filter is adapted, which is formed by a 5-tap
low-pass FIR filter and a 3-tap high-pass FIR filter. The corresponding
polyphase matrix for lifting-based DWT is given by
𝑃(5,3) (𝑧) = [
1
0
1 (1 + 𝑧)/4
][
]
−1
−(1
+
𝑧
)/2
1
0
1
(9.15)
In lossy mode, the Daubechies (9,7) biorthogonal spline filter is employed,
which includes a 9-tap low-pass FIR filter and a 7-tap high-pass FIR filter.
The corresponding polyphase matrix can be represented as
𝑃(9,7) (𝑧) =
−1 )
1
[1 𝛼(1 + 𝑧 ] [
𝛽(1 + 𝑧)
0
1
0 1
][
1 0
1
0 𝐾
𝛾(1 + 𝑧 −1 )] [
][
𝛿(1 + 𝑧) 1 0
1
137
0
1]
𝐾
(9.16)
Appendix
Where α = -1.586134342, β = -0.052980118, γ = 0.882911075, δ =
0.443506852 and K = 1.230174105
Detailed explanation of lifting-based DWT can be referred in [5, 27].
Quantisation
Quantisation of DWT coefficients is one of the main sources of information
loss in JPEG2000 encoder. In lossy compression mode, all the DWT
subbands are quantised in order to reduce the precision of DWT subbands to
aid in achieving compression [27]. The quantisation is performed by uniform
scalar quantisation with dead-zone around the origin. As illustrated in Figure
9.4, step size of the dead-zone scalar quantiser is set to be Δb and the width
of the dead-zone is 2Δb. The formula of uniform scalar quantisation with a
dead-zone can be give by
𝑞𝑏 (𝑖, 𝑗) = 𝑠𝑖𝑔𝑛(𝑦𝑏 (𝑖, 𝑗)) ⌊
|𝑦𝑏 (𝑖,𝑗)|
Δb
⌋
(9.17)
Where yb(I,j) is the DWT coefficient in subband b and Δb is the quantisation
step size for subband b. After quantisation, all quantised DWT coefficients
are signed integers and converted into sign-magnitude represented prior to
entropy coding [27].
-4
Δb
Δb
Δb
2Δb
-3
-2
-1
0
Δb
Δb
Δb
1
2
3
4
Figure 9.4 Dead-Zone Illustration of the Quantiser
Embedded Block Coding with Optimal Truncation
Physically the quantised wavelet coefficients is compressed by the entropy
encoder in each codeblock in each subband [27]. The complete entropy
encoding in JPEG2000 standard can be divided into two coding steps: Tier-1
coding and Tier-2 coding. For Tier-1 coding, the EBCOT [5] algorithm is
138
Appendix
adopted, which is composed of fractional bit-plane coding (Context Modeling)
and binary arithmetic coding (Arithmetic Encoding).
In Tier-1 coding, codeblocks are encoded separately within bit-level. Given
the precision of quantised DWT coefficients is p, a codeblock will be
decomposed into p bit-planes which are then coded sequentially from the
Most Significant Bit-plane (MSB) to the Least Significant Bit-plane (LSB).
Each coefficient is divided into one sign bit and several magnitude bits.
Context modeling is applied on each bit-plane of a codeblock to generate
intermediate data in the form of a pair of Context and binary Decision (CX/D);
while arithmetic encoding codes these CX/D pairs and generates the final
compressed bit-stream.
Context Modeling
The EBCOT CM algorithm has been built to exploit symmetries and
redundancies within and across bit-planes so as to minimise the statistics to
be maintained and minimise the coded bit-stream that it would generated
[27]. Before presenting detailed illustration of the CM algorithm, there are
several concepts need to be clarified as follows:
 Sign Array (χ): χ is a two-dimensional array representing signs of DWT
coefficients in a codeblock. Each element χ[m,n] in χ represents the sign
information of the corresponding sample y[m,n] in the codeblock as
follows:
χ[m, n] = {
1
0
if y[m, n] < 0
else
(9.18)
 Magnitude Array (ν): ν is a two-dimensional array consisting of unsigned
integers. It has the same size with χ so as the corresponding codeblock.
Each sample ν[m,n] in ν represents the absolute value of the
corresponding DWT coefficient, which is given by
νp [m, n] = |y p [m, n]|
(9.19)
where p represents the pth bit-plane.
139
Appendix
 Scanning Pattern: EBCOT has a certain scanning pattern which is based
on every four lines of coefficients, termed stripe. The scanning pattern
within a stripe is from up to down in a column and from left to right for
different columns, as illustrated in Figure 9.5 (a).
 Significant State (δ): In EBCOT, each DWT coefficient has a state
variable termed significant state (δ) which is initialised to be “0” and the bit
itself is considered to be insignificant at the beginning. When coding
starts, this state variable indicates whether the first non-zero bit in the
corresponding coefficient has been coded. If yes, δ changes to “1” and
maintains its value until coding finishes. Meanwhile, the coefficient turns
to be significant. The procedure is illustrated in Figure 9.5 (b).
Symbol
bits
...
stripe
Significant
state
Sign bit
0
MSB
0
0
0
0
1
1
0
1
1
1
1
1
0
1
1
1
...
LSB
(a)
When first
“1” bit is
coded
(b)
Figure 9.5 (a) Scanning Pattern of EBCOT (b) Significant State
D0
V0
D1
H0
X
H1
D2
V1
D3
Figure 9.6 Illustration of One Pixel’s Neighbours
140
Appendix
 Neighbour Significant States: Most of the coding schemes in EBCOT
utilise samples around the current sample under processing, which are
called neighbours. Totally there are eight neighbours for each sample and
they are divided into three categories, termed as horizontal neighbour (H),
vertical neighbour (V) and diagonal neighbour (D), as illustrated in Figure
9.6.
 Refinement State (γ): This state indicates whether the current sample has
been coded by MRC. If yes the corresponding γ is set to be “1” otherwise
“0”.
Based on these concepts, CM codes each codeblock stripe by stripe, from
the MSB to the LSB, separately. There are three coding passes existing in
CM. Each bit is coded by one of the three passes without any overlapping
with other passes [27]. These three coding passes are given as follows:
 Significant Propagation Pass (SPP): This coding pass is used to code
insignificant bits with one or more significant neighbours.
 Magnitude Refinement Pass (MRP): This coding pass processes bits
which have already been significant.
 Clean Up Pass (CUP): Coefficient bits that have not been coded by SPP
or MRP will be coded by this coding pass.
These three coding passes are executed with an order from SPP to CUP.
There are four primitive coding schemes which are employed by these three
coding passes to generate coded CX/D pairs, which are defined as follows:
 Zero Coding (ZC): In zero coding, CX is generated from three pre-defined
Look-Up Tables (LUTs) for different DWT subbands (LH, HL and HH).
Outputs of these LUTs depend on significant states associated to
neighbours of the current sample, which are illustrated in Table 9.1 [5].
The decision bit is equal to the current magnitude bit of the corresponding
coefficient which is being coded. This coding scheme is used both in SPP
and CUP.
 Sign Coding (SC): This coding scheme is employed to code the sign bit of
each DWT coefficient, and is executed only once when the first “1” bit in
141
Appendix
the coefficient has been coded, which is saying, as long as the coefficient
turns to be significant. Instead of directly using H, V and D defined
previously, SC generates CX by two new states termed as horizontal
contribution and vertical contribution, which depend on significant and
positive/negative states of H and V neighbours. The decision bit in SC is
determined by an XOR bit which is also generated according to H/V
contributions. The referenced LUP and the formula for obtaining decision
bit [5] are given in Table 9.2 and equation 9.20. This coding scheme is
usually involved within SPP and CUP.
𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑠𝑖𝑔𝑛𝑏𝑖𝑡 𝑂𝑅 𝑥𝑜𝑟𝑏𝑖𝑡
(9.20)
Table 9.1 Contexts for the Zero Coding Scheme
LL and LH subbands
HL subband
HH subband
CX
∑H
∑V
∑D
∑H
∑V
∑D
∑(H+V)
∑D
2
x
x
x
2
x
x
≥3
8
1
≥1
x
≥1
1
x
≥1
2
7
1
0
≥1
0
1
≥1
0
2
6
1
0
0
0
1
0
≥2
1
5
0
2
x
2
0
x
1
1
4
0
1
x
1
0
x
0
1
3
0
0
≥2
0
0
≥2
≥2
0
2
0
0
1
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
142
Appendix
Table 9.2 H/V Contributions and Contexts in the Sign Coding Scheme
H0/V0
H1/V1
H/V Contribution
Significant, positive
Significant, positive
1
Significant, negative
Significant, positive
0
Insignificant
Significant, positive
1
Significant, positive
Significant, negative
0
Significant, negative
Significant, negative
-1
insignificant
Significant, negative
-1
Significant, positive
insignificant
1
Significant, negative
insignificant
-1
Insignificant
insignificant
0
H contribution
V contribution
Context
XOR bit
1
1
13
0
1
0
12
0
1
-1
11
0
0
1
10
0
0
0
9
0
0
-1
10
1
-1
1
11
1
-1
0
12
1
-1
-1
13
1
 Magnitude Refinement Coding (MRC): This coding scheme is particularly
used in MRP. In MRC, context is determined by whether the current bit is
the first refinement bit, which says, the first time to be coded by MRP, in
the corresponding coefficient. Also significant states of the current bit’s
eight neighbours are taken into consideration. The referenced LUP is
given in Table 9.3 [5]. The Decision bit is simple equal to the magnitude
bit.
143
Appendix
Table 9.3 Contexts of the Magnitude Refinement Coding Scheme
∑H+∑V+∑D
First refinement bit
Context
x (not care)
No
16
≥1
Yes
15
0
Yes
14
(summation of significant states)
 Run Length Coding (RLC): This coding scheme is only used in CUP. It
generates one or more CX/D pairs by coding from one to four consecutive
bits within a stripe [27]. Generally the number of CX/D pairs generated is
determined by where the first “1” bit is located in the corresponding stripe
column. There are two contexts adopted in RLC: 17 and 18. When all the
four bits in the stripe column are zero, a CX/D pair (17,0) is generated. In
the case there are one or more “1” bits existing, firstly a CX/D pair (17,1)
is generated, indicating this is a non-zero stripe column, and then another
two CX/D pairs (18, 0/1) and (18, 0/1) are produced, in which the two
decision bits actually represent the location of the first “1” bit in this fourbit non-zero stripe column.
When describing the complete CM coding progress, these three coding
passes are applied to each bit-plane of a codeblock from the MSB to the
LSB. As the first bit-plane (MSB) actually has no significant coefficient at the
beginning, only CUP is applied on it. After finishing the first bit-plane, the next
bit-plane turns up and these three coding passes scan and code it in order of
SPP, MRP and CUP, with the scanning pattern illustrated in Figure 9.5 (a).
Figure 9.7 illustrates the detailed EBCOT working flowchart. When a stripe
starts, SPP checks significant states of the current bit itself as well as its
eight neighbours and only codes the current bit by ZC when it is insignificant
but has one or more significant neighbours. After ZC, SC will also be applied
if it is needed. MRP is applied after SPP finishing the current bit-plane. It
144
Appendix
Start
Initialise
Start coding
the Pth bit-plane
(SPP)
Insignificant but has
significant neighbor
Start coding
the Pth bit-plane
(MRP)
N
N
Y
Start coding
the Pth bit-plane
(CUP)
N
Significant
Not coded by SPP
Have not been coded
by SPP or MRP
Y
Y
Magnitude Refinement
Coding
Zero Coding
Consecutive for bits
and adjacent neighbors
N
N
End of the Pth
bit-plane
Bit = “1”
Y
Y
Y
are all insignificant
Run Length
Coding
Zero Coding
N
Sign Coding
Coefficient turns to be
significant
Bit = “1”
Next bit
N
Y
End of the Pth
bit-plane
Sign Coding
Coefficient turns to be
significant
Y
N
Y
Next bit
End of the Pth
bit-plane
N
Can be eliminated when
coding the first MSB bit-plane
Next bit
End of
the Codeblock
N
Next bit-plane
P = P-1
Y
Terminated
Figure 9.7 EBCOT Tier-1 Context Modeling Working Flowchart
checks whether the bit itself is significant and has not been coded by SPP. If
yes, MRC is applied. When MRP finishes the entire bit-plane, CUP is called.
It firstly checks whether there is any non-zero δ[m,n] in the current stripe
column and all neighbours, and whether there is any coefficient which has
already been coded by SPP and MRP. In both cases, if there is any, CUP
only codes bits have not been coded by SPP or MRP. If both cases are false,
CUP applies RLC to this stripe column. RLC only runs at the beginning of
each stripe column when all the four consecutive bits in the column as well
as all of their adjacent neighbours are insignificant. Meanwhile, RLC will
145
Appendix
terminate at the first ‘1’ bit in the current stripe column, and the rest bits in the
stripe column will be coded by ZC and SC. Being the final coding pass in CM,
CUP continues coding the current bit-plane until its end, then the CM coding
engine moves to the next bit-plane and starts with SPP again.
Arithmetic Encoder
CM in EBCOT provides a sequence of CX/D pairs for the following Arithmetic
Encoder, or termed MQ-coder, as the input. AE is context-based adaptive
and has been used in JBIG2 [5]. It employs a probability model with More
Probable Symbol (MPS) and Less Probable Symbol (LPS). The basic idea
for AE is to map a CX into an MPS or an LPS with its estimation. Assuming
the probability estimation of an LPS is Qe, and then the probability estimation
of a MPS can be presented as 1-Qe. For an interval A, both estimations sizes
need to be emphasised by factor A. Normally in JPEG2000 the value of A is
assumed to maintain close to 1, so the subintervals of LPS and MPS can be
approximated to be Qe and A-Qe respectively, as illustrated in Error!
Reference source not found. (a). Accordingly, MPS and LPS are assigned
with subintervals [0, A-Qe) and [A-Qe, A) respectively. During the coding
process, subintervals for both MPS and LPS are updated by adjusting the
interval’s upper and lower bounds. If the lower bound of the interval is set to
be C, then the bound updating can be represented by
MPS:
𝐶 = 𝐶 + 𝑄𝑒
𝐴 = 𝐴 − 𝑄𝑒
LPS:
𝐶: 𝑢𝑛𝑐ℎ𝑎𝑛𝑔𝑒𝑑
𝐴 = 𝑄𝑒
When performing the updating, an important issue that may happen is
termed interval inversion [27]. It happens when the MPS subinterval is
actually smaller than the LPS subinterval due to bound updating, which
means LPS actually occurring more frequently than MPS. In this case, these
two subintervals are inverted and reassigned in order to ensure that the
subinterval for LPS always stays lower than that for MPS (illustrated in Error!
146
Appendix
Reference source not found. (b)). In JPEG2000 standard, the actual value
of A is always maintained within the range 0.75≤A<1.5. Whenever the value
of A drops below 0.75 during the coding process, it will be doubled to make
sure A is greater than 0.75, which is termed renormalisation. Meanwhile the
value of C is also doubled when renormalisation to A is performed in order to
keep them synchronised [27].
The probability value (Qe) and probability estimation/mapping process is
provided by the JPEG2000 standard as an LUT with four fields: Qe, Next
MPS (NMPS), Next LPS (NLPS) and Switch, which are listed in Table 9.4.
There are another two LUTs required in order to indicate the index and state
of Table 9.4, which are also provided by the standard, termed I(CX) and
MPS(CX), as listed in Table 9.5. Here, I(CX) is the corresponding index for
the current CX, which is looked up and used as the index for Table 9.4.
MPS(CX) specifies the sense (0 or 1) of the MPS of CX, which is initialised to
zero and can be updated during the coding process. Given I(CX) and
MPS(CX), Qe(I(CX)) provides the probability value, NMPS(I(CX)) or
NLPS(I(CX)) indicates the next index for a MPS or LPS renormalisation, and
SWITCH(I(CX)) is a flag used to indicate whether a change of the MPS(CX)
sense is required [27].
147
Appendix
Table 9.4 Qe and Estimation LUT
Index
Qe
NMPS
NLPS
Switch
Index
Qe
NMPS
NLPS
Switch
0
0x5601
1
1
1
24
0x1C01
25
22
0
1
0x3401
2
6
0
25
0x1801
26
23
0
2
0x1801
3
9
0
26
0x1601
27
24
0
3
0x0AC1
4
12
0
27
0x1401
28
25
0
4
0x0521
5
29
0
28
0x1201
29
26
0
5
0x0221
38
33
0
29
0x1101
30
27
0
6
0x5601
7
6
1
30
0x0AC1
31
28
0
7
0x5401
8
14
0
31
0x09C1
32
29
0
8
0x4801
9
14
0
32
0x08A1
33
30
0
9
0x3801
10
14
0
33
0x0521
34
31
0
10
0x3001
11
17
0
34
0x0441
35
32
0
11
0x2401
12
18
0
35
0x02A1
36
33
0
12
0x1C01
13
20
0
36
0x0221
37
34
0
13
0x1601
29
21
0
37
0x0141
38
35
0
14
0x5601
15
14
1
38
0x0111
39
36
0
15
0x5401
16
14
0
39
0x0085
40
37
0
16
0x5101
17
15
0
40
0x0049
41
38
0
17
0x4801
18
16
0
41
0x0025
42
39
0
18
0x3801
19
17
0
42
0x0015
43
40
0
19
0x3401
20
18
0
43
0x0009
44
41
0
20
0x3001
21
19
0
44
0x0005
45
42
0
21
0x2801
22
19
0
45
0x0001
45
43
0
22
0x2401
23
20
0
46
0x5601
46
46
0
23
0x2201
24
21
0
Table 9.5 LUT for I(CX) and MPS(CX)
CX
0
1
2
3
4
5
6
7
8
9
I(CX)
4
0
0
0
0
0
0
0
0
0
MPS(CX)
CX
10
11
12
13
14
15
16
17
18
All initialised
to be zero
148
I(CX)
0
0
0
0
0
0
0
3
46
MPS(CX)
All initialised
to be zero
Appendix
Table 9.6 A and C Register Structure
32-bit Register
MSB
LSB
C
0000 cbbbbbbbbsssxxxxxxxxxxxxxxxx
A
0000 0000 0000 0000 aaaaaaaaaaaaaaaa
Illustrations: a: Fractional bits in A to hold the current interval value
x: Fractional bits in C
s: Spacer bits which provide useful constraints on carry-over
b: Bits for ByteOut
c: Carry bit
Initialisation
Read
CX and D
Y
N
D=0
N
MPS(CX)=1
MPS(CX)=0
Y
Y
Code MPS
N
N
Code LPS
Sub-modules included in
Code MPS and LPS:
RENORME, BYTEOUT
Finished
Y
FLUSH
End
Figure 9.8 Top-Level Flowchart for Arithmetic Encoder
Two 32-bit registers, A and C, are utilised by AE, which structures are given
in Table 9.6 [5]. A stands for the total interval and C indicates the lower
bound of the interval space. For initialisation, A is set to 0x00008000, which
actually represents 0.75 and indicates the initial probability interval space;
while C is initialised with 0x00000000. The top-level flowchart of Arithmetic
Encoder is illustrated in Figure 9.8, and detailed architectures of the key submodules are illustrated in Figure 9.9.
149
Appendix
N
CODELPS
CODEMPS
A = A - Qe(I(CX))
A = A - Qe(I(CX))
A < Qe(I(CX))
A = A - Qe(I(CX))
N
Y
A = A - Qe(I(CX))
A & 0X8000 = 0
Y
N
C = C + Qe(I(CX))
A < Qe(I(CX))
C = C+Qe(I(CX))
Y
Y
A = Qe(I(CX))
SWITCH(I(CX)) = 1
MPS(CX)=1-MPS(CX)
N
I(CX) = NMPS(I(CX))
I(CX) = NLPS(I(CX))
RENORME
RENORME
DONE
DONE
RENORME
BYTEOUT
A = A << 1
C = C << 1
CT = CT - 1
B = 0xff
Y
N
Y
N
CT = 0
C < 0x8000000
N
Y
B=B+1
BYTEOUT
N
A & 0X8000 = 0
B = 0xff
Y
Y
N
C = C& 0x7ffffff
DONE
BP = BP + 1
B = C >> 19
C = C & 0x7ffff
CT = 8
BP = BP + 1
B = C >> 20
C = C & 0xfffff
CT = 7
DONE
CT: a counter for counting the number of shifts applied on A and C
BP: the compressed data buffer pointer
B: the byte pointed to by BP
Figure 9.9 Detailed Architectures of the Key Sub-modules in Arithmetic Encoder
Tier-2 and File Formatting
After EBCOT Tier-1 encoder, each coding pass constitutes an atomic code
unit, termed chunk. These chunks can be grouped into quality layers and can
150
Appendix
be transmitted in any order if chunks belonging to the same codeblock are
transmitted in their relative order [101]. In JPEG2000 standard, EBCOT Tier2 encoder is mainly used to organise the previously compressed bit-stream
which is partitioned into packets containing header information in addition to
the bit-stream itself. The packet header includes the inclusion information,
the length of codewords, the zero bit-plane information and the number of
coding passes information. The basic coding scheme employed in Tier-2
encoder is termed Tag-Tree coding [5], which is utilised to code the inclusion
information and the zero bit-plane information. A Tag-Tree is a way to
represent a two-dimensional array of nonnegative integers in a hierarchical
way [27]. Take an original 2-D data symbol arrays (highest level) with the
size 6x3 as an example, as illustrated in Figure 9.10, every four (or less, if
the nodes are on the boundary) nodes (data symbols) are presented by a
parent node, which takes a lower level and is equal to the minimum value of
its four children. This kind of representation continues until it reaches a single
parent node for all child nodes, or called root node which has the lowest
level. The coding process starts at the root node, with an initial value of zero.
If the initial value is less than the root node, it is incremented by 1 and a “0” is
1
3
2
q3(0,0)
q3(1,0)
q3(2,0)
2
2
2
2
3
2
3
1
4
3
2
2
2
1
2
q0(0,0) = 1
01
q1(0,0) = 1
1
1
q2(0,0)
1
q2(1,0)
2
2
1
q1(0,0)
1
q0(0,0)
1
1
1
q2(0,0) = 1
q2(1,0) = 1
1
1
q3(0,0) = 1
q3(1,0) = 3
q3(2,0) = 2
1
001
01
Firstly the root node q0(0,0) is coded, with the output 01 generated,
then its child node q1(0,0) is coded. The coding procedure continues
to the highest level, q3(0,0), and the final coded bitstream for it is
01111. When q3(0,0) finished, the coding engine moves to q3(1,0)
and another fractional bitstream 001 is generated for it. When coding
q3(2,0), as its parent node, q2(1,0), has not been coded, the coding
engine must move to code it first with generating “1”, after that,
q3(2,0) can be coded and its final coded bitstream is 101. The rest of
the data array are coded in the same way
Figure 9.10 Tag Tree Encoding Procedure
151
Appendix
output. When the incremented value is equal to the root node, a “1” is output
which means the root node is coded. After that the coding engine moves to
one of its child node at the second lowest level, and the same coding
progress is performed node by node and level by level until all the nodes at
the highest level are coded. In Tag-Tree, nodes at higher levels cannot be
encoded until their parent nodes at lower levels are encoded [27]. In this way,
each node is coded as d number of 0’s followed by a ‘1’.
The information represented in Tier-2 encoder can be summarised as
follows:
 Zero-length packet: One bit indicating whether a packet is zero-length or
not.
 Inclusion information: Encoded by a separate Tag-Tree. The value in this
Tag-Tree is the number of the layer in which this codeblock is first
included.
 Number of zero bit-planes: Indicating how many zero bit-planes are
included in this codeblock, encoded by another separate Tag-Tree.
 Number of coding passes included: Encoded by specified code-words
listed in Table 9.7.
 Length of the bit-stream from the current codeblock: Represented by a
certain number of bits given by
𝑏𝑖𝑡𝑠 = 𝐿𝐵𝑙𝑜𝑐𝑘𝑠 + log 2 ⌊𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑐𝑜𝑑𝑖𝑛𝑔𝑝𝑎𝑠𝑠𝑒𝑠⌋
(9.21)
Where LBlocks is a state variable with an initial value of 3. In the case
Table 9.7 Codewords for Number of Coding Passes
No. of Coding Passes
Codeword
1
0
2
10
3
1100
4
1101
5
1110
6-36
1111 00000 – 1111 11110
37-164
1111 11111 0000 000 – 1111 11111 1111 111
152
Appendix
when the bits given is not sufficient to represent the bit-stream length,
some additional bits can be added and a prefix will be added, which is
known as the codeblock codeword indicator. If there are k additional bits
required to represent the bit-stream length, the codeblock codeword
indicator will comprise of k “1”s followed by a “0”.
The coding process of EBCOT Tier-2 encoder can be summarised as
follows:
…………………………………………………………………………………………......
If packet not empty
Code non-empty packet indicator (1 bit)
For each subband
For each codeblock in this subband
Code inclusion information (Tag-Tree or 1 bit)
If first inclusion of codeblock
Code number of zero bit-planes (Tag-Tree)
Code number of new coding passes
Code codeword length indicator
Code length of codeword
End
End
Else code empty packet indicator (1 bit)
End
…………………………………………………………………………………………......
153
Appendix
References
[1]
Wikipedia. http://en.wikipedia.org/wiki/Image_compression. 2011.
[2]
Wikipedia. http://en.wikipedia.org/wiki/Tagged_Image_File_Format.
[3]
Wikipedia. http://en.wikipedia.org/wiki/JPEG.
[4]
GIF. http://en.wikipedia.org/wiki/GIF.
[5]
JPEG2000 Committee, JPEG2000 Part I Final Committee Draft
Version 1.0, ISO/IEC JTC1/SC29/WG1 N1646R.2000
[6]
S. Khawam, I. Nousias, M. Milward, Y. Yi, M. Muir and T. Arslan, The
Reconfigurable Instruction Cell Array. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems. 16(1): pp. 75-85.2008
[7]
C. Brunelli, F. Garzia and J. Nurmi, A Coarse-Grained Reconfigurable
Architecture for Multimedia Applications Featuring Subword
Computation Capabilities. The EUROMICRO Journal of Systems
Architecture. 56(1): pp. 21-32.2008
[8]
http://en.wikipedia.org/wiki/digital_image_processing.
[9]
JBIG, http://en.wikipedia.org/wiki/JBIG,
[10]
JBIG2, http://en.wikipedia.org/wiki/JBIG2,
[11]
B. E. Bayer, Colour Imaging Array, in U. S. Patent.1976
[12]
http://en.wikipedia.org/wiki/Demosaicing. 2010.
[13]
W. Lu and Y. P. Tan, Colour Filter Array Demosaicing; New Method
and Performance Measures. IEEE Transaction of Image Processing.
12: pp. 1194-1210.2003
[14]
D. R. Cok, Signal Processing Method and Apparatus for Producing
Interpolated Chrominance Values in a Sampled Colour Image Signal,
in U. S. Patent, No. 4642678.1987
[15]
R. Ramanath, W. E. Snyder and G. L. Bilbro, Demosaicing Methods
for Bayer Colour Arrays. Journal of Electronic Imaging. 11: pp. 306315.2002
154
References
[16]
W. T. Freeman, Median Filter for Reconstructing Missing Colour
Samples, in U. S. Patent, No. 4724395.1988
[17]
C. A. Laroche and M. A. Prescott, Apparatus and Method of
Adaptively Interpolating a Full Colour Image Utilizing Luminance
Gradients, in U. S. Patent, No. 5373322.1994
[18]
J. F. Hamilton and J. E. Adams, Adaptive Colour Plane Interpolation in
Single Sensor Colour Electonic Camera, in U. S. Patent, No.
5629734.1997
[19]
R. Kimmel, Demosaicing: Image Reconstruction from CCD Samples.
IEEE Transaction of Image Processing. 8: pp. 1221-1228.1999
[20]
A. Lukin and D. Kubasov, An Improved Demosaicing Algorithm, in
Graphicon Conference.2004
[21]
P. Tsai, T. Acharya and A. Ray, Adaptive Fuzzy Color Interpolation.
Journal of Electronic Imaging. 11.2002
[22]
W. M. Lu and Y. P. Tan, Colour Filter Array Demosaicing: New
Method and Performance Measures. IEEE Transaction on Image
Processing. 12: pp. 1194-1210.2003
[23]
G. Zapryanov and I. Nikolova, Comparative Study of Demosaicing
Algorithms for Bayer and Pseudo-Random Bayer Color Filter Arrays,
in International Scientific Conference Computer Science. p. 133139.2008
[24]
M. D. Adams, The JPEG 2000 Still Image Compression Standard, in
ISO/IEC JTC 1/SC 29/WG 1 N 2412.2002
[25]
I. Daubechies and W. Sweldens, Factoring Wavelet Transform into
Lifting Steps. The Journal of Fourier Analysis and Applications. 4: pp.
247-269.1998
[26]
W. Sweldens, The New Philosophy in Biorthogonal Wavelet
Constructions, in Proceedings of the 1995 SPIE. p. 68-79.1995
[27]
T. Acharya and P. Tsai, eds. JPEG2000 Standard for Image
Compression Concepts, Algorithms and VLSI Architectures. 2004,
Wiley-Interscience
[28]
STMicroelectronics, 5 Megapixel Mobile Imaging Processor Data
Brief.2007
[29]
NXP Semiconductor. NXP Nexperia Mobile Multimedia Processor
PNX4103. Available from: http://thekef.free.fr/CV/PNX4103.pdf.
[30]
Analog Devices, Wavescale Video Codec: ADV212 Datasheet.2010
[31]
Bacro, BA110 HD/DCI JPEG2000 Encoder Factsheet.2008
[32]
intoPIX, RB5C634A Technical Specificaiton Outline (JPEG2000
Encoder).2005
[33]
ASICFPGA. Bayer CFA Interpolation Core. Available from:
http://www.asicfpga.com/site_upgrade/asicfpga/isp/interpolation1.html.
155
References
[34]
G. L. jair, A. A. Miguel and W. V. Julio, A Digital Real Time Image
Demosaicking Implemenattion for High Definition Video Cameras, in
Robotics and Automotive Mechanics Conference. p. 565-569.2008
[35]
Xilinx, http://www.xilinx.com/products/ipcenter/JPEG2K_E.htm.2010
[36]
J. Guo, C. Wu, Y. Li, K. Wang and J. Song, Memory-Efficient
Architecture Including DWT and EC for JPEG2000, in IEEE
International Conference on Solid-State and Integrated-Circurt
Technology. p. 2192-2195.2008
[37]
M. Gangadhar and D. Bhatia, FPGA Based EBCOT Architecture for
JPEG2000, in Microprocessors and Microsystems. p. 363-373.2005
[38]
H. B. Damecharla, K. Varma, J. E. Carletta and A. E. Bell, FPGA
Implementation of a Parallel EBCOT Tier-1 Encoder that Preserves
Coding Efficiency, in Proceedings of the 16th ACM Great Lakes
Symposiym on VLSI. p. 266-271.2006
[39]
BroadMotion, BroadMotion JPEG2000 Codecs for Combined TI DSPAltera FPGA Platform.2006
[40]
Silicon Hive, ISP2000 Processors Enable C-Programmable Image
Signal Processing SoCs.2009
[41]
Philips, TM-1300 Media Processor Data Book.2000
[42]
T. H. Tsai, Y. N. Pan and L. T. Tsai, DSP Platform-Based JPEG2000
Encoder with Fast EBCOT Algorithm, in Proceedings of the SPIE. p.
48-57.2004
[43]
Texas Instruments, TMS320C6414T/15T/16T Fixed-Point Digital
Signal Processors.2009
[44]
Texas Instrumeents, TMS320C6455 Fixed-Point Digital Signal
Processor.2011
[45]
C. C. Liu and H. M. Hang, Acceleration and Implementation of
JPEG2000 Encoder on TI DSP Platform, in IEEE International
Conference on Image Processing. p. 329-332.2007
[46]
Q. Liu and G. Ren, The real Time Coding of JPEG2000 base don
TMS320C6455, in IEEE International Conference on Computer
Application and System Modeling. p. 503-507.2010
[47]
Analog Devices, BLACKFIN Embedded Processor ADSP-BF535.2004
[48]
Kiran K.S., Shivaprakash. H., Subrahmanya M. V., Sundeep Raj and
Suman David S., Implementation of JPEG2000 Still Image Codec on
BLACKFIN (ADSP-BF535) Processor, in International Conference on
Signal Processing. p. 804-807.2004
[49]
Analog Devices, BLACKFIN Embedded Processor ADSP-BF561.2009
[50]
P. Zhou, Y. G. Zhao and J. Zhou. The JPEG2000 compression
algorithm based on Blackfin561 Implementation and Optimization.
2009; Available from: http://electronics-tech.com/the-jpeg2000-
156
References
compression-algorithm-based-on-blackfin561-implementation-andoptimization/.
[51]
M. Hashimoto, K. Matsuo and A. Koike, JPEG2000 Encoder for
Reducing Tiling Artifacts and Accelerating the Coding Process, in
IEEE International Conference on Image Processing. p. 645-648.2003
[52]
S. Smorfa and M. Olivieri, Cycle-Accurate Performance Evaluation of
Parallel JPEG2000 on a Multiprocessor System-on-chip Platform, in
IEEE Conference on Industrial Electronics. p. 3385-3390.2006
[53]
J. C. Chen and S. Y. Chien, Crisp: Coarse-Grained Reconfigurable
Image Stream Processor for Digital Still Cameras and Camcorders.
IEEE Transactions on Circuits and Systems for Video Technology. 18:
pp. 1223-1236.2008
[54]
K. Deguchi, S. Abe, M. Suzuki, K. Anjo, T. Awashima and H. Amano,
Implementing Core Tasks of JPEG2000 Encoder on the Dynamically
Reconfigurable Processor, in International Conference on Architecture
of Computing Systems.2005
[55]
M. Motomura, A Dynamically Reconfigurable Processor Architecture,
in Microprocessor Forum.2002
[56]
H. Parizi, A. Niktash, N. Bagherzadeh and F. Kurdahi, MorphoSys: A
Coarse Grain Reconfigurable Architecture for Multimedia Applications,
in The Euro-Par Conference. p. 844-848.2002
[57]
A. Abnous and C. Christensen, Design and Implementation of the
TinyRISC Microprocessor Microprocessors and Microsystems. 16: pp.
187-194.1992
[58]
B. Mei, S. Vernalde, D. Verkest, H. D. Man and R. Lauwereins,
ADRES: An Architecture with Tightly Coupled VLIW Processor and
Coarse-Grained Reconfigurable Matrix, in The Conference of FieldProgrammable Logic and Applications p. 61-70.2003
[59]
M. Hartmann, V. Pantazis, T. V. Aa, M. Berekovic, C. Hochberger and
B. Sutter, Still Image Processing on Coarse-Grained Reconfigurable
Array Architectures, in IEEE Workshop on ESTIMedia. p. 67-72.2007
[60]
Y. YI, I. Nousias, M. Milward, S. Khawam, T. Arslan and I. Lindsay,
System-Level Scheduling on Instruction Cell Based Reconfigurable
Systems, in The Conference on Design, Automation and Test in
Europe. p. 381-386.2006
[61]
A. O. El-Rayis, X. Zhao, T. Arslan and A. T. Erdogan, Dynamically
Programmable Reed Solomon Processor with Embedded Galois Field
Multiplier, in International Conference on ICECE Technology, FPT. p.
269-272.2008
[62]
A. O. El-Rayis, X. Zhao, T. Arslan and A. T. Erdogan, Low Power RS
Codec Using Cell-Based Reconfigurable Processor, in IEEE
International Conference on System on Chip. p. 279-282.2009
157
References
[63]
X. Zhao, A. Erdogan and T. Arslan, OFDM Symbol Timing
Synchronization System on a Reconfigurable Instruction Cell Array, in
IEEE International Conference on System on Chip. p. 319-322.2008
[64]
J. J. van de Beek, M .Sandell and P. O. Borjesson, ML Estimation of
Time and Frequency Offset in OFDM Systems. IEEE Transaction on
Signal Processing. 45: pp. 1800-1805.1997
[65]
Y. Y. Chuang, Cameras, in Digital Visual Effects.2005
[66]
Wiki. http://en.wikipedia.org/wiki/Median_filter. 2010.
[67]
G. Landini. Image Processing Fundamentals. Available from:
http://www.ph.tn.tudelft.nl/Courses/FIP/Frames/fip.html.
[68]
P. Longere, X. M. Zhang, P. B. Delahunt and D. H. Brainard.
Perceptual Assessment of Demosaicing Algorithm Performance. in
Proceedings of IEEE.2002
[69]
Y. W. Liu, J. Meng, H. Fan and J. J. Li, Research on Infrared Image
Smoothing for Warship Targets, in IEEE International Conference on
Machine Learning and Cybernetics. p. 4054-4056.2004
[70]
A. R. Rostampour and A. P. Reeves, 2-D Median Filtering and Pseudo
Median Filtering, in Proceeding of the 20th Southeastern Symposium
on System Theory. p. 554-557.1988
[71]
W.Han, Y. Yi, M. Muir, I. Nousias, T. Arslan and A. T. Erdogan,
MRPSIM: a TLM Based Simulation Tool for MPSoCs targeting
Dynamically Reconfigurable Processors, in IEEE International SoC
Conference. p. 41-44.2008
[72]
W. Han, Y. Yi, X. Zhao, M. Muir, T. Arslan and A.T. Erdogan,
Heterogeneous multi-core architectures with dynamically
reconfigurable processors for wireless communication, in IEEE
Symposium on Application Specific Processors. p. 27-32.2009
[73]
X. Zhao, Y. Yi, A. T. Erdogan and T. Arslan, A High-Efficiency
Reconfigurable 2-D Discete Wavelet Transform Engine for JPEG2000
Implementation on Next Generation of Digital Cameras, in IEEE
International SOC Conference.2010
[74]
M. Mehendale, S. B. Roy, S. D. Serlekar and G. Venkatesh,
Coefficient Transformations for Area-Efficient Implementation of
Multiplier-less FIR Filters, in IEEE International Conference on VLSI
Design. p. 110-115.1998
[75]
P. C. Wu and L. C. Chen, An Efficient Architecture for Two-Dimension
Discrete Wavelet Transform. IEEE Transaction on Circuits and
Systems for Video Technology. 11.2001
[76]
J. Guo, K. Wang, C. Wu and Y. Li, Efficient FPGA Implementation of
Modified DWT for JPEG2000, in IEEE International Conference on
Solid-State and Integrated Circuit Technology. p. 2200-2203.2008
158
References
[77]
Q. Liu, L. Du and B. Hu, Low-Power JPEG2000 Implementation on
DSP-based Camera Node in Wireless Multimedia Sensor Networks, in
IEEE International Conference on NSWCTC. p. 300-303.2009
[78]
Freescale Semiconductor (2004) JPEG2000 Wavelet Transform on
Starcore(TM)-Based DSPs.
[79]
M. Adams and F. Kossentini, Jasper: A Software-Based JPEG2000
Codec Implementation, in Proceeding of IEEE International
Conference on Image Processing. p. 53-56.oct. 2000
[80]
K. F. Chen, C. J. Lian, T. H. Chand and L. G. Chen, Analysis and
Architecture Design of EBCOT for JPEG2000, in Proceedings of IEEE
International Sysmposiym of Circuits and Systems. p. 765-768.2001
[81]
H. H. Chen, C. J. Lian, T. H. Chang and L. G. Chen, Analysis of
EBCOT decoding algorithm and its VLSI implementation for
JPEG2000, in Proceeedings of IEEE International Symposium of
Circuits and Systems. p. 329-332.2002
[82]
J. S. Chiang, Y. S. Lin and C. Y. Hsieh, Efficient Pass Parallel
Architecture for EBCOT in JPEG2000, in Proceedings of IEEE
International Symposium of Circuits and Systems. p. 773-776.2002
[83]
D. Taubman, E. Ordentikich, M. Weinberger and G. Seroiussi,
Embedded Block Coding in JPEG2000, in Proceedings of IEEE
International Conference on Image Processing. p. 33-36.2000
[84]
X. Zhao, A. T. Erdogan and T. Arslan, A Novel High-Efficiency PartialParallel Context Modeling Architecture for EBCOT in JPEG2000, in
IEEE International Conference on System on Chip. p. 57-61.2009
[85]
ARM. www.arm.co.uk.
[86]
ARM,
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0274h
/Chdifhhi.html,
[87]
M. Dyer, D. Taubman, S. Nooshabadi and A. K. Gupta, Concurrency
Techniques for Arithmetic Coding in JPEG2000. IEEE Transactions on
Circuits and Systems. 53(6): pp. 1203-1213.2006
[88]
B. Min, S. Yoon, J. Ra and D. S. Park, Enhenced Renormalization
Algorithm in MQ-Coder of JPEG2000, in International Symposium on
Information Technology Convergence. p. 213-216.2007
[89]
R. R. Osorio and B. Vanhoof, High Speed 4-Symbol Arithmetic
Encoder Arichitecture for Embedded Zero Tree-Based Compression.
Journal of VLSI Signal Processing Systems. 33(3): pp. 267-275.2003
[90]
M. Tarui, M. Oshita, T. Onoye and I Shirakawa, High-Speed
Implementation of JBIG Arithmetic Coder, in IEEE Conference of
TENCON. p. 1291-1294.2002
159
References
[91]
M. Dyer, D. Taubman and S. Nooshabadi, Improved Throughput
Arithmetic Coder for JPEG2000, in IEEE International Conference on
Image Processing. p. 2817-2820.2004
[92]
A. Aminlou, M. Homayouni, M. R. Hashemi and O. Fatemi, Low-Power
High-Throughput MQ-Coder Architecture with an Improved Coding
Algorithm, in The EURSIP Picture Coding Symposium.2007
[93]
B. Valentine and O. Sohm, Optimizing the JPEG2000 Binary
Arithmetic Encoder for VLIW Architectures, in Proceedings of
International Conference on Acoustics, Speech and Signal
Processing. p. 117-120.2004
[94]
ARM. http://arm.com/products/tools/software-tools/rvds/index.php.
2011.
[95]
X. Zhao, A. T. Erdogan and T. Arslan, A Hybrid Dual-Core
Reconfigurable Processor for EBCOT Tier-1 Encoder in JPEG2000 on
Next Generation of Digital Cameras, in IEEE International Conference
on Design and Architectures for Signal and Image Processing.2010
[96]
Faraday 65nm power. http://www.faradaytech.com/html/products/FeatureLibrary/miniLib_65nm.html.
[97]
ARM,
http://www.arm.com/products/processors/classic/arm9/arm946.php,
[98]
Texas Instruments, TMS320C6414T/15T/16T Power Consumption
Summary.2008
[99]
M. Y. Chiu, K. B. Lee and C. W. Jen, Optimal Data Transfer and
Buffering Schemes for JPEG2000 Encoder, in IEEE Workshop on
Signal Processing Systems. p. 177-182.2003
[100] B. F. Wu and C. F. Lin, Analysis and Architecture for High
Performance JPEG2000 Coprocessor, in IEEE International
Symposium on Circuits and Systems. p. 225-228.2004
[101] Gaetano Impoco (2004) JPEG2000 A Short Tutorial.
160
Download