High Efficiency Coarse-Grained Customised Dynamically Reconfigurable Architecture for Digital Image Processing and Compression Technologies Xin Zhao A thesis submitted for the degree of Doctor of Philosophy The University of Edinburgh November 2011 Abstract Digital image processing and compression technologies have significant market potential, especially the JPEG2000 standard which offers outstanding codestream flexibility and high compression ratio. Strong demand for high performance digital image processing and compression system solutions is forcing designers to seek proper architectures that offer competitive advantages in terms of all performance metrics, such as speed and power. Traditional architectures such as ASIC, FPGA and DSPs have limitations in either low flexibility or high power consumption. On the other hand, through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable architectures are proving to be strong candidates for future high performance digital image processing and compression systems. This thesis investigates dynamically reconfigurable architectures and especially the newly emerging RICA paradigm. Case studies such as ReedSolomon decoder and WiMAX OFDM timing synchronisation engine are implemented in order to explore the potential of RICA-based architectures and the possible optimisation approaches such as eliminating conditional branches, reducing memory accesses and constructing kernels. Based on investigations in this thesis, a novel customised dynamically reconfigurable architecture targeting digital image processing and compression applications is devised, which can be tailored to adopt different applications. I A demosaicing engine based on the Freeman algorithm is designed and implemented on the proposed architecture as the pre-processing module in a digital imaging system. An efficient data buffer rotating scheme is designed with the aim of reducing memory accesses. Meanwhile an investigation targeting mapping the demosaicing engine onto a dual-core RICA platform is performed. After optimisation, the performance of the proposed engine is carefully evaluated and compared in aspects of throughput and consumed computational resources. When targeting the JPEG2000 standard, the core tasks such as 2-D Discrete Wavelet Transform (DWT) and Embedded Block Coding with Optimal Truncation (EBCOT) are implemented and optimised on the proposed architecture. A novel 2-D DWT architecture based on vector operations associated with RICA paradigm is developed, and the complete DWT application is highly optimised for both throughput and area. For the EBCOT implementation, a novel Partial Parallel Architecture (PPA) for the most computationally intensive module in EBCOT, termed Context Modeling (CM), is devised. Based on the algorithm evaluation, an ARM core is integrated into the proposed architecture for performance enhancement. A Ping-Pong memory switching mode with carefully designed communication scheme between RICA based architecture and ARM is proposed. Simulation results demonstrate that the proposed architecture for JPEG2000 offers significant advantage in throughput. II Declaration of Originality I hereby declare that the research recorded in this thesis and the thesis itself was composed by myself in the School of Engineering at The University of Edinburgh, expect where explicitly stated otherwise in the text. Xin Zhao 07/11/2011 III Acknowledgements Foremost, I would like to thank my Ph.D. supervisors Prof. TughrulArslanand Dr. KhaledBenkrid for their support and guidance during my study. I would like to thank Dr. Ahmet T. Erdogan who provides massive support and guidance to my work during my study. I would also like to thank my colleagues, Dr. Ying Yi who offered me guidance to RICA paradigm and made great contributions to the dual-core demosaicing engine work. Dr. Wei Han for his valuable suggestions to a number of difficulties I met during my research. Mr. Ahmed O. El-Rayis for his contributions to the customised GFMUL cell utilised in the RS decoder. Dr. Sami Khawam, Dr. IoannisNousias and Dr. Mark I. R. Muir for their noticeable help and suggestions to RICA based architectures in my work. Meanwhile, I would like to thank all RICA team members: Dr. Sami Khawam, Dr. Mark Milward, Dr. IoannisNousias, Dr. Ying Yi and Dr. Mark I. R. Muir for their brilliant invention – the RICA paradigm and its tool flow. In addition, many thanks to all members in SLI group for their help throughput my Ph.D. study. A very special thank to my wife Ying Liu. We met each other in SLI group and got married in Edinburgh. She always encourages me with her love and stays with me getting through all tough times. Finally, I would like to express my deepest appreciation to my parents for their love, guide and support to me throughput my life. IV Acronyms and Abbreviations ASIC Application Specific Integrated Circuit ALU Arithmetic Logic Units AE Arithmetic Encoder ADF Architecture description File BMA Berlekamp-Massey Algorithm CFA Colour Filter Array CM Context Modeling CCD Charge Coupled Device CUP Clean Up Pass CX/D Context and binary Decision CRLB Cramer-Rao Lower Bound CP Cyclic Prefix CSD Canonical Sign Digit DC Direct Current DCT Discrete Cosine Transform DWT Discrete Wavelet Transform DLP Date Level Parallelism DPRAM Dual Port RAM DAG Data Address Generator V DRP Dynamically Reconfigurable Processor DMU Data Management Unit EBCOT Embedded Block Coding with Optimal Truncation FPGA Field Programmable Gate Array FFU Flip-Flop Unit FU Function Unit FPS Frames per Second FCM Floating Coefficient Multiplier GF Galois Finite Field GFMUL Galois Finite Field Multiplier GOCS Group-Of-Column Skipping HD High Definition IC Instruction Cell ICT Irreversible Colour Transformation ILP Instruction Level Parallelism IFFT Inverse Fast Fourier Transform IP Intellectual Property ISI Inter-Symbol Interference JPEG Joint Photographic Experts Group LUT Look-Up Table LS Least Squares LZW LempleZiv Welch LPS Less Probable Symbol MDF Machine Description File MR Memory Relocation VI MRC Magnitude Refinement Coding MRP Magnitude Refinement Pass MSB Most Significant Bit-plane MPS More Probable Symbol MSE Mean Squared Error MSPS Million Symbols per Second MIPS Million Instructions per Second MAC Multiply-Accumulate MMASC Multiply-Accumulates per Second ML Maximum Likelihood MMSE Minimum Mean Square Error NMPS Next Most Significant Bit-plane MRPSIM Multiple Reconfigurable Processor Simulator MCOLS Multiple Column Skipping NLPS Next Less Probable Symbol NLOS Non-Line-Of-Sight LSB Least Significant Bit-plane OFDM Orthogonal Frequency Division Multiplexing PSNR Peak Signal to Noise Ratio PPA Partial Parallel Architecture PPCM Pass Parallel Context Modelling PE Processing Element QoS Quality of Service RCT Reversible Colour Transformation RGB Red-Green-Blue VII RFU Register File Unit RICA Reconfigurable Instruction Cell Array RLC Run Length Coding RLE Run Length Encoding RTL Register Transfer Level RC Reconfigurable Cell RF Register File RSPE Reconfigurable Stage Processing Element RS Reed Solomon SC Sign Coding SoC System on Chip SIMD Single Instruction Multiple Data SPP Significant Propagation Pass SS Sample Skipping TCP Turbo Decoder Coprocessor VCP Viterbi Decoder Coprocessor VGOSS Variable Group of Sample Skip VO Vector Operation WiMAX Worldwide Interoperability for Microwave Access ZC Zero Coding VIII Publication from this work 1. X. Zhao, A.T. Erdogan, T. Arslan, “High Efficiency Customised CoarseGrained Dynamically Reconfigurable Architecture for JPEG2000”, submitted to the IEEE Transaction on Very Large Scale Integration Systems, May, 2011. 2. X. Zhao, A.T. Erdogan, T. Arslan, “Dual-Core Reconfigurable Demosaicing Engine for Next Generation of Portable Camera Systems,” the IEEE Conference on Design & Architectures for Signal and Image Processing (DASIP), October 26-28, 2010. 3. X. Zhao, A. T. Erdogan, T. Arslan, “A Hybrid Dual-Core Reconfigurable Processor for EBCOT Tier-1 Encoder in JPEG2000 on Next Generation Digital Cameras,” the IEEE Conference on Design & Architectures for Signal and Image Processing (DASIP), October 26-28, 2010. 4. X. Zhao, Y. Yi, A. T. Erdogan, T. Arslan, “A High-Efficiency Reconfigurable 2-D Discrete Wavelet Transform Engine for JPEG2000 Implementation on Next Generation Digital Cameras,” the 23 rd IEEE International System-on-Chip (SOC) Conference, September 27-29, 2010. 5. X. Zhao, A. T. Erdogan, T. Arslan, “A Novel High-Efficiency PartialParallel Context Modeling Architecture for EBCOT in JPEG2000,” the 22 nd IEEE International SOC Conference, pp. 57-60, 2009. 6. X. Zhao, A. T. Erdogan, T. Arslan, “OFDM Symbol Timing Synchronization System on a Reconfigurable Instruction Cell Array,” the 21st IEEE International SOC Conference, pp. 319-322, 2008. 7. A. El-Rayis, X. Zhao, T. Arslan, A. T. Erdogan, “Low power RS codec using cell-based reconfigurable processor, ” the 22nd IEEE International SOC Conference, pp. 279-282, 2009. 8. A. El-Rayis, X. Zhao, T. Arslan, A. T. Erdogan, “Dynamically programmable Reed Solomon processor with embedded Galois Field multiplier,” IEEE International Conference on ICECE Technology, FPT, pp. 269-272, 2008. IX Contents Chapter 1 Introduction....................................................................................................... 1 1.1. Motivation................................................................................................................ 1 1.2. Objective ................................................................................................................. 3 1.3. Contribution............................................................................................................. 3 1.4. Thesis Structure ...................................................................................................... 4 Chapter 2 Digital Image Processing Technologies and Architectures ........................ 6 2.1. Introduction to Digital Image Processing Technologies ......................................... 6 2.2. Demosaicing Algorithms ......................................................................................... 9 2.3. JPEG2000 Compression Standard ...................................................................... 13 2.4. Literature Review .................................................................................................. 15 2.4.1. Demosaicing Algorithms Evaluations ............................................................... 15 2.4.2. Solutions for Image Procesing and Compression Applications ....................... 18 2.5. Demand for Novel Architectures ........................................................................... 31 2.6. Conclusion ............................................................................................................ 34 Chapter 3 RICA Paradigm Introduction and Case Studies .......................................... 36 3.1. Introduction ........................................................................................................... 36 3.2. Dynamically Reconfigurable Instruction Cell Array .............................................. 37 3.2.1. Architecture ...................................................................................................... 37 3.2.2. RICA Tool Flow ................................................................................................ 39 3.2.3. Optimisation Approaches to RICA Based Applications .................................... 41 3.3. Case Studies ........................................................................................................ 42 3.4. Outcomes of Case Studies ................................................................................... 43 3.5. Prediction of Different Imaging Tasks on RICA Based Architecture .................... 45 3.6. Conclusion ............................................................................................................ 47 Chapter 4 Freeman Demosaicing Engine on RICA Based Architecture .................... 49 4.1. Introduction ........................................................................................................... 49 4.2. Freeman Demosaicing Algorithm ......................................................................... 49 4.3. Freeman Demosaicing Engine Implementation.................................................... 51 X 4.4. System Analysis and Dual-Core Implementation ................................................. 54 4.4.1. System Analysis ............................................................................................... 54 4.4.2. Dual-Core Implementation ............................................................................... 56 4.5. Optimisation .......................................................................................................... 61 4.6. Performance Analysis and Comparison ............................................................... 63 4.7. Future Improvement ............................................................................................. 65 4.8. Conclusion ............................................................................................................ 66 Chapter 5 2-D DWT Engine on RICA Based Architecture ............................................ 68 5.1. Introduction ........................................................................................................... 68 5.2. Lifting-Based 2-D DWT Architecture in JPEG2000 Standard .............................. 68 5.3. Lifting-Based DWT Engine on RICA Based Architecture ..................................... 70 5.3.1. 1-D DWT Engine Implementation..................................................................... 70 5.3.2. 2-D DWT Engine Implementation..................................................................... 72 5.3.3. 2-D DWT Engine Optimisation ......................................................................... 75 5.4. Performance Analysis and Comparisons ............................................................. 77 5.5. Conclusion ............................................................................................................ 81 Chapter 6 EBCOT on RICA Based Architecture and ARM Core.................................. 83 6.1. Introduction ........................................................................................................... 83 6.2. Context Modelling Algorithm Evaluation ............................................................... 83 6.3. Efficient RICA Based Designs for Primitive Coding Schemes in CM ................... 87 6.3.1. Zero Coding ...................................................................................................... 87 6.3.2. Sign Coding ...................................................................................................... 88 6.3.3. Magnitude Refinement Coding ......................................................................... 90 6.3.4. Run Length Coding .......................................................................................... 90 6.4. Partial Parallel Architecture for CM ...................................................................... 93 6.4.1. Architecture ...................................................................................................... 93 6.4.2. PPA based CM Coding Procedure ................................................................... 94 6.5. Arithmetic Encoder in EBCOT .............................................................................. 98 6.6. EBCOT Tier-2 Encoder....................................................................................... 100 6.7. Performance Analysis and Comparisons ........................................................... 102 6.8. Conclusion .......................................................................................................... 104 Chapter 7 JPEG2000 Encoder on Dynamically Reconfigurable Architecture ......... 106 7.1. Introduction ......................................................................................................... 106 7.2. 2-D DWT and EBCOT Integration ...................................................................... 107 7.3. CM and AE Integration ....................................................................................... 108 7.3.1. System Architecture ....................................................................................... 108 7.3.2. Memory Relocation Module ............................................................................ 109 XI 7.3.3. Communication Scheme between CM and MR ............................................. 111 7.3.4. Ping-Pong Memory Switching Scheme .......................................................... 113 7.4. Performance Analysis and Comparison ............................................................. 115 7.4.1. Execution Time Evaluation ............................................................................. 115 7.4.2. Power and Energy Dissipation Evaluation ..................................................... 116 7.4.3. Performance Comparisons ............................................................................. 119 7.5. Future Improvements .......................................................................................... 122 7.6. Conclusion .......................................................................................................... 124 Chapter 8 Conclusions .................................................................................................. 126 8.1. Introduction ......................................................................................................... 126 8.2. Review of Thesis Contents ................................................................................. 126 8.3. Novel Outcomes of the Research ....................................................................... 127 8.4. Future Work ........................................................................................................ 130 Appendix ............................................................................................................................. 133 JPEG2000 Encoding Standard ........................................................................................ 133 Tiling and DC Level Shifting ........................................................................................ 133 Component Transformation ......................................................................................... 134 2-Demension Discrete Wavelet Transform ................................................................. 135 Quantisation ................................................................................................................ 138 Embedded Block Coding with Optimal Truncation ...................................................... 138 References .......................................................................................................................... 154 XII List of Figures Figure 2.1 Digital Image Processing System Architecture ....................................................... 7 Figure 2.2 Bayer CFA Pattern .................................................................................................. 9 Figure 2.3 Bayer CFA Pattern Demosaicing Procedure .......................................................... 9 Figure 2.4 Illustration of Freeman Demosaicing Algorithm .................................................... 11 Figure 2.5 JPEG2000 Encoder Architecture .......................................................................... 14 Figure 2.6 Test Images for Evaluating Different Demosaicing Algorithms in [13] ................. 16 Figure 2.7 Performance Comparisons between Different Demosaicing Algorithms.............. 16 Figure 2.8 Test Images in [23] ............................................................................................... 17 Figure 2.9 (a) PSNR Comparisons (b) Execution Time Comparisons [23] ........................... 17 Figure 2.10 HiveFlex ISP2300 Block Diagram [41] ............................................................... 22 Figure 2.11 TM1300 Block Diagram [42] ............................................................................... 23 Figure 2.12 TMS320C6416T Block Diagram [44] .................................................................. 25 Figure 2.13 ADSP BF535 Core Architecture [48] .................................................................. 26 Figure 2.14 CRISP Processor Architecture [54] .................................................................... 28 Figure 2.15 (a) NEC DRP Structure (b) PE in NEC DRP [56] ............................................... 29 Figure 2.16 (a) MorphoSys Architecture (b) RC Array Architecture [57] ............................... 30 Figure 2.17 (a) ADRES Architecture (b) RC Architecture [59] ............................................... 31 Figure 3.1 RICA Paradigm [6] ................................................................................................ 37 Figure 3.2 RICA Tool Flow ..................................................................................................... 40 Figure 4.1(a) Freeman Demosaicing Architecture (b) Bilinear Demosaicing for Bayer Pattern ............................................................................................................................................... 50 Figure 4.2 Freeman Demosaicing Implementation Architecture........................................... 51 XIII Figure 4.3 Data Buffers Addresses Rotation ......................................................................... 51 Figure 4.4 Parallel Architecture for Freeman Demosaicing ................................................... 52 Figure 4.5 Freeman Demosaicing Execution Flowchart ........................................................ 53 Figure 4.6 (a) Pseudo Median Filter (b) Median Filter Reuse ................................................ 55 Figure 4.7 Mapping Methodology for MRPSIM ...................................................................... 57 Figure 4.8 Dual-Core Freeman Demosaicing Engine Architecture ....................................... 58 Figure 4.9 Pseudo Code for Dual-Core Implementation ........................................................ 60 Figure 4.10 Illustration of Pipeline Architecture for Kernels ................................................... 62 Figure 4.11 A Demosaiced 648x432 Image........................................................................... 63 Figure 4.12 Potential Vector Operations in Median Filter ...................................................... 65 Figure 5.1 (a) Convolutional DWT Architecture (b) 5/3 Lifting-based DWT Architecture (c) 9/7 Lifting-based DWT Architecture ............................................................................................. 69 Figure 5.2 Generic Lifting-Based DWT Architecture for Both 5/3 and 9/7 modes ................. 69 Figure 5.3 Lifting-Based 2-D DWT Architecture..................................................................... 70 Figure 5.4 Detailed Generic Architecture of 1-D DWT Engine on RICA ................................ 71 Figure 5.5 Reconstructed Image Quality with Different CSD Bits .......................................... 71 Figure 5.6 Streamed Data Buffers in DWT Engine ................................................................ 72 Figure 5.7 Detailed 3-Level 2-D DWT Decomposition ........................................................... 73 Figure 5.8 Parallel Pixel Transformation with VO and SIMD Technique ............................... 74 Figure 5.9 Kernel in the 2-D DWT Engine on RICA Architecture .......................................... 75 Figure 5.10 Standard Lena Image Transformed by the 2-D DWT Engine ............................ 77 Figure 5.11 Throughput (fps) Comparisons ........................................................................... 78 Figure 5.12 Area and Δ Comparisons .................................................................................... 79 Figure 5.13 Performance Comparisons ................................................................................. 80 Figure 6.1 Sample Skipping Method for CM .......................................................................... 84 Figure 6.2 Group of Column Skipping Method for CM ........................................................... 84 Figure 6.3 Pass Parallel Context Modeling ............................................................................ 85 Figure 6.4 Detailed Architecture for ZC Unit .......................................................................... 88 Figure 6.5 Detailed Architecture for SC Unit .......................................................................... 89 XIV Figure 6.6 Detailed Architecture for MRC Unit....................................................................... 90 Figure 6.7 Codeword Structure in RLC Unit .......................................................................... 91 Figure 6.8 The Structure of RLC Unit .................................................................................... 92 Figure 6.9 Partial Parallel Architecture for Context Modeling ................................................ 93 Figure 6.10 The Example of How Data Buffers Work in PPA ................................................ 94 Figure 6.11 Pseudo Code of PPA Working Process ............................................................. 96 Figure 6.12 PPA Codeword Structure .................................................................................... 97 Figure 6.13 (a) Original RENORME Architecture(b) Optimised RENORME Architecture ..... 99 Figure 6.14 (a) Original BYTEOUT Architecture (b) Optimised BYTEOUT Architecture .... 100 Figure 6.15 Detailed Tag-Tree Coding Procedure ............................................................... 101 Figure 6.16 Detailed Codeword Length Coding Procedure ................................................. 101 Figure 6.17 PPA Based CM Execution Time under Different Pre-Conditions ..................... 102 Figure 7.1 Original data processing pattern between 2-D DWT and EBCOT ..................... 107 Figure 7.2 Modified 2-D DWT Scanning Pattern.................................................................. 108 Figure 7.3 Proposed Architecture with DPRAM ................................................................... 109 Figure 7.4 (a) Memory Relocation in JPEG2000 Encoder (b) Detailed Architecture of MR module.................................................................................................................................. 110 Figure 7.5 Pseudo Code for EBCOT Implementation on the Proposed Architecture .......... 112 Figure 7.6 Pipeline Structure of the JPEG2000 Encoder .................................................... 113 Figure 7.7 Execution Time Ratio of Different Modules in JPEG2000 Encoder ................... 114 Figure 7.8 Ping-Pong Memory Switching Architecture ........................................................ 114 Figure 9.1 Discrete Wavelet Transform ............................................................................... 136 Figure 9.2 Multi-level 2-Demension DWT ............................................................................ 136 Figure 9.3 Lifting-Based DWT .............................................................................................. 137 Figure 9.4 Dead-Zone Illustration of the Quantiser .............................................................. 138 Figure 9.5 (a) Scanning Pattern of EBCOT (b) Significant State ......................................... 140 Figure 9.6 Illustration of One Pixel’s Neighbours ................................................................. 140 Figure 9.7 EBCOT Tier-1 Context Modeling Working Flowchart ......................................... 145 Figure 9.8 Top-Level Flowchart for Arithmetic Encoder ...................................................... 149 XV Figure 9.9 Detailed Architectures of the Key Sub-modules in Arithmetic Encoder.............. 150 Figure 9.10 Tag Tree Encoding Procedure.......................................................................... 151 XVI List of Tables Table 2.1 Examples of Image Processing Technologies and Compression Standards .......... 8 Table 2.9 Comparisons of Different Architectures for Image Processing Applications ......... 32 Table 3.1 Instruction Cells in RICA ........................................................................................ 38 Table 4.1 Instruction Cells Occupied by Freeman Demosaicing Engine ............................... 61 Table 4.2 Freeman Demosaicing Performance Evaluations and Comparisons .................... 64 Table 5.1 CSD Forms of Floating-Point Parameters ............................................................. 71 Table 5.2 Numbers of Cells in Different DWT Engines .......................................................... 76 Table 6.1 Simplified LUT for XOR Bit .................................................................................... 89 Table 6.2 Valid_state in the RLC Unit .................................................................................... 92 Table 6.3 CX/D Selection in PPA ........................................................................................... 95 Table 6.4 Valid_state Indication for RLC in PPA ................................................................... 97 Table 6.5 Performance Comparisons .................................................................................. 102 Table 6.6 Numbers of Cells in CM Engines on Customised RICA Architecture .................. 103 Table 6.7 Performance Comparisons .................................................................................. 104 Table 7.1 Communication Variables .................................................................................... 111 Table 7.2 Detailed Execution Time of the JPEG2000 Encoder Sub-modules on the Proposed Architecture .......................................................................................................................... 116 Table 7.3 Power and Energy Dissipation of the JPEG2000 Encoder Sub-modules on the Proposed Architecture .......................................................................................................... 118 Table 7.4 Execution Time Comparisons .............................................................................. 120 Table 7.5 Energy Dissipation Comparisons ......................................................................... 121 Table 7.6 Future Throughput Improvement ......................................................................... 123 XVII Table 9.1 Contexts for the Zero Coding Scheme................................................................. 142 Table 9.2 H/V Contributions and Contexts in the Sign Coding Scheme .............................. 143 Table 9.3 Contexts of the Magnitude Refinement Coding Scheme ..................................... 144 Table 9.4 Qe and Estimation LUT ........................................................................................ 148 Table 9.5 LUT for I(CX) and MPS(CX) ................................................................................ 148 Table 9.6 A and C Register Structure .................................................................................. 149 Table 9.7 Codewords for Number of Coding Passes .......................................................... 152 XVIII Introduction Chapter 1 Introduction 1.1. Motivation With the rapid development of computer technologies, digital image processing stands as a pivotal element in people’s life. It has been widely utilised in not only academic and research aspects such as medical image processing and radar image analysing, but also people’s daily life such as mobile phones and digital cameras. Meanwhile, together with the growth of the Internet technology and portable storage devices, digital image compression techniques are drawing more and more attention with the objective to reduce irrelevance and redundancy of image data in order to store or transmit data in an efficient form [1]. Take a digital camera as an example; usually it employs image processing technologies including demosaicing, Gamma correction, white balancing, smooth filtering, etc. After processing, the obtained digital image is compressed and stored in an SD card or transmitted through internet or other media. The compressed image may be represented in different forms such as TIFF [2], JPEG [3] and GIF [4]. Recently, a newer version of JPEG, termed JPEG2000 [5], has been presented. Based on the wavelet-based method, JPEG2000 compression standard offers significant flexibility and outstanding performance compared with other existing standards. Given these exciting technologies, a question is likely to arise: What is desired for digital image processing solutions in applications such as mobile 1 Introduction phones and digital cameras? Obviously, a solution which is able to provide high throughput is highly desirable. Meanwhile, the power-efficient feature is also very important especially targeting those portable applications powered by batteries. Moreover, in advanced digital cameras, a good image processing solution is normally required to have significant flexibility in order to support different algorithms. Generally, an ideal digital image processing solution is expected to have high throughput, strict low power and outstanding flexibility/reconfigurability for various tasks in advanced digital cameras. Research targeting efficient solutions for digital image processing and compression applications has been carried out for a long time. Application Specific Integrated Circuit (ASIC) implementations are traditionally popular in designing complex image processing applications such as JPEG2000 solutions. However this kind of solution is inherently inflexible and cannot be upgraded or altered after fabrication. Field Programmable Gate Array (FPGA) based solutions can provide more flexibility and shorter time-to-market compared with ASIC solutions. However, traditional FPGAs normally consume more power than ASICs and may be not suitable for embedded image processing applications since the majority of the available transistors are used to provide flexibility [6] . Another popular solution is to use thirdparty DSPs to build System on Chips (SoCs). Compared with ASIC/FPGA based solutions, DSPs have advantages in either higher flexibility (compared with ASICs) or lower power consumption (compared with FPGAs). However, DSP based solutions usually have limited throughput due to the lack of Instruction Level Parallelism (ILP) compared with the other two solutions. Moreover, even they have lower power consumption compared with FPGAs, the power-efficient feature of DSP based applications is still curbed by high clock rates and deep submicron processes. Recently, a new category of programmable architectures, termed coarsegrained reconfigurable architecture, has emerged targeting high performance and area-efficient computing applications. Different from traditional FPGAs and DSPs, coarse-grained reconfigurable architectures can be intended as 2 Introduction hardware components whose internal architecture can be dynamically reconfigured in order to implement different algorithms. Generally, coarsegrained reconfigurable architectures are more area and power efficient compared with FPGAs while holding software-like programmability similar to DSPs, and they are more efficient due to the implementation on hardware of computational functionalities [7]. This thesis proposes a customised dynamically reconfigurable architecture based on coarse-grained Reconfigurable Instruction Cell Array (RICA) paradigm for digital image processing and compression applications such as demosaicing and JPEG2000 standard. 1.2. Objective The objective of this thesis is to explore a high efficiency customised reconfigurable architecture targeting digital image processing and compression technologies by utilising the coarse-grained dynamically RICA paradigm. After investigating different RICA based architectures, this thesis aims to design efficient solutions for demosaicing and core tasks in the JPEG2000 standard based on the proposed architecture. 1.3. Contribution The major contributions of this thesis are split into five key aspects: The potential of RICA based architecture and possible optimisation approaches are explored by case studies including Reed-Solomon decoder and Worldwide Interoperability for Microwave Access (WiMAX) Orthogonal Frequency Division Multiplexing (OFDM) timing synchronisation engine implementations. A Freeman demosaicing engine is developed as the pre-processing module in a digital imaging system. This demosaicing engine is implemented on RICA based architecture and optimised by an efficient data buffer rotating scheme and a pseudo median filter. A parallel architecture for the demosaicing engine is developed. Investigation 3 Introduction targeting mapping the demosaicing engine onto a dual-core RICA platform is performed. A novel 2-D Discrete Wavelet Transform (DWT) engine for JPEG2000 is developed. This 2-D DWT engine is based on vector operations associated with RICA paradigm and is highly optimised for both throughput and area. Solutions for efficiently implementing the four primitive coding schemes in the Context Modeling (CM) module in JPEG2000 on RICA based architecture are developed. A novel Partial Parallel Architecture (PPA) for CM is developed which makes good balance between throughput and area occupation for RICA based implementations. A novel customised dynamically reconfigurable architecture for JPEG2000 is developed. This proposed architecture is based on RICA paradigm and an embedded ARM core for the efficient implementation of Arithmetic Encoder (AE) module in JPEG2000. A modified 2-D DWT scanning pattern, a memory relocation module together with an efficient communication scheme between RICA based architecture and ARM core is developed. A Ping-Pong memory switching mode between RICA based architecture and ARM core is proposed for further performance improvement. 1.4. Thesis Structure This thesis is structured as follows: Chapter 2 contains descriptions of the background. It provides detailed algorithms for demosaicing and introduces the JPEG2000 standard. Literature reviews are also included in this chapter. The referred literatures mainly involve demosaicing and JPEG2000 encoder solutions on different architectures. DSP and coarse-grained reconfigurable architecture based solutions are especially emphasised. Chapters 3 through 7 address my Ph.D. research achievements. Based on a detailed description of RICA paradigm, two case studies (Reed-Solomon 4 Introduction decoder and WiMAX OFDM timing synchronisation engine) are introduced in Chapter 3 in order to investigate the potential of RICA based architectures and possible optimisation approaches. Chapter 4 focuses on design and implementation of a Freeman demosaicing engine on RICA based architecture. The work involves an efficient data buffer rotating scheme, single-core engine implementation & optimisation and mapping the demosaicing engine onto dual-core RICA architecture. From Chapter 5 this thesis aims to the proposed customised dynamically reconfigurable architecture for a JPEG2000 encoder solution. A novel vector operation based 2-D DWT engine is proposed in Chapter 5, together with detailed throughput and area evaluations. Chapter 6 proposes efficient solutions for CM and AE modules in JPEG2000 encoding algorithm. This includes design and implementations of the four primitive coding schemes involved in CM and the novel PPA solution. Based on algorithm analysis and evaluation, an ARM core is selected for an efficient AE implementation. In Chapter 7, the proposed architecture for JPEG2000 is introduced based on the previous discussion. A modified 2-D DWT scanning pattern, a shared Dual Port RAM (DPRAM), a Memory Relocation (MR) module and a Ping-Pong memory switching mode are presented in order to improve the performance of the proposed architecture. Finally, the thesis is concluded with the summary in Chapter 8. 5 Digital Image Processing Technologies and Architectures Chapter 2 Digital Image Processing Technologies and Architectures 2.1. Introduction to Digital Image Processing Technologies Digital image processing is the use of computer algorithms to perform processing on digital images. It is a subcategory of digital signal processing which provides many advantages over analog image processing such as wider range of algorithms to select and avoiding buildup of noise and distortion during processing [8]. Digital image processing has been extremely widely used in fields of digital cameras, remote sensing, multimedia, satellite and so on. A typical digital image processing system can be viewed to be composed of the following modules: a source of input digital image data, a processing module and a destination for the processed image, as illustrated in Figure 2.1. Usually, the digital image source is provided by a digitisation procedure, which means the process of converting an analog image into an ordered array of discrete pixels. This procedure is normally executed by a digital camera, scanner, etc. The processing module in digital image processing is usually a digital processor, which can be a computer or a dedicated chip. The image destination can be realised by different kinds of digital storages and output terminals for transmission. Generally, digital image processing is to apply different processing algorithms onto a matrix consisting of digitised pixels, which is the fact of a digital image. According to different pixel formats, the image can be presented in grayscale or colour. 6 Digital Image Processing Technologies and Architectures Object Digital Image Image Capture and Digitisation Digital Image Processing Unit Processed Image Destination (010111010101...) Figure 2.1 Digital Image Processing System Architecture In the case that the image is large or has deep bit-depth pixels, transmission and storage of an uncompressed image becomes extremely costly and impractical. For example, an 8-bit grayscale image with the size of 320x240 requires 76,800 bytes to be stored, and this figure will increase significantly if there is an increment in image size or pixel bit-depth. In this case, digital image compression technique becomes critical in order to minimise the size of a digital image to adapt the given storage/transmission ability without degrading the image quality to an unacceptable level. According to different compression algorithms, there are two categories of image compression: lossless and lossy. Lossless compression is preferred in the field of medical imaging, technical drawings and so on due to the elimination of compression artifacts; while lossy scheme is widely used in applications where minor distortions are tolerable and expecting a low bit-rate for storage and transmission. There are several different methods to compress an image. For internet use, the two most popular compression schemes are JPEG [3] and GIF [4]. JPEG uses compression techniques such as Discrete Cosine Transform (DCT), chroma sub-sampling, Run Length Encoding (RLE) and Huffman coding. This compression scheme makes good representations for natural images. GIF uses lossless compression algorithm such as Lemple Ziv Welch (LZW), which is good for artificial images instead of natural images as this standard allows using only 256 colours. There are also other popular compression methods being used nowadays such as TIFF [2], JBIG [9] and JBIG2 [10]. Also recently a newer version of JPEG, termed JPEG2000 [5], has been proposed which is base on DWT and supports both lossless and lossy compressions. JPEG2000 standard offers outstanding features compared 7 Digital Image Processing Technologies and Architectures with other standards such as high compression ratio, random bit-stream access, region of interesting coding, etc, which will be discussed in the following sections. Based on these digital image processing and compression techniques, digital cameras are widely used in our daily life. Basically, a digital camera captures the object by a Charge Coupled Device (CCD) with a Bayer filter [11], which is the most common method for digitisation. A Red-Green-Blue (RGB) image is then reconstructed by the demosaicing module. After that, the RGB image can be further processed and compressed to certain file formats for storage and display. When targeting next generation digital cameras, JPEG2000 compression standard becomes an ideal choice because of its desirable features compared with other compression standards. Table 2.1 Examples of Image Processing Technologies and Compression Standards (a) Image Processing Technologies Technology Purpose Demosaicing To reconstruct a full-colour image from CFA Gamma correction To correct and to adjust the colour difference White balancing To adjust the intensities of different colours Sharpening To increase the contrast around the edges of objects Smoothing To reduce noise within an image (b) Image Compression Standards Standard Year Features TIFF 1986 Lossless/lossy, a popular format for high colour-depth images GIF 1987 Lossless, supports up to 256 colours JBIG 1993 Lossless, for bi-level image compression JBIG2 2000 Lossless/lossy, for bi-level image compression JPEG 1992 Usually lossy with an optional lossless mode JPEG2000 2000 Lossless/lossy, a newly emerging standard 8 Digital Image Processing Technologies and Architectures Table 2.1lists examples of existing image processing technologies and compression standards. According to the above discussion and the main work in this thesis, the following sections mainly focus on introduction to different demosaicing algorithms and the JPEG2000 compression standard. 2.2. Demosaicing Algorithms Commercially, the most commonly used Colour Filter Array (CFA) pattern is the Bayer filter [11] illustrated in Figure 2.2. It has alternating Red (R) and Green (G) filters for odd rows and alternating Green (G) and Blue (B) filters for even rows. Due to human eye’s high sensitivity to green lights, Bayer CFA contains twice as many green (luminance) filters as either red or blue (chrominance) ones. As the object information captured by image sensors R G R G R 11 21 31 41 51 G B G B G 12 22 32 42 52 R G R G R 13 23 33 43 53 G B G B G 14 24 34 44 54 R G R G R 15 25 35 45 55 Figure 2.2 Bayer CFA Pattern Original Sampled Image by Bayer CFA Separate Colour Planes Obtained by Demosaicing Reconstructed Image Figure 2.3 Bayer CFA Pattern Demosaicing Procedure 9 Digital Image Processing Technologies and Architectures overlaid with CFA has only incomplete colour components (R/G/B) at each pixel position, a full colour image needs to be reconstructed and the concept of demosaicing arises. The aim of demosaicing is to reconstruct an image with a full set of colour components from spatially undersampled colour samples captured by image sensors. For Bayer CFA, demosaicing interpolates the estimated missing two colour components for each pixel position with a selected algorithm. Figure 2.3 illustrates the procedure from original Bayer filter samples to the reconstructed image [12]. An ideal demosaicing algorithm should be able to avoid the introduction of false colour artifacts such as zippering and chromatic aliases as much as possible, with the maximum preservation of the original image resolution. Considering embedded applications in cameras, an ideal algorithm should also have low computational complexity for fast processing and efficient hardware implementation. There have been a number of demosaicing algorithms proposed. The simplest approach is called nearest-neighbour interpolation, which simply copies an adjacent pixel with the required colour component as the missing colour value. Obviously this approach can only be used to generate previews given strictly limited computational resources and is unsuitable for most applications where quality matters. Another simple approach, bilinear demosaicing, fills missing colour components with weighted averages of their adjacent same colour component values. This algorithm is simple enough for implementation in most cases; however it introduces severe demosaicing artifacts and smears sharp edges [13]. In [14], Cok presented a constant hue-based interpolation demosaicing algorithm which utilises a spectral correlation between different colour ratios. Hue is termed as the property of colours by which they can be perceived as ranging from red through yellow, green and blue, as determined by the dominant wavelength of the light [15]. As specified in Cok demosaicing algorithm, hue is defined by a vector of ratios as (R/G, B/G)2. By interpolating the hue value and deriving the interpolated chrominance values from the interpolated hue values, hues are allowed to change only gradually, thereby reducing the 10 Digital Image Processing Technologies and Architectures 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 (a) (b) (c) (d) (e) (f) Colour A Colour B (a) Original Image (d) Colour Difference A-B (b) Bayer CFA Samples (e) Median Filtered Colour Difference (c) Bilinear Demosaicing (fringe introduced) (f) Reconstructed Image (fringe removed) Figure 2.4 Illustration of Freeman Demosaicing Algorithm appearance of colour fringes which would have been obtained by interpolating only the chrominance values. Detailed algorithm description can be referred in [14]. The Freeman demosaicing algorithm was proposed in [16].It is a two-stage process combining bilinear interpolation and median filtering together. In order to minimise the zippering artifacts introduced by bilinear demosaicing, Freeman algorithm utilises colour difference, of say, red minus green and blue minus green, for median filtering. The filtered differences are then added back to the green plane to obtain the final red and blue planes. In this way, fringes at edges of different colour areas can be eliminated as illustrated in Figure 2.4, which takes a line containing two colour components in Bayer CFA for example [16]. Figure 2.4 (a) shows an original image with two colour components. This image contains a sharp edge between two different areas (the vertical axis represents the colour component intensities). Figure 2.4 (b) illustrates the sampled colour information captured by CCD with Bayer CFA. After linear demosaicing, the image is reconstructed however with the noticeable colour fringe between two colour areas at the positions of pixel 6 and 7, shown in (c). If we take the colour value difference, that is, colour A minus colour B, and filter them by a median filter with a kernel of size 5, the difference can be eliminated, as illustrated in (d) and (e). Finally, the original image can be reconstructed without the colour fringe existing in bilinear demosaicing, which is shown in (f). Laroche and Prescott proposed a three-step gradient-based demosaicing algorithm in [17], which estimates the luminance channel first and then interpolates the colour difference, of say, red minus green and blue minus green. Two classifiers α and β are referred and utilised to determine whether a pixel belonging to a vertical or horizontal colour edge [15]. According to the 11 Digital Image Processing Technologies and Architectures magnitude comparison between α and β, difference formulas are employed to estimate the green pixel value. Both classifiers and formulas are revised when processing pixels within difference positions of Bayer CFA pattern. Once the luminance channel (green) is determined, chrominance values (red and blue) are estimated from the colour difference by other formulas. Detailed algorithm can be referred in [17]. Modification of this algorithm is proposed by Hamilton and Adams in [18], termed as adaptive colour plane interpolation. This method also employs a multi-step process with classifiers similar to Laroche-Prescott algorithm but modified to accommodate first order and second order derivatives, that is, to calculate arithmetic averages for the chrominance channel and appropriately scaled second derivative terms for the luminance data [15]. R. Kimmel in [19] presented another demosaicing algorithm consisting of three stages. Firstly a missing green pixel is estimated as a linear combination of its four neighbours. In Kimmel algorithm, a weight function, termed Ei, is utilised in this stage for the estimation. Generally, Ei is calculated from the probability that the neighbour green pixels belong to the same image object as the missing green pixel. For different neighbours, there are different equations to calculate Ei respectively. The second stage is to estimate the missing red and blue colour components. This estimation is performed similarly to green pixels at the previous stage, utilising the weight functions (Ei) discussed above. After these two stages, colour correction acts as the third stage in Kimmel algorithm. The main idea of colour correction is to assume that the ratio of red (or blue) to green is constant within each image object. Based on this assumption, green pixels and red/blue pixels are corrected alternatively. Normally this colour correction stage is repeated 3 times before the final reconstructed image is obtained [19-20]. Tsai-Acharya demosaicing algorithm [21] is an adaptive method based on the hue concept. The main idea is to assign weight coefficients to all neighbour pixels of the one currently under processing. This algorithm is also carried on three stages. Firstly, all missing green pixels are estimated. Secondly, the missing red/blue colour components at blue/red pixels are estimated. Finally, 12 Digital Image Processing Technologies and Architectures the missing blue/red colour components at green pixels are estimated. All the estimation in these three stages is performed on the basis of assigning different hue values to the demosaicing window. Detailed algorithm can be referred in [21]. Wenmiao-Peng algorithm was proposed in [22]. It is based on the spectral correlation among pixels along the respective interpolation direction. This algorithm assumes that green and blue/red colour components are well correlated with constant offset; meanwhile the changing rate of neighbouring pixel values along an interpolation direction is a constant[23]. The median filter in Freeman algorithm is utilised in this algorithm in order to suppress the noticeable demosaicing artifacts. 2.3. JPEG2000 Compression Standard JPEG2000 image compression standard was created by Joint Photographic Experts Group (JPEG) committee in 2000 and introduces by ISO/IEC. Instead of using DCT, it employs DWT technique and supports both lossless and lossy compression schemes. Numerous features are provided in JPEG2000 standard in addition to the basic compression functionality, including 1~5 [24], providing JPEG2000 a very large potential application base. 1. Progressive recovery of an image by fidelity or resolution 2. Region of interest coding 3. Random access to particular regions of an image 4. Flexible file format with provisions for specifying opacity information and image sequences 5. Good error resilience Figure 2.5 illustrates a block diagram oftheJPEG2000 encoding algorithm. The original image is decomposed into rectangular blocks termed tiles and codeblocks for processing in order to avoid massive memory usage. The main modules in JPEG2000 encoder are: Component Transform, Tiling, 2Demonsion DWT, Quantisation, EBCOT Tier-1 encoder including Context 13 Digital Image Processing Technologies and Architectures Tiling and DC level shifting Component transformation Original Image 2-D DWT Subbands EBCOT Tier-1 Quantization Context Modelling Arithmetic Encoder Codeblocks Bitstream 010110011101... Tier-2 & File Formatting Figure 2.5 JPEG2000 Encoder Architecture Modeling and Arithmetic Encoder, and finally the Tier-2 encoder with File Formatting. Since the JPEG2000 standard is quite complicated, only a brief introduction of each module is given here. Readers can refer to the Appendix for a more detailed description of the standard. Tiling and DC Level Shifting: Tiling partitions the original image into a number of rectangular non-overlapping blocks, termed tiles. Within each tile, DC level shifting is applied to ensure each of them has a dynamic range which is approximately centered around zero. Component Transformation: Normally, the input image is considered to have three colour planes (R, G, B). JPEG2000 standard supports two different transformations: (1) Reversible Colour Transformation (RCT) and (2) Irreversible Colour Transformation (ICT). RCT can be applied to both lossless and lossy compression, while ICT can only be used in the lossy scheme. 2-D Discrete Wavelet Transform: This is one of the key differences between JPEG2000 and previous JPEG standard. DWT decomposes a tile into a number of subbands at different resolution levels with both frequency and time information. 2-D DWT is a further decomposition based on 1-D DWT. In JPEG2000, a modified scheme termed lifting based DWT [25-26] is utilised to simply the computation. 14 Digital Image Processing Technologies and Architectures Quantisation: In lossy compression mode, all the DWT coefficients are quantised in order to reduce the precision of DWT subbands to aid in achieving compression [27]. The quantisation is performed by uniform scalar quantisation with dead-zone around the origin. After quantisation, all quantised DWT coefficients are signed integers and converted into sign-magnitude represented prior to entropy coding [27]. Embedded Block Coding with Optimal Truncation: This is the most computationally intensive module in JPEG2000. It can be divided into two coding steps: Tier-1 and Tier-2. Tier-1 coding scheme consists of fractional bit-plane coding (Context Modeling) and binary arithmetic coding (Arithmetic Encoding). Context Modeling (CM) codes DWT coefficients in bit-level using four primitive coding schemes. After CM, DWT coefficients are coded into Context/Decision (CX/D) pairs. Then Arithmetic Encoder (AE) continues coding these CX/D pairs to obtain the compressed bit-stream. After Tier-1 coding step, the bit-stream is organised by Tier-2 coding step and the final coded bit-stream is generated. 2.4. Literature Review 2.4.1. Demosaicing Algorithms Evaluations Performance evaluations and Comparisons of various demosaicing algorithms discussed in Section 2.2 have been proposed in [15] and [23]. In[15],a couple of test images were employed in the authors’ proposed experiment including bar/starburst images and real images such as macaw and crayon, shown in Figure 2.6 (a)-(h). These images are selected to cover different cases such as images containing sharp edges, high spatial frequencies, speckle behaviour and distinct colour edges. Figure 2.7 illustrates performance comparisons in terms of Mean Squared Error (MSE). It is found that the Freeman algorithm is best suitable for images with speckle behaviour, while Laroche-Prescott and Hamilton-Adams algorithms are best suitable for cases with sharp edges [15]. 15 Digital Image Processing Technologies and Architectures (a) (b) (e) (c) (f) (g) (d) (h) MSE (x10-3) Figure 2.6 Test Images for Evaluating Different Demosaicing Algorithms in [13] Figure 2.7 Performance Comparisons between Different Demosaicing Algorithms In [23], performance comparisons were carried out among Freeman algorithm, Kimmel algorithm, Tsai-Acharya algorithm and Weimiao-Peng algorithm in terms of both Peak Signal to Noise Ratio (PSNR) and execution time. The test images utilised are shown in Figure 2.8. The first six images are selected to be synthetic vector images and the other six are actual photographic images. Figure 2.9 illustrates comparisons in both PSNR and execution time aspects. In this thesis, these demosaicing algorithms are compared from two aspects: reconstructed image quality and whether the algorithm is suitable for hardware based implementation. From Figure 2.18, it is seen that the Freeman demosaicing algorithm provide the lowest MSE for most of the images. In Figure 2.9, it is seen that both Freeman algorithm and WenmiaoPeng algorithm deliver higher PSNR compared with the other two algorithms. When processing synthetic vector images, Wenmiao-Peng algorithm 16 Digital Image Processing Technologies and Architectures Figure 2.8 Test Images in [23] (a) (b) Figure 2.9 (a) PSNR Comparisons (b) Execution Time Comparisons [23] provides the best PSNR. However when processing real photographic 17 Digital Image Processing Technologies and Architectures images, Freeman algorithm provides very similar PSNRs. Hence it is concluded that both Freeman and Wenmiao-Peng algorithm can provide good reconstructed image quality. When considering hardware based implementation, the Freeman algorithm is very straightforward. Both the bilinear stage and median filter can be easily implemented on hardware. On the other hand, Wenmiao-Peng algorithm involves divisions when calculating its estimating coefficients and intermediate colour values, which require more complicated hardware architecture. In this case, as a classical demosaicing algorithm which provides both good performance and relatively simple architecture, Freeman demosaicing algorithm is selected in this thesis. 2.4.2. Solutions for Image Procesing and Compression Applications In industry, demosaicing task is usually not coming with a single-purpose product. In contrast, it is normally integrated within a complete digital image processing chain as the pre-processing module. On the other hand, there are various JPEG2000 encoder solutions based on different architectures including custom chips, FPGAs, DSP&VLIW based SoCs, coarse-grained reconfigurable architectures and so on. Generally, these solutions are popular to some extent, even owning some non-neglectable drawbacks. In the following subsections, various solutions for image processing and compression applications including demosaicing and JPEG2000 are discussed. 2.4.2.1. Custom Chips Implementations Custom chip implementations are traditional popular in designing image processing and compression solutions. Usually these chips are fully customised for targeted imaging applications and designed through the standard ASIC design flow involving Register Transfer Level (RTL) coding, logic synthesis and layout design. Recently, some high-end custom chips choose to integrate one or more processor cores into their dedicated hardware architectures for performance enhancement. One such example is 18 Digital Image Processing Technologies and Architectures STMicroelectronics STV0986 processor [28] which provides a full image processing chain including noise filter, demosaicing, sharpness enhancement, lens shading correction, etc. STV0986 has a video processor, two video pipes and a dedicated JPEG encoder. It can provide throughput up to 12.5 fps in JPEG format at 5 megapixel resolution. Another example is NXP PNX4103 [29] which is a multimedia processing chip with embedded TM3270 and ARM926 cores. It can efficiently realise imaging tasks such as demosaicing, white balancing, image stabilisation, sharpening, etc. and supports video standards such as H.264 and MPEG. For JPEG2000 solutions, one such example is Analog Devices ADV212 JPEG2000 codec [30] which can deliver up to 65 Million Symbols per Second (MSPS) for 9/7 irreversible mode or 40 MPSP for 5/3 reversible mode. Another example is Bacro BA110 JPEG2000 encoder [31] supporting 720p/1080p High Definition (HD) videos. Other commercial custom chip solutions include intoPIX RB5C634A JPEG2000 encoder [32]providing throughput up to 27Mpixels/s without truncation and so on. Obviously these ASIC based implementations offer high throughput, high power efficiency and small footprints for image processing and compression solutions. However, even with embedded processor cores, this kind of solution is inherently inflexible as they are fully customised and cannot be upgraded and altered after fabrication. This is one of its main drawbacks, particularly when ASIC solutions are used for imaging applications where algorithms evolve rapidly. Meanwhile, designing full custom chips requires more human effort and cannot meet the short time-tomarket demands. 2.4.2.2. FPGA Based Implementations Image processing and compression applications can be mapped on FPGAs for fast hardware prototypes and some domains where the costs of power and area are not important. Compared with ASIC solutions, FPGA based solutions can provide more flexibility and shorter time-to-market. FPGA based demosaicing solutions 19 Digital Image Processing Technologies and Architectures A Bayer CFA interpolation IP core targeting FPGAs and ASICs was presented by ASICFPGA Ltd. in [33]. This interpolation IP is based on ASICFPGA’s own demosaicing algorithm which has a 5x5 processing window. This algorithm has similar nature to Laroche and Prescott algorithm as both of them try to detect the change of colour edges in the image. Given an 8-bit input bit-width, this IP core can work at a frequency of up to 129MHz on a Virtex 4 LX15 FPGA. The authors in [34] presented a bilinear demosaicing engine on a Virtex 4 FPGA. The engine demonstrates throughput up to 150MPixels/s when working at a frequency of 150MHz. FPGA based JPEG2000 solutions FPGAs are also traditional popular solutions for complex imaging tasks such as JPEG2000. Announced in [35], the JPEG2K-E Intellectual Property (IP) core can be mapped on Xilinx Virtex 4~6 and Spartan-6 FPGA families, providing throughput up to 190MSamples/s based on a 90nm technology. The JPEG2K-E IP consists of a 2-D DWT engine, a quantiser and multiple EBCOT engines. Based on Virtex-6 FPGA platforms, JPEG2K-E IP runs at the frequency of 210MHz and consumes 44K LUTs and 77 BRAMs. The authors in [36] presented a memory efficient JPEG2000 architecture including 2-DDWT and EBCOT on Xilinx Virtex II FPGA platform. A multilevel line-based lifting 2-D DWT was implemented which was claimed to be able to support multi-level DWT being executed simultaneously. Based on the line-based DWT, a parallel EBCOT engine was established. The authors declared that their implementation can provide throughput up to 44.76Mpixles/s with a working frequency of 100MHz. M. Gangadhar and D. Bhatia presented an FPGA based EBCOT architecture in [37]. A parallel architecture was developed in their work which can process three coding passes simultaneously. Two column based processing elements were designed to code different coding passes in parallel. With a XC2V1000 device running at 50MHz, the proposed application can encode a 512x512 grayscale image in less than 0.03s. 20 Digital Image Processing Technologies and Architectures An Altera APEX20K FPGA was selected in [38] for the implementation of a parallel EBCOT tier-1 encoder. The authors presented a split arithmetic encoder for EBCOT Tier-1 process which well investigated the causal relationship between different coding passes and enabled AE to code context information generated by different coding passes simultaneously. Results showed that the proposed architecture offered a 55% improvement in processing time compared with the traditional serial architecture for a set of different test images. There are also JPEG2000 implementations based on FPGA and DSP combined platforms such as BroadMotion JPEG2000 codec [39] on a combination of Altera Cyclone II FPGA and TMS320DM64x DSP. Since FPGAs are fine-grained, they require more configuration bits compared with coarse-grained reconfigurable architectures. Meanwhile, a majority of the transistors in FPGAs are used for providing reconfigurability. In this case, traditional FPGAs usually consume more power than ASICs. However, FPGAs are evolving rapidly with the latest process technology, and applications based on elder FPGAs can be easily transplanted to newer devices. In this case, the performance of FPGA based applications highly depends on the manufacturing process technology and the FPGA device itself. Moreover, FPGA based applications are normally developed with certain hardware description languages such as VHDL and Verilog, which increases the design difficulty for engineers. 2.4.2.3. DSP Based SoC Implementations Instead of designing basic components from RTL, another popular solution for image processing and compression tasks is to use third-party or their own DSPs to build SoCs. Several DSP solutions targeting imaging applications including demosaicing and JPEG2000 are introduced in the following subsections. 21 Digital Image Processing Technologies and Architectures Figure 2.10 HiveFlex ISP2300 Block Diagram [41] HiveFlex ISP2000 series HiveFlex ISP2000 series [40] provided by Silicon Hive is a series of processors with licensable silicon proven C-programmable IPs optimised for image signal processing. Figure 2.10illustrates the architecture of HiveFlex ISP2300 as an example. It has an instruction set optimised for the image processing domain and a combination of VLIW and Single Instruction Multiple Data (SIMD) parallelism [40]. With different configurations, its SIMD datapath can vary from 4-way to 128-way. HiveFlex ISP2300 is C programmable, and it has scalar data path for standard C programs. HiveFlex ISP2300 has dedicated hardware peripherals such as encoding accelerator and filterbank accelerator in order to enhance its performance. It supports a full-featured image processing chain including demosaicing with Silicon Hive’s patented technology, wide dynamic range visual optimisation, red eye removal, flexible scaling, JPEG codec, etc. With the maximum 128 SIMD factor, HiveFlex ISP 2300’s performance can reach up to 170 GOPS at 333MHz and can support full HD 1080p video at 30 fps [40]. Philips TriMedia TM1300 22 Digital Image Processing Technologies and Architectures Philips TriMedia TM1300 [41] is an advanced 32-bit 5 issue VLIW processor core. Specialised processing blocks are integrated into the device in addition to the programmable VLIW core. The VLIW architecture of TM1300 allows parallelism of instruction execution and data manipulation. The special functional blocks of TM1300 include digital audio ports, an image coprocessor, PCI and other external device interfaces, a memory controller, video I/O ports, an MPEG variable length decoder block and a multi-mode fixed function video scaling and filtering engine [42]. Figure 2.11illustrates the TM 1300 block diagram. This processor is completely C language programmable. In addition to providing an object oriented C/C++ compiler and debug tools, TriMedia software tool flow provides a real-time OS kernel, optimisation tools, a simulator and application code libraries for industry standard digital audio and video stream processing algorithms. In the Figure 2.11 TM1300 Block Diagram [42] 23 Digital Image Processing Technologies and Architectures reference design [42], a fast EBCOT CM algorithm is implemented and optimised on TM1300 processor. Optimisation approaches include using custom operations, simplifying logic operations, removing conditional branches, loop fusion, etc. The simulated result demonstrates the execution time of 10.26ms for processing the standard 256x256 Lena test image by CM with a working frequency of 143MHz. TI TMS320C64x DSPs TMS320C64x DSPs are the highest performance fixed-point DSP generation in the TMS320C6000 DSP platform. They are based on the secondgeneration (C6416T) and third-generation (C6455) high performance advanced VelociTI VLIW architecture developed by Texas Instruments[43]. Figure 2.12illustrates the block diagram of TMS320C6416T as an example. It has six 32/40-bit Arithmetic Logic Units (ALUs), two 16-bit multipliers and 64 32-bit general purpose registers. With performance of up to 8000 Million Instructions per Second (MIPS) at a working frequency of 1GHz, C6416T DSP can produce four 16-bit Multiply-Accumulates (MACs) per cycle for a total of 4000 million MACs per Second (MMASC). C6416T DSP has two embedded coprocessors: Viterbi Decoder Coprocessor (VCP) and Turbo Decoder Coprocessor (TCP) in order to speed up channel-decoding operations on chip [43]. Based on 90 nm process technology, C6455 DSP can support a higher clock rate of 1.2GHz, which enables it with performance of up to 9600 MIPS [44]. References [45] and [46] demonstrate JPEG2000 encoder designs based on C6416T and C6455 respectively. The utilised optimisation approaches include Variable Group of Sample Skip (VGOSS) for CM and SIMD functions in C6416T [45] and system-level compiler optimisation and DMA utilisation [46]. With these optimisations, the JPEG200 encoder in [45] shows its encoding time of approximately 74.6 ms for a 256x256 grayscale image while the encoder in [46] demonstrates the execution time of 45.25 ms for the grayscale Lena image with the same size. 24 Digital Image Processing Technologies and Architectures Figure 2.12 TMS320C6416T Block Diagram [44] BLACKFIN Processors BLACKFIN DSPs are embedded processors developed by Analog Devices. They use a 32-bit RISC microcontroller programming model on an SIMD architecture which offers low power and high performance features. The ADSP-BF535 processor [47] combines a 32-bit RICS-like instruction set and dual 16-bit MAC signal processing functionality with an extendable addressing capability. Figure 2.13illustrates the core architecture of ADSP BF535. It consists of a data arithmetic unit which includes two 16-bit MACs, two 40-bit ALUs, four 8-bit video ALUs and a single 40-bit barrel shifter [48]. 25 Digital Image Processing Technologies and Architectures Figure 2.13 ADSP BF535 Core Architecture [48] The two Data Address Generators (DAGs) support bit-reversed addressing and circular buffering. Registers occupied by BF535 include six 32-bit address pointer registers for fetching operands, index registers, modifier registers, base registers and length registers [48]. ADSP-BF535 processor contains a rich set of peripherals connected to the core via several high bandwidth buses. Based on its dual-core architecture, ADSP-BF561 processor offers higher performance [49]. References [48] and [50] present JPEG2000 implementations based on ADSP-BF535 and BF561 respectively. In [48], LUTs and functional macros are utilised for the code optimisation. Execution complexity of each submodule in JPEG2000 is analysed though the coding time is not given. In [50], optimisation mainly focuses on logic simplification, code reusing and memory arrangement. The execution time provided by [50] is approximately 53 ms for encoding a 256x256 grayscale image. Other DSP/VLIW based Implementations 26 Digital Image Processing Technologies and Architectures There are a couple of other DSP/VLIW based imaging solutions such as a CPU JPEG2000 implementation [51], an ARM920T implementation[52]and a STMicroelectronics LX-ST230 based JPEG2000 implementation [52]. These implementations focus on either algorithm optimisation [51] or efficient task mapping scheme [52] in order to accelerate the coding process. Generally, traditional DSP/VLIW solutions discussed above have noticeable lower throughput compared with ASIC/FPGA solutions although they are usually more power efficient. In this case, traditional DSP/VLIW based solutions should not be considered as the ideal solution for imaging applications in next generation digital cameras. On the other hand, DSPs specialised for imaging applications like [40] have their limitations such as lack of ILP. Meanwhile, as various dedicated hardware peripherals are usually integrated into specialised imaging DSPs, they may require longer time-to-market and cost more than traditional DSPs. 2.4.2.4. Coarse-Grained Reconfigurable Architecture Based Implementations Recently, a new category of programmable processor architectures for demanding DSP applications, termed coarse-grained reconfigurable architecture, has emerged targeting high performance and area-efficient computing applications. Different from traditional FPGAs and DSPs, coarsegrained reconfigurable architectures can be intended as hardware components whose internal architecture can be dynamically reconfigured in order to implement different algorithms. Since the internal circuits can be reused for implementing different functionalities at different times and the required configuration information is less than fine-grained architectures, coarse-grained reconfigurable architectures are more area and power efficient compared with FPGAs. Meanwhile, coarse-grained reconfigurable architectures also offer software-like programmability similar to DSPs, and they are more efficient due to the implementations on hardware of computational functionalities [7]. 27 Digital Image Processing Technologies and Architectures Unfortunately, as far as our investigation is concerned, there are only few demosaicing and JPEG2000 applications based on coarse-grained reconfigurable architectures. In this subsection, several coarse-grained reconfigurable architectures targeting image processing and multimedia applications are discussed. Some of these architectures have demosaicing engine implementations such as CRISP in[53] and core tasks of JPEG2000 standard implemented such as NEC Dynamically Reconfigurable Processor (DRP) in [54]. Others have their potential for imaging applications demonstrated by applying tasks such as DCT, median filter, FIR, etc. CRISP Different from other coarse-grained reconfigurable architectures, CRISP processor [53] consists of context registers, main controller, reconfigurable interconnection, and various kinds of coarse-grained Reconfigurable Stage Processing Elements (RSPEs). Each kind of RSPE corresponds to one module specified for image processing such as load memory, pixel-based operation, colour interpolation, downsample, etc. Figure 2.14 illustrates the architecture of CRISP processor. The authors in [53] implemented several Figure 2.14 CRISP Processor Architecture [54] 28 Digital Image Processing Technologies and Architectures typical image processing tasks such as Gamma correction, demosaicing, median filter, smooth filter, etc. on a fabricated chip. Performance comparisons are made between CRISP and DSPs such as Philips TM1300 and TMS320C64x and CRISP demonstrates good throughput improvement. However, since the CRISP processor is more ASIC-like as it has dedicated hardwired imaging-targeted RSPEs, these comparisons become less convictive to some extent. NEC Dynamically Reconfigurable Processor NEC DRP [55] is a coarse-grained dynamically reconfigurable processor core released by NEC. It carries an on-chip configuration data, or contexts, and it dynamically reschedules these contexts to realise multiple functions. 64 of the most primitive 8-bit Processing Elements (PEs)are combined to form what is called a tile, and DRP core consists of an arbitrary number of these tiles (Figure 2.15(a)). The architecture of a PE is illustrated in Figure 2.15(b). A PE has an 8-bit ALU, an 8-bit Data Management Unit (DMU) (for shifts/masks), an 8-bit x 16-word Register File Unit (RFU), and an 8-bit FlipFlop Unit (FFU). These units are connected by programmable wires specified by instruction data, and their bitwidths range from 8 Bytes to 18 Bytes depending on the location. A PE has 16-depth instruction memories (e.g. 16 contexts) and supports multiple context operation [54]. (a) (b) Figure 2.15 (a) NEC DRP Structure (b) PE in NEC DRP [56] 29 Digital Image Processing Technologies and Architectures Based on NEC DRP architecture, the authors in [54] implement some core tasks in JPEG2000 encoding algorithm including 2-D DWT, significant coding pass in CM and AE. The optimisation approaches mainly focus on efficient context controlling and reducing the number of occupied PEs. Without giving the performance of processing areal image, NEC DRP demonstrates its execution time of 0.213ms for processing 256 16-bit samples by significant coding pass and 1023 CX/D pairs by AE [54], which shows advantages compared with the TMS320C6713 DSP based implementations. MorphoSys MorphoSys [56] is a reconfigurable architecture for computation intensive applications based on combination of both coarse grain and fine grain reconfiguration techniques. Figure 2.16(a) illustrates the architecture of MorphoSys processor. The reconfigurable part in Morphosys is an RC array. An RC array is an 8x8 array of Reconfigurable Cells (RCs). The configuration data is stored in the context memory. The architecture of an RC is illustrated in Figure 2.16(b). Each RC consists of four types of basic elements: functional units for arithmetic and logic operations, memory element to feed the functional units and store their results, input and output modules to connect cells together to form the RC array architecture and a fine grain reconfigurable logic block. TinyRisc [57] is a general purpose 32-bit RISC processor. It controls operation sequence in MorphoSys and executes nondata parallel operations [56].The authors in [56]presented implementations of (b) (a) Figure 2.16 (a) MorphoSys Architecture (b) RC Array Architecture [57] 30 Digital Image Processing Technologies and Architectures (b) (a) Figure 2.17 (a) ADRES Architecture (b) RC Architecture [59] DCT, FFT and correlation based on MorphoSys processor, which show advantages compared with TMS320C6000 DSPs. ADRES ADRES architecture [58] is a combination of a VLIW processor and a coarsegrained reconfigurable matrix. Figure 2.17(a) illustrates the architecture of ADRES. For the VLIW part, several Function Units (FUs) are allocated and connected together through one multi-port register file, which is typical for VLIW architecture. For the reconfigurable matrix part, there are a number of RCs which basically comprise FUs and Register Files (RFs) as illustrated in Figure 2.17(b) [58]. FUs in ADRES perform coarse-grained operations on 32bit operands. Based on ADRES architecture, the authors in [59] presented implementations of both Tiff2BW transform and wavelet transform benchmarks and made comparisons with TI C64x DSP implementations. 2.5. Demand for Novel Architectures Table 2.2produces brief comparisons of different reviewed architectures for image processing applications. As presented, customised chips offer good performance in aspects of throughput and power efficiency for imaging solutions. However their flexibility is strictly limited since these chips are fully customised. This drawback becomes extremely noticeable when such customised chips are used for rapidly evolving imaging technologies. For 31 Digital Image Processing Technologies and Architectures Table 2.2 Comparisons of Different Architectures for Image Processing Applications Architectures Structure Target Applications Customised Chips (including pure ASICs and customised chips with embedded CPUs) STV0986 [29] Dedicated hardware with embedded video processor core Image/video processing, JPEG NXP PNX4103 [30] Dedicated hardware with embedded TM3270 and ARM cores Image/video processing, H.264, MPEG ADV212 [31] Dedicated hardware with embedded RISC processor JPEG2000 Bacro BA110 [32] Dedicated hardware JPEG2000 intoPIX RB5C634A [33] Dedicated hardware JPEG2000 FPGA Based Implementations (including IP for FPGA and ASIC) ASICFPGA IP [34] IP for FPGA and ASIC Demosaicing JPEG2K-E IP [36] IP for FPGA and ASIC JPEG2000 BroadMotion [40] Combination of Altera FPGA and TI DSP JPEG2000 Other FPGA applications [35], [37-39] FPGA Demosaicing, JPEG2000, etc DSP and VLIW Based Implementations HiveFlex ISP2300 [41] Programmable VLIW core with dedicated imaging hardware Image/video processing Phillips TM1300 [42] Programmable VLIW core with imaging peripherals JPEG2000 TMS320C64x [44] [45] Programmable VLIW DSP JPEG2000 ADSP-BF535/561 [48] [50] Programmable DSP JPEG2000 ARM920T [53] Programmable DSP JPEG2000 STMicroelectronics LXST230 [53] Programmable DSP, supports multi-core architecture JPEG2000 Coarse-Grained Reconfigurable Architectures CRISP [54] Dedicated imaging RSPEs with programmable connections and controllers Gamma correction, demosaicing, median filter, smooth filter NEC DRP [55] PE array with programmable connections JPEG2000 MorphoSys [57] RC array with TinyRICS and peripherals DCT, FFT, correlations, etc. ADRES [59] Reconfigurable matrix with VLIW Tiff2BW transform, wavelet transformation 32 Digital Image Processing Technologies and Architectures those products who have embedded processor cores such as [28], [29] and [30], they have relatively higher flexibility compared with other pure ASICs. However the massive human effort and long time-to-market required for development cannot be ignored. FPGA based solutions offer much more flexibility compared with customised chips while keeping comparable high throughput. Meanwhile FPGA based solutions require less developing time and human effort than ASICs. However, traditional FPGAs may be not power or area efficient for imaging applications especially for mobile devices. Although new FPGA devices based on latest manufacturing process are released frequently, their actual performance and power dissipation for complex imaging tasks need to be evaluated and tested. Based on their inherent nature, DSP based solutions offer high flexibility and easy programmability. Meanwhile, the possibility of adding extended imaging instruction sets allows DSPs to be utilised for image processing solutions. However, Although SIMD technique is utilised in some solution such as [40] in order to increase Data Level Parallelism (DLP), DSP based solutions often suffer from the limited level of ILP found in their typical programs, leading to their restricted performance. Moreover, since DSPs usually have quite high working frequencies, their power dissipation will become critical in some power-sensitive aspects. Coarse-grained reconfigurable architectures fill the gap between traditional FPGAs/ASICs and DSPs. Compared with customised chips, it is obviously that coarse-grained reconfigurable architectures offer much more flexibility. Meanwhile, based on a set of hardware components and/or reconfigurable connections all of which are reusable and reduced configuration information required, coarse-grained reconfigurable architectures are more area and power efficient compared with fine-grained FPGAs. Moreover, with provided software-like programmability similar to DSPs, coarse-grained reconfigurable architectures are more efficient since their hardware based nature offers high levels of both ILP and DLP. 33 Digital Image Processing Technologies and Architectures Generally, an ideal architecture for imaging solutions should provide high throughput, high flexibility and low power dissipation. Meanwhile, since the amount of data in imaging applications is usually higher than that in other applications such as communication, high levels of both ILP and DLP become critical. Based on all the discussion above, coarse-grained reconfigurable architectures appear to be strong candidates for image processing solutions. Since there are only few coarse-grained reconfigurable architecture based solutions for image processing applications having been proposed, this thesis presents customised dynamically reconfigurable architecture based on coarse-grained Reconfigurable Instruction Cell Array (RICA) [6] paradigm for digital image processing and compression applications such as demosaicing and JPEG2000 standard, which will be detailed in the following chapters. Since different platforms are evaluated from aspects of throughput, power dissipation and flexibility in this chapter, the work described in this thesis will be evaluated with similar metrics. In the following chapters, throughput and area (directly relevant to power dissipation) are mainly used for evaluation. On the other hand, since the flexibility limitation only applies to ASICs, this metric will not be included in the following evaluation. 2.6. Conclusion This chapter has introduced basic theories of digital image processing and compression technologies. With a brief review of different imaging technologies, demosaicing and JPEG2000 compression standard are particularly discussed in detail. In Section 2.2, a number of existing demosaicing algorithms including bilinear, Cok, Freeman, Laroche-Prescott, Hamilton-Adam, Kimmel, Tsai-Acharya, Wenmiao-Peng, are presented. In the following Section 2.3, different modules in JPEG2000 compression standard are introduced. Section 2.4 presents the literature review which mainly focuses on demosaicing algorithms evaluation and imaging solutions based on various architectures. In Section 2.4.1, MSE and PSNR are utilised to evaluate 34 Digital Image Processing Technologies and Architectures performance of different demosaicing algorithms. Based on performance comparisons and complexity evaluation, Freeman algorithm is considered to be a promising method providing both good performance (especially for actual photographic images) and relatively simple structure. In Section 2.4.2, various architectures for image processing and compression solutions are discussed in aspects of throughput, flexibility and power dissipation.Since the targeted architecture in this thesis is DSP-like coarse-grained dynamically reconfigurable, more investigation is launched into solutions based on DSPs and coarse-grained reconfigurable architectures. It is concluded that traditional architectures have limitations such as low flexibility (custom chips), high power consumption (FPGAs) and low throughput (DSPs). On the other hand, coarse-grained reconfigurable architectures act as strong candidates for imaging solutions, since they fill the gap between traditional FPGAs/ASICs and DSPs and provide desirable features like good throughput, high flexibility, relatively low power dissipation, high levels of both ILP and DLP, etc. Based on the above discussion, this thesis aims to develop customised coarse-grained applications dynamically including reconfigurable demosaicing and architecture JPEG2000. for A imaging dynamically reconfigurable instruction cell array paradigm, which will be introduced in Chapter 3, is chosen to build the proposed architecture. From Chapter 4, this thesis focuses on presenting imaging solutions such as Freeman demosaicing and JPEG2000 on the proposed RICA based architecture. 35 RICA Paradigm Introduction and Case Studies Chapter 3 RICA Paradigm Introduction and Case Studies 3.1. Introduction As discussed in Chapter 2, there are several established architectures which can be utilised for image processing solutions. In conclusion, ASICs are well known to provide low power and high throughput compared with other architectures; however, they have both high design costs and limited postfabrication flexibility. FPGAs have their success which lies in their ability to map algorithms onto their logic and interconnects after fabrication, which actually offers outstanding flexibility. However, an impact on energy consumption of FPGA based solutions cannot be avoided. DSP&VLIW architectures offer advantages in terms of generic adaptivity and easy programming; however their performance is curbed due to the limited amount of ILP found in typical programs. On the other hand, coarse-grained reconfigurable architectures appear to be strong candidates for image processing solutions; and further investigation is required for their actual potential for imaging applications since little research work has been done in this field. In recent years, a novel coarse-grained dynamically Reconfigurable Instruction Cell Array[6] has emerged, which is promising to be an ideal candidate for high performance embedded image processing applications such as demosaicing and JPEG2000 in next generation digital cameras. By 36 RICA Paradigm Introduction and Case Studies designing the silicon fabric in a similar way to reconfigurable arrays but with a closer equivalence to software, RICA paradigm based architectures can achieve comparable high performance as coarse-grain FPGA architectures and maintain the same flexibility, low cost and programmability as DSPs [6]. A detailed introduction of RICA paradigm will be given in the next section. 3.2. Dynamically Reconfigurable Instruction Cell Array 3.2.1. Architecture The idea behind RICA paradigm is to provide a dynamically reconfigurable fabric that allows building specialised circuits for different applications. Instead of using fine-grained CLBs or homogeneous coarse-grained elements like FPGAs and most CGRAs, RICA has its heterogeneous coarsegrained hardware modules termed Instruction Cells (ICs)[6]. Each IC can be configured to do a small number of operations as listed in Table 3.1, and the nature of RICA paradigm is a heterogeneous IC array interconnected through an island-style programmable mesh fabric as illustrated in Figure 3.1 [6]. All ICs are expected to be independent and can run concurrently. Having such an array of interconnectable ICs allows building circuits from an assembly representation of programs. The configuration of the ICs void oned_dct (int *coeff,int *block) { C Source Code b0 = coeff[0]; b1 = coeff[1]; b2 = coeff[2]; b3 = coeff[3]; b4 = coeff[4]; b5 = coeff[5]; b6 = coeff[6]; b7 = coeff[7]; e = b1 * const_f7 - b7 * const_f1; f = b5 * const_f3 - b3 * const_f5; c4 = e + f; c5 = e - f; h = b7 * const_f7 + b1 * const_f1; g = b3 * const_f3 + b5 * const_f5; LDR LDR MOV MOV MUL MUL SUB LDR LDR MOV MOV MUL MUL SUB a1,[sp,#0x24] ip,[sp,0x3c] a3,#6 a4,#0x1b ip,a4,ip a1,a3,a1 a1,a1,ip a5,[sp,#0x36] ip,[sp,0x5f] a7,#6 a8,#0x1b ip,a8,ip a5,a7,a7 a5,a5,ip ASM Code Data Memory Banks I/O Ports Step Configurations 100011101001101001101011001110101101000111 011011010110111010001010101010100110100101 010101110100110110010110101110101110111001 011001101011001010101001011010100111010100 010111001001010010100101111101000010100101 010100101010110010110010100101001010101001 RRC Program Configuration counter stream Program Memory C code e = b1 * 6 – b7 * 27; CONST (0x06) Compiled ASM: LDR LDR MOV MOV MUL MUL SUB a1,[sp,#0x24] ip,[sp,0x3c] a3,#6 a4,#0x1b ip,a4,ip a1,a3,a1 a1,a1,ip MUL (mul) REG Define Configuration (read b1) CONST (0x1b) Dynamic Allocation of Instruction Cells into processing steps, scheduled within gcc toolchain REG (read b7) Figure 3.1 RICA Paradigm [6] 37 MUL (mul) REG ADD (sub) (write a1) and RICA Paradigm Introduction and Case Studies Table 3.1 Instruction Cells in RICA Instruction Cell Associated Functions ADD Addition and subtraction MUL Multiplication REG Registers SHIFT Shifting LOGIC Logic Operation COMP Comparison MUX Multiplexing I/O REG Register with access to external I/O ports RMEM Interface for reading data memory WMEM Interface for writing data memory I/O Port Interface for external I/O ports RRC Controlling reconfiguration rates JUMP Branches SOURCE Interface for read files SINK Interface for writing files SBUF Interface for accessing stream buffers interconnections are changeable on every cycle to execute different blocks of instructions. As illustrated in Figure 3.1, the processing datapath of RICA is a reconfigurable array of ICs, where the program memory contains the configuration instructions that control both the ICs and interconnections [6]. The use of an IC-based reconfigurable architecture as a datapath gives important advantages over DSP and VLIWs such as better support for parallel processing. The RICA architecture can execute a block containing both independent and dependent instructions in the same cycle, which prevents the dependent instruction from limiting the amount of ILP in the program [6]. Different from traditional DSPs, RICA has a reconfigurable datapath which implies that it does not have fixed clock cycles but an operation chain, which means that RICA can execute both dependent and independent instructions in parallel within one configuration context. In this case, the cycle in RICA architecture is termed step. In contrast to traditional processors which have computation units on critical paths pipelined to 38 RICA Paradigm Introduction and Case Studies improve the throughput, RICA architecture introduces variable clock cycles in different steps to ensure longer critical paths consume more clock cycles. Due to the heterogeneous nature of RICA, one of its salient characteristics is that the IC array can be customised at the design stage according to the requirement of targeted application in term of the number and the type of ICs, which leads to efficient computational resource utilisation and better system performance. Another distinction from a conventional processor is the memory access pattern. RICA paradigm allows multiple simultaneous reading and writing operations to multiple memory locations within one step. Since there are four memory banks existing in current RICA paradigm, totally four memory reading and four writing operations are supported in a single step. 3.2.2. RICA Tool Flow An automatic tool flow has been developed for the simulation of the RICA paradigm based architectures. The tool takes a definition of the available ICs in the array along with other parameters such as their count, bitwidth and the type of interconnections. These specified hardware resource can be modelled using a simulator written in high-level C/C++ code [6]. If the required performances determined through the RICA software simulator are not met, the developer can modify their original code or change the mixture of the available IC resources to improve the performance. RICA supports pure ANSI-C programmability in a manner very similar to conventional processors and DSPs. A dedicated tool flow [6] for RICA has been developed which comprises compiler, scheduler, placement & routing, simulator and emulator. Figure 3.2 illustrates the working mode of RICA tool flow. Compiler: The compiler takes the high-level C code and transforms it into an intermediate assembly language format [6]. This transformation is performed by an open source GCC compiler. After the compilation, a RICA-targeted assembly file is obtained, which consists of instructions for the ICs in RICA based architecture. 39 RICA Paradigm Introduction and Case Studies C code Compiler MDF Assembly code Schedular Netlist Step 0 Step 1 Simulator/Emulator Profile Execution Trace ... Step 2 Placement and Routing Memory Dump Configuration Bits Figure 3.2 RICA Tool Flow Scheduler: The RICA scheduler tasks the assembly file generated by the compiler and tries to create a netlist to represent the program. The netlist contains blocks of instructions that will be executed in a single step[6]. The partitioning for different steps is performed after scheduling the instructions and investigating the dependencies between them. Within a step, dependent instructions are connected in sequence, and independent instructions are executed in parallel. The scheduler[60] takes into account the available ICs, interconnections and timing constraints in the array by a Machine Description File (MDF). It also performs optimisations for the code such as removing temporary registers[6]. Simulator: The simulator takes the instruction blocks in the netlist file and executes them step by step. It also takes into account the timing constraints defined in the netlist to simulate the simulated execution time of the current application. The output of the RICA simulator includes profile which contains execution time, number of steps, etc, execution 40 RICA Paradigm Introduction and Case Studies trace which is the detailed trance indicating how the applications is executed, and the memory dump containing the data written into the memory. Placement and Routing: If the RICA based architecture needs to be mapped to a physical chip, a tool is provided to minimise the distance when allocating all the ICs and connecting them to each other [6]. When the placement and routing netlist file is generated, the configuration bits can be generated. 3.2.3. Optimisation Approaches to RICA Based Applications As a dynamically reconfigurable architecture, RICA needs to be reconfigured for each step. The time consumed by fetching and loading configuration instructions and configuring the IC array is called configuration latency. Due to different steps occupy different numbers of ICs, the configuration latency is variable. Generally, configuration latency for a certain step is smaller than the step execution time. Therefore the configuration instruction set for next step can be pre-fetched when executing the current step in order to eliminate the latency. This kind of pre-fetching operation can work only if there is no conditional branch involved in the current step; otherwise the location of next step will be unknown until the condition for branch is computed. Meanwhile, in the case that there are successive iterations of certain loops existing in the application, if a loop can be placed into one single step, the instruction fetch scheme associated with RICA allows the instruction set for the loop to be fetched only once from the instruction memory before executing the loop. This kind of step is termed kernel. In the case of executing a kernel, configuration latency will be introduced only at the first iteration, and absolutely no anywhere else during iterations [6]. Moreover, a kernel can be pipelined into several stages, with which its critical path will be shortened, leading to execution time reduction. A simple example is provided here to indicate the performance improvement by constructing kernels instead of keeping the code in separate steps. Given the following code: 41 RICA Paradigm Introduction and Case Studies …………………………………………………………………………………………...... for (i=0; i<300; i++) { if(a[i]>b[i]) e[i] = a[i] * b[i]; else e[i] = 0; } …………………………………………………………………………………………...... This fragment of code is scheduled to 2 steps by RICA tool flow, and the execution time provided by the simulator is 7.842us. If the code is modified as follows: …………………………………………………………………………………………...... for (i=0; i<300; i++) { asm volatile ("MUX \tout= %0 \tin1= %1 \tin2= %2 \tsel=%3 \tconf= `MUX_COND_NEZ_SI" : "=r" (e[i]) : "r" ((a[i])*(b[i])) , "r" (0), "r" ((a[i])>(b[i]))); } …………………………………………………………………………………………...... This modification means to use a multiplexer to generate the required output instead of having a conditional branch. The RICA tool flow places this modified code into a single step (kernel), and the reported execution time is 3.948us. From this simple example, it is obviously that constructing kernels is essential for RICA based applications. In order to construct kernels, firstly the conditional branches existing in the code must be eliminated. Usually multiplexers are used to realise such eliminations. Meanwhile, the available IC resource in RICA architecture must satisfy the minimum requirement of all instructions in a kernel, otherwise the RICA architecture needs to be tailored. Moreover, the memory accesses in a kernel should not exceed the maximum allowance of RICA architecture (4 write and 4 read in a step). 3.3. Case Studies Based on the introduction to RICA paradigm, several applications have been implemented on RICA based architecture and evaluated in terms of performance and efficiency. The following two sections discuss a ReedSolomon (RS) decoder and a WiMAX OFDM symbol timing synchronisation 42 RICA Paradigm Introduction and Case Studies engine, which have been implemented on customised RICA based architectures and optimised for performance improvement. These two implementations are used as case studies targeting RICA based applications. Since these two communication applications are not directly relevant to the main work in this thesis, only a brief introduction is presented here, and the detailed work can be found in the author’s previous published work [61-62] for the RS decoder and [63] for the OFDM timing synchronisation engine. Reed-Solomon Decoder: RS coding algorithm is constructed in Galois Field (GF) which has its own calculation theorem. The conventional approach uses Look-Up Tables (LUTs)to calculate multiplications in GF. In this case study, a 32-bit GF Multiplier (GFMUL) is employed as a custom IC integrated in RICA paradigm based architecture, which significantly reduces the computational complexity. Meanwhile, the SIMD technique is applied to accelerate the coding process, and kernels are constructed for certain modules such as syndrome calculation and Chien search. OFDM Timing Synchronisation Engine: This work utilises the Maximum Likelihood (ML) estimation algorithm [64] to estimate the OFDM symbol time offset. Instead of using memory blocks, two shifting register windows are constructed for the accumulating calculation in the algorithm. Kernels are also constructed for the entire engine. 3.4. Outcomes of Case Studies Although the two case studies performed in this thesis are communication tasks rather than imaging applications, they both well investigated the potential of RICA paradigm and indicated the possible optimisation approaches. The RS decoder was implemented on customised RICA based architecture with GFMUL cells. With SIMD technique, RICA paradigm demonstrates high level of DLP, which is a desired feature for image processing applications. Meanwhile, the ILP nature of RICA paradigm enables multiple GFMUL cells, together with a number of other ICs, are 43 RICA Paradigm Introduction and Case Studies executed simultaneously within a step. With high levels of both DLP and ILP, RICA paradigm based architecture is expected to be able to provide good performance for image processing solutions. The OFDM timing synchronisation engine was implemented on tailored RICA architecture. Two 1-D shifting windows in the ML algorithm were utilised in order to reduce memory accesses. For image processing tasks such as demosaicing and filtering, the 1-D shifting window can be easily extended to establish a 2-D window with a couple of registers. Given the 2-D window moving along every row/column in an image, RICA paradigm based architecture can efficiently deal with different imaging tasks. Considering the two case studies, one most important thing in common is the construction of kernels. As discussed, with a kernel, the instruction set is only fetched once before the kernel starts, and there is no configuration latency during the execution. It is worth noticing that in most image processing tasks such as demosaicing and JPEG2000, the image is scanned by the processing engine within a 2-D loop: 1-D for horizontal scanning (every line) and 1-D for vertical moving (when a new line starts). Thus, if the processing engine can be placed into a single kernel, there will be no configuration latency when processing the horizontal scanning. When a new line starts, the kernel can be either kept the same or reconfigured, depending on the task. The configuration latency is only introduced when performing the vertical moving loop and nowhere else. In this case, RICA paradigm based architectures can be extremely efficient for image processing applications. Another promising outcome in common is RICA paradigm’s tailorable nature. In order to satisfy the requirement of constructing a kernel, ICs in RICA based architecture can be tailored targeting the maximum usage of computational resource. Moreover, since there are different modules with various complexities involved in complex image processing applications such as JPEG2000, RICA based architecture can be dynamically reconfigured to ensure different tasks are assigned with proper computational resources. When switching from a computation intensive task to a simple task, the 44 RICA Paradigm Introduction and Case Studies redundant computational resources can be bypassed in order to reduce energy dissipation. Through the case studies, advantages of RICA based architecture are concluded as follows: High levels of both DLP and ILP, providing high throughput. Kernels for configuration latency reduction especially when targeting image processing applications. Customisable and tailorable nature. Flexible reconfigurability. Low power nature. With these advantages, RICA based architectures are considered to be strong candidates for solutions to image processing applications. In the following chapters, this thesis will aim at developing customised dynamically reconfigurable RICA based architectures targeting various image processing applications such as demosaicing and JPEG2000. 3.5. Prediction of Different Imaging Tasks on RICA Based Architecture Based on the discussion in previous sections, the nature and potential of RICA based architecture are clearly clarified. In this section, different imaging tasks introduced in Section 2.2 (Freeman demosaicing) and 2.3 (core tasks in JPEG2000 including 2-D DWT, CM and AE) are roughly evaluated targeting implementations on RICA based architecture. The computational aspects are characterised, and the possible performance is predicted. Freeman demosaicing: When looking into the algorithm, the first stage, bilinear demosaicing, can be efficiently implemented on RICA based architecture since it mainly involves simple additions and shifting operations. When processing different lines, multiplexers can be utilised to avoid possible conditional branches. On the other hand, how to efficiently implement the median filter on RICA based architecture becomes challenging, since the sorting operations in the median filter will 45 RICA Paradigm Introduction and Case Studies introduce loads of data swaps and conditional branches. If possible, a simplified median filtering algorithm should be applied. 2-D DWT: The lifting-based 2-D DWT in JPEG2000 significantly reduces the computational complexity existing in the traditional DWT architecture. Since the lifting-based architecture has only a small number of additions and shifting operations and no conditional branches, RICA based architecture is expected to be able to deliver quite high performance for DWT implementations. Moreover, RICA based architecture supports simultaneous multiple memory read/write operations, which provides the possibility to design some 2-D DWT architecture with high parallelism. CM: This is the most computationally intensive module in JPEG2000. On one hand, the CM algorithm scans 4 bits in a stripe column from top to bottom. This means that it is possible to develop a parallel architecture which will accelerate the processing significantly. In this case, the high levels of both DLP and ILP existing in RICA paradigm become enable such a parallel architecture to be implemented. On the other hand, since there are four primitive coding schemes in CM and the output is generated by one of them according different conditions, the CM algorithm is actually branch-intensive. Although multiplexers can be utilised to avoid branches and construct kernels, the increment in required computational resources cannot be ignored as all the four coding schemes need to be executed and then the final output can be selected by multiplexers. In this case, the balance between throughput and computational resource requirement needs to be well balanced when implementing CM on RICA. AE: The computation in AE is actually quite simple. However, the AE algorithm is also branch-intensive. Different from CM, AE has a serial architecture, which means that the critical of the constructed kernel will be quite long even all the branches are eliminated. It is worth trying to implement and optimise AE on RICA based architecture and seeing the actual performance. If the performance is not good enough, some other platform can be considered to be a better solution. 46 RICA Paradigm Introduction and Case Studies Generally, given a new algorithm, it should be evaluated from three aspects to predict whether RICA based architecture is suitable for its implementation: 1. Is the algorithm branch-intensive? 2. Does the algorithm require frequent memory accesses? 3. Is there any inherent parallelism in the algorithm? In the following chapters, these imaging tasks are implemented on RICA based architecture and their performance is evaluated. Moreover, for any given new algorithm, its nature can be evaluated and compared with these imaging tasks. If there is some similarity between the given new algorithm and these imaging tasks, the given algorithm’s performance on RICA based architecture can be roughly predicted. 3.6. Conclusion In this chapter, the RICA paradigm is introduced. Based on a coarse-grained architecture consisting of various ICs and reconfigurable connections, RICA paradigm offers high levels of both DLP and ILP, outstanding flexibility and easy programmability, which are desirable features for imaging solutions. A dedicated tool flow associated with RICA paradigm was introduced, which consists of compiler, scheduler, placement & routing and simulator. This tool flow can provide developers with number of steps, required computational resources, execution time, etc. In Section 3.3, two case studies, RS decoder and WiMAX OFDM symbol timing synchronisation engine, are brought into discussion in order to investigate the potential of RICA based architectures. With the RS decoder, RICA paradigm’s customisable and tailorable nature is well explored. Meanwhile, it is proved that RICA paradigm offers high levels of both DLP and ILP, which is an advantage over DSPs and VLIWs. By implementing the WiMAX OFDM symbol timing synchronisation engine, it is clear that a shifting window with registers can be efficiently established on RICA based architecture, and this shifting window can be widely used in various imaging tasks. 47 RICA Paradigm Introduction and Case Studies Kernel construction is one of the most important outcomes found through the two case studies. For imaging tasks, it is possible to place the processing engine within a single kernel. In this case, the processing time will be significantly shortened as configuration latency is only introduced when a new image line starts and nowhere else. Another promising outcome is RICA paradigm’s tailorable nature. The numbers of different ICs in RICA based architecture can be tailored to adapt various tasks with the maximum computational resources usage. When switching from computationally intensive applications to relatively simple tasks, the redundant ICs can be bypassed. This ensures RICA paradigm’s power-saving nature. Based on all the above discussion, advantages of RICA based architecture are concluded. With these promising features, RICA based architecture is proved to be a strong candidate for solutions to image processing applications. In Chapter 4, this thesis will aim at developing customised dynamically reconfigurable RICA based architecture targeting Freeman demosaicing algorithm. From Chapter 5, an efficient RICA based solution for JPEG2000 will be presented. 48 Freeman Demosaicing Engine on RICA Based Architecture Chapter 4 Freeman Demosaicing Engine on RICA Based Architecture 4.1. Introduction This chapter proposes a Freeman demosaicing engine implemented on RICA based architecture. The demosaicing engine is highly optimised by an efficient data buffer rotating scheme and pseudo median filter. Simulation results demonstrate that the proposed Freeman demosaicing engine can process a 648x432 image within 2ms. Moreover, based on the algorithm investigation, a dual-core RICA based architecture is developed and the demosaicing algorithm is partitioned and mapped onto the dual-core architecture in order to provide higher performance. 4.2. Freeman Demosaicing Algorithm Based on the evaluation presented in Chapter 2, Freeman demosaicing algorithm is chosen for the targeted application due to its overall good performance and low implementation complexity [65]. Figure 4.1 (a) illustrates the Freeman algorithm architecture consisting of bilinear demosaicing and median filtering. The first stage estimates the missing colour components for each pixel by bilinear interpolation algorithm shown in formulas (4.1)-(4.6), which may change according to the different pixel layouts in Bayer pattern as illustrated in Figure 4.1 (b). After the first stage, three intermediate colour planes are obtained. In the second stage, the 49 Freeman Demosaicing Engine on RICA Based Architecture Intermediate Final output R’ R’ R’ R R R R R R R R R R’ R’ R’ Bayer image R G R G B G R G R Stage 1 R’ R’ R’ Stage 2 Bilinear Demosaicing G’ G’ G’ G’ G’ G’ Median Filter (R-G)&(B-G) G’ G’ G’ G G G G G G G G G B B B B B B B B B B’ B’ B’ B’ B’ B’ B’ B’ B’ R11 G12 R13 G21 B22 G23 R31 G32 R33 G41 B42 G43 (b) (a) Figure 4.1(a) Freeman Demosaicing Architecture (b) Bilinear Demosaicing for Bayer Pattern colour value differences (red minus green (R-G) and blue minus green (B-G)) are median filtered. Median filter is widely used in the field of image processing as a non-linear digital filtering technique. The main idea for median filter is running through the input signal entry by entry and replacing each entry with the median of its neighbouring entries. The pattern of neighbours is called filter window, which slides over the entire signal entry by entry [66]. When processing a 2-D image, its mathematical formula can be represented as Yij = Med{Xij}, where Xij is the set of pixels in the filter window. Median filter is employed in order to reduce effects of fringing from images by removing sudden jumps in hue, which has been discussed in Chapter 2. In Freeman demosaicing, the median filter is based on using a shifting window over the image and calculating the median value of pixels within the window for the output [67]. Since the demosaicing artifacts are generally manifest as small chromatic splotches, median filtering the R-G and B-G colour planes tends to eliminate the artifacts efficiently [68]. The final interpolated image is generated by adding the median filtered colour difference to the corresponding pixel value, for example, the red value in position 32 in Figure 4.1 (b) is obtained by adding the filtered R-G value to the sampled green value at this position. Estimated results are used only at positions where the original sampled colour pixel is different, for example, it is not necessary to estimated blue value at position 32. 𝑅22 = (𝑅11 + 𝑅13 + 𝑅31 + 𝑅33 )/4 (4.1) 𝐺22 = (𝐺12 + 𝐺21 + 𝐺32 + 𝐺23 )/4 (4.2) 50 Freeman Demosaicing Engine on RICA Based Architecture 𝐵22 = 𝐵22 (4.3) 𝑅32 = (𝑅31 + 𝑅33 )/2 (4.4) 𝐺32 = 𝐺32 (4.5) 𝐵32 = (𝐵22 + 𝐵42 )/2 (4.6) 4.3. Freeman Demosaicing Engine Implementation In order to implement the Freeman demosaicing algorithm on RICA based architecture efficiently, a set of data buffers are employed as the intermediate storage for line pixels. Each data buffer’s capacity is set to be 2048x32bit so the engine can support image with maximum 2048 pixels per line. Figure 4.2 illustrates the RICA based Freeman demosaicing engine architecture. A 3x3 shifting window is employed for both bilinear demosaicing and median filter. Each line of the window corresponds to one line in the image. As illustrated in Figure 4.3, pixels belonging to the first two lines of the shifting window (line 1, 2) are read out from data buffers by SBUF cells, while the third line (line 3) is 6 rotative buffers Intermediate Colour Planes buf 3 buf 4 Obtained from Stage 1 Directly 2 rotative buffers buf 1 buf 2 Read from image Original Image Stage 1 Red buf 5 Green Obtained from Stage 1 Directly Blue buf 8 Bilinear Demosaicing (3x3 window) buf 6 Stage 2 Red Median Filter (3x3 window) Green Blue buf 7 Obtained from Stage 1 Directly Figure 4.2 Freeman Demosaicing Implementation Architecture Source image with 3x3 shifting window Line 1 Line 2 (read by sbuf 1 from buf 1) (read by sbuf 1 from buf 2) Line 2 Switch Read Addresses (read by sbuf 2 from buf 2) Line 3 (read from source) The data in line 3 are written back to buf 1 Figure 4.3 Data Buffers Addresses Rotation 51 Line 3 (read by sbuf 2 from buf 1) Line 4 (read from source) Freeman Demosaicing Engine on RICA Based Architecture RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB RGB G G B G B R G R G G B G B RGB RGB R-> RGB R G R G Shifting Window for Bilinear Demosaicing G B G B Shifting Window for Median Filter Figure 4.4 Parallel Architecture for Freeman Demosaicing directly read from the original source image via SOURCE cell. After the data within the shifting window being processed, data in line 3 is written back into buffer 1 to replace the oldest line 1. When a line finished, read addresses for buffer 1 and 2 are switched so that these two buffers will contain data belonging to line 2 and 3 correctly without any change to the code for next line iteration. As there are three separate colour planes generated after bilinear demosaicing, totally 8 data buffers are occupied for the overall implementation; two for bilinear demosaicing and six for median filtering. For the complete demosaicing engine implementation, it will be much more efficient that if the two stages can be executed in parallel rather than in serial. Figure 4.4 illustrates the parallel architecture from the view of shifting windows. Since the median filter requires outputs from bilinear demosaicing as its input, the bilinear demosaicing shifting window needs to be processed prior to the median filter window. The optimised parallel demosaicing engine is based on a line-by-line scanning pattern. With the same 3x3 shifting window, the first two lines of pixels are fed into the demosaicing engine and are then stored in data buffers. When the third line begins, pixels belonging to it are directly read out from the source image and fed into the bilinear demosaicing shifting window, and the pixel in the centre of the window is interpolated. As shown in Figure 4.4, the centre pixel in the red broken block is interpolated to a full RGB colour set by bilinear demosaicing. The 52 Freeman Demosaicing Engine on RICA Based Architecture Start Feed in line 0, 1 Step 1 Original Image Pixels belonging to line 0,1 Feed in line 2, Bilinear demosaicing Bilinear Interpolated line 0 Step 2 Pixels belonging to line 1,2 Feed in line 3, Bilinear demosaicing Bilinear Interpolated line 1 Final Interpolated line 0 ... Median Filter ... Bilinear Interpolated line 2 ... ... Feed in line 4, Bilinear demosaicing Step 3 (Iteration) Pixels belonging to line 2.3 Figure 4.5 Freeman Demosaicing Execution Flowchart interpolated pixels are then stored in three separate data buffers (R/G/B) as the intermediate for median filter. In order to build the 3x3 median filter shifting window, two lines of bilinear interpolated pixels are required as the precondition, which means there are totally four lines of pixels need to be fed into the engine before median filtering starts and six intermediate data buffers to store different colour components belonging to different lines. When the fifth line starts, the median filter window fetches interpolated pixels from the intermediate data buffers for its first two lines, and interpolated pixels belonging to the third line directly come from the bilinear demosaicing window, as illustrated in Figure 4.4. The output of the bilinear demosaicing window (the centre pixel of the red broken block) becomes the right-lower corner pixel of the median filter window (the blue broken block). In this case, these two shifting windows can slide line by line simultaneously, by which a demosaicing engine with parallel architecture is constructed thereby the processing speed is increased. A detailed processing flowchart is illustrated in Figure 4.5. 53 Freeman Demosaicing Engine on RICA Based Architecture 4.4. System Analysis and Dual-Core Implementation 4.4.1. System Analysis From the view of implementation, bilinear demosaicing is a quite simple module. The shifting window is established with a set of registers. As mentioned in Section 4.2, estimating formulas are different according to different pixel positions in Bayer CFA. In order to eliminate conditional branches in the code, a couple of multiplexers are employed to generate the conditional outputs of each pixel’s estimation. The median filter has been analysed in Section 4.2. Typically, by far the majority of the computational effort and time is spent on calculating the median of each window. When the image is large, the efficiency of median filter becomes a critical factor in determining the algorithm’s speed as the filter window must process every entry in the input signal [66]. The key module in median filter is the sorting algorithm to find out the median value among a number of pixels. Traditional sorting algorithms such as selection algorithm and histogram medians [66] are time and energy consuming as they both involves massive iterations and data swaps. In this Freeman demosaicing engine, a pseudo median filter [69], [70] is utilised as it significantly shortens the processing time meanwhile maintains a fine PSNR for the interpolated image. For a 3x3 filter window, the pseudo median filter calculates the median value of each column in the window, and then the median value among the three intermediate medians is obtained as the final output. As the median value of three pixels can be easily calculated without any onerousness, this pseudo median filter is feasible for embedded system applications as the iterations and swaps in traditional sorting algorithms are eliminated. Meanwhile, the performance of pseudo median filter is very similar to the true median filter. A comparison is given in [70] which demonstrates that the pseudo median filter provides a normalised MSE of 0.0521 when filtering a noisy girl image and compared with the original image, while the true median filter delivers a normalised MSE of 0.0469 (the 54 Freeman Demosaicing Engine on RICA Based Architecture P11 P12 P13 P11 P12 P13 P14 P21 P22 P23 P21 P22 P23 P24 P31 P32 P33 P31 P32 P33 P34 Seeking median value M1 M2 M3 M4 M1 M2 M3 Seeking median value output output output 1 2 (a) (b) Figure 4.6 (a) Pseudo Median Filter (b) Median Filter Reuse un-filtered noisy image has a normalised MSE of 2.9494 compared with the original image). Figure 4.6 (a) illustrates the pseudo median filter computation. The median value of each row is calculated first, as marked with the vertical broken blocks. The corresponding median values are then recorded as M 1, M2 and M3, which are sorted again for their median value calculation as the final output. Only MAX and MIN operations are needed for seeking median value among three samples, which can be implemented via comparators and multiplexers in order to construct the potential kernel for the iteration. The pseudo code is given as follows, in which totally 3 comparators and 5 multiplexers are needed to process one median window. …………………………………………………………………………………………………………… Seeking median value among three samples (a, b, c) using comparators and multiplexers: asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof = ‘MUX_COND_NEZ_SI” : “=r” (out1) : “r” (a) , “r” (b) , “r” (a<b)); asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof = ‘MUX_COND_NEZ_SI” : “=r” (out2) : “r” (b) , “r” (a) , “r” (a<b)); 55 Freeman Demosaicing Engine on RICA Based Architecture asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof = ‘MUX_COND_NEZ_SI” : “=r” (out1) : “r” (out1) , “r” (c) , “r” (out1<c)); // out1 contains the minimum value of the three samples asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof = ‘MUX_COND_NEZ_SI” : “=r” (out3) : “r” (c) , “r” (out1) , “r” (out1<c)); asm volatile (“MUX \tout = %0, \tin1 = %1 \tin2 = %2, \tsel = %3 \tcof = ‘MUX_COND_NEZ_SI” : “=r” (out) : “r” (out2) , “r” (out3) , “r” (out2<out3)); // out contains the final median value …………………………………………………………………………………………………………… Given the shifting window moving from left to right, it is possible to reuse the intermediate results of the median filter. As illustrated in Figure 4.6 (b), initially the shifting window contains pixels from P11 to P33, and the intermediate values are M1~M3 which leading to output 1. When the shifting window moves one pixel horizontally to the right, its contents are updated from P12 to P34 and M4 becomes the latest output. In this case, both the intermediates M2 and M3 can be reused to calculate output 2 and only the new intermediate value M4 needs to be calculated. 4.4.2. Dual-Core Implementation When analysing from the Freeman demosaicing algorithm aspect, it is seen that even lines and odd lines of the image are processed with different equations at the bilinear interpolation stage. In this case, it is possible to have the demosaicing engine running on two processor cores corresponding to even/odd lines separately. The two cores work in parallel and hence further throughput improvement is likely to be introduced. In order to realise a RICA based multi-core implementation, first the overall application needs to be partitioned into several tasks. Each task is then compiled and simulated separately with a single RICA core and then the execution trace file for each task is generated, which contains the static timing and information for communication instructions. After that, tasks are mapped onto the RICA based multi-core architecture according to a certain mapping methodology [71]. Generally, this mapping methodology analyses both the static timing 56 Freeman Demosaicing Engine on RICA Based Architecture Application Description Task n Task 1 DR Processor Toolflow *.c Compiler *.s ...... DR Processor n Toolflow ...... Single DR Processor n Simulator Schedular *.net Single DR Processor 1 Simulator ...... Parser n Parser 1 Multiprocessor Simulator Performance Results Figure 4.7 Mapping Methodology for MRPSIM and dynamic timing during simulation. Static timing represents the time consumed by the combinatorial critical path in each step, which is not affected by the run-time execution. Dynamic timing refers to the time taken by communication instructions such as memory write/read, which can only be determined during at run-time in the multiprocessor simulation due to multiple memory accesses [71]. When tasks mapped, the Multiple Reconfigurable Processor Simulator (MRPSIM) presented in [71] analyses the execution traces and obtains the dynamic delays. Only communication instructions which will contribute to the dynamic timing are modelled by MRPSIM. Other inputs for MRPSIM include the RAM files and an Architecture Description File (ADF). After performing simulations with MRPSIM, the generated results are used as feedback to change the design strategies such as task partitioning and architecture customisations in order to achieve better performance [71]. The complete execution flow graph is illustrated in Figure 4.7. Figure 4.8 illustrates the Freeman demosaicing engine based on a dual-core architecture. As discussed previously, the complete demosaicing engine is 57 Freeman Demosaicing Engine on RICA Based Architecture Original Bayer Image Proc. 1 Proc. 2 lock Bilinear demosaicing (odd lines) Bilinear demosaicing (even lines) Intermediate Shared Memory Primitive Interpolated Intermediate (odd lines) Median filtering (odd lines) Primitive Interpolated Intermediate (even lines) lock Median filtering (even lines) lock Final Full-Interpolated Image Figure 4.8 Dual-Core Freeman Demosaicing Engine Architecture partitioned into two tasks, one for odd lines and the other for even lines. The two tasks are then mapped onto the RICA based dual-core architecture. In order to reduce the implementation complexity, intermediate outputs from bilinear demosaicing are stored in shared data buffers instead of being immediately ready for median filter, which is different from the single-core implementation. As the two processor cores need to share data during processing, a scheme termed improved spinlock [72] is utilised to make sure accesses to the shared memory are synchronised without any conflict between the two cores. A spinlock is a lock (synchronisation variable) where a task repeatedly checks based on busy-waiting scheme, and it allows only one task to access the shared resource protected by the lock at any given time. This method has been improved in [72] which allows the requiring processor core going to sleep when the lock is unavailable instead of keeping checking, and the core will be waked up through an inter-processor interrupt when the lock is released. Therefore, memory access conflicts due to the busy-waiting based spinlock are eliminated [72]. In the proposed demosaicing design, improved spinlocks are utilised when the processor cores loading data from the original Bayer source image, 58 Freeman Demosaicing Engine on RICA Based Architecture requiring primitive interpolated data from the intermediate shared memory and writing the final interpolated data to the output file. Figure 4.9 provides a pseudo code for the dual-core implementation. The two tasks mapped onto two cores are executed in parallel. The locks control the access to the shared memory as well as the source and output file. When iteration finishes, all the locks for processor 1 and 2 are re-initialised in order to avoid unexpected lock states. 59 Freeman Demosaicing Engine on RICA Based Architecture for (j=0, j<number of line-1, j+=2) { Initialise the locks for both processor 1 and 2; processor 1 lock = 0; // available processor 2 lock = 1; // unavailable for (i=0, i< number of pixels per line, i++) { // Processor 1 check&set lock; source Bayer image accessing; release lock for processor 2; bilinear demosaicing; check&set lock; shared memory accessing; release lock for processor 2; median filtering; check&set lock; final output file accessing; release lock for processor 2; // Two processors work in parallel // Processor 2 check&set lock; source Bayer image accessing; release lock for processor 1; bilinear demosaicing; check&set lock; shared memory accessing; release lock for processor 1; median filtering; check&set lock; final output file accessing; release lock for processor 1; } re-initialise all locks for processor 1 and 2; } Figure 4.9 Pseudo Code for Dual-Core Implementation 60 Freeman Demosaicing Engine on RICA Based Architecture 4.5. Optimisation As presented in previous chapters, a salient characteristic of RICA paradigm is its ability to be customised at the design stage according to application requirements. For multi-core applications, each core can be tailored differently to build an adaptive heterogeneous multi-core platform. Table 4.1 shows numbers of ICs for both customised single-core and dual-core architectures. The ICs numbers are given by the simulator proposed in [71]. This tailorable nature of RICA paradigm eliminates the redundant ICs in different applications and hence the energy and area consumption is decreased. As proposed in Chapter 3, RICA paradigm supports development by using high level languages such as C in a manner very similar to conventional microprocessors and DSPs. The ANSI-C programs can be compiled into a Table 4.1 Instruction Cells Occupied by Freeman Demosaicing Engine (a) Single-Core Architecture Cell Proc. Cell Proc. ADD 20 SOURCE 1 LOGIC 1 SINK 3 SHIFT 5 WMEM 4 MUX 29 RMEM 4 COMP 23 JUMP 1 REG 177 RRC 1 SBUF 12 Total Area (um2) 160033 (b) Dual-Core Architecture Cell Proc. 1 Proc. 2 Cell Proc. 1 Proc. 2 ADD 15 9 SOURCE 1 1 LOGIC 2 2 SINK 3 3 SHIFT 5 5 WMEM 4 4 MUX 29 27 RMEM 4 4 COMP 23 20 JUMP 1 1 REG 135 112 RRC 1 1 SBUF 12 12 Total Area 157184 125226 61 Freeman Demosaicing Engine on RICA Based Architecture sequence of assembly configuration instruction sets, termed steps. The content of each step is executed concurrently by RICA according to the availability of hardware resources and data dependence, and kernels are desirable for applications with large number of iterations especially when processing images. For the single-core implementation, bilinear demosaicing and median filter are integrated into one kernel; while these two modules are put into two small kernels for each core when targeting dual-core implementation to reduce the computational complexity. In order to eliminate conditional branches in the application which will break kernels into separate steps, the most common technique employed in RICA based architectures is using multiplexers to realise conditional selections instead of branches. As there may be several hundreds of pixels per line in the image, the kernel may loop many times. In this case software pipelining technique can be utilised in order to shorten the kernel critical path. Basically the technique itself creates additional fill and flush steps which occupy a number of registers to fill in and release the intermediate between different pipeline stages as illustrated in Figure 4.10. For the proposed single-core demosaicing engine, the critical path of the kernel before pipelining is 44.86ns, while after pipelining this path is shortened to 6.86ns. Original Kernel Critical path pipelined A A B B A C B A step 3 (kernel) C B step 4 (flush 1) C step 5 (flush 2) C Critical path step 1 (fill 1) step 2 (fill 2) Figure 4.10 Illustration of Pipeline Architecture for Kernels 62 Freeman Demosaicing Engine on RICA Based Architecture 4.6. Performance Analysis and Comparison The Freeman demosaicing application is implemented on both single-core and dual-core RICA architectures by ANSI-C, which is then compiled, scheduled and simulated by the tool-flow associated with RICA. The tool flow is based on 65nm technology. The code is optimised by both hand and compiler. The simulator [14] provides performance results such as simulation time, number of steps, required computational resources and so on. The simulator is based on an accurate model of RICA paradigm which takes IC configuration and interconnections into account. Figure 4.11 shows a real image processed by the proposed Freeman demosaicing engine and a PSNR1 = 26.4dB is obtained. Table 4.2 lists performance evaluations of theproposed Freeman demosaicing engineat different optimised stages. It is seen that the final single-core implementation with kernel and pipeline technique achieves up to 80.1% reduction in kernel critical path length and 4.92x speedup in throughput compared with the original implementation.When mapped onto dual-core architecture, the throughput has reached up to 241.6Mpixels/s, which corresponds to a 1.72x speedup compared with the single-core engine. The throughput here is defined as the number of pixels in the image divided by the proecssing time. Performance comparisons in Table 4.2 are made between the proposed Freeman demosaicing engine and a Hamilton-Adam demosaicing engine on Figure 4.11 A Demosaiced 648x432 Image 63 Freeman Demosaicing Engine on RICA Based Architecture Table 4.2 Freeman Demosaicing Performance Evaluations and Comparisons For a 648*432 image Execution Time (ms) Average Throughput (Mpixels/s) Kernel Critical Path (ns) Evaluations Original 9.78 28.62 34.54 Single-core 1.99 140.7 6.86 Dual-core 1.16 241.6 7.02 Comparisons Applications Throughput (Mpixles/s) Average Frequency (MHz) Bilinear 142 151 Single-core Freeman 140.7 145.8 Dual-core Freeman 241.6 142.4 Hamilton-Adam 127 144 FPGA bilinear [35] 150 150 CRISP bilinear [54] 345 115 RICA based architecture, a Virtex 4 FPGA based bilinear demosaicing engine [34] and a bilinear demosaicing engine implemented on CRISP [53]. Since the bilinear demosiacing is the first stage in Freeman method, it is split and included in the comparisons. For RICA based engines, their average frequencies are defined as their kernels’ iterating frequencies and are calculated via their kernels’ critical path length. It is seen that the proposed Freeman demosaicing engine demonstrates good throughput with both single-core and dual-core architectures. Due to the efficient pipeline architecture, Freeman engine keeps almost the same throughput with the bilinear engine even with extra burden of computationally intensive median filters. The highly optimised kernel in Freeman engine shows a comparable iterating frequency with the referred FPGA based bilinear engine, and is higher than the maximum frequency that CRISP can achieve. It should be noticed that CRISP has dedicated 2-D load memory RSPE and colour interpolation RSPE, which means that multiple demosaicing windows can be executed simulateneously. This is the reason why CRISP demonstrates much higher throughput even with a lower working frequency. If there is only a single demosaicing window running, the proposed Freeman engine 64 Freeman Demosaicing Engine on RICA Based Architecture deserves better performance. Moreover, as the RSPEs in CRISP are dedicated hardware, CRISP is actually an ASIC-like coarse-grained reconfigurable architecture. In this case, its flexibility is restricted to some extent. In contrast, the nature of RICA paradigm enables RICA based architectures to be flexibly reconfigured and tailored to adpapt different applications. 4.7. Future Improvement A feature associated with RICA paradigm termed Vector Operation (VO) can be utilised to further improve the demosaicing engine performance. A vector is a 32-bit operand which is constructed of two 16-bit operands. With SIMD technique, these two 16bit operands can be calculated in parallel via a single vector operation. For applications which have parallel architectures, VO can be employed expecting significantly computational resource deduction as the number of calculations is cut to nearly a half. In the proposed Freeman demosaicing engine, it is possible to utilise VO to improve the median filter’s efficiency. The shifting window in median filter can be extended to 4x3, and each two pixels within the window can be composed to construct a vector, as illustrated in Figure 4.12. Totally six vectors can be built initially, and the four corresponding intermediate outputs can be obtained through two median value seeking operations, within the form of two vector 1 P 11 P 12 P 13 P 14 vector 2 P 21 P 22 P 23 P 24 vector 3 P 31 P 32 P 33 P 34 vector 4 vector 5 vector 6 vector 7 vector 8 vector 9 M 1 M 2 M 3 M M 4 Duplicate 1 M M 3 M M 3 2 2 M 4 Out1 Out2 Figure 4.12 Potential Vector Operations in Median Filter 65 Freeman Demosaicing Engine on RICA Based Architecture vectors (M1&M2, M3&M4). After that, these two vectors are decomposed and duplicated in order to construct new vectors (vector 7~9), and the final two outputs can be obtained via another median value seeking operation. In this way, the totally number of seeking operations required for obtaining two outputs is reduced from 6 to 3, which obviously simplifies the overall calculation complexity. VO usually comes with the negative of additional logic and shifting resources required for constructing and decomposing vectors, which will negate the benefits by prolonging the critical path and increasing the area consumption. When being applied to the proposed demosaicing engine, VO requires two pixels to be fed in at a time instead of one, which requires more complex control scheme and brings difficulty in building kernels. However, VO is still a promising approach for further optimisation to the proposed demosaicing engine. 4.8. Conclusion In this chapter, a Freeman demosaicing engine on RICA based architecture has been presented. With the detailed Freeman algorithm being introduced in Section 4.2, Section 4.3 presents the demosaicing engine implementation. The 2-D shifting window consisting of registers as discussed in the previous chapter is utilised to construct the demosaicing and median filter windows. An efficient data buffer rotating scheme is designed with the aim of reducing memory accesses. Another novel outcome is that a parallel architecture is developed which ensures bilinear demosaicing and median filter can be executed simultaneously. This parallel architecture successfully reduces the required intermediate data storage and improves the overall efficiency. In Section 4.4, the proposed demosaicing architecture is further analysed. A pseudo median filter is utilised which enables the demosaicing engine to demonstrate both good performance and low computational complexity. Only a few multiplexers and comparators are utilised in order to realise the pseudo median value calculation. Given the demosaicing window shifting from left to 66 Freeman Demosaicing Engine on RICA Based Architecture right, the pseudo median filter is reused. Only two median value seeking calculations are required for every new pixel. Based on the algorithm analysis, Section 4.4.2 presents dual-core architecture developed for the proposed Freeman demosaicing engine. Two RICA cores are occupied for processing even and odd lines in the image respectively. An improved spin-lock method is utilised in order to ensure the shared data between the two cores is accessed correctly without any conflict. A pseudo code is given to illustrate that how the dual-core architecture works. The optimisation presented in Section 4.5 mainly focuses on customising the proposed RICA based architecture and constructing kernels with efficient pipeline scheme. Performance evaluation demonstrates that the proposed Freeman demosaicing engine offers a good PSNR to a real photographic image. The throughput is improved step by step by utilising different optimisation approaches. Comparisons between the proposed engine and other demosaicing applications show that the proposed Freeman demosaicing engine provides good throughput with an efficient architecture. The VO with SIMD technique is discussed in Section 4.7 as a possible future improvement to the demosaicing engine. With VO, multiple data can be processed through a single calculation. In the following chapter, VO is utilised for the 2-D DWT module in JPEG2000 and its positives and negatives are discussed in detail. 1. 𝑀𝑆𝐸 = 1 𝑚𝑛 ∑𝑚 ∑𝑛[𝑅𝑒𝑐𝑜𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑒𝑑 𝐼𝑚𝑎𝑔𝑒(𝑚, 𝑛) − 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐼𝑚𝑎𝑔𝑒(𝑚, 𝑛)]2 𝑃𝑆𝑁𝑅 = 10 ∗ log( 𝑀𝐴𝑋 2 ) 𝑀𝑆𝐸 MAX is the maximum possible pixel value of the image. If the pixel has 8 bits, MAX is 255 67 2-D DWT Engine on RICA Based Architecture Chapter 5 2-D DWT Engine on RICA Based Architecture 5.1. Introduction This chapter presents a reconfigurable lifting-based 2-Dimensional Discrete Wavelet Transform engine for JPEG2000 [73]. The proposed engine is implemented on RICA based architecture and can be dynamically reconfigured to support both 5/3 and 9/7 transform modes for lossless and lossy compression schemes in JPEG2000 standard. VO with SIMD technique is utilised in the proposed 2-D DWT engine and the advantages and disadvantages brought by VO are well discussed. Simulation results [73] demonstrate that the proposed 2-D DWT engine provides high throughput that reaches up to 103.1 Frames per Second (FPS) for a 1024x1024 image, which shows its advantage compared with a number of FPGA and DSP based implementations. 5.2. Lifting-Based 2-D DWT Architecture in JPEG2000 Standard DWT is the first decorrelation step in JPEG2000 standard to decompose the input image into different subbands in order to obtain both approximation and detailed information. The algorithm of both 1-D and 2-D DWT has already been introduced in Chapter 2 and Appendix. Figure 5.1 illustrates both the convolutional DWT architecture in 68 (a) and the lifting-based DWT 2-D DWT Engine on RICA Based Architecture L LPF 2 L even X(z) X(z) HPF 2 Predict (alpha) Split Update (beta) odd H H (b) (a) 1/K L K H even X(z) Split Predict (alpha) Update (beta) Update (delta) Predict (gamma) odd (c) 5/3 DWT: alpha = -½, beta = ¼ 9/7 DWT: alpha = -1.5861342, beta = -0.052980118, gamma = 0.882911076, delta = 0.443506852, K = 1.230174105 Figure 5.1 (a) Convolutional DWT Architecture (b) 5/3 Lifting-based DWT Architecture (c) 9/7 Lifting-based DWT Architecture Output for 5/3 mode Output for 9/7 mode L 1/K L K H even X(z) Split Predict (alpha) Predict (gamma) Update (beta) Update (delta) odd H Figure 5.2 Generic Lifting-Based DWT Architecture for Both 5/3 and 9/7 modes architectures in (b) and (c). As it is known, the lifting-based scheme is selected in JPEG2000 standard as the default DWT architecture. In Figure 5.1(b) and (c), the polyphase matrices are divided into a couple of units termed “Predict” and “Update”, which actually mean different combinations of additions and multiplications. Parameters in these two units are determined by DWT polyphase matrices introduced in Chapter 2. Since there is only one extra pair of Predict and Update existing in the 9/7 architecture compared with that in 5/3 scheme, these two architectures can be combined and simplified to a generic architecture which adopts both schemes, as illustrated in Figure 5.2. The lifting-based 2-D DWT architecture is illustrated in Figure 5.3, which can be viewed as the extension and duplication of the 1-D architecture. Actually the vertical transformation can also be executed by a single 1-D DWT engine. 69 2-D DWT Engine on RICA Based Architecture First Demision Second Demision LL Split Predict Update even X(z) Split Predict LH HL Update odd Split Predict Update HH Figure 5.3 Lifting-Based 2-D DWT Architecture 5.3. Lifting-Based DWT Engine on RICA Based Architecture 5.3.1. 1-D DWT Engine Implementation Figure 5.4 illustrates the detailed generic architecture of the 1-D DWT engine implemented on RICA based architecture. When considering the parameters, Predict and Update units in 5/3 DWT architecture can be easily realised by shifting and addition operations, while the floating-point parameters in 9/7 mode require complex computations. In this work, Hardwired Floating Coefficient Multipliers (FCMs) are utilised to convert floating-point computations to fixed-point calculation. The floating-point parameters are represented in their Canonical Sign Digit (CSD) form [74] and hence floatingpoint multiplications are replaced by a number of shifts and additions. Table 5.1 illustrates the CSD forms of the floating-point parameters in this engine. Since the number of bits used for CSD representation can be truncated, Figure 5.5 gives comparisons between different CSD bits and the PSNR of reconstructed image (256x256 Lena). It is obvious that more CSD bits provide higher PSNR, which means better reconstructed image quality. However, having more CSD bits also means higher computational complexity as more addition and shift need to be executed. Since RICA paradigm has high level of ILP, the 12-bit CSD representation is selected in this thesis. These CSD based FCMs can be dynamically reconfigured and bypassed when adopting 5/3 architecture. 70 2-D DWT Engine on RICA Based Architecture Output_L 5/3 DWT D Input_odd D FCM beta alpha Input_even FCM D delta gamma FCM Output_L 1/K Output_H FCM D D Output_H K FCM = Floating Coefficient Multiplier Figure 5.4 Detailed Generic Architecture of 1-D DWT Engine on RICA Table 5.1 CSD Forms of Floating-Point Parameters Value CSD representation Approximate Value α 1.5861342 2-2-1+2-3-2-5-2-7+2-12 1.5861816 β 0.052980118 2-4-2-7-2-9+2-12 0.0529785 γ 0.882911076 20-2-3+2-7 0.8828125 δ 0.443506852 2-1-2-4+2-7-2-9+2-12 0.4436035 k 1.230174105 20+2-2-2-6-2-8-2-12 1.2302246 1/k 0.812893066 20-2-3-2-4+2-11 0.8129883 Figure 5.5 Reconstructed Image Quality with Different CSD Bits When processing colour images, the proposed DWT engine can be easily extended to be adaptive for transforming multiple colour components simultaneously (RGB or YUV/YCrCb) due to the high parallelism of RICA 71 2-D DWT Engine on RICA Based Architecture Even part of Y Y Split the data into even and odd parts Y_L DWT Engine for Y Y_H Odd part of Y Even part of U Original Image Obtain different colour components U Split the data into even and odd parts U_L DWT Engine for U U_H Odd part of U Even part of V V Split the data into even and odd parts V_L DWT Engine for V Odd part of V V_H Figure 5.6 Streamed Data Buffers in DWT Engine based architecture and its customisable nature. Figure 5.6 illustrates the potential architecture of RICA based DWT engine for processing multi-colour components. For each colour component, even and odd input data symbols are split and stored intermediately. Totally there are three separate DWT engines occupied in the multi-component DWT architecture, all of which can be accessed and executed in parallel. As the multiple colour transform can be simply realised by duplicating the DWT engine, in this work, the discussion mainly focuses on single component transform engine. 5.3.2. 2-D DWT Engine Implementation As discussed previously, 2-D DWT architecture is an extension of 1-D scheme by involving two stages: horizontal transforming and vertical transforming. Figure 5.7 illustrates the detailed 2-D DWT decomposition procedure with an example of an 8x8 image with 3-level 2-D DWT. It is clearly seen that how the original image is decomposed by horizontal and vertical transformations. After the third-level transformation, the image is decomposed to 4 individually 1x1 pixels belonging to different subbands. The method for implementing 2-D DWT engine on RICA based architecture has some comparability to the processing pattern proposed in [75]. The proposed 2-D DWT engine performance is enhanced by the VO with SIMD technique discussed in Chapter 4, which is utilised to transform pixels belonging to adjacent lines in parallel. Instead of transforming pixels line by 72 2-D DWT Engine on RICA Based Architecture Original Image Level 1 L H LL HL LH LLL HH LLH Level 2 LLLL LLHH LLLH LLLLL LLHL LLLLH Level 3 LLLLLL LLLLLH LLLLHL LLLLHH Figure 5.7 Detailed 3-Level 2-D DWT Decomposition line, four pixels (P00, P01, P10, P11) are processed at a time, as illustrated in Figure 5.8. These four pixels are divided into two pairs, each of which has two pixels belonging to adjacent lines. Each pixel pair is combined to construct a vector, as shown in the red broken blocks. These two vectors can be transformed by a horizontal DWT engine illustrated in Figure 5.4, and the four outputs (P0L, P1L &P0H, P1H) are generated in parallel in the form of two vectors by a single transformation, as highlighted by the red broken blocks in step 1. As discussed previously, vectors can be decomposed into separate operands and then reconstructed to build new vectors. In this work, after the first step (horizontal transformation), the two intermediate vectors are decomposed and the four 1-D transformed coefficients are recombined to construct two new vectors, as highlighted in the blue broken blocks. With 73 2-D DWT Engine on RICA Based Architecture Step 1: 1-D WT (horizontal) L The original image H P00 P01 P0L P0H P10 P11 P1L P1H PLL PHL Vector 1 Vector 2 HL LL PLH PHH HH LH Step 2: 2-D WT (vertical) Figure 5.8 Parallel Pixel Transformation with VO and SIMD Technique another vertical transform, the four final outputs (PLL, PLH, PHL and PHH) are obtained simultaneously. It is obviously that the engine with VO technique can offer higher processing speed compared with the original engine which transforms the image line by line since it can process two lines/columns at a time. Meanwhile, compared with the case in which two lines/columns are transformed concurrently by two parallel single operations, the computational resource required by the VO based engine is significantly reduced. Moreover, the configuration latency is reduced by VO technique due to decrement in required computational ICs, which leads to further execution time decrease. However, there is a trade off as additional LOGIC and SHIFT cells are required in order to combine and decompose separate operands and vectors. Such an increase in IC occupation will lead to higher area consumption. Also these additional LOGIC and SHIFT operations will also increase the execution time of steps involving VO, resulting in decreased throughput. Detailed performance comparisons 74 2-D DWT Engine on RICA Based Architecture between engines with VO technique and regular engines will be given in Section 5.4. 5.3.3. 2-D DWT Engine Optimisation Optimisation to the proposed 2-D DWT engine still focuses on constructing kernels. In this work, both the horizontal and vertical transforming engines are integrated in corresponding kernels. Figure 5.9 illustrates kernels and their moving patterns in the proposed 2-D DWT engine. The transforming kernels scan the original image with a raster order, and generate the four outputs belonging to four subbands simultaneously. In order to construct DWT kernels, the original RICA architecture needs to be tailored to satisfy the computational resource requirement of the DWT engine. Table 5.2 illustrates numbers of cells in different 2-D DWT engines on customised RICA based architecture, including original DWT engines (transform an image line by line), DWT engines with two single parallel operations (transform two lines/columns at a time) and DWT engines with VO technique. All these engines are implemented within kernels and the numbers of registers are calculated after pipelining the kernels. It is seen that Original Image ... ... ... ... ... ... ... ... Kernel Subband LL Subband LH Subband HL Subband HH Figure 5.9 Kernel in the 2-D DWT Engine on RICA Architecture 75 2-D DWT Engine on RICA Based Architecture Table 5.2 Numbers of Cells in Different DWT Engines (a) Number of Cells in DWT 5/3 Engines Cell DWT53 (a) DWT53 (b) DWT53 (c) Cell DWT53 (a) DWT53 (b) DWT53 (c) ADD 10 14 10 WMEM 4 4 4 LOGIC 0 0 4 RMEM 4 4 4 SHIFT 4 4 8 REG 33 47 45 MUX 1 1 1 JUMP 1 1 1 COMP 2 2 2 Total Area 37049 45523 49188 (b) Number of Cells in DWT 9/7 Engines Cell DWT97 (a) DWT97 (b) DWT97 (c) Cell DWT97 (a) DWT97 (b) DWT97 (c) ADD 31 56 32 WMEM 4 4 4 LOGIC 0 0 4 RMEM 4 4 4 SHIFT 23 42 26 REG 93 155 146 MUX 1 1 1 JUMP 1 1 1 COMP 2 2 2 Total Area 101269 161327 129581 (a): Original engines (b): Engines with two parallel single operations (c): Engines with VO technique the engines with two single parallel operations consume more computational resource compared with original engines in aspects of ADD, SHIFT and REG cells. On the other hand, engines with VO technique keep similar cell occupation to original engines with slight increments in ADD, LOGIC and SHIFT cells (the number of REG cells in VO based engines are also large due to the requirement of establishing pipelines). When comparing the latter two kinds of engines, the numbers of ADD cells are reduced significantly by utilising VO technique, especially for the 9/7 mode. In contrast, the numbers of LOGIC cells increase as more logic operations are required in order to construct and decompose vectors. The numbers of SHIFT cells have different trends between 5/3 mode and 9/7 mode when applying VO 76 2-D DWT Engine on RICA Based Architecture technique. Since the computation involved in 5/3 mode is quite simple compared with that in 9/7 mode, the trade-off of adopting VO technique, which means increase in the required number of SHIFT cells, is obvious; on the other hand, the massive computation in 9/7 mode requires large number of SHIFT cells, and this requirement is reduced significantly by parallel operations in VO based implementation. Therefore, those additional SHIFT ICs required by constructing and decomposing vectors become negligible. 5.4. Performance Analysis and Comparisons The proposed lifting-based 2-D DWT engine is implemented on customised RICA based architecture using ANSI-C and simulated by the RICA tool flow [6]. Figure 5.10 demonstrates the standard Lena test image transformed by the proposed DWT engine for both 5/3 and 9/7 modes. The performance analysis and comparisons mainly focus on aspects of both throughput and computational cell occupation. Figure 5.11 gives throughput comparisons for processing the standard 256x256 Lena test image under different 2-D DWT levels. Comparisons are made between original DWT engines, DWT engines with two single parallel operations and DWT engines with VO technique. It is seen that the original DWT engines provide the lowest throughput at all DWT levels. In contrast, (a) (b) Figure 5.10 Standard Lena Image Transformed by the 2-D DWT Engine (a) DWT 5/3 mode (b) DWT 9/7 mode 77 2-D DWT Engine on RICA Based Architecture 1800 1600 1400 DWT_53 (ori) 1200 1000 DWT_53 (two parallel single operations) 800 600 DWT_53 (Vector) 400 200 0 1-level 2-level 3-level 4-level 1800 1600 1400 DWT_97 (ori) 1200 1000 DWT_97 (two parallel single operations) 800 600 DWT_97 (Vector) 400 200 0 1-level 2-level 3-level 4-level Figure 5.11 Throughput (fps) Comparisons DWT engines with two single parallel operations offer the highest throughput, and DWT engines with VO technique demonstrate slightly lower throughput. This is because the VO technique introduces additional operations in order to construct and decompose vectors, and these operations normally increases the pipeline depth by adding more filling and flushing steps while keeping similar critical path length, leading to execution time increment. In contrast, when comparing the area consumption as illustrated in Figure 5.12, the original DWT engines demonstrate the lowest area occupation in both 5/3 and 9/7 modes. When comparing the other two kinds of engines, the DWT engine with VO technique has significant reduction in cell area consumption compared with the engine with two single parallel operationsin9/7 mode. As discussed in Section 5.3.3, parallel operations introduced by VO technique 78 2-D DWT Engine on RICA Based Architecture enable the engine to perform the same function with less computational cells. On the other hand, in 5/3 mode, the area consumption of VO based engine is actually slightly higher than the engine with two single parallel operations. This is because the area overhead of the extra LOGIC and SHIFT cells required by VO negates the benefit coming from deducting other computational cells such as ADD and REG. As the total computational cell numbers in 5/3 DWT engine are quite small, this overhead is obvious and increases the total area. In contrast, this overhead is negligible compared to the reduction of other cells in 9/7 mode. In order to measure the trade-off between throughput and area consumption, 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 DWT_53 (ori) DWT_53 (two parallel single operations) DWT_53 (Vector) DWT_97 (ori) DWT_97 (two parallel single operations) DWT_97 (Vector) (a) Area (um2) Comparisons DWT_53 (ori) 0.04 0.035 DWT_53 (two parallel single operations) DWT_53 (Vector) 0.03 0.025 0.02 DWT_97 (ori) 0.015 DWT_97 (two parallel single operations) DWT_97 (Vector) 0.01 0.005 0 1-level 2-level 3-level 4-level (b) Δ Comparisons Figure 5.12 Area and Δ Comparisons 79 2-D DWT Engine on RICA Based Architecture a parameter Δ = throughput/area is defined to measure the efficiency of different engines. A high Δ means that the corresponding engine has good computational resource utilisation. It can be seen in Figure 5.12 that the original DWT engines show the lowest Δs in both 5/3 and 9/7 modes, which means that original DWT engines are actually not efficient. When comparing the other two kinds of engines, the 5/3 DWT engine with two single parallel operations has the highest Δs, and the 5/3 DWT engine with VO has respectively lower Δs, which means that VO is not the best solution for the 5/3 mode as it brings too much overhead. In contrast, for the 9/7 mode, the engine with VO offers higher Δs compared with the other two engines. In this case, a conclusion is obtained that the VO technique is more suitable for complex applications in which the computational cell reduction will not be negated by the overhead brought by extra LOGIC and SHIFT cells. Figure 5.13 illustrates performance comparisons between RICA based DWT engines and an FPGA based lifting 2-D DWT implementation in [76], a TI C6416 DSP based DWT engine [77] and a StarCore DSP based DWT implementation [78] which are all targeting JPEG2000 standard. The test image size is set to be 1024x1024 which is the same with [76] and [77], and throughput of the proposed engines and [78] is scaled according to the image size. Comparisons are made under different DWT modes and levels 0.09 0.0766 0.08 Execution time (s) 0.07 0.06 0.05 RICA 0.04 References 0.03 0.02 0.0154 0.0172 0.0154 FPGA [76] 9/7 4-level C6416 [77] 9/7 4-level 0.0097 0.0132 0.01 0 Starcore [78] 5/3 1-level Figure 5.13 Performance Comparisons 80 2-D DWT Engine on RICA Based Architecture according to the references. Meanwhile, it is worth noticing that the FPGA implementation [76] was based on an old Virtex 2 device with the working frequency of 67MHz, so throughput improvement is expected if the implementation can be elaborated to some newer FPGA platforms. However, generally it is seen that DWT engines on RICA based architecture demonstrate their clear advantages in both 5/3 and 9/7 modes. In conclusion, RICA based DWT engines benefit from the high parallelism nature of RICA paradigm and efficiently pipelined kernels. Meanwhile VO is proved to be a promising technique to optimise complex applications especially to those area sensitive applications. 5.5. Conclusion In this chapter, a high efficiency reconfigurable lifting-based 2-D DWT engine on customised RICA based architecture targeting JPEG2000 standard has been proposed. In Section 5.2, the lifting-based DWT architectures are introduced. Section 5.3 presents both 1-D and 2-D DWT implementations on RICA based architecture. The proposed DWT engine can be reconfigured for both 5/3 and 9/7 transforming modes in JPEG2000 standard. Hardwired FCMs are utilised in the 9/7 mode for converting floating-point calculations to fixed-point calculations instead of involving dedicated floating-point calculation units. With the VO and SIMD technique introduced in the previous chapter, the 2-D DWT engine can generate its final output coefficients belonging to different subbands simultaneously. Optimisation still targets customising RICA based architecture and constructing kernels. Different computational resources occupied by both regular engines and engines with VO are discussed. For the horizontal and vertical steps in 2-D DWT, separate kernels are constructed respectively. In a single kernel, there are four pixels included, corresponding to the four final transformed coefficients. In Section 5.4, performance comparisons between original DWT engines, DWT engines with two single parallel operations and DWT engines with VO 81 2-D DWT Engine on RICA Based Architecture technique are discussed in aspects of both throughput and area occupation. The original DWT engines demonstrate both the lowest throughput and the lowest area occupation in both 5/3 and 9/7 modes. Meanwhile, DWT engines with two single parallel operations demonstrate the highest throughput. Meanwhile, VO based DWT engines provide slightly lower throughput but much lower area occupation in the 9/7 mode. A parameter Δ = throughput/area is utilised to measure the efficiency of different engines. It is concluded that VO technique is suitable for the 9/7 mode, which is more computationally intensive compared with 5/3 mode. Performance comparisons are also made between the proposed 2-D DWT engine and various implementations based on other architectures. The proposed DWT engine demonstrates clear advantages in both 5/3 and 9/7 modes compared with various FPGA and DSP based 2-D DWT solutions. These advantages mainly come from the high parallelism nature of RICA paradigm, efficiently pipelined kernels and VO technique. 82 EBCOT on RICA Based Architecture and ARM Core Chapter 6 EBCOT on RICA Based Architecture and ARM Core 6.1. Introduction This chapter presents a JPEG2000 EBCOT encoder based on RICA based dynamically reconfigurable architecture and an ARM core. The EBCOT Tier1 encoding scheme consists of two modules: Context Modelling and Arithmetic Encoder. Based on algorithm evaluation, the four primitive coding schemes in CM are efficiently implemented on RICA based architecture. A novel Partial Parallel Architecture for CM is applied to improve the overall system performance. Meanwhile, an ARM core is integrated in the proposed architecture for implementing optimised AE efficiently. Simulation results demonstrate that the resulting CM architecture can code 69.8million symbol bits per second, representing approximately 1.37x speed up compared with the Pass Parallel CM architecture implemented on RICA paradigm based architectures; while the ARM based AE implementation can process approximately 31.25 million CX/D pairs per second. The EBCOT Tier-2 encoder and file formatting module can also be implemented on the ARM core together with AE. 6.2. Context Modelling Algorithm Evaluation The detailed CM algorithm has been discussed in Chapter 2. In JPEG2000 applications, EBCOT usually consumes most of the execution time (typically 83 EBCOT on RICA Based Architecture and ARM Core more than 50%) in software-based implementations [79] and CM is considered to be the most computationally intensive unit in EBCOT. Since CM adopts the fractional bit-plane coding idea and codes DWT coefficients in codeblocks by three separate coding passes in bit-level, it is actually more suitable for specialised hardware implementation rather than general hardware. There have been several methods proposed to accelerate CM process, which are detailed as follows. Sample Skipping (SS): This method is proposed in [80] as illustrated in Figure 6.1. Through a parallel checking performed by the encoder, if there are n Need-to-Be-Coded (NBC) coefficient bits in a stripe column (1≤n≤4), only n cycles are spent on coding these NBC bits, and 4-n cycles are saved compared with the conventional method which checks all bits one by one. In the case that there is no NBC bit existing in the column, only one cycle is spent on checking. Since most columns have less than four NBC samples, this method can save cycle time [80]. Group-Of-Column Skipping (GOCS): This method is also presented in Conventional way Sample Skipping 4 cycle required 2 cycle required Coefficient bit needs to be coded Coefficient bit does not need to be coded Figure 6.1 Sample Skipping Method for CM Conventional way (process 16 columns) Coefficient bit needs to be coded Coefficient bit does not need to be coded Group of Column Skipping (process 8 columns) Figure 6.2 Group of Column Skipping Method for CM 84 EBCOT on RICA Based Architecture and ARM Core [80] as illustrated in Figure 6.2. This method is to skip a group of nooperation columns together and can only be applied to pass 2 and 3. The number of NBC bits in each group are checked and recorded while being coded in pass 1 with a 1-bit tag for each group. When executing coding pass 2 and 3, these tags are checked. If a tag is “0” then the corresponding group is skipped, otherwise columns in the group are checked one by one and coded with the SS method [80]. Multiple Column Skipping (MCOLS): This method is proposed in [81] in order to add more flexibility to GOCS. The tag indicator for each group is extended to cover different states of the four columns. In this case the coding engine can process each column respectively and determine whether a single or multiple columns can be skipped. Pass Parallel Context Modelling (PPCM): All the accelerating methods discussed above aim to save checking cycles when processing a stripe column. The PPCM method presented in [82] targets parallel processing, that is, performing three coding passes simultaneously. This method adopts the column-based operation in [80] with four coefficient bits in a column being processed at a time. To encode a sample, firstly the encoder decides by which coding pass the current sample should be coded. This sample is then coded by one of the four primitive coding schemes according to the coding pass. However, some issues occur due Coding window for pass 3 Coding window for passes 1 and 2 Stripe causal Figure 6.3 Pass Parallel Context Modeling 85 EBCOT on RICA Based Architecture and ARM Core to the causal relationship within the three coding passes in the parallel processing mode and must be solved. First, samples belong to pass 3 may become significant earlier than the two prior coding passes since three coding passes are executed concurrently. Second, if the current sample belongs to pass 2 or 3, significant states of samples that have not been visited in the coding window shall be predicted since these samples may become significant in pass 1 [82]. In order to solve these problems, the coding window for pass 3 in PPCM is delayed by one stripe column to eliminate the reciprocal effect between pass 3 and the other two passes. Figure 6.3 illustrates the PPCM architecture. The stripe causal mode [5, 83] is utilised to eliminate the dependence of coding operations on the significance of samples in the next stripe. Moreover, two significant state variables σ0[k] and σ1[k] are introduced to state whether the sample becomes significant in pass 1 or pass 3 respectively. Detailed description of PPCM algorithm can be referred in [82]. With this architecture, the execution time of CM can be reduced by more than 25% compared with SS and GOCS [82]. All of these accelerating methods presented above are originally FPGAtargeted. For SS, GOCS and MCOLS, they require either additional control units/memory or modified memory arrangements. PPCM requires large amount of computational resources to make the three coding passes working in parallel. Moreover, the power consumption of FPGA paradigm makes it inefficient to use FPGA for embedded JPEG2000 solutions. Basically, an ideal hardware architecture for CM should have high parallelism so more than one coefficient samples can be coded simultaneously, also the dynamically reconfigurability is desirable, with which the architecture can be reconfigured to adapt different coding passes at different coding stages in order to prevent unnecessary computational resource waste. Other features such as power-saving and high integration are also essential for embedded applications. 86 EBCOT on RICA Based Architecture and ARM Core 6.3. Efficient RICA Based Designs for Primitive Coding Schemes in CM Before figuring out the most suitable CM architecture for RICA based applications, the discussion in this thesis focuses on how to efficiently implement the four primitive coding schemes involved in the three coding passes on RICA based architecture. According to the analysis of CM algorithm and features of RICA paradigm, main challenges for efficient implementations are concluded as follows: Kernels need to be constructed for one or more coding passes involving various primitive coding schemes in order to eliminate the configuration latency. The number of memory accesses in each kernel must be restricted within 4 in order to prevent breaking kernels. The conditional branches in coding schemes must be eliminated. Kernels must be adaptive to the RLC coding scheme which may generate various numbers of CX/D pairs within a single stripe column. In CM, all CXs are generated depending on different combinations of the significant states/refinement states/magnitudes of the current bit and its eight neighbours, so the latter two challenges become critical. In order to overcome the above challenges, all the four primitive coding schemes in CM are carefully designed and implemented on RICA based architecture. 6.3.1. Zero Coding As discussed in Chapter 2, ZC generates a CX according to sums of significant states of the current bit and its eight neighbours by horizontal/vertical/diagonal directions as well as which DWT subband the current codeblock belongs to. In the proposed ZC implementation, all the H/V/D sums are judged by comparators and the output CX is generated through a sequence of multiplexers, whose selecting inputs are the outputs of the comparators. Figure 6.4 illustrates the detailed circuit structure of this CM 87 EBCOT on RICA Based Architecture and ARM Core D0 V0 D1 H = H0 + H1 V = V0 + V1 D = D0 + D1 + D2 + D3 H0 X H1 D2 V1 D3 Comparator Multiplexer H 0 1 = CX_ori 2 = = 0 Judegments 1 2 V 0 1 = 2 = 0 = 3 > D 0 1 = 1 = 0 > Logic combination according to the ZC LUT in Chapter 2 to decide the final CX value 4 5 6 7 > 8 Decisions CX_out Figure 6.4 Detailed Architecture for ZC Unit coding engine. The comparator array compares the H/V/D values with the different parameters defined in Table 2.1 (ZC LUT table) in Chapter 2.Compared results are used as judgements for H/V/D contributions. The logic combination block combines these judgements with logic operations according to the ZC LUT table and generates decisions for the multiplexer sequence to decide the final CX value. In this ZC coding unit, CX has an initial value (normally zero), and the multiplexer sequence chooses the final CX value by utilising decisions provided by the logic combination block. In this way, conditional branches can be totally eliminated and the ZC coding unit can be integrated within a kernel, with a fair trade-off of a number of extra multiplexers. For different DWT subbands, various parameters for the comparator array and new logic combinations in the logic block are utilised without modifying the coding unit architecture. 6.3.2. Sign Coding Sign coding scheme utilises sign bits and significant bits of the current bit and its horizontal/vertical neighbours to calculate the required H/V contributions 88 EBCOT on RICA Based Architecture and ARM Core Decisions for H/V contributions Logic combination to generate the decisions for H/V contributions H V Horizontal contribution multiplexer sequence H contribution Decisions for CX Logic combination to generate the decisions for CX CX multiplexer sequence CX Sign bits and significant states Vertical contribution multiplexer sequence V contribution Logic combination to generate the decisions for XOR bit XOR bit multiplexer sequence XOR bit D XOR Decisions for XOR bit Sign bit Figure 6.5 Detailed Architecture for SC Unit as discussed in Chapter 2. Figure 6.5 illustrates the detailed architecture for SC coding unit. Since both the sign bits and the significant states can only hold the value of 0 or 1, no comparator but logic combinations are required to generate decisions for the H/V contributions. Similar as the ZC unit, multiplexer sequences for both horizontal and vertical contributions generate the H and V contributions respectively without breaking the potential kernel, These two contributions are further used to generate decisions for CX and XOR bit. According to Table 9.1 in Chapter 2, conditions for generating XOR bit can be simplified as shown in Table 6.1. With these simplified conditions, two logic combination blocks are employed to generate final decisions for CX and XOR bit respectively, which are utilised by another two multiplexer sequences in order to obtain the final CX and XOR bit. The Decision bit is generated by a simple XOR operation between the current sign bit and the XOR bit. Table 6.1 Simplified LUT for XOR Bit Combinations of H/V Contributions XOR bit H = 1, V = x (x means don't care) 0 H = 0 and V ≥ 0 0 H = 0 and V = -1 1 H = -1 1 89 EBCOT on RICA Based Architecture and ARM Core ∑H ∑V ∑D Significant states Refinement state CX_ori Accumulator 1 0 16 1 15 = >= = Decisions for CX 14 Logic combination CX_out Figure 6.6 Detailed Architecture for MRC Unit 6.3.3. Magnitude Refinement Coding MRC requires accumulation of significant states of the current bit’s eight adjacent neighbours and the information indicating whether the current bit has been coded by MRC. Compared with ZC and SC, MRC implementation is relatively simple as illustrated in Figure 6.6. There are totally three comparators and three multiplexers utilised in the coding unit in order to eliminate conditional branches. 6.3.4. Run Length Coding RLC presents a more difficult problem as it may generate various numbers (1 or 3) of CX/D pairs according to the four bits in the stripe column. Due to the memory access and conditional branch restrictions in RICA paradigm, an efficient implementation has to ensure that this variation does not break the potential kernel. In this work, we managed to realise the RLC unit within a kernel by combining the generated CX/D pairs into a single codeword that can be read or write by a single memory operation, which is demonstrated in Figure 6.7. The codeword occupies 14 bits totally, which is constructed by two modified CX/D pairs in each of which the decision bit is expanded to two bits. For a zero stripe column coded by RLC, only one CX/D pair (17, 0) 90 EBCOT on RICA Based Architecture and ARM Core A zero stripe column A non-zero stripe column 17 (10001) 0 (00) 17 (10001) 0 (00) 8772 8904 8905 8906 8907 17 (10001) 1 (01) 18 (10010) 00/01/ 10/11 5 bits 2 bits 5 bits 2 bits Codeword: Weight factor = b3<<3 + b2<<2 + b1<<1 + b0 b0/1/2/3 are the magnitude bits in the current stripe column 8772 8907 8906 8905 8904 weight factor = 0 weight factor = 1 2 ≤ weight factor < 4 4 ≤ weight factor < 8 weight factor ≥ 8 Figure 6.7 Codeword Structure in RLC Unit needs to be generated, and the two parts of the codeword are both filled with it. When coding a non-zero stripe column, firstly a CX/D pair (17, 1) is generated, which fills the highest 7 bits of the codeword. After that, another two CX/D pairs (18, 0 or 1), (18, 0 or 1) are generated and stored in the lowest 7 bits of the codeword by combining the two decision bits together. According to the different contents, the 14-bit codeword is actually represented in decimal as shown in Figure 6.7. Assume the four bits in a stripe column are b0, b1, b2, b3 (from top to bottom), a weight factor is employed to indicate the position of the first non-zero bit in the stripe column with which the codeword can be assigned with correct value, as illustrated in the figure. There is another state employed in this work to indicate the location where RLC finishes, which is termed valid_state. Table 6.2 illustrates the valid_state and its value assignment. As RLC is only applied at the beginning of a stripe column, the first bit has different valid_state values compared with the other three bits. In the case that no RLC is applied to a stripe column, the valid_state for the first bit is set to “0”. When RLC applied, the valid_state of the first bit depends on the weight factor introduced above. If the weight factor is zero, the valid_state is set to “3” indicating that the whole stripe 91 EBCOT on RICA Based Architecture and ARM Core Table 6.2 Valid_state in the RLC Unit valid_state conditions b0 0 RLC is not applied to this stripe column 1 RLC applied and weight factor ≠ 0 3 RLC applied and weight factor = 0 b1 2 RLC applied and weight factor < 8 0 else b2 2 RLC applied and weight factor < 4 0 else b3 2 RLC applied and weight factor < 2 0 else column is RLC coded; otherwise it is set to “1”, which means that the RLC is applied but only for a subset of the four bits in the column. For the other three bits, their valid_states depend on the value of the weight factor. If the valid_state is set to “2”, it means that RLC has been applied to this bit; otherwise the valid_state is set to “0”. Based on the above discussion, RLC unit is efficiently implemented on RICA based architecture as illustrated in Figure 6.8. RLC unit checks every stripe column at the beginning in order to find out whether RLC should be applied. The codeword containing CX/D pairs and the valid_state indicating the RLC applied or not Significant states of all required samples Bits in the column Judge RLC Generate valid_state Calculate weight factor Generate codeword Weight factor Figure 6.8 The Structure of RLC Unit 92 Valid_state for each bit Codeword for the stripe column EBCOT on RICA Based Architecture and ARM Core RLC ending position are generated for the EBCOT CM coding engine, which will split CX/D pairs from codewords according to the provided information. 6.4. Partial Parallel Architecture for CM 6.4.1. Architecture Based on efficient implementations of the four primitive coding schemes, an optimised Partial Parallel Architecture for CM [84]is developed specially targeting RICA based architectures. The PPA method is derived from the original PPCM architecture with utilisation of the loop splitting technique. Figure 6.9 illustrates the PPA method. Since the coding pass 2 does not affect the sample’s significant state, in PPA it is executed in parallel with coding pass 1, while pass 3 is executed separately after the first two coding passes finishing the current bit-plane. As a result, there are two separate coding windows with the same size (5x3) for pass 1&2 and pass 3 in PPA and two kernels are constructed respectively. As most of the primitive coding schemes employed in coding pass 3 are the same as coding pass 1, RICA based architecture can be dynamically reconfigured for the pass 3 kernel without requiring additional computational resources. For each coding Coding windows for Go to next pass 3 bit-plane Current bitplane finished, go to pass 3 Coding windows for pass 1 & 2 Current stripe Column Causal Figure 6.9 Partial Parallel Architecture for Context Modeling 93 EBCOT on RICA Based Architecture and ARM Core window, stripe causal technique is utilised to eliminate the dependence of coding operations on the significance of samples belonging to the next stripe [5], [38]. The four bits in a stripe, from top to down, are coded in parallel [84]. 6.4.2. PPA based CM Coding Procedure Given a codeblock, PPA based CM starts to code it from the MSB to the LSB. The two coding windows shift independently from left to right and stripe by stripe. As discussed in Chapter 2, the required information for CM includes significant state (δ), refinement state (γ), sign (χ) and magnitude (v) information. In addition, since coding pass 3 is executed separately with the other two coding passes, another state (θ) indicating whether the bit has been coded by coding 1 or 2 is necessary in PPA. In order to reduce the number of memory accesses, all the required information except χ and v is stored in different data buffers, while χ and v are directly obtained from the coefficients in the current codeblock. Along with the shifting coding window, the required information is read out from data buffers consecutively via SBUF cells, while updated state information is written into corresponding data buffers in sequence during the coding process. Figure 6.10 gives an example of how data buffers work in PPA. It illustrates the case of data buffers containing δ and γ when coding the pth bitplane (LSB <p< MSB). Each data buffer corresponds to each line of the coding window. When coding the first stripe, data buffer 0 is reserved to be all zeros as there is no valid sample. When the current stripe is finished, the reading address of Coding the first stripe Starting Address 0 Starting SBUF 1 Address 1 Starting SBUF 2 Address 2 Starting SBUF 3 Address 3 Starting SBUF 4 Address 4 Starting Address 4 Continue to read Continue to read Continue to read Continue to read Reserved (0) SBUF 0 Coding the second stripe When a line finished, the read address of buffer 0 is reset to be the last starting address of buffer 4 Codeblock line 0 Codeblock line 4 ... Codeblock line 1 Codeblock line 5 ... Codeblock line 2 Codeblock line 6 ... Codeblock line 3 Codeblock line 7 ... Codeblock line 3 Codeblock line 0 Codeblock line 4 ... Codeblock line 1 Codeblock line 5 ... Codeblock line 2 Codeblock line 6 ... Codeblock line 3 Codeblock line 7 ... Figure 6.10 The Example of How Data Buffers Work in PPA 94 EBCOT on RICA Based Architecture and ARM Core SBUF 0 is reset to the last starting address of data buffer 4, while the other four buffers keep consecutive reading addresses. Therefore, when coding the second stripe, the state information belonging to codeblock line 3 is read out via SBUF 0, while information for the other four lines is read out consecutively. In this way, PPA ensures that the coding window contains correct state information during the coding procedure. For the case of θ, only four data buffers are required and there is no address reset, as only the θ within the current stripe is required. One of the main challenges is how to eliminate conditional branches which will break potential kernels. In PPA, all the primitive coding schemes belonging to different coding passes are executed simultaneously and the required contexts are selected according to the algorithm discussed in Chapter 2 and Section 6.3 as the final outputs. A pseudo code is given in Figure 6.11 to show the detailed working process of PPA. The selecting operations are realised by multiplexers with the similar architecture as that discussed in Section 6.3. Table 6.3 lists the basis according to which these selections are made. Table 6.3 CX/D Selection in PPA For coding pass 1 and 2 ZC SC MRC Condition_ZC1 = 1 Current bit = 1 &&Condition_ZC = 1 Significant_currentbit = 1 For coding pass 3 RLC ZC SC Condition_RLC2 = 1 Not coded by pass 1 or 2 Current bit = 1 &&Significant_currentbit = 0 1 :Condition_ZC 2 = (significant_currentbit = 0) && (any significant_neighbour = 1) :Condition_RLC = all the significant states in the coding windows are zero 95 EBCOT on RICA Based Architecture and ARM Core for (k=0, k<bit depth, k++) // for bit-plane iteration { Initialize the buffer addresses for state variables; for (j=0, j< number of lines, j+=4) // coding pass 1 and 2 { Initialize the state variables; for (i=0; i< line length; i++) // kernel 1 { Read the state variables including χ and v into the coding window; Pass 1 coding; // including four ZC and four SC Pass 2 coding; // parallel with pass 1, including four MRC Select the correct CX/D pairs; Write the updated variables back into data buffers; Write the selected CX/D pairs into data buffers as the intermediate; } Reset the buffer read/write addresses; } for (j=0, j< number of lines, j+=4) // coding pass 3 { for (i=0; i< line length; i++) // kernel 2 { Read the state variables including χ and v into the coding window; Read the intermediate CX/D pairs from coding pass 1 and 2 in the coding window; Pass 3 coding; // including RLC, four ZC and four SC Select the correct CX/D pairs; Write the updated variables back into data buffers; Write the selected CX/D pairs into the final output memory; } Reset the buffer read/write addresses; } } Figure 6.11 Pseudo Code of PPA Working Process 96 EBCOT on RICA Based Architecture and ARM Core When processing finishes, the four coded CX/D pairs for a stripe column are generated. The information provided by PPA for each bit includes: a. The significant or refinement state of the sample. b. The CX/D pair of the sign bit (if existing). c. The CX/D pair of the magnitude bit. d. The valid_state. As the maximum memory access number is restricted to be 4 in a kernel, the information for each bit must be written into the memory by a single operation. In this case, PPA combines all the required information listed above for each bit into a single codeword so the complete information for a stripe column can be written into the memory simultaneously without breaking the kernels. Figure 6.12 illustrates the codeword structure. In order to distinguish the CX/D pairs generated by coding pass 3 from those coded by pass 1, an offset is added to contexts generated by coding pass 3 (excluding RLC contexts) which will be removed in the following process. Since the RLC coding scheme may generate various numbers of CX/D pairs within one stripe column and conditional branches must be eliminated in the kernel, the valid_state discussed in Section 6.3.4 is used to indicate whether any RLC coded CX/D pairs are involved in the codewords from a stripe column and the count of them. Table 6.4 gives an illustration of the generic RLC Significant state | refinement state CX/D pair of the sign bit CX/D pair of the magnitude bit (RLC codeword) Valid_state 1 bits 8 bits 14 bits 2 bits Figure 6.12 PPA Codeword Structure Table 6.4 Valid_state Indication for RLC in PPA Valid_state Indication 0 The current bit is not coded by RLC 1 The current bit is the first bit in a stripe column. The bit itself and some following bits in the stripe column are coded by RLC 2 The current bit is not the first bit in a stripe column and the bit is coded by RLC 3 The current bit is the first bit in a stripe column and the entire stripe column is coded by RLC 97 EBCOT on RICA Based Architecture and ARM Core indicating procedure for each output combination in PPA. With the provided information, CX/D pairs can be correctly derived by the following modules. 6.5. Arithmetic Encoder in EBCOT The top-level architecture and details of the key sub-modules in AE have been introduced in Chapter 2. It is observed that the encoding algorithm is composed of frequent conditional branches and simple operations. When targeting RICA based implementations, these conditional branches will split potential kernels into separate steps, which will dramatically increase the configuration latency and extend the execution time. On the other hand, if we force the application to be executed in a single kernel, that is, to employ massive comparators and multiplexers to eliminate branches like the way CM is implemented, the kernel will have a long critical path even after being pipelined due to the serial nature of AE. In this case, AE becomes the bottleneck of a pure RICA based JPEG2000 encoder implementation. In this work, an ARM [85] core is integrated with RICA based architecture for an efficient AE solution. ARM is the leading provider of 32-bit embedded microprocessors and offers a wide range of processors based on a common architecture that delivers high performance, power efficiency and reduced system cost. With the extension of DSP instruction set and instruction prediction, ARM shows its considerable performance in signal processing and control applications with noticeable low power consumption. Currently, the mature products from ARM include ARM7, ARM9, ARM11 and Cortex series. Considering this application itself, an ARM946E-S processor core is chosen in this work due to its high performance and ultra low power dissipation features. The ARM946E-S core implements the ARMv5TE [86] instruction set and features and enhanced 16x32-bit multiplier capable of single cycle MAC operations and DSP instructions in order to accelerate signal processing algorithms and applications. A number of accelerating methods for AE implementation have been presented including developing parallel architectures, pipelining the coding 98 EBCOT on RICA Based Architecture and ARM Core process and simplifying the encoding procedure via DSP technique [87-93]. Due to the nature of ARM, in this work, attention is focused on the efficient encoding procedure simplification. As presented, coded CXs and Ds are written into the memory simultaneously by PPA within codewords due to the conditional branch and memory access restriction of RICA paradigm. In this case, AE have to derive separate CXs and Ds via shifting and logic operations. Based on the algorithm, the deciding procedure for MPS and LPS is simplified as following: If (D == MPS(CX)) code MPS, else code LPS As the RENORME module is the most time consuming task in AE, the optimisation focuses on reducing the number of iterations when the two probability subinterval is switched. The simplification methods proposed in [88, 93] are referred as they are suitable for ARM based implementation. RENORME RENORME NS = CLZ(A) - 16 A = A << 1 C = C << 1 CT = CT - 1 Y N NS < CT N A = A << NS C = C << NS CT = CT - NS CT = 0 P = NS - CT A = A << NS C = C << CT Y BYTEOUT A & 0X8000 = 0 BYTEOUT CT = CT – P C = C << P Y N DONE DONE (a) (b) Figure 6.13 (a) Original RENORME Architecture(b) Optimised RENORME Architecture 99 EBCOT on RICA Based Architecture and ARM Core BYTEOUT BYTEOUT B = 0xff B_tmp = B C_tmp = C Y N Y N C < 0x8000000 Y N B++; C=(B==0xff)? C&0x7ffffff : C B=B+1 N B!=0xff && C!=0x8000000 B = 0xff Y Y C = C& 0x7ffffff BP = BP + 1 B = C >> 19 C = C & 0x7ffff CT = 8 B_tmp==0xff || (B_tmp!=0xff && C_tmp>=0x8000000 && B==0xff) B = (C>>20)&0xff; C = C&0xfffff; CT = 7 BP = BP + 1 B = C >> 20 C = C & 0xfffff CT = 7 N B = (C>>19)&0xff; C = C&0x7ffff; CT = 8 BP = BP + 1 DONE DONE (a) (b) Figure 6.14 (a) Original BYTEOUT Architecture (b) Optimised BYTEOUT Architecture The basic idea behind these approaches is to calculate the number of leftshifting register A in RENORME so the iteration can be complete eliminated by a single operation. In this case, the number of leading zeros in A must be detected when performing RENORME. Instead of preparing a reference table in [88], the DSP instruction CLZ in ARM is utilised to simplify the implementation. Figure 6.13 illustrates the difference between the original and the optimised RENORME module. The BYTEOUT module is also simplified by merging logic operations and reducing conditional branches for enhanced performance, as illustrated in Figure 6.14. 6.6. EBCOT Tier-2 Encoder In this work, Tier-2 encoder is implemented on the ARM core together with AE. As discussed in Chapter 2, the key modules in EBCOT Tier-2 encoder are Tag-Tree coding and bit-stream length coding. The information needs to be coded by Tag-Tree includes the inclusion information (relating to layer information) and the number of zero bit-planes in each codeblock within a 100 EBCOT on RICA Based Architecture and ARM Core Original data array Obtain the 1-level leave nodes Obtain the root node Code the root node Codeword is stored Code the original array Codewords are stored Code the 1-lvel leave nodes Codewords are stored Output the coded bits in raster order. The bits for the root node must be outputted first. The bits for the original data cannot be outputted before the corresponding 1-level nodes. Figure 6.15 Detailed Tag-Tree Coding Procedure Given the number of coding passes Obtain the current number of bits Calculate the difference Stuff the codeword indicator Obtain the expect number of bits Update LBlock Output the codeword length Figure 6.16 Detailed Codeword Length Coding Procedure DWT subband. Figure 6.15 illustrates a detailed flow graph explaining how Tag-Tree coding is implemented on ARM. Since the bit-stream length coding process involves log calculation, an LUT is utilised in this work to replace the log calculation and to obtain the required bits to represent the bit-stream length. The LUT supports the number of coding passes from 1 to 2048, which is sufficient for almost any cases. Figure 6.16 illustrates the coding procedure for the bit-stream length information. Since the EBCOT Tier-2 encoder only processes the global information of a tile with a simple and straightforward architecture, its execution time becomes negligible compared with other computationally intensive tasks such as 2-D DWT, CM and AE. 101 EBCOT on RICA Based Architecture and ARM Core 6.7. Performance Analysis and Comparisons The EBCOT CM and AE modules are implemented on RICA based architecture and ARM respectively. Performance comparisons between the original CM architecture, PPCM architecture and the proposed PPA CM architecture which are all implemented on RICA based architecture are made in Table 6.5. It is clear that although the original implementation has the shortest critical path; it has the lowest throughput due to the frequent conditional branches which introduce massive configuration latency. In contrast, PPCM method improves the throughput significantly by executing the complete coding process within a large kernel. However, the length of its critical path increases dramatically since it is difficult to pipeline such a large kernel into deep levels. Comparing with the other two approaches, PPA Table 6.5 Performance Comparisons PPCM PPA Throughput (MSymbols/S) 14.43 50.91 69.8 Critical Path after pipelining(ns) 6.67 35.72 15.96 / 12.78 Execution Time (ms) Original Codeblock 64x64 Codeblock 32x32 Codeblock 16x16 Figure 6.17 PPA Based CM Execution Time under Different Pre-Conditions 102 EBCOT on RICA Based Architecture and ARM Core demonstrates its highest throughput by constructing two kernels with proper sizes which can be better pipelined. Figure 6.17 illustrates the execution time of PPA based CM under different pre-conditions (codeblock size, DWT level) with processing a 256x256 8-bit Lena image. It is seen that the proposed PPA based CM keeps the same execution time under different DWT levels. This is because that various primitive coding schemes in PPA are executed in parallel and the final outputs are selected among a couple of possible results. Based on this architecture, PPA based CM is not data-sensitive, which means its execution time only depends on the amount of input data and has no relationship to the data values. In this case, PPA based CM can provide stable performance when working with different images. It can also be seen that the execution time increases when the codeblock size gets smaller. This increment is introduced by the configuration latency and other related processing when finishing the current codeblock and switching to the next one. The more codeblock there are, the more latency is generated. Table 6.6 gives numbers of cells utilised in CM on customised RICA based architecture. It is observed that the COMP, MUX and LOGIC cells take the majority of the overall occupied computational resources in order to prevent conditional branches and to construct kernels, and PPA shows nearly a half reduction in the cell usage compared with PPCM. The performance of AE implemented on ARM is also compared with two RICA based implementations targeting the original approach Table 6.6 Numbers of Cells in CM Engines on Customised RICA Architecture Cell PPCM PPA Cell PPCM PPA ADD 61 45 SBUF 15 27 COMP 492 235 RMEM 4 4 MUX 482 287 WMEM 4 4 REG 526 520 JUMP 1 1 SHIFT 48 61 RRC 1 1 LOGIC 496 291 Total Area 1038539 763747 103 and EBCOT on RICA Based Architecture and ARM Core Table 6.7 Performance Comparisons RICA (original) RICA (kernel) ARM (500MHz) 0.1353 0.1187 0.033 Coding time (ms) optimisation with kernel respectively. The ARM based AE is simulated by the ARM RVDS tool flow [94] with the working frequency of 500MHz, which is equal to the memory accessing delay of RICA paradigm. Execution time is obtained via coding a simple CX/D sequence with the length of 1024 and the results are demonstrated in Table 6.7. The significant throughput improvement of the ARM implementation mainly benefits from the processor’s efficient branch prediction and DSP instructions such as CLZ; and RICA based implementations suffer from either the large configuration latency introduced by branches or long critical path of the kernel. Generally, the average coding speed of the ARM-based AE is approximately 16 cycles per CX/D pair. Compared with a TI C6416 AE implementation presented in [93] which has an average of 13 cycles per CX/D pair, the proposed ARMbased AE has extra burden of deriving CX and D from every codeword via shifting and logic operations, and it is possible to improve its speed by further assembly-level optimisation. 6.8. Conclusion In this chapter, an optimised EBCOT implementation on customised RICA based architecture and an ARM core is presented. In Section 6.2, different existing algorithms for CM in EBCOT are introduced and their positives and negatives are discussed, especially when targeting RICA based architecture. Before introducing the novel PPA algorithm, Section 6.3 presents efficient designs of the four primitive coding schemes in CM. These designs are optimised targeting reducing memory accesses and eliminating conditional branches. Particularly, a special codeword is designed for the RLC scheme, which supports the multiple CX/D pairs generated by RLC being transmitted by a single memory accessing operation. In order to eliminate conditional 104 EBCOT on RICA Based Architecture and ARM Core branches, a variable named valid_state is designed to indicate the ending position of RLC in every stripe column. Meanwhile, a selecting scheme for choosing the correct CX/D pairs is developed. The output for every DWT coefficient is generated by PPA within a single final codeword, which includes all the required information including CX/D for magnitude bit, CX/D for sign bit, significant state, etc. In Section 6.5, the arithmetic encoding algorithm is evaluated. Since it consists of massive conditional branches, it is inefficient to implement AE on RICA based architecture. In contrast, AE is efficiently implemented on an ARM core which can be embedded into RICA based architecture. The AE structure is successfully optimised by utilised DSP instructions in ARM such as CLZ and simplifying logic combinations. In Section 6.6, the Tag-Tree and bit-stream length coding schemes in EBCOT Tier-2 encoder are presented. Section 6.7 mainly targets performance comparisons. Different algorithms for CM, including the original algorithm, PPCM and PPA, are implemented on RICA based architecture and compared. It is demonstrated that PPA offers the highest throughput and lower area occupation compared with PPCM. It is also demonstrated that PPA can provide stable performance when working with different images. Meanwhile, the AE implemented on ARM core shows higher throughput compared with implementations on RICA based architecture. It also demonstrates a comparable performance when compared with a TI C6416 AE implementation. With the proposed 2-D DWT and EBCOT, the proposed solution for JPEG2000 encoder will be presented in the following chapter. 105 JPEG2000 Encoder on Dynamically Reconfigurable Architecture Chapter 7 JPEG2000 Encoder on Dynamically Reconfigurable Architecture 7.1. Introduction This chapter presents the system-level integration and optimisation of the JPEG2000 encoder on the proposed dynamically reconfigurable architecture consisting of RICA based architecture and an ARM core. Targeting an efficient data transfer scheme between 2-D DWT and EBCOT, the scanning pattern of the 2-D DWT presented in Chapter 5 is optimised with the aim of accelerating the processing and reducing the required intermediate data storage [73]. Meanwhile, CM and AE modules in EBCOT are integrated by a memory relocation module with a carefully designed communication scheme [95]. A Ping-Pong memory switching mode is developed in order to further reduce the execution time. Based on the system-level integration and optimisation, performance of the proposed architecture for JPEG2000 is evaluated in detail. Simulation results demonstrate that the proposed architecture for JPEG2000 offers significant advantage in throughput compared with various DSP& VLIW and coarse-grained reconfigurable architecture based applications. Furthermore, a power estimation method of RICA paradigm based architectures is presented and the system energy consumption is evaluated. 106 JPEG2000 Encoder on Dynamically Reconfigurable Architecture 7.2. 2-D DWT and EBCOT Integration It is observed that 2-D DWT and EBCOT have different data processing patterns, as 2-D DWT is line (column) based while EBCOT processes data samples at bit-level within codeblocks. In this case, EBCOT has to wait for 2D DWT to finish transforming a number of complete lines and columns in order to obtain a full codeblock, as illustrated in Figure 7.1. In order to improve the coding efficiency, the 2-D DWT scanning pattern is modified in this work as illustrated in Figure 7.2. Instead of scanning a complete line or column of the image, modified 2-D DWT takes an area of 4 codeblocks as its processing unit at a time. After 2-D DWT, this area is directly transformed to four codeblocks belonging to different subbands. Codeblocks for LH, HL and HH subbands are then coded by EBCOT separately, while codeblock for LL subband is reserved and stored for a deeper level transform. When the current DWT finishes, the next four codeblocks become the processing unit. In this way, codeblocks for EBCOT are generated directly without delay of finishing a complete line/column, and the required intermediate storage for 2-D DWT is reduced since only four codeblocks are transformed at a time instead of the entire image. line codeblock ... ... ... Column ... ... Figure 7.1 Original data processing pattern between 2-D DWT and EBCOT 107 ... ... ... ... ... ... ... ... ... JPEG2000 Encoder on Dynamically Reconfigurable Architecture Original (4 codeblocks) ... L H L H L L H L ... L H H L H L H L H H L H ... L H L H L H L H L H L H L H LL HL LL HL LL HL LH HH LH HH LH HH LL HL LL HL LL HL LH HH LH HH LH HH LL HL LL HL LL HL LH HH LH HH LH HH ... ... ... H Next 4 codeblocks... Next 4 codeblocks... ... L ... Kernel Codeblock LL Codeblock LH ... ... Codeblock HL Codeblock HH Figure 7.2 Modified 2-D DWT Scanning Pattern 7.3. CM and AE Integration 7.3.1. System Architecture Since CM and AE modules are implemented separately on two different architectures, a shared DPRAM, which acts as the communication channel between RICA based architecture and ARM core, is utilised to integrate CM and AE. Figure 7.3 illustrates the proposed architecture with DPRAM. As both RICA paradigm and AE are based on 32-bit operand, a 32-bit DPRAM is selected in this work. The depth of the DPRAM can be flexible, but its minimum capacity should be able to satisfy the following requirements: The storage space for DWT coefficients belonging to LH, HL and HH subbands (three codeblocks) and CX/D codewords belonging to a single 108 JPEG2000 Encoder on Dynamically Reconfigurable Architecture RICA based architecture can be dynamically reconfigured for different modules RICA Original Image 2-D DWT C M ARM AE Shared DPRAM Figure 7.3 Proposed Architecture with DPRAM bitplane of a complete codeblock generated by CM. Totally the required depth is 3x64x64+64x64 = 16384. The storage space for all the CX/D pairs belonging to a single bitplane of a complete codeblock. Usually the maximum size of a codeblock is 64x64. Considering the extreme condition (all the samples are sign coded), the required storage depth is 2x64x64 = 8192. Some reserved space for storing communication variables (less than 32). Based on the above analysis, a 32K x 32-bit DPRAM is select in this work. This DPRAM can be accessed by both RICA based architecture and ARM via its two ports. Several communication variables are utilised to control the communication between RICA based architecture and ARM in order to avoid memory accessing conflict, and these variables will be introduced in Section 7.3.3. 7.3.2. Memory Relocation Module As discussed in the previous chapter, coded CX/D pairs from CM are generated in parallel within codewords. When referring to the JPEG2000 standard, the following AE module needs to receive CX/D pairs separately 109 JPEG2000 Encoder on Dynamically Reconfigurable Architecture CM codewords RICA 3 1 2 1 1 2 2 2 1 3 1 1 2 3 1 3 2-D DWT&CM ARM …… DPRAM (Relocated) Relocated CX/D pairs 1 1 1 1 1 2 2 2 2 2 ... 3 3 3 3 ... 1 Communication Variables ... CX/D Pairs ... 1 AE MR (a) Split the codeword Obtain all information Locate the magnitude CX/D Detect the sign CX/D N Move to the next codeword Y Increase the CX/D count Locate the sign CX/D (b) Figure 7.4 (a) Memory Relocation in JPEG2000 Encoder (b) Detailed Architecture of MR module according to different coding passes together with the numbers of these pairs. In this case, CX/D pairs generated by CM need to be derived from codewords and relocated. Due to the memory access and conditional branch restrictions of RICA paradigm, deriving and relocating operations cannot be performed simultaneously with CM. In this work, a module termed Memory Relocation (MR) is added between CM and AE, in order to ensure AE module receives CX/D pairs with the correct order. The MR module is implemented on RICA based architecture as illustrated in Figure 7.4. Given a codeword, MR first splits it and obtains all information provided by CM mentioned in Section 6.4.2, and then the magnitude CX/D 110 JPEG2000 Encoder on Dynamically Reconfigurable Architecture pair is relocated to the corresponding coding pass storage by MR. After that, MR checks whether the current codeword contains any sign CX/D pair. If yes, the sign CX/D is relocated together with the magnitude CX/D pair. Meanwhile, the CX/D pair count increments whenever a CX/D pair is relocated. The valid_state is also utilised by MR in order to derive and relocate RLC coded CX/D pairs correctly. 7.3.3. Communication Scheme between CM and MR In order to reduce the implementation complexity, communication between RICA based architecture and ARM core is directly realised through variables located on specified memory addresses which are listed in Table 7.1. These variables can be accessed by both RICA based architecture and ARM through simple memory operations. Figure 7.5 gives a pseudo code illustration of how the integrated EBCOT architecture works [95]. At the beginning of encoding, AE_START signal is initialised to “0” to make sure AE does not work until CX/D pairs of the current bitplane are ready; at the same time MEMORY_READY is set to “1”. On the ARM side, after initialisation, AE_READY is set to “1” indicating that ARM is ready for AE coding. When coding starts, the index of current bitplane is indicated by BITPLANE_INDEX and is ready to be passed to ARM. RICA checks the shared DPRAM. If it is ready, CM coding starts. After completing CM coding of the current bitplane, RICA checks AE_READY to see whether ARM is Table 7.1 Communication Variables Variables Purpose AE_START To start AE on ARM MEMORY_READY To indicate the shared DPRAM is ready for accessing without conflict AE_READY To indicate the ARM core is ready for next AE coding BITPLANE_INDEX To notify ARM the index of bit-plane COUNT_PASS1/2/3 To store the numbers of CX/D pairs coded in each coding pass 111 JPEG2000 Encoder on Dynamically Reconfigurable Architecture RICA // Initialise set AE_START = 0; // AE only starts when relocated CX/D pairs are ready set MEMORY_READY = 1; // shared DPRAM is initialised ready for use for (k=0; k <bitdepth; k++) // loop for bitplanes { BITPLANE_INDEX = k; // index the bitplane do{} while (MEMORY_READY == 0); // waiting for DPRAM ready // RICA based architecture processing starts RICA based architecture processing; // CM // processing finishes do{} while (AE_READY == 0); // waiting for ARM finishing // Memory relocation starts set MEMORY_READY = 0; MR starts; set MEMORY_READY = 1; // indicating relocated CX/D pairs are ready set AE_START = 1; // starting ARM } ARM: // Initialise set AE_READY = 1; while (BITPLANE_INDEX <bitdepth) { // waiting for relocated CX/D pairs ready and the starting signal from RICA do{} while ((MEMORY_READY == 0) or (AE_START==0)); // ARM processing starts set AE_READY = 0; ARM processing; // AE set AE_READY = 1; // wait until MR of the next bitplane starts do{} while (MEMORY_READY = 1); } Figure 7.5 Pseudo Code for EBCOT Implementation on the Proposed Architecture ready for AE. If yes, MR is executed with MEMORY_READY being set to “0” in order to avoid any unexpected access to the shared DPRAM. When MR finishes, both MEMORY_READY and AE_START are set to “1” to start AE on ARM. On the ARM side, ARM checks the BITPLANE_INDEX regularly until finishing the entire codeblock. AE does not start until both MEMORY_READY and AE_START signals are “1” to ensure it is safe to access the shared DPRAM. The AE_READY variable is set to “0” at the time AE starts 112 JPEG2000 Encoder on Dynamically Reconfigurable Architecture CM Bitplane 0 MR Bitplane 0 AE Bitplane 0 CM Bitplane 1 Stage 1 Stage 2 MR Bitplane 1 AE Bitplane 1 CM Bitplane 2 Stage 3 & stage 1 MR Bitplane 2 ... Figure 7.6 Pipeline Structure of the JPEG2000 Encoder in order to indicate that ARM is busy. When AE coding finishes, AE_READY is set to “1” again and ARM is put to wait state and jumps back to the beginning of bitplane loop only when the next MR starts. One of the benefits of this working mode is that it is possible to pipeline between RICA based architecture and ARM core during the coding process. Excluding 2-D DWT which is executed at beginning of the coding procedure, the JPEG2000 encoder on the proposed architecture is considered to be composed of three stages: CM, MR and AE. Since the MR module acts as the intermediate between CM and AE, a 3-stage pipeline is established with which CM and AE can be executed in the same time slot. Figure 7.6 illustrates this pipeline architecture. As described in Table 7.1, four communication variables take charge of controlling the pipeline structure. By strictly controlling accesses to the shared DPRAM and the starting time of different coding engines, the pipeline architecture offers significant improvement in system performance [95]. 7.3.4. Ping-Pong Memory Switching Scheme Based on the above discussion, core tasks in JPEG2000 encoder are implemented and optimised on the proposed architecture. A performance evaluation is performed with the standard 256x256 grayscale Lena test image. Figure 7.7 illustrates the execution time ratio of different modules for encoding the entire image (codeblock size = 64x64, 5/3 DWT, 1-level). It is seen that the MR module takes 33% of the overall execution time and become the system bottleneck. In other words, if a more efficient pipeline scheme can be established which ensures that MR module can also be executed simultaneously with other modules instead of only pipelining CM 113 JPEG2000 Encoder on Dynamically Reconfigurable Architecture 3% 24% 2-D DWT CM MR AE 40% 33% Figure 7.7 Execution Time Ratio of Different Modules in JPEG2000 Encoder Ping-Pong Switch Shared DPRAM Memory Block A 2-D DWT coefficients CM MR AE Memory Block B CM + MR Bitplane 0 (Memory Block A) Idle CM + MR Bitplane 1 (Memory Block B) AE Bitplane 0 (Memory Block A) CM + MR Bitplane 2 (Memory Block A) AE Bitplane 1 (Memory Block B) ... AE Bitplane 2 (Memory Block A) ... Figure 7.8 Ping-Pong Memory Switching Architecture and AE, the system performance will be significantly improved. In this work, another memory block with the same size of the CX/D pair storage space in the shared DPRAM is employed to construct a Ping-Pong memory switching scheme, as illustrated in Figure 7.8. These two memory blocks in the shared DPRAM are accessed alternately by both RICA based architecture and ARM core. When CM and MR finish coding the 1st bitplane, CX/D pairs are stored in memory block A and then coded by AE. At the same time the 2 nd bitplane is coded by CM and MR and stored in memory block B. After that, AE fetches CX/D pairs from memory block B to code the 2 nd bitplane, meanwhile CM and MR switch to memory block A again for the next bitplane and so until the complete codeblock is coded. In this way, CM and MR are executed in the 114 JPEG2000 Encoder on Dynamically Reconfigurable Architecture same time slot with AE, leading to further execution time reduction. When the Ping-Pong memory switching mode is applied, the DPRAM illustrated in Section 7.3.1 can be replaced by a similar DPRAM with different address space for the two Ping-Pong data blocks or even a 4-port RAM. Obviously, the capacity for storing CX/D codewords and relocated CX/D pairs need to be doubled. 7.4. Performance Analysis and Comparison System performance analysis is performed with the standard Lena test image (256x256, 8-bit grayscale). Core tasks in JPEG2000 are implemented on this proposed architecture mainly using ANSI-C, including some embedded assembly language for improving the code efficiency. Core tasks on RICA based architecture are then compiled, scheduled and simulated by the 65nm technology based RICA simulator [6] with optimisation by both hand and compiler. The ARM-based modules are simulated by the ARM RVDS tool flow [94] with the working frequency of 500MHz, which is equal to the memory accessing delay of RICA paradigm. 7.4.1. Execution Time Evaluation Execution time of different modules on this proposed architecture is listed in Table 7.2. Based on the Ping-Pong memory switching mode, the total execution time is calculated as follows: Total execution time = DWT+MAX (CM+MR, AE)+AE_last_bitplane It is observed that the most time consuming modules are CM, MR and AE. It is worth noticing that the critical path length of kernels for CM module is curbed by the number of available registers supported by the current RICA tool flow (maximum 547 including scratch registers) for pipelining. In other words, a deeper pipeline for CM could be established if there were more registers can be used, leading to less execution time. Due to the memory access limitation of RICA paradigm, MR can only process one codeword at a time instead of four as what CM does, and this is the reason why the MR 115 JPEG2000 Encoder on Dynamically Reconfigurable Architecture module is even more time consuming than CM. Meanwhile, the ARM-based AE implementation is based on serial architecture, so it has higher time consumption than CM and MR. Table 7.2 Detailed Execution Time of the JPEG2000 Encoder Sub-modules on the Proposed Architecture (a) DWT 5/3 5/3 64x64 32x32 16x16 Time (ms) level 1 level 2 level 1 level 2 level 3 level 1 level 2 level 3 level 4 DWT 0.83 1.04 0.88 1.1 1.15 0.98 1.22 1.28 1.3 CM 7.86 7.87 8.53 8.53 8.53 9.86 9.87 9.87 9.87 MR 10.49 10.49 10.51 10.51 10.51 10.6 10.6 10.6 10.6 AE 12.49 11.99 12.48 11.98 11.9 12.46 11.96 11.9 11.9 Total 22.1 22.42 22.86 23.34 23.49 24.37 24.89 25.06 25.12 (b) DWT 9/7 9/7 64x64 32x32 16x16 Time (ms) level 1 level 2 level 1 level 2 level 3 level 1 level 2 level 3 level 4 DWT 0.89 1.12 1.003 1.25 1.32 1.22 1.53 1.61 1.63 CM 7.86 7.87 8.53 8.53 8.53 9.86 9.87 9.87 9.87 MR 10.49 10.49 10.51 10.51 10.51 10.6 10.6 10.6 10.6 AE 13.22 12.77 13.23 12.78 12.76 13.25 12.8 12.77 12.77 Total 21.96 22.24 22.76 23.15 23.28 24.4 24.86 25 25.05 7.4.2. Power and Energy Dissipation Evaluation How to calculate the power dissipation of a given architecture is a tricky problem. Power dissipation usually consists of two parts: dynamical power and static power. Dynamical power is determined directly by the working frequency, the voltage, the transistor load capacity and the activity factor, while static power is relevant to the leakage current, the threshold voltage and the manufacturing process. Due to the limitation of RICA software, the current tool flow cannot provide the power dissipation of a given application. In this thesis, a rough power dissipation estimating method for RICA based 116 JPEG2000 Encoder on Dynamically Reconfigurable Architecture applications, which is provided by the RICA developing group, is described. This estimating method takes into account the number of utilised gates and average kernels frequencies. Given a RICA based application, its power dissipation is calculated by the following steps: 1. The numbers of required ICs are calculated, which depend on the largest kernels in the targeted application. 2. The required array area is obtained by summing up the different cell areas provided by the RICA tool flow [6]. 3. The total gate count and the internal power (mW/MHz) are calculated with the gate density and the gate-level internal power provided by the RICA developing group and [96]. 4. The average frequency (MHz) of the application is obtained by the critical path length of the main kernel. 5. The power (mW) is calculated by the internal power and the average frequency. 6. The energy consumption is calculated by the power and the application execution time. It should be noticed that the power figures given by this method are only rough estimations. Only the dynamical power dissipation is estimated and the static power/reconfiguration power is not considered. Meanwhile, if there is any peripheral enabled, the external power dissipation also needs to be taken into account. For example, the power consumption of the DPRAM discussed previously in this chapter is not included in this estimation. However, this estimating method can provide a brief sense of the internal dynamical power of RICA based architectures, which is important for demonstrating the low power nature of RICA paradigm. Table 7.3 shows the estimated power and energy consumption of different modules in JPEG2000 on the proposed architecture. It is worth noticing that for the overall architecture, the total computational resource occupation is equal to that of the most computationally intensive module (CM), while those redundant ICs can be reconfigured and bypassed when processing other 117 JPEG2000 Encoder on Dynamically Reconfigurable Architecture Table 7.3 Power and Energy Dissipation of the JPEG2000 Encoder Sub-modules on the Proposed Architecture (a) Power Calculation Area (um2) DWT 5/3 DWT 9/7 CM MR ADD 3458.16 26 33 45 6 COMP 3458.16 3 3 235 9 LOGIC 910.44 0 16 291 6 SHIFT 5865.12 8 28 61 7 REG 260 56 132 520 49 SBUF 9375 0 0 27 0 JUMP 886 1 1 1 1 SBOX (including MUX) 4040 38 80 660 29 DWT 5/3 DWT 9/7 CM MR 316173.6 661690.2 4646606 229176.9 Total area (um2) Gate density = 694 KGate/mm2 Internal gate-level power dissipation = 1.2nW/MHz/Gate Internal power (mW/MHz) Average frequency (MHz) Power (mW) DWT 5/3 DWT 9/7 CM MR 0.263 0.551 3.869 0.191 DWT 5/3 DWT 9/7 CM MR 150 150 69 144 DWT 5/3 DWT 9/7 CM MR 39.45 82.65 266.96 27.5 (b) Energy Dissipation 64x64 32x32 16x16 Energy (mJ) level 1 level 2 level 1 level 2 level 3 level 1 level 2 level 3 level 4 DWT 5/3 0.0327 0.041 0.0347 0.0435 0.0456 0.0386 0.0483 0.0507 0.0513 DWT 9/7 0.074 0.092 0.083 0.103 0.109 0.101 0.126 0.133 0.134 CM 2.10 2.10 2.277 2.277 2.277 2.634 2.633 2.633 2.633 MR 0.288 0.288 0.289 0.289 0.289 0.291 0.291 0.291 0.291 AE (5/3) 0.614 0.569 0.593 0.569 0.569 0.592 0.568 0.569 0.57 AE (9/7) 0.628 0.606 0.628 0.607 0.606 0.629 0.608 0.606 0.608 Total (5/3) 3.035 2.998 3.194 3.178 3.180 3.555 3.540 3.544 3.545 Total (9/7) 3.09 3.086 3.277 3.276 3.281 3.655 3.658 3.663 3.666 118 JPEG2000 Encoder on Dynamically Reconfigurable Architecture modules such as 2-D DWT and MR. The power dissipation of the embedded ARM core highly varies depending on process, libraries and optimisations. In this work, the power dissipation of 90nm technology based ARM946E-S is chosen, which is 0.095mW/MHz [97]. 7.4.3. Performance Comparisons Performance comparisons are made between the proposed architecture and various DSP&VLIW and coarse-grained reconfigurable architecture based implementations such as ARM920T (400MHz) [52], STMicroelectronics LXST230 (400MHz) [52], NEC Dynamically Reconfigurable Processor (NEC DRP, 150nm, 45.7MHz) [54], Philips TriMedia 1300 (250nm, 143MHz) [42], TI TMS320C6416T (90nm, 600MHz) [45], TI TMS320C6455 (90nm, 1GHz) [46], and ADI BLACKFIN ADSP-BF561 (130nm, 600MHz) [50]. The standard Lena image is utilised as the benchmark in most of the references. For those references using different size of test images, their performance is scaled to meet the 256x256 test image for fair comparisons. For those implementations with 24-bit RGB test images such as [52], it is assumed that different colour components are processed in serial so only a third of the execution time is taken for comparison (the actual time consumption is usually higher). Performance comparisons are made separately according to the different 2-D DWT levels in the references. For those references not mentioning their DWT levels, it is assumed that they are all under 1-level 2-D DWT. Table 7.4 lists execution time comparisons. It is observed that the proposed architecture for JPEG2000 demonstrates considerable higher throughput in both DWT and EBCOT aspects compared with other solutions. Generally, these modules in this work mainly benefit from high levels of both DLP and ILP of RICA paradigm and pipelined kernels. Meanwhile, AE execution time is successfully eliminated by the Ping-Pong memory switching mode, which improves the overall system throughput significantly. For those DSP&VLIW based solutions, their performance is curbed mainly by limited parallelism in both instruction and pixel aspects. Meanwhile, the proposed architecture 119 JPEG2000 Encoder on Dynamically Reconfigurable Architecture Table 7.4 Execution Time Comparisons Category Execution time (ms) DWT EBCOT Total 0.83 21.27 22.1 10 64.63 74.63 DWT 5/3 1-level Proposed TI TMS320C6416T [46] DWT 5/3 4-level Proposed 1.3 23.82 25.12 TI TMS320C6455 [47] 4.5 40.75 45.25 ARM920T [53] 412.2 STMicroelectronics LX-ST230 [53] 85.3 DWT 9/7 1-level Proposed 0.89 21.07 ADI BLACKFIN ADSP-BF561 [51] 21.96 53 DWT 9/7 3-level Proposed (CM only) 8.53 Philips TriMedia TM1300 (CM only) [43] 10.26 Other Proposed (full CM and AE for processing the same amount of data) 0.094 NEC DRP [55] (Significant pass in CM and AE only, the significant pass processes 256 16-bit samples, AE processes 1023 CX/D pairs) 0.213 outperforms NEC DRP by offering relatively higher average working frequency and a more straightforward hardware structure due to the dynamical reconfigurability and heterogeneous nature of RICA paradigm. On the other hand, the proposed architecture still has some limitations. The MR module, introduced for splitting CX/D pairs from the codewords generated by CM, curbs the overall throughput. The two kernels for CM are also large due to extra COMP and MUX cells which are utilised to construct kernels, and the pipeline depth of these two kernels is limited by the available registers supported by the current RICA tool flow. In contrast, DSP and VLIW based solutions benefit from high frequencies and minor affects introduced by conditional branches and memory operations. Meanwhile, the throughput of 120 JPEG2000 Encoder on Dynamically Reconfigurable Architecture Table 7.5 Energy Dissipation Comparisons Energy (mJ) DWT EBCOT Proposed (CM only) 2.277 Philips TriMedia 1300 (CM only) [43] 26.6 Total Proposed 0.074 3.016 3.132 TI TMS320C6416T [46] 2.38 15.12 17.5 the proposed architecture is lower than some ASIC and FPGA based solutions such as ADV212 [30], Bacro BA110 [31], JPEG2K-E [35] and Virtex II based solutions in [36] and [37]. Apparently, ASIC and FPGA based solutions benefit from specially designed hardware circuits and flexible branch/memory operations. Even with these negatives, the proposed architecture is still proved to be a promising solution for JPEG2000 by providing good throughput, high flexibility and low energy consumption at the same time. Since the power consumption results discussed in Section 7.4.2 are rough estimations, only a couple of simple comparisons are made between the proposed architecture and some references in Table 7.5 in order to demonstrate the power-saving nature of the proposed architecture. Since it is difficult to exactly tell how far the estimated internal dynamical power dissipation of RICA based applications is from their actual result, only the internal dynamic core power dissipation of referred DSPs is considered here for a closest-to-fair comparison. The energy consumption of DSPs is estimated by using the cycle count (or the execution time and the working frequency) and the internal dynamic power of their platforms. For example, the scaled cycle count for EBCOT in [45] is around 3.87e+7, and the typical internal CPU dynamic power of TI C6416T DSP under 600MHz clock frequency is 0.39 mW/MHz [98]. In this case, the energy consumption for EBCOT in [45] is estimated as 15.12mJ. It is clear that the proposed architecture shows its advantages in the power consumption aspect for both DWT and EBCOT modules. The DWT module, which has a quite small area and short execution time, demonstrates its 121 JPEG2000 Encoder on Dynamically Reconfigurable Architecture outstanding low power feature. On the other hand, despite the CM module consumes the most computational resources, it has the lowest kernel active frequency, leading to relatively low energy consumption. Generally, the performance of the proposed architecture is enhanced by the power-saving nature of both RICA paradigm and ARM core. 7.5. Future Improvements Based on the above system analysis and comparisons, advantages and limitations of the proposed architecture for JPEG2000 can be concluded as follows: Advantages: The DWT module has high throughput as a result of pipelined kernels and parallel processing. Modified DWT scanning pattern further improves the overall system efficiency. The CM module is efficiently implemented due to efficient implementation of the four primitive coding schemes, properly balanced kernels and stripe-column level parallel processing provided by PPA. The Ping-Pong memory switching mode eliminates the AE execution time and improves the throughput dramatically. The proposed architecture based on RICA paradigm and embedded ARM core offers outstanding parallelism and power-saving feature compared with traditional DSPs. Limitations: Due to the restrictions of conditional branches and memory access, the overall system throughput is curbed by the MR module. The pipeline depth of CM module is limited by the number of available registers supported by the current RICA tool flow. Fortunately, the latest RICA tool flow under development is quite likely to overcome these limitations. It increases the memory access restriction per 122 JPEG2000 Encoder on Dynamically Reconfigurable Architecture step from 4 to 14 and is able to support more than ten thousands of available registers as well as other computational resources. In this case, the MR module can be simplified and a much deeper pipeline for the CM module can be established, both of which will shorten the overall execution time. What’s more, it is possible to employ triple pairs of CM and MR modules for processing the LH, HL and HH subbands respectively. These three pairs can Table 7.6 Future Throughput Improvement 5/3 64x64 32x32 16x16 Time (ms) level 1 level 2 level 1 level 2 level 3 level 1 level 2 level 3 level 4 DWT 0.83 1.04 0.88 1.1 1.15 0.98 1.22 1.28 1.3 CM 2.95 2.95 2.93 2.93 2.93 3.31 3.31 3.31 3.31 MR 3.93 3.93 3.61 3.61 3.61 3.56 3.56 3.56 3.56 AE 12.94 11.99 12.48 11.98 11.9 12.46 11.96 11.9 11.9 Total 12.94 11.99 12.48 11.98 11.9 12.46 11.96 11.9 11.9 Current 19.18 19.4 19.92 20.14 20.19 21.44 21.69 21.75 21.77 9/7 64x64 32x32 16x16 Time (ms) level 1 level 2 level 1 level 2 level 3 level 1 level 2 level 3 level 4 DWT 0.89 1.12 1.003 1.25 1.32 1.22 1.53 1.61 1.63 CM 2.95 2.95 2.93 2.93 2.93 3.31 3.31 3.31 3.31 MR 3.93 3.93 3.61 3.61 3.61 3.56 3.56 3.56 3.56 AE 13.22 12.77 13.23 12.78 12.76 13.25 12.8 12.77 12.77 Total 13.22 12.77 13.23 12.78 12.76 13.25 12.8 12.77 12.77 Current 19.24 19.48 20.04 20.29 20.36 21.68 22 22.08 22.1 work simultaneously, and the system architecture can be further optimised with the overall throughput being improved significantly. Based on theoretical calculation, the DWT, CM and MR modules can be processed simultaneously with AE. The potential theoretically calculated throughput improvement is demonstrated in Table 7.6. 123 JPEG2000 Encoder on Dynamically Reconfigurable Architecture 7.6. Conclusion Based on all the previous chapters, this chapter presents the system level JPEG2000 encoder integration on the proposed customised coarse-grained dynamically reconfigurable architecture. In Section 7.2, a modified scanning pattern for 2-D DWT is proposed. This modified scanning pattern is blockbased and an area of four codeblocks is taken as the processing unit every time. With this method, four codeblocks belonging to different DWT subbands are generated simultaneously, leading to reduction in both execution time and required intermediate data storage. Section 7.3 presents CM and AE integration for EBCOT. Since an ARM core has been selected to implement AE, the proposed architecture consisting of RICA based architecture and embedded ARM core is introduced. A shared DPRAM is utilised acting as the communication channel between the two parts in the proposed architecture. A memory relocation module is developed and placed between CM and AE in order to derive information from the codeword generated by CM. Communication between RICA based architecture and ARM core is directly realised through communication variables located at specified memory addresses. A three-stage pipeline is established based on the proposed communication scheme, by which CM and AE can be executed in parallel. Based on the evaluation of system processing time, it is found that MR consumes approximately 33% of the overall execution time and becomes the system bottleneck. In this case, a Ping-Pong memory switching scheme is developed with which CM and MR can be executed at the same time with AE, leading to further execution time reduction. Section 7.4 presents system performance analysis and comparisons targeting both throughput and power dissipation aspects. Execution time of different modules including 2-D DWT, CM, MR and AE are listed and compared under various pre-conditions. The power estimation method for RICA based architecture is presented, which is based on the numbers of occupied ICs and the average kernel frequency. Performance comparisons 124 JPEG2000 Encoder on Dynamically Reconfigurable Architecture are made between the proposed architecture and various DSP&VLIW and coarse-grained architectures. It is seen that the proposed architecture offers significant advantage in throughput. Meanwhile, although the power estimation method only provides roughly estimated results, the proposed architecture still demonstrates its power-saving nature clearly. Due to the inherent restriction of RICA paradigm and the available tool flow, the proposed architecture still has some limitations. In Section 7.5, the advantages and limitations of the proposed architecture are concluded. Some possible future improvements are presented. These improvements are feasible with the last tool flow associated with RICA paradigm which is under development. Based on these possible improvements, the proposed architecture’s potential performance is calculated theoretically. 125 Conclusions Chapter 8 Conclusions 8.1. Introduction This chapter concludes this thesis. In Section 8.2, the contents of individual chapters are reviewed. Section 8.3 lists some specific conclusions that can be drawn from the research work in this thesis. Finally in Section 8.4, some possible directions for future work are addressed. 8.2. Review of Thesis Contents Chapter 2 provided the background knowledge and detailed algorithms about digital image processing technologies especially demosaicing and JPEG2000. This chapter also gave a review of the existing literature and various research works which are related to this thesis. Chapter 3 described the newly emerging RICA paradigm including its structure, its associated software tool flow and its possible optimisation approaches. Two of the author’s initial works were proposed as case studies, both of which demonstrated that RICA has great potential in terms of throughput, flexibility and power consumption when targeting different kinds of applications. In Chapter 4, a Freeman demosaicing engine on RICA based architecture was proposed. The shifting-window based demosaicing engine was optimised by data buffer rotating scheme, parallel processing and pseudo median filter. An investigation of mapping the engine onto a dual-core RICA 126 Conclusions based architecture was performed. The simulation results demonstrate that the proposed demosaicing engine can provide 502fps and 862fps when processing a 648x432 image for the single-core and dual-core implementations respectively. Chapter 5 presented a lifting-based 2-D DWT engine for JPEG2000 on RICA architecture. The 2-D DWT engine was optimised by Hardwired Floating Coefficient Multipliers and SIMD based VO technique. Positives and negatives of VO technique was discussed in aspects of both throughput and area occupation in detail. The proposed 2-D DWT engine can reach up to 103.1 fps for a 1024x1024 image. Chapter 6 presented a JPEG2000 EBCOT implementation on RICA based architecture. A novel PPA algorithm for CM was proposed with the four primitive coding schemes optimised for RICA based implementation. An ARM core was employed to implement AE instead of RICA based architecture. Simulation results demonstrate that the proposed PPA algorithm provided better throughput compared with other popular algorithms, and the ARM based AE implementation showed good throughput. Chapter 7 presented the system-level implementation of JPEG2000 encoder on RICA based architecture. A block based 2-D DWT scanning pattern was proposed. A memory relocation module was designed and placed between CM and AE. The CX/D pairs relocated by MR were placed in a shared DPRAM, which could be modified to be either a 4-port RICA or a DPRAM with doubled capacity if the Ping-Pong memory switching mode is selected. Performance evaluations included execution time and estimated power consumption. The proposed JPEG2000 architecture provided outstanding performance in aspects of both throughput and energy dissipation compared with various DSP & VLIW and CGRA based JPEG2000 solutions, 8.3. Novel Outcomes of the Research This section presents a variety of novel outcomes, which stem from the research in this thesis. 127 Conclusions Most academic and industry efforts on digital image processing solutions have focused on using traditional platforms such as ASIC, FPGA and DSP. This thesis investigated novel coarse-grained dynamically reconfigurable architectures which are based on a newly emerging RICA paradigm – an area that as yet has been little explored. The results in Chapter 4, 5, 6 and 7 showed that based on the proposed architecture, different digital image processing tasks can deliver high performance in aspects of both throughput and energy dissipation. In this case, it was very promising to utilise RICA paradigm based architectures in future high performance system designed for digital image processing applications like JPEG2000. Moreover, since these image processing tasks covers algorithms with different nature, it is possible to predict a new algorithm’s performance on RICA based architecture by comparing and match the new algorithm to some of these imaging tasks. In Chapter 4, the customisable nature of RICA paradigm enabled the Freeman demosaicing engine to use data buffers to store intermediate data instead of traditional memory blocks. An important outcome was the proposed parallel demosaicing engine based on investigation of the hidden parallelism in the algorithm. Since RICA paradigm supports independent instructions being executed in parallel, kernelisation of the complete demosaicing engine significantly accelerated the processing speed. When dealing with the median filter including sorting operations, the pseudo median filter was considered to be a reasonable solution for RICA based applications as it would not introduce conditional branches which would break the kernel. Moreover, mapping the demosaicing engine onto a dual-core RICA based architecture demonstrated the potential of building up a multi-core RICA based architecture for complicated applications. In Chapter 5, RICA based architecture provided a good solution for 2-D DWT tasks. Again, due to the inherent tailorable nature of RICA paradigm, the two DWT modes in JPEG2000 could be implemented with a generic architecture. Different from traditional DSPs, the CSD based FCMs in the 9/7 mode could be efficiently implemented on RICA based architecture since the additions 128 Conclusions and shifting operations can be executed in parallel. The most important outcome is that the SIMD based VO technique can be employed to improve the computational resource utilisation. Simulation results demonstrate that the VO technique successfully improved the Throughput/Area ratio for the 9/7 DWT mode. More generally, the positives and negatives introduced by VO technique were clearly clarified, which allowed developers to choose the suitable solution for different tasks according to the nature of the algorithm and application requirements. Chapter 6 presented the implementation of EBCOT, which is the most challenging module in JPEG2000.When looking into the EBCOT CM algorithm and existing solutions, it was found that the current solutions were not suitable for RICA based architecture as they required either frequent conditional branches or massive computational resources. In this case, the novel PPA solution for CM in EBCOT was developed specially for RICA based applications. The required computational resource by the proposed PPA solution was almost only a half of the traditional PPCM method, while the processing speed of PPA was actually higher than PPCM when mapped onto RICA architecture. On the other hand, simulation results demonstrated that RICA based architecture is not a good solution for AE since the frequent branches strictly limited the performance, and this is the reason why an embedded ARM core was selected for AE implementation. For conclusion, RICA based architecture can provide good performance for computationally intensive applications with inherent parallelism, but may be not suitable for some simple but branch-intensive applications. In Chapter 7, the system-level integration of JPEG2000 encoder was presented based on all the discussion and results in previous chapters. Since the 2-D DWT is line-based while the EBCOT is codeblock-based, the novel 4-codeblock based 2-D DWT scanning pattern was considered to be an efficient solution for JPEG2000 as the line delay between 2-D DWT and EBCOT was eliminated. Meanwhile, the shared DPRAM provided a simple communicating method between RICA and ARM compared with using data buses and DMA. Moreover, the Ping-Pong memory switching mode enabled 129 Conclusions a deeper pipeline between different modules, which is essential for reducing the overall execution time. Simulation results proved that the proposed architecture for JPEG2000 demonstrated outstanding performance compared with various DSP&VLIW and CGRA based applications. In addition, the power estimation method for RICA based architectures was introduced, which provided a possible approach for the energy dissipation analysis. Based on the work in this thesis, it is concluded that RICA paradigm can provide good solutions for image processing applications. In this thesis, RICA paradigm’s potential and advantage for different imaging tasks was clearly investigated and evaluated. Various optimisation approaches for RICA based applications including customisation, kernel construction, VO technique utilisation, parallel processing and hybrid architecture development were well performed and discussed. It was also demonstrated that it is possible to build up a multi-core RICA based architecture for complex applications. Meanwhile, performance comparisons of different imaging tasks between RICA based architecture and other platforms were evaluated in detail. With this presented work, other developers can evaluate a given algorithm to estimate its performance on RICA based architecture and decide the possible optimisation approaches. They can also obtain a brief idea of the advantages and disadvantages of their RICA based work compared with other solutions such as ASIC, FPGA and DSP&VLIWs. 8.4. Future Work There are a few areas in which the work in this thesis can be further investigated. Some are listed below. Short-term work For the Freeman demosaicing engine, it is possible to utilise VO technique to further optimise the median filter module. As discussed in Chapter 4, the utilisation of VO will reduce the number of seeking operations in median filter. Although there would be more additional logic resources required and more complex control scheme, the 130 Conclusions utilisation of VO technique is still a considerable approach to further optimise the demosaicing engine. When processing different images, the possible bit-depth increment of 2-D DWT coefficients should be taken into account. With the current Lean image, the bit depth used for CM is fine. However there might be 1 or 2 bits increment for other images especially in the case more than 3 levels of 2-D DWT is applied. In this case, the number of bitlevel iteration in CM also needs to be increased. For the 2-D DWT engine, the possible artifact brought by the block based scanning pattern should be considered. Although there are some papers presenting similar scanning patterns [99-100], they did not mention the possible artifact brought by modified scanning patterns and this side effect is worth being evaluated especially when the engine is employed within a complete JPEG2000 encoder. Long-term work A full tool set which enables joint debugging and testing of RICA based architecture and embedded ARM core should be developed. Since there are only separate toolflows for simulating RICA and ARM based applications respectively, all the data communication between RICA based architecture and ARM core in this thesis was carried out manually, which requires massive labour and is extremely time consuming. With a full tool set, it is possible to make joint debugging and testing for the complete system, meanwhile the rate-distortion control module and PSNR calculation can be implemented and carried out respectively. Once the full tool set is ready, more test images should be processed to see the general performance of the proposed architecture. Although the 2-D DWT, CM and MR are not data sensitive (the processing time is only relevant to the amount of data), different images do lead to different AE execution time. Meanwhile, the possible bit-depth increment mentioned previously will also lead to more CM execution time. 131 Conclusions The power dissipation of the proposed architecture should be evaluated in detail. Although the power estimating method presented in Chapter 7 provides a possible approach to estimate the internal dynamic power, it is only the roughly estimated result and still has difference from the reality. Meanwhile, the shared DPRAM may contribute a significant part to the overall power consumption. And the memory controllers on both RICA and ARM sides need to be considered. A multi-core solution can be considered for the complete JPEG2000 encoder implementation. Instead of extending the number of CM and MR pairs to three on a large scale single-core based RICA architecture as discussed in Chapter 7, it is possible to employ three RICA cores to realise triple pairs of CM and MR with the support of the MRPSIM tool introduced in Chapter 4, in order to encode DWT coefficients belonging to different subbands simultaneously. Moreover, more ARM cores can also be embedded into the system and it is possible to have three AEs running in parallel correspondingly. The multi-core architecture is expected to provide much higher throughput compared with the current single-core implementation. Obviously, the energy dissipation will increase at the same time. 132 Appendix Appendix JPEG2000 Encoding Standard Tiling and DC Level Shifting The first preprocessing step in JPEG2000 standard is tiling, which partitions the original image into a number of rectangular non-overlapping blocks, termed tiles. Each tile has the exact same colour component as the original image. Tile sizes can be arbitrary and up to the size of the entire original image. Generally, a large tile offers better visual quality to the reconstructed image and the best case is to treat the entire image as one single tile (no tiling). However, a large tile also requires more memory space for processing. Typically, tiles with the size of 256x256 or 512x512 are considered to be popular choices for various implementations based on the evaluation of cost, area and power consumption [27]. Originally, pixels in the input image are stored in the form of unsigned integers. For the purpose of mathematical computation, DC level shifting is essential to convert these pixels to ensure each of them has a dynamic range which is approximately centered around zero. All pixels Ii(x,y) are DC level shifted by subtracting the same quantity 2s-1 to produce DC level shifted samples I’i(x,y) as follows [27]: 𝐼𝑖′ (𝑥, 𝑦) = 𝐼𝑖 (𝑥, 𝑦) − 2𝑠−1 (9.1) Where s is the precision of pixels. 133 Appendix Component Transformation Component transformation is effective on reducing correlations amongst multiple components in the image. Normally, the input image is considered to have three colour planes (R, G, B). JPEG2000 standard supports two different transformations: (1) Reversible Colour Transformation (RCT) and (2) Irreversible Colour Transformation (ICT). RCT can be applied to both lossless and lossy compression, while ICT can only be used in the lossy scheme [27]. In the lossless mode with RCT, pixels can be exactly reconstructed by inverse RCT. The forward and inverse transformations are given by: Forward RCT: 𝑌𝑟 = [ 𝑅+2𝐺+𝐵 4 ] (9.2) 𝑈𝑟 = 𝐵 − 𝐺 (9.3) 𝑉𝑟 = 𝑅 − 𝐺 (9.4) Inverse RCT: 𝐺 = 𝑌𝑟 − [ 𝑈𝑟 +𝑉𝑟 4 ] (9.5) 𝑅 = 𝑉𝑟 + 𝐺 (9.6) 𝐵 = 𝑈𝑟 + 𝐺 (9.7) ICT is only applied for lossy compression because of the error introduced by using non-integer coefficients as weighting parameters in the transformation matrix [27]. Different from RCT, ICT uses YCrCb instead of YUV, in which Y is the luminance channel while Cr and Cb are two chrominance channels. The transformation formulas are given by: Forward ICT: 𝑌 𝑅 0.299000 0.587000 0.114000 [𝐶𝑟 ] = [0.500000 − 0.418688 − 0.081312] [𝐺 ] 𝐶𝑏 −0.168736 − 0.331264 0.500000 𝐵 134 (9.8) Appendix Inverse ICT: 𝑅 [𝐺 ] = 𝐵 1.0 0.0 1.402000 𝑌 [1.0 − 0.344136 − 0.714136] [𝐶𝑟 ] 1.0 1.772000 0.0 𝐶𝑏 (9.9) 2-Demension Discrete Wavelet Transform DWT is one of the key differences between JPEG2000 and previous JPEG standard. It is the first decorrelation step in JPEG2000 standard and it decomposes a tile into a number of subbands at different resolution levels with both frequency and time information. Basically, wavelets are functions generated from one single function, which is termed mother wavelet, by scaling and shifting in time and frequency domains. If the mother wavelet is denoted by ψ(t), other wavelets ψa,b(t) can be represented as ψ𝑎,𝑏 (𝑡) = 1 √𝑎 ψ( t−b 𝑎 ) (9.10) Where a is the scaling factor and b represents the shifting parameter. Based on this definition of wavelets, the wavelet transform of a function f(t) can be mathematically represented by ∞ 𝑊(𝑎, 𝑏) = ∫−∞ ψ𝑎,𝑏 (𝑡)𝑓(𝑡)𝑑𝑡 (9.11) When targeting discrete signals, DWT can be considered to convolve the input discrete signal with two filter banks, one for low pass and the other is high pass. The two output streams are then down-sampled by a factor of 2. The transforms are given by [27] 𝜏𝐿 −1 𝑊𝐿 (𝑛) = ∑𝑖=0 ℎ(𝑖)𝑓(2𝑛 − 𝑖) (9.12) 𝜏 −1 𝐻 𝑊𝐻 (𝑛) = ∑𝑖=0 𝑔(𝑖)𝑓(2𝑛 − 𝑖) (9.13) where 𝜏𝐿 and 𝜏𝐻 are taps of the low-pass (h) and the high-pass (g) filters. After the transform, the original input signal is decomposed into two subbands: lower band and higher band. Practically, the lower band can be further decomposed for different resolutions. The architecture is illustrated in Figure 9.1. 135 Appendix LPF L 2 X(z) HPF LPF 2 LL HPF 2 LH H 2 Figure 9.1 Discrete Wavelet Transform Step 1: Horizontal transform The original image L LL2 HL2 H HL LL HL LH HH LH HH LH2 HH2 Step 2: Vertical transform Figure 9.2 Multi-level 2-Demension DWT For digital image processing, it is essential to have 2-demensional DWT to perform the transformation of a 2-D image. The approach for 2-D DWT is to implement a 1-D DWT at the horizontal direction first and then another 1-D DWT along the vertical direction is performed. After a 2-D transform, four subbands are generated, which are LL, LH, HL and HH respectively. LL is a coarser version of the original input image, while LH, HL and HH are highfrequency subbands containing the detail information [27]. Normally, the LL subband can be recursively further decomposed by higher level 2-D DWT in order to obtain new subbands with multiple resolutions such as LL2, LH2, HL2 and HH2 illustrated in Figure 9.2. Traditionally, DWT is implemented by convolution or FIR filter banks. These approaches may require a large amount of computational resources and memory storages, which should be avoided in embedded system 136 Appendix even X(z) Sm(z) Split 1/K L K H Tm(z) odd Figure 9.3 Lifting-Based DWT applications. To solve this problem, a modified DWT architecture, termed liftbased architecture, is proposed in [25-26]. The main idea of lifting-based DWT architecture is to break up both the high-pass and low-pass wavelet filter banks into a sequence of smaller filters that in turn can be converted into a sequence of upper and lower triangular matrices, leading DWT to banded-matrix multiplications [27]. Figure 9.3 illustrates the lifting-based DWT architecture, where Sm(z) and Tm(z) are filter matrices and K is an constant. The polyphase matrix of the lifting-based architecture can be realised as 1 1 𝑇𝑚 (𝑧) 𝑃(𝑧) = ∏ [ ][ 𝑆 0 1 𝑚 (𝑧) 0 𝐾 ][ 1 0 0 ] 1/𝐾 (9.14) For JPEG2000 standard, there are two default wavelet filter schemes employed corresponding to lossless and lossy modes separately. In lossless mode, the Le Gall (5,3) spline filter is adapted, which is formed by a 5-tap low-pass FIR filter and a 3-tap high-pass FIR filter. The corresponding polyphase matrix for lifting-based DWT is given by 𝑃(5,3) (𝑧) = [ 1 0 1 (1 + 𝑧)/4 ][ ] −1 −(1 + 𝑧 )/2 1 0 1 (9.15) In lossy mode, the Daubechies (9,7) biorthogonal spline filter is employed, which includes a 9-tap low-pass FIR filter and a 7-tap high-pass FIR filter. The corresponding polyphase matrix can be represented as 𝑃(9,7) (𝑧) = −1 ) 1 [1 𝛼(1 + 𝑧 ] [ 𝛽(1 + 𝑧) 0 1 0 1 ][ 1 0 1 0 𝐾 𝛾(1 + 𝑧 −1 )] [ ][ 𝛿(1 + 𝑧) 1 0 1 137 0 1] 𝐾 (9.16) Appendix Where α = -1.586134342, β = -0.052980118, γ = 0.882911075, δ = 0.443506852 and K = 1.230174105 Detailed explanation of lifting-based DWT can be referred in [5, 27]. Quantisation Quantisation of DWT coefficients is one of the main sources of information loss in JPEG2000 encoder. In lossy compression mode, all the DWT subbands are quantised in order to reduce the precision of DWT subbands to aid in achieving compression [27]. The quantisation is performed by uniform scalar quantisation with dead-zone around the origin. As illustrated in Figure 9.4, step size of the dead-zone scalar quantiser is set to be Δb and the width of the dead-zone is 2Δb. The formula of uniform scalar quantisation with a dead-zone can be give by 𝑞𝑏 (𝑖, 𝑗) = 𝑠𝑖𝑔𝑛(𝑦𝑏 (𝑖, 𝑗)) ⌊ |𝑦𝑏 (𝑖,𝑗)| Δb ⌋ (9.17) Where yb(I,j) is the DWT coefficient in subband b and Δb is the quantisation step size for subband b. After quantisation, all quantised DWT coefficients are signed integers and converted into sign-magnitude represented prior to entropy coding [27]. -4 Δb Δb Δb 2Δb -3 -2 -1 0 Δb Δb Δb 1 2 3 4 Figure 9.4 Dead-Zone Illustration of the Quantiser Embedded Block Coding with Optimal Truncation Physically the quantised wavelet coefficients is compressed by the entropy encoder in each codeblock in each subband [27]. The complete entropy encoding in JPEG2000 standard can be divided into two coding steps: Tier-1 coding and Tier-2 coding. For Tier-1 coding, the EBCOT [5] algorithm is 138 Appendix adopted, which is composed of fractional bit-plane coding (Context Modeling) and binary arithmetic coding (Arithmetic Encoding). In Tier-1 coding, codeblocks are encoded separately within bit-level. Given the precision of quantised DWT coefficients is p, a codeblock will be decomposed into p bit-planes which are then coded sequentially from the Most Significant Bit-plane (MSB) to the Least Significant Bit-plane (LSB). Each coefficient is divided into one sign bit and several magnitude bits. Context modeling is applied on each bit-plane of a codeblock to generate intermediate data in the form of a pair of Context and binary Decision (CX/D); while arithmetic encoding codes these CX/D pairs and generates the final compressed bit-stream. Context Modeling The EBCOT CM algorithm has been built to exploit symmetries and redundancies within and across bit-planes so as to minimise the statistics to be maintained and minimise the coded bit-stream that it would generated [27]. Before presenting detailed illustration of the CM algorithm, there are several concepts need to be clarified as follows: Sign Array (χ): χ is a two-dimensional array representing signs of DWT coefficients in a codeblock. Each element χ[m,n] in χ represents the sign information of the corresponding sample y[m,n] in the codeblock as follows: χ[m, n] = { 1 0 if y[m, n] < 0 else (9.18) Magnitude Array (ν): ν is a two-dimensional array consisting of unsigned integers. It has the same size with χ so as the corresponding codeblock. Each sample ν[m,n] in ν represents the absolute value of the corresponding DWT coefficient, which is given by νp [m, n] = |y p [m, n]| (9.19) where p represents the pth bit-plane. 139 Appendix Scanning Pattern: EBCOT has a certain scanning pattern which is based on every four lines of coefficients, termed stripe. The scanning pattern within a stripe is from up to down in a column and from left to right for different columns, as illustrated in Figure 9.5 (a). Significant State (δ): In EBCOT, each DWT coefficient has a state variable termed significant state (δ) which is initialised to be “0” and the bit itself is considered to be insignificant at the beginning. When coding starts, this state variable indicates whether the first non-zero bit in the corresponding coefficient has been coded. If yes, δ changes to “1” and maintains its value until coding finishes. Meanwhile, the coefficient turns to be significant. The procedure is illustrated in Figure 9.5 (b). Symbol bits ... stripe Significant state Sign bit 0 MSB 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 ... LSB (a) When first “1” bit is coded (b) Figure 9.5 (a) Scanning Pattern of EBCOT (b) Significant State D0 V0 D1 H0 X H1 D2 V1 D3 Figure 9.6 Illustration of One Pixel’s Neighbours 140 Appendix Neighbour Significant States: Most of the coding schemes in EBCOT utilise samples around the current sample under processing, which are called neighbours. Totally there are eight neighbours for each sample and they are divided into three categories, termed as horizontal neighbour (H), vertical neighbour (V) and diagonal neighbour (D), as illustrated in Figure 9.6. Refinement State (γ): This state indicates whether the current sample has been coded by MRC. If yes the corresponding γ is set to be “1” otherwise “0”. Based on these concepts, CM codes each codeblock stripe by stripe, from the MSB to the LSB, separately. There are three coding passes existing in CM. Each bit is coded by one of the three passes without any overlapping with other passes [27]. These three coding passes are given as follows: Significant Propagation Pass (SPP): This coding pass is used to code insignificant bits with one or more significant neighbours. Magnitude Refinement Pass (MRP): This coding pass processes bits which have already been significant. Clean Up Pass (CUP): Coefficient bits that have not been coded by SPP or MRP will be coded by this coding pass. These three coding passes are executed with an order from SPP to CUP. There are four primitive coding schemes which are employed by these three coding passes to generate coded CX/D pairs, which are defined as follows: Zero Coding (ZC): In zero coding, CX is generated from three pre-defined Look-Up Tables (LUTs) for different DWT subbands (LH, HL and HH). Outputs of these LUTs depend on significant states associated to neighbours of the current sample, which are illustrated in Table 9.1 [5]. The decision bit is equal to the current magnitude bit of the corresponding coefficient which is being coded. This coding scheme is used both in SPP and CUP. Sign Coding (SC): This coding scheme is employed to code the sign bit of each DWT coefficient, and is executed only once when the first “1” bit in 141 Appendix the coefficient has been coded, which is saying, as long as the coefficient turns to be significant. Instead of directly using H, V and D defined previously, SC generates CX by two new states termed as horizontal contribution and vertical contribution, which depend on significant and positive/negative states of H and V neighbours. The decision bit in SC is determined by an XOR bit which is also generated according to H/V contributions. The referenced LUP and the formula for obtaining decision bit [5] are given in Table 9.2 and equation 9.20. This coding scheme is usually involved within SPP and CUP. 𝐷𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑠𝑖𝑔𝑛𝑏𝑖𝑡 𝑂𝑅 𝑥𝑜𝑟𝑏𝑖𝑡 (9.20) Table 9.1 Contexts for the Zero Coding Scheme LL and LH subbands HL subband HH subband CX ∑H ∑V ∑D ∑H ∑V ∑D ∑(H+V) ∑D 2 x x x 2 x x ≥3 8 1 ≥1 x ≥1 1 x ≥1 2 7 1 0 ≥1 0 1 ≥1 0 2 6 1 0 0 0 1 0 ≥2 1 5 0 2 x 2 0 x 1 1 4 0 1 x 1 0 x 0 1 3 0 0 ≥2 0 0 ≥2 ≥2 0 2 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 142 Appendix Table 9.2 H/V Contributions and Contexts in the Sign Coding Scheme H0/V0 H1/V1 H/V Contribution Significant, positive Significant, positive 1 Significant, negative Significant, positive 0 Insignificant Significant, positive 1 Significant, positive Significant, negative 0 Significant, negative Significant, negative -1 insignificant Significant, negative -1 Significant, positive insignificant 1 Significant, negative insignificant -1 Insignificant insignificant 0 H contribution V contribution Context XOR bit 1 1 13 0 1 0 12 0 1 -1 11 0 0 1 10 0 0 0 9 0 0 -1 10 1 -1 1 11 1 -1 0 12 1 -1 -1 13 1 Magnitude Refinement Coding (MRC): This coding scheme is particularly used in MRP. In MRC, context is determined by whether the current bit is the first refinement bit, which says, the first time to be coded by MRP, in the corresponding coefficient. Also significant states of the current bit’s eight neighbours are taken into consideration. The referenced LUP is given in Table 9.3 [5]. The Decision bit is simple equal to the magnitude bit. 143 Appendix Table 9.3 Contexts of the Magnitude Refinement Coding Scheme ∑H+∑V+∑D First refinement bit Context x (not care) No 16 ≥1 Yes 15 0 Yes 14 (summation of significant states) Run Length Coding (RLC): This coding scheme is only used in CUP. It generates one or more CX/D pairs by coding from one to four consecutive bits within a stripe [27]. Generally the number of CX/D pairs generated is determined by where the first “1” bit is located in the corresponding stripe column. There are two contexts adopted in RLC: 17 and 18. When all the four bits in the stripe column are zero, a CX/D pair (17,0) is generated. In the case there are one or more “1” bits existing, firstly a CX/D pair (17,1) is generated, indicating this is a non-zero stripe column, and then another two CX/D pairs (18, 0/1) and (18, 0/1) are produced, in which the two decision bits actually represent the location of the first “1” bit in this fourbit non-zero stripe column. When describing the complete CM coding progress, these three coding passes are applied to each bit-plane of a codeblock from the MSB to the LSB. As the first bit-plane (MSB) actually has no significant coefficient at the beginning, only CUP is applied on it. After finishing the first bit-plane, the next bit-plane turns up and these three coding passes scan and code it in order of SPP, MRP and CUP, with the scanning pattern illustrated in Figure 9.5 (a). Figure 9.7 illustrates the detailed EBCOT working flowchart. When a stripe starts, SPP checks significant states of the current bit itself as well as its eight neighbours and only codes the current bit by ZC when it is insignificant but has one or more significant neighbours. After ZC, SC will also be applied if it is needed. MRP is applied after SPP finishing the current bit-plane. It 144 Appendix Start Initialise Start coding the Pth bit-plane (SPP) Insignificant but has significant neighbor Start coding the Pth bit-plane (MRP) N N Y Start coding the Pth bit-plane (CUP) N Significant Not coded by SPP Have not been coded by SPP or MRP Y Y Magnitude Refinement Coding Zero Coding Consecutive for bits and adjacent neighbors N N End of the Pth bit-plane Bit = “1” Y Y Y are all insignificant Run Length Coding Zero Coding N Sign Coding Coefficient turns to be significant Bit = “1” Next bit N Y End of the Pth bit-plane Sign Coding Coefficient turns to be significant Y N Y Next bit End of the Pth bit-plane N Can be eliminated when coding the first MSB bit-plane Next bit End of the Codeblock N Next bit-plane P = P-1 Y Terminated Figure 9.7 EBCOT Tier-1 Context Modeling Working Flowchart checks whether the bit itself is significant and has not been coded by SPP. If yes, MRC is applied. When MRP finishes the entire bit-plane, CUP is called. It firstly checks whether there is any non-zero δ[m,n] in the current stripe column and all neighbours, and whether there is any coefficient which has already been coded by SPP and MRP. In both cases, if there is any, CUP only codes bits have not been coded by SPP or MRP. If both cases are false, CUP applies RLC to this stripe column. RLC only runs at the beginning of each stripe column when all the four consecutive bits in the column as well as all of their adjacent neighbours are insignificant. Meanwhile, RLC will 145 Appendix terminate at the first ‘1’ bit in the current stripe column, and the rest bits in the stripe column will be coded by ZC and SC. Being the final coding pass in CM, CUP continues coding the current bit-plane until its end, then the CM coding engine moves to the next bit-plane and starts with SPP again. Arithmetic Encoder CM in EBCOT provides a sequence of CX/D pairs for the following Arithmetic Encoder, or termed MQ-coder, as the input. AE is context-based adaptive and has been used in JBIG2 [5]. It employs a probability model with More Probable Symbol (MPS) and Less Probable Symbol (LPS). The basic idea for AE is to map a CX into an MPS or an LPS with its estimation. Assuming the probability estimation of an LPS is Qe, and then the probability estimation of a MPS can be presented as 1-Qe. For an interval A, both estimations sizes need to be emphasised by factor A. Normally in JPEG2000 the value of A is assumed to maintain close to 1, so the subintervals of LPS and MPS can be approximated to be Qe and A-Qe respectively, as illustrated in Error! Reference source not found. (a). Accordingly, MPS and LPS are assigned with subintervals [0, A-Qe) and [A-Qe, A) respectively. During the coding process, subintervals for both MPS and LPS are updated by adjusting the interval’s upper and lower bounds. If the lower bound of the interval is set to be C, then the bound updating can be represented by MPS: 𝐶 = 𝐶 + 𝑄𝑒 𝐴 = 𝐴 − 𝑄𝑒 LPS: 𝐶: 𝑢𝑛𝑐ℎ𝑎𝑛𝑔𝑒𝑑 𝐴 = 𝑄𝑒 When performing the updating, an important issue that may happen is termed interval inversion [27]. It happens when the MPS subinterval is actually smaller than the LPS subinterval due to bound updating, which means LPS actually occurring more frequently than MPS. In this case, these two subintervals are inverted and reassigned in order to ensure that the subinterval for LPS always stays lower than that for MPS (illustrated in Error! 146 Appendix Reference source not found. (b)). In JPEG2000 standard, the actual value of A is always maintained within the range 0.75≤A<1.5. Whenever the value of A drops below 0.75 during the coding process, it will be doubled to make sure A is greater than 0.75, which is termed renormalisation. Meanwhile the value of C is also doubled when renormalisation to A is performed in order to keep them synchronised [27]. The probability value (Qe) and probability estimation/mapping process is provided by the JPEG2000 standard as an LUT with four fields: Qe, Next MPS (NMPS), Next LPS (NLPS) and Switch, which are listed in Table 9.4. There are another two LUTs required in order to indicate the index and state of Table 9.4, which are also provided by the standard, termed I(CX) and MPS(CX), as listed in Table 9.5. Here, I(CX) is the corresponding index for the current CX, which is looked up and used as the index for Table 9.4. MPS(CX) specifies the sense (0 or 1) of the MPS of CX, which is initialised to zero and can be updated during the coding process. Given I(CX) and MPS(CX), Qe(I(CX)) provides the probability value, NMPS(I(CX)) or NLPS(I(CX)) indicates the next index for a MPS or LPS renormalisation, and SWITCH(I(CX)) is a flag used to indicate whether a change of the MPS(CX) sense is required [27]. 147 Appendix Table 9.4 Qe and Estimation LUT Index Qe NMPS NLPS Switch Index Qe NMPS NLPS Switch 0 0x5601 1 1 1 24 0x1C01 25 22 0 1 0x3401 2 6 0 25 0x1801 26 23 0 2 0x1801 3 9 0 26 0x1601 27 24 0 3 0x0AC1 4 12 0 27 0x1401 28 25 0 4 0x0521 5 29 0 28 0x1201 29 26 0 5 0x0221 38 33 0 29 0x1101 30 27 0 6 0x5601 7 6 1 30 0x0AC1 31 28 0 7 0x5401 8 14 0 31 0x09C1 32 29 0 8 0x4801 9 14 0 32 0x08A1 33 30 0 9 0x3801 10 14 0 33 0x0521 34 31 0 10 0x3001 11 17 0 34 0x0441 35 32 0 11 0x2401 12 18 0 35 0x02A1 36 33 0 12 0x1C01 13 20 0 36 0x0221 37 34 0 13 0x1601 29 21 0 37 0x0141 38 35 0 14 0x5601 15 14 1 38 0x0111 39 36 0 15 0x5401 16 14 0 39 0x0085 40 37 0 16 0x5101 17 15 0 40 0x0049 41 38 0 17 0x4801 18 16 0 41 0x0025 42 39 0 18 0x3801 19 17 0 42 0x0015 43 40 0 19 0x3401 20 18 0 43 0x0009 44 41 0 20 0x3001 21 19 0 44 0x0005 45 42 0 21 0x2801 22 19 0 45 0x0001 45 43 0 22 0x2401 23 20 0 46 0x5601 46 46 0 23 0x2201 24 21 0 Table 9.5 LUT for I(CX) and MPS(CX) CX 0 1 2 3 4 5 6 7 8 9 I(CX) 4 0 0 0 0 0 0 0 0 0 MPS(CX) CX 10 11 12 13 14 15 16 17 18 All initialised to be zero 148 I(CX) 0 0 0 0 0 0 0 3 46 MPS(CX) All initialised to be zero Appendix Table 9.6 A and C Register Structure 32-bit Register MSB LSB C 0000 cbbbbbbbbsssxxxxxxxxxxxxxxxx A 0000 0000 0000 0000 aaaaaaaaaaaaaaaa Illustrations: a: Fractional bits in A to hold the current interval value x: Fractional bits in C s: Spacer bits which provide useful constraints on carry-over b: Bits for ByteOut c: Carry bit Initialisation Read CX and D Y N D=0 N MPS(CX)=1 MPS(CX)=0 Y Y Code MPS N N Code LPS Sub-modules included in Code MPS and LPS: RENORME, BYTEOUT Finished Y FLUSH End Figure 9.8 Top-Level Flowchart for Arithmetic Encoder Two 32-bit registers, A and C, are utilised by AE, which structures are given in Table 9.6 [5]. A stands for the total interval and C indicates the lower bound of the interval space. For initialisation, A is set to 0x00008000, which actually represents 0.75 and indicates the initial probability interval space; while C is initialised with 0x00000000. The top-level flowchart of Arithmetic Encoder is illustrated in Figure 9.8, and detailed architectures of the key submodules are illustrated in Figure 9.9. 149 Appendix N CODELPS CODEMPS A = A - Qe(I(CX)) A = A - Qe(I(CX)) A < Qe(I(CX)) A = A - Qe(I(CX)) N Y A = A - Qe(I(CX)) A & 0X8000 = 0 Y N C = C + Qe(I(CX)) A < Qe(I(CX)) C = C+Qe(I(CX)) Y Y A = Qe(I(CX)) SWITCH(I(CX)) = 1 MPS(CX)=1-MPS(CX) N I(CX) = NMPS(I(CX)) I(CX) = NLPS(I(CX)) RENORME RENORME DONE DONE RENORME BYTEOUT A = A << 1 C = C << 1 CT = CT - 1 B = 0xff Y N Y N CT = 0 C < 0x8000000 N Y B=B+1 BYTEOUT N A & 0X8000 = 0 B = 0xff Y Y N C = C& 0x7ffffff DONE BP = BP + 1 B = C >> 19 C = C & 0x7ffff CT = 8 BP = BP + 1 B = C >> 20 C = C & 0xfffff CT = 7 DONE CT: a counter for counting the number of shifts applied on A and C BP: the compressed data buffer pointer B: the byte pointed to by BP Figure 9.9 Detailed Architectures of the Key Sub-modules in Arithmetic Encoder Tier-2 and File Formatting After EBCOT Tier-1 encoder, each coding pass constitutes an atomic code unit, termed chunk. These chunks can be grouped into quality layers and can 150 Appendix be transmitted in any order if chunks belonging to the same codeblock are transmitted in their relative order [101]. In JPEG2000 standard, EBCOT Tier2 encoder is mainly used to organise the previously compressed bit-stream which is partitioned into packets containing header information in addition to the bit-stream itself. The packet header includes the inclusion information, the length of codewords, the zero bit-plane information and the number of coding passes information. The basic coding scheme employed in Tier-2 encoder is termed Tag-Tree coding [5], which is utilised to code the inclusion information and the zero bit-plane information. A Tag-Tree is a way to represent a two-dimensional array of nonnegative integers in a hierarchical way [27]. Take an original 2-D data symbol arrays (highest level) with the size 6x3 as an example, as illustrated in Figure 9.10, every four (or less, if the nodes are on the boundary) nodes (data symbols) are presented by a parent node, which takes a lower level and is equal to the minimum value of its four children. This kind of representation continues until it reaches a single parent node for all child nodes, or called root node which has the lowest level. The coding process starts at the root node, with an initial value of zero. If the initial value is less than the root node, it is incremented by 1 and a “0” is 1 3 2 q3(0,0) q3(1,0) q3(2,0) 2 2 2 2 3 2 3 1 4 3 2 2 2 1 2 q0(0,0) = 1 01 q1(0,0) = 1 1 1 q2(0,0) 1 q2(1,0) 2 2 1 q1(0,0) 1 q0(0,0) 1 1 1 q2(0,0) = 1 q2(1,0) = 1 1 1 q3(0,0) = 1 q3(1,0) = 3 q3(2,0) = 2 1 001 01 Firstly the root node q0(0,0) is coded, with the output 01 generated, then its child node q1(0,0) is coded. The coding procedure continues to the highest level, q3(0,0), and the final coded bitstream for it is 01111. When q3(0,0) finished, the coding engine moves to q3(1,0) and another fractional bitstream 001 is generated for it. When coding q3(2,0), as its parent node, q2(1,0), has not been coded, the coding engine must move to code it first with generating “1”, after that, q3(2,0) can be coded and its final coded bitstream is 101. The rest of the data array are coded in the same way Figure 9.10 Tag Tree Encoding Procedure 151 Appendix output. When the incremented value is equal to the root node, a “1” is output which means the root node is coded. After that the coding engine moves to one of its child node at the second lowest level, and the same coding progress is performed node by node and level by level until all the nodes at the highest level are coded. In Tag-Tree, nodes at higher levels cannot be encoded until their parent nodes at lower levels are encoded [27]. In this way, each node is coded as d number of 0’s followed by a ‘1’. The information represented in Tier-2 encoder can be summarised as follows: Zero-length packet: One bit indicating whether a packet is zero-length or not. Inclusion information: Encoded by a separate Tag-Tree. The value in this Tag-Tree is the number of the layer in which this codeblock is first included. Number of zero bit-planes: Indicating how many zero bit-planes are included in this codeblock, encoded by another separate Tag-Tree. Number of coding passes included: Encoded by specified code-words listed in Table 9.7. Length of the bit-stream from the current codeblock: Represented by a certain number of bits given by 𝑏𝑖𝑡𝑠 = 𝐿𝐵𝑙𝑜𝑐𝑘𝑠 + log 2 ⌊𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑐𝑜𝑑𝑖𝑛𝑔𝑝𝑎𝑠𝑠𝑒𝑠⌋ (9.21) Where LBlocks is a state variable with an initial value of 3. In the case Table 9.7 Codewords for Number of Coding Passes No. of Coding Passes Codeword 1 0 2 10 3 1100 4 1101 5 1110 6-36 1111 00000 – 1111 11110 37-164 1111 11111 0000 000 – 1111 11111 1111 111 152 Appendix when the bits given is not sufficient to represent the bit-stream length, some additional bits can be added and a prefix will be added, which is known as the codeblock codeword indicator. If there are k additional bits required to represent the bit-stream length, the codeblock codeword indicator will comprise of k “1”s followed by a “0”. The coding process of EBCOT Tier-2 encoder can be summarised as follows: …………………………………………………………………………………………...... If packet not empty Code non-empty packet indicator (1 bit) For each subband For each codeblock in this subband Code inclusion information (Tag-Tree or 1 bit) If first inclusion of codeblock Code number of zero bit-planes (Tag-Tree) Code number of new coding passes Code codeword length indicator Code length of codeword End End Else code empty packet indicator (1 bit) End …………………………………………………………………………………………...... 153 Appendix References [1] Wikipedia. http://en.wikipedia.org/wiki/Image_compression. 2011. [2] Wikipedia. http://en.wikipedia.org/wiki/Tagged_Image_File_Format. [3] Wikipedia. http://en.wikipedia.org/wiki/JPEG. [4] GIF. http://en.wikipedia.org/wiki/GIF. [5] JPEG2000 Committee, JPEG2000 Part I Final Committee Draft Version 1.0, ISO/IEC JTC1/SC29/WG1 N1646R.2000 [6] S. Khawam, I. Nousias, M. Milward, Y. Yi, M. Muir and T. Arslan, The Reconfigurable Instruction Cell Array. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 16(1): pp. 75-85.2008 [7] C. Brunelli, F. Garzia and J. Nurmi, A Coarse-Grained Reconfigurable Architecture for Multimedia Applications Featuring Subword Computation Capabilities. The EUROMICRO Journal of Systems Architecture. 56(1): pp. 21-32.2008 [8] http://en.wikipedia.org/wiki/digital_image_processing. [9] JBIG, http://en.wikipedia.org/wiki/JBIG, [10] JBIG2, http://en.wikipedia.org/wiki/JBIG2, [11] B. E. Bayer, Colour Imaging Array, in U. S. Patent.1976 [12] http://en.wikipedia.org/wiki/Demosaicing. 2010. [13] W. Lu and Y. P. Tan, Colour Filter Array Demosaicing; New Method and Performance Measures. IEEE Transaction of Image Processing. 12: pp. 1194-1210.2003 [14] D. R. Cok, Signal Processing Method and Apparatus for Producing Interpolated Chrominance Values in a Sampled Colour Image Signal, in U. S. Patent, No. 4642678.1987 [15] R. Ramanath, W. E. Snyder and G. L. Bilbro, Demosaicing Methods for Bayer Colour Arrays. Journal of Electronic Imaging. 11: pp. 306315.2002 154 References [16] W. T. Freeman, Median Filter for Reconstructing Missing Colour Samples, in U. S. Patent, No. 4724395.1988 [17] C. A. Laroche and M. A. Prescott, Apparatus and Method of Adaptively Interpolating a Full Colour Image Utilizing Luminance Gradients, in U. S. Patent, No. 5373322.1994 [18] J. F. Hamilton and J. E. Adams, Adaptive Colour Plane Interpolation in Single Sensor Colour Electonic Camera, in U. S. Patent, No. 5629734.1997 [19] R. Kimmel, Demosaicing: Image Reconstruction from CCD Samples. IEEE Transaction of Image Processing. 8: pp. 1221-1228.1999 [20] A. Lukin and D. Kubasov, An Improved Demosaicing Algorithm, in Graphicon Conference.2004 [21] P. Tsai, T. Acharya and A. Ray, Adaptive Fuzzy Color Interpolation. Journal of Electronic Imaging. 11.2002 [22] W. M. Lu and Y. P. Tan, Colour Filter Array Demosaicing: New Method and Performance Measures. IEEE Transaction on Image Processing. 12: pp. 1194-1210.2003 [23] G. Zapryanov and I. Nikolova, Comparative Study of Demosaicing Algorithms for Bayer and Pseudo-Random Bayer Color Filter Arrays, in International Scientific Conference Computer Science. p. 133139.2008 [24] M. D. Adams, The JPEG 2000 Still Image Compression Standard, in ISO/IEC JTC 1/SC 29/WG 1 N 2412.2002 [25] I. Daubechies and W. Sweldens, Factoring Wavelet Transform into Lifting Steps. The Journal of Fourier Analysis and Applications. 4: pp. 247-269.1998 [26] W. Sweldens, The New Philosophy in Biorthogonal Wavelet Constructions, in Proceedings of the 1995 SPIE. p. 68-79.1995 [27] T. Acharya and P. Tsai, eds. JPEG2000 Standard for Image Compression Concepts, Algorithms and VLSI Architectures. 2004, Wiley-Interscience [28] STMicroelectronics, 5 Megapixel Mobile Imaging Processor Data Brief.2007 [29] NXP Semiconductor. NXP Nexperia Mobile Multimedia Processor PNX4103. Available from: http://thekef.free.fr/CV/PNX4103.pdf. [30] Analog Devices, Wavescale Video Codec: ADV212 Datasheet.2010 [31] Bacro, BA110 HD/DCI JPEG2000 Encoder Factsheet.2008 [32] intoPIX, RB5C634A Technical Specificaiton Outline (JPEG2000 Encoder).2005 [33] ASICFPGA. Bayer CFA Interpolation Core. Available from: http://www.asicfpga.com/site_upgrade/asicfpga/isp/interpolation1.html. 155 References [34] G. L. jair, A. A. Miguel and W. V. Julio, A Digital Real Time Image Demosaicking Implemenattion for High Definition Video Cameras, in Robotics and Automotive Mechanics Conference. p. 565-569.2008 [35] Xilinx, http://www.xilinx.com/products/ipcenter/JPEG2K_E.htm.2010 [36] J. Guo, C. Wu, Y. Li, K. Wang and J. Song, Memory-Efficient Architecture Including DWT and EC for JPEG2000, in IEEE International Conference on Solid-State and Integrated-Circurt Technology. p. 2192-2195.2008 [37] M. Gangadhar and D. Bhatia, FPGA Based EBCOT Architecture for JPEG2000, in Microprocessors and Microsystems. p. 363-373.2005 [38] H. B. Damecharla, K. Varma, J. E. Carletta and A. E. Bell, FPGA Implementation of a Parallel EBCOT Tier-1 Encoder that Preserves Coding Efficiency, in Proceedings of the 16th ACM Great Lakes Symposiym on VLSI. p. 266-271.2006 [39] BroadMotion, BroadMotion JPEG2000 Codecs for Combined TI DSPAltera FPGA Platform.2006 [40] Silicon Hive, ISP2000 Processors Enable C-Programmable Image Signal Processing SoCs.2009 [41] Philips, TM-1300 Media Processor Data Book.2000 [42] T. H. Tsai, Y. N. Pan and L. T. Tsai, DSP Platform-Based JPEG2000 Encoder with Fast EBCOT Algorithm, in Proceedings of the SPIE. p. 48-57.2004 [43] Texas Instruments, TMS320C6414T/15T/16T Fixed-Point Digital Signal Processors.2009 [44] Texas Instrumeents, TMS320C6455 Fixed-Point Digital Signal Processor.2011 [45] C. C. Liu and H. M. Hang, Acceleration and Implementation of JPEG2000 Encoder on TI DSP Platform, in IEEE International Conference on Image Processing. p. 329-332.2007 [46] Q. Liu and G. Ren, The real Time Coding of JPEG2000 base don TMS320C6455, in IEEE International Conference on Computer Application and System Modeling. p. 503-507.2010 [47] Analog Devices, BLACKFIN Embedded Processor ADSP-BF535.2004 [48] Kiran K.S., Shivaprakash. H., Subrahmanya M. V., Sundeep Raj and Suman David S., Implementation of JPEG2000 Still Image Codec on BLACKFIN (ADSP-BF535) Processor, in International Conference on Signal Processing. p. 804-807.2004 [49] Analog Devices, BLACKFIN Embedded Processor ADSP-BF561.2009 [50] P. Zhou, Y. G. Zhao and J. Zhou. The JPEG2000 compression algorithm based on Blackfin561 Implementation and Optimization. 2009; Available from: http://electronics-tech.com/the-jpeg2000- 156 References compression-algorithm-based-on-blackfin561-implementation-andoptimization/. [51] M. Hashimoto, K. Matsuo and A. Koike, JPEG2000 Encoder for Reducing Tiling Artifacts and Accelerating the Coding Process, in IEEE International Conference on Image Processing. p. 645-648.2003 [52] S. Smorfa and M. Olivieri, Cycle-Accurate Performance Evaluation of Parallel JPEG2000 on a Multiprocessor System-on-chip Platform, in IEEE Conference on Industrial Electronics. p. 3385-3390.2006 [53] J. C. Chen and S. Y. Chien, Crisp: Coarse-Grained Reconfigurable Image Stream Processor for Digital Still Cameras and Camcorders. IEEE Transactions on Circuits and Systems for Video Technology. 18: pp. 1223-1236.2008 [54] K. Deguchi, S. Abe, M. Suzuki, K. Anjo, T. Awashima and H. Amano, Implementing Core Tasks of JPEG2000 Encoder on the Dynamically Reconfigurable Processor, in International Conference on Architecture of Computing Systems.2005 [55] M. Motomura, A Dynamically Reconfigurable Processor Architecture, in Microprocessor Forum.2002 [56] H. Parizi, A. Niktash, N. Bagherzadeh and F. Kurdahi, MorphoSys: A Coarse Grain Reconfigurable Architecture for Multimedia Applications, in The Euro-Par Conference. p. 844-848.2002 [57] A. Abnous and C. Christensen, Design and Implementation of the TinyRISC Microprocessor Microprocessors and Microsystems. 16: pp. 187-194.1992 [58] B. Mei, S. Vernalde, D. Verkest, H. D. Man and R. Lauwereins, ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix, in The Conference of FieldProgrammable Logic and Applications p. 61-70.2003 [59] M. Hartmann, V. Pantazis, T. V. Aa, M. Berekovic, C. Hochberger and B. Sutter, Still Image Processing on Coarse-Grained Reconfigurable Array Architectures, in IEEE Workshop on ESTIMedia. p. 67-72.2007 [60] Y. YI, I. Nousias, M. Milward, S. Khawam, T. Arslan and I. Lindsay, System-Level Scheduling on Instruction Cell Based Reconfigurable Systems, in The Conference on Design, Automation and Test in Europe. p. 381-386.2006 [61] A. O. El-Rayis, X. Zhao, T. Arslan and A. T. Erdogan, Dynamically Programmable Reed Solomon Processor with Embedded Galois Field Multiplier, in International Conference on ICECE Technology, FPT. p. 269-272.2008 [62] A. O. El-Rayis, X. Zhao, T. Arslan and A. T. Erdogan, Low Power RS Codec Using Cell-Based Reconfigurable Processor, in IEEE International Conference on System on Chip. p. 279-282.2009 157 References [63] X. Zhao, A. Erdogan and T. Arslan, OFDM Symbol Timing Synchronization System on a Reconfigurable Instruction Cell Array, in IEEE International Conference on System on Chip. p. 319-322.2008 [64] J. J. van de Beek, M .Sandell and P. O. Borjesson, ML Estimation of Time and Frequency Offset in OFDM Systems. IEEE Transaction on Signal Processing. 45: pp. 1800-1805.1997 [65] Y. Y. Chuang, Cameras, in Digital Visual Effects.2005 [66] Wiki. http://en.wikipedia.org/wiki/Median_filter. 2010. [67] G. Landini. Image Processing Fundamentals. Available from: http://www.ph.tn.tudelft.nl/Courses/FIP/Frames/fip.html. [68] P. Longere, X. M. Zhang, P. B. Delahunt and D. H. Brainard. Perceptual Assessment of Demosaicing Algorithm Performance. in Proceedings of IEEE.2002 [69] Y. W. Liu, J. Meng, H. Fan and J. J. Li, Research on Infrared Image Smoothing for Warship Targets, in IEEE International Conference on Machine Learning and Cybernetics. p. 4054-4056.2004 [70] A. R. Rostampour and A. P. Reeves, 2-D Median Filtering and Pseudo Median Filtering, in Proceeding of the 20th Southeastern Symposium on System Theory. p. 554-557.1988 [71] W.Han, Y. Yi, M. Muir, I. Nousias, T. Arslan and A. T. Erdogan, MRPSIM: a TLM Based Simulation Tool for MPSoCs targeting Dynamically Reconfigurable Processors, in IEEE International SoC Conference. p. 41-44.2008 [72] W. Han, Y. Yi, X. Zhao, M. Muir, T. Arslan and A.T. Erdogan, Heterogeneous multi-core architectures with dynamically reconfigurable processors for wireless communication, in IEEE Symposium on Application Specific Processors. p. 27-32.2009 [73] X. Zhao, Y. Yi, A. T. Erdogan and T. Arslan, A High-Efficiency Reconfigurable 2-D Discete Wavelet Transform Engine for JPEG2000 Implementation on Next Generation of Digital Cameras, in IEEE International SOC Conference.2010 [74] M. Mehendale, S. B. Roy, S. D. Serlekar and G. Venkatesh, Coefficient Transformations for Area-Efficient Implementation of Multiplier-less FIR Filters, in IEEE International Conference on VLSI Design. p. 110-115.1998 [75] P. C. Wu and L. C. Chen, An Efficient Architecture for Two-Dimension Discrete Wavelet Transform. IEEE Transaction on Circuits and Systems for Video Technology. 11.2001 [76] J. Guo, K. Wang, C. Wu and Y. Li, Efficient FPGA Implementation of Modified DWT for JPEG2000, in IEEE International Conference on Solid-State and Integrated Circuit Technology. p. 2200-2203.2008 158 References [77] Q. Liu, L. Du and B. Hu, Low-Power JPEG2000 Implementation on DSP-based Camera Node in Wireless Multimedia Sensor Networks, in IEEE International Conference on NSWCTC. p. 300-303.2009 [78] Freescale Semiconductor (2004) JPEG2000 Wavelet Transform on Starcore(TM)-Based DSPs. [79] M. Adams and F. Kossentini, Jasper: A Software-Based JPEG2000 Codec Implementation, in Proceeding of IEEE International Conference on Image Processing. p. 53-56.oct. 2000 [80] K. F. Chen, C. J. Lian, T. H. Chand and L. G. Chen, Analysis and Architecture Design of EBCOT for JPEG2000, in Proceedings of IEEE International Sysmposiym of Circuits and Systems. p. 765-768.2001 [81] H. H. Chen, C. J. Lian, T. H. Chang and L. G. Chen, Analysis of EBCOT decoding algorithm and its VLSI implementation for JPEG2000, in Proceeedings of IEEE International Symposium of Circuits and Systems. p. 329-332.2002 [82] J. S. Chiang, Y. S. Lin and C. Y. Hsieh, Efficient Pass Parallel Architecture for EBCOT in JPEG2000, in Proceedings of IEEE International Symposium of Circuits and Systems. p. 773-776.2002 [83] D. Taubman, E. Ordentikich, M. Weinberger and G. Seroiussi, Embedded Block Coding in JPEG2000, in Proceedings of IEEE International Conference on Image Processing. p. 33-36.2000 [84] X. Zhao, A. T. Erdogan and T. Arslan, A Novel High-Efficiency PartialParallel Context Modeling Architecture for EBCOT in JPEG2000, in IEEE International Conference on System on Chip. p. 57-61.2009 [85] ARM. www.arm.co.uk. [86] ARM, http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0274h /Chdifhhi.html, [87] M. Dyer, D. Taubman, S. Nooshabadi and A. K. Gupta, Concurrency Techniques for Arithmetic Coding in JPEG2000. IEEE Transactions on Circuits and Systems. 53(6): pp. 1203-1213.2006 [88] B. Min, S. Yoon, J. Ra and D. S. Park, Enhenced Renormalization Algorithm in MQ-Coder of JPEG2000, in International Symposium on Information Technology Convergence. p. 213-216.2007 [89] R. R. Osorio and B. Vanhoof, High Speed 4-Symbol Arithmetic Encoder Arichitecture for Embedded Zero Tree-Based Compression. Journal of VLSI Signal Processing Systems. 33(3): pp. 267-275.2003 [90] M. Tarui, M. Oshita, T. Onoye and I Shirakawa, High-Speed Implementation of JBIG Arithmetic Coder, in IEEE Conference of TENCON. p. 1291-1294.2002 159 References [91] M. Dyer, D. Taubman and S. Nooshabadi, Improved Throughput Arithmetic Coder for JPEG2000, in IEEE International Conference on Image Processing. p. 2817-2820.2004 [92] A. Aminlou, M. Homayouni, M. R. Hashemi and O. Fatemi, Low-Power High-Throughput MQ-Coder Architecture with an Improved Coding Algorithm, in The EURSIP Picture Coding Symposium.2007 [93] B. Valentine and O. Sohm, Optimizing the JPEG2000 Binary Arithmetic Encoder for VLIW Architectures, in Proceedings of International Conference on Acoustics, Speech and Signal Processing. p. 117-120.2004 [94] ARM. http://arm.com/products/tools/software-tools/rvds/index.php. 2011. [95] X. Zhao, A. T. Erdogan and T. Arslan, A Hybrid Dual-Core Reconfigurable Processor for EBCOT Tier-1 Encoder in JPEG2000 on Next Generation of Digital Cameras, in IEEE International Conference on Design and Architectures for Signal and Image Processing.2010 [96] Faraday 65nm power. http://www.faradaytech.com/html/products/FeatureLibrary/miniLib_65nm.html. [97] ARM, http://www.arm.com/products/processors/classic/arm9/arm946.php, [98] Texas Instruments, TMS320C6414T/15T/16T Power Consumption Summary.2008 [99] M. Y. Chiu, K. B. Lee and C. W. Jen, Optimal Data Transfer and Buffering Schemes for JPEG2000 Encoder, in IEEE Workshop on Signal Processing Systems. p. 177-182.2003 [100] B. F. Wu and C. F. Lin, Analysis and Architecture for High Performance JPEG2000 Coprocessor, in IEEE International Symposium on Circuits and Systems. p. 225-228.2004 [101] Gaetano Impoco (2004) JPEG2000 A Short Tutorial. 160