Profiling-Based Hardware/Software CoExploration for the Design of Video Coding Architectures Heiko Hübert and Benno Stabernack Lu Hao Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion Background -- profiling Profiling is used to understand the runtime behavior of applications Efficient profiling approaches Software profiling Sampling, Instrumentation Flexible but have high overhead Hardware profiling Performance counter inexpensive but more rigid and may not be universally available Hybrid Combinations of the above Hold great potential since they combine the advantages of both without the drawbacks An example of hardware profiling PC – Performance Counter Background – system analysis Why we need profiling? It is very important to adapt the system to the application in order to find an efficient solution. Video coding Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion MEMTRACE profiler MEMTRACE delivers cycle-accurate profiling results on a C function level. The results include clock cycles, various memory access statistics, and optionally energy consumption estimation for reduced instruction set computer (RISC)-based processors. A focus is placed on memory access analysis, as for data-intensive applications this aspect has a high potential for increasing system efficiency. MEMTRACE profiling toolflow MEMTRACE -- Initialization MEMTRACE – Performance Analysis MEMTRACE – Post Processing MEMTRACE backend MEMTRACE -- Profiling data acquisition MEMTRACE -- Profiling data acquisition init() Initialize the profiler. Creates a list of all functions and global variables nextInstruction() Checks if the program execution has changed from one function to another If so, the cycle count of the previous function is recalculated and the call count of the new function is incremented memoryAccess() It is decided if a load or store access was performed, and which bit-width (8, 16, or 32-bit) was used. MEMTRACE -- Profiling data acquisition busActivity() Identifies the bus status (idle cycle, core access or DMA access) and increments the appropriate counter of the current function cacheMiss() Is called each time a cache miss occurs finish() When the ISS terminates the simulation Processor model generator Interconnection What can we do by using the result of MEMTRACE profiler? Contents 1. Background 2. MEMTRACE profiler 3. Software/Hardware Optimization 4. Conclusion System partitioning Computationally intensive functions are wellsuited for hardware acceleration in a coprocessor Control-intensive functions are better suited for software implementation on ASIPs (Application Specific Instruction set Processors) Software Optimization Loop unrolling For computational intensive parts, arithmetic optimizations or SIMD instructions can be applied, if such instructions are available in the processor Video applications Hardware Optimization Memory Subsystem Optimizations External memory Cache (Cache miss) • The data areas with the most cache misses and the smallest size should be stored in on-chip memory SRAM Instruction Set Architecture Optimizations Frequently used instructions should be considered as targets for optimization during the processor architecture development. Conclusion Profiling and system analysis MEMTRACE architecture Initialization Performance analysis Post processing Hardware/Software optimization Software Hardware And questions? Lu Hao References [1] H Hübert, B Stabernack. Profiling-based hardware/software co-exploration for the design of video coding architectures. IEEE Transactions on Circuits and Systems for Video Technology, 2009, Pages: 1680-1691 [2]ST Microelectronics: Nomadik STn8820 Mobile Multimedia Application Processor (2008, Feb.). Data brief. [Online]. Available: www.st.com [3] Broadcom: BCM2820 Low Power, High Performance Application Processor (2006, Sep.). Product brief. [Online]. Available: www.broadcom.com [4] G. de Micheli and L. Benini, Network on Chips. San Francisco, CA: Morgan Kaufmann, 2006. [5] H. H¨ubert, “MEMTRACE: A memory, performance and energy profiler targeting RISC-based embedded systems for dataintensive applications,” Ph.D. dissertation, Dept. Elect. Eng. Comput. Sci., Tech. Univ. Berlin, Germany, 2009. [Online]. Available: http://opus.kobv.de/tuberlin/volltexte/2009/2261