Profiling-Based
Hardware/Software CoExploration for the Design of
Video Coding Architectures
Heiko Hübert and Benno Stabernack
Lu Hao
Contents
1. Background
2. MEMTRACE profiler
3. Software/Hardware Optimization
4. Conclusion
Background -- profiling
Profiling is used to understand the runtime behavior of applications
Efficient profiling approaches
Software profiling
Sampling, Instrumentation
Flexible but have high overhead
Hardware profiling
Performance counter
inexpensive but more rigid and may not be
universally available
Hybrid Combinations of the above
Hold great potential since they combine the
advantages of both without the drawbacks
An example of hardware profiling
PC – Performance Counter
Background – system analysis
Why we need profiling?
It is very important to adapt the system to the
application in order to find an efficient solution.
Video coding
Contents
1. Background
2. MEMTRACE profiler
3. Software/Hardware Optimization
4. Conclusion
MEMTRACE profiler
MEMTRACE delivers cycle-accurate profiling
results on a C function level.
The results include clock cycles, various memory
access statistics, and optionally energy
consumption estimation for reduced instruction
set computer (RISC)-based processors.
A focus is placed on memory access analysis, as
for data-intensive applications this aspect has a
high potential for increasing system efficiency.
MEMTRACE profiling toolflow
MEMTRACE -- Initialization
MEMTRACE – Performance Analysis
MEMTRACE – Post Processing
MEMTRACE backend
MEMTRACE -- Profiling data acquisition
MEMTRACE -- Profiling data acquisition
init()
Initialize the profiler.
Creates a list of all functions and global variables
nextInstruction()
Checks if the program execution has changed from
one function to another
If so, the cycle count of the previous function is
recalculated and the call count of the new function is
incremented
memoryAccess()
It is decided if a load or store access was performed,
and which bit-width (8, 16, or 32-bit) was used.
MEMTRACE -- Profiling data acquisition
busActivity()
Identifies the bus status (idle cycle, core
access or DMA access) and increments the
appropriate counter of the current function
cacheMiss()
Is called each time a cache miss occurs
finish()
When the ISS terminates the simulation
Processor model generator
Interconnection
What can we do by using the result of
MEMTRACE profiler?
Contents
1. Background
2. MEMTRACE profiler
3. Software/Hardware Optimization
4. Conclusion
System partitioning
Computationally intensive functions are wellsuited for hardware acceleration in a
coprocessor
Control-intensive functions are better suited
for software implementation on ASIPs
(Application Specific Instruction set
Processors)
Software Optimization
Loop unrolling
For computational intensive parts,
arithmetic optimizations or SIMD
instructions can be applied, if such
instructions are available in the processor
Video applications
Hardware Optimization
Memory Subsystem Optimizations
External memory
Cache (Cache miss)
• The data areas with the most cache misses and the smallest
size should be stored in on-chip memory
SRAM
Instruction Set Architecture Optimizations
Frequently used instructions should be considered as
targets for optimization during the processor
architecture development.
Conclusion
Profiling and system analysis
MEMTRACE architecture
Initialization
Performance analysis
Post processing
Hardware/Software optimization
Software
Hardware
And questions?
Lu Hao
References
[1] H Hübert, B Stabernack. Profiling-based hardware/software co-exploration for the
design of video coding architectures. IEEE Transactions on Circuits and Systems for
Video Technology, 2009, Pages: 1680-1691
[2]ST Microelectronics: Nomadik STn8820 Mobile Multimedia Application Processor
(2008, Feb.). Data brief. [Online]. Available: www.st.com
[3] Broadcom: BCM2820 Low Power, High Performance Application Processor (2006,
Sep.). Product brief. [Online]. Available: www.broadcom.com
[4] G. de Micheli and L. Benini, Network on Chips. San Francisco, CA: Morgan
Kaufmann, 2006.
[5] H. H¨ubert, “MEMTRACE: A memory, performance and energy profiler targeting
RISC-based embedded systems for dataintensive applications,” Ph.D. dissertation,
Dept. Elect. Eng. Comput. Sci., Tech. Univ. Berlin, Germany, 2009. [Online]. Available:
http://opus.kobv.de/tuberlin/volltexte/2009/2261