An energy efficient multi-target binary translator

Design Automation for Embedded Systems (2022) 26:55–82
An energy efficient multi-target binary translator for
instruction and data level parallelism exploitation
Tiago Knorst1 · Julio Vicenzi2 · Michael G. Jordan1 · Jonathan H. de Almeida2 ·
Guilherme Korol1 · Antonio C. S. Beck1 · Mateus B. Rutzig2
Received: 14 October 2020 / Accepted: 13 November 2021 / Published online: 14 January 2022
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021
Embedded devices are omnipresent in our daily routine, from smartphones to home appliances, that run data and control-oriented applications. To maximize the energy-performance
tradeoff, data and instruction-level parallelism are exploited by using superscalar and specific accelerators. However, as such devices have severe time-to-market, binary compatibility
should be maintained to avoid recurrent engineering, which is not considered in current
embedded processors. This work visited a set of embedded applications showing the need
for concurrent ILP and DLP exploitation. For that, we propose a Hybrid Multi-Target Binary
Translator (HMTBT) to transparently exploit ILP and DLP by using a CGRA and ARM
NEON engine as targeted accelerators. Results show that HMTBT transparently achieves
24% performance improvements and 54% energy savings over an OoO superscalar processor coupled to an ARM NEON engine. The proposed approach improves performance and
energy in 10%, 24% over decoupled binary translators using the same accelerator with the
same ILP and DLP capabilities.
Keywords CGRA · ARM NEON · ILP · DLP · Binary Translator
1 Introduction
From simple home appliances to complex smartphones, embedded devices have been
omnipresent in our daily routine. Since user experience is fundamental, performance is a
consensus. However, as most of such devices are battery-powered, designers also must be
committed to severe power constraints.
This study was financed in part by: CNPq; FAPERGS/CNPq 11/2014 - PRONEM; and the Coordenação de
Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Mateus B. Rutzig
Institute of Informatics, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
Electronics and Computing Department, Universidade Federal de Santa Maria (UFSM), Santa Maria,
T. Knorst, et al.
Fig. 1 Application Execution
considering HMTBT system
Traditionally, application parallelism is exploited to leverage performance bringing energy
benefits. However, these benefits are limited when single parallelism exploitation is employed
in mobile applications as they present distinct behaviors, from data to control-oriented applications, such as multimedia and communication protocol, respectively.
Even in embedded devices, instruction-level parallelism is aggressively exploited by using
out-of-order superscalar processors, such as the ARM Cortex Family. Due to multimedia
application dominance in smartphones, single instruction multiple data engines have been
coupled to superscalar processors to exploit data-level parallelism, such as ARM NEON.
However, these parallelism exploitation paradigms diverge in one aspect: binary compatibility. For decades, superscalar processors take advantage of Moore‘s Law to exploit
instruction-level parallelism to keep software compatibility dynamically. SIMD engines,
coupled to superscalar processors, rely on specific libraries or compilers to be triggered,
breaking the primary purpose of such processors, the legacy code.
Binary translators have been proposed as means to provide code portability among different architectures. Moreover, the translation process can be used to exploit parallelism,
which, despite offering binary compatibility, can achieve performance improvements. Several binary translators have been proposed to exploit instruction [1] or data level parallelism
[11] dynamically, leveraging the performance of specific architectures. However, none of
them exploits instruction and data-level parallelism concurrently in a system comprised of
multiple accelerators, which is the ideal system to take advantage of the varying parallelism
behavior found in the applications present in current embedded devices. Moreover, such
runtime binary translator should transparently detect the parallelism and orchestrate the triggering of different accelerators, maximizing the energy-performance trade-off and keeping
the main idea of superscalar processors, the binary compatibility.
This work proposes a Hybrid Multi-Target Binary Translator (HMTBT) that transparently supports code optimization on accelerators that exploit different parallelism degrees.
As depicted in Fig. 1, HMTBT detects at runtime instruction and data-level parallelism of the
application hotspots and smartly issues them to the most well-suited accelerator aiming at
maximizing performance per watt. The HMTBT supports code optimization to a CGRA [1],
which exploits instruction-level parallelism, and to ARM NEON engine, which is responsible for exploiting data-level parallelism. Our system is coupled to an in-order ARMv7,
which executes code regions that are not suitable for optimization. On average, performance
improvements of 44% and 99% and energy savings of 14% and 18% were shown by HMTBT
over two systems: one with a binary translator targeting the CGRA [1] and other the ARM
NEON. Moreover, the HMTBT offers 24% energy savings with 10% more performance than
a unique system composed of both binary translators generating code for CGRA and NEON
[11]. Finally, results show that HMTBT outperforms an OoO superscalar processor in 24%
spending 54% less energy.
Summarizing, the contributions of this work are the following:
• we show that the opportunities for ILP and DLP exploitation can highly vary among
embedded applications and even for the same application when data input or data type
An energy efficient multi-target binary translator...
changing, indicating the need for dynamic scheduling of code portions to achieve the
best of ILP and DLP exploitation;
• we propose a dynamic adaptive multi-target binary translator that detects the data and
instruction-level parallelism degree of application hotpots and automatically translates
and schedules code portions to the most well-suited accelerator. In this work, we propose two scheduling methods: one based on a history table that holds the performance
improvement of every code portion for each accelerator; other based on the operation
per cycle metric resulted by the parallelism exploitation of both accelerators over every
application code portion;
• shows that the proposed adaptive binary translator outperforms a system coupled to:
single parallelism exploitation, mutual parallelism exploitation without adaptive code
scheduling; superscalar processor with SIMD engine.
2 Related work
This section aims at showing binary translator proposals that transform native code to explore
instruction or data-level parallelism.
2.1 Binary translators for ILP
Instruction-level parallelism is a traditional approach to accelerate applications by executing
instructions in parallel. Binary translators are commonly employed to translate non-parallel
code to specific accelerators aiming at exploiting instruction-level parallelism [18]. This
subsection presents static and dynamic binary translators that modify the native code to be
accelerated in specific hardware to exploit instruction-level parallelism.
Warp processing [21] was one of the first to propose binary translator to accelerate code
in a simplified FPGA. A microprocessor is responsible for executing the program to be
accelerated, while another microprocessor concurrently runs a stripped-down CAD algorithm
that monitors the instructions to detect critical regions. The CAD software decompiles the
instruction flow to a control flow graph, synthesizes it, and maps the circuit into the FPGA
Configurable Compute Array (CCA) [5] is a coarse-grained reconfigurable array coupled
to an ARM processor. The process to trigger the CCA is divided into two steps: discovering
which subgraphs are suitable for running on the CCA, and their replacement by microops
in the instruction stream. Two alternative approaches are presented: a static approach, which
finds subgraphs for the CCA at compile-time, changing the original ARM code by inserting
CCA instructions; and a dynamic approach, which assumes the use of a trace cache to perform
subgraph discovering and CCA instruction building at runtime.
Hybrid DBT [19], a hardware-based dynamic binary translator, is proposed to translate
RISC-V ISA to a VLIW-based multicore system. Hybrid DBT is composed of a set of inorder, out-of-order, and VLIW cores. While out-of-order cores are best for control-dominated
kernels, VLIW cores perform better in compute-intensive kernels. The translation process,
which supports hardware accelerators, is composed of three steps: the first step translates
RISC-V to VLIW instructions; the second and third steps build instruction blocks and perform inter-block optimizations, such as loop unrolling and trace building. The scoreboard
scheduling algorithm is used to schedule VLIW instructions over a dual-core VLIW processor
to explore instruction-level parallelism aggressively.
T. Knorst, et al.
Park et al. [17] propose a runtime binary translator that unrolls loops to avoid redundant branch instructions, which increases the instruction-level parallelism exploitation. Such
unrolling employs machine learning techniques to predict loops that are suitable for unrolling.
Results show high prediction accuracy that significantly reduces the number of dynamic
instructions by running applications compiled to X86 ISA.
The DIM (Dynamic Instruction Merging) [1] is a binary translator capable of detecting
application hotspots to execute in a CGRA. The CGRA is conceived as a replication of
arithmetic and logic functional units, multipliers, and memory access units. DIM’s binary
translation explores application hotspots with high ILP to offload their execution to the
CGRA. Since it is entirely implemented in hardware, it does not rely on source code modifications, which provides the original ISA’s binary compatibility. [2–4,13,20] extend the
DIM binary translator to improve performance and energy savings but still focusing only on
transparent ILP exploitation.
DORA [22] implements dynamic translation, which utilizes optimizations such as loop
unrolling, accumulator identification, store forwarding, and loop deepening to accelerate
code in a CGRA. The system can achieve substantial power and performance improvements
over DySer [10], which uses a static binary translator targeting the same CGRA setup.
Fajardo et al. [7] proposes a flow to accelerate multi-ISA application into a unique CGRA
through a runtime two-level binary translation. The first BT level is responsible for translating
the native ISA to a standard code. In contrast, the second BT level accelerates the common
code in heterogeneous accelerators, such as DSP, VLIW, and CGRA. The proposed two-level
BT simplifies the support for additional ISAs since just the first BT level must be modified to
generate the standard code, being transparent to the second BT to accelerate the common code
into the CGRA. As a case study, the proposed approach shows performance improvements
by accelerating x86 code into a CGRA.
2.2 Binary translators for DLP
SIMD engines are widely used to run data-intensive applications, such as multimedia, games,
and digital filter. SIMD engines’ triggering commonly relies on specific libraries or compiler
techniques, such as ARM NEON and Intel AVX architectures. However, this subsection
only presents approaches that, somehow, transform the native code to SIMD instructions at
DSA [11] is a runtime binary translator aiming to exploit data-level parallelism of application loops to execute in a SIMD engine. Software productivity is improved by using DSA
since no code modification is needed to trigger the SIMD engine. DSA has access to information at runtime that is not available at programming and compile-time. It has shown higher
performance improvements than static DLP exploitation (hand coded using ARM NEON
library and GCC compiler).
In [8], it is proposed a retargetable software-based dynamic binary translator that supports
translation from SIMD to SIMD engines (e.g., ARM NEON to AVX32). The proposal’s
main claim is that some SIMD engines provide a limited number of functional units and
registers that restrain the maximum data-level parallelism exploitation in the applications.
The binary translator is implemented on Hybrid QEMU that gives support to register and
functional units mapping from the guest to host SIMD ISA. The proposed framework shows
performance improvements by translating SIMD instructions from ARMv7, ARMv8, and
IA32 to the X86-AVX2 host ISA.
An energy efficient multi-target binary translator...
Selftrans [15] is proposed aiming at automatic vectorizing binary code at runtime to
increase the utilization ratio of SIMD units. It extracts parallelism from binary x86 machine
code without requiring its source code and translates it into a binary code that utilizes SIMD
In [23] is presented an approach that exploits thread, memory, and data-level parallelism
through a same-ISA dynamic binary modifier guided by static binary analysis. The Janus
framework uses a static code analysis combined with a dynamic code modification approach
to transform scalar to SIMD instructions from the application code automatically. Considering just the automatic vectorization, Janus shows significant performance improvements in
running the TSVC benchmarks.
Vapor SIMD [16] proposes a just-in-time (JIT) compilation solution for different SIMD
vendors, such as Intel SSE and ARM NEON. The Vector SIMD can combine static and
dynamic infrastructure to support vectorization. The process is divided into two steps: first,
a generic SIMD code is generated; second, the target ISA assembly is generated. They
demonstrate the effectiveness of the framework by vectorizing a set of kernels that exercise
the innermost loop and outer loop.
2.3 Our contributions
All aforementioned works provide, somehow, transparent software optimization by using
a binary translator to exploit parallelism at instruction [12] or data-level [11] or even to
take advantage of intrinsic characteristics of multiple ISA multicore systems [6,9]. However,
none of them support binary translation to multi-parallelism exploitation, which is mandatory to accelerate the variety of application behaviors running in current mobile devices. Our
HMTBT is a multi-target dynamic and transparent binary translator that, unlike all aforementioned works, supports software optimization through accelerators that exploit different
levels of parallelism. At runtime, the Hybrid Multi-Target Binary Translator (HMTBT) concurrently detects instruction and data-level parallelism of application hotspots and schedules
them to the most well-suited accelerator aiming at maximizing performance per watt.
The block diagram of the entire system is shown in Fig. 2. The Hybrid Multi-Target Binary
Translator (HMTBT) is connected to the ARMv7 core, which supports the execution of nonaccelerated code regions. The HMTBT translates ARMv7 instructions to CGRA and NEON
engine to exploit instruction and data-level parallelism, respectively. The translation scope of
HMTBT for CGRA is application basic blocks, while for NEON is loop statements. A code
portion translated for CGRA is a configuration, while for the NEON engine is a sequence of
NEON SIMD instructions. From now on, both are generically named as accelerated word. A
Translation Cache is responsible for storing accelerated words. It has the structure of a regular
cache memory, with block size defined by the largest one (configuration or NEON SIMD
instruction), while indexation and tagging uses the memory address of the first translated
ARMv7 instruction.
The HMTBT is connected to the ARMv7 register file to gather general-purpose registers’
content containing loop iteration indexes and stop conditions of loops. The data memory
addresses of operands are also collected from the register file used in the cross iteration
detection process for NEON translations. The CGRA is connected to the register file and
T. Knorst, et al.
Fig. 2 System overview - HMTBT coupled to the ARMv7 core with ARM NEON and CGRA accelerators
data cache to access the operands to trigger the execution of configurations sent by the
HMTBT. The ARM NEON engine accesses operands only via data cache through specific
load and store NEON instructions.
The Current Program Status Register (CPSR) is necessary to keep the HMTBT aware of
status flags (Negative/Less than/Zero) and the ARM core’s execution mode since the NEON
engine cannot be triggered when the core is running on privileged mode. The program counter
(PC) content is used to track the loop execution for NEON translations and to index accelerated words in the Translation Cache for both accelerators. The HMTBT writes the address of
the ARMv7 instruction in the PC to return the execution to ARM core at the application point
after the code portion accelerated by CGRA or NEON. The content of instruction fetched
from the ARM core is accessible via the instruction register, so the HMTBT decodes instructions to gather the operation and operands involved. Also, the HMTBT controls the ARM
core’s execution through a stall signal, responsible for freezing the ARM pipeline when the
NEON engine or the CGRA are running.
Aiming at keeping binary compatibility with applications compiled with VFP/ARM
NEON instructions, the NEON engine can be triggered by both ARM Core and HMTBT
being controlled by a multiplexer that switches the dispatching instruction from the HMTBT
and the ARM Core. However, when the HMTBT dispatches a NEON SIMD instruction to
the NEON engine, it flushes the regular ARM instructions and stalls the ARM pipeline.
3.1 NEON hardware
The NEON engine is composed of an in-order 10-stage superscalar pipeline capable of issuing
two instructions per cycle. It supports the execution of ARM VFPv3 scalar instructions
compliant with single and double-precision floating-point arithmetic. As shown in Fig. 3, a
NEON instruction operate over 128 bits data width, which can be configured as:
– 16 signed or unsigned 8 bit integers
– 8 signed or unsigned 16 bit integers
– 4 signed or unsigned 32 bit integers or single precision floating point data
An energy efficient multi-target binary translator...
Fig. 3 ARM NEON Engine - Instruction to register data packing
– 2 double precision floating point data
The NEON register file is composed of 32 64-bit registers, that can be viewed as:
– 128-bit registers Q0 to Q15
– 64-bit registers D0 to D31
– 32-bit floating point registers S0 to S31 (used by VFP instructions)
3.2 CGRA hardware
Figure 4 shows the datapath of CGRA that is composed of arithmetic and logic units, multipliers, and memory access functional units divided into rows and columns. The datapath is
totally combinational, the FUs of neighboring columns are interconnected by multiplexers
which creates a unique path from the leftmost to the rightmost column. Operations allocated
in the same column perform parallel execution, while those placed in different columns
perform data-dependent execution. The FUs are virtually grouped in levels, representing
the propagation delay of the ARM core’s critical path. The input and output context are
connected to the ARM core register file working as a buffer to hold the source operands
and results of the operations, respectively. The CGRA execution is divided into three steps:
reconfiguration/operand fetch, execution, write back. The reconfiguration/operand fetch step
is triggered when an accelerated word is found in the Translation Cache. In that moment, the
configuration bits are fetched from the Translation Cache being sent to the CGRA datapath
to configure the FUs operations and multiplexers. In parallel, the input context is fed from
operands that are fetched from the register file. The execution step starts when all operands
have been stored in the input context. When the propagation delay of all levels has been
finished, the write-back step starts. Finally, after all the operands have been written back
in the register file, the return address is written in the ARM core program counter, which
restarts the execution. The CGRA hardware supports basic block speculation, meaning that
more than one basic block can be coupled in a single CGRA configuration. The HMTBT
builds speculative configurations following the path of the branch of the detection phase (that
will be detailed in Sect. 3.4.3). If a misspeculation occurs, the CGRA hardware just writes
back the basic blocks’ results, which have been speculated properly.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
T. Knorst, et al.
Fig. 4 CGRA Data Path
History Table
Fig. 5 HMTBT Pipeline Stages
3.3 HMTBT pipeline
The HMTBT is composed of two operation modes:
– Detection Mode: while the ARMv7 processor is executing instructions for the first time,
the HMTBT, concurrently, monitors code portions and generates SIMD Instructions and
CGRA Configurations;
– Execution Mode: the HMTBT triggers either the NEON engine with the SIMD Instructions or the CGRA with the CGRA Configuration depending on which accelerator
provides higher performance improvements for such code portion.
Figure 5 depicts the pipeline stages of HMTBT hardware. Stages colored in black work
over code regions for both CGRA and NEON; stages colored in white involve building
configurations for the CGRA; and stages colored in gray are responsible for generating SIMD
instructions for NEON. The first stage, ARMv7 Instruction Decode/Accelerated Word Issue,
works on both detection and execution mode. It receives the ARMv7 instruction fetched by
the ARMv7 processor and its memory address. The Translation Cache is accessed with the
instruction’s memory address to identify if such code portion has been already translated. If
a Cache Hit happens, the execution mode is triggered. In this mode, the accelerated word
is fetched from the translation cache and issued to the accelerator responsible for executing
such code portion. If a Translation Cache Miss occurs, the detection mode starts since such
code portion has not been translated yet, the instruction is decoded, and the second stage of
HMTBT is called.
After the ARMv7 Instruction Decode, the HMTBT pipeline forks in two specialized
pipelines: the CGRA branch aims at exploiting ILP by producing a configuration for the
An energy efficient multi-target binary translator...
Branch Detection
Data Collection
Cross Dependecy
Analysis State
Sentinel and Conditional Loops
Execution State
Fig. 6 NEON Translation FSM Stages
CGRA; while the NEON branch is responsible for exploiting DLP by generating NEON
3.3.1 CGRA translation stages
The CGRA branch is composed of Dependency Analysis, Mapping and Configuration Build
stages. All stages were adapted from [1] to be coupled to the HMTBT pipeline.
Dependency Analysis Data Dependency Analysis stage verifies data hazard among instructions by comparing the source operands of current instruction with destination operands of
previously analyzed instructions. The output of this stage is the number of the CGRA’s leftmost column where the current ARMv7 instruction can be placed. Such a placement respects
the data dependencies of the previously placed instructions in the current configuration.
Mapping The Mapping stage verifies if there is an available functional unit to place the current ARMv7 instruction considering the number of columns received from the Dependency
Analysis stage. If there is no available functional unit in such a column, the mapping stage
seeks in the next column to the right until finding an available FUs. If there are no FUs to
place the Armv7 instruction, the current configuration is closed, and this instruction will be
placed in the first column of a new configuration.
Configuration Build During the Configuration Build stage, the control bits for FUs operation
and interconnection are generated, resulting in a bitmap stream (configuration) that reflects
the datapath behavior for the execution of the translated code portion. The configuration bits
are sent to the next pipeline stage (Dispatcher).
3.3.2 NEON translation stages
The NEON branch is composed of the following stages:
Loop Detection The loop detection stage is capable of detecting and vectorizing the following
types of loops:
– Count loops: loop that performs a fixed number of iterations that can be determined at
compile time.
– Dynamic range loops: loop range is determined before the execution and can be modified
at runtime on every iteration.
– Sentinel loops: the loop range can only be determined during its execution.
– Conditional loops: loop that contains conditional code within its body.
A finite state machine, presented in [11], is employed to exploit data-level parallelism
depending on the type of loop. As shown in Fig. 6, the FSM contains five states: Branch
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
T. Knorst, et al.
Detection, Data Collection, Cross Dependency Analysis, Mapping, and Speculative Execution. Count loops and dynamic range are vectorized very similarly, requiring only the first
three states. Sentinel and conditional loops use a speculative process that requires all FSM
states to process the entire vectorization.
The Branch Detection state is triggered at the end of the first loop iteration. It is responsible
for detecting the instruction memory address range of the loop body by tracking the program
counter register’s content. If a particular loop has been previously detected as vectorizable,
such stage detects conditional code and function calls within the loop body.
The Data Collection state is triggered during the second iteration of the loop. This state is
responsible for evaluating the loop range, the vectorizable instructions, and their operands.
It also stores the addresses of data memory accesses in the Verification Cache (VC), used
for the Cross Dependency Analysis state to verify if the loop is vectorizable. Besides, such
a state verifies the presence of a Sentinel Loop.
The Cross Dependency Analysis State is triggered in the third loop iteration. This stage
is responsible for analyzing the cross-iteration dependency (dependencies between two or
more iterations in the same loop statement). At the memory access point of view, a crossiteration dependency exists when the same data memory address is accessed in different
loop iterations. The DSA cross-iteration analysis starts in the 2nd loop iteration, where the
addresses of data memory accesses are saved in the Verification Cache. Even having the
memory addresses in the VC and comparing them to the memory addresses performed on
every iteration, one cannot discard cross-iteration dependencies in future iterations. Assuming
such a situation, we have implemented Cross-iteration Dependency Prediction. The equations
below describe the steps of the prediction process, where MRead[2] and MRead[3] is the
memory address accessed by a MemRead (load) instruction in the second and third loop
iterations, respectively. MRead[lastIteration] is the memory address accessed by a load
instruction in the last executed iteration (Eq. 4), x is the interval between MRead[2] and
MRead[lastIteration] (Eq. 1), M(Write[2]) is the memory address accessed by a MemWrite
(store) instruction in the second iteration (Eqs. 2 and 3), MRange is the memory address range
between the MRead[2] and MRead[3] (Eq. 5), CID means Cross-Iteration Dependency and
NCID means No Cross-Iteration Dependency.
M Read[3] ≤ x ≤ M Read[last I teration]
MW rite[2] ∈ x → C I D
/ x → NC I D
MW rite[2] ∈
M Read[last I teration] = M Read[2] + (Mttap ∗ (last I teration − 2))
M Range = |M Read[3] − M Read[2] |
The Mapping State is only activated for Conditional loops. This stage is responsible
for: evaluating the loop range; mapping the executed conditional code statements; verifying
cross-iteration dependencies between iterations within every condition; and, assessing the
vectorizable instructions and their operands within every condition.
The Speculative Execution State is only enabled for Conditional Loops and Sentinel Loops.
This state is responsible for: selecting data generated during SIMD execution at the end of the
Loop; storing the current range for Sentinel Loops (Speculative Range); and, storing mapped
conditions of the Conditional Loop for further executions in the Translation Cache.
SIMD Build The SIMD Build Stage receives the set of operands, operations involved in the
loop detection stage, and generates the SIMD NEON instructions that are sent to the Dispatcher Stage. As the NEON Engine performs operations over vectors that contain elements
multiple of 2, 4, 8 or 16, leftover elements of non-multiple vectors should be operated some-
An energy efficient multi-target binary translator...
Fig. 7 Leftovers example for a
32-bit vector with 11 elements.
The first 8 elements can be
processed using SIMD
instructions, but the last 3
elements must be processed using
single element instructions
how. As shown in the example of Fig. 7, the 32-bit vector contains 10 elements, the SIMD
build stage handles such issue by generating the maximum number of SIMD instructions to
exploit DLP, in that case, two NEON SIMD instructions. The leftover elements are executed
by single element NEON instructions in the NEON engine without data-level parallelism
3.3.3 Dispatcher stage
The HMTBT pipeline fuses in the Dispatcher stage. It receives CGRA configurations or
SIMD NEON instructions, dispatching them to the Translation Cache. Once both pipeline
branches of the binary translator have optimized the same code portion (Mutual Regions),
this stage is responsible for deciding which accelerated word should be dispatched to the
Translation Cache.
As mentioned above, the dispatcher can operate over Non-Mutual Accelerated Regions or
Mutual Accelerated Regions manipulating them as follows:
1. Non-Mutual Regions: since HMTBT translated all basic blocks to execute in the CGRA,
those code regions that are not loops are just covered by such accelerator. At detection time,
the SIMD Build stage of the NEON pipeline branch informs the Dispatcher stage that such a
region is not feasible for DLP exploitation. Thus, the accelerated word for those code regions
is always stored in the Translation Cache as CGRA configurations.
2. Mutual Regions: are those code regions that can be covered by both accelerators,
which are limited to the NEON engine acceleration scope (loops without cross iteration
dependency). In this work, we have implemented two methods that decide which accelerated
word of mutual regions will be stored in the Translation Cache.
History Table Method
The History Table Method decides which accelerator will be responsible for executing the
mutual region based on the performance improvements. As shown in Fig. 8, this method uses
the History Table, which holds on each row the number of cycles taken by both CGRA configuration and NEON instruction to run a certain mutual code portion. The Accelerated Word is
indexed and tagged in the History Table by the memory address of the first ARMv7 instruction. Since the detection time to translate a basic block to a CGRA configuration is faster
than SIMD NEON instructions building, the CGRA configuration is stored in the Translation
Cache as soon as it is built by the HMTBT. This is performed to exploit the opportunities to
accelerate such code portion in the CGRA while the NEON SIMD instructions are translated.
When the NEON pipeline branch finishes the accelerated word for the same code portion,
the dispatcher uses a 6-bit comparator to select which accelerated word will be stored in
the Translation cache based on the number of cycles taken to run such code portion. In
case the NEON instruction provides higher performance improvements than the CGRA
configuration, the dispatcher stores the built NEON instruction in the Translation Cache,
replacing the previously stored CGRA configuration. Otherwise, the dispatcher discards the
NEON instructions.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
T. Knorst, et al.
Fig. 8 Dispatcher using a history table
Fig. 9 Dispatcher using OPC calculus
Operation per Cycle Method
The Operation per Cycle (OPC) method performs the selection of which accelerator will
execute the mutual region based on parallelism degree. For CGRA configuration, the OPC
method measures the parallelism degree by having the number of operations allocated on
each CGRA level. For the NEON accelerator, the operation per cycle is evaluated by taking
the type of operands and the number of instructions generated by the SIMD build stage.
Figure 9 shows the interconnection between the SIMD and Configuration Build and the
Dispatcher stage. The OPC method uses a 6-bits comparator to select the accelerated word
stored in the Translation Cache based on previous pipeline stage information. The SIMD
build stage intrinsically has the parallelism degree since the type of operands is decoded in
the Loop Detection stage to build the SIMD NEON instructions. For instance, if the operands
involved in the vectorization process are 32-bit long, the resulted OPC will be four since the
NEON engine is 128-bit long.
For CGRA configurations, it is necessary to compute the OPC by using a 6-bit Fixed Point
Divider. Towards that, the Dispatcher divides the total number of instructions allocated by
the number of levels in the mutual region configuration to gather the average Operation per
Cycle of the CGRA configuration. In case the NEON instruction provides higher OPC than
the CGRA configuration, the dispatcher stores the built NEON instruction in the Translation Cache, replacing the previously stored CGRA configuration. Otherwise, the Dispatcher
discards the NEON instructions.
3.4 Examples of HMTBT translations
This section shows examples of mutual code regions translated by HMTBT for CGRA and
NEON engine. In addition, the decision process of the Dispatcher stage is shown.
An energy efficient multi-target binary translator...
3.4.1 Count loops
Figure 10 shows the C and the ARM assembly code of a count loop that performs a sum of
two vectors of length 400. Below, the HMTBT detection will be detailed for both NEON and
NEON Detection Process
The HMTBT detects the loop at the end of the first iteration by triggering the FSM branch
detection state when the bne .count_loop instruction reaches the loop detection stage pipeline.
This FSM state accesses the translation cache using the ARMv7 instruction address. If a cache
miss happens, it means that this loop was not analyzed yet. Thus, the program counter of the
branch instruction is registered, and the detection process starts. During the second iteration,
the Data Collection FSM state works to identify the number of iterations by looking for the
value of the increment/decrement of the index and the stop condition value. In this example,
at line 2, r7 is incremented by the load instruction ldr r3, [r7, #4]! (increment has a value of
4), and the value of stop condition is held in the register r4. All memory addresses accessed
by load and store instructions are stored into the Verification Cache to be further checked
to detect cross iteration dependencies (r7 and r1 being load addresses and r2 being a store
address). The type of operands is also stored on the verification cache to have the byte offset
for the cross iteration dependency process.
The Dependency Analysis State uses the address values stored during data collection to
verify cross iteration dependencies. If it is found, the detection process is ended, and the loop
address is sent to the Dispatcher Stage, informing that this code portion is not vectorizable
(non-mutual code portion). As no dependency is found in the given example, the SIMD Build
pipeline stage is triggered.
During the SIMD Build stage, the HMTBT generates vectorized instructions and sends
them to the Dispatcher stage, which will decide if they are stored in the Translation Cache.
Since the number of iterations for a count loop is known before execution, the loop is unrolled,
and the SIMD Built stage generates instructions of data processing and memory access,
disregarding all branch and comparison instructions. The vst1.32 d16-d17, [r3, #4]! postincrement instruction is generated to increment the loop index register r3 by four.
CGRA Detection Process
The CGRA configuration steps start at the data dependency analysis stage, inspecting the data
dependency between the incoming instructions by comparing the current instruction’s source
operand with the previously analyzed instructions. This stage sends to the mapping stage the
number of the leftmost column of the CGRA in which this instruction can be allocated. In the
mapping stage, the current instruction is allocated in a functional unit following the column’s
number received of the data dependency stage. In this example, both load operations are
allocated in parallel (column 1) while the add instruction is allocated in column 4, respecting
the data dependency with the load instruction. Finally, the store instruction is allocated in
column 7. The bne instruction is allocated in column 5 since it has data dependency with
cmp instruction. The speculation process starts after the allocation of bne instruction, while
the instructions of the second iteration arrive in the data dependency stage, they are allocated
in the same configuration following the same allocation methodology of the first basic block
as shown in Fig. 10. In the current implementation, the HMTBT supports the speculation of
one basic block. If miss speculation happens, just the results of the instructions of the first
basic block will be written back in the ARM core register file. Once all instructions of the
second iteration is allocated, the Configuration Build stage generates the control bits for the
FUs and interconnection, sending them to the Dispatcher stage.
T. Knorst, et al.
Fig. 10 Count loop example and the allocation of CGRA’s functional units
Dispatcher Decision Process
Since each CGRA level takes one processor clock cycle, the execution of the CGRA configuration from Fig. 10, which contains two loop iterations, takes 6 cycles or 3 cycles per
iteration. Supposing that the NEON pipeline is full, NEON execution requires 4 instructions
and takes 4 cycles to process 4 loop iterations. Given that, the CGRA and NEON will store in
the History Table the values 1200 and 400 cycles, which is the total number of cycles taken to
execute the 400 iterations of the loop, respectively. For the OPC Method, as the operands are
32-bits long, the OPC is four for the NEON engine, while the average OPC for the CGRA is
two (12 instructions divided by 6 levels). Considering both History Table and OPC methods,
the NEON SIMD instructions will be stored in the Translation Cache since they provide less
clock cycles and higher OPC than the CGRA configuration.
3.4.2 Dynamic range loops and leftovers
Figure 11 shows an example of Dynamic Range loop, where the stop condition is unknown
at compile-time, being determined by user input via scanf function call and stored in register
NEON Detection Process
In the first iteration of the loop, the branch detection state detects the loop. During the second
iteration execution, the HMTBT calculates the range by evaluating the contents of r2 and
r5 since the latter register already holds the returned value from the scanf function. As the
content of r5 is never modified, the remaining translation steps are similar to a count loop.
However, differently from a count loop, even if already analyzed, the dynamic range loop
must always pass through the data collection stage since the loop range can change on every
Supposing that the user input for the scanf function of the Fig. 11 is 11 and each element
a 32-bit integer. The HMTBT generates the maximum number of NEON SIMD instructions
An energy efficient multi-target binary translator...
Fig. 11 Dynamic range loop example with single element instructions
for the loop range, being the leftover elements executed by single element instructions in the
NEON engine. Each iteration will compute 4 elements at once, the HMTBT generates two
instructions that will operate over the elements 0 to 3 and 4 to 7. Three leftover elements will
be processed by three single element instructions in the NEON engine.
In the example of Fig. 11, one can notice that the instruction vld1.64 d8-d9, [r2]! has
the equivalent single element instruction vld1.8 d8[0], [r2]!. Where d8[0] represents the 32
lower bits of the 64-bit register, which can also be viewed as simply being s16. This way,
dealing with the leftovers is only a matter of changing load and store instructions to operate
over a single element.
CGRA Detection Process
The instruction allocation of the CGRA is the same presented for the count loop in Fig. 11
since the C code is equal.
Dispatcher Decision Process
Since both CGRA and NEON instructions are the same as the count loop presented in Fig. 11,
the Dispatcher receives the same information and provides the same decisions. Thus, both
History Table and OPC decide to store the NEON SIMD instructions to accelerate such loop
3.4.3 Conditional loops with branch instructions
Figure 12 shows an example of a conditional loop that uses branch instructions to control the
program flow. As usual, the branch can take two paths, depending on condition A or B, both
being accelerated by CGRA and NEON engine.
NEON Detection Process
The HMTBT is able to vectorize conditional loops by using two additional stages: the mapping stage, responsible for, at detection time, mapping the different conditional blocks, and
the speculative stage, which is responsible for, at execution time, running both conditions
and asserting which results will be committed.
During the first loop iteration, at the Loop detection stage, the HMTBT is analyzing branch
instructions that may occur within the loop range. If the HMTBT finds this condition, the
Mapping state is triggered.
During the second loop iteration, the Mapping stage is responsible for mapping each conditional block. In this example, only instruction 6 composes the condition block of condition
A, while instruction 9 composes the condition B block. Since only one of these conditional
blocks will be executed on each iteration, the HMTBT must wait for at least one execution
of all conditions before concluding if the loop is vectorizable. The mapping process uses the
cross iteration dependency approach to detect if the conditional loop is vectorizable. Once
all conditional blocks have been mapped and marked as vectorizable, the SIMD Build stage
T. Knorst, et al.
Fig. 12 Example of conditional branch loop and the allocation of CGRA’s functional units
is called to build the NEON SIMD instructions for both Conditions A and B, as shown in
Fig. 12.
If a Translation Cache hit happens, meaning that such code portion will be accelerated
in the NEON engine, the Speculative process is triggered. The HMTBT sends to the NEON
engine the instructions of both conditions A and B, being their results stored into ARM
NEON registers. The results from operations of both conditions of every iteration are stored
in different registers. In Fig. 12, the results from condition A are stored in registers q0 to q3,
and the results from condition B are stored in registers q4 to q7. During the speculative stage,
data memory writes of the already computed NEON instructions are issued on demand,
depending on the result of the cmp instruction that is executed by the ARM Core. The
HMTBT checks the result of the CMP instruction of a certain iteration by looking at CPSR
conditional flags. The registers containing the correct conditional block results are then stored
into memory using single element store instruction. In this example, if the first iteration had
condition B as true, the issued store instruction will be vst1.64 d4[0], [r0]!, where d4[0]
contains the results of condition B. However, if condition A was true, the issued instruction
will be vst1.64 d0[0], [r0]!, where d0[0] contains the results of condition A. The NEON
registers that contain results of the wrong executed condition are marked as free to store
speculative results of remaining iterations.
CGRA Detection Process
As mentioned before, the CGRA can couple two basic blocks within a configuration, one is
the regular and the other speculative. Supposing that condition A is true for the first iteration
of the loop of Fig. 12, instructions from 2 to 7 would be allocated in the configuration,
being instructions from 2 to 5 representing the first basic block and 6 to 7 the second one. If
condition B is true in the second loop iteration execution, a misspeculation happens, and the
configuration will be deleted from the Translation Cache. Such code portion will be translated
again in the third loop iteration with the basic blocks executed in such execution.
Dispatcher Decision Process
Supposing the best-case scenario, where condition A is translated in a configuration, and no
misspeculation happens. The CGRA takes 2 cycles to execute each loop iteration, 128 cycles
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
An energy efficient multi-target binary translator...
Fig. 13 Conditional instruction loop example, similar to conditional branch code, the assembled code now
uses conditional instructions
for 64 iterations, the DSA execution takes the same 128 cycles to execute the same scenario.
For that case, the HMTBT based on the History Table method elects the CGRA to execute such
code portion since it consumes less energy, as will be detailed in the Experimental Results
section. Considering the OPC method, the CGRA takes 2 levels to execute 8 instructions,
having an OPC equal to 4. As such example considers an operand of 16-bits long, the OPC
of the NEON engine is 8. Thus, the HMTBT based on OPC Method will select the NEON
engine to execute such code portion.
3.4.4 Conditional loops with conditional instructions
Figure 13 shows an example of a conditional loop that sums odd indexed elements and
subtracts the even indexed elements. As can be seen, the generated ARMv7 code has two
conditional instructions: addeq r2, r2, r0 and subne r2, r2, r0, and a test instruction tst r3, #1
to implement the conditional code. In this example, we have two distinct possible conditions,
at lines 6 and 7, referred to as conditions A and B, respectively.
NEON Detection Process
During the first iteration (Branch Detection State), the HMTBT detects the conditional
instructions. Unlike conditional branch code, regardless of the number of conditional statements in the loop, the HMTBT checks cross iteration dependencies in a single loop iteration
since all instructions are fetched and decoded. Thus, the Mapping Stage is triggered and
completed at the second iteration for the example of Fig. 13. At the execution time, the
speculation process is the same as the conditional loop example with branches shown in the
previous subsection.
CGRA Detection Process
Supposing that condition A is true in the first loop iteration, the CGRA configuration holds
instructions from lines 2 to 6. The instruction allocation is the same presented for the conditional loop with branch instruction shown in Fig. 12.
Dispatcher Decision Process
The NEON engine takes 4 cycles to operate over 4 array elements, so the entire loop execution
takes 100 cycles. The CGRA configuration takes 3 cycles to execute a loop iteration and 300
cycles to run the whole loop. Considering the OPC Method, the CGRA configuration has
an OPC equal to one (five instructions divided by 3 levels) while the OPC of NEON is four
since the operands are 32-bit long. Thus, both the History Table and OPC methods choose
the NEON engine to execute such code portion.
T. Knorst, et al.
3.4.5 Sentinel loops
Figure 14 shows a sentinel loop where the stop condition is based on an operation placed
within the loop body.
NEON Detection Process
In a sentinel loop, the number of iterations is unknown at compile and can change on every
loop iteration at runtime. For this reason, the HMTBT uses a speculation execution method
similar to conditional loops to execute an arbitrary number of iterations, just committing the
results of the correct number of iterations when the loop range is solved.
The HMTBT triggers the same states of dynamic range and count loops until the loop’s
first three iterations. The register r3 receives the result of the multiplication (Line 4) that is
used to be compared with constant 0 (Line 5). This implies that the stop condition can only
be found during the loop execution. Once the Data collection stage and Dependency analysis
are done, the loop is considered vectorizable. The SIMD build stage generates the vectorized
instructions and sends them to the Translation Cache.
If a Translation Cache hit happens, the SIMD NEON instructions are fetched from the
Translation Cache, and the speculative process starts. The initial speculative range is set to
4 (the minimum number of elements to be processed considering 32-bit integers). Once the
execution of the initial speculation value is done, the HMTBT generates a compare instruction
and dispatches it to the ARM core. In this case, four different conditions must be checked by
reading the content of the CPSR register. Once an instruction is asserted, the corresponding
iteration results are stored back into memory by changing the register address of the store
instruction (in this example, varying from d0[0] to d1[2]) and dispatching it to ARM NEON.
This process is repeated until the stop condition is reached. If more iterations are processed
than necessary, the unused results are not stored back into memory, and the loop execution
is finished.
CGRA Detection Process
The CGRA allocation is shown in Fig. 14. In that case, two loop iterations is held in a single
Dispatcher Decision Process
The CGRA can execute a loop iteration in 3 cycles. In comparison, the DSA can process 4
elements in 3 cycles, taking 4 cycles to store each processing element using single element
instructions, resulting in 7 cycles. Supposing that such loop executes 4 iterations, the NEON
takes 7 cycles while CGRA 12 cycles. For the OPC method, the NEON has an OPC of four
since operands are 32-bit long while CGRA has an OPC equal to two. Both methods select
NEON to execute such code portion.
4 DLP and ILP opportunities
The purpose of this section is to show that DLP and ILP vary among applications of different
domains. In addition, even considering a specific application, the data input and type of
operands can influence the degree of available parallelism. Applications belonging to the
following domains were selected for such exploitation:
– Image Processing
– EPIC - image compression algorithm based on a bi-orthogonal critically sampled
dyadic wavelet decomposition and a combined run-length/Huffman entropy coder;
An energy efficient multi-target binary translator...
Fig. 14 Sentinel loop example and the allocation of CGRA’s functional units
– RGB Grayscale - converts image from RGB to grayscale format;
– Susan Edges- iterates over each image pixel to find edges;
– Susan Smoothing - a noise filter that gets a central pixel smoothing the neighbors’
pixels taking the average value of all of those pixels;
– Gaussian Filter - linear filter used to blur an image;
– JPEG - decoder for image compression format;
– MPEG - decoder for video compression format;
– H264 - video compression standard;
– Telecommunication and Cryptography
– GSM - Global Standard for Mobile Communication Protocol that uses a combination of Time- and Frequency-Division Multiple Access (TDMA/FDMA) to
encode/decode data streams.
– SHA - secure hash algorithm that produces 160-bit message from a given input;
– Blowfish - symmetric block cipher with a variable length key;
– Kernels
Matrix Mult - multiplies two matrices of N elements;
Bitcount - count the number of bits in a array;
FFT - Fast Fouriet Transform;
StringSearch - searches for a given word in phrases;
Figures 15 and 16 show the opportunities to exploit DLP and ILP plus DLP considering
the application execution time. In addition, execution time that provides neither DLP nor ILP
is presented in such Figures. For better visualization, we have divided the benchmarks in two
Figures, they are sorted considering increasing DLP opportunities, being the benchmarks
of the first Figure have fewer DLP opportunities, and the benchmarks placed in the second
Figure have more DLP opportunities.
T. Knorst, et al.
Fig. 15 ILP and DLP runtime opportunities of the applications, ordered by the DLP opportunities
Fig. 16 ILP and DLP runtime opportunities of the applications, ordered by the DLP opportunities
As it can be seen, the selected benchmarks are very heterogeneous in terms of DLP and ILP
opportunities. RGB Grayscale shows that ILP is mandatory since no DLP opportunities are
available. On the other hand, Susan Smoothing provides DLP on 86% of its execution time,
suggesting that such exploitation should be present. When applications of the same domain
are investigated, the need for both parallelism exploitation is reinforced. For instance, Susan
Edges shows 3% of DLP while Susan Smoothing shows 86%, but the former provides 84%
of ILP and the latter just 11%. The same behavior can be noticed with GSM and SHA,
where the former has low DLP opportunities and high ILP opportunities, being the opposite
for the latter. Moreover, Matrix Multiplication, a kernel commonly used in a wide range of
application domains, benefits from ILP and DLP exploitation at almost the same level, which
shows that both exploitations are orthogonal but complementary.
Figures 17 and 18 show the ILP and DLP opportunities of Susan Edges and Susan Smoothing varying the size and the input images. The input images vary from a monochrome black to
a crowd image to evaluate the changing of opportunities from homogeneous to heterogeneous
colored pixels. For Susan Edges, the instruction-level parallelism grows with the increasing
heterogeneity when executing colored pixels and bigger image size. Considering a 50 × 50
An energy efficient multi-target binary translator...
Fig. 17 ILP and DLP runtime opportunities of Susan E varying size and input data
Fig. 18 ILP and DLP runtime opportunities of Susan S varying size and input data
image, the ILP grows 14% from black to crowd image. Also, the ILP increases 27% when a
black image grows from 50 × 50 to 500 × 500 pixels. Unlike Susan Edges, the increase in
the image’s size produces more opportunities for data-level parallelism exploitation in Susan
Smoothing. From 50 × 50 to 500 × 500, the DLP increases 20%, regardless of the input
Figure 19 shows ILP and DLP opportunities of Matrix Multiplication considering different
matrix sizes and operand granularity. For both matrix size, the opportunities for DLP and
ILP maintains a similar level for an integer number. However, when floating-point numbers
are considered, DLP opportunities decreases the third part for both matrix size, while ILP
keeps a similar level for 64 × 64 and decreases halve for 256 × 256. Also, significant ILP and
DLP opportunities are shown when comparing both matrix sizes, while ILP keeps on 40% in
64×64 size, 256×256 shows just 20% on integer numbers. Similar behavior is shown in DLP
opportunities that stay on 30% on 64 × 64 and 70% in 256 × 256, showing that opportunities
to explore parallelism change among different applications and vary significantly in a single
application with the size and operand granularity changes.
T. Knorst, et al.
Fig. 19 ILP and DLP runtime opportunities of MultMatrix varying size and data granularity
5 Experimental results
5.1 Methodology
This subsection shows the methodology used to gather the performance, area, and energy
results of the proposed approach.
Table 1 summarizes the system setups used to compare with HMTBT:
– ARMv7 Single Issue is the baseline system.
– The ARMv7 Superscalar + Static DLP Hand Coded is a 4-issue out-of-order execution,
running applications vectorized by hand using NEON intrinsics library. In this setup, ILP
is transparently exploited while DLP depends on programming time overhead.
– The ARMv7 Superscalar + Static DLP Compiler is a 4-issue out-of-order execution,
running applications vectorized by GCC compiler. In this setup, ILP is transparently
exploited while DLP depends on source code recompilation.
– The ARMv7+CGRA setup is used to explore the potential of transparent ILP exploitation
by using the binary translator proposed in [1].
– The ARMv7+NEON uses the binary translator [11], aiming at exploiting only DLP in a
transparent fashion.
– The ARMv7+CGRA+NEON Decoupled setup has both binary translators, but working
separately. This setup aims at showing both ILP and DLP exploitation without cooperative
– ARMv7 HMTBT is evaluated over two setups, which cover the OPC and History Table
dispatching approaches.
As can be noticed in Tables 1 and 2, for the sake of the comparison, the operation frequency
and size of cache memories are the same for all setups. We have implemented all setups in the
GEM5 simulator to gather performance results. Energy and area evaluations were gathered
from logic synthesis using Cadence tools with 15 nm Design kit [14]. The CGRA used in
all setups has two parallel arithmetic and logic units per column, three LD/ST units, one
multiplier, and two floating-point units, with a total of six levels in a row. The NEON engine
used can issue two instructions in-order per cycle, being each instruction working over 128
bits data width.
An energy efficient multi-target binary translator...
Table 1 Simulation setups of the compared systems
System setup
ARMv7 Single Issue
ARMv7 Superscalar
Clock Freq.
Cache L1, L2
64Kb, 512Kb
64Kb, 512Kb
64Kb, 512Kb
64Kb, 512Kb
ILP/DLP exploit.
Translation Cache
1Kb 4-way
3Kb 4-way
Table 2 Simulation setup of the implemented systems
System setup
ARMv7 Decoupled
Clock Freq.
Cache L1, L2
64Kb, 512Kb
64Kb, 512Kb
64Kb, 512Kb
ILP/DLP exploit.
Translation Cache
1Kb+3Kb 4-way
4Kb 4-way
4Kb 4-way
History Table
32 entries
5.2 Performance
Figures 20 and 21 show the performance improvements of all setups over ARMv7 Single
Issue processor. Such Figures follow the same sorting methodology of Figs. 15 and 16, where
benchmarks of Fig. 20 have less DLP opportunities and benchmarks placed in Fig. 21 have
more DLP opportunities. Despite ARMv7 Superscalar processor setups showing performance
improvements over the Single issue, they rely on human intervention to extract data-level
parallelism by modifying the code using NEON Intrinsic (Hand Coded) or recompiling the
code using GCC compiler. Besides increasing programming time, the performance gains of
the ARMv7 Superscalar + Static DLP Hand Coded setup are limited since some loop types
(such as dynamic and sentinel loops) are not vectorized by NEON Intrisics. On average,
the ARMv7 Superscalar + Static DLP Hand Coded setup achieves a speedup of 2.6 times
over the ARMv7 single issue. The ARMv7 Superscalar + Static DLP Compiler setup behaves
similar to static DLP exploitation by using NEON Intrinsics since the compiler is not enabled
to vectorized loops due to the lack of execution time information. On average, such setup
outperforms the Single Issue processor in 2.4 times.
By just exploiting instruction-level parallelism, the ARMv7 CGRA setup outperforms
both ARMv7 superscalar processor setups with static DLP exploitation in six benchmarks.
The main source of performance improvements of ARMv7 CGRA setup is the capability
to accelerate data-dependent instruction. However, when a high degree of data-level parallelism is available in the application code (rightmost applications of Fig. 21), the NEON
engine coupled to the Superscalar processors boosts performance improvements. On average, both parallelism exploitation of Superscalar processors outperforms the standalone ILP
exploitation of ARMv7 CGRA in 25%. Similar behavior is shown when both ARMv7 superscalar processors are compared to the ARMv7 NEON, where just DLP is exploited. The
ARMv7 NEON outperforms both ARMv7 superscalar setups with static DLP exploitation in
applications where DLP is largely present, but the former fails in applications where DLP is
not available (leftmost applications in Fig. 20). On average, both ARMv7 superscalar setups
outperforms ARMv7 NEON in 92%. Summarizing, it is shown that, even having a CGRA
T. Knorst, et al.
Fig. 20 Performance improvements of the systems over ARMv7 Processor
capable of outperforming superscalar processors with static DLP exploitation in applications
with massive ILP, both parallelism exploitation are mandatory to achieve a balanced performance over an application from heterogeneous domains that contain varied DLP and ILP
Performance improvements are shown over ARMv7 Superscalar processors with static
DLP exploitation when ILP and DLP are dynamically exploited by ARMv7 Decoupled (setup
that couples both binary translators). Such improvements are more evident when higher DLP
is available (rightmost applications in Fig. 21) since the binary translator can accelerate
loops that are not vectorized at programming and compiling time. On average, the ARMv7
Decoupled outperforms both ARMv7 superscalar setups in 17%.
Unlike ARMv7 Decoupled, the binary translators work cooperatively in the proposed
ARMv7 HMTBT, selecting the most well-suited accelerator for each application portion. The
proposed ARMv7 HMTBT OPC outperforms ARMv7 Decoupled and Superscalar processors
in all applications but Bitcount and EPIC. On average, 8% of performance improvements are
shown over the decoupled system.
However, the ARMv7 HMTBT Table improves the prediction accuracy of HMTBT based
on OPC, outperforming Superscalar in all applications but EPIC. EPIC has a specific behavior
on CGRA acceleration. As its basic blocks are too small, most configurations have few
instructions being gains in ILP diluted by the reconfiguration time of the CGRA. On average,
the HMTBT Table outperforms the superscalar setups, and ARMv7 decoupled system in
29% and 10%, respectively.
The Oracle bars represent HMTBT parallelism exploitation’s potential, meaning that the
code portions are always dispatched to the accelerator that provides the higher performance
improvements. As it can be seen, both HMTBT OPC and Table dispatching methodologies
achieve near-optimal performance. On average, the HMTBT OPC and Table dispatching
methods lose just 2.8% and 1.2% of performance over Oracle, respectively.
5.3 Energy
Table 3, Figs. 20 and 21 show the power consumption of individual hardware blocks and
energy savings of all setups over ARMv7 Single issue, respectively.
An energy efficient multi-target binary translator...
Fig. 21 Performance improvements of the systems over ARMv7 Processor
Table 3 Power consumption from the components of the systems
Power (mW)
ARMv7 Single Issue
Processor Widths=1
ARMv7 Single Issue + NEON
Processor Widths=1 and 128-bit NEON
ARMv7 Superscalar + NEON
Processor Widths=4 and 128-bit NEON
2ALU-3MEM-1MUL-2FP (6levels)
CGRA Binary Translator
ARM NEON Binary Translator
Proposed Binary Translator
Translation Cache
4 Kb 4-way
Verification Cache
1 Kb 4-way
History Table
32 entries Fully Associative
OPC Divisor
In Table 3, the overhead in power of 4-issue superscalar is higher than performance
improvements shown in most applications (Figs. 20, 21). Thus, despite having static DLP
exploitation, the ARMv7 superscalar setups consume more energy in all applications. Considering setups with single parallelism exploitation, both ARMv7 CGRA and NEON show
unbalanced gains. For high DLP applications, the ARMv7 NEON shows huge energy savings.
For instance, Susan Smoothing presents the largest energy savings considering all setups. On
the other hand, the ARMv7 CGRA achieves higher energy savings in applications where ILP
is dominant, such as EPIC, being the setup with the most significant energy savings results.
Considering the HMTBT setups, the energy savings are more balanced considering all
applications. Both HMTBT dispatching methods save more energy than the ARMv7 decoupled in all applications. On average, the energy savings provided by the cooperative work
of binary translators of the HMTBT Table is 24% over the Decoupled setup. Moreover,
despite the setups composed of single parallelism exploitation providing higher energy saving than HMTBT in Blowfish, RBG, Bitcount, and Susan Smoothing, such applications are
the corner cases of parallelism availability (they are in the leftmost side of Fig. 15 and in
T. Knorst, et al.
Fig. 22 Energy Savings of the systems over ARMv7 Processor
Fig. 23 Energy Savings of the systems over ARMv7 Processor
the rightmost side of Fig. 16), which does not represent the whole workload of a current
embedded device where applications are very heterogeneous. Considering all applications
that mimic such heterogeneity, the average energy savings are shown by the HMTBT over
the ARMv7 Superscalar, CGRA, NEON are 54%, 14%, and 18% achieving performance
improvements of 29%, 44%, and 99%, respectively (Fig. 22).
5.4 Area
Table 4 shows the area occupied by the individual hardware blocks that compose the evaluated
setups. It is important to notice that, the HMTBT coupled to an ARMv7 single issue processor
performing dynamic ILP and DLP exploitation by using a single binary translator occupies
33% less chip area than an OoO ARMv7 superscalar processor with static DLP exploitation.
Summarizing, besides keeping binary compatibility and not affecting software productivity,
the HMTBT outperforms the OoO superscalar processor in 29% with 54% energy savings,
occupying 33% less chip area (Fig. 23).
An energy efficient multi-target binary translator...
Table 4 Area on chip from the components of the systems
Area (µm2 )
ARMv7 Single Issue
Processor Widths=1
ARMv7 Single Issue + NEON
Processor Widths=1 and 128-bit NEON
ARMv7 Superscalar + NEON
Processor Widths=4 and 128-bit NEON
2ALU-3MEM-1MUL-2FP (6levels)
CGRA Binary Translator
ARM NEON Binary Translator
Proposed Binary Translator
Translation Cache
4 Kb 4-way
Verification Cache
1 Kb 4-way
History Table
32 entries Fully Associative
OPC Divisor
6 Conclusions
This work proposes a Hybrid Multi-Target Binary Translator (HMTBT) that transparently
supports code optimization on accelerators that exploit different parallelism degrees. Such
exploitation is important since instruction and data-level parallelism exploitation opportunities can highly vary among embedded applications and even for the same application when
data input or data type changes. The proposed BT detects at runtime DLP and ILP of the
application hotspots and smartly issues them to the most well-suited accelerator to maximize performance per watt. HMTBT shows to be more efficient in terms of performance and
energy over an OoO superscalar processor coupled to an ARM NEON engine. Furthermore,
the proposed approach improves performance and energy over decoupled binary translators
using the same accelerator with the same ILP and DLP capabilities.
