Design Automation for Embedded Systems (2022) 26:55–82 https://doi.org/10.1007/s10617-021-09258-6 An energy efficient multi-target binary translator for instruction and data level parallelism exploitation Tiago Knorst1 · Julio Vicenzi2 · Michael G. Jordan1 · Jonathan H. de Almeida2 · Guilherme Korol1 · Antonio C. S. Beck1 · Mateus B. Rutzig2 Received: 14 October 2020 / Accepted: 13 November 2021 / Published online: 14 January 2022 © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021 Abstract Embedded devices are omnipresent in our daily routine, from smartphones to home appliances, that run data and control-oriented applications. To maximize the energy-performance tradeoff, data and instruction-level parallelism are exploited by using superscalar and specific accelerators. However, as such devices have severe time-to-market, binary compatibility should be maintained to avoid recurrent engineering, which is not considered in current embedded processors. This work visited a set of embedded applications showing the need for concurrent ILP and DLP exploitation. For that, we propose a Hybrid Multi-Target Binary Translator (HMTBT) to transparently exploit ILP and DLP by using a CGRA and ARM NEON engine as targeted accelerators. Results show that HMTBT transparently achieves 24% performance improvements and 54% energy savings over an OoO superscalar processor coupled to an ARM NEON engine. The proposed approach improves performance and energy in 10%, 24% over decoupled binary translators using the same accelerator with the same ILP and DLP capabilities. Keywords CGRA · ARM NEON · ILP · DLP · Binary Translator 1 Introduction From simple home appliances to complex smartphones, embedded devices have been omnipresent in our daily routine. Since user experience is fundamental, performance is a consensus. However, as most of such devices are battery-powered, designers also must be committed to severe power constraints. This study was financed in part by: CNPq; FAPERGS/CNPq 11/2014 - PRONEM; and the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. B Mateus B. Rutzig mateus@inf.ufsm.br 1 Institute of Informatics, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre, Brazil 2 Electronics and Computing Department, Universidade Federal de Santa Maria (UFSM), Santa Maria, Brazil 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 56 T. Knorst, et al. Fig. 1 Application Execution considering HMTBT system Traditionally, application parallelism is exploited to leverage performance bringing energy benefits. However, these benefits are limited when single parallelism exploitation is employed in mobile applications as they present distinct behaviors, from data to control-oriented applications, such as multimedia and communication protocol, respectively. Even in embedded devices, instruction-level parallelism is aggressively exploited by using out-of-order superscalar processors, such as the ARM Cortex Family. Due to multimedia application dominance in smartphones, single instruction multiple data engines have been coupled to superscalar processors to exploit data-level parallelism, such as ARM NEON. However, these parallelism exploitation paradigms diverge in one aspect: binary compatibility. For decades, superscalar processors take advantage of Moore‘s Law to exploit instruction-level parallelism to keep software compatibility dynamically. SIMD engines, coupled to superscalar processors, rely on specific libraries or compilers to be triggered, breaking the primary purpose of such processors, the legacy code. Binary translators have been proposed as means to provide code portability among different architectures. Moreover, the translation process can be used to exploit parallelism, which, despite offering binary compatibility, can achieve performance improvements. Several binary translators have been proposed to exploit instruction [1] or data level parallelism [11] dynamically, leveraging the performance of specific architectures. However, none of them exploits instruction and data-level parallelism concurrently in a system comprised of multiple accelerators, which is the ideal system to take advantage of the varying parallelism behavior found in the applications present in current embedded devices. Moreover, such runtime binary translator should transparently detect the parallelism and orchestrate the triggering of different accelerators, maximizing the energy-performance trade-off and keeping the main idea of superscalar processors, the binary compatibility. This work proposes a Hybrid Multi-Target Binary Translator (HMTBT) that transparently supports code optimization on accelerators that exploit different parallelism degrees. As depicted in Fig. 1, HMTBT detects at runtime instruction and data-level parallelism of the application hotspots and smartly issues them to the most well-suited accelerator aiming at maximizing performance per watt. The HMTBT supports code optimization to a CGRA [1], which exploits instruction-level parallelism, and to ARM NEON engine, which is responsible for exploiting data-level parallelism. Our system is coupled to an in-order ARMv7, which executes code regions that are not suitable for optimization. On average, performance improvements of 44% and 99% and energy savings of 14% and 18% were shown by HMTBT over two systems: one with a binary translator targeting the CGRA [1] and other the ARM NEON. Moreover, the HMTBT offers 24% energy savings with 10% more performance than a unique system composed of both binary translators generating code for CGRA and NEON [11]. Finally, results show that HMTBT outperforms an OoO superscalar processor in 24% spending 54% less energy. Summarizing, the contributions of this work are the following: • we show that the opportunities for ILP and DLP exploitation can highly vary among embedded applications and even for the same application when data input or data type 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 57 changing, indicating the need for dynamic scheduling of code portions to achieve the best of ILP and DLP exploitation; • we propose a dynamic adaptive multi-target binary translator that detects the data and instruction-level parallelism degree of application hotpots and automatically translates and schedules code portions to the most well-suited accelerator. In this work, we propose two scheduling methods: one based on a history table that holds the performance improvement of every code portion for each accelerator; other based on the operation per cycle metric resulted by the parallelism exploitation of both accelerators over every application code portion; • shows that the proposed adaptive binary translator outperforms a system coupled to: single parallelism exploitation, mutual parallelism exploitation without adaptive code scheduling; superscalar processor with SIMD engine. 2 Related work This section aims at showing binary translator proposals that transform native code to explore instruction or data-level parallelism. 2.1 Binary translators for ILP Instruction-level parallelism is a traditional approach to accelerate applications by executing instructions in parallel. Binary translators are commonly employed to translate non-parallel code to specific accelerators aiming at exploiting instruction-level parallelism [18]. This subsection presents static and dynamic binary translators that modify the native code to be accelerated in specific hardware to exploit instruction-level parallelism. Warp processing [21] was one of the first to propose binary translator to accelerate code in a simplified FPGA. A microprocessor is responsible for executing the program to be accelerated, while another microprocessor concurrently runs a stripped-down CAD algorithm that monitors the instructions to detect critical regions. The CAD software decompiles the instruction flow to a control flow graph, synthesizes it, and maps the circuit into the FPGA structure. Configurable Compute Array (CCA) [5] is a coarse-grained reconfigurable array coupled to an ARM processor. The process to trigger the CCA is divided into two steps: discovering which subgraphs are suitable for running on the CCA, and their replacement by microops in the instruction stream. Two alternative approaches are presented: a static approach, which finds subgraphs for the CCA at compile-time, changing the original ARM code by inserting CCA instructions; and a dynamic approach, which assumes the use of a trace cache to perform subgraph discovering and CCA instruction building at runtime. Hybrid DBT [19], a hardware-based dynamic binary translator, is proposed to translate RISC-V ISA to a VLIW-based multicore system. Hybrid DBT is composed of a set of inorder, out-of-order, and VLIW cores. While out-of-order cores are best for control-dominated kernels, VLIW cores perform better in compute-intensive kernels. The translation process, which supports hardware accelerators, is composed of three steps: the first step translates RISC-V to VLIW instructions; the second and third steps build instruction blocks and perform inter-block optimizations, such as loop unrolling and trace building. The scoreboard scheduling algorithm is used to schedule VLIW instructions over a dual-core VLIW processor to explore instruction-level parallelism aggressively. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 58 T. Knorst, et al. Park et al. [17] propose a runtime binary translator that unrolls loops to avoid redundant branch instructions, which increases the instruction-level parallelism exploitation. Such unrolling employs machine learning techniques to predict loops that are suitable for unrolling. Results show high prediction accuracy that significantly reduces the number of dynamic instructions by running applications compiled to X86 ISA. The DIM (Dynamic Instruction Merging) [1] is a binary translator capable of detecting application hotspots to execute in a CGRA. The CGRA is conceived as a replication of arithmetic and logic functional units, multipliers, and memory access units. DIM’s binary translation explores application hotspots with high ILP to offload their execution to the CGRA. Since it is entirely implemented in hardware, it does not rely on source code modifications, which provides the original ISA’s binary compatibility. [2–4,13,20] extend the DIM binary translator to improve performance and energy savings but still focusing only on transparent ILP exploitation. DORA [22] implements dynamic translation, which utilizes optimizations such as loop unrolling, accumulator identification, store forwarding, and loop deepening to accelerate code in a CGRA. The system can achieve substantial power and performance improvements over DySer [10], which uses a static binary translator targeting the same CGRA setup. Fajardo et al. [7] proposes a flow to accelerate multi-ISA application into a unique CGRA through a runtime two-level binary translation. The first BT level is responsible for translating the native ISA to a standard code. In contrast, the second BT level accelerates the common code in heterogeneous accelerators, such as DSP, VLIW, and CGRA. The proposed two-level BT simplifies the support for additional ISAs since just the first BT level must be modified to generate the standard code, being transparent to the second BT to accelerate the common code into the CGRA. As a case study, the proposed approach shows performance improvements by accelerating x86 code into a CGRA. 2.2 Binary translators for DLP SIMD engines are widely used to run data-intensive applications, such as multimedia, games, and digital filter. SIMD engines’ triggering commonly relies on specific libraries or compiler techniques, such as ARM NEON and Intel AVX architectures. However, this subsection only presents approaches that, somehow, transform the native code to SIMD instructions at runtime. DSA [11] is a runtime binary translator aiming to exploit data-level parallelism of application loops to execute in a SIMD engine. Software productivity is improved by using DSA since no code modification is needed to trigger the SIMD engine. DSA has access to information at runtime that is not available at programming and compile-time. It has shown higher performance improvements than static DLP exploitation (hand coded using ARM NEON library and GCC compiler). In [8], it is proposed a retargetable software-based dynamic binary translator that supports translation from SIMD to SIMD engines (e.g., ARM NEON to AVX32). The proposal’s main claim is that some SIMD engines provide a limited number of functional units and registers that restrain the maximum data-level parallelism exploitation in the applications. The binary translator is implemented on Hybrid QEMU that gives support to register and functional units mapping from the guest to host SIMD ISA. The proposed framework shows performance improvements by translating SIMD instructions from ARMv7, ARMv8, and IA32 to the X86-AVX2 host ISA. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 59 Selftrans [15] is proposed aiming at automatic vectorizing binary code at runtime to increase the utilization ratio of SIMD units. It extracts parallelism from binary x86 machine code without requiring its source code and translates it into a binary code that utilizes SIMD units. In [23] is presented an approach that exploits thread, memory, and data-level parallelism through a same-ISA dynamic binary modifier guided by static binary analysis. The Janus framework uses a static code analysis combined with a dynamic code modification approach to transform scalar to SIMD instructions from the application code automatically. Considering just the automatic vectorization, Janus shows significant performance improvements in running the TSVC benchmarks. Vapor SIMD [16] proposes a just-in-time (JIT) compilation solution for different SIMD vendors, such as Intel SSE and ARM NEON. The Vector SIMD can combine static and dynamic infrastructure to support vectorization. The process is divided into two steps: first, a generic SIMD code is generated; second, the target ISA assembly is generated. They demonstrate the effectiveness of the framework by vectorizing a set of kernels that exercise the innermost loop and outer loop. 2.3 Our contributions All aforementioned works provide, somehow, transparent software optimization by using a binary translator to exploit parallelism at instruction [12] or data-level [11] or even to take advantage of intrinsic characteristics of multiple ISA multicore systems [6,9]. However, none of them support binary translation to multi-parallelism exploitation, which is mandatory to accelerate the variety of application behaviors running in current mobile devices. Our HMTBT is a multi-target dynamic and transparent binary translator that, unlike all aforementioned works, supports software optimization through accelerators that exploit different levels of parallelism. At runtime, the Hybrid Multi-Target Binary Translator (HMTBT) concurrently detects instruction and data-level parallelism of application hotspots and schedules them to the most well-suited accelerator aiming at maximizing performance per watt. 3 HMTBT The block diagram of the entire system is shown in Fig. 2. The Hybrid Multi-Target Binary Translator (HMTBT) is connected to the ARMv7 core, which supports the execution of nonaccelerated code regions. The HMTBT translates ARMv7 instructions to CGRA and NEON engine to exploit instruction and data-level parallelism, respectively. The translation scope of HMTBT for CGRA is application basic blocks, while for NEON is loop statements. A code portion translated for CGRA is a configuration, while for the NEON engine is a sequence of NEON SIMD instructions. From now on, both are generically named as accelerated word. A Translation Cache is responsible for storing accelerated words. It has the structure of a regular cache memory, with block size defined by the largest one (configuration or NEON SIMD instruction), while indexation and tagging uses the memory address of the first translated ARMv7 instruction. The HMTBT is connected to the ARMv7 register file to gather general-purpose registers’ content containing loop iteration indexes and stop conditions of loops. The data memory addresses of operands are also collected from the register file used in the cross iteration detection process for NEON translations. The CGRA is connected to the register file and 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 60 T. Knorst, et al. Fig. 2 System overview - HMTBT coupled to the ARMv7 core with ARM NEON and CGRA accelerators data cache to access the operands to trigger the execution of configurations sent by the HMTBT. The ARM NEON engine accesses operands only via data cache through specific load and store NEON instructions. The Current Program Status Register (CPSR) is necessary to keep the HMTBT aware of status flags (Negative/Less than/Zero) and the ARM core’s execution mode since the NEON engine cannot be triggered when the core is running on privileged mode. The program counter (PC) content is used to track the loop execution for NEON translations and to index accelerated words in the Translation Cache for both accelerators. The HMTBT writes the address of the ARMv7 instruction in the PC to return the execution to ARM core at the application point after the code portion accelerated by CGRA or NEON. The content of instruction fetched from the ARM core is accessible via the instruction register, so the HMTBT decodes instructions to gather the operation and operands involved. Also, the HMTBT controls the ARM core’s execution through a stall signal, responsible for freezing the ARM pipeline when the NEON engine or the CGRA are running. Aiming at keeping binary compatibility with applications compiled with VFP/ARM NEON instructions, the NEON engine can be triggered by both ARM Core and HMTBT being controlled by a multiplexer that switches the dispatching instruction from the HMTBT and the ARM Core. However, when the HMTBT dispatches a NEON SIMD instruction to the NEON engine, it flushes the regular ARM instructions and stalls the ARM pipeline. 3.1 NEON hardware The NEON engine is composed of an in-order 10-stage superscalar pipeline capable of issuing two instructions per cycle. It supports the execution of ARM VFPv3 scalar instructions compliant with single and double-precision floating-point arithmetic. As shown in Fig. 3, a NEON instruction operate over 128 bits data width, which can be configured as: – 16 signed or unsigned 8 bit integers – 8 signed or unsigned 16 bit integers – 4 signed or unsigned 32 bit integers or single precision floating point data 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 61 Fig. 3 ARM NEON Engine - Instruction to register data packing – 2 double precision floating point data The NEON register file is composed of 32 64-bit registers, that can be viewed as: – 128-bit registers Q0 to Q15 – 64-bit registers D0 to D31 – 32-bit floating point registers S0 to S31 (used by VFP instructions) 3.2 CGRA hardware Figure 4 shows the datapath of CGRA that is composed of arithmetic and logic units, multipliers, and memory access functional units divided into rows and columns. The datapath is totally combinational, the FUs of neighboring columns are interconnected by multiplexers which creates a unique path from the leftmost to the rightmost column. Operations allocated in the same column perform parallel execution, while those placed in different columns perform data-dependent execution. The FUs are virtually grouped in levels, representing the propagation delay of the ARM core’s critical path. The input and output context are connected to the ARM core register file working as a buffer to hold the source operands and results of the operations, respectively. The CGRA execution is divided into three steps: reconfiguration/operand fetch, execution, write back. The reconfiguration/operand fetch step is triggered when an accelerated word is found in the Translation Cache. In that moment, the configuration bits are fetched from the Translation Cache being sent to the CGRA datapath to configure the FUs operations and multiplexers. In parallel, the input context is fed from operands that are fetched from the register file. The execution step starts when all operands have been stored in the input context. When the propagation delay of all levels has been finished, the write-back step starts. Finally, after all the operands have been written back in the register file, the return address is written in the ARM core program counter, which restarts the execution. The CGRA hardware supports basic block speculation, meaning that more than one basic block can be coupled in a single CGRA configuration. The HMTBT builds speculative configurations following the path of the branch of the detection phase (that will be detailed in Sect. 3.4.3). If a misspeculation occurs, the CGRA hardware just writes back the basic blocks’ results, which have been speculated properly. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 62 T. Knorst, et al. Fig. 4 CGRA Data Path NEON NEON Loop Detection ARMv7 Instruction Decode Traslation Cache CGRA ARMv7 CORE SIMD Build Dependency Analysis CGRA Mapping Dispatcher History Table CGRA Configuration Build Fig. 5 HMTBT Pipeline Stages 3.3 HMTBT pipeline The HMTBT is composed of two operation modes: – Detection Mode: while the ARMv7 processor is executing instructions for the first time, the HMTBT, concurrently, monitors code portions and generates SIMD Instructions and CGRA Configurations; – Execution Mode: the HMTBT triggers either the NEON engine with the SIMD Instructions or the CGRA with the CGRA Configuration depending on which accelerator provides higher performance improvements for such code portion. Figure 5 depicts the pipeline stages of HMTBT hardware. Stages colored in black work over code regions for both CGRA and NEON; stages colored in white involve building configurations for the CGRA; and stages colored in gray are responsible for generating SIMD instructions for NEON. The first stage, ARMv7 Instruction Decode/Accelerated Word Issue, works on both detection and execution mode. It receives the ARMv7 instruction fetched by the ARMv7 processor and its memory address. The Translation Cache is accessed with the instruction’s memory address to identify if such code portion has been already translated. If a Cache Hit happens, the execution mode is triggered. In this mode, the accelerated word is fetched from the translation cache and issued to the accelerator responsible for executing such code portion. If a Translation Cache Miss occurs, the detection mode starts since such code portion has not been translated yet, the instruction is decoded, and the second stage of HMTBT is called. After the ARMv7 Instruction Decode, the HMTBT pipeline forks in two specialized pipelines: the CGRA branch aims at exploiting ILP by producing a configuration for the 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... Branch Detection State Data Collection State 63 Cross Dependecy Analysis State END Sentinel and Conditional Loops Mapping State Speculative Execution State Fig. 6 NEON Translation FSM Stages CGRA; while the NEON branch is responsible for exploiting DLP by generating NEON instructions. 3.3.1 CGRA translation stages The CGRA branch is composed of Dependency Analysis, Mapping and Configuration Build stages. All stages were adapted from [1] to be coupled to the HMTBT pipeline. Dependency Analysis Data Dependency Analysis stage verifies data hazard among instructions by comparing the source operands of current instruction with destination operands of previously analyzed instructions. The output of this stage is the number of the CGRA’s leftmost column where the current ARMv7 instruction can be placed. Such a placement respects the data dependencies of the previously placed instructions in the current configuration. Mapping The Mapping stage verifies if there is an available functional unit to place the current ARMv7 instruction considering the number of columns received from the Dependency Analysis stage. If there is no available functional unit in such a column, the mapping stage seeks in the next column to the right until finding an available FUs. If there are no FUs to place the Armv7 instruction, the current configuration is closed, and this instruction will be placed in the first column of a new configuration. Configuration Build During the Configuration Build stage, the control bits for FUs operation and interconnection are generated, resulting in a bitmap stream (configuration) that reflects the datapath behavior for the execution of the translated code portion. The configuration bits are sent to the next pipeline stage (Dispatcher). 3.3.2 NEON translation stages The NEON branch is composed of the following stages: Loop Detection The loop detection stage is capable of detecting and vectorizing the following types of loops: – Count loops: loop that performs a fixed number of iterations that can be determined at compile time. – Dynamic range loops: loop range is determined before the execution and can be modified at runtime on every iteration. – Sentinel loops: the loop range can only be determined during its execution. – Conditional loops: loop that contains conditional code within its body. A finite state machine, presented in [11], is employed to exploit data-level parallelism depending on the type of loop. As shown in Fig. 6, the FSM contains five states: Branch 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 64 T. Knorst, et al. Detection, Data Collection, Cross Dependency Analysis, Mapping, and Speculative Execution. Count loops and dynamic range are vectorized very similarly, requiring only the first three states. Sentinel and conditional loops use a speculative process that requires all FSM states to process the entire vectorization. The Branch Detection state is triggered at the end of the first loop iteration. It is responsible for detecting the instruction memory address range of the loop body by tracking the program counter register’s content. If a particular loop has been previously detected as vectorizable, such stage detects conditional code and function calls within the loop body. The Data Collection state is triggered during the second iteration of the loop. This state is responsible for evaluating the loop range, the vectorizable instructions, and their operands. It also stores the addresses of data memory accesses in the Verification Cache (VC), used for the Cross Dependency Analysis state to verify if the loop is vectorizable. Besides, such a state verifies the presence of a Sentinel Loop. The Cross Dependency Analysis State is triggered in the third loop iteration. This stage is responsible for analyzing the cross-iteration dependency (dependencies between two or more iterations in the same loop statement). At the memory access point of view, a crossiteration dependency exists when the same data memory address is accessed in different loop iterations. The DSA cross-iteration analysis starts in the 2nd loop iteration, where the addresses of data memory accesses are saved in the Verification Cache. Even having the memory addresses in the VC and comparing them to the memory addresses performed on every iteration, one cannot discard cross-iteration dependencies in future iterations. Assuming such a situation, we have implemented Cross-iteration Dependency Prediction. The equations below describe the steps of the prediction process, where MRead[2] and MRead[3] is the memory address accessed by a MemRead (load) instruction in the second and third loop iterations, respectively. MRead[lastIteration] is the memory address accessed by a load instruction in the last executed iteration (Eq. 4), x is the interval between MRead[2] and MRead[lastIteration] (Eq. 1), M(Write[2]) is the memory address accessed by a MemWrite (store) instruction in the second iteration (Eqs. 2 and 3), MRange is the memory address range between the MRead[2] and MRead[3] (Eq. 5), CID means Cross-Iteration Dependency and NCID means No Cross-Iteration Dependency. M Read[3] ≤ x ≤ M Read[last I teration] (1) MW rite[2] ∈ x → C I D (2) / x → NC I D MW rite[2] ∈ (3) M Read[last I teration] = M Read[2] + (Mttap ∗ (last I teration − 2)) (4) M Range = |M Read[3] − M Read[2] | (5) The Mapping State is only activated for Conditional loops. This stage is responsible for: evaluating the loop range; mapping the executed conditional code statements; verifying cross-iteration dependencies between iterations within every condition; and, assessing the vectorizable instructions and their operands within every condition. The Speculative Execution State is only enabled for Conditional Loops and Sentinel Loops. This state is responsible for: selecting data generated during SIMD execution at the end of the Loop; storing the current range for Sentinel Loops (Speculative Range); and, storing mapped conditions of the Conditional Loop for further executions in the Translation Cache. SIMD Build The SIMD Build Stage receives the set of operands, operations involved in the loop detection stage, and generates the SIMD NEON instructions that are sent to the Dispatcher Stage. As the NEON Engine performs operations over vectors that contain elements multiple of 2, 4, 8 or 16, leftover elements of non-multiple vectors should be operated some- 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 65 Fig. 7 Leftovers example for a 32-bit vector with 11 elements. The first 8 elements can be processed using SIMD instructions, but the last 3 elements must be processed using single element instructions how. As shown in the example of Fig. 7, the 32-bit vector contains 10 elements, the SIMD build stage handles such issue by generating the maximum number of SIMD instructions to exploit DLP, in that case, two NEON SIMD instructions. The leftover elements are executed by single element NEON instructions in the NEON engine without data-level parallelism exploitation. 3.3.3 Dispatcher stage The HMTBT pipeline fuses in the Dispatcher stage. It receives CGRA configurations or SIMD NEON instructions, dispatching them to the Translation Cache. Once both pipeline branches of the binary translator have optimized the same code portion (Mutual Regions), this stage is responsible for deciding which accelerated word should be dispatched to the Translation Cache. As mentioned above, the dispatcher can operate over Non-Mutual Accelerated Regions or Mutual Accelerated Regions manipulating them as follows: 1. Non-Mutual Regions: since HMTBT translated all basic blocks to execute in the CGRA, those code regions that are not loops are just covered by such accelerator. At detection time, the SIMD Build stage of the NEON pipeline branch informs the Dispatcher stage that such a region is not feasible for DLP exploitation. Thus, the accelerated word for those code regions is always stored in the Translation Cache as CGRA configurations. 2. Mutual Regions: are those code regions that can be covered by both accelerators, which are limited to the NEON engine acceleration scope (loops without cross iteration dependency). In this work, we have implemented two methods that decide which accelerated word of mutual regions will be stored in the Translation Cache. History Table Method The History Table Method decides which accelerator will be responsible for executing the mutual region based on the performance improvements. As shown in Fig. 8, this method uses the History Table, which holds on each row the number of cycles taken by both CGRA configuration and NEON instruction to run a certain mutual code portion. The Accelerated Word is indexed and tagged in the History Table by the memory address of the first ARMv7 instruction. Since the detection time to translate a basic block to a CGRA configuration is faster than SIMD NEON instructions building, the CGRA configuration is stored in the Translation Cache as soon as it is built by the HMTBT. This is performed to exploit the opportunities to accelerate such code portion in the CGRA while the NEON SIMD instructions are translated. When the NEON pipeline branch finishes the accelerated word for the same code portion, the dispatcher uses a 6-bit comparator to select which accelerated word will be stored in the Translation cache based on the number of cycles taken to run such code portion. In case the NEON instruction provides higher performance improvements than the CGRA configuration, the dispatcher stores the built NEON instruction in the Translation Cache, replacing the previously stored CGRA configuration. Otherwise, the dispatcher discards the NEON instructions. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 66 T. Knorst, et al. Fig. 8 Dispatcher using a history table Fig. 9 Dispatcher using OPC calculus Operation per Cycle Method The Operation per Cycle (OPC) method performs the selection of which accelerator will execute the mutual region based on parallelism degree. For CGRA configuration, the OPC method measures the parallelism degree by having the number of operations allocated on each CGRA level. For the NEON accelerator, the operation per cycle is evaluated by taking the type of operands and the number of instructions generated by the SIMD build stage. Figure 9 shows the interconnection between the SIMD and Configuration Build and the Dispatcher stage. The OPC method uses a 6-bits comparator to select the accelerated word stored in the Translation Cache based on previous pipeline stage information. The SIMD build stage intrinsically has the parallelism degree since the type of operands is decoded in the Loop Detection stage to build the SIMD NEON instructions. For instance, if the operands involved in the vectorization process are 32-bit long, the resulted OPC will be four since the NEON engine is 128-bit long. For CGRA configurations, it is necessary to compute the OPC by using a 6-bit Fixed Point Divider. Towards that, the Dispatcher divides the total number of instructions allocated by the number of levels in the mutual region configuration to gather the average Operation per Cycle of the CGRA configuration. In case the NEON instruction provides higher OPC than the CGRA configuration, the dispatcher stores the built NEON instruction in the Translation Cache, replacing the previously stored CGRA configuration. Otherwise, the Dispatcher discards the NEON instructions. 3.4 Examples of HMTBT translations This section shows examples of mutual code regions translated by HMTBT for CGRA and NEON engine. In addition, the decision process of the Dispatcher stage is shown. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 67 3.4.1 Count loops Figure 10 shows the C and the ARM assembly code of a count loop that performs a sum of two vectors of length 400. Below, the HMTBT detection will be detailed for both NEON and CGRA. NEON Detection Process The HMTBT detects the loop at the end of the first iteration by triggering the FSM branch detection state when the bne .count_loop instruction reaches the loop detection stage pipeline. This FSM state accesses the translation cache using the ARMv7 instruction address. If a cache miss happens, it means that this loop was not analyzed yet. Thus, the program counter of the branch instruction is registered, and the detection process starts. During the second iteration, the Data Collection FSM state works to identify the number of iterations by looking for the value of the increment/decrement of the index and the stop condition value. In this example, at line 2, r7 is incremented by the load instruction ldr r3, [r7, #4]! (increment has a value of 4), and the value of stop condition is held in the register r4. All memory addresses accessed by load and store instructions are stored into the Verification Cache to be further checked to detect cross iteration dependencies (r7 and r1 being load addresses and r2 being a store address). The type of operands is also stored on the verification cache to have the byte offset for the cross iteration dependency process. The Dependency Analysis State uses the address values stored during data collection to verify cross iteration dependencies. If it is found, the detection process is ended, and the loop address is sent to the Dispatcher Stage, informing that this code portion is not vectorizable (non-mutual code portion). As no dependency is found in the given example, the SIMD Build pipeline stage is triggered. During the SIMD Build stage, the HMTBT generates vectorized instructions and sends them to the Dispatcher stage, which will decide if they are stored in the Translation Cache. Since the number of iterations for a count loop is known before execution, the loop is unrolled, and the SIMD Built stage generates instructions of data processing and memory access, disregarding all branch and comparison instructions. The vst1.32 d16-d17, [r3, #4]! postincrement instruction is generated to increment the loop index register r3 by four. CGRA Detection Process The CGRA configuration steps start at the data dependency analysis stage, inspecting the data dependency between the incoming instructions by comparing the current instruction’s source operand with the previously analyzed instructions. This stage sends to the mapping stage the number of the leftmost column of the CGRA in which this instruction can be allocated. In the mapping stage, the current instruction is allocated in a functional unit following the column’s number received of the data dependency stage. In this example, both load operations are allocated in parallel (column 1) while the add instruction is allocated in column 4, respecting the data dependency with the load instruction. Finally, the store instruction is allocated in column 7. The bne instruction is allocated in column 5 since it has data dependency with cmp instruction. The speculation process starts after the allocation of bne instruction, while the instructions of the second iteration arrive in the data dependency stage, they are allocated in the same configuration following the same allocation methodology of the first basic block as shown in Fig. 10. In the current implementation, the HMTBT supports the speculation of one basic block. If miss speculation happens, just the results of the instructions of the first basic block will be written back in the ARM core register file. Once all instructions of the second iteration is allocated, the Configuration Build stage generates the control bits for the FUs and interconnection, sending them to the Dispatcher stage. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 68 T. Knorst, et al. Fig. 10 Count loop example and the allocation of CGRA’s functional units Dispatcher Decision Process Since each CGRA level takes one processor clock cycle, the execution of the CGRA configuration from Fig. 10, which contains two loop iterations, takes 6 cycles or 3 cycles per iteration. Supposing that the NEON pipeline is full, NEON execution requires 4 instructions and takes 4 cycles to process 4 loop iterations. Given that, the CGRA and NEON will store in the History Table the values 1200 and 400 cycles, which is the total number of cycles taken to execute the 400 iterations of the loop, respectively. For the OPC Method, as the operands are 32-bits long, the OPC is four for the NEON engine, while the average OPC for the CGRA is two (12 instructions divided by 6 levels). Considering both History Table and OPC methods, the NEON SIMD instructions will be stored in the Translation Cache since they provide less clock cycles and higher OPC than the CGRA configuration. 3.4.2 Dynamic range loops and leftovers Figure 11 shows an example of Dynamic Range loop, where the stop condition is unknown at compile-time, being determined by user input via scanf function call and stored in register r5. NEON Detection Process In the first iteration of the loop, the branch detection state detects the loop. During the second iteration execution, the HMTBT calculates the range by evaluating the contents of r2 and r5 since the latter register already holds the returned value from the scanf function. As the content of r5 is never modified, the remaining translation steps are similar to a count loop. However, differently from a count loop, even if already analyzed, the dynamic range loop must always pass through the data collection stage since the loop range can change on every execution. Supposing that the user input for the scanf function of the Fig. 11 is 11 and each element a 32-bit integer. The HMTBT generates the maximum number of NEON SIMD instructions 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 69 Fig. 11 Dynamic range loop example with single element instructions for the loop range, being the leftover elements executed by single element instructions in the NEON engine. Each iteration will compute 4 elements at once, the HMTBT generates two instructions that will operate over the elements 0 to 3 and 4 to 7. Three leftover elements will be processed by three single element instructions in the NEON engine. In the example of Fig. 11, one can notice that the instruction vld1.64 d8-d9, [r2]! has the equivalent single element instruction vld1.8 d8[0], [r2]!. Where d8[0] represents the 32 lower bits of the 64-bit register, which can also be viewed as simply being s16. This way, dealing with the leftovers is only a matter of changing load and store instructions to operate over a single element. CGRA Detection Process The instruction allocation of the CGRA is the same presented for the count loop in Fig. 11 since the C code is equal. Dispatcher Decision Process Since both CGRA and NEON instructions are the same as the count loop presented in Fig. 11, the Dispatcher receives the same information and provides the same decisions. Thus, both History Table and OPC decide to store the NEON SIMD instructions to accelerate such loop statement. 3.4.3 Conditional loops with branch instructions Figure 12 shows an example of a conditional loop that uses branch instructions to control the program flow. As usual, the branch can take two paths, depending on condition A or B, both being accelerated by CGRA and NEON engine. NEON Detection Process The HMTBT is able to vectorize conditional loops by using two additional stages: the mapping stage, responsible for, at detection time, mapping the different conditional blocks, and the speculative stage, which is responsible for, at execution time, running both conditions and asserting which results will be committed. During the first loop iteration, at the Loop detection stage, the HMTBT is analyzing branch instructions that may occur within the loop range. If the HMTBT finds this condition, the Mapping state is triggered. During the second loop iteration, the Mapping stage is responsible for mapping each conditional block. In this example, only instruction 6 composes the condition block of condition A, while instruction 9 composes the condition B block. Since only one of these conditional blocks will be executed on each iteration, the HMTBT must wait for at least one execution of all conditions before concluding if the loop is vectorizable. The mapping process uses the cross iteration dependency approach to detect if the conditional loop is vectorizable. Once all conditional blocks have been mapped and marked as vectorizable, the SIMD Build stage 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 70 T. Knorst, et al. Fig. 12 Example of conditional branch loop and the allocation of CGRA’s functional units is called to build the NEON SIMD instructions for both Conditions A and B, as shown in Fig. 12. If a Translation Cache hit happens, meaning that such code portion will be accelerated in the NEON engine, the Speculative process is triggered. The HMTBT sends to the NEON engine the instructions of both conditions A and B, being their results stored into ARM NEON registers. The results from operations of both conditions of every iteration are stored in different registers. In Fig. 12, the results from condition A are stored in registers q0 to q3, and the results from condition B are stored in registers q4 to q7. During the speculative stage, data memory writes of the already computed NEON instructions are issued on demand, depending on the result of the cmp instruction that is executed by the ARM Core. The HMTBT checks the result of the CMP instruction of a certain iteration by looking at CPSR conditional flags. The registers containing the correct conditional block results are then stored into memory using single element store instruction. In this example, if the first iteration had condition B as true, the issued store instruction will be vst1.64 d4[0], [r0]!, where d4[0] contains the results of condition B. However, if condition A was true, the issued instruction will be vst1.64 d0[0], [r0]!, where d0[0] contains the results of condition A. The NEON registers that contain results of the wrong executed condition are marked as free to store speculative results of remaining iterations. CGRA Detection Process As mentioned before, the CGRA can couple two basic blocks within a configuration, one is the regular and the other speculative. Supposing that condition A is true for the first iteration of the loop of Fig. 12, instructions from 2 to 7 would be allocated in the configuration, being instructions from 2 to 5 representing the first basic block and 6 to 7 the second one. If condition B is true in the second loop iteration execution, a misspeculation happens, and the configuration will be deleted from the Translation Cache. Such code portion will be translated again in the third loop iteration with the basic blocks executed in such execution. Dispatcher Decision Process Supposing the best-case scenario, where condition A is translated in a configuration, and no misspeculation happens. The CGRA takes 2 cycles to execute each loop iteration, 128 cycles 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 71 Fig. 13 Conditional instruction loop example, similar to conditional branch code, the assembled code now uses conditional instructions for 64 iterations, the DSA execution takes the same 128 cycles to execute the same scenario. For that case, the HMTBT based on the History Table method elects the CGRA to execute such code portion since it consumes less energy, as will be detailed in the Experimental Results section. Considering the OPC method, the CGRA takes 2 levels to execute 8 instructions, having an OPC equal to 4. As such example considers an operand of 16-bits long, the OPC of the NEON engine is 8. Thus, the HMTBT based on OPC Method will select the NEON engine to execute such code portion. 3.4.4 Conditional loops with conditional instructions Figure 13 shows an example of a conditional loop that sums odd indexed elements and subtracts the even indexed elements. As can be seen, the generated ARMv7 code has two conditional instructions: addeq r2, r2, r0 and subne r2, r2, r0, and a test instruction tst r3, #1 to implement the conditional code. In this example, we have two distinct possible conditions, at lines 6 and 7, referred to as conditions A and B, respectively. NEON Detection Process During the first iteration (Branch Detection State), the HMTBT detects the conditional instructions. Unlike conditional branch code, regardless of the number of conditional statements in the loop, the HMTBT checks cross iteration dependencies in a single loop iteration since all instructions are fetched and decoded. Thus, the Mapping Stage is triggered and completed at the second iteration for the example of Fig. 13. At the execution time, the speculation process is the same as the conditional loop example with branches shown in the previous subsection. CGRA Detection Process Supposing that condition A is true in the first loop iteration, the CGRA configuration holds instructions from lines 2 to 6. The instruction allocation is the same presented for the conditional loop with branch instruction shown in Fig. 12. Dispatcher Decision Process The NEON engine takes 4 cycles to operate over 4 array elements, so the entire loop execution takes 100 cycles. The CGRA configuration takes 3 cycles to execute a loop iteration and 300 cycles to run the whole loop. Considering the OPC Method, the CGRA configuration has an OPC equal to one (five instructions divided by 3 levels) while the OPC of NEON is four since the operands are 32-bit long. Thus, both the History Table and OPC methods choose the NEON engine to execute such code portion. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 72 T. Knorst, et al. 3.4.5 Sentinel loops Figure 14 shows a sentinel loop where the stop condition is based on an operation placed within the loop body. NEON Detection Process In a sentinel loop, the number of iterations is unknown at compile and can change on every loop iteration at runtime. For this reason, the HMTBT uses a speculation execution method similar to conditional loops to execute an arbitrary number of iterations, just committing the results of the correct number of iterations when the loop range is solved. The HMTBT triggers the same states of dynamic range and count loops until the loop’s first three iterations. The register r3 receives the result of the multiplication (Line 4) that is used to be compared with constant 0 (Line 5). This implies that the stop condition can only be found during the loop execution. Once the Data collection stage and Dependency analysis are done, the loop is considered vectorizable. The SIMD build stage generates the vectorized instructions and sends them to the Translation Cache. If a Translation Cache hit happens, the SIMD NEON instructions are fetched from the Translation Cache, and the speculative process starts. The initial speculative range is set to 4 (the minimum number of elements to be processed considering 32-bit integers). Once the execution of the initial speculation value is done, the HMTBT generates a compare instruction and dispatches it to the ARM core. In this case, four different conditions must be checked by reading the content of the CPSR register. Once an instruction is asserted, the corresponding iteration results are stored back into memory by changing the register address of the store instruction (in this example, varying from d0[0] to d1[2]) and dispatching it to ARM NEON. This process is repeated until the stop condition is reached. If more iterations are processed than necessary, the unused results are not stored back into memory, and the loop execution is finished. CGRA Detection Process The CGRA allocation is shown in Fig. 14. In that case, two loop iterations is held in a single configuration. Dispatcher Decision Process The CGRA can execute a loop iteration in 3 cycles. In comparison, the DSA can process 4 elements in 3 cycles, taking 4 cycles to store each processing element using single element instructions, resulting in 7 cycles. Supposing that such loop executes 4 iterations, the NEON takes 7 cycles while CGRA 12 cycles. For the OPC method, the NEON has an OPC of four since operands are 32-bit long while CGRA has an OPC equal to two. Both methods select NEON to execute such code portion. 4 DLP and ILP opportunities The purpose of this section is to show that DLP and ILP vary among applications of different domains. In addition, even considering a specific application, the data input and type of operands can influence the degree of available parallelism. Applications belonging to the following domains were selected for such exploitation: – Image Processing – EPIC - image compression algorithm based on a bi-orthogonal critically sampled dyadic wavelet decomposition and a combined run-length/Huffman entropy coder; 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 73 Fig. 14 Sentinel loop example and the allocation of CGRA’s functional units – RGB Grayscale - converts image from RGB to grayscale format; – Susan Edges- iterates over each image pixel to find edges; – Susan Smoothing - a noise filter that gets a central pixel smoothing the neighbors’ pixels taking the average value of all of those pixels; – Gaussian Filter - linear filter used to blur an image; – JPEG - decoder for image compression format; – MPEG - decoder for video compression format; – H264 - video compression standard; – Telecommunication and Cryptography – GSM - Global Standard for Mobile Communication Protocol that uses a combination of Time- and Frequency-Division Multiple Access (TDMA/FDMA) to encode/decode data streams. – SHA - secure hash algorithm that produces 160-bit message from a given input; – Blowfish - symmetric block cipher with a variable length key; – Kernels – – – – Matrix Mult - multiplies two matrices of N elements; Bitcount - count the number of bits in a array; FFT - Fast Fouriet Transform; StringSearch - searches for a given word in phrases; Figures 15 and 16 show the opportunities to exploit DLP and ILP plus DLP considering the application execution time. In addition, execution time that provides neither DLP nor ILP is presented in such Figures. For better visualization, we have divided the benchmarks in two Figures, they are sorted considering increasing DLP opportunities, being the benchmarks of the first Figure have fewer DLP opportunities, and the benchmarks placed in the second Figure have more DLP opportunities. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 74 T. Knorst, et al. Fig. 15 ILP and DLP runtime opportunities of the applications, ordered by the DLP opportunities Fig. 16 ILP and DLP runtime opportunities of the applications, ordered by the DLP opportunities As it can be seen, the selected benchmarks are very heterogeneous in terms of DLP and ILP opportunities. RGB Grayscale shows that ILP is mandatory since no DLP opportunities are available. On the other hand, Susan Smoothing provides DLP on 86% of its execution time, suggesting that such exploitation should be present. When applications of the same domain are investigated, the need for both parallelism exploitation is reinforced. For instance, Susan Edges shows 3% of DLP while Susan Smoothing shows 86%, but the former provides 84% of ILP and the latter just 11%. The same behavior can be noticed with GSM and SHA, where the former has low DLP opportunities and high ILP opportunities, being the opposite for the latter. Moreover, Matrix Multiplication, a kernel commonly used in a wide range of application domains, benefits from ILP and DLP exploitation at almost the same level, which shows that both exploitations are orthogonal but complementary. Figures 17 and 18 show the ILP and DLP opportunities of Susan Edges and Susan Smoothing varying the size and the input images. The input images vary from a monochrome black to a crowd image to evaluate the changing of opportunities from homogeneous to heterogeneous colored pixels. For Susan Edges, the instruction-level parallelism grows with the increasing heterogeneity when executing colored pixels and bigger image size. Considering a 50 × 50 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 75 Fig. 17 ILP and DLP runtime opportunities of Susan E varying size and input data Fig. 18 ILP and DLP runtime opportunities of Susan S varying size and input data image, the ILP grows 14% from black to crowd image. Also, the ILP increases 27% when a black image grows from 50 × 50 to 500 × 500 pixels. Unlike Susan Edges, the increase in the image’s size produces more opportunities for data-level parallelism exploitation in Susan Smoothing. From 50 × 50 to 500 × 500, the DLP increases 20%, regardless of the input image. Figure 19 shows ILP and DLP opportunities of Matrix Multiplication considering different matrix sizes and operand granularity. For both matrix size, the opportunities for DLP and ILP maintains a similar level for an integer number. However, when floating-point numbers are considered, DLP opportunities decreases the third part for both matrix size, while ILP keeps a similar level for 64 × 64 and decreases halve for 256 × 256. Also, significant ILP and DLP opportunities are shown when comparing both matrix sizes, while ILP keeps on 40% in 64×64 size, 256×256 shows just 20% on integer numbers. Similar behavior is shown in DLP opportunities that stay on 30% on 64 × 64 and 70% in 256 × 256, showing that opportunities to explore parallelism change among different applications and vary significantly in a single application with the size and operand granularity changes. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 76 T. Knorst, et al. Fig. 19 ILP and DLP runtime opportunities of MultMatrix varying size and data granularity 5 Experimental results 5.1 Methodology This subsection shows the methodology used to gather the performance, area, and energy results of the proposed approach. Table 1 summarizes the system setups used to compare with HMTBT: – ARMv7 Single Issue is the baseline system. – The ARMv7 Superscalar + Static DLP Hand Coded is a 4-issue out-of-order execution, running applications vectorized by hand using NEON intrinsics library. In this setup, ILP is transparently exploited while DLP depends on programming time overhead. – The ARMv7 Superscalar + Static DLP Compiler is a 4-issue out-of-order execution, running applications vectorized by GCC compiler. In this setup, ILP is transparently exploited while DLP depends on source code recompilation. – The ARMv7+CGRA setup is used to explore the potential of transparent ILP exploitation by using the binary translator proposed in [1]. – The ARMv7+NEON uses the binary translator [11], aiming at exploiting only DLP in a transparent fashion. – The ARMv7+CGRA+NEON Decoupled setup has both binary translators, but working separately. This setup aims at showing both ILP and DLP exploitation without cooperative translation. – ARMv7 HMTBT is evaluated over two setups, which cover the OPC and History Table dispatching approaches. As can be noticed in Tables 1 and 2, for the sake of the comparison, the operation frequency and size of cache memories are the same for all setups. We have implemented all setups in the GEM5 simulator to gather performance results. Energy and area evaluations were gathered from logic synthesis using Cadence tools with 15 nm Design kit [14]. The CGRA used in all setups has two parallel arithmetic and logic units per column, three LD/ST units, one multiplier, and two floating-point units, with a total of six levels in a row. The NEON engine used can issue two instructions in-order per cycle, being each instruction working over 128 bits data width. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 77 Table 1 Simulation setups of the compared systems System setup ARMv7 Single Issue ARMv7 Superscalar ARMv7 NEON ARMv7 CGRA Clock Freq. 1GHz 1GHz 1GHz 1GHz Cache L1, L2 64Kb, 512Kb 64Kb, 512Kb 64Kb, 512Kb 64Kb, 512Kb ILP/DLP exploit. – OoO+NEON NEON CGRA Translation Cache – – 1Kb 4-way 3Kb 4-way Table 2 Simulation setup of the implemented systems System setup ARMv7 Decoupled ARMv7 HMTBT OPC ARMv7 HMTBT Table Clock Freq. 1GHz 1GHz 1GHz Cache L1, L2 64Kb, 512Kb 64Kb, 512Kb 64Kb, 512Kb ILP/DLP exploit. CGRA+NEON CGRA+NEON CGRA+NEON Translation Cache 1Kb+3Kb 4-way 4Kb 4-way 4Kb 4-way History Table – – 32 entries 5.2 Performance Figures 20 and 21 show the performance improvements of all setups over ARMv7 Single Issue processor. Such Figures follow the same sorting methodology of Figs. 15 and 16, where benchmarks of Fig. 20 have less DLP opportunities and benchmarks placed in Fig. 21 have more DLP opportunities. Despite ARMv7 Superscalar processor setups showing performance improvements over the Single issue, they rely on human intervention to extract data-level parallelism by modifying the code using NEON Intrinsic (Hand Coded) or recompiling the code using GCC compiler. Besides increasing programming time, the performance gains of the ARMv7 Superscalar + Static DLP Hand Coded setup are limited since some loop types (such as dynamic and sentinel loops) are not vectorized by NEON Intrisics. On average, the ARMv7 Superscalar + Static DLP Hand Coded setup achieves a speedup of 2.6 times over the ARMv7 single issue. The ARMv7 Superscalar + Static DLP Compiler setup behaves similar to static DLP exploitation by using NEON Intrinsics since the compiler is not enabled to vectorized loops due to the lack of execution time information. On average, such setup outperforms the Single Issue processor in 2.4 times. By just exploiting instruction-level parallelism, the ARMv7 CGRA setup outperforms both ARMv7 superscalar processor setups with static DLP exploitation in six benchmarks. The main source of performance improvements of ARMv7 CGRA setup is the capability to accelerate data-dependent instruction. However, when a high degree of data-level parallelism is available in the application code (rightmost applications of Fig. 21), the NEON engine coupled to the Superscalar processors boosts performance improvements. On average, both parallelism exploitation of Superscalar processors outperforms the standalone ILP exploitation of ARMv7 CGRA in 25%. Similar behavior is shown when both ARMv7 superscalar processors are compared to the ARMv7 NEON, where just DLP is exploited. The ARMv7 NEON outperforms both ARMv7 superscalar setups with static DLP exploitation in applications where DLP is largely present, but the former fails in applications where DLP is not available (leftmost applications in Fig. 20). On average, both ARMv7 superscalar setups outperforms ARMv7 NEON in 92%. Summarizing, it is shown that, even having a CGRA 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 78 T. Knorst, et al. Fig. 20 Performance improvements of the systems over ARMv7 Processor capable of outperforming superscalar processors with static DLP exploitation in applications with massive ILP, both parallelism exploitation are mandatory to achieve a balanced performance over an application from heterogeneous domains that contain varied DLP and ILP degrees. Performance improvements are shown over ARMv7 Superscalar processors with static DLP exploitation when ILP and DLP are dynamically exploited by ARMv7 Decoupled (setup that couples both binary translators). Such improvements are more evident when higher DLP is available (rightmost applications in Fig. 21) since the binary translator can accelerate loops that are not vectorized at programming and compiling time. On average, the ARMv7 Decoupled outperforms both ARMv7 superscalar setups in 17%. Unlike ARMv7 Decoupled, the binary translators work cooperatively in the proposed ARMv7 HMTBT, selecting the most well-suited accelerator for each application portion. The proposed ARMv7 HMTBT OPC outperforms ARMv7 Decoupled and Superscalar processors in all applications but Bitcount and EPIC. On average, 8% of performance improvements are shown over the decoupled system. However, the ARMv7 HMTBT Table improves the prediction accuracy of HMTBT based on OPC, outperforming Superscalar in all applications but EPIC. EPIC has a specific behavior on CGRA acceleration. As its basic blocks are too small, most configurations have few instructions being gains in ILP diluted by the reconfiguration time of the CGRA. On average, the HMTBT Table outperforms the superscalar setups, and ARMv7 decoupled system in 29% and 10%, respectively. The Oracle bars represent HMTBT parallelism exploitation’s potential, meaning that the code portions are always dispatched to the accelerator that provides the higher performance improvements. As it can be seen, both HMTBT OPC and Table dispatching methodologies achieve near-optimal performance. On average, the HMTBT OPC and Table dispatching methods lose just 2.8% and 1.2% of performance over Oracle, respectively. 5.3 Energy Table 3, Figs. 20 and 21 show the power consumption of individual hardware blocks and energy savings of all setups over ARMv7 Single issue, respectively. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 79 Fig. 21 Performance improvements of the systems over ARMv7 Processor Table 3 Power consumption from the components of the systems Component Description Power (mW) ARMv7 Single Issue Processor Widths=1 243.9 ARMv7 Single Issue + NEON Processor Widths=1 and 128-bit NEON 241.5 ARMv7 Superscalar + NEON Processor Widths=4 and 128-bit NEON 668.6 CGRA 2ALU-3MEM-1MUL-2FP (6levels) 148.1 DIM CGRA Binary Translator 77.90 DSA ARM NEON Binary Translator 71.41 HMTBT Proposed Binary Translator 129.8 Translation Cache 4 Kb 4-way 7.404 Verification Cache 1 Kb 4-way 1.892 History Table 32 entries Fully Associative 0.059 OPC Divisor Fixed-Point 5.434 In Table 3, the overhead in power of 4-issue superscalar is higher than performance improvements shown in most applications (Figs. 20, 21). Thus, despite having static DLP exploitation, the ARMv7 superscalar setups consume more energy in all applications. Considering setups with single parallelism exploitation, both ARMv7 CGRA and NEON show unbalanced gains. For high DLP applications, the ARMv7 NEON shows huge energy savings. For instance, Susan Smoothing presents the largest energy savings considering all setups. On the other hand, the ARMv7 CGRA achieves higher energy savings in applications where ILP is dominant, such as EPIC, being the setup with the most significant energy savings results. Considering the HMTBT setups, the energy savings are more balanced considering all applications. Both HMTBT dispatching methods save more energy than the ARMv7 decoupled in all applications. On average, the energy savings provided by the cooperative work of binary translators of the HMTBT Table is 24% over the Decoupled setup. Moreover, despite the setups composed of single parallelism exploitation providing higher energy saving than HMTBT in Blowfish, RBG, Bitcount, and Susan Smoothing, such applications are the corner cases of parallelism availability (they are in the leftmost side of Fig. 15 and in 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 80 T. Knorst, et al. Fig. 22 Energy Savings of the systems over ARMv7 Processor Fig. 23 Energy Savings of the systems over ARMv7 Processor the rightmost side of Fig. 16), which does not represent the whole workload of a current embedded device where applications are very heterogeneous. Considering all applications that mimic such heterogeneity, the average energy savings are shown by the HMTBT over the ARMv7 Superscalar, CGRA, NEON are 54%, 14%, and 18% achieving performance improvements of 29%, 44%, and 99%, respectively (Fig. 22). 5.4 Area Table 4 shows the area occupied by the individual hardware blocks that compose the evaluated setups. It is important to notice that, the HMTBT coupled to an ARMv7 single issue processor performing dynamic ILP and DLP exploitation by using a single binary translator occupies 33% less chip area than an OoO ARMv7 superscalar processor with static DLP exploitation. Summarizing, besides keeping binary compatibility and not affecting software productivity, the HMTBT outperforms the OoO superscalar processor in 29% with 54% energy savings, occupying 33% less chip area (Fig. 23). 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. An energy efficient multi-target binary translator... 81 Table 4 Area on chip from the components of the systems Area (µm2 ) Component Description ARMv7 Single Issue Processor Widths=1 360,797 ARMv7 Single Issue + NEON Processor Widths=1 and 128-bit NEON 450,996 ARMv7 Superscalar + NEON Processor Widths=4 and 128-bit NEON 3,104,130 CGRA 2ALU-3MEM-1MUL-2FP (6levels) 65,520 DIM CGRA Binary Translator 7485 DSA ARM NEON Binary Translator 468 HMTBT Proposed Binary Translator 8070 Translation Cache 4 Kb 4-way 13,878 Verification Cache 1 Kb 4-way 3651 History Table 32 entries Fully Associative 456 OPC Divisor Fixed-Point 1365 6 Conclusions This work proposes a Hybrid Multi-Target Binary Translator (HMTBT) that transparently supports code optimization on accelerators that exploit different parallelism degrees. Such exploitation is important since instruction and data-level parallelism exploitation opportunities can highly vary among embedded applications and even for the same application when data input or data type changes. The proposed BT detects at runtime DLP and ILP of the application hotspots and smartly issues them to the most well-suited accelerator to maximize performance per watt. HMTBT shows to be more efficient in terms of performance and energy over an OoO superscalar processor coupled to an ARM NEON engine. Furthermore, the proposed approach improves performance and energy over decoupled binary translators using the same accelerator with the same ILP and DLP capabilities. References 1. Beck ACS, Carro L (2007) Transparent acceleration of data dependent instructions for general purpose processors. In: IFIP VLSI-SoC, pp 66–71 2. Beck ACS, Rutzig MB, Carro L (2014) A transparent and adaptive reconfigurable system. Microprocess Microsyst 38(5):509–524. https://doi.org/10.1016/j.micpro.2014.03.004. https://www.sciencedirect. com/science/article/pii/S0141933114000313 3. Beck ACS., Rutzig MB, Gaydadjiev G, Carro L (2008) Transparent reconfigurable acceleration for heterogeneous embedded applications. In: 2008 Design, automation and test in Europe, pp 1208–1213. IEEE 4. Brandalero M, Beck ACS (2017) A mechanism for energy-efficient reuse of decoding and scheduling of x86 instruction streams. In: Design, automation & test in Europe conference & exhibition (DATE), 2017, pp 1468–1473. IEEE 5. Clark N, Kudlur M, Park H, Mahlke S, Flautner K (2004) Application-specific processing on a general-purpose core via transparent instruction set customization. In: 37th International symposium on microarchitecture (MICRO-37’04), pp 30–40. IEEE 6. DeVuyst M, Venkat A, Tullsen DM (2012) Execution migration in a heterogeneous-isa chip multiprocessor. In: ASPLOS, pp 261–272 7. Fajardo J Jr, Rutzig MB, Carro L, Beck AC (2013) Towards a multiple-isa embedded system. J Syst Architect 59(2):103–119 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 82 T. Knorst, et al. 8. Fu SY, Hong DY, Liu YP, Wu JJ, Hsu WC (2018) Efficient and retargetable SIMD translation in a dynamic binary translator. Softw Pract Exp 48(6):1312–1330 9. Georgakoudis G, Nikolopoulos DS, Vandierendonck H, Lalis S (2014) Fast dynamic binary rewriting for flexible thread migration on shared-isa heterogeneous mpsocs. In: SAMOS XIV, pp 156–163. IEEE 10. Govindaraju V, Ho CH, Nowatzki T, Chhugani J, Satish N, Sankaralingam K, Kim C (2012) DySER: unifying functionality and parallelism specialization for energy-efficient computing. IEEE Micro 32(5):38–51. https://doi.org/10.1109/MM.2012.51. http://ieeexplore.ieee.org/document/6235947/ 11. Jordan MG, Knorst T, Vicenzi J, Rutzig MB (2019) Boosting simd benefits through a run-time and energy efficient dlp detection. In: 2019 Design, automation & test in Europe conference & exhibition (DATE), pp 722–727. IEEE. https://doi.org/10.23919/DATE.2019.8714826 12. Junior JF, Rutzig MB, Carro L, Beck AC (2011) A transparent and adaptable multiple-isa embedded system. In: Proceedings of the international conference on engineering of reconfigurable systems and algorithms (ERSA), p 1. The steering committee of the world congress in computer science, computer 13. Korol G, Jordan MG, Brandalero M, Hübner M, Beck Rutzig M, Schneider Beck AC (2020) MCEA: A resource-aware multicore CGRA architecture for the edge. In: 2020 30th International conference on field-programmable logic and applications (FPL), pp 33–39. https://doi.org/10.1109/FPL50879.2020. 00017. ISSN: 1946-1488 14. Martins MGA, Matos JM, Ribas RP, Reis AI, Schlinker G, Rech L, Michelsen J (2015) Open cell library in 15 nm freepdk technology. In: ISPD, pp 171–178 15. Nakamura T, Miki S, Oikawa S (2011) Automatic vectorization by runtime binary translation. In: 2011 second international conference on networking and computing, pp 87–94 16. Nuzman D, Zaks A (2008) Outer-loop vectorization—revisited for short SIMD architectures. In: 2008 International conference on parallel architectures and compilation techniques (PACT), pp 2–11 17. Park S, Wu Y, Lee J, Aupov A, Mahlke S (2019) Multi-objective exploration for practical optimization decisions in binary translation. ACM Trans Embed Comput Syst 18(5s):1–19 18. Podobas A, Sano K, Matsuoka S (2020) A survey on coarse-grained reconfigurable architectures from a performance perspective. arXiv preprint arXiv:2004.04509 19. Rokicki S, Rohou E, Derrien S (2019) Hybrid-dbt: Hardware/software dynamic binary translation targeting vliw. IEEE Trans Comput Aided Des Integr Circuits Syst 38(10):1872–1885. https://doi.org/10. 1109/TCAD.2018.2864288 20. Rutzig MB, Beck ACS, Carro L (2013) A transparent and energy aware reconfigurable multiprocessor platform for simultaneous ILP and TLP exploitation. In: 2013 Design, automation test in europe conference exhibition (DATE), pp 1559–1564. https://doi.org/10.7873/DATE.2013.317. ISSN: 1530-1591 21. Vahid F, Stitt G, Lysecky R (2008) Warp processing: dynamic translation of binaries to fpga circuits. Computer 41(7):40–46 22. Watkins MA, Nowatzki T, Carno A (2016) Software transparent dynamic binary translation for coarse-grain reconfigurable architectures. In: 2016 IEEE International symposium on high performance computer architecture (HPCA), pp 138–150. IEEE, Barcelona, Spain. https://doi.org/10.1109/HPCA. 2016.7446060. http://ieeexplore.ieee.org/document/7446060/ 23. Zhou R, Wort G, Erdös M, Jones TM (2019) The janus triad: exploiting parallelism through dynamic binary modification. In: Proceedings of the 15th ACM SIGPLAN/SIGOPS international conference on virtual execution environments–VEE 2019, pp 88–100. ACM Press. https://doi.org/10.1145/3313808. 3313812. http://dl.acm.org/citation.cfm?doid=3313808.3313812 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Terms and Conditions Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”). Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial. These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will apply. We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as detailed in the Privacy Policy. While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may not: 1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access control; 2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is otherwise unlawful; 3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in writing; 4. use bots or other automated methods to access the content or redirect messages 5. override any security feature or exclusionary protocol; or 6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal content. In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue, royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any other, institutional repository. These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved. To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law, including merchantability or fitness for any particular purpose. Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed from third parties. If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not expressly permitted by these Terms, please contact Springer Nature at onlineservice@springernature.com