Category: Embedded Systems - ES01 Poster P5193 contact Name Kristoffer Robin Stokke: krisrst@ifi.uio.no Energy and Performance Optimisation of a Simple Video Encoder on the Jetson­TK1 Kristoffer Robin Stokke, Håkon Kvale Stensland, Pål Halvorsen Department of Informatics, University of Oslo, Norway {krisrst, haakonks, paalh}@ifi.uio.no This poster analyses the energy consumption of a simple video encoder running on NVIDIA's Tegra K1 processor[1][2]. The total energy consumption of the video encoder is investigated under the influence of different hardware configurations, such as which processors (CPU clusters or GPU) are used, DVFS algorithms, and whether performance optimisations like NEON are implemented. We find that NEON instructions and multithreading generally have positive effects on energy consumption, saving between 25 to 40 % energy compared to a non­optimised, naive implementation. GPU offloading is found to be marginally better than CPU execution by an amount of 1.7 %. Jetson­TK1 NEON instructions enable instruction­level parallelism, Single Instruction, Multiple Data (SIMD). The integrated 16 GB eMMC storage is used as root file system. The Workload: Codec 63 NVIDIA's Tegra K1 contains five CPUs arranged into one High­Performance (HP) cluster and one Low­Performance (LP) cluster ("4­plus­1"). The architecture supports hardware migration of software between the clusters. Storage EMMC 16GB CPU 4 ARM Cortex­A15 (HP) 1 ARM Cortex­A15 (LP) NEON Memory DDR3 2GB Fan The Power Management Integrated Circuit (PMIC) distributes power to the board from the main power rail. DVFS 352x288 50 Frames INA219 Power Monitor 12 V DC IN (I2C) PMIC Regulator Input File Y U V Frame Core Migration Motion Estimation (Diamond) GPU 192 CUDA cores Interfaces Ethernet Inverse DCT DCT I2C Quantisation To minimise the energy consumption of passive software running on the Tegra K1, we built and installed a minimal Ubuntu image without x­server. We communicate with the device through SSH over the Ethernet port instead. The available 3.3 V I2C bus on the expansion header pins is used to communicate with the power measurement sensor. Inverse Quantisation Multithreaded (3, one thread per frame type, Y, U and V). CPU ­ NEON Single threaded, but with NEON optimisation. CPU ­ Threaded,NEON Multithreaded (3) with NEON optimisation. GPU offloading using CUDA. Hardware Configurations Processor Cluster High Performance CPU Output File Codec63 Entropy Encoding (Huffman) GPU DVFS Algorithm Ondemand Conservative Ondemand Low Performance N/A Conservative Auto Concluding Remarks Experimental Results NVIDIA's Tegra K1 processor provides the programmer with a wide range of tools for performance optimisation, for example multithreading, NEON, and GPU offloading. From the experimental results we see that these affect energy consumption in a positive way. The best hardware configuration is to offload the workload to the GPU, while forcing the CPU to operate in the LP cluster. The conservative DVFS scheduler results in slightly less energy consumption than the ondemand scheduler. Performance is also slightly lower, however, because the conservative scheduler reacts slower to changes in processor utilisation [3]. The naive CPU implementations are the most time and energy consuming benchmarks. CPU ­ Threaded Reconstructed Frame n­1 Motion Compensation Regulator Dynamic Voltage and Frequency Scaling (DVFS) algorithms are used to control the performance, power and thermal dissipation of the cores. There are several DVFS algorithms, for example "powersave" and "performance" implemented in the Linux kernel. Performance improvements such as threads and NEON optimisation has a positive effect on both speed and energy consumption. Single threaded. GPU n Regulator CPU ­ Naive Regulator DVFS The Tegra K1 is equipped with a GK20A Kepler GPU. It supports CUDA 6.5 for parallel computing through 192 CUDA cores. Codec 63 is a simple video encoder with structural similarities to H.264 and Google's VP8. We use our implementation of this encoder as a simple code base to run our benchmarks under different hardware configurations and software optimisations. Foreman The board was equipped with a custom INA219 power measurement sensor to measure total platform power consumption. Benchmark Configurations Full GPU offloading and combined threading and NEON optimisation shows comparable energy consumption and performance on the HP cluster. Threading has no impact on the LP cluster because it has only one core. The tests are run for the same duration (30 s) such that the energy consumption contribution of external hardware (USB controllers, fan, etc) is constant. The GPU is always the best offloading choice in terms of performance for this workload, but the energy consumption is comparable to threading with NEON optimisation on the CPU. The most energy efficient configuration for this workload is to offload processing to the GPU while forcing the CPU to execute on the LP cluster. The GPU's DVFS algorithm adjusts processor frequency in response to the increased processor utilisation. The average energy consumption (power) increases and decreases with the operating frequency and processor utilisation. After the encoding is complete, the utilisation of the GPU goes down. The DVFS algorithm settles the GPU's frequency to the minimum operating point to conserve energy. We have not shown the "performance" or "power" CPU DVFS algorithms here because they are easily outperformed by the "ondemand" and "conservative" algorithms. We see that the "conservative" algorithm causes slightly lower energy consumption, but also higher runtime. This is because it reacts slower to changes in processor utilisation. Energy consumption is an important topic in a mobile world. The Jetson­TK1 takes great steps to provide the developer with a feature rich computing platform with interesting power saving opportunities that are yet to be explored fully. References [1] "Variable SMP (4+PLUS­1) ­ A Multi­Core CPU Architecture for Low Power and High Performance" http://www.nvidia.com/content/PDF/tegra_white_papers/Variable­SMP­A­Multi­Core­CPU­ Architecture­for­Low­Power­and­High­Performance.pdf [2] "NVIDIA Tegra K1 ­ A new Era in Mobile Computing" http://www.nvidia.com/content/PDF/tegra_white_papers/Tegra­K1­whitepaper­v1.0.pdf Ondemand Conservative [3] V. Pallipadi and A. Starikovskiy. The ondemand governor. In Proc. of the Linux Symposium, volume 2, pages 215–230. sn, 2006. [4] K.R. Stokke, H.K. Stensland, C. Griwodz, P. Halvorsen. Energy Efficient Video Encoding Using the Tegra K1 Mobile Processor. In Proc. of the ACM Multimedia Systems Conference, 2015.