Benchmark Configurations

advertisement
Category: Embedded Systems - ES01
Poster
P5193
contact Name
Kristoffer Robin Stokke: krisrst@ifi.uio.no
Energy and Performance Optimisation of a Simple Video Encoder on the Jetson­TK1
Kristoffer Robin Stokke, Håkon Kvale Stensland, Pål Halvorsen
Department of Informatics, University of Oslo, Norway
{krisrst, haakonks, paalh}@ifi.uio.no
This poster analyses the energy consumption of a simple video encoder running on NVIDIA's Tegra K1 processor[1][2]. The total energy consumption of the video encoder is investigated under the influence of
different hardware configurations, such as which processors (CPU clusters or GPU) are used, DVFS algorithms, and whether performance optimisations like NEON are implemented. We find that NEON
instructions and multithreading generally have positive effects on energy consumption, saving between 25 to 40 % energy compared to a non­optimised, naive implementation. GPU offloading is found to be
marginally better than CPU execution by an amount of 1.7 %.
Jetson­TK1
NEON instructions enable
instruction­level
parallelism, Single
Instruction, Multiple Data
(SIMD).
The integrated 16
GB eMMC storage is
used as root file
system.
The Workload: Codec 63
NVIDIA's Tegra K1 contains five CPUs arranged
into one High­Performance (HP) cluster and
one Low­Performance (LP) cluster ("4­plus­1").
The architecture supports hardware migration
of software between the clusters.
Storage
EMMC
16GB
CPU
4 ARM Cortex­A15 (HP)
1 ARM Cortex­A15 (LP)
NEON
Memory
DDR3
2GB
Fan
The Power Management
Integrated Circuit (PMIC)
distributes power to the board
from the main power rail.
DVFS
352x288
50 Frames
INA219
Power Monitor 12 V
DC IN
(I2C)
PMIC
Regulator
Input File
Y
U
V
Frame
Core Migration
Motion
Estimation
(Diamond)
GPU
192 CUDA cores
Interfaces
Ethernet
Inverse
DCT
DCT
I2C
Quantisation
To minimise the energy consumption of passive
software running on the Tegra K1, we built and
installed a minimal Ubuntu image without x­server.
We communicate with the device through SSH over
the Ethernet port instead.
The available 3.3 V I2C
bus on the expansion
header pins is used to
communicate with the
power measurement
sensor.
Inverse
Quantisation
Multithreaded (3, one thread per frame type, Y, U and V).
CPU ­ NEON
Single threaded, but with NEON optimisation.
CPU ­ Threaded,NEON Multithreaded (3) with NEON optimisation.
GPU offloading using CUDA.
Hardware Configurations
Processor
Cluster
High Performance
CPU
Output File
Codec63
Entropy
Encoding
(Huffman)
GPU
DVFS Algorithm
Ondemand
Conservative
Ondemand
Low Performance
N/A
Conservative
Auto
Concluding Remarks
Experimental Results
NVIDIA's Tegra K1 processor provides the programmer with
a wide range of tools for performance optimisation, for
example multithreading, NEON, and GPU offloading. From
the experimental results we see that these affect energy
consumption in a positive way. The best hardware
configuration is to offload the workload to the GPU, while
forcing the CPU to operate in the LP cluster.
The conservative
DVFS scheduler
results in slightly
less energy
consumption than
the ondemand
scheduler.
Performance is also
slightly lower,
however, because
the conservative
scheduler reacts
slower to changes in
processor utilisation
[3].
The naive CPU
implementations are the
most time and energy
consuming benchmarks.
CPU ­ Threaded
Reconstructed
Frame
n­1
Motion
Compensation
Regulator
Dynamic Voltage and Frequency Scaling (DVFS)
algorithms are used to control the performance,
power and thermal dissipation of the cores. There are
several DVFS algorithms, for example "powersave"
and "performance" implemented in the Linux kernel.
Performance improvements
such as threads and NEON
optimisation has a positive
effect on both speed and
energy consumption.
Single threaded.
GPU
n
Regulator
CPU ­ Naive
Regulator
DVFS
The Tegra K1 is
equipped with a
GK20A Kepler GPU. It
supports CUDA 6.5
for parallel
computing through
192 CUDA cores.
Codec 63 is a simple video encoder with
structural similarities to H.264 and Google's
VP8. We use our implementation of this encoder
as a simple code base to run our benchmarks
under different hardware configurations and
software optimisations.
Foreman
The board was equipped with a custom
INA219 power measurement sensor to
measure total platform power consumption.
Benchmark Configurations
Full GPU offloading and combined
threading and NEON optimisation
shows comparable energy consumption
and performance on the HP cluster.
Threading has no impact
on the LP cluster because
it has only one core.
The tests are run
for the same
duration (30 s)
such that the
energy
consumption
contribution of
external hardware
(USB controllers,
fan, etc) is
constant.
The GPU is always the best
offloading choice in terms of
performance for this workload, but
the energy consumption is
comparable to threading with NEON
optimisation on the CPU.
The most energy efficient
configuration for this
workload is to offload
processing to the GPU while
forcing the CPU to execute
on the LP cluster.
The GPU's DVFS algorithm
adjusts processor
frequency in response to
the increased processor
utilisation.
The average energy
consumption (power)
increases and decreases
with the operating
frequency and processor
utilisation.
After the encoding is complete,
the utilisation of the GPU goes
down. The DVFS algorithm
settles the GPU's frequency to
the minimum operating point to
conserve energy.
We have not shown the "performance" or "power" CPU DVFS
algorithms here because they are easily outperformed by the
"ondemand" and "conservative" algorithms. We see that the
"conservative" algorithm causes slightly lower energy
consumption, but also higher runtime. This is because it
reacts slower to changes in processor utilisation.
Energy consumption is an important topic in a mobile world.
The Jetson­TK1 takes great steps to provide the developer
with a feature rich computing platform with interesting
power saving opportunities that are yet to be explored fully.
References
[1] "Variable SMP (4+PLUS­1) ­ A Multi­Core CPU Architecture for Low Power and High
Performance"
http://www.nvidia.com/content/PDF/tegra_white_papers/Variable­SMP­A­Multi­Core­CPU­
Architecture­for­Low­Power­and­High­Performance.pdf
[2] "NVIDIA Tegra K1 ­ A new Era in Mobile Computing"
http://www.nvidia.com/content/PDF/tegra_white_papers/Tegra­K1­whitepaper­v1.0.pdf
Ondemand
Conservative
[3] V. Pallipadi and A. Starikovskiy. The ondemand governor. In Proc. of the Linux Symposium,
volume 2, pages 215–230. sn, 2006.
[4] K.R. Stokke, H.K. Stensland, C. Griwodz, P. Halvorsen. Energy Efficient Video Encoding Using
the Tegra K1 Mobile Processor. In Proc. of the ACM Multimedia Systems Conference, 2015.
Download