PPMC : Hardware Scheduling and Memory Management Support

advertisement
PPMC : Hardware Scheduling and Memory Management Support
for Multi Accelerators
Tassadaq Hussain∗†
Miquel Pericàs‡
Nacho Navarro∗†
† Universitat Politècnica de Catalunya
Supercomputing Center
{tassadaq.hussain, nacho.navarro, eduard.ayguade}@bsc.es
∗ Barcelona
Abstract—
A generic multi-accelerator system comprises a microprocessor unit that schedules the accelerators along with the
necessary data movements. The system, having the processor
as control unit, encounters multiple delays (memory and task
management) which degrade the overall system performance.
This performance degradation demands an efficient memory
manager and high speed scheduler, which feeds prearranged
data to the appropriate accelerator. In this work we propose the
integration of an efficient scheduler and an intelligent memory
manger into an existing core known as PPMC (Programmable
Pattern based Memory Controller), such that data movement
and computational tasks can be handled proficiently. Consequently, the modified PPMC system improves performance by
managing data movements and address generation in hardware
and scheduling accelerators without the intervention of a
control processor nor an operating system. The PPMC system
is evaluated with six memory intensive accelerators: Laplacian
solver, FIR, FFT, Thresholding, Matrix Multiplication and 3DStencil. This modified PPMC system is implemented and tested
on a Xilinx ML505 evaluation FPGA board. The performance
of the system is compared with a microprocessor based system
that has been integrated with the Xilkernel operating system.
Results show that the modified PPMC based multi-accelerator
system consumes 50% less hardware resources, 32% less
on-chip power and achieves approximately a 27× speed-up
compared to the MicroBlaze-based system.
I. I NTRODUCTION
Much research has been conducted on improving the
performance of HPC systems. One approach is to build
a multi-accelerator/core system, manage/schedule [1] its
hardware resources efficiently and write parallel code to
execute on the system. A task-based programming model [2]
is appropriate for such architectures, as it identifies the tasks
in software that can be executed concurrently.
In a multi-accelerator environment (Figure 1), a master
core (microprocessor) is used to schedule a number of
accelerators cores (core-1..n), and to manage the memory.
Improper scheduling and complex data arrangements of accelerator kernels can lead to significant performance degradation. In such a scenario, efficient management of memory
accesses across the set of accelerators and microprocessors
is critical to achieve high performance.
Figure 1.
Generic Multi-Core Architecture
Eduard Ayguadé∗†
‡ Tokyo
Institute of Technology
pericas.m.aa@m.titech.ac.jp
A number of scheduling and memory management approaches exist for multi-accelerators, but to the best of our
knowledge a challenge is still there to find mechanisms
that can schedule dynamic operations while taking both the
processing and memory requirements into account. Ganusov
et al. [3] has proposed Efficient Emulation of Hardware
Prefetchers via Event Driven Helper Threading (EDHT).
Wolf et al. [4] provide a real-time capable thread scheduling
interface to the two-level hardware scheduler of MERASA
multi-core processor. Chai et al. [5] presents a configurable
stream unit for providing streaming data to hardware accelerators. Wen et al. [6] present a FT64 based on chip memory sub system that combines software/hardware managed
memory structure. FT64 combines caching and software
managed memory structures, captures locality exhibited in
regular/irregular stream access without data transfer between
stream register file and caches.
Software implementations of memory management and
scheduling strategies account for performance degradation
in multi-accelerator/core systems. We propose the integration of the PPMC, already proposed in [7], in the multiaccelerator environment (Figure 1). The PPMC core schedules operations dynamically to multiple accelerators while
taking both the processing and memory requirements into
account. The PPMC core can prefetch complete patterns of
data into its scratch-pad memories that can then be accessed
either by a master core or by an accelerator. Some salient
features of the proposed PPMC architecture are listed below:
• The PPMC based system can operate as stand-alone
system, without support of the master core.
• PPMC supports multiple hardware accelerators using
an event driven handshaking methodology.
• The PPMC system improves performance by efficiently
prefetching complex/irregular data patterns.
• Due to the light weight (in terms of logic elements) of
PPMC the system consumes less power.
• Standard C/C++ language calls are supported to identify tasks in software.
II. PPMC M ULTI -ACCELERATOR S YSTEM
In the past, we have been working on a memory controller
PPMC [7] for high performance systems. Existing PPMC
core handles static data access patterns and support single
accelerator. New PPMC design efficiently feeds complex
data patterns (Strided, Tiled) to multiple hardware accelerators using special event driven handshaking methodology. To
demonstrate the operation, the PPMC architecture is illustrated in Figure 2. The PPMC system for multi-accelerator is
(a)
Figure 3.
(b)
(c)
PPMC Architecture: (a) PPMC Scheduler (b) PPMC Memory Manager (c) PPMC Stream Prefetcher
run and there are no higher priority accelerators that are
ready. If same priorities are assigned for more than one
accelerator, PPMC scheduler executes them as FIFO method.
Figure 2.
PPMC Multi-core Architecture
comprised of four units which are the scheduler, the memory
manager, the stream prefetcher, and the memory controller.
A. The Scheduler
The PPMC scheduler manages read, write and execute
operations of multiple accelerators. At the end of each
accelerator’s operation, the PPMC invokes the scheduler to
select the next accelerator. This selection depends on the
accelerator’s request and priority state. The accelerators are
categorized into three states, busy (accelerator is processing
on local buffer), requesting (accelerator is free), and request
and busy. In the request and busy state the accelerator is
assumed to have double or multi buffers. To provide a feature
of multi buffer support in the current developed platform,
a state controller (Figure 4 (c)) is instantiated with each
accelerator that handles the states of accelerator using a
double buffering technique. The state controller manages the
accelerator’s Request and Grant signals and communicate
with the scheduler. Each Request includes a read and write
buffer operation. Once the request is accepted, the state
controller provides a path to the PPMC read/write buffer.
PPMC supports two scheduling policies, symmetric and
asymmetric, that execute accelerators efficiently. In Symmetric multi-accelerator strategy, the PPMC scheduler manipulates the available accelerator’s request in FIFO (First in
First out). The task buffer (Figure 3 (a)) is used to manage
the accelerator’s request in FIFO order. The Asymmetric
strategy emphasizes on priority and incoming requests of
the accelerators. Like Xilinx Xilkernel scheduling model,
the PPMC scheduling polices are configured statically at
program-time and are executed by hardware at run-time. The
number of priority levels can be configured for asymmetric
scheduling. Assigned priorities of the accelerators are placed
in the programmed priority buffer (Figure 3 (a)). The comparator picks an accelerator to execute, only if it is ready to
B. Memory Manager
The memory management performs the key role in multiaccelerators system. The PPMC Memory Manager places
accelerator address space and physical address space in
the descriptor memory (Figure 3 (b)). The memory space
allocated to an accelerator as part of one request can be
addressable through single or multiple descriptors [7]. The
Memory Manager loads block of data to the local hardware
accelerator buffer. Once the hardware accelerator finishes
processing, it writes back the processed data to physical
memory. The PPMC provides instructions to allocate and
map application kernel local memory buffer and physical
dataset. A PPMC_MEMCPY instruction is created which
reads/writes a block of data from physical memory to the
accelerator’s local memory buffer.
C. Stream Prefetcher
The stream prefetcher takes a dataset (main memory)
description from the Memory Manager as shown in Figure 3
(c). This unit is responsible for transferring data to/from the
memory controller and accelerator buffer memory. It takes
the strided stream description from the Memory Manager
and read/write data from/to physical memory.
D. Memory Controller
In the current implementation of PPMC, on a Xilinx ML505 evaluation FPGA board, a modular DDR2
SDRAM [8] controller is used with PPMC to access data
from physical memory and to perform the address mapping
from physical address to memory address. A 256 MByte
(32M x 16) of DDR2 memory having SODIMM I/O module
is connected with memory controller.
III. E VALUATIONS OF PPMC
In this section, we describe and evaluate the PPMC-based
multi-accelerator system having application kernels mention
in Table I. Separate ROCCC [9] generated hardware IP cores
are used to execute the kernels. In order to evaluate the
performance of the PPMC System, the results are compared
with a similar system having MicroBlaze master core. The
Xilinx Integrated Software Environment and Xilinx Platform
(a)
(b)
(c)
Test Architectures: (a) MicroBlaze based Multi Hardware Accelerator (b) PPMC based Multi Hardware Accelerator (c) PPMC
State Controller
Figure 4.
Studio are used to design the systems. The power analysis
is done by Xilinx Power Estimator. A Xilinx ML505 evaluation FPGA board [10] is used to test the multi-accelerator
systems.
A. MicroBlaze based Multi-Accelerator System
A MicroBlaze based multi-accelerator system is proposed
(Figure 4 (a)). The MicroBlaze soft-core processor is used to
control the resources of system. The Real-Time Operating
System (RTOS) Xilkernel [11] is experienced on the MicroBlaze processor. The Xilkernel has POSIX support and
can statically declare threads that start with the kernel. From
the main program, the application is spawned as multiple
parallel threads using the pthread library. Each thread controls a single hardware accelerator and its memory access.
The target architecture has 16 KB of each instruction and
data cache. The design (excluding hardware accelerators)
uses 7225 flip-flops, 6142 LUTs and 14 BRAMs. On-chip
power in a Xilinx V5-Lx110T device is 1.97 watts.
B. PPMC based Multi-Accelerator System
The PPMC based system (Figure 4 (b)) schedules accelerators similar to the Xilkernel scheduling model. Scheduling
is done at the accelerator event level. The PPMC contains
hardwired scheduler whereas Xilkernel perform scheduling
in software while using few hardware resources. The system
(excluding hardware accelerators) consumes 3786 flip-flops,
2830 LUTs, 24 BRAMs and 1.33 watts on-chip power on a
V5-Lx110T device. Due to the light weight of PPMC, the
proposed architecture consumes 50% less slices and 32%
less on-chip power than the MicroBlaze based system.
C. PPMC Programming
The proposed PPMC system provides C and C++ language support. An example used to program the PPMC
based multi-accelerator system is shown in Figure 5. The
program initializes two hardware accelerators and their 2D
and 3D tiled data pattern. The first part of the program
structure specifies the scheduling policies that includes the
accelerator id and its priority. The PPMC scheduler supports
the scheduling policies similar to the Xilkernel. The second
part of the code belongs to the physical memory dataset. The
third part defines the size of the accelerator’s buffer memory.
Same programming style is used for other accelerators. To
program the PPMC a MicroBlaze API is used that translates
the multi-accelerator program for PPMC system.
IV. R ESULTS AND D ISCUSSION
Figure 6 (a) shows the execution time (clock cycles)
of the application kernels. Each bar represents the application kernel’s computation time on hardware accelerator
and execution time on the system. The application kernel
time contains task execution, scheduling (request/grant) and
data transfer time. The X and Y axes represent application
kernels and number of clock cycles respectively. By using
Table I
B RIEF DESCRIPTION OF APPLICATION KERNELS
(a)
(b)
PPMC Application Program: (a) 2D Tiled Data Access
(b) 3D Tiled Data Access
Figure 5.
0
00
00
10
0
00
10
00
10
0
00
10
0
Clock Cycles
Multi-Accelerators System
MicroBlaze Multi-Accelerator
PPMC Multi-Accelerator
Accelerator Execution Time
Threshold
FIR
FFT
Mat_Mul Laplacian 3D-Stencil System
Applications
(a)
Figure 6.
(b)
Multi-Accelerator Systems: (a) Application Kernels Execution Time (b) Memory Access and Scheduling
PPMC system, the results show that Thresholding application achieves 4.5× speed-ups compared to the MicroBlaze
based system. The Thresholding application has load/store
memory access pattern and it achieve less speed-up compared to other application kernels. The FIR application has
streaming data access pattern with a 33.5× speed-up. The
FFT application kernel reads a 1D block of data, processes
it and writes it back to physical memory. This application
archives 18× speed-up. The Matrix Multiplication kernel
accesses row and column vectors. The application attains
46× speed-up. The Laplacian application takes 2D block
of data and achieves 44× speed-up. The 3D-Stencil data
decomposition achieves 48× speed-up.
Figure 6 (b) illustrates the execution time of the system
and categorize execution time into two factors: arbitration
(request/grant) time among the scheduling, and the memory
management (bus delay and memory access) time. The
computation time of application kernels in both systems
overlap each other under the scheduling and memory access
time as shown in Figure 6 (a). In the PPMC system memory
management time is dominant and the PPMC overlaps
scheduling and computation under memory access time. The
complete PPMC multi-accelerator system achieves 27.6×
speed-ups.
V. C ONCLUSION
In this work, we have proposed the use of a core that
is specialized in accessing pattern-based memory for multiaccelerator systems. The PPMC core improves the system
performance by reducing the speed gap between accelerators/processors and memory, and by scheduling/managing
complex memory patterns without master core intervention.
The PPMC system provides strided, scatter/gather and tiled
memory access support that eliminates the overhead of
arranging and gathering address/data by the master cores
(i.e., microprocessor). The proposed environment can be
programmed by a microprocessor using an HLL API or
directly from an accelerator using a special command interface. The experimental evaluations based on the Xilinx MicroBlaze multi-accelerator system having Xilkernel (RTOS)
demonstrates that PPMC based multi-accelerator system best
utilizes hardware resources and efficiently accesses physical
data. In the future, we are planning to embed a selective
static/dynamic set of data access pattern inside PPMC
for multi-accelerator (vector accelerator) architectures that
would effectively eliminate the requirement of programming
the PPMC by the user for a range of applications.
VI. ACKNOWLEDGMENTS
We thankfully acknowledge the support of the European
Commission through the HiPEAC-2 Network of Excellence
(FP7/ICT 217068), the support of the Spanish Ministry of
Education (TIN2007-60625, and CSD2007-00050) and the
Generalitat de Catalunya (2009-SGR-980). Miquel Pericàs
is supported by a JSPS Postdoctoral Fellowship For Foreign
Researchers. Finally, the authors also would like to thank
the reviewers for their useful comments.
R EFERENCES
[1] C .Boneti, R. Gioiosa, F. Cazorla, and M. Valero, “A dynamic
scheduler for balancing HPC applications,” in Proceedings of
the 2008 ACM/IEEE conference on Supercomputing, 2008.
[2] “Cell Superscalar (CellSs) Users Manual (Barcelona Supercomputing Center),” May 2009.
[3] I. Ganusov and M. Burtscher, “Efficient emulation of hardware prefetchers via event-driven helper threading,” in Proceedings of the 15th international conference on Parallel
architectures and compilation techniques, 2006.
[4] J. Wolf, M. Gerdes, F. Kluge, S. Uhrig, J. Mische, S. Metzlaff,
C. Rochange, H. Cassé, P. Sainrat, T. Ungerer, “RTOS Support for Parallel Execution of Hard Real-Time Applications
on the MERASA Multi-core Processor,” in Proceedings of
the 2010 13th IEEE International Symposium.
[5] Sek M. Chai, N. Bellas, M. Dwyer and D. Linzmeier, “Stream
Memory Subsystem in Reconfigurable Platforms.”
2nd
Workshop on Architecture Research using FPGA Platforms,
2006.
[6] M Wen, N Wu, C Zhang, Q Yang, J Ren, Y He, W Wu, J Chai,
M Guan, C Xun, “On-Chip Memory System Optimization
Design for the FT64 Scientific Stream Accelerator,” Micro
IEEE 2008.
[7] T. Hussain, M. Shafiq, M. Pericas,N. Navarro and E. Ayguade,
“PPMC : A Programmable Pattern based Memory Controller,”
ARC 2012, the 8th International Symposium on Applied
Reconfigurable Computing.
[8] Xilinx, Memory Interface Solutions, December 2, 2009.
[9] “Riverside Optimizing Compiler for Configurable Computing
(ROCCC 2.0),” 3,April 2011.
[10] “Xilinx University Program XUPV5-LX110T Development
System.”
[11] Xilinx , “Xilkernel 3.0,” December , 2006.
Download