PPMC : Hardware Scheduling and Memory Management Support for Multi Accelerators Tassadaq Hussain∗† Miquel Pericàs‡ Nacho Navarro∗† † Universitat Politècnica de Catalunya Supercomputing Center {tassadaq.hussain, nacho.navarro, eduard.ayguade}@bsc.es ∗ Barcelona Abstract— A generic multi-accelerator system comprises a microprocessor unit that schedules the accelerators along with the necessary data movements. The system, having the processor as control unit, encounters multiple delays (memory and task management) which degrade the overall system performance. This performance degradation demands an efficient memory manager and high speed scheduler, which feeds prearranged data to the appropriate accelerator. In this work we propose the integration of an efficient scheduler and an intelligent memory manger into an existing core known as PPMC (Programmable Pattern based Memory Controller), such that data movement and computational tasks can be handled proficiently. Consequently, the modified PPMC system improves performance by managing data movements and address generation in hardware and scheduling accelerators without the intervention of a control processor nor an operating system. The PPMC system is evaluated with six memory intensive accelerators: Laplacian solver, FIR, FFT, Thresholding, Matrix Multiplication and 3DStencil. This modified PPMC system is implemented and tested on a Xilinx ML505 evaluation FPGA board. The performance of the system is compared with a microprocessor based system that has been integrated with the Xilkernel operating system. Results show that the modified PPMC based multi-accelerator system consumes 50% less hardware resources, 32% less on-chip power and achieves approximately a 27× speed-up compared to the MicroBlaze-based system. I. I NTRODUCTION Much research has been conducted on improving the performance of HPC systems. One approach is to build a multi-accelerator/core system, manage/schedule [1] its hardware resources efficiently and write parallel code to execute on the system. A task-based programming model [2] is appropriate for such architectures, as it identifies the tasks in software that can be executed concurrently. In a multi-accelerator environment (Figure 1), a master core (microprocessor) is used to schedule a number of accelerators cores (core-1..n), and to manage the memory. Improper scheduling and complex data arrangements of accelerator kernels can lead to significant performance degradation. In such a scenario, efficient management of memory accesses across the set of accelerators and microprocessors is critical to achieve high performance. Figure 1. Generic Multi-Core Architecture Eduard Ayguadé∗† ‡ Tokyo Institute of Technology pericas.m.aa@m.titech.ac.jp A number of scheduling and memory management approaches exist for multi-accelerators, but to the best of our knowledge a challenge is still there to find mechanisms that can schedule dynamic operations while taking both the processing and memory requirements into account. Ganusov et al. [3] has proposed Efficient Emulation of Hardware Prefetchers via Event Driven Helper Threading (EDHT). Wolf et al. [4] provide a real-time capable thread scheduling interface to the two-level hardware scheduler of MERASA multi-core processor. Chai et al. [5] presents a configurable stream unit for providing streaming data to hardware accelerators. Wen et al. [6] present a FT64 based on chip memory sub system that combines software/hardware managed memory structure. FT64 combines caching and software managed memory structures, captures locality exhibited in regular/irregular stream access without data transfer between stream register file and caches. Software implementations of memory management and scheduling strategies account for performance degradation in multi-accelerator/core systems. We propose the integration of the PPMC, already proposed in [7], in the multiaccelerator environment (Figure 1). The PPMC core schedules operations dynamically to multiple accelerators while taking both the processing and memory requirements into account. The PPMC core can prefetch complete patterns of data into its scratch-pad memories that can then be accessed either by a master core or by an accelerator. Some salient features of the proposed PPMC architecture are listed below: • The PPMC based system can operate as stand-alone system, without support of the master core. • PPMC supports multiple hardware accelerators using an event driven handshaking methodology. • The PPMC system improves performance by efficiently prefetching complex/irregular data patterns. • Due to the light weight (in terms of logic elements) of PPMC the system consumes less power. • Standard C/C++ language calls are supported to identify tasks in software. II. PPMC M ULTI -ACCELERATOR S YSTEM In the past, we have been working on a memory controller PPMC [7] for high performance systems. Existing PPMC core handles static data access patterns and support single accelerator. New PPMC design efficiently feeds complex data patterns (Strided, Tiled) to multiple hardware accelerators using special event driven handshaking methodology. To demonstrate the operation, the PPMC architecture is illustrated in Figure 2. The PPMC system for multi-accelerator is (a) Figure 3. (b) (c) PPMC Architecture: (a) PPMC Scheduler (b) PPMC Memory Manager (c) PPMC Stream Prefetcher run and there are no higher priority accelerators that are ready. If same priorities are assigned for more than one accelerator, PPMC scheduler executes them as FIFO method. Figure 2. PPMC Multi-core Architecture comprised of four units which are the scheduler, the memory manager, the stream prefetcher, and the memory controller. A. The Scheduler The PPMC scheduler manages read, write and execute operations of multiple accelerators. At the end of each accelerator’s operation, the PPMC invokes the scheduler to select the next accelerator. This selection depends on the accelerator’s request and priority state. The accelerators are categorized into three states, busy (accelerator is processing on local buffer), requesting (accelerator is free), and request and busy. In the request and busy state the accelerator is assumed to have double or multi buffers. To provide a feature of multi buffer support in the current developed platform, a state controller (Figure 4 (c)) is instantiated with each accelerator that handles the states of accelerator using a double buffering technique. The state controller manages the accelerator’s Request and Grant signals and communicate with the scheduler. Each Request includes a read and write buffer operation. Once the request is accepted, the state controller provides a path to the PPMC read/write buffer. PPMC supports two scheduling policies, symmetric and asymmetric, that execute accelerators efficiently. In Symmetric multi-accelerator strategy, the PPMC scheduler manipulates the available accelerator’s request in FIFO (First in First out). The task buffer (Figure 3 (a)) is used to manage the accelerator’s request in FIFO order. The Asymmetric strategy emphasizes on priority and incoming requests of the accelerators. Like Xilinx Xilkernel scheduling model, the PPMC scheduling polices are configured statically at program-time and are executed by hardware at run-time. The number of priority levels can be configured for asymmetric scheduling. Assigned priorities of the accelerators are placed in the programmed priority buffer (Figure 3 (a)). The comparator picks an accelerator to execute, only if it is ready to B. Memory Manager The memory management performs the key role in multiaccelerators system. The PPMC Memory Manager places accelerator address space and physical address space in the descriptor memory (Figure 3 (b)). The memory space allocated to an accelerator as part of one request can be addressable through single or multiple descriptors [7]. The Memory Manager loads block of data to the local hardware accelerator buffer. Once the hardware accelerator finishes processing, it writes back the processed data to physical memory. The PPMC provides instructions to allocate and map application kernel local memory buffer and physical dataset. A PPMC_MEMCPY instruction is created which reads/writes a block of data from physical memory to the accelerator’s local memory buffer. C. Stream Prefetcher The stream prefetcher takes a dataset (main memory) description from the Memory Manager as shown in Figure 3 (c). This unit is responsible for transferring data to/from the memory controller and accelerator buffer memory. It takes the strided stream description from the Memory Manager and read/write data from/to physical memory. D. Memory Controller In the current implementation of PPMC, on a Xilinx ML505 evaluation FPGA board, a modular DDR2 SDRAM [8] controller is used with PPMC to access data from physical memory and to perform the address mapping from physical address to memory address. A 256 MByte (32M x 16) of DDR2 memory having SODIMM I/O module is connected with memory controller. III. E VALUATIONS OF PPMC In this section, we describe and evaluate the PPMC-based multi-accelerator system having application kernels mention in Table I. Separate ROCCC [9] generated hardware IP cores are used to execute the kernels. In order to evaluate the performance of the PPMC System, the results are compared with a similar system having MicroBlaze master core. The Xilinx Integrated Software Environment and Xilinx Platform (a) (b) (c) Test Architectures: (a) MicroBlaze based Multi Hardware Accelerator (b) PPMC based Multi Hardware Accelerator (c) PPMC State Controller Figure 4. Studio are used to design the systems. The power analysis is done by Xilinx Power Estimator. A Xilinx ML505 evaluation FPGA board [10] is used to test the multi-accelerator systems. A. MicroBlaze based Multi-Accelerator System A MicroBlaze based multi-accelerator system is proposed (Figure 4 (a)). The MicroBlaze soft-core processor is used to control the resources of system. The Real-Time Operating System (RTOS) Xilkernel [11] is experienced on the MicroBlaze processor. The Xilkernel has POSIX support and can statically declare threads that start with the kernel. From the main program, the application is spawned as multiple parallel threads using the pthread library. Each thread controls a single hardware accelerator and its memory access. The target architecture has 16 KB of each instruction and data cache. The design (excluding hardware accelerators) uses 7225 flip-flops, 6142 LUTs and 14 BRAMs. On-chip power in a Xilinx V5-Lx110T device is 1.97 watts. B. PPMC based Multi-Accelerator System The PPMC based system (Figure 4 (b)) schedules accelerators similar to the Xilkernel scheduling model. Scheduling is done at the accelerator event level. The PPMC contains hardwired scheduler whereas Xilkernel perform scheduling in software while using few hardware resources. The system (excluding hardware accelerators) consumes 3786 flip-flops, 2830 LUTs, 24 BRAMs and 1.33 watts on-chip power on a V5-Lx110T device. Due to the light weight of PPMC, the proposed architecture consumes 50% less slices and 32% less on-chip power than the MicroBlaze based system. C. PPMC Programming The proposed PPMC system provides C and C++ language support. An example used to program the PPMC based multi-accelerator system is shown in Figure 5. The program initializes two hardware accelerators and their 2D and 3D tiled data pattern. The first part of the program structure specifies the scheduling policies that includes the accelerator id and its priority. The PPMC scheduler supports the scheduling policies similar to the Xilkernel. The second part of the code belongs to the physical memory dataset. The third part defines the size of the accelerator’s buffer memory. Same programming style is used for other accelerators. To program the PPMC a MicroBlaze API is used that translates the multi-accelerator program for PPMC system. IV. R ESULTS AND D ISCUSSION Figure 6 (a) shows the execution time (clock cycles) of the application kernels. Each bar represents the application kernel’s computation time on hardware accelerator and execution time on the system. The application kernel time contains task execution, scheduling (request/grant) and data transfer time. The X and Y axes represent application kernels and number of clock cycles respectively. By using Table I B RIEF DESCRIPTION OF APPLICATION KERNELS (a) (b) PPMC Application Program: (a) 2D Tiled Data Access (b) 3D Tiled Data Access Figure 5. 0 00 00 10 0 00 10 00 10 0 00 10 0 Clock Cycles Multi-Accelerators System MicroBlaze Multi-Accelerator PPMC Multi-Accelerator Accelerator Execution Time Threshold FIR FFT Mat_Mul Laplacian 3D-Stencil System Applications (a) Figure 6. (b) Multi-Accelerator Systems: (a) Application Kernels Execution Time (b) Memory Access and Scheduling PPMC system, the results show that Thresholding application achieves 4.5× speed-ups compared to the MicroBlaze based system. The Thresholding application has load/store memory access pattern and it achieve less speed-up compared to other application kernels. The FIR application has streaming data access pattern with a 33.5× speed-up. The FFT application kernel reads a 1D block of data, processes it and writes it back to physical memory. This application archives 18× speed-up. The Matrix Multiplication kernel accesses row and column vectors. The application attains 46× speed-up. The Laplacian application takes 2D block of data and achieves 44× speed-up. The 3D-Stencil data decomposition achieves 48× speed-up. Figure 6 (b) illustrates the execution time of the system and categorize execution time into two factors: arbitration (request/grant) time among the scheduling, and the memory management (bus delay and memory access) time. The computation time of application kernels in both systems overlap each other under the scheduling and memory access time as shown in Figure 6 (a). In the PPMC system memory management time is dominant and the PPMC overlaps scheduling and computation under memory access time. The complete PPMC multi-accelerator system achieves 27.6× speed-ups. V. C ONCLUSION In this work, we have proposed the use of a core that is specialized in accessing pattern-based memory for multiaccelerator systems. The PPMC core improves the system performance by reducing the speed gap between accelerators/processors and memory, and by scheduling/managing complex memory patterns without master core intervention. The PPMC system provides strided, scatter/gather and tiled memory access support that eliminates the overhead of arranging and gathering address/data by the master cores (i.e., microprocessor). The proposed environment can be programmed by a microprocessor using an HLL API or directly from an accelerator using a special command interface. The experimental evaluations based on the Xilinx MicroBlaze multi-accelerator system having Xilkernel (RTOS) demonstrates that PPMC based multi-accelerator system best utilizes hardware resources and efficiently accesses physical data. In the future, we are planning to embed a selective static/dynamic set of data access pattern inside PPMC for multi-accelerator (vector accelerator) architectures that would effectively eliminate the requirement of programming the PPMC by the user for a range of applications. VI. ACKNOWLEDGMENTS We thankfully acknowledge the support of the European Commission through the HiPEAC-2 Network of Excellence (FP7/ICT 217068), the support of the Spanish Ministry of Education (TIN2007-60625, and CSD2007-00050) and the Generalitat de Catalunya (2009-SGR-980). Miquel Pericàs is supported by a JSPS Postdoctoral Fellowship For Foreign Researchers. Finally, the authors also would like to thank the reviewers for their useful comments. R EFERENCES [1] C .Boneti, R. Gioiosa, F. Cazorla, and M. Valero, “A dynamic scheduler for balancing HPC applications,” in Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008. [2] “Cell Superscalar (CellSs) Users Manual (Barcelona Supercomputing Center),” May 2009. [3] I. Ganusov and M. Burtscher, “Efficient emulation of hardware prefetchers via event-driven helper threading,” in Proceedings of the 15th international conference on Parallel architectures and compilation techniques, 2006. [4] J. Wolf, M. Gerdes, F. Kluge, S. Uhrig, J. Mische, S. Metzlaff, C. Rochange, H. Cassé, P. Sainrat, T. Ungerer, “RTOS Support for Parallel Execution of Hard Real-Time Applications on the MERASA Multi-core Processor,” in Proceedings of the 2010 13th IEEE International Symposium. [5] Sek M. Chai, N. Bellas, M. Dwyer and D. Linzmeier, “Stream Memory Subsystem in Reconfigurable Platforms.” 2nd Workshop on Architecture Research using FPGA Platforms, 2006. [6] M Wen, N Wu, C Zhang, Q Yang, J Ren, Y He, W Wu, J Chai, M Guan, C Xun, “On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator,” Micro IEEE 2008. [7] T. Hussain, M. Shafiq, M. Pericas,N. Navarro and E. Ayguade, “PPMC : A Programmable Pattern based Memory Controller,” ARC 2012, the 8th International Symposium on Applied Reconfigurable Computing. [8] Xilinx, Memory Interface Solutions, December 2, 2009. [9] “Riverside Optimizing Compiler for Configurable Computing (ROCCC 2.0),” 3,April 2011. [10] “Xilinx University Program XUPV5-LX110T Development System.” [11] Xilinx , “Xilkernel 3.0,” December , 2006.