PPMC : A Programmable Pattern based Memory

advertisement
PPMC : A Programmable Pattern based Memory
Controller
Tassadaq Hussain1 , Muhammad Shafiq1 , Miquel Pericàs1 ,
Nacho Navarro2 , and Eduard Ayguadé1,2
1
Barcelona Supercomputing Center
Universitat Politecnica de Catalunya
{thussain, mshafiq, miquel.pericas, eduard.ayguade} @bsc.es,
nacho@ac.upc.edu
2
Abstract. One of the main challenges in the design of hardware accelerators is
the efficient access of data from the external memory. Improving and optimizing the functionality of the memory controller between the external memory and
the accelerators is therefore critical. In this paper, we advance toward this goal
by proposing PPMC, the Programmable Pattern-based Memory Controller. This
controller supports scatter-gather and strided 1D, 2D and 3D accesses with programmable tiling. Compared to existing solutions, the proposed system provides
better performance, simplifies programming access patterns and eases software
integration by interfacing to high-level programming languages. In addition, the
controller offers an interface for automating domain decomposition via tiling.
We implemented and tested PPMC on a Xilinx ML505 evaluation board using
a MicroBlaze soft-core as the host processor. The evaluation uses six memory
intensive application kernels: Laplacian solver, FIR, FFT, Thresholding, Matrix
Multiplication, and 3D-Stencil. The results show that the PPMC-enhanced system
achieves at least 10x speed-ups for 1D, 2D and 3D memory accesses as compared
to a non-PPMC based setup.
1
Introduction
The current trend in the design of HPC systems is shifting towards the integration of
microprocessors with accelerators in order to achieve the increasing performance targets. However, high-performance microprocessor/accelerator systems are of little use if
the memory hierarchy is unable to provide the necessary data bandwidth. The efficient
management of memory accesses by timely prefetching across the set of accelerators
and microprocessors in such a scenario is critical for performance. However, it is also
very challenging.
A lot of research has been conducted in the past to build efficient prefetchers. Basic
patterns that have been exploited include vectors with constant strides or linked lists [1].
Dynamic prefetching [8, 11] is useful for microprocessors designed for general purpose.
On the software side, blocking (tiling) algorithms [13] are used extensively to improve
cache efficiency by partitioning the working set to the characteristics of cache memories. Software managed caches or scratchpad memories [4] are used when stricter control of data locality is desired. However, it also increases the programming complexity,
particularly when data layouts are complex.
This paper introduces a memory controller based on high-level data patterns ensuring high performance and high efficiency while providing simplified programming of
SoC applications. The main contribution of this paper is the description of the hardware
implementation of a Pattern-based Programable Memory Controller (PPMC) which
takes a data access description (descriptor blocks) and provides 1D, 2D and 3D data
in streaming mode. The PPMC system provides the following novel features to the
computing system:
– Concurrency: To expose maximum parallelism, the PPMC system fetches complex
streams in parallel to be processed by the compute engine(s).
– Programmability: Standard C/C++ language calls are used to schedule the access
patterns.
– Simplicity: The PPMC design offers simplicity in the management of different complex (strided) data access patterns in hardware by allowing users to program them.
This paper discusses the architecture of the memory controller and its implementation on a Xilinx Virtex 5 FPGA (ML505 development board), including its integration
with a Xilinx MicroBlaze processor. We also show how PPMC can be attached to reconfigurable accelerators. We use reconfigurable accelerators for the Laplacian solver,
FFT, FIR filters and 3D-Stencil kernel. Thresholding and Matrix Multiplication kernels
are executed on a MicroBlaze processor. We evaluate these application kernels on a
MicroBlaze-based SoC Architecture using PPMC. This shows that the PPMC system is
equally usable as a stand alone component within a reconfigurable system, and also as
a slave in a system with a general purpose processor. The results are compared with a
system that lacks the pattern based controller, showing great benefits from the usage of
PPMC.
2
Programmable Pattern based Memory Controller (PPMC)
In general, a conventional Direct Memory Access (DMA) controller’s data access pattern consists of source, destination addresses and size of the transfer with unit stride.
This scheme is not efficient to access complex/irregular memory patterns. In such cases,
the task of DMA becomes significantly more complicated. Delays while computing addresses are generated when non-contiguous memory locations are being accessed.
One of the main characteristic of the proposed PPMC is its efficiency in managing
more complex address patterns by combining scatter/gather and strided vector accesses,
while generic DMA controllers can only support simple address patterns due to scatter/gather access. Another important feature of the proposed controller is its capability
to work standalone as well as in SoC environment, which enables it to operate in parallel
with/without a microprocessor unit while previous state of the art DMA controllers have
to follow microprocessor instructions and bus protocols to perform the same operation.
When dealing with volumes of large size in high performance computing applications,
the address generation and volume decomposition are done by a host unit which puts
additional pressure on the computing unit. This shortcoming has been overcome in the
proposed controller by remapping and generating physical addresses without delay in
the hardware unit.
The PPMC unit efficiently accesses complex data patterns (either as Array of Structure (AoS), Structure of Array (SoA) or Tiled) with the help of a single or multiple descriptor blocks. The minimum set of parameters for a single descriptor block includes
command, source address, destination address, stride and stream. Command specifies
whether to read or write a single element or a stream of data. The address parameters
Fig. 1: PPMC Internal Architecture
specify the starting addresses of the source and destination locations. Stride indicates
the distance between two consecutive physical memory addresses of a stream. All of
this helps data being transferred from a non-contiguous memory to a contiguous address space, and vice versa.
2.1
PPMC Architecture
To demonstrate the operation of PPMC, we first depict its internal architecture in Figure 1, which also briefly shows the interconnection with the external processing units.
The PPMC system is comprised of four units which are the front-end interface, the
memory manager, the stream controller and the memory controller.
2.1.1 The Front-End Interface The Front-End interface includes two separate interfaces, the Scratchpad Controller Interface and the High-Speed Source Synchronous
Interface, to link with the microprocessor unit (master) and the hardware accelerator
unit (slave), respectively.
Scratchpad Controller Interface: The Scratchpad Controller Interface as shown in
Figure 2 (a) provides a bus interface between the master unit (microprocessor) and the
scratchpad memory (Block RAM). Currently the bus interface is compliant with the Xilinx Local Memory Bus (LMB) unit. The scratchpad memory subsystem consists of the
Descriptor Memory along with the Buffer Memory. The descriptor blocks are placed in
Descriptor memory in priority order. The Memory Manager (see Section 2.1.2) accesses
these descriptor blocks without any delay, thus reducing the additional synchronization
time required to access non-contiguous memories. Each descriptor block contains information of source unit and destination unit that eliminates request/grant time. The Bus
interface is used to initialize the descriptor memory with the support of C/C++ calls
such as send(), receive(), send tile() and receive tile(). send()
and receive() occupy a single descriptor memory block, whereas send tile and
receive tile reserve descriptor blocks according to the volume of the tile. The
Buffer memory is used to hold data temporarily while it is being moved to/from physical memory. Depending on the system architecture and applications access pattern the
Buffer memory can be partitioned.
(a)
(b)
(c)
Fig. 2: PPMC Architecture: (a) PPMC Scratchpad Controller Interface (b) PPMC Memory Manager (c) PPMC Stream Controller
High-Speed Source Synchronous Interface: The Source Synchronous Interface
as shown in Figure 1 is used to supply high-speed data to reconfigurable accelerators.
To request and grant data for multiple accelerators, a handshaking protocol is applied
to the bus arbiter. Transfer of data is accomplished according to the physical memory
clock.
2.1.2 Memory Manager In a PPMC system having a scratchpad memory and multiple hardware accelerators, several requests can occur concurrently. The PPMC has to
select one source from multiple requesting sources. PPMC manages concurrent access
by implementing scheduling policies for both hardware accelerators and the scratchpad
read/write access. The memory manager is composed of two modules: the Arbiter and
the Address Manager.
Arbitration: PPMC applies an arbitration policy that is processed by the bus arbiter.
The current bus arbiter handles 32 hardware accelerators. It manages the distribution of
streams to the appropriate hardware accelerator. A round-robin token passing protocol
is implemented in the arbiter. For each descriptor access cycle, one of the hardware
accelerators has the highest priority (token). If the token-holding master does not need
the resource in this turn, the master with the next highest priority (sending a request) is
granted the resource, and the highest priority master later passes the token to the next
master in round-robin order.
Address Manager: The address manager controls the addresses of the scratchpad
memory (descriptor block, scratchpad buffer). It handles a Descriptor Memory Pointer
(DMP) and a Scratchpad Buffer Pointer (SBP) as shown in Figure 2 (b). The DMP holds
the address for the next descriptor block and provides this address to the descriptor
memory to fetch a descriptor block. After completion of target data access, the address
manager sends an ack signal to the DMP that requests the next descriptor block. The
SBP is responsible to generate addresses for the scratchpad memory. Depending on the
application (multi-threaded) or hardware architecture (multi-accelerator), the scratchpad memory is divided into multiple buffers. The SBP takes the source address (Base
Address) from the descriptor controller and with single cycle latency starts incrementing in it. The SBP stops incrementing addresses, when the ack signal is granted by the
Stream Controller (see Section 2.1.3).
2.1.3 Stream Controller The stream controller, shown in Figure 2 (c), takes memory
transaction descriptions from the Memory Manager. This unit is responsible for transferring data between the Memory Controller and the Front-End Interface depending on
the programmed descriptor. The stream controller is comprised of the Data Management Unit (DMU) and the Address Management Unit (AMU).
Data Management Unit (DMU): The DMU manages data lines labeled as Data
In and Data Out shown in Figure 2 (c). It enables the data stream to be written at the
appropriate location of the physical memory by generating the write–enable along with
the write–data and mask–data control signals. The source of the data stream can be
a contiguous buffer memory or a streaming accelerator unit. The DMU unit supports
64-bit data bus and 8-bits mask bus. For every 8 bits of data bus, there is a mask bit.
Address Management Unit (AMU): The AMU deals with the Address, Stream
and Stride units. Strides between two consecutive accesses are handled by the AMU
without generating delay. For each stream, the first data transfer uses an address taken
by the descriptor unit and for the rest of the transfers, the address is equal to the address
of the previous transfer plus the size of the stride. The AMU supports a stream size of
up to 1024 contiguous memory elements with one descriptor. Supported strides need
to be multiple of 4.
2.1.4 Memory Controller A modular DDR2 SDRAM [16] controller is used to access data from physical memory. The DDR2 SDRAM controller provides a high-speed
source-synchronous interface and transfers data on both edges of the clock cycle.
3
Data Access Pattern
In order to elaborate the functionality of PPMC two types of data access patterns are
discussed in this section: Vector Access Pattern (AoS or SoA) and Tiling Access Pattern.
3.1
Vector Access Pattern (AoS or SoA)
Gou et. al observed that most HPC applications are in favor of operating on SoA format [5]. The PPMC system allows to access a SoA in AoS format without generating
any delay. AoS or SoA data access is performed by the send() and receive() calls.
Each call is associated to a single descriptor block which needs to be initialized for each
(a)
(b)
Fig. 3: Two PPMC Patterns: (a) AoS and SoA access pattern (b) PPMC tiling example
transfer. SoA data access requires unit-stride, whereas the AoS requires strided access.
The stride is determined by the size of the working set. Figure 3 (a) presents three different vector accesses (x[n][0], y[0][n] and z[n][n]) for a n×n matrix where
vector x corresponds to the contiguous row having unit stride, vector y belongs to column access with stride equal to row length and vector z is a diagonal SoA pattern. Stride
of z[n][n] access is the addition of row length and unit stride.
3.2
Tiling Access
The tiling technique is employed to improve system performance. It is widely used for
exploiting data locality and improving parallelism.
Single Tile Access: To access a single tile, the PPMC uses multiple descriptors. By
combining these descriptors, the PPMC exchanges tiled data between physical memory and scratch pad memory buffer. To read/write a single tile, the send tile() and
receive tile() calls are used to initialize the descriptor memory blocks. Each call
requires two input parameters Buffer # and Physical Address. Parameter Buffer # indicates the starting address of the buffer where the tile is read/written. Physical Address
holds start address of the working dataset.
Multi-Tiling Access: The PPMC multi-tiling method attempts to derive an effective
tiling scheme. It requires information about the Physical Memory Tiled Dataset (M N )
and the Buffer’s Tile size (mn ) of the computing engine. Where (M,m) represents width
and (N,n) represents dimension of each tile. Depending on the structure of the buffer
memory, the PPMC partitions Physical memory into multiple tiles. The number of tiles
for the even multidimensional memory access is given by Equation 1.
#
N "
DataSet WidthDimension
M
(1)
=
Number of Tiles =
mn
LocalBuffer WidthDimension
An example of PPMC (4×4) 2D-tile access is shown in Figure 3 (b). The PPMC uses
4 descriptors to access a single tile. Channel 0 takes descriptor 0 from descriptor
memory and accesses data starting at location Physical Address with unit-stride and
n-stream (row width) size. Depending on the size of the main memory dataset (M N ),
different strides (address) are used between channels. The starting address of the next
channels is dependent on Equation 2. Channel 1 starts right after the completion of
Channel 0. Channel 3 is the last channel. After completion of channel 3, PPMC
generates an interrupt signal indicating to the external source that the tile data transfer
has finished. The PPMC will move ahead to the next tile and restart processing.
Start Address = Dataset Base AddressT ile + Buffer WidthT ile ∗ channel number (2)
4
Evaluations of PPMC
In this section, we describe and evaluate the PPMC-based SoC architecture for the FIR,
Thresholding, FFT, Laplacian solver, Matrix Multiplication and 3D-Stencils application kernels. In order to evaluate the performance of the PPMC System, the results are
compared with a similar system that does not feature a PPMC unit.
(a)
(b)
Fig. 4: Test Environment: (a) MPMC based SoC with Reconfigurable Hardware Accelerator (b)
PPMC based SoC with Reconfigurable Hardware Accelerator
4.1
The System Architecture
In our evaluations, we used two types of systems: one system is based on MicroBlaze
and the other system is based on a reconfigurable accelerator. The former system is
used to test Matrix Multiplication and Thresholding while the later one is used to test
the FFT, FIR, Laplacian solver and 3D-stencil application kernels. Both systems are
configured with and without support of PPMC.
4.1.1 MPMC based SoC A generic Multi-Port Memory Controller (MPMC) with a
hardware accelerator based SoC architecture is proposed in Figure 4 (a). The MPMC
is employed as it provides an efficient means of interfacing the processor to SDRAM.
MicroBlaze soft-core processor is used to control the system. The target architecture
has a Single Precision Floating Point Unit, 16 KB of each instruction and data cache.
The design uses 9612 flip-flops, 9388 LUTs and 14 BRAMs in a Xilinx V5-Lx110T
device.
4.1.2 PPMC based SoC The PPMC based system is shown in Figure 4 (b). In this
architecture, dual-port memory controllers are used to share the descriptor memory between the Microblaze and the PPMC. The MicroBlaze is connected with the PPMC for
(a)
(b)
Fig. 5: (a) ROCCC-based Hardware Accelerator (b) PPMC State Controller for Hardware Accelerator
Table 1: Brief description of application kernels
Application Kernel
Thresholding
Description
An application of image segmentation, which takes
streaming 8-bit pixel data and generates binary output.
Finite Impulse Response Calculates the weighted sum
of the current and past inputs.
Fast Fourier Transform Used for transferring a time-domain signal
into corresponding frequency-domain signal.
Laplacian solver
Applies discrete convolution filter that
can approximate the second order derivatives.
Matrix Multiplication
X=Y×Z
3D-Stencil
An algorithm that averages nearest
Decomposition [10]
neighbor points (size 8x9x8) in 3D.
Access Pattern Executed By
Streaming
MicroBlaze
Streaming
ROCCC IP
1D Block
ROCCC IP
2D Tiling
ROCCC IP
2D Tiling
MicroBlaze
3D-Tiling
MicroBlaze
programming PPMC’s descriptor memory, to execute applications (more information
in Table 1) and to display results using a serial link. The PPMC is connected to the
reconfigurable hardware accelerator using PPMC’s source synchronous interface and a
state controller. The PPMC state controller, shown in Figure 5 (b), is used to control the
hardware accelerator and the buffer memory. Four memory buffers (two read and two
write) are specified to accesses tiled data to/from main memory. The system consumes
6280 flip-flops, 5581 LUTs and 24 BRAMs on a V5-Lx110T device. Due to the light
weight of PPMC, the proposed architecture consumes 38% less slices than the generic
MicroBlaze SoC.
4.2
Test Applications
The application kernels used to validate the design are shown in Table 1. These kernels
are chosen to characterize different data access patterns. The results are validated by
comparing the execution time of these kernels on a MicroBlaze system with PPMC and
without PPMC support. To execute FIR, FFT and Laplacian kernels separate hardware
IP cores are used. The hardware (shown in Figure 5 (a)) is generated by the ROCCC [12]
compiler as IPs for the evaluated application kernels. The IPs are compliant with the
scratchpad memory interface which makes it feasible to be integrated into the MPMC
and PPMC based systems. In order to give a standard interface to ROCCC IPs with the
PPMC system, a state controller is designed (see Figure 5 (b)).
4.2.1 PPMC Programming The proposed system provides comprehensive support
for the C and C++ languages. The functionality of the PPMC is managed by C based
device drivers. An example code segment that is used to initialize and run PPMC by an
accelerator program is shown in Figure 6 (a). The accelerator program uses this structure to communicate with the PPMC for handling a 2D tiling pattern. The first part of
the structure specifies physical memory parameters. The PPMC will automatically adjust the dataset into appropriate tile sizes, to be processed by the hardware accelerator.
The second part defines the size of the accelerator’s buffer memory, the maximum tile
size depends on the buffer size. The third part contains the name of a hardware accelerator and passing structure which is used during the execution of the kernel. If more than
one hardware accelerator is connected to the PPMC system, then the above information
(a)
(b)
Fig. 6: PPMC: (a) 2D-Tiling Program Example (b) Scheduling of the Tiling program
is required for each kernel. The MicroBlaze is used to program the PPMC by using an
API. The API takes the code (shown in Figure 6 (a)), pipelines and overlaps the PPMC
operations shown in Figure 6 (b), and then programs the Descriptor Memory.
5
Results and Discussion
Figure 7 shows a plot with Read/Write data accesses for MPMC and PPMC based systems. The X-axis presents 1D/2D/3D datasets that are read and written from/to the main
memory and the scratchpad memory. The Y-axis of the plot represents the number of
clock cycles consumed, while accessing the tiled dataset. The MicroBlaze with MPMC
has load/store data access patterns, whereas PPMC based access patterns contain multiple noncontiguous streams. The results show that PPMC memory access times are 10
times faster than a generic memory controller.
Figure 8 shows the execution time (clock cycles) for the application kernels, executed on MPMC and PPMC based systems. The X and Y axis represent application
kernels and number of clock cycles, respectively. By using our PPMC system, the results show that the Thresholding and FIR application kernels respectively achieve 32x
Fig. 7: Clock Cycles : PPMC and MicroBlaze Read/Write data to/from Main Memory
10
00
00
00
SoC: Application Execution
00
10
00
0
10
00
00
10
Clock Cycles
00
0
MicroBlaze without PPMC
MicroBlaze with PPMC
10
00
olding
Thresh
Finite
r
rm
onse
lication
ransfo
ian Filte
Multip
urier T
Laplac
Matrix
Fast Fo
e Resp
Impuls
Applications
Fig. 8: Execution time (clock cycles) of the applications for different volume sizes
and 22x speed-ups. Both applications kernels have streaming data access pattern. The
results confirm that PPMC reduces physical memory data access delay and that it decreases memory request/grant time by overlapping it with application execution time.
The FFT application kernel reads a 1D block of data, processes it and then writes it back
to physical memory. This application observes a 8.2x speed-up. The Matrix Multiplication and Laplacian applications respectively achieve speed-ups of 6.4x and 9.6x. Both
applications have 2D-Data access and both use data tiling. Figure 9 shows the execution
time for the 3D-Stencil for various sizes of large input volumes. The stencil is run for
500 time steps. The usage of PPMC for directly handling the tiling for 3D volumes improves around 5x the performance of the stencil kernel as compared to software based
volume decomposition. In our evaluations for the 3D-stencil, the accelerator directly
writes the tiling commands to the PPMC.
3D-Stencil Execution Time
3DS with M.Blaze based Decomposition
PPMC Based Decomposition
1500
1000
4
40
5x
32
5x
38
0x
5x
30
40
Input Volume Dimensions --->
38
4
4
38
0x
4x
26
32
5x
4x
22
32
20
3x
24
38
0x
38
4
4
2
19
24
3x
17
0x
5x
2x
12
16
16
2x
10
0x
19
2
0
2
500
19
Execution Time (seconds) --->
2000
Fig. 9: Execution time for various volume sizes with domain decomposition done by the host
processor and tiling done by the PPMC for the 3D-Stencil
6
Related Work
A number of DMA Memory Controllers have been proposed in literature and developed commercially. The Xilinx XPS Channelized DMA Controller [15] provides simple DMA services to peripherals and memory devices on the Processor Local Bus
(PLB). Lattice Semiconductor’s Scatter-Gather Direct Memory Access Controller IP
[9] and Altera’s Scatter-Gather DMA Controller core [3] provide data transfers from
non-contiguous blocks of memory by means of a series of smaller contiguous transfers. Both scatter-gather cores read a series of descriptors that specify the data to be
transferred. However, the transfer of data uses unit-stride which is not suitable to access
complex memory patterns. These DMA controllers are forced to follow microprocessor instructions and bus protocol. PPMC extends this model by enabling the memory
controller to work standalone as well as in SoC environment.
Chai et. al proposed a configurable stream unit for the memory hierarchy [2]. This
unit takes descriptor blocks from the on-chip bus system. Based on these descriptor
blocks it prefetches and aligns streaming data. As with the aforementioned controllers,
the configurable stream unit does not have the ability to operate independently from a
microprocessor, which is one of the main features provided by PPMC and which allows
hardware accelerators to directly fetch data patterns from memory without processor intervention. The Impulse memory controller [7] supports application-specific optimizations through configurable physical address remapping. By remapping the physical addresses, applications can control the data to be accessed and cached. The Impulse controller works under the control of the operating system and manages physical address
remapping in software, which may not always be appropriate for hardware accelerators.
PPMC remaps and generates physical addresses in the hardware unit without producing
delay. Based on its C/C++ language support, PPMC can be integrated with any operating system that supports a the C/C++ stack. A study on FPGA based implementation of
the N-Body (Barnes-Hut) algorithm introduced the design of traversal caches [14, 6] for
traversing tree data, in particular the Barnes-Hut octree. Initially traversal caches were
presented by Stitt et. al [14] based on exploiting the opportunity of repeated traversals
for a branch of a tree. The traversal cache fetches a branch of data into FPGA and if
the algorithm needs the same branch data it can be accessed from the local memory of
FPGA. Otherwise a new branch needs to be fetched from main memory. Coole et al.
extended the idea by prefetching multiple branches into the FPGA and keeping them accessible by the compute block, thus enabling repeated traversals and parallel execution
of multiple traversals whenever possible [6].
7
Conclusion
In this work, we propose a Programmable Pattern-based Memory Controller (PPMC).
The PPMC improves the memory-processor data access bottlenecks by allowing to
fetch complex patterns without processor intervention. Further, to improve the on chip
bandwidth, this work proposes hardware tiling based on programmable domain decomposition. The proposed controller can be programmed by a microprocessor using HLL
or directly from an accelerator using a special command interface. The experimental
evaluations based on the Xilinx MicroBlaze accelerator system demonstrate that the
PPMC based approach improves utilization of hardware resources and efficiently accesses physical data. In the future, we plan to embed a selective set of data access
patterns inside PPMC that would effectively eliminate the requirement of programming
PPMC by the user for a range of applications.
8
Acknowledgments
This work has been supported by the Ministry of Science and Innovation of Spain
(CICYT) under contract TIN–2007–60625 and by the European Union Framework
Program 7 HiPEAC2 Network of Excellence. The authors would like to thank the
Barcelona Supercomputing Center and the Universitat Politecnica de Catalunya (UPC)
for their support. The authors also wish to thank the reviewers for their insightful comments.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Amir Roth, Gurindar S. Sohi. “Effective jump-pointer prefetching for linked data structures”. In: ISCA ’99 Proceedings of the 26th Annual International Symposium on Computer Architecture (May 1999).
Chai, S M and Bellas, N and Dwyer, M and Linzmeier, D. “Stream Memory Subsystem
in Reconfigurable Platforms”. In: (2006).
Altera Corporation. Scatter-Gather DMA Controller Core, Quartus II 9.1. November
2009.
Dennis Gannon , William Jalby and Kyle Gallivan. “Strategies for Cache and Local Memory Management by Global Program Rransformation”. In: Journal of Parallel and Distributed Computing ().
Gou, Chunyang and Kuzmanov, Georgi and Gaydadjiev, Georgi N. “SAMS multi-layout
memory: providing multiple views of data to boost SIMD performance”. In: (2010).
James Coole and John Wernsing and Greg Stitt. “A Traversal Cache Framework for FPGA
Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation”. In: Reconfigurable Computing and FPGAs, International Conference on (2009).
John Carter, Wilson Hsieh, Leigh Stoller, Mark Swanson, Lixin Zhang, Erik Brunvand,
Al Davis,Chen-Chi Kuo, Ravindra Kuramkote,Michael Parker, Lambert Schaelicke, and
Terry Tateyama. “Impulse: Building a Smarter Memory Controller”. In: Fifth International Symposium on High Performance Computer Architecture (HPCA-5) (January 1999).
Keith I. Farkas and Norman P. Jouppi and Paul Chow. “How Useful Are Non-blocking
Loads, Stream Buffers, and Speculative Execution in Multiple Issue Processors?” In:
(1995).
Lattice Semiconductor Corporation. Scatter-Gather Direct Memory Access Controller IP
Core Users Guide. October 2010.
Muhammad Shafiq and Miquel Pericas and Raul de la Cruz and Mauricio Araya-Polo and
Nacho Navarro and Eduard Ayguade. “Exploiting Memory Customization in FPGA for
3D Stencil Computations”. In: (2009).
Norm Jouppi. “Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers”. In: (1990).
Riverside Optimizing Compiler for Configurable Computing (ROCCC). URL: {http :
//www.jacquardcomputing.com/roccc/}.
Steven Derrien, Sanjay Rajopadhye. “Loop Tiling for Reconfigurable Accelerators”. In:
Field-Programmable Logic and Applications 11th International Conference, FPL 2001
Belfast, Northern Ireland, UK (August 2001).
Stitt, Greg and Chaudhari, Gaurav and Coole, James. “Traversal Caches: A First Step
Towards FPGA Acceleration of Pointer-Based Data Structures”. In: (2008).
Xilinx. Channelized Direct Memory Access and Scatter Gather. February 2010.
Xilinx. Memory Interface Solutions. December 2009.
Download