Task-Aware Scheduling for Multi-Core Memory Interconnect Carnegie Mellon University

advertisement
Carnegie Mellon University
Department of Electrical and Computer Engineering
Research Proposal — 18-740
Task-Aware Scheduling for Multi-Core Memory
Interconnect
Tim Detwiler - Rohit Banerjee - Alekh Vaidya
1
Problem Definition and Motivation
Most modern DRAM controllers are optimized for single processor/single thread
access and do not scale well for multi-core/multi-thread memory access. The simplest policy is the First Come-First Serve (FCFS) scheduling policy which, as the names suggest,
simply queues memory accesses in a FIFO order and then services them when free. This
simplistic policy is not optimum on two levels first, there is an implicit assumption that
a single process will issue requests which demonstrate locality and will take advantage of
row locality in memory; this in turn will supposedly result in memory bandwidth being
used most efficiently due to the adjacency of requests. However, this assumption breaks
down for multi core accesses and, in the worst case, can actually result in very inefficient
usage of memory bandwidth. Second, and perhaps more importantly, there is no concept
of fairness. From a fairness perspective, this simplistic policy is not adequate because it is
rooted in the belief that only a single thread is making memory accesses; therefore multiple thread access fairness is not enforced in any way. Round-robin scheduling policies are
somewhat more effective in that every thread is given access but bandwidth is not used
very efficiently. Certain scheduling policies do row address ordering at the interconnect
level so that interleaved accesses from various threads are not serviced inefficiently again,
this scheme uses bandwidth more efficiently but there is no conception of fairness. For
example, if the algorithm is able to order a large number of requests from one thread
due to the locality of the accesses, then the other threads have to wait an undue amount
of time to be serviced. Finally, Fairness Source Throttling (FTS) is a recent scheduling
policy which dynamically limits the number of requests injected into the memory controller to enforce fairness; additionally, this protocol is also priority-aware to some extent.
However, additional hardware is needed to buffer the memory access requests. Our goal
is to determine a static, low-overhead scheduling policy solution which enforces fairness
among multi-thread memory accesses while providing considerations for thread priority
according to memory accesses.
2
Related Work
Ebrahimi, Lee, Mutlu and Patt proposed a hardware based solution which relies
upon better control over the number of entries in the Miss Status Holding Registers. This
allowed the system to better allocate shared memory resources to requests from multiple
cores. Also, the entire external memory access system (L2/DRAM) was controlled by this
technique. The amount of unfairness introduced in each of the cores due to the memory
access schemes is evaluated and this factor is then monitored to ensure that it doesnt fall
below a certain threshold. Some other systems follow a similar strategy for each of the
individual system resources, rather than a common, system-wide control strategy.
Yuan, Bakhoda, and Aamodth investigated the bottleneck created at the DRAM memory controller and found that while there was a great degree of row locality in the memory
accesses of the private threads, the interleaving of the requests made the controller blind
to the locality and created, from the controllers perspective, a stream of spatially random
and disjoint memory accesses. By ordering these accesses by row (and consequently by
thread), overall performance was increased because row locality was now being exploited
when reading from memory.
All the proposed solutions target high performance machines. With the advent of
cost and space sensitive mobile devices, it may not always be feasible to add additional
hardware to the system for implementation of a fair memory access policy. Also, the
scheduler is unaware of the memory requirements of each of the processes before they
start executing.
1
3
Plan to Solve/Improve the Problem
We aim to build our work on the currently available techniques by introducing compile
time, static information about the processes in the system. The memory access maps of
the processes would be analyzed at compile time to provide additional information to the
scheduler. Reordering the processes statically so that multiple memory intensive processes
do not run at the same time might be able to improve the overall efficiency of the system.
We propose a link between the OS process scheduler and memory controller to share
information about the delay caused in each process due to unfairness in the memory access
system. This information can then be used in real-time by the system to supplement the
static scheduling.
4
Experimental Design Methodology
In order to test and verify our memory system, we will utilize the DRAMSim2 memory
simulator. This tool will allow us to model our proposed memory system/controller, as
well as modeling other proposed or available designs. This simulator is freely available,
so we are free to make any requisite modifications to the simulator to conform to our
experimental design. DRAMSim2 was designed to test timing characteristics of DRAM
under different workloads, configurations, and in different environments.
The DRAMSim2 website describes the tool as ”a cycle accurate model of a DRAM
memory controller, the DRAM modules which comprise system storage, and the buses by
which they communicate.” The simulator can be either integrated with a CPU simulator,
or used in a stand-alone mode that generates DRAM traces. All of these features make
the DRAMSim2 simulator the ideal solution to test our solution.
We plan to run benchmarks on the processor simulator based upon the modified controller and gauge improvements for the fairness as well as performance of the entire system.
5
Research Plan
Our research requires us to use simulation tools to model our solution. In order
to do this, we will need to become familiar with multiple different simulators, as well as
the benchmarking software used to test them. In light of these hurdles, we propose the
following milestones to keep our research on track:
1. We will have our memory and processor simulators configured and working for a
reference system (one without our modifications).
2. We will have our memory access scheduling algorithm modeled in our simulator and
ready to be analyzed.
3. We will have all benchmarks run on the test systems in order to determine the
viability of our solution.
2
Download