Tilera ManyCore - Memory Controller Affinity & Execution Placement Team: Problem Definition

advertisement
Tilera ManyCore - Memory Controller Affinity & Execution Placement
Team:
Alexey Tumanov (andrew: atumanov)
Joshua A. Wise (andrew: jwise)
Problem Definition
As the number of cores on chip increases, and is projected to do so with rates roughly
equivalent to Moore’s Law, careful placement of threads of execution relative to the memory
controllers (MCs) used is expected to become increasingly important. At the highest level, we
would like to investigate the sensitivity of execution threads to their location with respect to the
on-chip memory controllers. The goal of this project is to answer the following set of questions:
1. Can we reproducibly observe execution thread’s sensitivity to where it’s scheduled
relative to the MCs?
2. Can we describe this dependency with a model?
3. Given the dynamic nature of a paged OS’s physical memory management, is it
possible to introduce the controls that add memory controller affinity support?
4. Is there a way to export a set of kernel level API to expose memory controller usage
vectors for specified execution threads? Can a higher level scheduler take advantage of
that in the cloud environment?
Motivation and Plan of Attack - High Level
Historically Inspired
We make an observation that, as has historically been the case, execution thread scheduling
algorithms favour throughput over latency. We believe that the on-chip interconnect technology
will parallel network technology in the widening of the gap between improvements in latency
and bandwidth characteristics. Memory access latency sensitive algorithms thus present a
worthwhile target for investigation.
Manycore systems are emerging as a promising platform of choice for massively parallel as
well as distributed computing environments. On-going and recent work has focused primarily
on memory throughput and fairness, leaving memory latency outside of the scheduling
equation. The increasing number of cores on a manycore platform, though, launches a
new trend - variability in memory access latency as a function of the executing core’s
location w.r.t. the memory controllers accessed by the executing thread. We propose the
investigation of applications’ sensitivity to core placement, as measured by their execution
time and memory access metrics. We hypothesize that there exists a class of memory latency
sensitive applications with memory-bound RAM patterns that could greatly benefit from certain
guarantees w.r.t. memory controller affinity. The project proposes affirmation of this hypothesis
with experimental results on a real manycore platform, distillation of a model that captures
the relationship between xy-coordinates, weights of memory controller utilization, and the
performance of an executing thread. Finally, armed with experimental support for the hypothesis
and a model, we will explore the introduction of memory controller affinity in the Linux kernel.
Plan of Attack
We intend to begin characterization of performance by using a micro-benchmark; specifically,
we intend to measure cycle-by-cycle latency of request-to-response time from each tile to
each memory controller on the system. At this time, we will take measures to avoid the
system’s Dynamic Distributed Cache (the global L3 cache composed of the L2 caches of all
tiles). In this phase, we will also measure bandwidth from each tile to each memory controller,
although the impact on bandwidth is expected to be minimal. To avoid the effects of the Linux
kernel impacting performance, the first phase micro-benchmark will be run directly on the Tile
hypervisor.
The second phase of the project will collect data on specific memory access patterns of various
applications running on the system. We intend to measure the DRAM hit patterns of some
specific applications when run unconstrained to memory controllers. This can be done by
modifying the Linux kernel’s paging infrastructure to set all pages non-present on a regular
basis, and keep track of per-page statistics of page-ins.
The third phase will rerun the micro-benchmark on Linux using the “cpuset” mechanism to
localize pages to a given memory controller. We will then run a large macro-benchmark of
the previously-characterized memory-intensive applications, trending their performance in the
same mechanism that the micro-benchmark was run. Should either the micro-benchmark or the
macro-benchmark fail to yield a change when run in Linux, we will run smaller analyses in the
tile-sim micro-architectural simulator to determine where the latency is being hidden.
Scope Considerations
We would like to leave the global fairness of scheduling outside the scope of this project. Very
recent and ongoing work by Onur’s students addresses this in great detail [1].
Previous Work
Recent studies (e.g. [1] ) primarily concern themselves with the fairness and throughput
improvement for a workload that shares one or more memory controllers. We differentiate our
proposed project by investigating the memory access latency and its variability as a function
of the (x,y) placement of execution on a 2D grid of cores. The closest piece of work to date
that also directly addresses memory latency was published this month, September 2010,
and definitively states that no prior work “examined the effects of data placement among
multiple MCs” [2] on manycore architectures. Their approach is that of page coloring and
migration as a mechanism for latency amelioration. We would like to re-evaluate the variable
latency hypothesis with real hardware that has only recently become available and with real
applications. We further would like to differentiate our project by the fundamental focus on
execution placement as opposed to data placement as an approach to MC distance induced
latency. The intention is to eventually compare these two with respect to the metrics of interest
to real applications (e.g. runtime performance).
Experimental Setup
We’ll be using Tilera TilePro64 real hardware, generously made available by Tilera for research
purposes, and the system software stack that comes with it. Initially, we will be working directly
on top of the TilePro64 hypervisor; the research will eventually move towards the Linux kernel
running on top of the hypervisor.
Milestones
Milestone 1 (Oct 13)
● microbenchmarks to validate latency variability hypothesis
○ code lives directly on hypervisor to remove Linux from the equation
● produce graphs of latency measurements for each tile to each controller
● prove/calculate correlation
Milestone 2 (Nov 1)
● write code to collect experimental data on physical page access
● extract MC weight vectors - determine the extent of variability in these vectors across
multiple runs
Milestone 3 (Nov 17-19)
● reproduce microbenchmark on Linux with cpusets
● macrobenchmarks using specific applications (gcc, memcached, ...) on Linux using
cpusets to localize memory controllers
● produce graphs of some application performance metric as a function of (x,y)
● prove/calculate correlation
Final Report (Dec 12)
● full analysis performed
● paper written
References
1. Yoongu Kim, Michael Papamichael, Onur Mutlu, Mor Harchol-Balter, “Thread Cluster
Memory Scheduling: Exploiting Differences in Memory Access Behavior”, to be submitted to
MICRO 2010.
2. “Handling the Problems and Opportunities Posed by Multiple On-Chip
Memory Controllers” , Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Al Davis, 19th International Conference on Parallel Architectures and Compilation Techniques
(PACT-19) , Vienna, September 2010 (Best paper award).
on-chip network policies
3. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das,
"Application-Aware Prioritization Mechanisms for On-Chip Networks". Proceedings of the
42nd International Symposium on Microarchitecture (MICRO), pages 280-291, New York, NY,
December 2009.
4. Boris Grot, Stephen W. Keckler, and Onur Mutlu, "Topology-aware Quality-of-Service
Support in Highly Integrated Chip Multiprocessors", to appear in Proceedings of the 6th
Annual Workshop on the Interaction between Operating Systems and Computer Architecture
(WIOSCA), Saint-Malo, France, June 2010.
Related documents
Download