Tilera ManyCore - Memory Controller Affinity & Execution Placement Team: Alexey Tumanov (andrew: atumanov) Joshua A. Wise (andrew: jwise) Problem Definition As the number of cores on chip increases, and is projected to do so with rates roughly equivalent to Moore’s Law, careful placement of threads of execution relative to the memory controllers (MCs) used is expected to become increasingly important. At the highest level, we would like to investigate the sensitivity of execution threads to their location with respect to the on-chip memory controllers. The goal of this project is to answer the following set of questions: 1. Can we reproducibly observe execution thread’s sensitivity to where it’s scheduled relative to the MCs? 2. Can we describe this dependency with a model? 3. Given the dynamic nature of a paged OS’s physical memory management, is it possible to introduce the controls that add memory controller affinity support? 4. Is there a way to export a set of kernel level API to expose memory controller usage vectors for specified execution threads? Can a higher level scheduler take advantage of that in the cloud environment? Motivation and Plan of Attack - High Level Historically Inspired We make an observation that, as has historically been the case, execution thread scheduling algorithms favour throughput over latency. We believe that the on-chip interconnect technology will parallel network technology in the widening of the gap between improvements in latency and bandwidth characteristics. Memory access latency sensitive algorithms thus present a worthwhile target for investigation. Manycore systems are emerging as a promising platform of choice for massively parallel as well as distributed computing environments. On-going and recent work has focused primarily on memory throughput and fairness, leaving memory latency outside of the scheduling equation. The increasing number of cores on a manycore platform, though, launches a new trend - variability in memory access latency as a function of the executing core’s location w.r.t. the memory controllers accessed by the executing thread. We propose the investigation of applications’ sensitivity to core placement, as measured by their execution time and memory access metrics. We hypothesize that there exists a class of memory latency sensitive applications with memory-bound RAM patterns that could greatly benefit from certain guarantees w.r.t. memory controller affinity. The project proposes affirmation of this hypothesis with experimental results on a real manycore platform, distillation of a model that captures the relationship between xy-coordinates, weights of memory controller utilization, and the performance of an executing thread. Finally, armed with experimental support for the hypothesis and a model, we will explore the introduction of memory controller affinity in the Linux kernel. Plan of Attack We intend to begin characterization of performance by using a micro-benchmark; specifically, we intend to measure cycle-by-cycle latency of request-to-response time from each tile to each memory controller on the system. At this time, we will take measures to avoid the system’s Dynamic Distributed Cache (the global L3 cache composed of the L2 caches of all tiles). In this phase, we will also measure bandwidth from each tile to each memory controller, although the impact on bandwidth is expected to be minimal. To avoid the effects of the Linux kernel impacting performance, the first phase micro-benchmark will be run directly on the Tile hypervisor. The second phase of the project will collect data on specific memory access patterns of various applications running on the system. We intend to measure the DRAM hit patterns of some specific applications when run unconstrained to memory controllers. This can be done by modifying the Linux kernel’s paging infrastructure to set all pages non-present on a regular basis, and keep track of per-page statistics of page-ins. The third phase will rerun the micro-benchmark on Linux using the “cpuset” mechanism to localize pages to a given memory controller. We will then run a large macro-benchmark of the previously-characterized memory-intensive applications, trending their performance in the same mechanism that the micro-benchmark was run. Should either the micro-benchmark or the macro-benchmark fail to yield a change when run in Linux, we will run smaller analyses in the tile-sim micro-architectural simulator to determine where the latency is being hidden. Scope Considerations We would like to leave the global fairness of scheduling outside the scope of this project. Very recent and ongoing work by Onur’s students addresses this in great detail [1]. Previous Work Recent studies (e.g. [1] ) primarily concern themselves with the fairness and throughput improvement for a workload that shares one or more memory controllers. We differentiate our proposed project by investigating the memory access latency and its variability as a function of the (x,y) placement of execution on a 2D grid of cores. The closest piece of work to date that also directly addresses memory latency was published this month, September 2010, and definitively states that no prior work “examined the effects of data placement among multiple MCs” [2] on manycore architectures. Their approach is that of page coloring and migration as a mechanism for latency amelioration. We would like to re-evaluate the variable latency hypothesis with real hardware that has only recently become available and with real applications. We further would like to differentiate our project by the fundamental focus on execution placement as opposed to data placement as an approach to MC distance induced latency. The intention is to eventually compare these two with respect to the metrics of interest to real applications (e.g. runtime performance). Experimental Setup We’ll be using Tilera TilePro64 real hardware, generously made available by Tilera for research purposes, and the system software stack that comes with it. Initially, we will be working directly on top of the TilePro64 hypervisor; the research will eventually move towards the Linux kernel running on top of the hypervisor. Milestones Milestone 1 (Oct 13) ● microbenchmarks to validate latency variability hypothesis ○ code lives directly on hypervisor to remove Linux from the equation ● produce graphs of latency measurements for each tile to each controller ● prove/calculate correlation Milestone 2 (Nov 1) ● write code to collect experimental data on physical page access ● extract MC weight vectors - determine the extent of variability in these vectors across multiple runs Milestone 3 (Nov 17-19) ● reproduce microbenchmark on Linux with cpusets ● macrobenchmarks using specific applications (gcc, memcached, ...) on Linux using cpusets to localize memory controllers ● produce graphs of some application performance metric as a function of (x,y) ● prove/calculate correlation Final Report (Dec 12) ● full analysis performed ● paper written References 1. Yoongu Kim, Michael Papamichael, Onur Mutlu, Mor Harchol-Balter, “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior”, to be submitted to MICRO 2010. 2. “Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers” , Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis, 19th International Conference on Parallel Architectures and Compilation Techniques (PACT-19) , Vienna, September 2010 (Best paper award). on-chip network policies 3. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks". Proceedings of the 42nd International Symposium on Microarchitecture (MICRO), pages 280-291, New York, NY, December 2009. 4. Boris Grot, Stephen W. Keckler, and Onur Mutlu, "Topology-aware Quality-of-Service Support in Highly Integrated Chip Multiprocessors", to appear in Proceedings of the 6th Annual Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), Saint-Malo, France, June 2010.