Towards Optimally Scheduled Manycore Clouds Alexey Tumanov Wei Lin 18742, Spring 2011 Prof: Onur Mutlu Papers Presented An Operating System for Multicore and Clouds: Mechanisms and Implementation David Wentzlaff et al Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi et al Contention-aware Application Mapping for Network-on-Chip Communication Architectures Chen-Ling Chou & Radu Marculescu Grand unifying theme: optimally scheduled manycore clouds Cloud+Manycore: Common Challenges scalability managing elasticity of both resource demand resource supply fault isolation programmability coherence of programming & communication model workload scheduling* our contribution to the pool of common challenges Factored Operating System Vision: to have a single system image OS that manages hardware resources on a local manycore chip as well as across the cluster of such chips Trends: the rise and fall of the single core application performance scaling cloud computing manycore architectures Single System Image Advantages Ease of administration manage one OS Transparent sharing e.g. memory & disk -- paging across machines Consistent communication & programming model no more fractured communication programming abstractions used e.g. shared memory vs message passing Easy process migration agility in a dynamically changing environment Debugging distributed multi-VM applications FOS: Design Principles Space multiplexing replaces time multiplexing core availability soon overcome number of threads OS functionality factored into services each implemented as a parallel distributed service libfos used to communicate to services Adapts utilization to scale with demand services monitored and services under higher demand given more resources (cores) Fault tolerance watchdog process monitors services in event of failure, new instances of services created naming service links communication channels FOS: Architecture Overview Microkernel one instance per core provides protected messaging layer name cache basic context switching API for modification of address space and thread creation Memory Management & scheduling in user space Messaging -- inter-process communication process-to-process API shared data is explicit implementation/distance agnostic mailboxes (inbound/outbound/internal) FOS: Architecture Overview Naming all servers assigned a name nameserver determines message routing globally coherent namespace frequent updates resource pool servers may share the same name round robin, closest server, <your idea here>... OS Services fleet of spatially distributed, cooperating servers processes added/removed to scale with demand complex implementation without support from FOS cooperative multithreaded programming model remote procedure calls data structures FOS: Architecture Overview FOS: Preliminary Results Operating System Overhead and performance Null System Calls fos min 11058 avg 11892 max 28700 stddev 513.1 Linux 1321 1328 9985 122.8 Table 1: intramachine response in cycles min avg max stddev fos 0.232 0.353 0.548 0.028 Linux 0.199 0.274 0.395 0.0491 Table 2: Intermachine response in ms Remote Procedure Call Overhead Ping Response Process Spawn Time min avg max mdev fos 0.029 0.031 0.059 0.008 Linux 0.017 0.032 0.064 0.009 Table 3: ping response in ms min avg max stddev Local 0.977 1.549 10.356 1.881 Remote 1.615 4.070 13.337 4.601 Table 4: process creation time in ms Performance optimization: data placement Vision: Improve the performance of applications by intelligent page placement and migration techniques Trends: Increase in cores result in memory controller bottleneck Number of memory controllers cannot scale with cores each controller must service more cores Latency due to memory controller policy not necessarily trivial compared to DRAM access Performance optimization: data placement Requirements minimize access latency to requested page distance network load queuing delay at MC bank and rank contention row buffer hit rate not significantly degrade accesses to other pages on the same MC must be dynamically determinable at runtime Techniques adaptive first touch (AFT) page placement dynamic migration of data Adaptive First Touch Page Placement Threads are assigned to processors arbitrarily On first touch, cost function must be minimized loadj = average queuing delay at MCj rowhitsj=average rate of row buffer hits at MCj distancej=number of hops from core to MCj each factor is weighted by importance Last 5 mapping cache used to speed up decision if more than one fault triggers within 5000 cycles Page migration Stable applications are considered in epochs Donor MC MC that drops 10% or more in row buffer hit rate from last epoch Recipient MC physical proximal to donor lowest number of page-conflicts in last epoch Once migrated, not touched for at least 2 epochs Maintaining correctness TLB Update/Invalidate Cache Invalidate dirty writebacks Where are you getting all this data? Counters Row-Hit Row-Miss Row-Conflict System level daemon polls for this information Queuing delay Static baseline latency of memory Runtime monitoring of current latency Results: Summary Adaptive First Touch: 17% Improvement Dynamic Page Migration: 34% Improvement Improved row-buffer hit rates first touch: 16.6% migration: 22.7% Overhead up to 15.6% increase in network traffic average of about 4.4% Recap lots of execution containers to schedule larger-granularity time quanta for running threads optimal relative placement of threads and data data placement & page migration is just one way thread placement is the other -- arguably superior the truth is probably somewhere in the middle data and thread migration is an attractive tradeoff space Performance optimization: thread placement Goal: improve application performance by minimizing: communication energy consumption due to NoC link traversal inter-tile contention* (authors' contribution) Proposed solution integer linear programming (ILP) formulation np-hard techniques to relax the complexity of the problem make it LP branch & bound optimizations Result: improvement of average packet latency throughput improvement Logical Core Placement: Problem Stmt Problem formulation Logical Core Placement: Algorithm ILP problem was relaxed to LP that means: remove the integer variable requirement The resulting heuristic algorithm is non-intuitive Thread placement: Results able to achieve within 5% of minimal cost produced by an ILP solver caveat: the paper says nothing about accuracy of mapping function solution comparison of energy-aware and contention-aware throughput improvement up to 24% is possible at the expense of average packet latency increase throughput improvement at the expense of communication energy increase for real applications: up to 20% throughput gained by giving up to 8.8% NoC latency Caveats biggest caveat of all: path-based contention there's nothing in the model that prevents us from including source- and dest- based contention as well Summary we're observing confluence of trends: cloud computing (IaaS) and multicore tangible intersection of challenges our belief: possibility of developing a set of techniques/algorithms addressing those challenges for both Factored OS - single system image OS powering manycore chips across a cluster of datacentre machines need for optimality in workload scheduling data placement thread placement The Blank Slide la fin de la présentation :-) Simulation Overview Virtutech Simics 16 Cores Distributed L2 cache 4 Memory Controllers DRAMSim Framework Workloads PARSEC SPECjbb2005 Stream Migration of 10 pages Epoch is 5 million cycles TLB invalidation is 5000 cycles 500 million instructions Results: Relative Throughput Results: Row Buffer Hits Data Placement vs Thread Placement Data Placement Advantages: doesn't necessitate thread migration preservation of cached state Disadvantages: consumes NoC bandwidth to migrate pages *** will not scale with the growing number of cores! if MCs are positioned on the perimeter of the grid e.g. threads at the centre of the grid -- data placement fails to improve their performance Thread Placement Advantages ability to achieve minimal communication cost more flexible - captures communication cost minimization core-MC as well as core-core Disadvantages Thread Migration Advantages: Cont no need to flush TLB on all cores no need to invalidate migrated page's cache lines on all cores scales with increase in cores much better than page migration page migration is the preferred method for traditional NUMA systems does not require use of the NoC bandwidth for page migration OR dedicated MC-to-MC communication fabric