LightSpeed: A Many-core Scheduling Algorithm Karl Naden Wolfgang Richter Ekaterina Taralova kbn@cs.cmu.edu wolf@cs.cmu.edu etaralova@cs.cmu.edu September 27, 2010 1 Figure 6 shows the speedup seen for the RMS workloads Introduction on our many core platform. Once again, these experiments demonstrate that McRT can scale almost linearly The world is heading towards many-core architectures dueuptoto 64 HW threads. We plot the number of cores on the x-axis many well-known and important present-day research issues: with each core containing 4 HW threads. We compute the power consumption, clock criticaltime path speedup with speed respect limits, to the execution onlengths, a single core etc. While existing many-core machines haveoftraditionally (4 threads). Again, the serial version the workloads had been handled in the same SMPs,time thisasmagnitude of essentially the way same as execution the single-threaded parallel version. parallelism introduces several fundamental challenges at the 25 perfect nosteal 3steal 20 steal 15 10 5 Speedup over a single core (4 threads) 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Number of cores (4 threads per core) Figure 8: Effect of scheduling on XviD speedup Figure 1: 1 XviD speedup varying number of cores. From: [2] 3.2.3 Statement Configurability Problem Relative Processor Utilization many-core systems. Figure 1 shows a clear gap where other 0.9 unstudied algorithms could exist—where LightSpeed should exist. In0.8fact, the authors of [2] have already attempted to XviD improve upon their existing schedulers in [3]. Equake Figure 7: Effect of different scheduling and locking policies 2 Speedup over a single core (4 threads) architectural level which translates to novel challenges in the 14 1queue/core 1queue/core no stealing design of the software stack for thesestealing platforms. 1 queue (queued locks) 1 queue (unqueued locks) 12 Many-core architectures cause new problems: shared 10 caches, shared memory controllers, shared communication paths—communication between computation now—and other 8 shared resource contentions. 6 Current many-core scheduling research literature and op4 erating system development have not provided a satisfactory solution to the scheduling problem. Partly because the hard2 ware is not yet widespread—no need to implement or solve a 0 1 corepartly 2 core 8 core 16 core non-existant problem—and because4 core many of the issues Number of cores (4 threads per core) are hard to solve—especially in the general case. 30 3 0.7 0.6 Previous Work 0.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 This section demonstrates the need for configurability in Algorithms a for scheduling threads in a many-core setting Thread Number large scale CMP runtime. For the sake of discussion, this Many-core scheduling presents a non-trivial challenge to mod- have been proposed ranging from the simplistic, though in section primarily concentrates on experiences with XviD. Figure 9: Per thread load balancing in Xvid and Equake ern schedulers within operating systems which none have some cases effective, algorithms defined in Saha et al’s [2] Figure 7 shows the performance of XviD with different thisTime result,(McRT), it would to appear though workCore on Run the asarbitrarily complex solved satisfactorily. Given a set of threads the goal of a Many Based McRT policies. The figure compares the performance with stealing is the optimal policy for or XviD. Figure 8 shows the search algorithms surveyed in [1] the Heterogeneity-Aware scheduler is to aminimize the makespan—the time it takes to single scheduling queue using a TTS (test & test & set) XviD speedup as we increase the number of HW threads to (HASS) algorithm [4]. We are aware of complete all threads. Figure 1 shows queue current experimental lock, a single scheduling using a ticket lock, reusing Signature-Supported 1 128. At 128 HW threads, the performance of work stealing one PhD dissertation of scheduling on upon chip mulsults from cutting edge research literature using several simscheduling queue/core (ticket lock) without any work is worse than thaton of the a notopic stealing configuration; tithreaded processors [5], and this seminar paper [6] stealing, and using 1 schedulingparallel queue/core (ticket lock) ple scheduling algorithms. Embarrasingly workloads inspection it turned out that work stealing introduces provides an with work stealing. There isinnomakespan—when perceptible difference a veryoverhead gentle that introduction the problem andnumber the research imply linear speedups—linear reduction grows moreto than linearly with the of between the policies till 16 HW threads (4 cores).the At 32 HW threads. At 128 threads, the overhead of stealing hurts literature. increasing the number of computational units. However, HW threads, the distributed scheduling queue (1 more than the load imbalance. Towhether address this, we added a research literature shows that achieving linear speedup with Algorithm design also hinges on or not the assumed queue/core) without work stealing performs the worst. The restricted form of stealing where a processor isheterogeneous— allowed to real-world architectures is hard, and achieving optimal linear underlying architecture is homogeneous or distributed queue reduces contention; however the absence steal work only from the queues of its neighbors (denoted speedup is NP-Complete [1]. hurts load balance. By definition, a singlefurther complicating optimal schedule assignment. Several of work stealing as “3steal” in the graph). The restricted stealing performs recentbetter attempts have been made to atcreate For example the only algorithm discussed [2]balance. that gives scheduling queue provides perfect in load At 32aHW than the unrestricted stealing all datascheduling points, but algothreads, the load imbalance hurts more than contention.rithmseven in that the does heterogeneous case such as HASS linear speedup is the nosteal algorithm—cores havethe separate not perform as well as with no stealing[4], in aor the Asymmetric Multiprocessor Scheduler (AMPS) [7]. Other 128 HW thread configuration. This shows that a runtime queues and areOn not allowed to steal work from other cores. the other hand, at 64 HW threads, contention becomes system must be configurable to enable or different policies pronounced, andin mitigates the of effect of load proposed algorithms target generality fairness suchtoas the nosteal is the more simplest algorithm the scenario a single optimize application performance. the distributed queue performs fair Distributed Weighted Round-Robin (DWRR) [8], and queue per core,imbalance: yet it hasthus a clear Achille’sscheduling heel—the slowest well as the single queue. The effect of contention is an image the amount of computation required manthreadGiven criticality forframe, performance, power and resource core defines theaspotential speedup. The other two algorithms, highlighted further by the fact that the queued locking for encoding different parts of it been can vary significantly. For agement [9]. There have even attempts at adapting 3steal—allow (ticket stealing from the 3 closest cores’ queues—and lock) starts to help compared to a TTS lock. Not example,[10]—a the partspotential of the frameworkload comprisingfor the LightSpeed—to background MapReduce steal—allow stealing from any core’s queue—show a subsurprisingly, the distributed queue with work stealing take significantly less computation since they remain static the many-core setting [11, 12]. These algorithms provide inlinear dropoff inprovides speedup. the best result – the load imbalance is removed between frames. On the other hand, the parts of the frame sight into the techniques and heuristics upon which Lightdue to the stealing, and the contention is reduced since the We intend to rigorously survey the research literature on containing objects in motion require more computation. queues are distributed. Speed must extend. many-core scheduling, explore alternative scheduling algoOther application domains do not exhibit such imbalance. Figure 9 compares the load science balance between XviD Other domains of computer research, suchandas netrithms and heuristics, and improve upon previous work focusing on embarrasingly parallel workloads on homogeneous working, benefit from many-core scheduling research as is evi1 80 References denced by the attempt of Mallik et al. [13] to create a scheduling algorithm using statistical analysis in network processors. Their use of statistical techniques encourages a line of research that could lead to using machine learning within scheduling algorithms for many-core architectures. Our work may explore this potential synergy between machine learning and thread schedulers in the many-core setting. 4 [1] S. Jin, G. Schiavone, and D. Turgut, “A performance study of multiprocessor task scheduling algorithms,” The Journal of Supercomputing, vol. 43, pp. 77–97, 2008, 10.1007/s11227-007-0139-z. [Online]. Available: http://dx.doi.org/10.1007/s11227-007-0139-z [2] B. Saha, A.-R. Adl-Tabatabai, A. Ghuloum, M. Rajagopalan, R. L. Hudson, L. Petersen, V. Menon, B. Murphy, T. Shpeisman, E. Sprangle, A. Rohillah, D. Carmean, and J. Fang, “Enabling scalability and performance in a large scale cmp environment,” SIGOPS Oper. Syst. Rev., vol. 41, no. 3, pp. 73–86, 2007. [Online]. Available: http://doi.acm.org/10.1145/1272998.1273006 Algorithm Exploration, Experimental Methods and Plan [3] M. Rajagopalan, B. T. Lewis, and T. A. Anderson, “Thread scheduling for multi-core platforms,” in HOTOS’07: Proceedings of the 11th USENIX workshop on Hot topics in operating systems. Berkeley, CA, USA: USENIX Association, 2007, pp. 1–6. We will implement thread scheduling algorithms explored in prior works on a simulator for many-core systems in a MapReduce-like application for highly parallelizable jobs. We will evaluate weaknesses and strengths of these algorithms and propose modifications to increase performance when the number of cores in the system increases. We will explore scheduling algorithms like the ones used in the McRT system [2], which is a software prototype of an integrated language runtime that was designed to explore configurations of the software stack for enabling performance and scalability on large scale many-core platforms. We will use a simulator for the Intel Single-chip Cloud Computer, (SCC), or similar. The SCC is a research microprocessor containing the most Intel Architecture cores ever integrated on a silicon CPU chip: 48 cores. It incorporates technologies intended to scale multi-core processors to 100 cores and beyond, such as an on-chip network, advanced power management technologies and support for message-passing. The potential workloads and benchmarks, in addition to the standard ones used in prior work [10] and [14], will include using existing implementations of MapReduce for various applications that lend themselves to high parallelization. By the Milestone 1 deadline, October 13th, we will: [4] D. Shelepov, S. Alcaide, J. Carlos, S. Jeffery, A. Fedorova, N. Perez, Z. F. Huang, S. Blagodurov, and V. Kumar, “Hass: a scheduler for heterogeneous multicore systems,” SIGOPS Oper. Syst. Rev., vol. 43, no. 2, pp. 66–75, 2009. [Online]. Available: http://doi.acm.org/10.1145/1531793.1531804 [5] A. Fedorova, “Operating system scheduling for chip multithreaded processors,” Ph.D. dissertation, Cambridge, MA, USA, 2006, adviserMargo I. Seltzer. [6] T. Fahringer, “Optimisation: Operating system scheduling on multi-core architectures,” Website, August 2008, http://www.scribd.com/doc/4838281/ Operating-System-Scheduling-on-multicore-architectures. [7] T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn, “Efficient operating system scheduling for performance-asymmetric multi-core architectures,” in SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1–11. [Online]. Available: http://doi.acm.org/10.1145/1362622.1362694 [8] T. Li, D. Baumberger, and S. Hahn, “Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin,” SIGPLAN Not., vol. 44, no. 4, pp. 65–74, 2009. [Online]. Available: http://doi.acm.org/10.1145/1594835.1504188 [9] A. Bhattacharjee and M. Martonosi, “Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors,” in ISCA ’09: Proceedings of the 36th annual international symposium on Computer architecture. New York, NY, USA: ACM, 2009, pp. 290–301. [Online]. Available: http://doi.acm.org/10.1145/1555754.1555792 [10] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [Online]. Available: http://doi.acm.org/10.1145/1327452.1327492 • Explore simulation tools. Ideally, we will try to obtain a simulator for the Intel SCC 48-core research architecture [15], or similar (for example the simulator of [16] for multicore simulation). We will have a simulator chosen and running. [11] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis, “Evaluating mapreduce for multi-core and multiprocessor systems,” in HPCA ’07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture. Washington, DC, USA: IEEE Computer Society, 2007, pp. 13–24. [Online]. Available: http://dx.doi.org/10.1109/HPCA.2007.346181 • Research existing implementations of libraries for MapReduce-like [10] functionality. For instance, Phoenix [11, 12], developed at Stanford, is a potentially suitable implementation of MapReduce for sharedmemory systems. [12] R. Yoo, A. Romano, and C. Kozyrakis, “Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system,” October 2009, pp. 198 –207. [13] A. Mallik, Y. Zhang, and G. Memik, “Automated task distribution in multicore network processors using statistical analysis,” in ANCS ’07: Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems. New York, NY, USA: ACM, 2007, pp. 67–76. [Online]. Available: http://doi.acm.org/10.1145/1323548.1323563 • Review literature on scheduling jobs in many-core systems - Saha et al. [2], Jin et al. [1], Mallik et al. [13], Rajagopalan et al. [3], and other research that we come across. [14] L. Huston, R. Sukthankar, R. Wickremesinghe, M. Satyanarayanan, G. R. Ganger, E. Riedel, and A. Ailamaki, “Diamond: A storage architecture for early discard in interactive search,” in FAST ’04: Proceedings of the 3rd USENIX Conference on File and Storage Technologies. Berkeley, CA, USA: USENIX Association, 2004, pp. 73–86. By the Milestone 2 deadline, November 1st, we will: • Reproduce baseline cases or prior results [15] Intel Corporation, “The scc platform overview,” Website, September 2010, http://communities.intel.com/servlet/JiveServlet/ downloadBody/5512-102-1-8627/SCC Platform Overview 90110.pdf. • Evaluate performance of existing scheduling algorithms • Determine weaknesses in existing algorithms, propose and implement improvements [16] H. Zeng, M. Yourst, K. Ghose, and D. Ponomarev, “Mptlsim: a cycle-accurate, full-system simulator for x86-64 multicore architectures with coherent caches,” SIGARCH Comput. Archit. News, vol. 37, no. 2, pp. 2–9, 2009. [Online]. Available: http://doi.acm.org/10.1145/ 1577129.1577132 • Provide results for some of the improved algorithms and continue evaluation of results 2