SOME WORKLOAD SCHEDULING ALTERNATIVES IN THE HIGH PERFORMANCE COMPUTING ENVIRONMENT Tyler A. Simon, University of Maryland, Baltimore County Jim McGalliard, FEDSIM CMG Southern Region Raleigh October 4, 2013 1 TOPICS HPC WORKLOADS BACKFILL MAPREDUCE & HADOOP HADOOP WORKLOAD SCHEDULING ALTERNATIVE PRIORITIZATIONS ALTERNATIVE SCHEDULES DYNAMIC PRIORITIZATION SOME DYNAMIC PRIORITIZATION RESULTS CONCLUSIONS 2 HPC Workloads Current generation HPCs have many CPUs – say, 1,000 or 10,000. Scale economies of clusters of commodity processors made price/performance of custom-designed processors too expensive. Typical HPC applications map some system, e.g., a physical system like the Earth’s atmosphere – into a matrix. 3 HPC Workloads e.g., the Cubed Sphere… 4 HPC Workloads Mathematical systems, such as systems of linear equations Physical systems, such as particle or molecular physics Logical systems, including more recently web activitiy, of which more later 5 HPC Workloads The application software simulates the behavior or dynamics of the system represented by assigning parts of the system to nodes of the cluster. After processing, final results are collected from the nodes. Often, applications can represent the behavior of the system of interest more accurately by using more nodes. Mainframes optimize the use of the expensive processor resource by dispatching new work on it when the current workload no longer requires it, e.g., at the start of an I/O operation 6 HPC Workloads In contrast to a traditional mainframe workload, HPC jobs may use hundreds or thousands of processor nodes simultaneously. Halting job execution on 1,000 CPUs so that a single CPU can start an I/O operation is not efficient. Some HPCs have checkpoint/restart capabilities that could allow job interruption and also easier recovery from a system failure. Most do not. On an HPC system, typically, once a job is dispatched, it is allowed to run uninterrupted until completion. (More on that later.) 7 Backfill HPCs are usually oversubscribed. Backfill is commonly used to increase processor utilization in an HPC environment. 8 No Backfill (FCFS) Processors ` J2 J4 J1 J3 Time No Backfill 9 Processors Strict Backfill J3 J2 J1 J4 Time Backfill 10 Relaxed Backfill Processors J4 J3 J2 J1 Time Relaxed Backfill 11 MapReduce Amdahl’s Law explains why it is in an important HPC environment for the application to be coded to exploit parallelism. Doing so – programming for parallelism - is an historically difficult problem. MadReduce is a response to this problem. 12 MapReduce MapReduce is a simple method for implementing parallelism in a program. The programmer inserts map() and reduce() calls in the code. The compiler, dispatcher, and operating system take care of distributing the code and data onto multiple nodes. The map() function distributes the input file data onto many disks and the code onto many processors. The code processes this data in parallel. For example, weather parameters in discrete patches of the surface of the globe. 13 MapReduce The reduce() function gathers and combines the data from many nodes into one and generates the output data set. MapReduce makes it easier for programmers to implement parallel versions of their applications by taking care of the distribution, management, shuffling, and return to and from the many processors and their associated data storage. 14 MapReduce MapReduce also improves reliability by distributing the data redundantly. And takes care of load-balancing and performance optimization by distributing the code and data fairly among the many nodes in the cluster. MapReduce may also be used to implement prioritization by controlling the number of map and reduce slots created and allocated to various users, classes of users, or classes of jobs – the more slots allocated, the faster that user or job will run. 15 MapReduce Image Licensed by Gnu Free Software Foundation 16 MapReduce MapReduce was originally developed for use by small teams where FIFO or “social scheduling” was sufficient for workload scheduling. It has grown to be used for very large problems, such as sorting, searching, and indexing large data sets. Google uses MapReduce to index the world wide web and holds a patent on the method. MapReduce is not the only framework for parallel processing and has opponents as well as advocates. 17 HADOOP Hadoop is a specific open source implementation of the MapReduce framework written in Java and licensed by Apache It includes a distributed file system Designed for very large (thousands of processors) systems using commodity processors, including grid systems Implements data replication for checkpoint (failover) and locality Locality of processors with their data help network bandwidth 18 Applications of Hadoop Data intensive applications Applications needing fault tolerance Not a DBMS Yahoo; Facebook; LinkedIn; Amazon Watson, the IBM system that won on Jeopardy used Hadoop 19 Hadoop MapReduce & Distributed File System Layers Source: Wikipedia 20 Hadoop Workload Scheduling Task Tracker attempts to dispatch processors on nodes near the data (e.g., same rack) the application requires. Hadoop HDFS is “rack aware” Default Hadoop scheduling is FCFS=FIFO Workload scheduling to optimize data locality and also optimize processor utilization are in conflict. There are various alternatives to HDFS and the default Hadoop scheduler 21 Some Prioritization Alternatives Wait Time Run Time Number of Processors Queue Composite Priorities Dynamic Prioritices 22 Some Scheduling Alternatives Global Vs. Local Scheduling Resource-Aware Scheduling Phase-Based Scheduling Delay Scheduling Copy-Compute Splitting Preemption and Interruption Social Scheduling Variable Budget Scheduling 23 Dynamic Prioritization Bill Ward and Tyler Simon, among others, have proposed setting job priorities as the product of several factors: (Est’d Proc Time)Proc Parameter * (Wait Time)Wait Parameter * (CPUs Requested)CPU Parameter * Queue = Priority 24 Algorithm – Prioritization Data: job file, system size Result: Schedule performance Read job input file; for α,β,γ = -1 → 1,+ = 0.1 do while jobs either running or queued do calculate job priorities; for every job do if job is running and has time remaining then update_running_job(); else for all waiting jobs do pull jobs from priority queue and start based on best _t; if job cannot be started then increment waittime end end end end Print results; end end 25 Algorithm – Cost Evaluation Require: C, capacity of the knapsack, n, the number of tasks, a, array of tasks of size n Ensure: A cost vector containing Ef for each job class Cost (A, B, C, D, WRT). 1: i = 0 2: for α = -2 ≤ 2; α+=0.1 do 3: for β = -2 ≤ 2; β+=0.1 do 4: for γ = -2 ≤ 2; γ+=0.1 do 5: Cost[i++] = schedule(α,β,γ) {For Optimization} 6: if cost[i] < bestSoFar then 7: bestSoFar = cost[i] 8: end if 9: end for 10: end for 11: end for 26 Some Results (0, 1, 0) = FCFS 27 Some Results (0, 0, -1) = SJF 28 Some Results (-0.1, -0.3, -0.4) 29 Some Results Policy LJF LargestJF SJF SmallestJF FCFS Dynamic α β γ 0 1 0 -1 0 -0.1 0 0 0 0 1 -0.3 1 0 -1 0 0 -0.4 Wait 165 146 150 165 156 145 Utiliz TotalWait 84 4820 95 5380 92 3750 84 4570 88 4970 95 3830 30 Conclusions Where data center management wants to maximize some calculable objective, it may be possible for an exhaustive search of the parameter space to constantly tune the system and provide near-optimal performance in terms of that function 31 Conclusions We expect new capabilities, e.g., commodity cluster computers and Hadoop, to continue to inspire new applications. The proliferation of workload scheduling alternatives may be the result of (1) the challenges of parallel programming, (2) the popularity of open source platforms that are easy to customize, and (3) the brief lifespan of MapReduce that has not yet had the chance to mature. 32 Bibliography (from National CMG Paper) [Calzolari] Calzolari, Federico, and Volpe, Silvia. “A New Job Migration Algorithm to Improve Data Center Efficiency,” Proceedings of Science. The International Symposium on Grids and Clouds and the Open Grid Forum, Taipei, March, 2011. [Feitelson1998] Feitelson, Dror, and Rudolph, Larry. “Metrics and Benchmarking for Parallel Job Scheduling.” Job Scheduling Strategies for Parallel Processing ’98, LNCS 1459, SpringerVerlag, Berlin, 1998. 33 [Feitelson1999] Feitelson, Dror and Naaman, Michael. “Self-Tuning Systems” IEEE Software, March/April 1999. [Glassbrook] Glassbrook, Richard and McGalliard, James. “Performance Management at an Earth Science Supercomputer Center.” CMG 2003. [Heger] Heger, Dominique. “Hadoop Performance Tuning – A Pragmatic & Iterative Approach” www.cmg.org/measureit/issues/mit97/m_97_3.p df 1997 34 [Hennessy] Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, 2nd Edition. Morgan Kauffmann, San Mateo, California. [Herodotou] Herodotou, Herodotos and Babu, Shivnath. “Profiling, What-if Analysis, and Costbased Optimization of MapReduce Programs,” Proceedings of the 37th International Conference on Very Large Data Bases, Vol. 4, No. 11. VLDB Endowment, Seattle, ©2011. [Hovestadt] Hovestadt, Matthias, and others. "Scheduling in hpc resource management systems: queuing vs. planning." Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, 2003. 35 [Nguyen] Nguyen, Phuong; Simon, Tyler; and others. “A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment,” IEEE/ACM Fifth International Conference on Utility and Cloud Computing. IEEE Computer Society, ©2012. [Rao] Rao, B. Thirumala, and others. “Performance Issues of Heterogeneous Hadoop Clusters in Cloud Computing” Global Journal of Computer Science and Technology. Volume XI Issue VIII, May 2011. [Sandholm2009] Sandholm, Thomas, and Lai, Kevin. “MapReduce Optimization Using Regulated Dynamic Prioritization,” SIGMETRICS Performance ’09. ACM, Seattle, ©2009. 36 [Sandholm2010] Sandholm, Thomas, and Lai, Kevin. “Dynamic Proportional Share Scheduling in Hadoop” Job Scheduling Strategies for Parallel Processing 2010. Springer-Verlag, Berlin, 2010. [Scavlex] Scavlex. http://compprog.wordpress. com/2007/11/20/the-fractional-knapsackproblem [Sherwani] Sherwani, Jahanzeb, and others. "Libra: a computational economy‐based job scheduling system for clusters." Software: Practice and Experience 34.6 2004. 37 [Simon] Simon, Tyler and others. “Multiple Objective Scheduling of HPC Workloads Through Dynamic Prioritization” HPC 2013, Spring Simulation Conference, The Society for Modeling & Simulation International, 2013. [Spear] Spear, Carrie and McGalliard, James. “A Queue Simulation Tool for a High Performance Scientific Computing Center.” CMG 2007. San Diego, 2007. [Streit] Streit, Achim. “On Job Scheduling for HPC-Clusters and the dynP Scheduler” Paderborn Center for Parallel Computing, Paderborn, Germany. 38 [Ward] Ward, William A. and others. “Scheduling Jobs on Parallel Systems Using a Relaxed Backfill Strategy.” Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSP ’02, London, Springer-Verlag, 2002. [Zaharia2009] Zaharia, Matei, and others. “Job Scheduling for Multi-User MapReduce Clusters” Technical Report UCB/EECS-2009-55. University of California at Berkeley, 2009. [Zaharia2010] Zaharia, Mateir, and others. “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling.” EuroSys’10.ACM ,Paris, ©2010 39