High Performance Systems Group Performance-based Middleware for Grid Computing Dr Stephen Jarvis High Performance Systems Group University of Warwick, UK High Performance Systems Group Context • Funded by / collaborating with – UK e-Science Core Programme – IBM (Watson, Hursley) – NASA (Ames), NEC Europe, Los Alamos National Laboratory, MIT • Aims – Integrate established performance and scheduling tools with emerging grid middleware – Test on scientific and business case studies Performance-managed Grid Middleware High Performance Systems Group Do we need performancemanaged Grid middleware? • User perspective – Large, complex scientific applications – Grid provides a number of run options – Real-time results/guarantees important – Budget • Resource providers perspective – Scheduling of tasks – Make best use of resources / profit – Provide QoS High Performance Systems Group Performance Services • Intra-domain – Lab- / department-based – Shared resources under local administration • Multi-domain – Campus- / country-based – Wide-area resource and task management – Cross domain High Performance Systems Group Performance Services • Intra-domain – Lab- / department-based – Shared resources under local administration • Multi-domain – Campus- / country-based – Wide-area resource and task management – Cross domain High Performance Systems Group Performance Services • Intra-domain – Lab- / department-based – Shared resources under local administration • Multi-domain – Campus- / country-based – Wide-area resource and task management – Cross domain High Performance Systems Group Performance tools • Performance prediction tools • Aim to predict – Execution time – Communication usage – Data and resource requirements • Provides best guess as to how an application will execute on a given resource High Performance Systems Group PACE Application Resource User High Performance Systems Group PACE Application Application Model Resource Model Resource User High Performance Systems Group PACE Application Application Model Model parameters Evaluation Engine Resource Model Resource Resource config. User High Performance Systems Group PACE Application Application Model Model parameters Evaluation Engine Resource Model Resource Resource config. User High Performance Systems Group • Scaling properties on single architectures sweep3d 50 45 40 35 30 • Compare performance over improc different architectures closure fft 25 20 15 10 • jacobi Re-order tasks according to memsort cpideadlines 16 13 10 7 4 5 0 1 Running Execution Time on SGIOrigin2000 (sec) Time Why is prediction useful? Processing T he Number ofElements Processors • Give priority to favoured users • Maximise resource usage Allows runtime scenarios to be explored before deployment High Performance Systems Group 1. Intra-Domain Co-Scheduling • Augment Condor with additional performance information • Handle predictive and non-predictive tasks • Use predictive data for system improvement – Time to complete tasks / utilisation of resources – QoS – ability to meet deadlines • Scheduler driver, or co-scheduler (called Titan) High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks • Tasks with prediction data REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks • Tasks with prediction data REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks • Tasks with prediction data REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks • Tasks with prediction data REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Co-Scheduling • Non-predictive tasks • Tasks with prediction data REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS PACE PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA Titan CLUSTER CONNECTOR CLASSADS CONDOR RESOURCES High Performance Systems Group Intra-Domain Deployment Without co-scheduler Time to complete = 70.08m With co-scheduler Time to complete = 35.19m High Performance Systems Group 2. Multi-Domain Management • Publish intra-domain perf. data through Globus Information Services • Augment service with agent system – One agent per domain / VO • When a task is submitted – Agents query IS, and negotiate to discover best domain to run task • Scheme is tested on a 256-node exp. Grid – 16 resource domains; 6 arch. types High Performance Systems Group Multi-Domain Management time High Performance Systems Group Multi-Domain Management Time to complete = 2752s High Performance Systems Group Multi-Domain Management Time to complete = 467s; an improvement of 83% High Performance Systems Group QoS: Ability to Meet Deadline active inactive High Performance Systems Group Resource usage active inactive High Performance Systems Group • Software Project Status – GT2 and GT3 implementations – Handles workflows as well as discrete jobs (demonstration available) – Have developed predictive methods for business and scientific applications • Output – Presented at GGFs, NeSC workshops, IPDPS, HPDC, Super Computing, Cluster Computing, CCGrid, EuroPar … – 23 journal and conference papers – GT3 software release at All-Hands • See www.dcs.warwick.ac.uk/~hpsg