What is the impact of a priori knowledge? Stephen Jarvis High Performance Systems Group University of Warwick © 2006 HPSG High Performance Systems Group Investigate software / systems combinations Analysing the operational effectiveness of applications microprocessor codes distributed enterprise systems large-scale scientific applications on a wide range of computing architectures embedded systems commodity clusters high-specification supercomputers Grids © 2006 HPSG Workflow Optimisation in Distributed Environments 2 Why do we do this? Wide-ranging application Quantify advantages of different architectural options Compare alternative vendor systems Forecast final system behaviour, validate installation Determine impact of system and application upgrades Verify the effects of maintenance (e.g. fault analysis) Vulnerability analysis Variety of users US Navy Ocean Systems Center, LANL, NASA, Mental images, National Physical Laboratories, Thomson ASM, INRIA, Simulog, IBM, BMW, Microsoft, HP Labs, BT, AWE … © 2006 HPSG Workflow Optimisation in Distributed Environments 3 Workflow research Other talks Modelling parallel pipelined synchronous wavefront apps. Dynamic operating policies for hosting environments Predicting the power footprints of architecture / application combinations Multi-core / Cell design and application performance This talk Performance-based middleware for Grid computing e-Science Core Programme funded research Modelling applications and hardware Deriving from these models accurate performance data Using this data for effective resource management © 2006 HPSG Workflow Optimisation in Distributed Environments 4 Demonstrators Business demonstrator Joint work with IBM Hursley and T.J. Watson Research Labs Application of performance prediction techniques to distributed enterprise systems, IBM Watson, J. Supercomputing, 2004 Workload allocation for distributed enterprise applications, IBM Watson, IEEE Trans. Parallel and Distributed Systems, 2005 Dynamic operating policies for commercial hosting environments, HP Labs, BT, IBM, Newcastle Uni., NB2BC, Computer Science Challenges to emerge from e-Science, 2005-2008 The focus of this talk is on the scientific demonstrator Collaboration with KCL, Oxford, Imperial (IXI) © 2006 HPSG Workflow Optimisation in Distributed Environments 5 IXI – Information eXtraction from Images UK e-Science medical imaging project Large-scale image processing and medical image analysis on dedicated clusters / condor pools / NGS Images from different MRI modalities (CT, MR and PET) Volume rendering and non-rigid registration on pairs of 3-D MRI scans Used to compensate for misregistration in breast MR images and when isolating tumour growth © 2006 HPSG Workflow Optimisation in Distributed Environments 6 Registration procedure Uniform mesh of control points fitted to 3-D image Similarity measure based on normalised mutual information Gradient decent optimisation. Points are moved (x,y,z). Effect of transformation measured by fitness function. Improved transformations are kept. Independence of (neighbourhood) B-splines makes it computationally tractable to use large numbers of points Optimisation performed at different image resolutions (through B-spline subdivision) Fitness function ensures that transformations are well formed © 2006 HPSG Workflow Optimisation in Distributed Environments 7 Complexity Computationally intensive problem, tens of hours Runtime limited by speed of CPU and main memory, is proportional to the number of control points in the target image Runtime can nonetheless differ by an order of magnitude depending on properties of image Target image Source image Iterations Runtime b7_s2 b7_s2 15 3863s b7_s2 b7_s1 44 14210s b9_s2 b9_s1 83 26940s b9_s2_e2 b9_s3_e2 114 4284s b9_s3_e2 b8_s3_e2 134 3515s © 2006 HPSG Workflow Optimisation in Distributed Environments 8 Deriving performance data Performance modelling including three medical imaging tools, BET, FAST and Nreg Allows real clinical workflows to be built 1. Highly variable runtime - a factor of 16 between fastest and slowest at the same image size 2. Two classes of registration. Depends on destination image. 3. Self registration is fast. 4. Prediction based on timing of subsampled images © 2006 HPSG Workflow Optimisation in Distributed Environments 9 Deriving performance data Performance modelling including three medical imaging tools, BET, FAST and Nreg Allows real clinical workflows to be built 1. Highly variable runtime - a factor of 16 between fastest and slowest at the same image size 2. Two classes of registration. Depends on destination image. 3. Self registration is fast. 4. Prediction based on timing of subsampled images © 2006 HPSG Workflow Optimisation in Distributed Environments 10 Deriving performance data Performance modelling including three medical imaging tools, BET, FAST and Nreg Allows real clinical workflows to be built 1. Highly variable runtime - a factor of 16 between fastest and slowest at the same image size 2. Two classes of registration. Depends on destination image. 3. Self registration is fast. 4. Prediction based on timing of subsampled images © 2006 HPSG Workflow Optimisation in Distributed Environments 11 Deriving performance data Performance modelling including three medical imaging tools, BET, FAST and Nreg Allows real clinical workflows to be built 1. Highly variable runtime - a factor of 16 between fastest and slowest at the same image size 2. Two classes of registration. Depends on destination image. 3. Self registration is fast. 4. Prediction based on timing of subsampled images © 2006 HPSG Workflow Optimisation in Distributed Environments 12 Deriving performance data Performance modelling including three medical imaging tools, BET, FAST and Nreg Allows real clinical workflows to be built 1. Highly variable runtime - a factor of 16 between fastest and slowest at the same image size 2. Two classes of registration. Depends on destination image. 3. Self registration is fast. 4. Prediction based on timing of subsampled images © 2006 HPSG Workflow Optimisation in Distributed Environments 13 Deriving performance data © 2006 HPSG Workflow Optimisation in Distributed Environments 14 What a priori knowledge is needed? What error bounds are acceptable when dealing with these runtime predictions? Sub-sample analysis costs. Trade-off between gathering information and slowing down launch We can improve things by caching performance results and closing the feedback loop Ultimately it depends on what we want to do with this information Important to understand that models are parameterised (p,c,m,i,d etc.) © 2006 HPSG Workflow Optimisation in Distributed Environments 15 Performance–based workload management Tasks with associated prediction models REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS Performance Modelling PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA CLUSTER CONNECTOR Condor © 2006 HPSG Workflow Optimisation in Distributed Environments 16 Performance–based workload management Tasks with associated prediction models Fed through prescheduler REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS Performance Modelling PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA CLUSTER CONNECTOR Condor © 2006 HPSG Workflow Optimisation in Distributed Environments 17 Performance–based workload management Tasks with associated prediction models Fed through prescheduler REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS Performance Modelling PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA CLUSTER CONNECTOR Condor © 2006 HPSG Workflow Optimisation in Distributed Environments 18 Performance–based workload management Tasks with associated prediction models deployed once prediction information is taken into account REQUESTS FROM USERS OR OTHER DOMAIN SCHEDULERS Performance Modelling PORTAL PREEXECUTION ENGINE SCHEDULE QUEUE MATCHMAKER GA CLUSTER CONNECTOR Condor © 2006 HPSG Workflow Optimisation in Distributed Environments 19 How is a priori knowledge used? Deadline-driven jobs Can be launched in a configuration that is likely to satisfy this deadline Different # of processors selected, different architectures selected Different (medical) data-sets selected Improving resource utilisation Same as above but optimising over different metrics Interaction with medical scientist They know how long their workflows might take This is updated by the scheduler Speculative work (flows) can be proposed © 2006 HPSG Workflow Optimisation in Distributed Environments 20 Predictions and workflow Workflow construction interacting with performance models As input data becomes available, non-scheduled predictions are made When workflow is run, predictions update According to the resource availability According to actual runtime behaviour Note we get runtime prediction of complete workflow Interaction between workflow engine, scheduler and prediction system Data continually updated © 2006 HPSG Workflow Optimisation in Distributed Environments 21 Workflow speculation Here a prediction is made in the workflow tool (note this takes some time if probe tasks are submitted) We can request that this workflow to be submitted speculatively Seen as ghost tasks in scheduler User might decide to re-engineer study Re-predict and submit © 2006 HPSG Workflow Optimisation in Distributed Environments 22 Resource speculation Scheduled prediction What happens if we add more resources? re-schedules updates prediction (runtime goes down) What happens if we delete resources? What happens if nodes fail? Trade-off between resources and application capability © 2006 HPSG Workflow Optimisation in Distributed Environments 23 So what’s the point of all this? When are we going to get the results of this analysis? Am I buying the right kind of equipment to support this work? What extra resources do I need to improve this process? What is the impact of upgrading my software and/or hardware? This is not the answer, but it is a useful pilot © 2006 HPSG Workflow Optimisation in Distributed Environments 24 Conclusions Example of some of the features of our work through a UK-based scientific demonstrator Practical motivation, driven by clinical researchers • Deadlines = before the patient leaves the clinic • QoS = can we improve the diagnosis and recommended treatment? • Capabilities = predictable delivery of computingsupported service • Management = are the NHS trusts spending their money in the right way to support this type of eHealthcare? © 2006 HPSG Workflow Optimisation in Distributed Environments 25