a priori knowledge? Stephen Jarvis University of Warwick

advertisement
What is the impact of a priori
knowledge?
Stephen Jarvis
High Performance Systems Group
University of Warwick
© 2006 HPSG
High Performance Systems Group

Investigate software / systems combinations

Analysing the operational effectiveness of
applications



microprocessor codes
distributed enterprise systems
large-scale scientific applications
on a wide range of computing architectures





embedded systems
commodity clusters
high-specification supercomputers
Grids
© 2006 HPSG
Workflow Optimisation in Distributed Environments
2
Why do we do this?
Wide-ranging application


Quantify advantages of different architectural options

Compare alternative vendor systems

Forecast final system behaviour, validate installation

Determine impact of system and application upgrades

Verify the effects of maintenance (e.g. fault analysis)

Vulnerability analysis
Variety of users


US Navy Ocean Systems Center, LANL, NASA, Mental images,
National Physical Laboratories, Thomson ASM, INRIA,
Simulog, IBM, BMW, Microsoft, HP Labs, BT, AWE …
© 2006 HPSG
Workflow Optimisation in Distributed Environments
3
Workflow research
Other talks


Modelling parallel pipelined synchronous wavefront apps.

Dynamic operating policies for hosting environments

Predicting the power footprints of architecture / application
combinations

Multi-core / Cell design and application performance
This talk


Performance-based middleware for Grid computing

e-Science Core Programme funded research

Modelling applications and hardware

Deriving from these models accurate performance data

Using this data for effective resource management
© 2006 HPSG
Workflow Optimisation in Distributed Environments
4
Demonstrators
Business demonstrator

Joint work with IBM Hursley and T.J. Watson Research Labs
 Application
of performance prediction techniques to
distributed enterprise systems, IBM Watson, J.
Supercomputing, 2004
 Workload
allocation for distributed enterprise applications,
IBM Watson, IEEE Trans. Parallel and Distributed Systems,
2005
 Dynamic
operating policies for commercial hosting
environments, HP Labs, BT, IBM, Newcastle Uni., NB2BC,
Computer Science Challenges to emerge from e-Science,
2005-2008
The focus of this talk is on the scientific demonstrator

Collaboration with KCL, Oxford, Imperial (IXI)
© 2006 HPSG
Workflow Optimisation in Distributed Environments
5
IXI – Information eXtraction from Images

UK e-Science medical imaging project

Large-scale image processing and medical image
analysis on dedicated clusters / condor pools /
NGS

Images from different MRI modalities (CT, MR and
PET)

Volume rendering and non-rigid registration on
pairs of 3-D MRI scans

Used to compensate for misregistration in breast
MR images and when isolating tumour growth
© 2006 HPSG
Workflow Optimisation in Distributed Environments
6
Registration procedure

Uniform mesh of control points fitted to 3-D image

Similarity measure based on normalised mutual
information

Gradient decent optimisation. Points are moved (x,y,z).
Effect of transformation measured by fitness function.
Improved transformations are kept.

Independence of (neighbourhood) B-splines makes it
computationally tractable to use large numbers of
points

Optimisation performed at different image resolutions
(through B-spline subdivision)

Fitness function ensures that transformations are well
formed
© 2006 HPSG
Workflow Optimisation in Distributed Environments
7
Complexity

Computationally intensive problem, tens of hours

Runtime limited by speed of CPU and main memory, is
proportional to the number of control points in the
target image

Runtime can nonetheless differ by an order of
magnitude depending on properties of image
Target image
Source image
Iterations
Runtime
b7_s2
b7_s2
15
3863s
b7_s2
b7_s1
44
14210s
b9_s2
b9_s1
83
26940s
b9_s2_e2
b9_s3_e2
114
4284s
b9_s3_e2
b8_s3_e2
134
3515s
© 2006 HPSG
Workflow Optimisation in Distributed Environments
8
Deriving performance data

Performance modelling including three medical
imaging tools, BET, FAST and Nreg

Allows real clinical workflows to be built
1. Highly variable runtime - a
factor of 16 between fastest
and slowest at the same
image size
2. Two classes of registration.
Depends on destination
image.
3. Self registration is fast.
4. Prediction based on timing
of subsampled images
© 2006 HPSG
Workflow Optimisation in Distributed Environments
9
Deriving performance data

Performance modelling including three medical
imaging tools, BET, FAST and Nreg

Allows real clinical workflows to be built
1. Highly variable runtime - a
factor of 16 between fastest
and slowest at the same
image size
2. Two classes of registration.
Depends on destination
image.
3. Self registration is fast.
4. Prediction based on timing
of subsampled images
© 2006 HPSG
Workflow Optimisation in Distributed Environments
10
Deriving performance data

Performance modelling including three medical
imaging tools, BET, FAST and Nreg

Allows real clinical workflows to be built
1. Highly variable runtime - a
factor of 16 between fastest
and slowest at the same
image size
2. Two classes of registration.
Depends on destination
image.
3. Self registration is fast.
4. Prediction based on timing
of subsampled images
© 2006 HPSG
Workflow Optimisation in Distributed Environments
11
Deriving performance data

Performance modelling including three medical
imaging tools, BET, FAST and Nreg

Allows real clinical workflows to be built
1. Highly variable runtime - a
factor of 16 between fastest
and slowest at the same
image size
2. Two classes of registration.
Depends on destination
image.
3. Self registration is fast.
4. Prediction based on timing
of subsampled images
© 2006 HPSG
Workflow Optimisation in Distributed Environments
12
Deriving performance data

Performance modelling including three medical
imaging tools, BET, FAST and Nreg

Allows real clinical workflows to be built
1. Highly variable runtime - a
factor of 16 between fastest
and slowest at the same
image size
2. Two classes of registration.
Depends on destination
image.
3. Self registration is fast.
4. Prediction based on timing
of subsampled images
© 2006 HPSG
Workflow Optimisation in Distributed Environments
13
Deriving performance data
© 2006 HPSG
Workflow Optimisation in Distributed Environments
14
What a priori knowledge is needed?

What error bounds are acceptable when dealing
with these runtime predictions?

Sub-sample analysis costs. Trade-off between
gathering information and slowing down launch

We can improve things by caching performance
results and closing the feedback loop

Ultimately it depends on what we want to do with
this information

Important to understand that models are
parameterised (p,c,m,i,d etc.)
© 2006 HPSG
Workflow Optimisation in Distributed Environments
15
Performance–based workload management
Tasks with associated
prediction models
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
Performance
Modelling
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
CLUSTER
CONNECTOR
Condor
© 2006 HPSG
Workflow Optimisation in Distributed Environments
16
Performance–based workload management
Tasks with associated
prediction models
Fed through prescheduler
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
Performance
Modelling
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
CLUSTER
CONNECTOR
Condor
© 2006 HPSG
Workflow Optimisation in Distributed Environments
17
Performance–based workload management
Tasks with associated
prediction models
Fed through prescheduler
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
Performance
Modelling
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
CLUSTER
CONNECTOR
Condor
© 2006 HPSG
Workflow Optimisation in Distributed Environments
18
Performance–based workload management
Tasks with associated
prediction models
deployed once
prediction information
is taken into account
REQUESTS FROM
USERS OR OTHER
DOMAIN SCHEDULERS
Performance
Modelling
PORTAL
PREEXECUTION
ENGINE
SCHEDULE
QUEUE
MATCHMAKER
GA
CLUSTER
CONNECTOR
Condor
© 2006 HPSG
Workflow Optimisation in Distributed Environments
19
How is a priori knowledge used?
Deadline-driven jobs


Can be launched in a configuration that is likely to satisfy
this deadline

Different # of processors selected, different architectures
selected

Different (medical) data-sets selected
Improving resource utilisation


Same as above but optimising over different metrics
Interaction with medical scientist


They know how long their workflows might take

This is updated by the scheduler

Speculative work (flows) can be proposed
© 2006 HPSG
Workflow Optimisation in Distributed Environments
20
Predictions and workflow

Workflow construction interacting with performance
models

As input data becomes available, non-scheduled
predictions are made

When workflow is run, predictions update

According to the resource availability

According to actual runtime behaviour

Note we get runtime prediction of complete workflow

Interaction between workflow engine, scheduler and
prediction system

Data continually updated
© 2006 HPSG
Workflow Optimisation in Distributed Environments
21
Workflow speculation

Here a prediction is made in the workflow tool (note
this takes some time if probe tasks are submitted)

We can request that this workflow to be submitted
speculatively

Seen as ghost tasks in scheduler

User might decide to re-engineer study

Re-predict and submit
© 2006 HPSG
Workflow Optimisation in Distributed Environments
22
Resource speculation

Scheduled prediction

What happens if we add more resources?

re-schedules

updates prediction (runtime goes down)

What happens if we delete resources?

What happens if nodes fail?

Trade-off between resources and application
capability
© 2006 HPSG
Workflow Optimisation in Distributed Environments
23
So what’s the point of all this?

When are we going to get the results of this
analysis?

Am I buying the right kind of equipment to support
this work?

What extra resources do I need to improve this
process?

What is the impact of upgrading my software and/or
hardware?

This is not the answer, but it is a useful pilot
© 2006 HPSG
Workflow Optimisation in Distributed Environments
24
Conclusions

Example of some of the features of our work
through a UK-based scientific demonstrator

Practical motivation, driven by clinical
researchers
•
Deadlines = before the patient leaves the clinic
•
QoS = can we improve the diagnosis and
recommended treatment?
•
Capabilities = predictable delivery of computingsupported service
•
Management = are the NHS trusts spending their
money in the right way to support this type of eHealthcare?
© 2006 HPSG
Workflow Optimisation in Distributed Environments
25
Download