Document 13353262

Distributed Resource Management and

Parallel Computation

Dr Michael Rudgyard

Streamline Computing Ltd

Streamline Computing Ltd

• Spin out of Warwick (& Oxford) University

• Specialising in distributed (technical) computing

– Cluster and GRID computing technology

• 14 employees & growing; focussed expertise in:

– Scientific Computing

– Computer systems and support

– Presently 5 PhDs in HPC and Parallel Computation

– Expect growth to 20+ people in 2003

Strategy

• Establish an HPC systems integration company..

• ....but re-invest profits into software

– Exploiting IP and significant expertise

– First software product released

– Two more products in prototype stage

• Two complementary ‘businesses’

– Both high growth

Track Record (2001 – date..)

• Installations include:

– Largest Sun HPC cluster in Europe (176 proc)

– Largest Sun / Myrinet cluster in UK (128 proc)

– AMD, Intel and Sun clusters at 21 UK Universities

– Commercial clients include Akzo Noble, Fujitsu,

Maclaren F1, Rolls Royce, Schlumberger, Texaco….

• Delivered a 264 proc Intel/Myrinet cluster:

– 1.3 Tflop/s Peak !!

– Forms part of the White Rose Computational Grid

Streamline and Grid Computing

• Pre-configured ‘grid’-enabled systems:

– Clusters and farms

– The SCore parallel environment

– Virtual ‘desktop’ clusters

• Grid-enabled software products:

– The Distributed Debugging Tool

– Large-scale distributed graphics

– Scaleable, intelligent & fault tolerant parallel computing

‘Grid’-enabled turnkey clusters

• Choice of DRMs and schedulers:

– (Sun) GridEngine

– PBS / PBS-Pro

– LSF / ClusterTools

–

–

Condor

Maui Scheduler

• Globus 2.x gatekeeper (Globus 3 ???)

• Customised access portal

The SCore parallel environment

• Developed by the Real World Computing

Partnership in Japan (www.pccluster.org).

• Unique features, that are unavailable in most parallel environments:

– Low latency, high bandwidth MPI drivers

– Network transparency: Ethernet, Gigabit and

Myrinet

– Multi-user time-sharing (gang scheduling)

– O/S level checkpointing and failover

– Integration with PBS and SGE

– MPICH-G port

– Cluster management functionality

‘Desktop’ Clusters

• Linux Workstation Strategy

– Integrated software stack for HPTC (compilers, tools & libraries) – cf. UNIX workstations

• Aim to provide a GRID at point of sale:

– Single point of administration for several machines

– Files served from front-end

– Resource management

– Globus enabled

– Portal

• A cluster with monitors !!

The Distributed Debugging Tool

• A debugger for distributed parallel application

– Launched at Supercomputing 2002

• Aim is to be the de-facto HPC debugging tool

– Linux ports for GNU, Absoft, Intel and PGI

– IA64 and Solaris ports; AIX and HP-UX soon…

– Commodity pricing structure !

• Existing architecture lends itself to the GRID:

– Thin client GUI + XML middleware + back-end

– Expect GRID-enabled version in 2003

Distributed Graphics Software

• Aims

– To enable very large models to be viewed and manipulated using commodity clusters

– Visualisation on (local or remote) graphics client

• Technology

– Sophisticated data-partitioning and parallel I/O tools

– Compression using distributed model simplification

– Parallel (real-time) rendering

• To be GRID-enabled within e-Science ‘Gviz’ project

Parallel Compiler and Tools Strategy

• Aim to invest in new computing paradigms

• Developing parallel applications is far from trivial

– OpenMP does not marry with cluster architecture

– MPI is too low-level

– Few skills in the marketplace !

– Yet growth of MPPs is exponential…

• Most existing applications are not GRID-friendly

– # of processors fixed

– No Fault Tolerance

– Little interaction with DRM

DRM for Parallel Computation

• Throughput of parallel jobs is limited by:

– Static submission model: ‘mpirun –np …..’

– Static execution model: # processors fixed

– Scaleability; many jobs use too many processors !

– Job Starvation

• Available tools can only solve some issues

– Advanced reservation and back-fill (eg Maui)

– Multi-user time-sharing (gang scheduling)

• The application itself must take responsibility !!

Dynamic Job Submission

• Job scheduler should decide the available processor resource !

• The application then requires:

– In built partitioning / data management

– Appropriate parallel I/O model

– Hooks into the DRM

• DRM requires:

– Typical memory and processor requirements

– LOS information

– Hooks into the application

Dynamic Parallel Execution

• Additional resources may become available or be required by other applications during execution…

• Ideal situation:

– DRM informs application

– Application dynamically re-partitions itself

• Other issues:

– DRM requires knowledge of the application (benefit of data redistribution must outweigh cost !)

– Frequency of dynamic scheduling

– Message passing must have dynamic capabilities

The Intelligent Parallel Application

• Optimal scheduling requires more information:

– How well the application scales

– Peak and average memory requirements

– Application performance vs. architecture

• The application ‘cookie’ concept:

– Application (and/or DRM) should gather information about its own capabilities

– DRM can then limit # of available processors

– Ideally requires hooks into the programming paradigm…

Fault Tolerance

• On large MPPs, processors/components will fail !

• Applications need fault tolerance:

– Checkpointing + RAID-like redundancy (cf SCore)

– Dynamic repartitioning capabilities

– Interaction with the DRM

– Transparency from the user’s perspective

• Fault-tolerance relies on many of the capabilities described above…

Conclusions

• Commitment to near-term GRID objectives

– Turn-key clusters, farms and storage installations

– On going development of ‘GRID-enabled’ tools

– Driven by existing commercial opportunities….

• ‘Blue’-sky project for next generation applications

– Exploits existing IP and advanced prototype

– Expect moderate income from focussed exploitation

– Strategic positioning: existing paradigms will ultimately be a barrier to the success of (V-)MPP computers / clusters !

Document 13353262

Distributed Resource Management and

Parallel Computation

Dr Michael Rudgyard

Streamline Computing Ltd

Streamline Computing Ltd

Strategy

Track Record (2001 – date..)

Streamline and Grid Computing

‘Grid’-enabled turnkey clusters

The SCore parallel environment

‘Desktop’ Clusters

The Distributed Debugging Tool

Distributed Graphics Software

Parallel Compiler and Tools Strategy

DRM for Parallel Computation

Dynamic Job Submission

Dynamic Parallel Execution

The Intelligent Parallel Application

Fault Tolerance

Conclusions

Related documents

Products

Support

Document 13353262

Distributed Resource Management and

Parallel Computation

Dr Michael Rudgyard

Streamline Computing Ltd

Streamline Computing Ltd

Strategy

Track Record (2001 – date..)

Streamline and Grid Computing

‘Grid’-enabled turnkey clusters

The SCore parallel environment

‘Desktop’ Clusters

The Distributed Debugging Tool

Distributed Graphics Software

Parallel Compiler and Tools Strategy

DRM for Parallel Computation

Dynamic Job Submission

Dynamic Parallel Execution

The Intelligent Parallel Application

Fault Tolerance

Conclusions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib