Condor at Cardiff Dr James Osborne

advertisement
Condor at Cardiff
Dr James Osborne
Contents
•
•
•
•
•
•
•
What is Condor
Condor at Cardiff
Condor Users at Cardiff
Green Computing at Cardiff
Advanced Research Computing at Cardiff
Virtualization
Patterns
What is Condor
• Condor is the name for
two species of New World
vultures, each in a
monotypic genus
– They are the largest flying
land birds in the Western
Hemisphere
What is Condor
• A specialised workload management system for
compute-intensive jobs
• Users submit their jobs to Condor
–
–
–
–
Condor places them into a queue
Condor chooses where and when to run them
Condor carefully monitors their progress
Condor informs the user upon completion
http://www.cs.wisc.edu/condor/
Condor at Cardiff - Pilot
• The Condor pool began as a pilot service back
in April of 2004 led by Dr Hugh Beedie, CTO of
Information Services in conjunction with staff at
the Welsh e-Science Centre
– First user from the School of Business
– A solution looking for problems…
Condor at Cardiff - Production
• The Condor pool transitioned to a production
service back in January of 2006 with the
appointment of Dr James Osborne as project
manager
–
–
–
–
–
Latest user from the School of Psychology
Doubled size of pool, Tripled number of users
Distributed using Novell Zenworks
Common condor_config files EA, EI, S, SEA
Injected condor_config_local variables
• IS_OWNED_BY, IS_EXECUTE_ALWAYS, RANK
Central Manager
master, collector, negotiator
Execute Nodes
1600 Workstations
Submit Nodes
30 Workstations
master, schedd, shadow
master, startd, starter
Condor Users at Cardiff
• User in a computing context refers to one who
uses a computer system
– Users may need to identify themselves for the
purposes of accounting, security, logging and
resource management
– Users are also widely characterized as the class of
people that uses a system without complete
technical expertise required to fully
understand the system
Growth of User Base
35
30
25
20
15
10
5
0
Q1-06 Q2-06 Q3-06 Q4-06 Q1-07 Q2-07 Q3-07 Q4-07
Diversity of User Base
•
•
•
•
•
•
Architecture
Biosciences
Business
Computer Sci
Engineering
Epidemiology
1
9
1
6
3
2
•
•
•
•
•
•
History Arch
Mathematics
Optometry
Physics
Psychology
Social Sci
2
2
2
2
1
1
Total
32
Diversity of Applications
•
•
•
•
•
•
•
•
Blast, Damfilt
Dammin, Energyplus
Gasbor, Grinder
Lea, Leadmix
Matlab, Msvar
Oxcal, Perl
Pest, R
Sienna, Structure
•
•
•
•
•
•
•
•
Econometric Modelling
Fluid Dynamics
Fourier Analysis
Geological Modelling
Image Processing
Radiation Transport
Travelling Salesman
WIFI Roaming
Structural Biophysics Group
Donna Lammie
•
•
•
•
•
•
OPTOM
X-Ray Diffraction
Determine shape of molecules
Time on a single workstation = 2-3 Days
Time on the Condor pool = 2-3 Hours
Speed-up factor of 2000%
Donna Lammie
PF2
90o
PF5
90o
PF7
90o
PF8
90o
PF9
90o
PF10
90o
PF11
90o
PF12
90o
C. Baldock et. al. Nanostructure of Fibrillin-1 Reveals Compact Conformation of EGF Arrays and Mechanism for
Extensibility. Proceedings of the National Academy of Sciences of the United States of America, 103(32):1192211927, August 2006.
Research Assistant
Patrick Downes
•
•
•
•
•
•
Velindre Cancer Centre
Montecarlo simulation
Radiotherapy dose calculation
Time on a single workstation = 3 Months
Time on the Condor pool = 36 Hours
Speed-up of 6000%
Patrick Downes
Green Computing at Cardiff
• Green Computing is the study and practice of
using computing resources efficiently
– Typically, technological systems or computing
products that incorporate green computing principles
take into account the so-called triple bottom line of
economic viability, social responsibility, and
environmental impact
Based on a P4 3GHz PC with 512MB RAM
Power Consumption
Watts Consum ed
150
160
140
112
Watts
120
100
100
80
60
40
20
0
0
5
Off
Hibernate
Standby
0
Idle
Machine State
Office
Condor
Watts Up Pro
• Measures
– Watts, Volts, Amps,
WattHrs, Cost, Avg
Kwh, Mo Cost, Max
Wts, Max Vlt, Max
Amp, Min Wts, Min Vlt,
Min Amp, Pwr Fct, Dty
Cyc, Pwr Cyc
• Freq – 1 second
• Duration – 15 minutes
Based on a P4 3GHz PC with 512MB RAM
Economic Viability
•
•
•
•
•
Makes sound financial sense
Hibernate saves £60 per year
Condor = £30 per year (max)
Dedicated = £150 per year
Condor is 5 times cheaper
Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24
Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State)
Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con)
Based on a P4 3GHz PC with 512MB RAM
Environmental Impact
•
•
•
•
•
Makes sound environmental sense
Hibernate saves 650Kg CO2 per year
Condor = 325Kg CO2 per year (max)
Dedicated = 1,625Kg CO2 per year
Condor is 5 times greener
Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24
Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State)
Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con)
Based on 10,000 P4 3GHz PCs with 512MB RAM
Across Campus
• Makes sound financial sense
– Hibernate would save £600,000 per year
• Hibernate 16 out of 24 hours
• Makes sound environmental sense
– Hibernate would save 6,500T CO2 per year
– Rainforest required = 52Km2
– Rainforest required = 40% area of Cardiff
Saving of Hibernate = Cost of 100W Electricity (Idle State) for 16 Hours out of 24
Cost of Condor = Cost of 150W Electricity (Condor State) – Cost of 100W Electricity (Idle State)
Cost of Dedicated = Cost of 150W Electricity (Condor State) + Cost of 100W Electricity (Air Con)
Cardiff’s Condor Pool
• ...is the equivalent of a £500,000 supercomputer
– …costs £50,000 in equipment, power, and staff
– …improves return on investment
• ...is one of the largest pools in the UK
– …and we plan to expand the pool
• …is probably the most utilised pool in the UK
– …by a factor of 10
• ...has more users than other pool in the UK
– …and we are working hard to keep it that way
Nobody corrected me at the 1st Campus Grids SIG in Oxford
Nobody corrected me at the 21st Open Grid Forum in Manchester
The ARC Spectrum
HPC
HTC
Tightly Coupled
Loosely Coupled
Supercomputers
NUMA Machines
Large Clusters
SMP
Small Clusters
Campus Grids
£ Million+
£ H Thousand
£ Million
£ Thousand
£ H Thousand
The ARC Division
• ARCCA will provide, co-ordinate, support and develop
advanced research computing services for researchers
at Cardiff University
• ARCCA will also work with clients and partners outside
the University through a range of outreach activities
• ARCCA is staffed with experts in the field who are
already available to help and support your research
needs through a range of services
• ARCCA is procuring a range of dedicated high-end
computing equipment which is planned to be fully
operational by early 2008
The ARC Organisation
•
•
•
•
•
•
Prof Martyn Guest
Dr Christine Kitchen
Dr James Osborne
Mr Huw Lynes
Ms Liz Fitzgerald
Another
Director of ARC
Manager of ARC
Applications
Infrastructure
Admin Officer
Programmer
Prof Martyn Guest
• 2007
– Director of Advanced
Research Computing at
Cardiff
• 1995
– Associate Director of
Computational Science and
Engineering at Daresbury
• 1971
– PhD Theoretical Chemistry
• 1967
– BSc Chemistry
The ARC Cluster
• 256 x Compute Nodes (Cluster)
– Dual Socket Quad Core Intel Xeon E5472 3.0GHz
– 16 Gb of Memory
– ConnectX Infiniband + Dual GigE
• 4 x Compute Nodes (SMP)
– Quad Socket Quad Core Intel Xeon X7350 2.93GHz
– 32 Gb of Memory, 1Tb of Local Disk (RAID5)
– ConnectX Infiniband + Dual GigE + Resilient PSU
The ARC Cluster
• 4 x Login Nodes
– Dual Socket Quad Core Intel Xeon E5472 3.0GHz
– 32 Gb of Memory, 0.5Tb of Local Disk (RAID1)
– ConnectX Infiniband + Dual GigE + Resilient PSU
• 2 x Storage Nodes
– Dual Socket Quad Core Intel Xeon E5472 3.0GHz
– 32 Gigabytes of Memory + Resilient PSU
– ConnectX Infiniband + Dual GigE + Fibre Channel
Virtualization
• Virtualization is a broad term that refers to the
abstraction of computer resources
– This includes making a single physical resource
appear to function as multiple logical resources
– Or it can include making multiple physical resources
appear as a single logical resource
Based on 6 months of monitoring
Central Manager Utilisation
• CPU (Percentage) 16.60 (average) 65.99 (max)
– (Single Socket Single Core Intel Xeon 2.4GHz)
• RAM (Gb) 1.09 (average) 1.60 (max)
– [55.00% and 80.00% of current capacity] (2 GB)
• Disk (Gb) 1.25 (average) 1.50 (max)
– [1.71% and 2.05% of current capacity] (73 GB)
Based on 6 months of monitoring
Central Manager Utilisation
• Net In (Kbps) 29.66 (average) 45.39 (max)
– [0.02% and 0.04% of current capacity] (Gigabit)
• Net Out (Kbps) 39.13 (average) 86.17 (max)
– [0.03% and 0.07% of current capacity] (Gigabit)
Based on 6 months of monitoring
Central Manager Virtualization
• 1 x Condor Server
– Dual Socket Quad Core Intel Xeon E5472 3.0GHz
– 32 Gb of Memory, 0.5Tb of Local Disk (RAID1)
– Dual GigE + Resilient PSU
• = 4 x Virtual Central Managers ?
• = 2 x Virtual Submit Nodes ?
Design Patterns
• A Design Pattern is a general repeatable
solution to a commonly occurring problem in
software design
– A design pattern is not a finished design that can be
transformed directly into code
– It is a description or template for how to solve a
problem that can be used in many different
situations
Questions
condor@cardiff.ac.uk
http://www.cardiff.ac.uk/arcca/
http://www.cs.wisc.edu/condor/
Download