Condor, Globus and SRB: Tools for Constructing a Campus Grid Jon Wakelin

advertisement
Jon Wakelin
Condor, Globus and SRB:
Tools for Constructing a Campus
Grid
2
Overview
• Condor
• Globus
• Storage Resource Broker (SRB)
• UoBGrid
• Summary
3
Condor Overview
• High Throughput Computing Environment
– From networked resources (Condor pool)
• Like other Schedulers
– Queuing mechanism
– Prioritisation scheme
– Scheduling Policy
• Unlike other schedulers
– Doesn’t need dedicated resources
– Desktops workstations, library or PC lab computers
– Cycle scavenging
4
Class Ads
• Classified advertisements
– Machine Class Ads (for sale)
– Job Class Ads (wanted)
• Machine Class Ads
– Created from information “advertised” by machines in the condor pool
– Can add extra Class Ad information
• Job Class Ads
– Created from information in the condor submit file
– Created from default values
5
Different roles in a Condor pool
•
Central Manager
•
Submit
•
Execute
•
Or a combination of these
– e.g. submit and execute node
•
Different daemons will be started depending on the role of the
machine
6
Condor Daemons
• All Machines
– condor_master - controls other daemons
• Central Manager
– condor_collector - Collects information from other machines
– condor_negotiator - Performs matchmaking
• Execute
– condor_startd - Starts, stops, suspends jobs
• Submit
– condor_schedd - Maintains queue of jobs
7
Job Submission
Executable = /bin/ls
Arguments = -l
InitialDir = /usr/bin
Output = out
Error = err
Queue
8
Job Submission
Executable = /bin/ls
Arguments = -l
InitialDir = /usr/bin
Output = out.$(Process)
Error = err.$(process)
Queue 2
9
Job Submission
Executable = /bin/ls
Arguments = -l
InitialDir = /usr/bin
Output = out.$(Process)
Error = err.$(process)
Requirements = ((Arch==“INTEL” && OpSys=“LINUX”) ||
(Arch==“INTEL” && OpSys=“IRIX65”))
Queue 2
10
Job Submission
Executable = gaussian.$$(Arch).$$(OpSys)
InitialDir = /home/jon
Input = chlorobenzene.in
Output = chlorobenzene.$(Process)
Error = chlorobenzene.$(process)
Requirements = ((Arch==“INTEL” && OpSys=“LINUX”) ||
(Arch==“INTEL” && OpSys=“IRIX65”))
Queue 2
11
Condor Commands
• condor_submit <submit_file>
• condor_q
• condor_rm
• condor_status
– Displays pool status in a succinct format
• condor_status –l <machine>
– Display full Class Ad information
12
Condor-G
• Condor interface to access Globus resources
–
–
–
–
condor submit file
condor commands
Keeps log of runs
Adds fault tolerance
• Can be used to perform matchmaking
–
–
–
–
Must create machine Class Ads manually
condor_advertise command
Can be used to create a resource broker
No RB functionality in Globus Toolkit
13
Globus Toolkit Overview
• Globus is a toolkit not an turnkey solution
• Globus Toolkit 2.4.3 common choice for production grids
• Four main components
–
–
–
–
Authentication (GSI)
Resource management (GRAM)
Data transfer (GridFTP)
Resource discovery and monitoring (MDS)
14
Authentication
• Grid users need to obtain something called a
certificate
• Applications can use the certificate to establish the
identity of the user….
• i.e. authenticate the user
15
PKI Authentication
• Public Key Infrastructure
– Public/Private keys
– Used to encrypt data
– And to sign certificates
• Certification Authority (CA)
– User create certificate
– CA Signs certificates
– UK eScience CA at RAL
• Certificate Contains
– Identity/Distinguished Name (DN)
– Public Key
– signature & Identity of CA
16
GSI Authentication
• Grid Security Infrastructure
– extensions to PKI (X509, SSL extensions)
– Single sign-on
– Delegation
CA
User
Signature
Signature
Proxy
– grid-proxy-init – command to create proxy certificates
17
Resource Management (GRAM)
• Grid Resource Allocation Manager
– Gatekeeper
– Resource Specification Language
– JobManagers
18
GRAM - Gatekeeper
• Daemon runs on a grid resource
• Processes incoming globus requests
• Authenticates Users
– Configured to trust a given CA
– e.g. UK eScience CA at RAL
• Maps user to local account
– DN => username
– grid-mapfile
• Passes the job onto the jobmanager
19
GRAM - RSL
• Resource Specification Language
– (attribute op value) in parenthesis
• Operators
– Numerical operators within clauses (<, <=, >, >=, =, !=)
– Logical operators between clauses (&, | )
• Attributes
– Predefined
• executable, arguments, stdin, stdout, stderr, environment
• maxCpuTime, maxWallTime, maxMemory, project, queue
– User defined
• May be handled by subsequent application
20
GRAM - RSL
&(executable=“/bin/ls”)
(arguments=“-l”)
(directory=“/usr/bin”)
21
GRAM - JobManagers
• Perl modules
– Convert RSL into scheduler specific language
• Reference implementations
– Fork, Condor, PBS, LSF
• May need to roll-your-own
– e.g. LoadLeveller, SGE
– Or just to add extra functionality
22
Data Management (GridFTP)
• File Transfer Protocol
– Extension of the standard FTP protocol stack to include extra
functionality
• GSI authentication
• Third Party transfers
• Striped transfers
– User application is globus-url-copy
23
Information Services (MDS)
• Collect and provide status information about Grid resources
• MDS: Monitoring and Discovery Service
– GRIS: Grid Resource Information Service
• Collects info about local resource
• Reports to GIIS server
– GIIS: Grid Index Information Service
• Aggregates information from GRIS servers
• One per organisation
– Same executable with different configuration
24
Storage Resource Broker (SRB) Overview
• Uniform interface to heterogeneous data storage resources
–
–
–
–
Unix, Irix, linux file systems
Windows
Databases
Physical media (tape storage)
• SRB is middleware
– Allows access to a wide range of data resources
– Allows a wide range of user Apps to be written
– All accessed through a “narrow” API
Storage
API
Applications
25
SRB Access
• Applications
– Scommands: command line
– MySRB: Web access
– inQ: Windows GUI
• APIs
– Java, C, C++, Python, Perl
26
UoBGrid Overview
•
What is a Campus Grid?
•
Our Situation
•
Software Choices
•
Services
27
What is a Campus Grid?
•
A Grid: Single sign-on to multiple resources located in different
administrative domains
28
Our Situation
• Dedicated departmental clusters
– Windows Condor pools not a requested resource
• Separation of user communities
– parallel vs serial usage
• All contained within a single firewall domain
• Wanted to become partners in the NGS
– Systems must be compatible
– Encourage our users to become NGS users
• Full Economic Costing coming soon!
– Important to keep usage records
– Ensure best usage of purchased resources for sustainable future
29
Software Choices
• Condor 6.6.7
•
•
•
•
Globus 2.4.3
MyProxy
GSI-SSH
Storage Resource Broker (SRB)
• Virtual Data Toolkit (VDT)
– Bundles many useful tools
– Platform independent installation
– Supported release of Globus Toolkit, MyProxy & GSI-SSH
30
Planned Resources
31
Current Resources
• 4 Servers
–
–
–
–
RB: Resource Broker
VOM: Virtual Organisation Manager
MDS: Monitoring and Discovery Service
SRB: Storage Resource Broker
• 4 Compute Resources
–
–
–
–
Monster2 - SGE, 20 CPU
Tuya - PBS, 16 CPU
Grendel - PBS, 110 CPU
BSESrv1 - PBS, 28 CPU
32
Resource Broker
• Condor-G with matchmaking
• Custom script for determination of resource status
– Converts MDS information into condor Class Ads
– Adds information about available software
• User submission script
–
–
–
–
Create condor submit file
Software requirements passed into Condor submit file
Submits jobs
Sends data SRB
• http://cerb-rb.bris.ac.uk/cgi-bin/rb_status.cgi
33
Virtual Organisation Manager
• Built using
– Webserver
• Apache + mod_ssl
• Perl CGI
– Postgres Database
– Modified Globus JobManagers
• Functionality
– Record of users and machines
– Administrative functions
– Accounting/Usage Statistics
34
Virtual Organisation Manager
• Admin – via web interface (https)
–
–
–
–
Access based on Certificate/DN
Add/remove Users
Add/Remove Resource
Control Users Access to Resources
• Constructs grid-mapfiles for all resources
• https://cerb-vom.bris.ac.uk/vom-bin/VOM.cgi
35
Virtual Organisation Manager
• Accounting/Usage Statistics
– Usage by machine
– Usage by users
• Modified GRAM JobManagers
– Job details sent to DB on completion
– executable, arguments, start time, end time, CPU, wall time, memory,
virtual memory, jobmanager-type, number of nodes
• http://cerb-vom.bris.ac.uk/cgi-bin/VOM-usage-stats.cgi
36
Resource Monitor
• Runs GIIS
– Collects information from UoBGrid resources
• Runs Big Brother monitoring software
– Client/Server model
– Server pings registered resources
– Client records local system info and reports to server
37
Big Brother Monitoring System
Web available
status page with
easy to
understand
functionality for
helpdesk and
admin staff.
38
Storage Resource Broker
• All UobGrid users given SRB account
• GSI authentication enabled for Scommands
• Access via certificate
User
39
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
UoB Grid
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
User
40
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
Compute resources running GRIS report to information servers
User
41
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
Resource Broker polls information servers and converts
MDS information into Condor Class Ads
User
42
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
User logs on to Resource Broker to submit job. Jobs
Are matched to resources using condor
User
43
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Job details sent to machine by Condor-G
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
User
44
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
Upon completion output files are sent back to the Resource broker
User
45
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
If job runs on UoB Grid resources run details are sent to VOM DB
For NGS and UoB users alike.
User
46
VOM
SRB
RB
MDS
BDII
NGS
Leeds
Man
RAL
Oxford
Grendel
(PBS)
Tuya
(PBS)
Monster
2 (SGE)
bserv
(PBS)
Finally output file are sent from RB to the Storage Resource Broker
47
Summary
• Condor
–
–
–
Standalone: high throughput computing system
Matchmaking with Class Ads
Condor-G: interface to Globus Toolkit
• Globus Toolkit
–
–
Applications, Protocols, APIs
GSI – Certificates (DN, public key, digital signature)
• UoBGrid
–
–
–
–
–
Centralised access to disparate resources
Custom components created to fill functionality gaps
Globus gives authenticated access to resources
Condor provides matchmaking (i.e. brokering)
SRB provides storage
48
Useful URLs
• SRB: http://www.sdsc.edu/srb/
• Globus: http://www.globus.org/
• Condor: http://www.cs.wisc.edu/condor/
• UK eScience CA: https://ca.grid-support.ac.uk/
• NGS: http://www.ngs.ac.uk/
• UoBGrid: http://escience.bris.ac.uk
Download