CONDOR

advertisement
CONDOR
CISC 879 Parallel Computation
Spring 2003
Preethi Natarajan
Outline
o
o
o
o
o
o
Condor – Goals & Overview
Components
Matchmaking - ClassAds
RPC in Condor
Checkpoint/Restart
Glance @ APIs
Condor – Objectives
o Condor ‘s goal is to hunt for idle resources that can
be exploited by user applications
o Performance Vs. Throughput
o High Performance Computing
o CPU cycles/second under ideal circumstances. “How fast
can I run simulation X on this machine?”
o High Throughput Computing
o CPU cycles/day (week, month, year?) under non-ideal
circumstances. “How many times can I run simulation X in
the next month using all available machines?”
o How much computing power is available to me?
o Condor converts collections of distributively owned
workstations (different platforms) and dedicated
clusters into a distributed high-throughput
computing facility
Condor - Overview
o Customers advertise their job requirements to Condor – Resource Requests
o Resource owners advertise their resource descriptions – Resource Offers
Condor Central Manager
Site at which
job submitted
Resource found
appropriate for the job
o Condor provides
o Matchmaking between jobs and
resources
o Notification of Matches
o Transparent access to job’s files
during execution
o Opportunistic Scheduling –
Schedule resources when there
is an opportunity
o Checkpoint (save) job state when
current resource needs to be
preempted
o Restart job from checkpointed
state in another available
resource
Condor Components
Accountant
Collector
Negotiator
Notify
Match
Resource
Requests
Job
Resource
Offers
startd
schedd
submission
Customer Agent
Resource Agent
CUSTOMER AGENT
o Submits Resource Requests
(job requirements) in an
application queue ordered
by a priority scheme
o Implementation is called the
Scheduling daemon – schedd
RESOURCE AGENT
o Periodically extracts
resources’ state information
and updates its Resource
Offers
o Implementation is called the
startd
Condor Components (Cont.)
o CENTRAL MANAGER
o Is the condor “kernel” of the condor pool
o Collector - Periodically collects
o Resource Offers from startds
o Resource Requests Schedds
o Negotiator
o Matchmaking between Resource Requests and
Offers
o Notification about the match to the entities of
the matched pair
o Claiming Protocol followed between the
respective Customer and Resource Agents
o Accountant – Logs resource(s) usage by jobs
ClassAds
o
o
Classified Advertisement is a flexible and extensible
data model used to represent
o Resource Offers - Resource services
available
o Resource Requests - Job Requirements
o Access Policies - Constraints on resource
allocations & requirements
Is a mapping from attribute names to expressions –
defines semantics for evaluating the attributes
ClassAds - Access Policies
o Resource access policy
specifies
Policy Specification Example
o Who may use resource
o How they may use resource
o When they may use resource
o Access Policy Specification in Condor is done using the
following ClassAd Attributes
Expression Type
Evaluation Semantics for an application
Requirements
True => Application may use resource
Rank
Larger Value => Application is highly preferred over others
Suspend
True => Suspend active application
Continue
True => Unsuspend active application
Vacate
True => Active application notified to stop using the resource
Kill
True => Active application should be immediately stopped
Matchmaking
o
ClassAd Specification
o
Advertising Protocol
o
Matchmaking Algorithm
o
o
o
o
o
ClassAds describing Resource
Requests and Resource Offers with
attributes like Type, Rank,
Requirements, Vacate etc
Entity periodically communicates the
ClassAd and “contact address” to the
Central Manager (Matchmaker)
Matches based on Requirements
specified in the Resource Requests
and Offers.
Match with the highest Rank is
selected.
Use of past resource usage (log) for
fair scheduling
Matchmaking (cont. )
o Matchmaking Protocol
o Match notified to the two parties
that were matched @ their
“contact address” along with the
matched ClassAd
o (Possible) Authentication via
hand-off of a session-key
o
Claiming Protocol
o Match was a mutual introduction
of the 2 parties
o Customer contacts Resource
directly to negotiate regarding
resource allocation
After Match Notification…
1.
2.
3.
Schedd on the Initiating (Submit) machine first spawns a shadow
process. Shadow process acts as the shadow of the job that will
be executed on the remote machine
Shadow negotiates with Startd of remote machine to run the job
If successful, Startd on the remote, spawns Starter which
o
o
Starts the remote job by spawning
Manages the execution of the remote job by communicating with the
Shadow.
Exploiting RPC
o
o
o
o
Remote Machine agrees to run submit machine’s job at its
workstation. But the job’s files are physically located at the
submit machine.
open(), read(), write() calls in the job’s code are executed at
the submit machine as RPCs
condor_syscall_lib has to be linked to these jobs
If files can be accessed via NFS/AFS then it is preferred
over RPC if it will be efficient. The open() routine in the
condor_syscall_lib talks with the shadow at submit machine
and makes these decisions
Starter process for the remote job
Local File System
spawns
Remote Job’s process
…
Call to open(jobfile1)
Remote Machine
Shadow process for the
job
Access ‘jobfile1’ via NFS/AFS or RPC
Submit Machine
Checkpoint
o To checkpoint an executing program is to
take a snapshot of its current state in such
a way that the program can be restarted
from that state at a later time possibly at a
different resource
o Provides
o Preemptive-Resume scheduling
o Fault Tolerance – when checkpointing is done
periodically
o In Condor, checkpointing running jobs is
optional. If it is needed, source should be
linked with condor_syscall_lib
Checkpointing in Condor
o Implemented in
condor_syscall_lib as a signal
handler
o When condor sends a signal to
checkpoint, the handler saves
process’ state information in a
checkpoint file
o From Core - contents of process’s
uarea, data and stack segments
o From Executable – symbol and
debugging info, initialized data,
text
Checkpointing & Restart
o Shadow sends the latest checkpoint file to the new Starter during
restart
o The starter, reads the job state from the checkpoint file and the
execution continues
o Starter periodically sends a checkpoint signal to the executing job
o Condor_syscall_lib makes job dump core and saves job state in the checkpoint
file
o Checkpoint file temporarily stored @ Remote Machine
o Starter transfers latest checkpoint file to shadow when job vacated
Starter process for the remote job
Checkpoint
signal
Checkpoint
file
Code in condor_syscall_lib
saves process state information
Remote Machine
Checkpoint
file transferred when
job restarted
Checkpoint
file transferred
when job vacated
Local File System
Shadow process for the job
Submit Machine
CONDOR APIs - Glance
o Compile as a condor job
gcc –c hello.c –o hello.o
condor_compile gcc hello.o –o hello
o Submit a condor job
cat > submit.hello
Executable = hello
Universe = standard
Output = hello.out
Log = hello.log
Queue
condor_submit submit.hello – creates Job ClassAd
CONDOR APIs (Cont. )
o Condor_master – starts other daemons
o Condor_vacate – vacate jobs running on
specified hosts
o Condor_status – display status of condor pool
o Condor_rm – remove a condor job from queue
o More commands @
http://www.cs.wisc.edu/condor/manual/v6.4/
REFERENCES
o Condor Project Home Page
http://www.cs.wisc.edu/condor/
o Research Publications on Condor
http://www.cs.wisc.edu/condor/public
ations.html
Download