Condor Project

advertisement
Condor
High Throughput Computing with
Condor at Purdue
XSEDE ECSS Monthly Symposium
Topics
•
•
•
•
•
•
What is Condor?
What is High Throughput Computing?
Why Condor? Why not Condor?
Condor at Purdue
Submitting and managing jobs
Suitable jobs
What is Condor?
• A product of the University of WisconsinMadison
• A job scheduler
• A resource manager
• A workflow management system
• Focused on High Throughput Computing
What is High Throughput
Computing (HTC)?
• Large amounts of processing
• Long period of time
HTC v. HPC
• FLOPS extracted v. FLOPS
• Distributed Ownership v. Central
Ownership
• Capturing Idle Cycles v. Losing Idle Cycles
• Throughput v. Response Time
• Distributed Memory v. Tightly-coupled
Memory
• 1,000 Jobs v. 1 Job
Why Condor?
• Wasted compute cycles
• Scheduling of related jobs
• Access to more cores
Advantages of Condor
•
•
•
•
•
•
Many tasks running at once
Access to more powerful computers
Using wasted cycles
Minimal impact on remote computers
Security
Little or no code modification
Disadvantages of Condor
•
•
•
•
•
•
•
Compete for access
Task may take longer to complete
Processing can be lost
Parallel jobs aren’t available
Large files can impact the remote computer
Heterogeneity of the remote computers
Few compatible compilers
Condor at Purdue
•
•
•
•
•
•
Installed on large cyberinfrastructure clusters
Installed in distributed desktops
Used as a scavenger of free cycles
Parallel jobs not supported
~27K Linux cores and 1K Windows cores
Several more kilocores at DiaGrid partner sites
Condor at Purdue
• Jobs are vacated when a PBS job starts
– Long running jobs may never complete
• Common home directory across clusters
• Scratch directories roughly per-cluster
• ~7 TB of checkpoint storage for standard
universe jobs
Job Universes
• Vanilla universe
– Doesn't require a recompile
– No native checkpoint mechanism
• Standard universe
– Streams I/O (can overload the submit node)
– Supports checkpointing
– No fork(), shared memory, pipes
File transfer
• A vanilla universe feature
• Allows jobs to flow to other sites
Compiling for Condor
• A standard universe requirement
• The condor_compile command wraps a
limited compiler set.
• Links against Condor libraries to add support for
I/O streaming and checkpointing
Checkpointing
• Saves all state information
• Transfers state information to Condor
management
• Deletes job from processor
• Restarts interrupted job on another unused
processor
Job lifecycle
•
•
•
•
Job is submitted
Scheduler process contacts negotiator process
Negotiator matches job to an available slot
If no slots are available, scheduler contacts
remote negotiator
• Execute node runs job
• If job gets evicted, scheduler process contacts
negotiator process again
Submitting a job
• Create a submit file:
# Simple Condor job file
Executable = bin/simpletest
Arguments = 600
Universe = standard
Log = log/$(Cluster).$(Process).log
Error = log/$(Cluster).$(Process).err
Output = log/$(Cluster).$(Process).out
+TGProject = TG-STA060013N
Queue 10
Submitting a job
• With file transfer:
# Simple Condor job file
Executable = bin/process_files.sh
Universe = vanilla
ShouldTransferFiles = if_needed
Transfer_input_files = input.dat
Transfer_output_files = output.png
Log = log/$(Cluster).$(Process).log
+TGProject = TG-STA060013N
Queue
Submitting a job
• Job submitted with the condor_submit
command:
condor_submit myjobfile.condor
Managing jobs
• Get all jobs in queue: condor_q
• Get only user's jobs: condor_q user
• Why isn't my job running?
condor_q -better-analyze jobid
• Remove a job: condor_rm jobid
Getting the most cores:
Requirements = ...
• Condor tries to be helpful by inserting automatic
job requirements
• OpSys
• Arch
• FileSystemDomain
• Memory >= ImageSize
• This sometimes over-constrains jobs
Getting the most cores:
Requirements = ...
• The Requirements attribute gives you the
flexibility to add or remove execute nodes
• Example: job files are in your home directory
Requirements =
regexp(“rcac.purdue.edu”,FilesystemD
omain)
• Example: job executable is a Windows binary
Requirements = (OpSys==“WINNT61”)
A special note about Memory
• Condor sometimes overestimates the memory
usage of a job
• Condor reports totalmemory/cores, but jobs are
not memory constrained
• It’s best to put a dummy memory requirement in
the submission file
Getting the most out of your
cores: Rank = ...
• You can prefer a job land on particular nodes
• Example: prefer 64-bit nodes with lots of
memory
Rank = (ARCH==“X86_64”)*1000 +
Memory
Workflow management with
DAGman
• Directed Acyclic Graph Manager
• Defines parent-child relationships among
jobs
• Allows pre- and post-execution hooks
• Submit with condor_submit_dag
Diamond DAG
A
B1
B2
C
Diamond DAG
# Diamond-shaped DAG
Job First
p_00060.A.sub
Job Second_1 p_00060.B1.sub
Job Second_2 p_00060.B2.sub
Job Third
p_00060.C.sub
PARENT First CHILD Second_1 Second_2
PARENT Second_1 Second_2 CHILD Third
More complex DAGs
Who Benefits from
Condor?
• Monte Carlo simulations
• Parameter sweeps
• “Embarrassingly parallel” jobs
Purdue’s Condor Users
•
•
•
•
•
•
•
Structural Biology
Education
Chemical Engineering
Bioinformatics
Climate Visualization
Distributed Rendering
High Energy Physics
For more information
• University of Wisconsin website:
• http://research.cs.wisc.edu/condor
• Email:
• bcotton@purdue.edu
• rcac-help@purdue.edu
Download