Tips For Deploying Large Pools Zach Miller Computer Sciences Department University of Wisconsin-Madison zmiller@cs.wisc.edu http://www.cs.wisc.edu/condor Overview When supporting pools of hundreds or thousands of machines, there are some potentially tricky issues that can come up. Here I’ll address a few of them and talk about some solutions and workarounds. www.cs.wisc.edu/condor 2 Scalability Questions › How many jobs can I submit at once? › How many machines can I have in my pool? › Does it matter how long my jobs run? › What other factors impact scalability? www.cs.wisc.edu/condor 3 Job Queue › The condor_schedd can be one of the major bottlenecks in a condor system. › One schedd *can* hold 50000 jobs (or perhaps more) but it becomes painful to use, and can bring the throughput of your pool way down. www.cs.wisc.edu/condor 4 Why? › Besides consuming an enormous amount of memory and disk, having lots of jobs in the queue impacts the time it takes to match jobs. › The condor_schedd is single-threaded. › So, while running condor_q and waiting for 10000 jobs to be listed, the schedd can’t be doing other things, like actually starting jobs (spawning shadows) › It also cannot talk to the negotiator to match new jobs… which causes the negotiator to perhaps timeout waiting! www.cs.wisc.edu/condor 5 Job Queue › One options is to use DAGMan to throttle › the number of submitted jobs. Add all your jobs to a DAG (even if there are no dependences) and then do: condor_submit_dag -maxjobs 200 › DAGman will then never allow more than 200 jobs from this batch to be submitted at once. www.cs.wisc.edu/condor 6 DAGMan › DAGMan also provides the nice ability to › › retry jobs that fail Each DAGMan batch of is independent of any others, i.e. the maxjobs is only for a particular batch of jobs Can add a delay between submission of jobs so the condor_schedd isn’t swamped using: DAGMAN_SUBMIT_DELAY = 5 www.cs.wisc.edu/condor 7 Other Small Time Savers When Submitting › In the submit file: COPY_TO_SPOOL = FALSE › In the condor_config file: SUBMIT_SKIP_FILECHECK = TRUE www.cs.wisc.edu/condor 8 File Transfer › If you are using Condor’s file transfer › mechanism and you are also using encryption, the overhead can be significant. Condor 6.7 allows per-file specification of whether or not to use encryption www.cs.wisc.edu/condor 9 Per-File Encryption . . . Transfer_input_files = big_tarball.tgz, sec.key Encrypt_input_files = sec.key Dont_encrypt_input_files = big_tarball.tgz . . . www.cs.wisc.edu/condor 10 Per-File Encryption . . . Transfer_input_files = big_tarball.tgz, sec.key Encrypt_input_files = *.key Dont_encrypt_input_files = *.tgz . . . www.cs.wisc.edu/condor 11 Job Queue › My machine is running 800 jobs and the › › load is too high!! How can I throttle this? Use MAX_JOBS_RUNNING in the condor_config file. By default, this is set to 300. (You may actually wish to increase this if your submit machine can handle it) This controls how many shadows the schedd will spawn www.cs.wisc.edu/condor 12 Pool Size › Some of the largest known condor pools are over 4000 nodes › Some have 1 VM per actual CPU, and some have multiple VMs per CPU www.cs.wisc.edu/condor 13 Central Manager › If you have a lot of machines sending updates to your central manager, it is possible you are losing some of the periodic updates. You can determine if this is the case using the COLLECTOR_DAEMON_STATS feature… www.cs.wisc.edu/condor 14 Keeping Update Stats COLLECTOR_DAEMON_STATS = True COLLECTOR_DAEMON_HISTORY_SIZE = 128 % condor_status -l | grep Updates UpdatesTotal = 57200 UpdatesSequenced = 57199 UpdatesLost = 2 UpdatesHistory = "0x00000000800000000000000000000000" www.cs.wisc.edu/condor 15 If Your Network Is Swamped › You can make many different intervals longer: UPDATE_INTERVAL = 300 SCHEDD_INTERVAL = 300 MASTER_UPDATE_INTERVAL = 300 ALIVE_INTERVAL = 300 www.cs.wisc.edu/condor 16 Negotiation › Normally considers each job separately Can I run this job? No… Can I run this job? No… Can I run this job? No… Etc… www.cs.wisc.edu/condor 17 Negotiation 10/12 09:08:45 Request 00463.00000: 10/12 09:08:45 Rejected 463.0 zmiller@cs.wisc.edu <128.105.166.24:37845>: no match found 10/12 09:08:45 Request 00464.00000: 10/12 09:08:45 Rejected 464.0 zmiller@cs.wisc.edu <128.105.166.24:37845>: no match found 10/12 09:08:45 Request 00465.00000: 10/12 09:08:45 Rejected 465.0 zmiller@cs.wisc.edu <128.105.166.24:37845>: no match found www.cs.wisc.edu/condor 18 Negotiation › This process can be greatly sped up if you know › › which attributes are important to each job: SIGNIFICANT_ATTRIBUTES = Owner,Cmd Then, once a job is rejected, any more jobs of the same “class” can be skipped immediately. The less time the schedd spends talking to the negotiator, the better www.cs.wisc.edu/condor 19 Job Length › The length of your jobs matters! › There is overhead in scheduling a job, › moving the data, starting shadows and starters, etc. Jobs that run just a few seconds incur way more overhead than they do work! www.cs.wisc.edu/condor 20 Job Length › So if your jobs are too short, the schedd › basically cannot keep up with keeping the pool busy There is of course no exact formula for how long they should run, but longerrunning jobs usually get better overall throughput (assuming no evictions!) www.cs.wisc.edu/condor 21 Other factors › If you have many jobs running on a single › submit host, you may want to increase some of your resource limits On linux (and others I’m sure) there are system-wide limits and per-process limits on the number of file descriptors (FDs). www.cs.wisc.edu/condor 22 Resource Limits › System Wide: Edit /etc/sysctl.conf: # increase system fd limit fs.file-max = 32768 Or: echo 32768 > /proc/sys/fs/file-max www.cs.wisc.edu/condor 23 Resource Limits › Per-Process: Edit /etc/security/limits.conf Or: su - root ulimit -n 16384 # for sh limit descriptors 16384 # for csh su - your_user_name <run job here> www.cs.wisc.edu/condor 24 Port Ranges › Default range is 1024 to 4999 › Again, in /etc/sysctl.conf: # increase system IP port limitsnet.ipv4.ip_local_port_range = 1024 65535 › Or: echo 1024 65535 > /proc/sys/net/ipv4/ip_local_port_range www.cs.wisc.edu/condor 25 Complex Problem › Exactly how much work a system can do is a fairly complex problem since you are dealing with many types of resources (CPU, disk, network I/O) › Some experimentation is necessary. www.cs.wisc.edu/condor 26 Questions? Thank You! www.cs.wisc.edu/condor 27