Using Condor - Initiative for Computational Economics

advertisement

Using Condor

An Introduction

ICE 2008 http://www.cs.wisc.edu/condor

1

The Condor Project

(Established ‘85)

Distributed

High

Throughput

Computing research performed by a team of ~35 faculty, full time staff and students.

http://www.cs.wisc.edu/condor

2

Definitions

› Job

 The Condor representation of your work

› Machine

 The Condor representation of computers and that can perform the work

› Match Making

 Matching a job with a machine “Resource” http://www.cs.wisc.edu/condor

3

Job

Jobs state their requirements and preferences:

I need a Linux/x86 platform

I need the machine at least 500 Mb

I prefer a machine with more memory http://www.cs.wisc.edu/condor

4

Machine

Machines state their requirements and preferences:

Run jobs only when there is no keyboard activity

I prefer to run Frieda’s jobs

I am a machine in the econ department

Never run jobs belonging to Dr. Smith http://www.cs.wisc.edu/condor

5

The Magic of Matchmaking

› Jobs and machines state their requirements and preferences

› Condor matches jobs with machines based on requirements and preferences http://www.cs.wisc.edu/condor

6

Getting Started:

Submitting Jobs to Condor

› Overview:

 Choose a “ Universe ” for your job

 Make your job “batch-ready”

 Create a submit description file

 Run condor_submit to put your job in the queue http://www.cs.wisc.edu/condor

7

1. Choose the “Universe”

› Controls how Condor handles jobs

› Choices include:

 Vanilla

 Standard

 Grid

 Java

 Parallel

 VM http://www.cs.wisc.edu/condor

8

Using the Vanilla Universe

• The Vanilla Universe:

– Allows running almost any “serial” job

– Provides automatic file transfer, etc.

– Like vanilla ice cream

• Can be used in just about any situation http://www.cs.wisc.edu/condor

9

2. Make your job batchready

Must be able to run in the background

• No interactive input

• No GUI/window clicks

• No music ;^) http://www.cs.wisc.edu/condor

10

Make your job batch-ready

(continued)…

 Job can still use STDIN , STDOUT , and

STDERR (the keyboard and the screen), but files are used for these instead of the actual devices

 Similar to UNIX shell:

• $ ./myprogram <input.txt >output.txt

http://www.cs.wisc.edu/condor

11

3. Create a Submit

Description File

› A plain ASCII text file

› Condor does not care about file extensions

› Tells Condor about your job:

 Which executable, universe, input, output and error files to use, command-line arguments, environment variables, any special requirements or preferences

(more on this later)

› Can describe many jobs at once (a “cluster”), each with different input, arguments, output, etc.

http://www.cs.wisc.edu/condor

12

Simple Submit Description

File

# Simple condor_submit input file

# (Lines beginning with # are comments)

# NOTE: the words on the left side are not

# case sensitive, but filenames are!

Universe = vanilla

Executable = my_job

Output = output.txt

Queue http://www.cs.wisc.edu/condor

13

4. Run condor_submit

› You give condor_submit the name of the submit file you have created:

 condor_submit my_job.submit

› condor_submit :

 Parses the submit file, checks for errors

 Creates a “ClassAd” that describes your job(s)

 Puts job(s) in the Job Queue http://www.cs.wisc.edu/condor

14

The Job Queue

› condor_submit sends your job’s

ClassAd(s) to the schedd

› The schedd (more details later):

 Manages the local job queue

 Stores the job in the job queue

• Atomic operation, two-phase commit

• “Like money in the bank”

› View the queue with condor_q http://www.cs.wisc.edu/condor

15

Example condor_submit and condor_q

% condor_submit my_job.submit

Submitting job(s).

1 job(s) submitted to cluster 1.

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 6/16 06:52 0+00:00:00 I 0 0.0 my_job

1 jobs; 1 idle, 0 running, 0 held

% http://www.cs.wisc.edu/condor

16

Input, output & error files

› Controlled by submit file settings

› You can define the job’s standard input, standard output and standard error:

Read job’s standard input from “input_file”:

• Input = input_file

• Shell equivalent: program <input_file

Write job’s standard ouput to “output_file”:

• Output = output_file

• Shell equivalent: program >output_file

Write job’s standard error to “error_file”:

• Error = error_file

• Shell equivalent: program 2>error_file http://www.cs.wisc.edu/condor

17

Email about your job

• Condor sends email about job events to the submitting user

• Specify “notification” in your submit file to control which events:

Notification = complete Default

Notification = never

Notification = error

Notification = always http://www.cs.wisc.edu/condor

18

Feedback on your job

› Create a log of job events

› Add to submit description file: log = sim.log

› Becomes the Life Story of a Job

 Shows all events in the life of a job

 Always have a log file http://www.cs.wisc.edu/condor

19

Sample Condor User Log

000 (0001.000.000) 05/25 19:10:03 Job submitted from host:

< 128.105.146.14

:1816>

...

001 (0001.000.000) 05/25 19:12:17 Job executing on host:

< 128.105.146.14

:1026>

...

005 (0001.000.000) 05/25 19:13:06 Job terminated.

(1) Normal termination ( return value 0 )

...

http://www.cs.wisc.edu/condor

20

Example Submit Description

File With Logging

# Example condor_submit input file

# (Lines beginning with # are comments)

# NOTE: the words on the left side are not

# case sensitive, but filenames are!

Universe = vanilla

Executable = /home/frieda/condor/my_job.condor

Log = my_job.log

Input = my_job.in

·Job log (from Condor)

·Program’s standard input

Output = my_job.out ·Program’s standard output

Error = my_job.err

Arguments = -a1 -a2

·Program’s standard error

·Command line arguments

InitialDir = /home/frieda/condor/run

Queue http://www.cs.wisc.edu/condor

21

Let’s run a job

› First, need a terminal emulator

 http://www.putty.org

• (or similar)

› Login to chopin.cs.wisc.edu as

 cguserXX, and the given password

› source /scratch/ice08 http://www.cs.wisc.edu/condor

22

Logged In?

› condor_q

› condor_status http://www.cs.wisc.edu/condor

23

Create submit file

› nano submit

• universe = vanilla

• executable = /bin/echo

• Arguments = hello world

• Should_transfer_files = always

• When_to_transfer_output = on_exit

• Output = out

• Log = log

• queue http://www.cs.wisc.edu/condor

24

And submit it…

› condor_submit submit

› (wait… remember the HTC bit?)

› Condor_q xx

› cat output http://www.cs.wisc.edu/condor

25

“Clusters” and “Processes”

› If your submit file describes multiple jobs, we call this a “cluster”

› Each cluster has a unique “cluster number”

› Each job in a cluster is called a “process”

 Process numbers always start at zero

› A Condor “Job ID” is the cluster number, a period, and the process number (i.e. 2.1)

A cluster can have a single process

• Job ID = 20.0

Or, a cluster can have more than one process

• Job ID: 21.0, 21.1, 21.2

·Cluster 20, process 0

·Cluster 21, process 0, 1, 2 http://www.cs.wisc.edu/condor

26

Submit File for a Cluster

# Example submit file for a cluster of 2 jobs

# with separate input, output, error and log files

Universe = vanilla

Executable = my_job

Arguments = -x 0 log = my_job_0.log

Input = my_job_0.in

Output = my_job_0.out

Error = my_job_0.err

Queue ·Job 2.0 (cluster 2, process 0)

Arguments = -x 1 log = my_job_1.log

Input = my_job_1.in

Output = my_job_1.out

Error = my_job_1.err

Queue ·Job 2.1 (cluster 2, process 1) http://www.cs.wisc.edu/condor

27

Submitting The Job

% condor_submit my_job.submit-file

Submitting job(s).

2 job(s) submitted to cluster 2.

% condor_q

-- Submitter: perdita.cs.wisc.edu : <128.105.165.34:1027> :

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

1.0 frieda 4/15 06:52 0+00:02:11 R 0 0.0 my_job –a1 –a2

2.0

2.1

frieda 4/15 06:56 0+00:00:00 frieda 4/15 06:56 0+00:00:00

3 jobs; 2 idle, 1 running, 0 held

%

I

I

0 0.0

0 0.0 my_job –x 0 my_job –x 1 http://www.cs.wisc.edu/condor

28

Organize your files and directories for big runs

› Create subdirectories for each “run”

 run_0 , run_1 , … run_599

› Create input files in each of these

 run_0/simulation.in

 run_1/simulation.in

 …

 run_599/simulation.in

› The output, error & log files for each job will be created by Condor from your job’s output http://www.cs.wisc.edu/condor

29

Submit Description File for

600 Jobs

# Cluster of 600 jobs with different directories

Universe = vanilla

Executable = sim

Log = simulation.log

...

Arguments = -x 0

InitialDir = run_0

Queue

·Log, input, output & error files -> run_0

·Job 3.0 (Cluster 3, Process 0)

Arguments = -x 1

InitialDir = run_1

Queue

·Log, input, output & error files -> run_1

·Job 3.1 (Cluster 3, Process 1)

·Do this 598 more times………… http://www.cs.wisc.edu/condor

30

Submit File for a Big Cluster of Jobs

› We just submitted 1 cluster with 600 processes

› All the input/output files will be in different directories

› The submit file is pretty unwieldy (over

1200 lines)

› Isn’t there a better way?

http://www.cs.wisc.edu/condor

31

Submit File for a Big Cluster of Jobs (the better way) #1

› We can queue all 600 in 1 “Queue” command

 Queue 600

› Condor provides $(Process) and

$(Cluster)

 $(Process) will be expanded to the process number for each job in the cluster

• 0, 1, … 599

 $(Cluster) will be expanded to the cluster number

• Will be 4 for all jobs in this cluster http://www.cs.wisc.edu/condor

32

Submit File for a Big Cluster of Jobs (the better way) #2

› The initial directory for each job can be specified using $(Process)

 InitialDir = run_$(Process)

 Condor will expand these to “ run_0 ”,

“ run_1 ”, … “ run_599 ” directories

› Similarly, arguments can be variable

 Arguments = -x $(Process)

 Condor will expand these to “-x 0”,

“-x 1”, … “-x 599” http://www.cs.wisc.edu/condor

33

Better Submit File for 600

Jobs

# Example condor_submit input file that defines

# a cluster of 600 jobs with different directories

Universe = vanilla

Executable = my_job

Log = my_job.log

Input = my_job.in

Output = my_job.out

Error = my_job.err

Arguments = –x $(Process)

InitialDir = run_$(Process)

Queue 600

·–x 0, -x 1, … -x 599

·run_0 … run_599

·Jobs 4.0 … 4.599

http://www.cs.wisc.edu/condor

34

Now, we submit it…

$ condor_submit my_job.submit

Submitting job(s)

......................................................

......................................................

......................................................

......................................................

.......................................

Logging submit event(s)

......................................................

......................................................

......................................................

......................................................

.......................................

600 job(s) submitted to cluster 4.

http://www.cs.wisc.edu/condor

35

And, Check the queue

$ condor_q

-- Submitter: x.cs.wisc.edu : <128.105.121.53:510> : x.cs.wisc.edu

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

4.0 frieda 4/20 12:08 0+00:00:05 R 0 9.8 my_job -arg1 –x 0

4.1 frieda 4/20 12:08 0+00:00:03 I 0 9.8 my_job -arg1 –x 1

4.2 frieda 4/20 12:08 0+00:00:01 I 0 9.8 my_job -arg1 –x 2

4.3 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 3

...

4.598 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 598

4.599 frieda 4/20 12:08 0+00:00:00 I 0 9.8 my_job -arg1 –x 599

600 jobs; 599 idle, 1 running, 0 held http://www.cs.wisc.edu/condor

36

Removing jobs

› If you want to remove a job from the

Condor queue, you use condor_rm

› You can only remove jobs that you own

› Privileged user can remove any jobs

 “root” on UNIX

 “administrator” on Windows http://www.cs.wisc.edu/condor

37

Removing jobs (continued)

› Remove an entire cluster:

 condor_rm 4 ·Removes the whole cluster

› Remove a specific job from a cluster:

 condor_rm 4.0

·Removes a single job

› Or, remove all of your jobs with “-a

 condor_rm -a ·Removes all jobs / clusters http://www.cs.wisc.edu/condor

38

Submit cluster of 10 jobs

› nano submit

• universe = vanilla

• executable = /bin/echo

• Arguments = hello world $(PROCESS)

• Should_transfer_files = always

• When_to_transfer_output = on_exit

• Output = out.$(PROCESS)

• Log = log

• Queue 10 http://www.cs.wisc.edu/condor

39

And submit it…

› condor_submit submit

› (wait…)

› Condor_q xx

› cat log

› cat output.yy

http://www.cs.wisc.edu/condor

40

My new jobs run for 20 days…

› What happens when a job is forced off it’s CPU?

 Preempted by higher priority user or job

 Vacated because of user activity

› How can I add fault tolerance to my jobs?

http://www.cs.wisc.edu/condor

41

Condor’s Standard Universe to the rescue!

› Support for transparent process checkpoint and restart

› Remote system calls (remote I/O)

 Your job can read / write files as if they were local http://www.cs.wisc.edu/condor

42

Remote System Calls in the Standard Universe

› I/O system calls are trapped and sent back to the submit machine

Examples: open a file, write to a file

› No source code changes typically required

› Programming language independent http://www.cs.wisc.edu/condor

43

Process Checkpointing in the

Standard Universe

› Condor’s process checkpointing provides a mechanism to automatically save the state of a job

› The process can then be restarted from right where it was checkpointed

 After preemption, crash, etc.

http://www.cs.wisc.edu/condor

44

Checkpointing:

Process Starts checkpoint : the entire state of a program, saved in a file

 CPU registers, memory image, I/O time http://www.cs.wisc.edu/condor

45

Checkpointing:

Process Checkpointed time

1 2 3 http://www.cs.wisc.edu/condor

46

Checkpointing:

Process Killed

Killed!

3 time

3 http://www.cs.wisc.edu/condor

47

Checkpointing:

Process Resumed goodput badput time

3 goodput

3 http://www.cs.wisc.edu/condor

48

When will Condor checkpoint your job?

› Periodically, if desired

 For fault tolerance

› When your job is preempted by a higher priority job

› When your job is vacated because the execution machine becomes busy

› When you explicitly run condor_checkpoint , condor_vacate , condor_off or condor_restart command http://www.cs.wisc.edu/condor

49

Making the Standard

Universe Work

› The job must be relinked with Condor’s standard universe support library

› To relink, place condor_compile in front of the command used to link the job:

% condor_compile gcc -o myjob myjob.c

- OR -

% condor_compile f77 -o myjob filea.f fileb.f

- OR -

% condor_compile make –f MyMakefile http://www.cs.wisc.edu/condor

50

Limitations of the

Standard Universe

› Condor’s checkpointing is not at the kernel level.

 Standard Universe the job may not:

• Fork()

• Use kernel threads

• Use some forms of IPC, such as pipes and shared memory

› Must have access to source code to relink

› Many typical scientific jobs are OK http://www.cs.wisc.edu/condor

51

Submitting Std uni job

› #include <stdio.h>

› int main(int argc, char **argv) {

› int i; for(i = 0 ; i < 10000000; i++) {

› }

} http://www.cs.wisc.edu/condor

52

And submit…

› condor_compile –o foo foo.c

› condor_submit http://www.cs.wisc.edu/condor

53

My jobs have have dependencies…

Can Condor help solve my dependency problems?

http://www.cs.wisc.edu/condor

54

Condor Universes:

Scheduler and Local

› Scheduler Universe

 Plug in a meta-scheduler

 Developed for DAGMan (more later)

 Similar to Globus’s fork job manager

› Local

 Very similar to vanilla, but jobs run on

 the local host

Has more control over jobs than scheduler universe http://www.cs.wisc.edu/condor

55

Frieda learns DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”) http://www.cs.wisc.edu/condor

56

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the

DAG.

Job

B

› Each node can have any number of “parent” or

“children” nodes – as long as there are no loops !

Job

A

Job

D http://www.cs.wisc.edu/condor

Job

C

57

Defining a DAG

› A DAG is defined by a .dag file , listing each of its nodes and their dependencies:

Job A

# diamond.dag

Job A a.sub

Job B b.sub

Job B Job C

Job C c.sub

Job D d.sub

Parent A Child B C

Parent B C Child D

› each node will run the Condor job specified by its accompanying Condor submit file

Job D http://www.cs.wisc.edu/condor

58

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag

› condor_submit_dag is run by the schedd

 DAGMan daemon itself is “watched” by Condor, so you don’t have to http://www.cs.wisc.edu/condor

59

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to

Condor based on the DAG dependencies.

A

Condor

Job

Queue

A

B

DAGMan

D

C

.dag

File http://www.cs.wisc.edu/condor

60

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the

Condor queue at the appropriate times.

Condor

Job

Queue

B

C

A

B

DAGMan

D

C http://www.cs.wisc.edu/condor

61

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a

“rescue” file with the current state of the DAG.

A

Condor

Job

Queue

B X

Rescue

File

DAGMan

D http://www.cs.wisc.edu/condor

62

Recovering a DAG

› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

A

Condor

Job

Queue C

B C

Rescue

File

DAGMan D http://www.cs.wisc.edu/condor

63

Recovering a DAG (cont’d)

› Once that job completes, DAGMan will continue the DAG as if the failure never happened.

A

Condor

Job

Queue D

B C

DAGMan

D http://www.cs.wisc.edu/condor

64

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

Condor

Job

Queue

A

B

DAGMan D

C http://www.cs.wisc.edu/condor

65

Additional DAGMan

Features

› Provides other handy features for job management…

 nodes can have PRE & POST scripts

 failed nodes can be automatically retried a configurable number of times

 job submission can be “throttled” http://www.cs.wisc.edu/condor

66

General User Commands

condor_status

› condor_q

› condor_submit

› condor_rm

› condor_prio

› condor_history

› condor_submit_dag

› condor_checkpoint

› condor_compile

View Pool Status

View Job Queue

Submit new Jobs

Remove Jobs

Intra-User Prios

Completed Job Info

Submit new DAG

Force a checkpoint

Link Condor library http://www.cs.wisc.edu/condor

67

Thank you!

Check us out on the Web: http://www.condorproject.org

Email: condor-admin@cs.wisc.edu

http://www.cs.wisc.edu/condor

68

Download