Condor - local guide
Contents |
1. Introduction
This document is intended to provide a quick introduction to Condor, to introduce the most important concepts, and to provide examples to allow users to start using Condor as quickly as possible. Although this document is quite large it should be less intimidating than the Condor Reference Manual !For any more advanced use, however, you will have to refer to the vast Condor Reference Manual.
You may also wish to look at the Condor Project Homepage
1.1. What is Condor
Condor is a specialized batch system for managing compute-intensive jobs. Users submit their compute jobs to Condor, Condor puts the jobs in a queue, runs them, and then informs the user as to the result. The collection of inter-networked machines running Condor and controlled by a particular manager is known as a pool. Like most batch systems, Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications.In slightly more detail: a user submits the job to Condor from one of a number of Submit machines. Condor finds an available Execute machine from the pool and begins running the job on that machine. Condor has the capability to detect that a machine running a Condor job is no longer available (perhaps because the owner of the machine came back from lunch and started typing on the keyboard). It might be able to checkpoint the job and move (migrate) the jobs to a different machine which would otherwise be idle. If it has been able to checkpoint the job then Condor continues the job on the new machine from precisely where it left off.
Condor does not require an account (login) on machines where it runs a job. Condor can do this because it uses remote system calls which trap library calls for such operations as reading or writing from disk files. The calls are transmitted over the network to be performed on the machine where the job was submitted.
Every machine in a Condor pool can serve a variety of roles, and most machines will serve more than one role simultaneously, although certain roles can only be performed by single machines in a pool. The following list describes the 4 different roles:
- Central Manager
The Manager machine is the collector of information, and the negotiator between resources and resource requests. There is only one central manager for a pool. - Submit
Submit machines queue Condor jobs. There may be more than one. Users will only need to be able to log in to these. - Execute
Execute machines actually run the Condor jobs. There may be more than one. - Checkpoint Server
The checkpoint server is a centralized machine that stores all the checkpoint files for the jobs submitted in the pool. Only one machine in a pool can be configured as a checkpoint server, and its presence is optional.
1.2. Condor in the Computer Lab
NB. If you wish to start using Condor please email sysadmin first. We can advise on whether any of the pool machines are currently being used for other projects. We would also like to keep a general eye on how condor is being used so we can assess demand.The Computer Lab pool consists a number of virtual machines running under Xen Enterprise. They generally use spare resources, having lower priority than other VMs. To see condor's view of running machines, use condor_status, and to see XenE's view of running and halted machines, use cl-condor-list. The number of CPUs they have (up to 8) and the amount of memory they can have (up to 30GB) when machines are started using the command, e.g. to start the machine pb030 with 2 CPUs and 7GB of RAM use:
cl-condor-start pb030 2 7000000000NB - please check the top level page to see if there are any special notices concerning machine availability. People wishing to use them for running jobs under condor only need to be able to log in to the submit machines, which are condor-submit0 and condor-submit1, together known as condor-submit.
We have chosen to support only the Standard, Vanilla and Java Universes on our Condor pool.
We have decided not to alow preemption on the local pool (this is not the default behaviour - see 4.1.2. User Priority for details).
Before running any of the condor commands mentioned in these notes you will need to make sure you have the condor programs on your PATH:
PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
1.2.1 Etiquette
In the past people have tended to run condor in many different ways, long jobs (weeks even), short jobs, a handful at a time, or thousands of jobs at once. Because we have had to turn off job preemption it is now possible for a single user to use the entire pool for long periods, thus preventing other people from getting any jobs to run. Condor does not have sophisticated scheduling mechanisms, there is not much we can do about this !We have decided to adopt the policy that jobs should aim to finish with a couple of hours - anyone requiring jobs over 24h should contact sys-admin.
The time quantum of the scheduler appears to be five minutes. If a large number of jobs finish in less than five minutes, the scheduler ceases to function, and needs to be manually restarted by a sys admin. As such, ensure that jobs do not take less than five minutes. If they might, arrange to run several such steps in a single job.
Do not create files in /tmp/ as they might fail to be deleted, causing the filing system to fill up, which causes condor to consider the machine unavailable for running jobs. Instead use /anfs/bigdisc/$USER/ or /anfs/bigtmp/, and tidy them up. Many aplications which create tmp files will use $TMPDIR or such like for such files.
Users should add the line "nice_user = True" to their jobs as a matter of course. This will ensure that when a new job is to be started, they will only be considered if there is no other jobs free to run. This means that if a user has submitted a large batch of jobs, other users can submit a small number of non-nice jobs. This will only work if most waiting jobs are niced. Note that even if jobs are niced, while they are running, they stop other jobs from starting, so please try to ensure most jobs do complete within a couple of hours, even if it does waste a small amount of time starting up etc.
2. Using Condor
Here are all the steps needed to run a job using Condor (remember to set CONDOR_CONFIG and PATH beforehand, as mentioned above):- Code Preparation.
A job run under Condor must be able to run as a background batch job. Condor runs the program unattended and in the background. A program that runs in the background will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program/script will run correctly with these files on the submit machine. - The Condor Universe.
Condor has several runtime environments (called a universe) from which to choose. Of the universes, the most important are the Standard Universe, the Vanilla Universe and the Java Universe. If your job is written in C, C++ or fortran, and can be linked with the Condor library then run it in the standard universe, otherwise use the vanilla or java universes. Choose a universe under which to run your job, and re-link the program if necessary. - Submit description file.
A submit description file controls the details of job submission. The file will contain information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets. Write a submit description file to go with the job, using the syntax description and some illustrative examples here, here or here. - Submit the Job.
Login to a submit machine (see above) and submit the program to Condor with the condor_submit command.
When your program completes, Condor will tell you the exit status of your program and various statistics about its performance, including time used and I/O performed. If you are using a log file for the job (which is recommended) the exit status will be recorded in the log file. Alternatively you can view the history file for the job by typing condor_history, which will show something like:
ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 1.0 condor 6/13 10:58 0+00:00:00 C 0 0.9 job_blahNotice that the status ("ST") is now C, for completed.
You can remove a job from the queue prematurely with condor_rm.
2.1 Problems ?
If you haven't already you should check that your program or script works correctly on the condor-submit machine as set out in the 2. Using Condor section above. If it works on condor-submit but does not under condor login to "pb" and try it there. If it works there but not under condor, contact sys-admin.3. Submit File syntax
A submit description file controls the details of job submission. The syntax is simple, a list of the most important entries grouped by concept follows. This is by no means a full list, for that see the condor_submit man page, this selection is intended mainly to make it easier to understand the examples given in other sections.Blank lines and lines beginning with a pound sign (#) character are ignored by the submit description file parser, and so may be used for comments.
3.1. Basic entries
- executable =name
The name of the executable file for this job cluster (for a definition of a job cluster see this example). - arguments =argument_list
List of arguments to be supplied to the program named as the executable on its command line. - input =pathname
- output =pathname
- error =pathname
Condor assumes that its jobs are long-running, and that the user will not wait at the terminal for their completion. Because of this, the standard files which normally access the terminal, (stdin, stdout, and stderr), must refer to files. Thus, the file name specified with input should contain any keyboard input the program requires (that is, this file becomes stdin). Likewise with output and error. If not specified, the default value of /dev/null is used for submission to a Unix machine. - universe =vanilla | standard | java
Specifies which Condor Universe to use when running this job. - initialdir =directory-path
Used to give jobs a directory with respect to file input and output. Also provides a directory (on the submit machine) for the user log. - log =pathname
Use log to specify a file name where Condor will write a log file of what is happening with this job cluster. For example, Condor will log into this file when and where the job begins running, when the job is checkpointed and/or migrated, when the job completes, etc. Most users find specifying a log file to be very handy; its use is recommended. If no log entry is specified, Condor does not create a log for this cluster. - queue [number-of-procs]
Places one or more copies of the job into the Condor queue. The optional argument number-of-procs specifies how many times to submit the job to the queue, and it defaults to 1. If desired, any commands may be placed between subsequent queue commands, such as new input, output, error, initialdir, arguments, or executable commands. This is handy when submitting multiple runs into one cluster with one submit description file. Multiple clusters may be specified within a single submit description file by changing the executable between queue commands. Each time the executable command is issued (between queue commands), a new cluster is defined.
3.2. Job Ordering and location
- priority =priority
Condor job priorities range from -20 to +20, with 0 being the default. Jobs with higher numerical priority will run before jobs with lower numerical priority. Note that this priority is on a per user basis; setting the priority will determine the order in which your own jobs are executed, but will have no effect on whether or not your jobs will run ahead of another user's jobs. See 4.1 Priority. - nice_user =True | False
Normally, when a machine becomes available to Condor, Condor decides which job to run based upon user and job priorities. Setting nice_user equal to True tells Condor not to use your regular user priority, but that this job should have last priority among all users and all jobs. So jobs submitted in this fashion run only on machines which no other non-nice_user job wants - a true "bottom-feeder" job! This is very handy if a user has some jobs they wish to run, but do not wish to use resources that could instead be used to run other people's Condor jobs. Jobs submitted in this fashion have "nice-user." pre-appended in front of the owner name when viewed from condor_q or condor_userprio. The default value is False. - requirements =Boolean Expression
The requirements command is a boolean expression which uses C-like operators. In order for any job in this cluster to run on a given machine, this requirements expression must evaluate to true on the given machine. For example, to require that whatever machine executes your program has a least 64 Meg of RAM and has a MIPS performance rating greater than 45, use:requirements = Memory >= 64 && Mips > 45
Only one requirements command may be present in a submit description file. Unless you request otherwise, Condor will by default give your job to machines with the same architecture and operating system version as the machine running condor_submit. See 4.2 Machine Attributes. - rank =Float Expression
The argument is a Floating-Point expression that states how to rank machines which have already met the requirements expression. Essentially, rank expresses preference. A higher numeric value equals better rank. Condor will give the job to the machine with the highest rank. For example,requirements = Memory > 60 rank = Memoryasks Condor to find all available machines with more than 60 megabytes of memory and give the job the one with the most amount of memory. See 4.3 Ranking.
3.3. File Handling
- fetch_files = file1, file2, ...
If your job attempts to access a file mentioned in this list, Condor will automatically copy the whole file to the executing machine, where it can be accessed quickly. When your job closes the file, it will be copied back to its original location. This option only applies to standard-universe jobs. - append_files = file1, file2, ...
If your job attempts to access a file mentioned in this list, Condor will force all writes to that file to be appended to the end. Furthermore, condor_submit will not truncate it. This option may yield some surprising results. If several jobs attempt to write to the same file, their output may be intermixed. If a job is evicted from one or more machines during the course of its lifetime, such an output file might contain several copies of the results. This option should be only be used when you wish a certain file to be treated as a running log instead of a precise result. This option only applies to standard-universe jobs. - local_files = file1, file2, ...
If your job attempts to access a file mentioned in this list, Condor will cause it to be read or written at the execution machine. This is most useful for temporary files not used for input or output. This option only applies to standard-universe jobs.
3.4. Job Information
- notification =when
Owners of Condor jobs are notified by email when certain events occur. If when is set to Always, the owner will be notified whenever the job is checkpointed, and when it completes. If when is set to Complete (the default), the owner will be notified when the job terminates. If when is set to Error, the owner will only be notified if the job terminates abnormally. If when is set to Never, the owner will not be mailed, regardless what happens to the job. - notify_user =email-address
Used to specify the email address to use when Condor sends email about a job. If not specified, Condor will default to using job-owner@UID_DOMAIN where UID_DOMAIN is specified by the Condor site administrator.
3.5. Environment
- environment =parameter_list
A list of environment variables which will be placed (as given) into the job's environment before execution. The list is of the form : <parameter>=<value>. Multiple environment variables can be specified by separating them with a semicolon (;) when submitting from a Unix platform. The length of the list specified in the environment is currently limited to 10240 characters. - getenv =True | False
If getenv is set to True, then condor_submit will copy all of the user's current shell environment variables at the time of job submission into the job description. The job will therefore execute with the same set of environment variables that the user had at submit time. Defaults to False.
3.6. Macros
Parameterless macros in the form of $(macro_name) may be inserted anywhere in Condor submit description files. Macros can be defined by lines in the form of<macro_name> = <string>Two pre-defined macros are supplied by the submit description file parser. The $(Cluster) macro supplies the number of the job cluster, and the $(Process) macro supplies the number of the job. These macros are intended to aid in the specification of input/output files, arguments, etc., for clusters with lots of jobs, and/or could be used to supply a Condor process with its own cluster and process numbers on the command line. For an example see 5.1.2.2. Multiple Submission - Different Inputs.
In addition to the normal macro, there is also a special kind of macro called a substitution macro that allows you to substitute expressions defined on the resource machine itself (gotten after a match to the machine has been performed) into specific expressions in your submit description file. The special substitution macro is of the form $$(attribute). It may only be used in three expressions in the submit description file: executable, environment, and arguments. Example:
executable = myprog.$$(opsys).$$(arch)The opsys and arch attributes will be substituted at match time for any given resource. This will allow Condor to automatically choose the correct executable for the matched machine.
The environment macro, $ENV, allows the evaluation of an environment variable to be used in setting a submit description file command. The syntax used is
$ENV(variable)For example:
log = $ENV(HOME)/jobs/logfile
4. Job scheduling - Priority, Requirements and Rank
The scheduling arrangements adopted by condor control when and on which machine your jobs are run. Priority (both per-job and per-user) determine when a job will be run, ranking (which uses requirements and machine attributes) may be used to determine where a job is run.All machines in a Condor pool advertise their attributes, such as available RAM memory, CPU type and speed, virtual memory size, current load average, along with other static and dynamic properties. This machine information also includes under what conditions a machine is willing to run a Condor job and what type of job it would prefer.
Likewise, when submitting a job, you can specify your requirements and preferences, for example, the type of machine you wish to use. You can also specify an attribute, for example, floating point performance, and have Condor automatically rank the available machines according to their values for this attribute. Condor plays the role of a matchmaker by continuously reading all the job requirements and all the machine information, matching and ranking jobs with machines.
4.1 Priority
4.1.1 Job Priority
Job priorities allow the assignment of a priority level to each submitted Condor job in order to control order of execution - note that these are priorities between jobs of the same user only. To set a job priority, use the condor_prio command, or use the priority command in your submit description file. Job priorities do not impact user priorities in any fashion. Job priorities range from -20 to +20, with -20 being the worst and with +20 being the best, 0 is the default.4.1.2 User Priority
The default behaviour for Condor is to allocate machines to users based upon a user's priority - which changes according to the number of resources the individual is using. Condor enforces that each user gets his/her fair share of machines according to user priority both when allocating machines which become available and by priority preemption of currently allocated machines.However, it was discovered that this did not really work in our environment. Most users are running vanilla universe jobs which cannot be checkpointed. The preemption rules meant that heavy users were often having their jobs killed after many hours running by new users starting their jobs. This tended to annoy people ! It has therefore been decided to turn off preemption entirely on the local pool as an experiment. This may mean that jobs of new users sit idle for some time because of the long-running jobs of others. This may turn out to be equally unsuitable !
It is possible to submit a job as a "nice" job. Setting nice_user in your submit description file tells Condor not to use your regular user priority, but that this job should have last priority among all users and all jobs.
4.2 Machine Attributes
The attributes advertised by a machine can be seen with condor_status -l machine_name. Some of the listed attributes are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine can be utilized at job submission time as part of a request or preference on which machine to use. Additional attributes can be easily added.For example, this is the output of condor_status -l for one processor of the machine pb001:
MyType = "Machine" TargetType = "Job" Name = "vm1@pb001.cl.cam.ac.uk" Machine = "pb001.cl.cam.ac.uk" Rank = 0.000000 CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) COLLECTOR_HOST_STRING = "pb001.cl.cam.ac.uk" CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $" CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $" VirtualMachineID = 1 VirtualMemory = 0 Disk = 467172 CondorLoadAvg = 0.000000 LoadAvg = 0.000000 KeyboardIdle = 25539262 ConsoleIdle = 25539262 Memory = 29994 Cpus = 1 StartdIpAddr = "<128.232.4.1:33071>" Arch = "x86_64" OpSys = "LINUX" UidDomain = "cl.cam.ac.uk" FileSystemDomain = "cl.cam.ac.uk" Subnet = "128.232.4" HasIOProxy = TRUE TotalVirtualMemory = 0 TotalDisk = 934344 KFlops = 951601 Mips = 3370 LastBenchmark = 1103098732 TotalLoadAvg = 0.000000 TotalCondorLoadAvg = 0.000000 ClockMin = 678 ClockDay = 3 TotalVirtualMachines = 2 HasFileTransfer = TRUE HasMPI = TRUE HasJICLocalConfig = TRUE HasJICLocalStdin = TRUE JavaVendor = "Sun Microsystems Inc." JavaVersion = "1.4.1_01" JavaMFlops = 295.152039 HasJava = TRUE HasPVM = TRUE HasRemoteSyscalls = TRUE HasCheckpointing = TRUE StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin, HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing" CpuBusyTime = 0 CpuIsBusy = FALSE State = "Unclaimed" EnteredCurrentState = 1103041577 Activity = "Idle" EnteredCurrentActivity = 1103098732 Start = TRUE Requirements = START CurrentRank = 0.000000 DaemonStartTime = 1103041099 UpdateSequenceNumber = 230 MyAddress = "<128.232.4.1:33071>" LastHeardFrom = 1103109536 UpdatesTotal = 231 UpdatesSequenced = 230 UpdatesLost = 0 UpdatesHistory = "0x00000000000000000000000000000000"
4.3 Ranking
When considering the match between a job and a machine, rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.The job's rank expression evaluates to one of three values. It can be UNDEFINED, ERROR, or a floating point value. If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.
A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.
Example 1: For a job that desires the machine with the most available memory:
Rank = memoryExample 2: For a job that prefers to run on Saturdays and Sundays:
Rank = ( (clockday == 0) || (clockday == 6) )It is wise when writing a rank expression to check if the expression's evaluation will lead to the expected resulting ranking of machines. This can be accomplished using the condor_status command with the -constraint argument. This allows the user to see a list of machines that fit a constraint.
Example 1: To see which machines in the pool have kflops defined, use:
condor_status -constraint kflopsExample 2:If this is typed on a Wednesday it will show all of the machines in the pool, on any other day it will show none:
condor_status -constraint "(clockday == 3)"
5. Universes
A universe in Condor defines an execution environment. There are three main choices, and a host of others which we do not currently support on the CL pool.5.1. Standard Universe
In the standard universe, Condor provides checkpointing and remote system calls. These features make a job more reliable and allow it uniform access to resources from anywhere in the pool.Condor checkpoints a job at regular intervals. A checkpoint image is essentially a snapshot of the current state of a job. If a job must be migrated from one machine to another, Condor makes a checkpoint image, copies the image to the new machine, and restarts the job continuing the job from where it left off. If a machine should crash or fail while it is running a job, Condor can restart the job on a new machine using the most recent checkpoint image. In this way, jobs can run for months or years even in the face of occasional computer failures.
A job that is linked using condor_compile and is subsequently submitted into the standard universe will checkpoint and exit upon receipt of a SIGTSTP signal. The user's code may still checkpoint itself at any time by calling one of the following functions exported by the Condor libraries:
- ckpt()
Performs a checkpoint and then returns. - ckpt_and_exit()
Checkpoints and exits; Condor will then restart the process again later, potentially on a different machine.
The standard universe allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted. The standard universe also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. To use the standard universe, it is necessary to relink the program with the Condor library using the condor_compile command, hence this universe is only appropriate if you have the C or C++ source code for the program.
5.1.1. A Standard Universe "How to"
- Create a suitable directory and cd into it.
- Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
- Move your code files into the directory.
- Make sure you have the program condor_config_val on your PATH.
By default this lives in /opt/condor-6.8.3/bin, so
PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
- Compile using condor_compile
- Create a submit file
- Submit, using condor_submit
- Monitor the job's progress with the condor_q and condor_status commands
5.1.2. Examples
5.1.2.1. A very simple job
Using the C program called hello.c: #include <stdio.h>
main()
{
printf("hello, Condor\n");
exit(0);
}
The compilation instruction and resulting output is as
follows:
$ condor_compile gcc hello.c -o hello
hello.c: In function ‘main’:
hello.c:6: warning: incompatible implicit declaration of built-in function ‘exit’
LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-6.8.3/lib -Bstatic --eh-frame-hdr -m elf_x86_
64 --hash-style=gnu -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o hello /opt/condor-6.8.3/li
b/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.1.1/../../../../lib64/crti.o /usr/lib/gcc/x
86_64-redhat-linux/4.1.1/crtbeginT.o -L/opt/condor-6.8.3/lib -L/usr/lib/gcc/x86_64-redhat-lin
ux/4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.1/..
/../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccrztRSX.o /opt/condor-6.8.3/lib/lib
condorsyscall.a /opt/condor-6.8.3/lib/libcondor_z.a /opt/condor-6.8.3/lib/libcomp_libstdc++.a
/opt/condor-6.8.3/lib/libcomp_libgcc.a /opt/condor-6.8.3/lib/libcomp_libgcc_eh.a --as-needed
--no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lc
ondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-6.8.3/lib/libcomp_lib
gcc.a /opt/condor-6.8.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/x86_6
4-redhat-linux/4.1.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.1.1/../../../../lib64/crtn.o
/opt/condor-6.8.3/lib/libcondorsyscall.a(condor_file_agent.o): In function `CondorFileAgent::
open(char const*, int, int)':
(.text+0x29b): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
/opt/condor-6.8.3/lib/libcondorsyscall.a(switches.o): In function `__gets_chk':
(.text+0xa4bb): warning: the `gets' function is dangerous and should not be used.
The submit file, submit.hello, is:
########################
# Submit description file for hello program
########################
Executable = hello
Universe = standard
Output = hello.out
Log = hello.log
Queue
The submit instruction and output will look something like this (note the warning message !):
Submitting job(s)
WARNING: Log file /auto/homes/ckh11/condortest/hello.log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
Logging submit event(s).
1 job(s) submitted to cluster 57.
condor_q will say:
$ condor_q
-- Submitter: pb000.cl.cam.ac.uk : <127.0.0.1:59865> : pb000.cl.cam.ac.uk
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
57.0 ckh11 2/1 11:23 0+00:00:00 R 0 9.8 hello
1 jobs; 0 idle, 1 running, 0 held
The log file, hello.log, will show (something similar to):
000 (057.000.000) 02/01 11:23:57 Job submitted from host: <127.0.0.1:59865>
...
001 (057.000.000) 02/01 11:24:31 Job executing on host: <127.0.0.1:34755>
...
005 (057.000.000) 02/01 11:24:31 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
816 - Run Bytes Sent By Job
1702035 - Run Bytes Received By Job
816 - Total Bytes Sent By Job
1702035 - Total Bytes Received By Job
...
The output file, hello.out, will contain:
hello, Condor
5.1.2.2. Multiple Submission - Different Inputs
A common situation has one executable that is executed many times, each time with a different input set. This is called a job cluster. Each cluster has a "cluster ID" and within each cluster, each job has a "process ID". If the program wants the input in a file with a fixed name, then the solution of choice runs each queued job in its own directory.This particular example outputs the number of characters in an input file named mult_job_input. There are 5 different input files, so we need 5 jobs. Because the program uses a fixed name for its input file we do not need to specify an input in the submit description file. The 5 different but identically named input files are prestaged in 5 directories before submitting the job. The directories are named job.0, job.1, job.2, job.3 and job.4. In addition to the input file, each directory will receive its own output in a file called mult_job_output, its own error messages will go into mult_job_error, and Condor will log each job's progress in the file called mult_job_log.
The submit file, submit.mult_job, is:
####################
# Multiple jobs queued, each in its own directory
####################
universe = standard
executable = mult_job
output = mult_job_output
error = mult_job_error
log = mult_job_log
initialdir = job.$(Process)
queue 5
Note the initialdir line, it is using a simple macro to give a different
directory name for each job to be queued.
The program source, mult_job.c, is:
#include <stdio.h>
main() {
FILE *in;
char ch, filename[80];
int i=0;
sprintf(filename,"mult_job_input");
if((in=fopen(filename,"r")) == NULL){
printf("Cant open %s\n",filename);
exit(1);
}
while((ch=getc(in)) != EOF){i++;}
printf("i is %d\n",i);
exit(0);
}
The compile instruction is:
condor_compile gcc -o mult_job mult_job.cHaving set up the directories and input files, the submit instruction and output is:
$ condor_submit submit.mult_job Submitting job(s) WARNING: Log file /auto/homes/ckh11/condortest/job.0/mult_job_log is on NFS. This could cause log file corruption and is _not_ recommended. . WARNING: Log file /auto/homes/ckh11/condortest/job.1/mult_job_log is on NFS. This could cause log file corruption and is _not_ recommended. . WARNING: Log file /auto/homes/ckh11/condortest/job.2/mult_job_log is on NFS. This could cause log file corruption and is _not_ recommended. . WARNING: Log file /auto/homes/ckh11/condortest/job.3/mult_job_log is on NFS. This could cause log file corruption and is _not_ recommended. . WARNING: Log file /auto/homes/ckh11/condortest/job.4/mult_job_log is on NFS. This could cause log file corruption and is _not_ recommended. . Logging submit event(s)..... 5 job(s) submitted to cluster 60.
5.1.2.3. Multiple Submission - Different Arguments
This example queues three jobs for execution by Condor. The first will be given command line arguments of 15 and 20, and it will write its standard output to msda.out1. The second will be given command line arguments of 30 and 20, and it will write its standard output to msda.out2. Similarly the third will have arguments of 45 and 60, and it will use msda.out3 for its standard output.The submit file, submit.msda, is:
####################
#
# Different command line arguments and output files.
#
####################
executable = msda
universe = standard
arguments = 15 20
output = msda.out1
error = msda.err1
queue
arguments = 30 20
output = msda.out2
error = msda.err2
queue
arguments = 45 60
output = msda.out3
error = msda.err3
queue
The source for msda is not given as it is trivial - it adds its two arguments and
outputs them to stdout.
The compile command is as in previous examples, and the submit instruction and output is:
condor_submit submit.msda
Submitting job(s)...
3 job(s) submitted to cluster 61.
Note this time it does not mention logging as we did not specify a log file.
5.1.2.4. Intentional Checkpointing
A job that is linked using condor_compile and is subsequently submitted into the standard universe will checkpoint and exit upon receipt of a SIGTSTP signal. The user's code may still checkpoint itself at any time by using the condor library function - ckpt() (or the similar ckpt_and_exit()) which simply performs a checkpoint and then returns.It is wise to make the checkpoint call conditional so that you can check that your code compiles correctly without it. For example, the program checkplease.c:
#include <stdio.h>
main()
{
printf("hello\n");
#ifdef CONDOR
ckpt();
#endif
printf("world\n");
exit(0);
}
The compilation step for condor submission would therefore be:
condor_compile gcc -DCONDOR -m32 -static-libgcc -o checkplease checkplease.cThe submit file is trivial and is not shown. After running the log shows the checkpoint:
000 (062.000.000) 02/01 11:57:56 Job submitted from host: <127.0.0.1:59865>
...
001 (062.000.000) 02/01 11:58:00 Job executing on host: <127.0.0.1:34755>
...
006 (062.000.000) 02/01 11:59:00 Image size of job updated: 10798
...
003 (062.000.000) 02/01 11:59:00 Job was checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
...
005 (062.000.000) 02/01 11:59:00 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
817613 - Run Bytes Sent By Job
1702390 - Run Bytes Received By Job
817613 - Total Bytes Sent By Job
1702390 - Total Bytes Received By Job
...
5.2. Vanilla Universe
The vanilla universe in Condor is intended for programs which cannot be successfully re-linked. Shell scripts are another case where the vanilla universe is useful. Unfortunately, jobs run under the vanilla universe cannot checkpoint or use remote system calls. This has unfortunate consequences for a job that is partially completed when the remote machine running a job must be returned to its owner. Condor has only two choices. It can suspend the job, hoping to complete it at a later time, or it can give up and restart the job from the beginning on another machine in the pool.Under Unix, the Condor presumes a shared file system for vanilla jobs. If you are using a non-Unix or mixed machine pool you may have to know about the file transfer mechanism.
5.2.1. A Vanilla Universe "Howto"
- Create a suitable directory and cd into it.
- Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
- Move your files into the directory.
- Make sure you have the program condor_config_val on your PATH.
By default this lives in /opt/condor-6.8.3/bin, so
PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
- Check that the executable file does actually work
- Create a submit file
- Submit, using condor_submit
- Monitor the job's progress with the condor_q and condor_status commands
5.2.2. Examples
Some of the standard universe examples above are just as relevant in the vanilla universe: see Multiple Submission - Different Inputs and Multiple Submission - Different Arguments.5.2.2.1. Simple shell script
Any program can be run as a vanilla job, including shell scripts. The script "doloop" stays in a loop and prints out a number, then sleeps for a second. At the end, doloop.out should contain the values from 0 to 10 and the message "Normal End-of-Job".The script, "doloop" is:
#!/bin/bash
x=0; # initialize x to 0
while [ "$x" -le 10 ]; do
echo "$x"
# increment the value of x:
x=$(expr $x + 1)
sleep 1
done
echo "Normal End-of-Job"
The submit file, "submit.doloop", is
####################
##
## Vanilla script test
##
####################
universe = vanilla
executable = doloop
output = doloop.out
error = doloop.err
log = doloop.log
arguments = 10
queue
5.2.2.2. Matlab
Matlab cannot be relinked with the Condor library (unless you want to try 5.2.2.2.2 Matlab Compilation, but that is not straightforward and may not always be feasible) , so it has to be run in the vanilla universe. The following example shows matlab running a simple script file (also often known as an M-file) - a script file is an external file that contains a sequence of matlab statements, it can be executed interactively in matlab simply by typing its name (without the extension) at the prompt. However, under condor matlab cannot be run interactively, so the script file needs to be executed from the command line by using the -r option to matlab. It is also necessary to use the -nosplash, -nojvm and -nodesktop matlab options to prevent unwanted windows from appearing. Matlab will still try to open a display connection even if we don't want any windows to appear - normally this would not be a problem, but as we run condor daemons as user "condor" instead of root there can be authentication issues. Thus an option such as -display yourhostname:0 or -nodisplay is also needed (the latter will result in some warning messages about broken X connections in your error file which can be ignored). The fact that we run condor daemons as user "condor" instead of root also can cause file ownership problems in this particular example (see 1.2. Condor in the Computer Lab) - because we write to a file which will be owned by user "condor" we have to make the working directory world-writeable.Under Fedora Core 6 (as used on the processor bank machines) matlab is installed as cl-matlab.
The script file "matscripttest.m" in this example is:
load a.dat;
load b.dat;
matrR = a * b;
save matrR.dat;
exit;
Note the final exit - else the script will never finish and condor will hang.
The files a.dat and b.dat must preexist, the file matrR.dat will be
created.
The submit file, "submit.matlab" will be
#
# Submit a matlab job
#
executable = /usr/bin/cl-matlab
arguments = -nosplash -nojvm -nodesktop -nodisplay -r matscripttest
universe = vanilla
getenv = True # MATLAB needs local environment
log = mat.log
output = mat.out
error = mat.err
queue 1
Note the getenv = True - without it matlab will core dump !Note also that the executable given is the full path name. Even if matlab is on your PATH you need to give the full pathname or condor will assume it is an executable in the current working directory, and condor_submit will report an error when it can't find it.
5.2.2.2.1 Matlab Licencing Issues
If all goes well, once the job has completed the file "mat.out" will contain the usual matlab preamble:
< M A T L A B >
Copyright 1984-2006 The MathWorks, Inc.
Version 7.3.0.298 (R2006b)
August 03, 2006
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
If all does not go well you might see the following in an output file::
License Manager Error -4.
Maximum number of users for MATLAB reached.
Try again later.
To see a list of current users use the lmstat utility.
Every instance of matlab that you run consumes a licence - as we have a limited
number of such licences (they are very expensive) then it is quite likely that
using condor to run multiple simultaneous matlab jobs will hit this problem.
Check with lmstat -a before you condor_submit that there are
sufficient licences for the number of matlab-executing job clusters that you wish to run.
Alternatively, in some instances it may be possible to compile your matlab code using the Matlab Compiler. When you do the compilation you will consume a licence (both for matlab and for the compiler), however, when you subsequently run the compiled code no licence is consumed. Thus you can run as many instances of the compiled code as you wish without having to worry about the number of licences available. See the matlab internal documentation for details of the compiler. See 5.2.2.2.2 Matlab Compilation.
Note that because we run the condor daemons as user "condor" then that is the user name which will be displayed by lmstat. If more than one user is running matlab under condor at the same time then you will have to use condor_q -globalto determine which machine is executing your process and look for that machine name in the output of lmstat.
5.2.2.2.2 Matlab Compilation
When you do the compilation you will consume a licence (both for matlab and for the compiler), however, when you subsequently run the compiled code no licence is consumed. Thus you can run as many instances of the compiled code as you wish without having to worry about the number of licences available. See the matlab internal documentation for details of the compiler.It is, apparently, somewhat fiddly to get working as all sorts of environment variables have to be set at both compile and run time. The following scripts have been shown to work (on a previous version of both condor and matlab, so details may have changed. Thanks to Chris Town). The trick is to get the script run by the Condor submit job to call Matlab and execute a Matlab script to compile the code, then exit Matlab and run the compiled Matlab program.
Insert your own values of /Some/pathto in several places.
The Submit file:
executable = /Some/pathto/runit universe = vanilla initialdir = /Some/pathto/ getenv = True # MATLAB needs local environment log = matrunit.log output = matrunit.out error = matrunit.err queue
runit executable (as we have moved to a later version of matlab some of the library paths will need tweaking):
#!/bin/bash cd /Some/pathto export HOME=/Some/pathto export MATLAB_ROOT=/usr/groups/matlab/matlab14.3 export LD_LIBRARY_PATH=$MATLAB_ROOT/sys/os/glnxa64:$MATLAB_ROOT/bin/glnxa64: $MATLAB_ROOT/sys/opengl/lib/glnxa64: $MATLAB_ROOT/sys/java/jre/glnxa64/jre1.4.2/lib/amd64/:$LD_LIBRARY_PATH export XAPPLRESDIR=$MATLAB_ROOT/X11/app-defaults export ARCH=glnxa64 export DISPLAY=localhost:0 MATBIN=$HOME/realmatlabrunit /usr/bin/matlab -nosplash -nojvm -nodesktop -nodisplay -r compileit /Some/pathto/realmatlabrunit
compileit.m matlab script:
mcc -mv realmatlabrunit -R -nojvm exit
5.2.2.2.3. More matlab - a slight variant
A slight variant on the procedure in 5.2.2.2. Matlab is to create a small shell script, eg "matscripttest.sh", as a wrapper to matlab:
#! /bin/sh
cl-matlab -nosplash -nojvm -nodesktop -nodisplay -r "matscripttest"
NB The initial "#! /bin/sh" is necessary to prevent obscure error messages.
The submit file would be similar to the above, but the executable line would then be
executable = matscripttest.shand the arguments line would not be needed.
5.3. Java Universe
A program submitted to the Java universe may run on any sort of machine with a JVM regardless of its location, owner, or JVM version. Condor will take care of all the details such as finding the JVM binary and setting the classpath.The command condor_status -java will list those machines known to have a JVM installed.
Unfortunately, because of the way Fedora Core 6 handles RPMs, there will be two versions of the java command installed on a machine - you will need to be sure which one you are using. By default FC6 installs version 1.4 of the Java runtime environment, so if you just use "java" (which will probably be /usr/bin/java, depending on your PATH settings) then it will be 1.4. It doesn't install the java development kit, so by default there is no javac. We have installed the jdk in /usr/java/default (currently version 1.6 by user request). FC6 notices that this is installed and so automatically creates a link into it from /usr/bin for javac, but because it already has a java it ignores the version of java that it finds there. Thus we end up with /usr/bin/java being 1.4 and /usr/bin/javac being 1.6 ! If you need both to be the same version you should explicitly use /usr/java/default/bin/java.
The default memory allocation is "1/4 of memory, up to 1GB" (thus 1GB on all current machines). It uses this unless explicitly set using the java maxheap flag -Xmx.
5.3.1. A Java Universe "Howto"
- Create a suitable directory and cd into it.
- Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
- Move your files into the directory.
- Make sure you have the program condor_config_val on your PATH.
By default this lives in /opt/condor-6.8.3/bin, so
PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
- Check that the executable file does actually work
- Create a submit file
- Submit, using condor_submit
- Monitor the job's progress with the condor_q and condor_status commands
5.3.2. Examples
5.3.2.1. Hello World
The java file, "HelloWorldApp.java" will be /**
* The HelloWorldApp class implements an application that
* simply displays "Hello World!" to the standard output.
*/
class HelloWorldApp {
public static void main(String[] args) {
System.out.println("Hello World!"); // Display "Hello World!"
}
}
This will have been compiled with javac to create HelloWorldApp.class
The submit file, "submit.helloworldapp" will be
####################
#
# Execute a single Java class
#
####################
universe = java
executable = HelloWorldApp.class
arguments = HelloWorldApp
output = HelloWorldApp.output
error = HelloWorldApp.error
queue
For programs that consist of more than one .class file, an additional line in
the submit description file will be needed to tell Condor about the additional files:
For example:
transfer_input_files = TinkyWinky.class Dipsy.class LaaLaa.class Po.classIf the various class files have been combined into an archive (.jar) file then Condor must then be told where to find it by adding something like the following to the submit description file:
jar_files = Teletubbies.jarThe two seperate commands ("transfer_input_files" and "jar_files") are needed because the JVM will handle them differently.
5.4. Other Universes
The other universes available under condor which we do not currently support are:- PVM Universe. The PVM universe allows programs written for the Parallel Virtual Machine interface to be used.
- MPI Universe. The MPI universe allows programs written to the MPICH interface to be used. Note: we have attempted to support this Universe, but it does not work under our current setup for reasons unknown.
- Globus Universe. The Globus universe in Condor is intended to provide the standard Condor interface to users who wish to start Globus system jobs from Condor.
- Scheduler Universe. The scheduler universe allows a Condor job to be submitted and executed with different assumptions for the execution conditions of the job. The job does not wait to be matched with a machine. It instead executes right away, on the machine where the job is submitted. The job will never be preempted. The machine requirements are not considered for scheduler universe jobs.
6. Summary of Useful Condor Commands
- condor_submit is the program for submitting jobs for execution under Condor.
- condor_q displays information about jobs in the Condor job queue. Use the -global option to see multiple machines.
- condor_status may be used to monitor and query resource information, submitter information, checkpoint server information, and daemon master information for the Condor pool.
- condor_prio changes the priority of one or more jobs in the condor queue.
- condor_userprio with no arguments, lists the active users (see below) along with their priorities, in increasing priority order. The -all option can be used to display more detailed information.
- condor_history displays a summary of all condor jobs listed in the specified history files. If no history files are specified then the local history file as specified in Condor's configuration file is read.
- condor_rm removes one or more jobs from the Condor job queue.
- condor_compile relinks a program with the Condor libraries for submission into Condor's Standard Universe.
7. Glossary
- Pool The collection of inter-networked machines running Condor and controlled by a particular manager is known as a pool.
- Submit machine Submit machines start Condor jobs.
- Execute machine Execute machines run the Condor jobs.
- Central Manager The Manager machine is the collector of information.
- Checkpoint server This is a single centralized machine that stores all the checkpoint files for the jobs.
- Universe Condor has several runtime environments (called a universe) from which to choose. The standard universe allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted. The standard universe also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. The vanilla universe provides a way to run jobs that cannot be relinked. There is no way to take a checkpoint or migrate a job executed under the vanilla universe.
- Submit Description File This controls the details of job submission.
- Job Cluster A cluster is a set of jobs specified in the submit description file between queue commands for which the executable is not changed. a "process ID".
- ClassAd A ClassAd is a data structure used by Condor to store job or machine information. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Condor plays the role of a matchmaker by continuously reading all the job ClassAds and all the machine ClassAds, matching and ranking job ads with machine ads.
- Checkpoint Checkpointing is taking a snapshot of the current state of a program in such a way that the program can be restarted from that state at a later time. Checkpointing gives the Condor scheduler the freedom to reconsider scheduling decisions through preemptive-resume scheduling. If the scheduler decides to no longer allocate a machine to a job (for example, when the owner of that machine returns), it can checkpoint the job and preempt it without losing the work the job has already accomplished. The job can be resumed later when the scheduler allocates it a new machine. Additionally, periodic checkpointing provides fault tolerance in Condor. Snapshots are taken periodically, and after an interruption in service the program can continue from the most recent snapshot.
