Computer Laboratory

Condor - local guide

1. Introduction

This document is intended to provide a quick introduction to Condor, to introduce the most important concepts, and to provide examples to allow users to start using Condor as quickly as possible. Although this document is quite large it should be less intimidating than the Condor Reference Manual !

For any more advanced use, however, you will have to refer to the vast Condor Reference Manual.

You may also wish to look at the Condor Project Homepage

1.1. What is Condor

Condor is a specialized batch system for managing compute-intensive jobs. Users submit their compute jobs to Condor, Condor puts the jobs in a queue, runs them, and then informs the user as to the result. The collection of inter-networked machines running Condor and controlled by a particular manager is known as a pool. Like most batch systems, Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications.

In slightly more detail: a user submits the job to Condor from one of a number of Submit machines. Condor finds an available Execute machine from the pool and begins running the job on that machine. Condor has the capability to detect that a machine running a Condor job is no longer available (perhaps because the owner of the machine came back from lunch and started typing on the keyboard). It might be able to checkpoint the job and move (migrate) the jobs to a different machine which would otherwise be idle. If it has been able to checkpoint the job then Condor continues the job on the new machine from precisely where it left off.

Condor does not require an account (login) on machines where it runs a job. Condor can do this because it uses remote system calls which trap library calls for such operations as reading or writing from disk files. The calls are transmitted over the network to be performed on the machine where the job was submitted.

Every machine in a Condor pool can serve a variety of roles, and most machines will serve more than one role simultaneously, although certain roles can only be performed by single machines in a pool. The following list describes the 4 different roles:

  • Central Manager
    The Manager machine is the collector of information, and the negotiator between resources and resource requests. There is only one central manager for a pool.
  • Submit
    Submit machines queue Condor jobs. There may be more than one. Users will only need to be able to log in to these.
  • Execute
    Execute machines actually run the Condor jobs. There may be more than one.
  • Checkpoint Server
    The checkpoint server is a centralized machine that stores all the checkpoint files for the jobs submitted in the pool. Only one machine in a pool can be configured as a checkpoint server, and its presence is optional.

1.2. Condor in the Computer Lab

NB. If you wish to start using Condor please email sysadmin first. We can advise on whether any of the pool machines are currently being used for other projects. We would also like to keep a general eye on how condor is being used so we can assess demand.

The Computer Lab pool consists a number of virtual machines running under Xen Enterprise. They generally use spare resources, having lower priority than other VMs. To see condor's view of running machines, use condor_status, and to see XenE's view of running and halted machines, use cl-condor-list. The number of CPUs they have (normally up to 8) and the amount of memory they can have (typically up to 30GB) when machines are started using the command, e.g. to start the machine pb030 with 2 CPUs and 7GB of RAM use:

cl-condor-start pb030 2 7000000000
NB - please check the top level page to see if there are any special notices concerning machine availability. The Xen Admin WiKi may have some info on the lastest pool usage. People wishing to use them for running jobs under condor only need to be able to log in to the submit machines, which are condor-submit0 and condor-submit1, together known as condor-submit.

We have chosen to support only the Standard, Vanilla and Java Universes on our Condor pool.

We have decided not to alow preemption on the local pool (this is not the default behaviour - see 4.1.2. User Priority for details).

Before running any of the condor commands mentioned in these notes you will need to make sure you have the condor programs on your PATH:

PATH=/opt/condor-6.8.3/bin:$PATH;export PATH

1.2.1 Etiquette

In the past people have tended to run condor in many different ways, long jobs (weeks even), short jobs, a handful at a time, or thousands of jobs at once. Because we have had to turn off job preemption it is now possible for a single user to use the entire pool for long periods, thus preventing other people from getting any jobs to run. Condor does not have sophisticated scheduling mechanisms, there is not much we can do about this !

We have decided to adopt the policy that jobs should aim to finish with a couple of hours - anyone requiring jobs over 24h should contact sys-admin.

The time quantum of the scheduler appears to be five minutes. If a large number of jobs finish in less than five minutes, the scheduler ceases to function, and needs to be manually restarted by a sys admin. As such, ensure that jobs do not take less than five minutes. If they might, arrange to run several such steps in a single job.

Do not create files in /tmp/ as they might fail to be deleted, causing the filing system to fill up, which causes condor to consider the machine unavailable for running jobs. Instead use /anfs/bigdisc/$USER/ or /anfs/bigtmp/, and tidy them up. Many aplications which create tmp files will use $TMPDIR or such like for such files.

Users should add the line "nice_user = True" to their jobs as a matter of course. This will ensure that when a new job is to be started, they will only be considered if there is no other jobs free to run. This means that if a user has submitted a large batch of jobs, other users can submit a small number of non-nice jobs. This will only work if most waiting jobs are niced. Note that even if jobs are niced, while they are running, they stop other jobs from starting, so please try to ensure most jobs do complete within a couple of hours, even if it does waste a small amount of time starting up etc.

2. Using Condor

Here are all the steps needed to run a job using Condor (remember to set CONDOR_CONFIG and PATH beforehand, as mentioned above):
  1. Code Preparation.
    A job run under Condor must be able to run as a background batch job. Condor runs the program unattended and in the background. A program that runs in the background will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program/script will run correctly with these files on the submit machine.
  2. The Condor Universe.
    Condor has several runtime environments (called a universe) from which to choose. Of the universes, the most important are the Standard Universe, the Vanilla Universe and the Java Universe. If your job is written in C, C++ or fortran, and can be linked with the Condor library then run it in the standard universe, otherwise use the vanilla or java universes. Choose a universe under which to run your job, and re-link the program if necessary.
  3. Submit description file.
    A submit description file controls the details of job submission. The file will contain information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets. Write a submit description file to go with the job, using the syntax description and some illustrative examples here, here or here.
  4. Submit the Job.
    Login to a submit machine (see above) and submit the program to Condor with the condor_submit command.
Once submitted, Condor does the rest. You can monitor the progress of the job with the condor_q and condor_status commands. Note that without options condor_q will only tell you about the machine on which you run it - if you have submitted jobs from another machine you should use condor_q -global. You may modify the order in which Condor will run your jobs with condor_prio. If desired, Condor can even inform you in a log file every time your job is checkpointed and/or migrated to a different machine.

When your program completes, Condor will tell you the exit status of your program and various statistics about its performance, including time used and I/O performed. If you are using a log file for the job (which is recommended) the exit status will be recorded in the log file. Alternatively you can view the history file for the job by typing condor_history, which will show something like:

 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
   1.0   condor          6/13 10:58   0+00:00:00 C  0   0.9  job_blah         
Notice that the status ("ST") is now C, for completed.

You can remove a job from the queue prematurely with condor_rm.

2.1 Problems ?

If you haven't already you should check that your program or script works correctly on the condor-submit machine as set out in the 2. Using Condor section above. If it works on condor-submit but does not under condor login to "pb" and try it there. If it works there but not under condor, contact sys-admin.

3. Submit File syntax

A submit description file controls the details of job submission. The syntax is simple, a list of the most important entries grouped by concept follows. This is by no means a full list, for that see the condor_submit man page, this selection is intended mainly to make it easier to understand the examples given in other sections.

Blank lines and lines beginning with a pound sign (#) character are ignored by the submit description file parser, and so may be used for comments.

3.1. Basic entries

  • executable =name
    The name of the executable file for this job cluster (for a definition of a job cluster see this example).
  • arguments =argument_list
    List of arguments to be supplied to the program named as the executable on its command line.
  • input =pathname
  • output =pathname
  • error =pathname
    Condor assumes that its jobs are long-running, and that the user will not wait at the terminal for their completion. Because of this, the standard files which normally access the terminal, (stdin, stdout, and stderr), must refer to files. Thus, the file name specified with input should contain any keyboard input the program requires (that is, this file becomes stdin). Likewise with output and error. If not specified, the default value of /dev/null is used for submission to a Unix machine.
  • universe =vanilla | standard | java
    Specifies which Condor Universe to use when running this job.
  • initialdir =directory-path
    Used to give jobs a directory with respect to file input and output. Also provides a directory (on the submit machine) for the user log.
  • log =pathname
    Use log to specify a file name where Condor will write a log file of what is happening with this job cluster. For example, Condor will log into this file when and where the job begins running, when the job is checkpointed and/or migrated, when the job completes, etc. Most users find specifying a log file to be very handy; its use is recommended. If no log entry is specified, Condor does not create a log for this cluster.
  • queue [number-of-procs]
    Places one or more copies of the job into the Condor queue. The optional argument number-of-procs specifies how many times to submit the job to the queue, and it defaults to 1. If desired, any commands may be placed between subsequent queue commands, such as new input, output, error, initialdir, arguments, or executable commands. This is handy when submitting multiple runs into one cluster with one submit description file. Multiple clusters may be specified within a single submit description file by changing the executable between queue commands. Each time the executable command is issued (between queue commands), a new cluster is defined.

3.2. Job Ordering and location

  • priority =priority
    Condor job priorities range from -20 to +20, with 0 being the default. Jobs with higher numerical priority will run before jobs with lower numerical priority. Note that this priority is on a per user basis; setting the priority will determine the order in which your own jobs are executed, but will have no effect on whether or not your jobs will run ahead of another user's jobs. See 4.1 Priority.
  • nice_user =True | False
    Normally, when a machine becomes available to Condor, Condor decides which job to run based upon user and job priorities. Setting nice_user equal to True tells Condor not to use your regular user priority, but that this job should have last priority among all users and all jobs. So jobs submitted in this fashion run only on machines which no other non-nice_user job wants - a true "bottom-feeder" job! This is very handy if a user has some jobs they wish to run, but do not wish to use resources that could instead be used to run other people's Condor jobs. Jobs submitted in this fashion have "nice-user." pre-appended in front of the owner name when viewed from condor_q or condor_userprio. The default value is False.
  • requirements =Boolean Expression
    The requirements command is a boolean expression which uses C-like operators. In order for any job in this cluster to run on a given machine, this requirements expression must evaluate to true on the given machine. For example, to require that whatever machine executes your program has a least 64 Meg of RAM and has a MIPS performance rating greater than 45, use:
            requirements = Memory >= 64 && Mips > 45
    
    Only one requirements command may be present in a submit description file. Unless you request otherwise, Condor will by default give your job to machines with the same architecture and operating system version as the machine running condor_submit. See 4.2 Machine Attributes.
  • rank =Float Expression
    The argument is a Floating-Point expression that states how to rank machines which have already met the requirements expression. Essentially, rank expresses preference. A higher numeric value equals better rank. Condor will give the job to the machine with the highest rank. For example,
            requirements = Memory > 60
            rank = Memory
    
    asks Condor to find all available machines with more than 60 megabytes of memory and give the job the one with the most amount of memory. See 4.3 Ranking.

3.3. File Handling

  • fetch_files = file1, file2, ...
    If your job attempts to access a file mentioned in this list, Condor will automatically copy the whole file to the executing machine, where it can be accessed quickly. When your job closes the file, it will be copied back to its original location. This option only applies to standard-universe jobs.
  • append_files = file1, file2, ...
    If your job attempts to access a file mentioned in this list, Condor will force all writes to that file to be appended to the end. Furthermore, condor_submit will not truncate it. This option may yield some surprising results. If several jobs attempt to write to the same file, their output may be intermixed. If a job is evicted from one or more machines during the course of its lifetime, such an output file might contain several copies of the results. This option should be only be used when you wish a certain file to be treated as a running log instead of a precise result. This option only applies to standard-universe jobs.
  • local_files = file1, file2, ...
    If your job attempts to access a file mentioned in this list, Condor will cause it to be read or written at the execution machine. This is most useful for temporary files not used for input or output. This option only applies to standard-universe jobs.

3.4. Job Information

  • notification =when
    Owners of Condor jobs are notified by email when certain events occur. If when is set to Always, the owner will be notified whenever the job is checkpointed, and when it completes. If when is set to Complete (the default), the owner will be notified when the job terminates. If when is set to Error, the owner will only be notified if the job terminates abnormally. If when is set to Never, the owner will not be mailed, regardless what happens to the job.
  • notify_user =email-address
    Used to specify the email address to use when Condor sends email about a job. If not specified, Condor will default to using job-owner@UID_DOMAIN where UID_DOMAIN is specified by the Condor site administrator.

3.5. Environment

  • environment =parameter_list
    A list of environment variables which will be placed (as given) into the job's environment before execution. The list is of the form : <parameter>=<value>. Multiple environment variables can be specified by separating them with a semicolon (;) when submitting from a Unix platform. The length of the list specified in the environment is currently limited to 10240 characters.
  • getenv =True | False
    If getenv is set to True, then condor_submit will copy all of the user's current shell environment variables at the time of job submission into the job description. The job will therefore execute with the same set of environment variables that the user had at submit time. Defaults to False.

3.6. Macros

Parameterless macros in the form of $(macro_name) may be inserted anywhere in Condor submit description files. Macros can be defined by lines in the form of
        <macro_name> = <string>
Two pre-defined macros are supplied by the submit description file parser. The $(Cluster) macro supplies the number of the job cluster, and the $(Process) macro supplies the number of the job. These macros are intended to aid in the specification of input/output files, arguments, etc., for clusters with lots of jobs, and/or could be used to supply a Condor process with its own cluster and process numbers on the command line. For an example see 5.1.2.2. Multiple Submission - Different Inputs.

In addition to the normal macro, there is also a special kind of macro called a substitution macro that allows you to substitute expressions defined on the resource machine itself (gotten after a match to the machine has been performed) into specific expressions in your submit description file. The special substitution macro is of the form $$(attribute). It may only be used in three expressions in the submit description file: executable, environment, and arguments. Example:

          executable = myprog.$$(opsys).$$(arch)
The opsys and arch attributes will be substituted at match time for any given resource. This will allow Condor to automatically choose the correct executable for the matched machine.

The environment macro, $ENV, allows the evaluation of an environment variable to be used in setting a submit description file command. The syntax used is

          $ENV(variable)
For example:
          log = $ENV(HOME)/jobs/logfile

4. Job scheduling - Priority, Requirements and Rank

The scheduling arrangements adopted by condor control when and on which machine your jobs are run. Priority (both per-job and per-user) determine when a job will be run, ranking (which uses requirements and machine attributes) may be used to determine where a job is run.

All machines in a Condor pool advertise their attributes, such as available RAM memory, CPU type and speed, virtual memory size, current load average, along with other static and dynamic properties. This machine information also includes under what conditions a machine is willing to run a Condor job and what type of job it would prefer.

Likewise, when submitting a job, you can specify your requirements and preferences, for example, the type of machine you wish to use. You can also specify an attribute, for example, floating point performance, and have Condor automatically rank the available machines according to their values for this attribute. Condor plays the role of a matchmaker by continuously reading all the job requirements and all the machine information, matching and ranking jobs with machines.

4.1 Priority

4.1.1 Job Priority

Job priorities allow the assignment of a priority level to each submitted Condor job in order to control order of execution - note that these are priorities between jobs of the same user only. To set a job priority, use the condor_prio command, or use the priority command in your submit description file. Job priorities do not impact user priorities in any fashion. Job priorities range from -20 to +20, with -20 being the worst and with +20 being the best, 0 is the default.

4.1.2 User Priority

The default behaviour for Condor is to allocate machines to users based upon a user's priority - which changes according to the number of resources the individual is using. Condor enforces that each user gets his/her fair share of machines according to user priority both when allocating machines which become available and by priority preemption of currently allocated machines.

However, it was discovered that this did not really work in our environment. Most users are running vanilla universe jobs which cannot be checkpointed. The preemption rules meant that heavy users were often having their jobs killed after many hours running by new users starting their jobs. This tended to annoy people ! It has therefore been decided to turn off preemption entirely on the local pool as an experiment. This may mean that jobs of new users sit idle for some time because of the long-running jobs of others. This may turn out to be equally unsuitable !

It is possible to submit a job as a "nice" job. Setting nice_user in your submit description file tells Condor not to use your regular user priority, but that this job should have last priority among all users and all jobs.

4.2 Machine Attributes

The attributes advertised by a machine can be seen with condor_status -l machine_name. Some of the listed attributes are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine can be utilized at job submission time as part of a request or preference on which machine to use. Additional attributes can be easily added.

For example, this is the output of condor_status -l for one processor of the machine pb001:

MyType = "Machine"
TargetType = "Job"
Name = "vm1@pb001.cl.cam.ac.uk"
Machine = "pb001.cl.cam.ac.uk"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pb001.cl.cam.ac.uk"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 0
Disk = 467172
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 25539262
ConsoleIdle = 25539262
Memory = 29994
Cpus = 1
StartdIpAddr = "<128.232.4.1:33071>"
Arch = "x86_64"
OpSys = "LINUX"
UidDomain = "cl.cam.ac.uk"
FileSystemDomain = "cl.cam.ac.uk"
Subnet = "128.232.4"
HasIOProxy = TRUE
TotalVirtualMemory = 0
TotalDisk = 934344
KFlops = 951601
Mips = 3370
LastBenchmark = 1103098732
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 678
ClockDay = 3
TotalVirtualMachines = 2
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.1_01"
JavaMFlops = 295.152039
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1103041577
Activity = "Idle"
EnteredCurrentActivity = 1103098732
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1103041099
UpdateSequenceNumber = 230
MyAddress = "<128.232.4.1:33071>"
LastHeardFrom = 1103109536
UpdatesTotal = 231
UpdatesSequenced = 230
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

4.3 Ranking

When considering the match between a job and a machine, rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.

The job's rank expression evaluates to one of three values. It can be UNDEFINED, ERROR, or a floating point value. If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.

A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.

Example 1: For a job that desires the machine with the most available memory:

          Rank = memory
Example 2: For a job that prefers to run on Saturdays and Sundays:
          Rank = ( (clockday == 0) || (clockday == 6) )
It is wise when writing a rank expression to check if the expression's evaluation will lead to the expected resulting ranking of machines. This can be accomplished using the condor_status command with the -constraint argument. This allows the user to see a list of machines that fit a constraint.

Example 1: To see which machines in the pool have kflops defined, use:

          condor_status -constraint kflops
Example 2:If this is typed on a Wednesday it will show all of the machines in the pool, on any other day it will show none:
          condor_status -constraint "(clockday == 3)"

5. Universes

A universe in Condor defines an execution environment. There are three main choices, and a host of others which we do not currently support on the CL pool.

5.1. Standard Universe

In the standard universe, Condor provides checkpointing and remote system calls. These features make a job more reliable and allow it uniform access to resources from anywhere in the pool.

Condor checkpoints a job at regular intervals. A checkpoint image is essentially a snapshot of the current state of a job. If a job must be migrated from one machine to another, Condor makes a checkpoint image, copies the image to the new machine, and restarts the job continuing the job from where it left off. If a machine should crash or fail while it is running a job, Condor can restart the job on a new machine using the most recent checkpoint image. In this way, jobs can run for months or years even in the face of occasional computer failures.

A job that is linked using condor_compile and is subsequently submitted into the standard universe will checkpoint and exit upon receipt of a SIGTSTP signal. The user's code may still checkpoint itself at any time by calling one of the following functions exported by the Condor libraries:

  • ckpt()
    Performs a checkpoint and then returns.
  • ckpt_and_exit()
    Checkpoints and exits; Condor will then restart the process again later, potentially on a different machine.

The standard universe allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted. The standard universe also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. To use the standard universe, it is necessary to relink the program with the Condor library using the condor_compile command, hence this universe is only appropriate if you have the C or C++ source code for the program.

5.1.1. A Standard Universe "How to"

  1. Create a suitable directory and cd into it.
  2. Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
  3. Move your code files into the directory.
  4. Make sure you have the program condor_config_val on your PATH. By default this lives in /opt/condor-6.8.3/bin, so
    PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
    
  5. Compile using condor_compile
  6. Create a submit file
  7. Submit, using condor_submit
  8. Monitor the job's progress with the condor_q and condor_status commands

5.1.2. Examples

5.1.2.1. A very simple job

Using the C program called hello.c:
          #include <stdio.h>

          main()
          {
            printf("hello, Condor\n");
            exit(0);
          }
The compilation instruction and resulting output is as follows:
            $ condor_compile gcc hello.c -o hello
            hello.c: In function ‘main’:
            hello.c:6: warning: incompatible implicit declaration of built-in function ‘exit’
            LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-6.8.3/lib -Bstatic --eh-frame-hdr -m elf_x86_
            64 --hash-style=gnu -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o hello /opt/condor-6.8.3/li
            b/condor_rt0.o /usr/lib/gcc/x86_64-redhat-linux/4.1.1/../../../../lib64/crti.o /usr/lib/gcc/x
            86_64-redhat-linux/4.1.1/crtbeginT.o -L/opt/condor-6.8.3/lib -L/usr/lib/gcc/x86_64-redhat-lin
            ux/4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.1 -L/usr/lib/gcc/x86_64-redhat-linux/4.1.1/..
            /../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 /tmp/ccrztRSX.o /opt/condor-6.8.3/lib/lib
            condorsyscall.a /opt/condor-6.8.3/lib/libcondor_z.a /opt/condor-6.8.3/lib/libcomp_libstdc++.a
             /opt/condor-6.8.3/lib/libcomp_libgcc.a /opt/condor-6.8.3/lib/libcomp_libgcc_eh.a --as-needed
             --no-as-needed -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lc
            ondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-6.8.3/lib/libcomp_lib
            gcc.a /opt/condor-6.8.3/lib/libcomp_libgcc_eh.a --as-needed --no-as-needed /usr/lib/gcc/x86_6
            4-redhat-linux/4.1.1/crtend.o /usr/lib/gcc/x86_64-redhat-linux/4.1.1/../../../../lib64/crtn.o
            /opt/condor-6.8.3/lib/libcondorsyscall.a(condor_file_agent.o): In function `CondorFileAgent::
            open(char const*, int, int)':
            (.text+0x29b): warning: the use of `tmpnam' is dangerous, better use `mkstemp'
            /opt/condor-6.8.3/lib/libcondorsyscall.a(switches.o): In function `__gets_chk':
            (.text+0xa4bb): warning: the `gets' function is dangerous and should not be used.
The submit file, submit.hello, is:
            ########################
            # Submit description file for hello program
            ########################
            Executable     = hello
            Universe       = standard
            Output         = hello.out
            Log            = hello.log 
            Queue 
The submit instruction and output will look something like this (note the warning message !):
            Submitting job(s)
            WARNING: Log file /auto/homes/ckh11/condortest/hello.log is on NFS.
            This could cause log file corruption and is _not_ recommended.
            .
            Logging submit event(s).
            1 job(s) submitted to cluster 57.
condor_q will say:
            $ condor_q

            -- Submitter: pb000.cl.cam.ac.uk : <127.0.0.1:59865> : pb000.cl.cam.ac.uk
            ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
            57.0   ckh11           2/1  11:23   0+00:00:00 R  0   9.8  hello             

            1 jobs; 0 idle, 1 running, 0 held
The log file, hello.log, will show (something similar to):
000 (057.000.000) 02/01 11:23:57 Job submitted from host: <127.0.0.1:59865>
...
001 (057.000.000) 02/01 11:24:31 Job executing on host: <127.0.0.1:34755>
...
005 (057.000.000) 02/01 11:24:31 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        816  -  Run Bytes Sent By Job
        1702035  -  Run Bytes Received By Job
        816  -  Total Bytes Sent By Job
        1702035  -  Total Bytes Received By Job
...
The output file, hello.out, will contain:
          hello, Condor

5.1.2.2. Multiple Submission - Different Inputs

A common situation has one executable that is executed many times, each time with a different input set. This is called a job cluster. Each cluster has a "cluster ID" and within each cluster, each job has a "process ID". If the program wants the input in a file with a fixed name, then the solution of choice runs each queued job in its own directory.

This particular example outputs the number of characters in an input file named mult_job_input. There are 5 different input files, so we need 5 jobs. Because the program uses a fixed name for its input file we do not need to specify an input in the submit description file. The 5 different but identically named input files are prestaged in 5 directories before submitting the job. The directories are named job.0, job.1, job.2, job.3 and job.4. In addition to the input file, each directory will receive its own output in a file called mult_job_output, its own error messages will go into mult_job_error, and Condor will log each job's progress in the file called mult_job_log.

The submit file, submit.mult_job, is:

            ####################                    
            # Multiple jobs queued, each in its own directory
            ####################                                                    

            universe = standard
            executable = mult_job
            output = mult_job_output
            error = mult_job_error
            log = mult_job_log
            initialdir = job.$(Process)
            queue 5
Note the initialdir line, it is using a simple macro to give a different directory name for each job to be queued.

The program source, mult_job.c, is:

          #include <stdio.h>

          main()          {
          FILE *in;
          char ch, filename[80];
          int i=0;

            sprintf(filename,"mult_job_input");
            if((in=fopen(filename,"r")) == NULL){
              printf("Cant open %s\n",filename);
              exit(1);
            }

            while((ch=getc(in)) != EOF){i++;}

            printf("i is %d\n",i);
            exit(0);
          }

The compile instruction is:

          condor_compile gcc -o mult_job mult_job.c
Having set up the directories and input files, the submit instruction and output is:
$ condor_submit submit.mult_job
Submitting job(s)
WARNING: Log file /auto/homes/ckh11/condortest/job.0/mult_job_log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
WARNING: Log file /auto/homes/ckh11/condortest/job.1/mult_job_log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
WARNING: Log file /auto/homes/ckh11/condortest/job.2/mult_job_log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
WARNING: Log file /auto/homes/ckh11/condortest/job.3/mult_job_log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
WARNING: Log file /auto/homes/ckh11/condortest/job.4/mult_job_log is on NFS.
This could cause log file corruption and is _not_ recommended.
.
Logging submit event(s).....
5 job(s) submitted to cluster 60.

5.1.2.3. Multiple Submission - Different Arguments

This example queues three jobs for execution by Condor. The first will be given command line arguments of 15 and 20, and it will write its standard output to msda.out1. The second will be given command line arguments of 30 and 20, and it will write its standard output to msda.out2. Similarly the third will have arguments of 45 and 60, and it will use msda.out3 for its standard output.

The submit file, submit.msda, is:

          ####################
          #
          # Different command line arguments and output files.
          #                                                                      
          ####################                                                   
                                                                         
          executable     = msda                                                   
          universe       = standard
                                                                         
          arguments      = 15 20                                               
          output  = msda.out1                                                     
          error   = msda.err1
          queue                                                                  
                                                                         
          arguments      = 30 20                                               
          output  = msda.out2                                                     
          error   = msda.err2
          queue                                                                  
                                                                         
          arguments      = 45 60                                               
          output  = msda.out3                                                     
          error   = msda.err3
          queue
The source for msda is not given as it is trivial - it adds its two arguments and outputs them to stdout.

The compile command is as in previous examples, and the submit instruction and output is:

          condor_submit submit.msda
          Submitting job(s)...
          3 job(s) submitted to cluster 61.
Note this time it does not mention logging as we did not specify a log file.

5.1.2.4. Intentional Checkpointing

A job that is linked using condor_compile and is subsequently submitted into the standard universe will checkpoint and exit upon receipt of a SIGTSTP signal. The user's code may still checkpoint itself at any time by using the condor library function - ckpt() (or the similar ckpt_and_exit()) which simply performs a checkpoint and then returns.

It is wise to make the checkpoint call conditional so that you can check that your code compiles correctly without it. For example, the program checkplease.c:

          #include <stdio.h>
          main()
          {
            printf("hello\n");

          #ifdef CONDOR
            ckpt();
          #endif

            printf("world\n");
            exit(0);
          }
The compilation step for condor submission would therefore be:
          condor_compile gcc -DCONDOR -m32 -static-libgcc -o checkplease checkplease.c
The submit file is trivial and is not shown. After running the log shows the checkpoint:
          000 (062.000.000) 02/01 11:57:56 Job submitted from host: <127.0.0.1:59865>
          ...
          001 (062.000.000) 02/01 11:58:00 Job executing on host: <127.0.0.1:34755>
          ...
          006 (062.000.000) 02/01 11:59:00 Image size of job updated: 10798
          ...
          003 (062.000.000) 02/01 11:59:00 Job was checkpointed.
                  Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
          ...
          005 (062.000.000) 02/01 11:59:00 Job terminated.
                  (1) Normal termination (return value 0)
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                          Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
                  817613  -  Run Bytes Sent By Job
                  1702390  -  Run Bytes Received By Job
                  817613  -  Total Bytes Sent By Job
                  1702390  -  Total Bytes Received By Job
          ...

5.2. Vanilla Universe

The vanilla universe in Condor is intended for programs which cannot be successfully re-linked. Shell scripts are another case where the vanilla universe is useful. Unfortunately, jobs run under the vanilla universe cannot checkpoint or use remote system calls. This has unfortunate consequences for a job that is partially completed when the remote machine running a job must be returned to its owner. Condor has only two choices. It can suspend the job, hoping to complete it at a later time, or it can give up and restart the job from the beginning on another machine in the pool.

Under Unix, the Condor presumes a shared file system for vanilla jobs. If you are using a non-Unix or mixed machine pool you may have to know about the file transfer mechanism.

5.2.1. A Vanilla Universe "Howto"

  1. Create a suitable directory and cd into it.
  2. Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
  3. Move your files into the directory.
  4. Make sure you have the program condor_config_val on your PATH. By default this lives in /opt/condor-6.8.3/bin, so
    PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
    
  5. Check that the executable file does actually work
  6. Create a submit file
  7. Submit, using condor_submit
  8. Monitor the job's progress with the condor_q and condor_status commands

5.2.2. Examples

Some of the standard universe examples above are just as relevant in the vanilla universe: see Multiple Submission - Different Inputs and Multiple Submission - Different Arguments.

5.2.2.1. Simple shell script

Any program can be run as a vanilla job, including shell scripts. The script "doloop" stays in a loop and prints out a number, then sleeps for a second. At the end, doloop.out should contain the values from 0 to 10 and the message "Normal End-of-Job".

The script, "doloop" is:

          #!/bin/bash
          x=0;     # initialize x to 0
          while [ "$x" -le 10 ]; do
              echo "$x"
              # increment the value of x:
              x=$(expr $x + 1)
              sleep 1
          done
          echo "Normal End-of-Job"
The submit file, "submit.doloop", is
          ####################
          ##
          ## Vanilla script test
          ##
          ####################

          universe        = vanilla
          executable      = doloop
          output          = doloop.out
          error           = doloop.err
          log             = doloop.log
          arguments       = 10
          queue

5.2.2.2. Matlab

Note: from 29.11.2013 a new version of matlab has been installed for general use which is incompatable with certain libraries installed on the condor pool machines. As a temporary fix you should use the older version of matlab with condor. This means invoking /usr/groups/matlab/previous/bin/matlab (instead of cl-matlab), and you also need to add the option -c /usr/groups/matlab/previous/etc/license.dat to pick up the correct license file.

Matlab cannot be relinked with the Condor library (unless you want to try 5.2.2.2.2 Matlab Compilation, but that is not straightforward and may not always be feasible) , so it has to be run in the vanilla universe. The following example shows matlab running a simple script file (also often known as an M-file) - a script file is an external file that contains a sequence of matlab statements, it can be executed interactively in matlab simply by typing its name (without the extension) at the prompt. However, under condor matlab cannot be run interactively, so the script file needs to be executed from the command line by using the -r option to matlab. It is also necessary to use the -nosplash, -nojvm and -nodesktop matlab options to prevent unwanted windows from appearing. Matlab will still try to open a display connection even if we don't want any windows to appear - normally this would not be a problem, but as we run condor daemons as user "condor" instead of root there can be authentication issues. Thus an option such as -display yourhostname:0 or -nodisplay is also needed (the latter will result in some warning messages about broken X connections in your error file which can be ignored). The fact that we run condor daemons as user "condor" instead of root also can cause file ownership problems in this particular example (see 1.2. Condor in the Computer Lab) - because we write to a file which will be owned by user "condor" we have to make the working directory world-writeable.

Under Fedora Core 6 (as used on the processor bank machines) matlab is installed as cl-matlab.

The script file "matscripttest.m" in this example is:

          load a.dat;
          load b.dat;
          matrR = a * b;
          save matrR.dat;
          exit;
Note the final exit - else the script will never finish and condor will hang. The files a.dat and b.dat must preexist, the file matrR.dat will be created.

The submit file, "submit.matlab" will be

           #
           # Submit a matlab job
           #
           executable = /usr/bin/cl-matlab
           arguments = -nosplash -nojvm  -nodesktop -nodisplay -r matscripttest
           universe = vanilla
           getenv   = True          # MATLAB needs local environment
           log = mat.log
           output = mat.out
           error = mat.err
           queue 1 
Note the getenv = True - without it matlab will core dump !
Note also that the executable given is the full path name. Even if matlab is on your PATH you need to give the full pathname or condor will assume it is an executable in the current working directory, and condor_submit will report an error when it can't find it.

5.2.2.2.1 Matlab Licencing Issues

If all goes well, once the job has completed the file "mat.out" will contain the usual matlab preamble:
                              < M A T L A B >
                  Copyright 1984-2006 The MathWorks, Inc.
                         Version 7.3.0.298 (R2006b)
                              August 03, 2006

 
  To get started, type one of these: helpwin, helpdesk, or demo.
  For product information, visit www.mathworks.com.

If all does not go well you might see the following in an output file::

           License Manager Error -4.
           Maximum number of users for MATLAB reached.
           Try again later.
           To see a list of current users use the lmstat utility.
Every instance of matlab that you run consumes a licence - as we have a limited number of such licences (they are very expensive) then it is quite likely that using condor to run multiple simultaneous matlab jobs will hit this problem. Check with lmstat -a before you condor_submit that there are sufficient licences for the number of matlab-executing job clusters that you wish to run.

Alternatively, in some instances it may be possible to compile your matlab code using the Matlab Compiler. When you do the compilation you will consume a licence (both for matlab and for the compiler), however, when you subsequently run the compiled code no licence is consumed. Thus you can run as many instances of the compiled code as you wish without having to worry about the number of licences available. See the matlab internal documentation for details of the compiler. See 5.2.2.2.2 Matlab Compilation.

Note that because we run the condor daemons as user "condor" then that is the user name which will be displayed by lmstat. If more than one user is running matlab under condor at the same time then you will have to use condor_q -globalto determine which machine is executing your process and look for that machine name in the output of lmstat.

5.2.2.2.2 Matlab Compilation

When you do the compilation you will consume a licence (both for matlab and for the compiler), however, when you subsequently run the compiled code no licence is consumed. Thus you can run as many instances of the compiled code as you wish without having to worry about the number of licences available. See the matlab internal documentation for details of the compiler.

It is, apparently, somewhat fiddly to get working as all sorts of environment variables have to be set at both compile and run time. The following scripts have been shown to work (on a previous version of both condor and matlab, so details may have changed. Thanks to Chris Town). The trick is to get the script run by the Condor submit job to call Matlab and execute a Matlab script to compile the code, then exit Matlab and run the compiled Matlab program.

Insert your own values of /Some/pathto in several places.
The Submit file:

executable = /Some/pathto/runit
universe = vanilla
initialdir = /Some/pathto/
getenv   = True          # MATLAB needs local environment
log = matrunit.log
output = matrunit.out
error = matrunit.err
queue

runit executable (as we have moved to a later version of matlab some of the library paths will need tweaking):

#!/bin/bash
cd /Some/pathto
export HOME=/Some/pathto
export MATLAB_ROOT=/usr/groups/matlab/matlab14.3
export LD_LIBRARY_PATH=$MATLAB_ROOT/sys/os/glnxa64:$MATLAB_ROOT/bin/glnxa64:
$MATLAB_ROOT/sys/opengl/lib/glnxa64:
$MATLAB_ROOT/sys/java/jre/glnxa64/jre1.4.2/lib/amd64/:$LD_LIBRARY_PATH
export XAPPLRESDIR=$MATLAB_ROOT/X11/app-defaults
export ARCH=glnxa64
export DISPLAY=localhost:0
MATBIN=$HOME/realmatlabrunit
/usr/bin/matlab -nosplash -nojvm  -nodesktop -nodisplay -r compileit
/Some/pathto/realmatlabrunit

compileit.m matlab script:

mcc -mv realmatlabrunit -R -nojvm
exit

5.2.2.2.3. More matlab - a slight variant

A slight variant on the procedure in 5.2.2.2. Matlab is to create a small shell script, eg "matscripttest.sh", as a wrapper to matlab:
 
          #! /bin/sh  
          cl-matlab  -nosplash -nojvm  -nodesktop -nodisplay -r "matscripttest"
NB The initial "#! /bin/sh" is necessary to prevent obscure error messages.

The submit file would be similar to the above, but the executable line would then be

           executable = matscripttest.sh
and the arguments line would not be needed.

5.3. Java Universe

A program submitted to the Java universe may run on any sort of machine with a JVM regardless of its location, owner, or JVM version. Condor will take care of all the details such as finding the JVM binary and setting the classpath.

The command condor_status -java will list those machines known to have a JVM installed.

Unfortunately, because of the way Fedora Core 6 handles RPMs, there will be two versions of the java command installed on a machine - you will need to be sure which one you are using. By default FC6 installs version 1.4 of the Java runtime environment, so if you just use "java" (which will probably be /usr/bin/java, depending on your PATH settings) then it will be 1.4. It doesn't install the java development kit, so by default there is no javac. We have installed the jdk in /usr/java/default (currently version 1.6 by user request). FC6 notices that this is installed and so automatically creates a link into it from /usr/bin for javac, but because it already has a java it ignores the version of java that it finds there. Thus we end up with /usr/bin/java being 1.4 and /usr/bin/javac being 1.6 ! If you need both to be the same version you should explicitly use /usr/java/default/bin/java.

The default memory allocation is "1/4 of memory, up to 1GB" (thus 1GB on all current machines). It uses this unless explicitly set using the java maxheap flag -Xmx.

5.3.1. A Java Universe "Howto"

  1. Create a suitable directory and cd into it.
  2. Make sure the directory has suitable permissions set (see 1.2. Condor in the Computer Lab)
  3. Move your files into the directory.
  4. Make sure you have the program condor_config_val on your PATH. By default this lives in /opt/condor-6.8.3/bin, so
    PATH=/opt/condor-6.8.3/bin:$PATH;export PATH
    
  5. Check that the executable file does actually work
  6. Create a submit file
  7. Submit, using condor_submit
  8. Monitor the job's progress with the condor_q and condor_status commands

5.3.2. Examples

5.3.2.1. Hello World

The java file, "HelloWorldApp.java" will be
          /**
          * The HelloWorldApp class implements an application that
          * simply displays "Hello World!" to the standard output.
          */
         class HelloWorldApp {
             public static void main(String[] args) {
                 System.out.println("Hello World!"); // Display "Hello World!"
             }
         }
This will have been compiled with javac to create HelloWorldApp.class

The submit file, "submit.helloworldapp" will be

           ####################
           #
           # Execute a single Java class
           #
           ####################

           universe       = java
           executable     = HelloWorldApp.class
           arguments      = HelloWorldApp
           output         = HelloWorldApp.output
           error          = HelloWorldApp.error
           queue
For programs that consist of more than one .class file, an additional line in the submit description file will be needed to tell Condor about the additional files: For example:
           transfer_input_files = TinkyWinky.class Dipsy.class LaaLaa.class Po.class
If the various class files have been combined into an archive (.jar) file then Condor must then be told where to find it by adding something like the following to the submit description file:
           jar_files = Teletubbies.jar
The two seperate commands ("transfer_input_files" and "jar_files") are needed because the JVM will handle them differently.

5.4. Other Universes

The other universes available under condor which we do not currently support are:
  • PVM Universe. The PVM universe allows programs written for the Parallel Virtual Machine interface to be used.
  • MPI Universe. The MPI universe allows programs written to the MPICH interface to be used. Note: we have attempted to support this Universe, but it does not work under our current setup for reasons unknown.
  • Globus Universe. The Globus universe in Condor is intended to provide the standard Condor interface to users who wish to start Globus system jobs from Condor.
  • Scheduler Universe. The scheduler universe allows a Condor job to be submitted and executed with different assumptions for the execution conditions of the job. The job does not wait to be matched with a machine. It instead executes right away, on the machine where the job is submitted. The job will never be preempted. The machine requirements are not considered for scheduler universe jobs.

6. Summary of Useful Condor Commands

  • condor_submit is the program for submitting jobs for execution under Condor.
  • condor_q displays information about jobs in the Condor job queue. Use the -global option to see multiple machines.
  • condor_status may be used to monitor and query resource information, submitter information, checkpoint server information, and daemon master information for the Condor pool.
  • condor_prio changes the priority of one or more jobs in the condor queue.
  • condor_userprio with no arguments, lists the active users (see below) along with their priorities, in increasing priority order. The -all option can be used to display more detailed information.
  • condor_history displays a summary of all condor jobs listed in the specified history files. If no history files are specified then the local history file as specified in Condor's configuration file is read.
  • condor_rm removes one or more jobs from the Condor job queue.
  • condor_compile relinks a program with the Condor libraries for submission into Condor's Standard Universe.

7. Glossary

  • Pool The collection of inter-networked machines running Condor and controlled by a particular manager is known as a pool.
  • Submit machine Submit machines start Condor jobs.
  • Execute machine Execute machines run the Condor jobs.
  • Central Manager The Manager machine is the collector of information.
  • Checkpoint server This is a single centralized machine that stores all the checkpoint files for the jobs.
  • Universe Condor has several runtime environments (called a universe) from which to choose. The standard universe allows a job running under Condor to handle system calls by returning them to the machine where the job was submitted. The standard universe also provides the mechanisms necessary to take a checkpoint and migrate a partially completed job, should the machine on which the job is executing become unavailable. The vanilla universe provides a way to run jobs that cannot be relinked. There is no way to take a checkpoint or migrate a job executed under the vanilla universe.
  • Submit Description File This controls the details of job submission.
  • Job Cluster A cluster is a set of jobs specified in the submit description file between queue commands for which the executable is not changed. a "process ID".
  • ClassAd A ClassAd is a data structure used by Condor to store job or machine information. Condor's ClassAds are analogous to the classified advertising section of the newspaper. Condor plays the role of a matchmaker by continuously reading all the job ClassAds and all the machine ClassAds, matching and ranking job ads with machine ads.
  • Checkpoint Checkpointing is taking a snapshot of the current state of a program in such a way that the program can be restarted from that state at a later time. Checkpointing gives the Condor scheduler the freedom to reconsider scheduling decisions through preemptive-resume scheduling. If the scheduler decides to no longer allocate a machine to a job (for example, when the owner of that machine returns), it can checkpoint the job and preempt it without losing the work the job has already accomplished. The job can be resumed later when the scheduler allocates it a new machine. Additionally, periodic checkpointing provides fault tolerance in Condor. Snapshots are taken periodically, and after an interruption in service the program can continue from the most recent snapshot.