next up previous contents index
Next: 2.6 Managing a Job Up: 2. Users' Manual Previous: 2.4 Road-map for Running   Contents   Index

Subsections

2.5 Submitting a Job

A job is submitted for execution to Condor using the condor_ submit command. condor_ submit takes as an argument the name of a file called a submit description file. This file contains commands and keywords to direct the queuing of jobs. In the submit description file, Condor finds everything it needs to know about the job. Items such as the name of the executable to run, the initial working directory, and command-line arguments to the program all go into the submit description file. condor_ submit creates a job ClassAd based upon the information, and Condor works toward running the job.

The contents of a submit file can save time for Condor users. It is easy to submit multiple runs of a program to Condor. To run the same program 500 times on 500 different input data sets, arrange your data files accordingly so that each run reads its own input, and each run writes its own output. Each individual run may have its own initial working directory, stdin, stdout, stderr, command-line arguments, and shell environment. A program that directly opens its own files will read the file names to use either from stdin or from the command line. A program that opens a static filename every time will need to use a separate subdirectory for the output of each run.

The condor_ submit manual page is on page [*] and contains a complete and full description of how to use condor_ submit.


2.5.1 Sample submit description files

In addition to the examples of submit description files given in the condor_ submit manual page, here are a few more.

2.5.1.1 Example 1

Example 1 is the simplest submit description file possible. It queues up one copy of the program foo(which had been created by condor_ compile) for execution by Condor. Since no platform is specified, Condor will use its default, which is to run the job on a machine which has the same architecture and operating system as the machine from which it was submitted. No input, output, and error commands are given in the submit description file, so the files stdin, stdout, and stderr will all refer to /dev/null. The program may produce output by explicitly opening a file and writing to it. A log file, foo.log, will also be produced that contains events the job had during its lifetime inside of Condor. When the job finishes, its exit conditions will be noted in the log file. It is recommended that you always have a log file so you know what happened to your jobs.

  ####################                                                    
  # 
  # Example 1                                                            
  # Simple condor job description file                                    
  #                                                                       
  ####################                                                    
                                                                          
  Executable     = foo                                                    
  Log            = foo.log                                                    
  Queue

2.5.1.2 Example 2

Example 2 queues two copies of the program mathematica. The first copy will run in directory run_1, and the second will run in directory run_2. For both queued copies, stdin will be test.data, stdout will be loop.out, and stderr will be loop.error. There will be two sets of files written, as the files are each written to their own directories. This is a convenient way to organize data if you have a large group of Condor jobs to run. The example file shows program submission of mathematica as a vanilla universe job. This may be necessary if the source and/or object code to program mathematica is not available.

  ####################     
  #                       
  # Example 2: demonstrate use of multiple     
  # directories for data organization.      
  #                                        
  ####################                    
                                         
  Executable     = mathematica          
  Universe = vanilla                   
  input   = test.data                
  output  = loop.out                
  error   = loop.error             
  Log     = loop.log                                                    
                                  
  Initialdir     = run_1         
  Queue                         
                               
  Initialdir     = run_2      
  Queue

2.5.1.3 Example 3

The submit description file for Example 3 queues 150 runs of program foo which has been compiled and linked for Sun workstations running Solaris 8. This job requires Condor to run the program on machines which have greater than 32 megabytes of physical memory, and expresses a preference to run the program on machines with more than 64 megabytes, if such machines are available. It also advises Condor that it will use up to 28 megabytes of memory when running. Each of the 150 runs of the program is given its own process number, starting with process number 0. So, files stdin, stdout, and stderr will refer to in.0, out.0, and err.0 for the first run of the program, in.1, out.1, and err.1 for the second run of the program, and so forth. A log file containing entries about when and where Condor runs, checkpoints, and migrates processes for the 150 queued programs will be written into file foo.log.

  ####################                    
  #
  # Example 3: Show off some fancy features including
  # use of pre-defined macros and logging.
  #
  ####################                                                    

  Executable     = foo                                                    
  Requirements   = Memory >= 32 && OpSys == "SOLARIS28" && Arch =="SUN4u"     
  Rank		 = Memory >= 64
  Image_Size     = 28 Meg                                                 

  Error   = err.$(Process)                                                
  Input   = in.$(Process)                                                 
  Output  = out.$(Process)                                                
  Log = foo.log

  Queue 150


2.5.2 About Requirements and Rank

The requirements and rank commands in the submit description file are powerful and flexible. Using them effectively requires care, and this section presents those details.

Both requirements and rank need to be specified as valid Condor ClassAd expressions, however, default values are set by the condor_ submit program if these are not defined in the submit description file. From the condor_ submit manual page and the above examples, you see that writing ClassAd expressions is intuitive, especially if you are familiar with the programming language C. There are some pretty nifty expressions you can write with ClassAds. A complete description of ClassAds and their expressions can be found in section 4.1 on page [*].

All of the commands in the submit description file are case insensitive, except for the ClassAd attribute string values. ClassAds attribute names are case insensitive, but ClassAd string values are case preserving.

Note that the comparison operators (<, >, <=, >=, and ==) compare strings case insensitively. The special comparison operators =?= and =!= compare strings case sensitively.

The allowed ClassAd attributes are those that appear in a machine or a job ClassAd. To see all of the machine ClassAd attributes for all machines in the Condor pool, run condor_ status -l. The -l argument to condor_ status means to display all the complete machine ClassAds. The job ClassAds, if there jobs in the queue, can be seen with the condor_ q -l command. This will show you all the available attributes you can play with.

To help you out with what these attributes all signify, descriptions follow for the attributes which will be common to every machine ClassAd. Remember that because ClassAds are flexible, the machine ads in your pool may include additional attributes specific to your site's installation and policies.


2.5.2.1 ClassAd Machine Attributes

Activity:
String which describes Condor job activity on the machine. Can have one of the following values:
"Idle":
There is no job activity
"Busy":
A job is busy running
"Suspended":
A job is currently suspended
"Vacating":
A job is currently checkpointing
"Killing":
A job is currently being killed
"Benchmarking":
The startd is running benchmarks
Arch:
String with the architecture of the machine. Typically one of the following:
"INTEL":
Intel x86 CPU (Pentium, Xeon, etc).
"IA64":
Intel 64-bit CPU
"ALPHA":
Digital Alpha CPU
"SGI":
Silicon Graphics MIPS CPU
"SUN4u":
Sun UltraSparc CPU
"SUN4x":
A Sun Sparc CPU other than an UltraSparc, i.e. sun4m or sun4c CPU found in older Sparc workstations such as the Sparc 10, Sparc 20, IPC, IPX, etc.
"PPC":
Power Macintosh
"HPPA1":
Hewlett Packard PA-RISC 1.x CPU (i.e. PA-RISC 7000 series CPU) based workstation
"HPPA2":
Hewlett Packard PA-RISC 2.x CPU (i.e. PA-RISC 8000 series CPU) based workstation
CheckpointPlatform:
A string which opaquely encodes various aspects about a machine's operating system, hardware, and kernel attributes. It is used to identify systems where previously taken checkpoints for the standard universe may resume.
ClockDay:
The day of the week, where 0 = Sunday, 1 = Monday, ... , 6 = Saturday.
ClockMin:
The number of minutes passed since midnight.
CondorLoadAvg:
The portion of the load average generated by Condor (either from remote jobs or running benchmarks).
ConsoleIdle:
The number of seconds since activity on the system console keyboard or console mouse has last been detected.
Cpus:
Number of CPUs in this machine, i.e. 1 = single CPU machine, 2 = dual CPUs, etc.
CurrentRank:
A float which represents this machine owner's affinity for running the Condor job which it is currently hosting. If not currently hosting a Condor job, CurrentRank is 0.0. When a machine is claimed, the attribute's value is computed by evaluating the machine's Rank expression with respect to the current job's ClassAd.
Disk:
The amount of disk space on this machine available for the job in Kbytes ( e.g. 23000 = 23 megabytes ). Specifically, this is the amount of disk space available in the directory specified in the Condor configuration files by the EXECUTE macro, minus any space reserved with the RESERVED_DISK macro.
EnteredCurrentActivity:
Time at which the machine entered the current Activity (see Activity entry above). On all platforms (including NT), this is measured in the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).
FileSystemDomain:
A ``domain'' name configured by the Condor administrator which describes a cluster of machines which all access the same, uniformly-mounted, networked file systems usually via NFS or AFS. This is useful for Vanilla universe jobs which require remote file access.
KeyboardIdle:
The number of seconds since activity on any keyboard or mouse associated with this machine has last been detected. Unlike ConsoleIdle, KeyboardIdle also takes activity on pseudo-terminals into account (i.e. virtual ``keyboard'' activity from telnet and rlogin sessions as well). Note that KeyboardIdle will always be equal to or less than ConsoleIdle.
KFlops:
Relative floating point performance as determined via a Linpack benchmark.
LastHeardFrom:
Time when the Condor central manager last received a status update from this machine. Expressed as the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970). Note: This attribute is only inserted by the central manager once it receives the ClassAd. It is not present in the condor_ startd copy of the ClassAd. Therefore, you could not use this attribute in defining condor_ startd expressions (and you would not want to).
LoadAvg:
A floating point number with the machine's current load average.
Machine:
A string with the machine's fully qualified hostname.
Memory:
The amount of RAM in megabytes.
Mips:
Relative integer performance as determined via a Dhrystone benchmark.
MyType:
The ClassAd type; always set to the literal string "Machine".
Name:
The name of this resource; typically the same value as the Machine attribute, but could be customized by the site administrator. On SMP machines, the condor_ startd will divide the CPUs up into separate virtual machines, each with with a unique name. These names will be of the form ``vm#@full.hostname'', for example, ``vm1@vulture.cs.wisc.edu'', which signifies virtual machine 1 from vulture.cs.wisc.edu.
OpSys:
String describing the operating system running on this machine. For Condor Version 6.8.3 typically one of the following:
"HPUX10":
for HPUX 10.20
"HPUX11":
for HPUX B.11.00
"LINUX":
for LINUX 2.0.x, LINUX 2.2.x, LINUX 2.4.x, or LINUX 2.6.x kernel systems
"OSF1":
for Digital Unix 4.x
"SOLARIS25":
for Solaris 2.4 or 5.5
"SOLARIS251":
for Solaris 2.5.1 or 5.5.1
"SOLARIS26":
for Solaris 2.6 or 5.6
"SOLARIS27":
for Solaris 2.7 or 5.7
"SOLARIS28":
for Solaris 2.8 or 5.8
"SOLARIS29":
for Solaris 2.9 or 5.9
"WINNT50":
for Windows 2000
"WINNT51":
for Windows XP
"WINNT52":
for Windows Server 2003
"OSX":
for Darwin
"OSX10_2":
for Darwin 6.4
Requirements:
A boolean, which when evaluated within the context of the machine ClassAd and a job ClassAd, must evaluate to TRUE before Condor will allow the job to use this machine.
MaxJobRetirementTime:
An expression giving the maximum time in seconds that the startd will wait for the job to finish before kicking it off if it needs to do so. This is evaluated in the context of the job ClassAd, so it may refer to job attributes as well as machine attributes.
StartdIpAddr:
String with the IP and port address of the condor_ startd daemon which is publishing this machine ClassAd.
State:
String which publishes the machine's Condor state. Can be:
"Owner":
The machine owner is using the machine, and it is unavailable to Condor.
"Unclaimed":
The machine is available to run Condor jobs, but a good match is either not available or not yet found.
"Matched":
The Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it.
"Claimed":
The machine is claimed by a remote condor_ schedd and is probably running a job.
"Preempting":
A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.
TargetType:
Describes what type of ClassAd to match with. Always set to the string literal "Job", because machine ClassAds always want to be matched with jobs, and vice-versa.
UidDomain:
a domain name configured by the Condor administrator which describes a cluster of machines which all have the same passwd file entries, and therefore all have the same logins.
VirtualMachineID:
For SMP machines, the integer that identifies the VM. The value will be X for the VM with
name="vmX@full.hostname"
For non-SMP machines with one virtual machine, the value will be 1.
VirtualMemory:
The amount of currently available virtual memory (swap space) expressed in Kbytes.

In addition, there are a few attributes that are automatically inserted into the machine ClassAd whenever a resource is in the Claimed state:

ClientMachine:
The hostname of the machine that has claimed this resource

RemoteOwner:
The name of the user who originally claimed this resource.

RemoteUser:
The name of the user who is currently using this resource. In general, this will always be the same as the RemoteOwner, but in some cases, a resource can be claimed by one entity that hands off the resource to another entity which uses it. In that case, RemoteUser would hold the name of the entity currently using the resource, while RemoteOwner would hold the name of the entity that claimed the resource.

TotalClaimRunTime:
A running total of the amount of time (in seconds) that all jobs (under the same claim) ran (have spent in the Claimed/Busy state).

TotalClaimSuspendTime:
A running total of the amount of time (in seconds) that all jobs (under the same claim) have been suspended (in the Claimed/Suspended state).

TotalJobRunTime:
A running total of the amount of time (in seconds) that a single job ran (has spent in the Claimed/Busy state).

TotalJobSuspendTime:
A running total of the amount of time (in seconds) that a single job has been suspended (in the Claimed/Suspended state).

There are a few attributes that are only inserted into the machine ClassAd if a job is currently executing. If the resource is claimed but no job are running, none of these attributes will be defined.

JobId:
The job's identifier (for example, 152.3), as seen from condor_ q on the submitting machine.

JobStart:
The time stamp in integer seconds of when the job began executing, since the Unix epoch (00:00:00 UTC, Jan 1, 1970). For idle machines, the value is UNDEFINED.

LastPeriodicCheckpoint:
If the job has performed a periodic checkpoint, this attribute will be defined and will hold the time stamp of when the last periodic checkpoint was begun. If the job has yet to perform a periodic checkpoint, or cannot checkpoint at all, the LastPeriodicCheckpoint attribute will not be defined.

Finally, the single attribute, CurrentTime, is defined by the ClassAd environment.

CurrentTime:
Evaluates to the the number of integer seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970).


2.5.2.2 ClassAd Job Attributes

Args:
String representing the arguments passed to the job.

CkptArch:
String describing the architecture of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.

CkptOpSys:
String describing the operating system of the machine this job executed on at the time it last produced a checkpoint. If the job has never produced a checkpoint, this attribute is undefined.

ClusterId:
Integer cluster identifier for this job. A cluster is a group of jobs that were submitted together. Each job has its own unique job identifier within the cluster, but shares a common cluster identifier. The value changes each time a job or set of jobs are queued for execution under Condor.

Cmd:
The path to and the file name of the job to be executed.

CompletionDate:
The time when the job completed, or the value 0 if the job has not yet completed. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

CumulativeSuspensionTime:
A running total of the number of seconds the job has spent in suspension for the life of the job.

CurrentHosts:
The number of hosts in the claimed state, due to this job.

EnteredCurrentStatus:
An integer containing the epoch time of when the job entered into its current status So for example, if the job is on hold, the ClassAd expression
    CurrentTime - EnteredCurrentStatus
will equal the number of seconds that the job has been on hold.

ExecutableSize:
Size of the executable in Kbytes.

ExitBySignal:
An attribute that is True when a user job exits via a signal and False otherwise. For some grid universe jobs, how the job exited is unavailable. In this case, ExitBySignal is set to False.

ExitCode:
When a user job exits by means other than a signal, this is the exit return code of the user job. For some grid universe jobs, how the job exited is unavailable. In this case, ExitCode is set to 0.

ExitSignal:
When a user job exits by means of an unhandled signal, this attribute takes on the numeric value of the signal. For some grid universe jobs, how the job exited is unavailable. In this case, ExitSignal will be undefined.

ExitStatus:
The way that Condor previously dealt with a job's exit status. This attribute should no longer be used. It is not always accurate in heterogeneous pools, or if the job exited with a signal. Instead, see the attributes: ExitBySignal, ExitCode, and ExitSignal.

HoldReasonCode:
An integer value that represents the reason that a job was put on hold.


Integer Code Reason for Hold
1 The user put the job on hold with condor_ hold.
2 Globus middleware reported an error. HoldReasonSubCode is the GRAM error number.
3 The PERIODIC_HOLD expression evaluated to True.
4 The credentials for the job are invalid.
5 A job policy expression evaluated to Undefined.
6 The condor_ starter failed to start the executable. HoldReasonSubCode is the Unix error number.
7 The standard output file for the job could not be opened. HoldReasonSubCode is the Unix error number.
8 The standard input file for the job could not be opened. HoldReasonSubCode is the Unix error number.
9 The standard output stream for the job could not be opened. HoldReasonSubCode is the Unix error number.
10 The standard input stream for the job could not be opened. HoldReasonSubCode is the Unix error number.
11 An internal Condor protocol error was encountered when transferring files.
12 The condor_ starter failed to download input files. HoldReasonSubCode is the Unix error number.
13 The condor_ starter failed to upload output files. HoldReasonSubCode is the Unix error number.
14 The initial working directory of the job cannot be accessed. HoldReasonSubCode is the Unix error number.

HoldReasonSubCode:
An integer value that represents further information to go along with the HoldReasonCode, for some values of HoldReasonCode. See HoldReasonCode for the values.

HoldKillSig:
Currently only for scheduler and local universe jobs, a string containing a name of a signal to be sent to the job if the job is put on hold.

HoldReason:
A string containing a human-readable message about why a job is on hold. This is the message that will be displayed in response to the command condor_q -hold. It can be used to determine if a job should be released or not.

ImageSize:
Estimate of the memory image size of the job in Kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image). A vanilla universe job's ImageSize is recomputed internally every 15 seconds.

JobLeaseDuration:
The number of seconds set for a job lease, the amount of time that a job may continue running on a remote resource, despite its submitting machine's lack of response. See section 2.14.4 for details on job leases.

JobPrio:
Integer priority for this job, set by condor_ submit or condor_ prio. The default value is 0. The higher the number, the worse the priority.

JobStartDate:
Time at which the job first began running. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

JobStatus:
Integer which indicates the current status of the job.

Value Status
0 Unexpanded (the job has never run)
1 Idle
2 Running
3 Removed
4 Completed
5 Held

JobUniverse:
Integer which indicates the job universe.


Value Universe
1 standard
4 PVM
5 vanilla
7 scheduler
8 MPI
9 grid
10 java

LastCheckpointPlatform:
An opaque string which is the CheckpointPlatform identifier from the last machine where this standard universe job had successfully produced a checkpoint.

LastCkptServer:
Host name of the last checkpoint server used by this job. When a pool is using multiple checkpoint servers, this tells the job where to find its checkpoint file.

LastCkptTime:
Time at which the job last performed a successful checkpoint. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

LastMatchTime:
An integer containing the epoch time when the job was last successfully matched with a resource (gatekeeper) Ad.

LastRejMatchReason:
If, at any point in the past, this job failed to match with a resource ad, this attribute will contain a string with a human-readable message about why the match failed.

LastRejMatchTime:
An integer containing the epoch time when Condor-G last tried to find a match for the job, but failed to do so.

LastSuspensionTime:
Time at which the job last performed a successful suspension. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

LastVacateTime:
Time at which the job was last evicted from a remote workstation. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

LocalSysCpu:
An accumulated number of seconds of system CPU time that the job caused to the machine upon which the job was submitted.

LocalUserCpu:
An accumulated number of seconds of user CPU time that the job caused to the machine upon which the job was submitted.

MaxHosts:
The maximum number of hosts that this job would like to claim. As long as CurrentHosts is the same as MaxHosts, no more hosts are negotiated for.

MaxJobRetirementTime:
Maximum time in seconds to let this job run uninterrupted before kicking it off when it is being preempted. This can only decrease the amount of time from what the corresponding startd expression allows.

MinHosts:
The minimum number of hosts that must be in the claimed state for this job, before the job may enter the running state.

NiceUser:
Boolean value which indicates whether this is a nice-user job.

NumCkpts:
A count of the number of checkpoints written by this job during its lifetime.

NumGlobusSubmits:
An integer that is incremented each time the condor_ gridmanager receives confirmation of a successful job submission into Globus.

NumJobMatches:
An integer that is incremented by the condor_ schedd each time the job is matched with a resource ad by the negotiator.

NumRestarts:
A count of the number of restarts from a checkpoint attempted by this job during its lifetime.

NumSystemHolds:
An integer that is incremented each time Condor-G places a job on hold due to some sort of error condition. This counter is useful, since Condor-G will always place a job on hold when it gives up on some error condition. Note that if the user places the job on hold using the condor_ hold command, this attribute is not incremented.

Owner:
String describing the user who submitted this job.

ProcId:
Integer process identifier for this job. Within a cluster of many jobs, each job has the same ClusterId, but will have a unique ProcId. Within a cluster, assignment of a ProcId value will start with the value 0. The job (process) identifier described here is unrelated to operating system PIDs.

QDate:
Time at which the job was submitted to the job queue. Measured in the number of seconds since the epoch (00:00:00 UTC, Jan 1, 1970).

ReleaseReason:
A string containing a human-readable message about why the job was released from hold.

RemoteIwd:
The path to the directory in which a job is to be executed on a remote machine.

RemoteSysCpu:
The total number of seconds of system CPU time (the time spent at system calls) the job used on remote machines.

RemoteUserCpu:
The total number of seconds of user CPU time the job used on remote machines.

RemoteWallClockTime:
Cumulative number of seconds the job has been allocated a machine. This also includes time spent in suspension (if any), so the total real time spent running is
RemoteWallClockTime - CumulativeSuspensionTime
Note that this number does not get reset to zero when a job is forced to migrate from one machine to another.

RemoveKillSig:
Currently only for scheduler universe jobs, a string containing a name of a signal to be sent to the job if the job is removed.

StreamErr:
An attribute utilized only for grid universe jobs. The default value is True. If True, and TransferErr is True, then standard error is streamed back to the submit machine, instead of doing the transfer (as a whole) after the job completes. If False, then standard error is transfered back to the submit machine (as a whole) after the job completes. If TransferErr is False, then this job attribute is ignored.

StreamOut:
An attribute utilized only for grid universe jobs. The default value is True. If True, and TransferOut is True, then job output is streamed back to the submit machine, instead of doing the transfer (as a whole) after the job completes. If False, then job output is transferred back to the submit machine (as a whole) after the job completes. If TransferOut is False, then this job attribute is ignored.

TotalSuspensions:
A count of the number of times this job has been suspended during its lifetime.

TransferErr:
An attribute utilized only for grid universe jobs. The default value is True. If True, then the error output from the job is transferred from the remote machine back to the submit machine. The name of the file after transfer is the file referred to by job attribute Err. If False, no transfer takes place (remote to submit machine), and the name of the file is the file referred to by job attribute Err.

TransferExecutable:
An attribute utilized only for grid universe jobs. The default value is True. If True, then the job executable is transferred from the submit machine to the remote machine. The name of the file (on the submit machine) that is transferred is given by the job attribute Cmd. If False, no transfer takes place, and the name of the file used (on the remote machine) will be as given in the job attribute Cmd.

TransferIn:
An attribute utilized only for grid universe jobs. The default value is True. If True, then the job input is transferred from the submit machine to the remote machine. The name of the file that is transferred is given by the job attribute In. If False, then the job's input is taken from a file on the remote machine (pre-staged), and the name of the file is given by the job attribute In.

TransferOut:
An attribute utilized only for grid universe jobs. The default value is True. If True, then the output from the job is transferred from the remote machine back to the submit machine. The name of the file after transfer is the file referred to by job attribute Out. If False, no transfer takes place (remote to submit machine), and the name of the file is the file referred to by job attribute Out.


2.5.2.3 Rank Expression Examples

When considering the match between a job and a machine, rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.

The job's rank expression evaluates to one of three values. It can be UNDEFINED, ERROR, or a floating point value. If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.

A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.

The following rank expressions provide examples to follow.

For a job that desires the machine with the most available memory:

   Rank = memory

For a job that prefers to run on a friend's machine on Saturdays and Sundays:

   Rank = ( (clockday == 0) || (clockday == 6) )
          && (machine == "friend.cs.wisc.edu")

For a job that prefers to run on one of three specific machines:

   Rank = (machine == "friend1.cs.wisc.edu") ||
          (machine == "friend2.cs.wisc.edu") ||
          (machine == "friend3.cs.wisc.edu")

For a job that wants the machine with the best floating point performance (on Linpack benchmarks):

   Rank = kflops
This particular example highlights a difficulty with rank expression evaluation as currently defined. While all machines have floating point processing ability, not all machines will have the kflops attribute defined. For machines where this attribute is not defined, Rank will evaluate to the value UNDEFINED, and Condor will use a default rank of the machine of 0.0. The rank attribute will only rank machines where the attribute is defined. Therefore, the machine with the highest floating point performance may not be the one given the highest rank.

So, it is wise when writing a rank expression to check if the expression's evaluation will lead to the expected resulting ranking of machines. This can be accomplished using the condor_ status command with the -constraint argument. This allows the user to see a list of machines that fit a constraint. To see which machines in the pool have kflops defined, use

condor_status -constraint kflops
Alternatively, to see a list of machines where kflops is not defined, use
condor_status -constraint "kflops=?=undefined"

For a job that prefers specific machines in a specific order:

   Rank = ((machine == "friend1.cs.wisc.edu")*3) +
          ((machine == "friend2.cs.wisc.edu")*2) +
           (machine == "friend3.cs.wisc.edu")
If the machine being ranked is "friend1.cs.wisc.edu", then the expression
   (machine == "friend1.cs.wisc.edu")
is true, and gives the value 1.0. The expressions
   (machine == "friend2.cs.wisc.edu")
and
   (machine == "friend3.cs.wisc.edu")
are false, and give the value 0.0. Therefore, rank evaluates to the value 3.0. In this way, machine "friend1.cs.wisc.edu" is ranked higher than machine "friend2.cs.wisc.edu", machine "friend2.cs.wisc.edu" is ranked higher than machine "friend3.cs.wisc.edu", and all three of these machines are ranked higher than others.


2.5.3 Submitting Jobs Using a Shared File System

If vanilla, java, parallel (or MPI) universe jobs are submitted without using the File Transfer mechanism, Condor must use a shared file system to access input and output files. In this case, the job must be able to access the data files from any machine on which it could potentially run.

As an example, suppose a job is submitted from blackbird.cs.wisc.edu, and the job requires a particular data file called /u/p/s/psilord/data.txt. If the job were to run on cardinal.cs.wisc.edu, the file /u/p/s/psilord/data.txt must be available through either NFS or AFS for the job to run correctly.

Condor allows users to ensure their jobs have access to the right shared files by using the FileSystemDomain and UidDomain machine ClassAd attributes. These attributes specify which machines have access to the same shared file systems. All machines that mount the same shared directories in the same locations are considered to belong to the same file system domain. Similarly, all machines that share the same user information (in particular, the same UID, which is important for file systems like NFS) are considered part of the same UID domain.

The default configuration for Condor places each machine in its own UID domain and file system domain, using the full hostname of the machine as the name of the domains. So, if a pool does have access to a shared file system, the pool administrator must correctly configure Condor such that all the machines mounting the same files have the same FileSystemDomain configuration. Similarly, all machines that share common user information must be configured to have the same UidDomain configuration.

When a job relies on a shared file system, Condor uses the requirements expression to ensure that the job runs on a machine in the correct UidDomain and FileSystemDomain. In this case, the default requirements expression specifies that the job must run on a machine with the same UidDomain and FileSystemDomain as the machine from which the job is submitted. This default is almost always correct. However, in a pool spanning multiple UidDomains and/or FileSystemDomains, the user may need to specify a different requirements expression to have the job run on the correct machines.

For example, imagine a pool made up of both desktop workstations and a dedicated compute cluster. Most of the pool, including the compute cluster, has access to a shared file system, but some of the desktop machines do not. In this case, the administrators would probably define the FileSystemDomain to be cs.wisc.edu for all the machines that mounted the shared files, and to the full hostname for each machine that did not. An example is jimi.cs.wisc.edu.

In this example, a user wants to submit vanilla universe jobs from her own desktop machine (jimi.cs.wisc.edu) which does not mount the shared file system (and is therefore in its own file system domain, in its own world). But, she wants the jobs to be able to run on more than just her own machine (in particular, the compute cluster), so she puts the program and input files onto the shared file system. When she submits the jobs, she needs to tell Condor to send them to machines that have access to that shared data, so she specifies a different requirements expression than the default:

   Requirements = UidDomain == "cs.wisc.edu" && \
                  FileSystemDomain == "cs.wisc.edu"

WARNING: If there is no shared file system, or the Condor pool administrator does not configure the FileSystemDomain setting correctly (the default is that each machine in a pool is in its own file system and UID domain), a user submits a job that cannot use remote system calls (for example, a vanilla universe job), and the user does not enable Condor's File Transfer mechanism, the job will only run on the machine from which it was submitted.


2.5.4 Submitting Jobs Without a Shared File System: Condor's File Transfer Mechanism

Condor works well without a shared file system. The Condor file transfer mechanism is utilized by the user when the user submits jobs. Condor will transfer any files needed by a job from the machine where the job was submitted into a temporary working directory on the machine where the job is to be executed. Condor executes the job and transfers output back to the submitting machine. The user specifies which files to transfer, and at what point the output files should be copied back to the submitting machine. This specification is done within the job's submit description file.

The default behavior of the file transfer mechanism varies across the different Condor universes, and it differs between UNIX and Windows machines.

2.5.4.1 Default Behavior across Condor Universes and Platforms

For jobs submitted under the standard universe, the existence of a shared file system is not relevant. Access to files (input and output) is handled through Condor's remote system call mechanism. The executable and checkpoint files are transfered automatically, when needed. Therefore, the user does not need to change the submit description file if there is no shared file system.

For the vanilla, java, MPI, and parallel universes, access to files (including the executable) through a shared file system is presumed as a default on UNIX machines. If there is no shared file system, then Condor's file transfer mechanism must be explicitly enabled. When submitting a job from a Windows machine, Condor presumes the opposite: no access to a shared file system. It instead enables the file transfer mechanism by default. Submission of a job might need to specify which files to transfer, and/or when to transfer the output files back.

For the grid universe, jobs are to be executed on remote machines, so there would never be a shared file system between machines. See section 5.3.2 for more details.

For the PVM universe, file transfer other than the master's executable and files given in input, output, and error commands is not supported. This is not usually an impediment (shared file system or not), since PVM jobs are set up to have the master direct the workers, and I/O from the workers is usually passed back to the master via PVM messages, not files.

For the scheduler universe, Condor is only using the machine from which the job is submitted. Therefore, the existence of a shared file system is not relevant.


2.5.4.2 Specifying If and When to Transfer Files

To enable the file transfer mechanism, two commands are placed in the job's submit description file: should_transfer_files and when_to_transfer_output. An example is:

  should_transfer_files = YES
  when_to_transfer_output = ON_EXIT

The should_transfer_files command specifies whether Condor should transfer input files from the submit machine to the remote machine where the job executes. It also specifies whether the output files are transferred back to the submit machine. The command takes on one of three possible values:

  1. YES: Condor always transfers both input and output files.

  2. IF_NEEDED: Condor transfers files if the job is matched with (and to be executed on) a machine in a different FileSystemDomain than the one the submit machine belongs to. If the job is matched with a machine in the local FileSystemDomain, Condor will not transfer files and relies on a shared file system.

  3. NO: Condor's file transfer mechanism is disabled.

The when_to_transfer_output command tells Condor when output files are to be transferred back to the submit machine after the job has executed on a remote machine. The command takes on one of two possible values:

  1. ON_EXIT: Condor transfers output files back to the submit machine only when the job exits on its own.

  2. ON_EXIT_OR_EVICT: Condor will always do the transfer, whether the job completes on its own, is preempted by another job, vacates the machine, or is killed. As the job completes on its own, files are transferred back to the directory where the job was submitted, as expected. For the other cases, files are transferred back at eviction time. These files are placed in the directory defined by the configuration variable SPOOL, not the directory from which the job was submitted. The transferred files are named using the ClusterId and ProcId job ClassAd attributes. The file name takes the form:
       cluster<X>.proc<Y>.subproc0
    
    where <X> is the value of ClusterId, and <Y> is the value of ProcId. As an example, job 735.0 may produce the file
       $(SPOOL)/cluster735.proc0.subproc0
    

    This is only useful if partial runs of the job are valuable. An example of valuable partial runs is when the application produces its own checkpoints.

There is no default value for when_to_transfer_output. If using the file transfer mechanism, this command must be defined. If when_to_transfer_output is specified in the submit description file, but should_transfer_files is not, Condor assumes a value of YES for should_transfer_files.

NOTE: The combination of:

  should_transfer_files = IF_NEEDED
  when_to_transfer_output = ON_EXIT_OR_EVICT
would produce undefined file access semantics. Therefore, this combination is prohibited by condor_ submit.

When submitting from a Unix platform, the file transfer mechanism is unused by default. If neither when_to_transfer_output or should_transfer_files are defined, Condor assumes should_transfer_files = NO.

When submitting from a Windows platform, Condor does not provide any way to use a shared file system for jobs. Therefore, if neither when_to_transfer_output or should_transfer_files are defined, the file transfer mechanism is enabled by default with the following values:

  should_transfer_files = YES
  when_to_transfer_output = ON_EXIT

NOTE: Prior to Condor version 6.5.2, different attributes were used to control when and if files should be transferred. Previously, a single attribute was used to control both things, and the IF_NEEDED value was not supported. This older attribute is still allowed in newer versions of Condor but it is now deprecated. when_to_transfer_output and should_transfer_files should be used instead. However, beware that these settings will not work with Condor versions older than 6.5.2.

2.5.4.3 Specifying What Files to Transfer

If the file transfer mechanism is enabled, Condor will transfer the following files before the job is run on a remote machine.

  1. the executable
  2. the input, as defined with the input command
  3. any jar files (for the Java universe)
If the job requires any other input files, the submit description file should utilize the transfer_input_files command. This comma-separated list specifies any other files that Condor is to transfer to a remote site to set up the execution environment for the job before it is run. These files are placed in the same temporary working directory as the job's executable. At this time, directories can not be transferred in this way. For example:

  transfer_input_files = file1,file2

As a default, for jobs other than those submitted to the grid universe, any files that are modified or created by the job in the temporary directory at the remote site are transferred back to the machine from which the job was submitted. Most of the time, this is the best option. To restrict the files that are transferred, specify the exact list of files with transfer_output_files. Delimite these file names with a comma. When this list is defined, and any of the files do not exist as the job exits, Condor considers this an error, and re-runs the job.

WARNING: Do not specify transfer_output_files (for other than grid universe jobs) unless there is a really good reason - it is best to let Condor figure things out by itself based upon what output the job produces.

For grid universe jobs, files to be transferred (other than standard output and standard error) must be specified using transfer_output_files in the submit description file.

2.5.4.4 File Paths for File Transfer

The file transfer mechanism specifies file names and/or paths on both the file system of the submit machine and on the file system of the execute machine. Care must be taken to know which machine (submit or execute) is utilizing the file name and/or path.

Files in the transfer_input_files command are specified as they are accessed on the submit machine. The program (as it executes) accesses files as they are found on the execute machine.

There are three ways to specify files and paths for transfer_input_files:

  1. Relative to the submit directory, if the submit command initialdir is not specified.
  2. Relative to the initial directory, if the submit command initialdir is specified.
  3. Absolute.

Before executing the program, Condor copies the executable, an input file as specified by the submit command input, along with any input files specified by transfer_input_files. All these files are placed into a temporary directory (on the execute machine) in which the program runs. Therefore, the executing program must access input files without paths. Because all transferred files are placed into a single, flat directory, input files must be uniquely named to avoid collision when transferred. A collision causes the last file in the list to overwrite the earlier one.

If the program creates output files during execution, it must create them within the temporary working directory. Condor transfers back all files within the temporary working directory that have been modified or created. To transfer back only a subset of these files, the submit command transfer_output_files is defined. Transfer of files that exist, but are not within the temporary working directory is not supported. Condor's behavior in this instance is undefined.

It is okay to create files outside the temporary working directory on the file system of the execute machine, (in a directory such as /tmp) if this directory is guaranteed to exist and be accessible on all possible execute machines. However, transferring such a file back after execution completes may not be done.

Here are several examples to illustrate the use of file transfer. The program executable is called my_program, and it uses three command-line arguments as it executes: two input file names and an output file name. The program executable and the submit description file for this job are located in directory /scratch/test.

The directory tree for all these examples:

/scratch/test (directory)
      my_program.condor (the submit description file)
      my_program (the executable)
      files (directory)
          logs2 (directory)
          in1 (file)
          in2 (file)
      logs (directory)

Example 1

This simple example explicitly transfers input files. These input files to be transferred are specified relative to the directory where the job is submitted. The single output file, out1, created when the job is executed will be transferred back into the directory /scratch/test, not the files directory.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs/err.$(cluster)
Output          = logs/out.$(cluster)
Log             = logs/log.$(cluster)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1, files/in2

Arguments       = in1 in2 out1
Queue

Example 2

This second example is identical to Example 1, except that absolute paths to the input files are specified, instead of relative paths to the input files.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs/err.$(cluster)
Output          = logs/out.$(cluster)
Log             = logs/log.$(cluster)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /scratch/test/files/in1, /scratch/test/files/in2

Arguments       = in1 in2 out1
Queue

Example 3

This third example illustrates the use of the submit command initialdir, and its effect on the paths used for the various files. The expected location of the executable is not affected by the initialdir command. All other files (specified by input, output, transfer_input_files, as well as files modified or created by the job and automatically transferred back) are located relative to the specified initialdir. Therefore, the output file, out1, will be placed in the files directory. Note that the logs2 directory exists to make this example work correctly.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs2/err.$(cluster)
Output          = logs2/out.$(cluster)
Log             = logs2/log.$(cluster)

initialdir      = files

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = in1, in2

Arguments       = in1 in2 out1
Queue

Example 4 - Illustrates an Error

This example illustrates a job that will fail. The files specified using the transfer_input_files command work correctly (see Example 1). However, relative paths to files in the arguments command cause the executing program to fail. The file system on the submission side may utilize relative paths to files, however those files are placed into a single, flat, temporary directory on the execute machine.

Note that this specification and submission will cause the job to fail and reexecute.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs/err.$(cluster)
Output          = logs/out.$(cluster)
Log             = logs/log.$(cluster)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1, files/in2

Arguments       = files/in1 files/in2 files/out1
Queue

This example fails with the following error:

err: files/out1: No such file or directory.

Example 5 - Illustrates an Error

As with Example 4, this example illustrates a job that will fail. The executing program's use of absolute paths cannot work.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs/err.$(cluster)
Output          = logs/out.$(cluster)
Log             = logs/log.$(cluster)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = /scratch/test/files/in1, /scratch/test/files/in2

Arguments = /scratch/test/files/in1 /scratch/test/files/in2 /scratch/test/files/out1
Queue

The job fails with the following error:

err: /scratch/test/files/out1: No such file or directory.

Example 6 - Illustrates an Error

This example illustrates a failure case where the executing program creates an output file in a directory other than within the single, flat, temporary directory that the program executes within. The file creation may or may not cause an error, depending on the existence and permissions of the directories on the remote file system.

Further incorrect usage is seen during the attempt to transfer the output file back using the transfer_output_files command. The behavior of Condor for this case is undefined.

# file name:  my_program.condor
# Condor submit description file for my_program
Executable      = my_program
Universe        = vanilla
Error           = logs/err.$(cluster)
Output          = logs/out.$(cluster)
Log             = logs/log.$(cluster)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = files/in1, files/in2
transfer_output_files = /tmp/out1

Arguments       = in1 in2 /tmp/out1
Queue

2.5.4.5 Requirements and Rank for File Transfer

The requirements expression for a job must depend on the should_transfer_files command. The job must specify the correct logic to ensure that the job is matched with a resource that meets the file transfer needs. If no requirements expression is in the submit description file, or if the expression specified does not refer to the attributes listed below, condor_ submit adds an appropriate clause to the requirements expression for the job. condor_ submit appends these clauses with a logical AND, &&, to ensure that the proper conditions are met. Here are the default clauses corresponding to the different values of should_transfer_files:

  1. should_transfer_files = YES results in the addition of the clause (HasFileTransfer). If the job is always going to transfer files, it is required to match with a machine that has the capability to transfer files. This is a backward compatibility issue, since all versions of Condor since version 6.3.3 support file transfer and have HasFileTransfer defined to TRUE.

  2. should_transfer_files = NO results in the addition of (TARGET.FileSystemDomain == MY.FileSystemDomain). In addition, Condor automatically adds the FileSystemDomain attribute to the job ad, with whatever string is defined for the condor_ schedd to which the job is submitted. If the job is not using the file transfer mechanism, Condor assumes it will need a shared file system, and therefore, a machine in the same FileSystemDomain as the submit machine.

  3. should_transfer_files = IF_NEEDED results in the addition of
      (HasFileTransfer || (TARGET.FileSystemDomain == MY.FileSystemDomain))
    
    If Condor will optionally transfer files, it must require that the machine is either capable of transferring files or in the same file system domain.

To ensure that the job is matched to a machine with enough local disk space to hold all the transfered files, Condor automatically adds the DiskUsage job attribute. This attribute includes the total size of the job's executable and all input files to be transferred. Condor then adds an additional clause to the Requirements expression that states that the remote machine must have at least enough available disk space to hold all these files:

  && (Disk >= DiskUsage)

If should_transfer_files = IF_NEEDED and the job prefers to run on a machine in the local file system domain over transferring files, (but are still willing to allow the job to run remotely and transfer files), the rank expression works well. Use:

rank = (TARGET.FileSystemDomain == MY.FileSystemDomain)

The rank expression is a floating point number, so if other items are considered in ranking the possible machines this job may run on, add the items:

rank = kflops + (TARGET.FileSystemDomain == MY.FileSystemDomain)

The value of kflops can vary widely among machines, so this rank expression will likely not do as it intends. To place emphasis on the job running in the same file system domain, but still consider kflops among the machines in the file system domain, weight the part of the rank expression that is matching the file system domains. For example:

rank = kflops + (10000 * (TARGET.FileSystemDomain == MY.FileSystemDomain))

2.5.4.6 Old Attributes for File Transfer

The should_transfer_files and when_to_transfer_output commands in the submit description file result in two corresponding string attributes in the job ClassAd: ShouldTransferFiles and WhenToTransferOutput. These attributes are only defined when the job is matched with an execute machine running Condor version 6.5.3 or a more recent version. So, for backward compatibility, condor_ submit also includes the old attribute used to control this feature: TransferFiles. If you examine a job with the -long option to condor_ q, and you see TransferFiles, that attribute is only there for backward compatibility, and it is ignored if matched with a machine running version 6.5.3 or greater. There were problems with this old attribute, since it was not flexible enough to handle the new IF_NEEDED functionality, and it was confusing for users. Therefore, TransferFiles is deprecated, and we will no longer document its use. If your submit file refers to transfer_files, consider switching it to use the settings described here.

2.5.5 Environment Variables

The environment under which a job executes often contains information that is potentially useful to the job. Condor allows a user to both set and reference environment variables for a job or job cluster.

Within a submit description file, the user may define environment variables for the job's environment by using the environment command. See the condor_ submit manual page at section 9 for more details about this command.

The submittor's entire environment can be copied into the job ClassAd for the job at job submission. The getenv command within the submit description file does this. See the condor_ submit manual page at section 9 for more details about this command.

Commands within the submit description file may reference the environment variables of the submitter as a job is submitted. Submit description file commands use $ENV(EnvironmentVariableName) to reference the value of an environment variable. Again, see the condor_ submit manual page at section 9 for more details about this usage.

Condor sets several additional environment variables for each executing job that may be useful for the job to reference.

2.5.6 Heterogeneous Submit: Execution on Differing Architectures

If executables are available for the different platforms of machines in the Condor pool, Condor can be allowed the choice of a larger number of machines when allocating a machine for a job. Modifications to the submit description file allow this choice of platforms.

A simplified example is a cross submission. An executable is available for one platform, but the submission is done from a different platform. Given the correct executable, the requirements command in the submit description file specifies the target architecture. For example, an executable compiled for a Sun 4, submitted from an Intel architecture running Linux would add the requirement

  requirements = Arch == "SUN4x" && OpSys == "SOLARIS251"
Without this requirement, condor_ submit will assume that the program is to be executed on a machine with the same platform as the machine where the job is submitted.

Cross submission works for both standard and vanilla universes. The burden is on the user to both obtain and specify the correct executable for the target architecture. To list the architecture and operating systems of the machines in a pool, run condor_ status.

2.5.6.1 Vanilla Universe Example for Execution on Differing Architectures

A more complex example of a heterogeneous submission occurs when a job may be executed on many different architectures to gain full use of a diverse architecture and operating system pool. If the executables are available for the different architectures, then a modification to the submit description file will allow Condor to choose an executable after an available machine is chosen.

A special-purpose Machine Ad substitution macro can be used in the executable, environment, and arguments attributes in the submit description file. The macro has the form

  $$(MachineAdAttribute)
Note that this macro is ignored in all other submit description attributes. The $$() informs Condor to substitute the requested MachineAdAttribute from the machine where the job will be executed.

An example of the heterogeneous job submission has executables available for three platforms: LINUX Intel, Solaris26 Intel, and Solaris 8 Sun. This example uses povray to render images using a popular free rendering engine.

The substitution macro chooses a specific executable after a platform for running the job is chosen. These executables must therefore be named based on the machine attributes that describe a platform. The executables named

  povray.LINUX.INTEL
  povray.SOLARIS26.INTEL
  povray.SOLARIS28.SUN4u
will work correctly for the macro
  povray.$$(OpSys).$$(Arch)

The executables or links to executables with this name are placed into the initial working directory so that they may be found by Condor. A submit description file that queues three jobs for this example:

  ####################
  #
  # Example of heterogeneous submission
  #
  ####################

  universe     = vanilla
  Executable   = povray.$$(OpSys).$$(Arch)
  Log          = povray.log
  Output       = povray.out.$(Process)
  Error        = povray.err.$(Process)

  Requirements = (Arch == "INTEL" && OpSys == "LINUX") || \
                 (Arch == "INTEL" && OpSys =="SOLARIS26") || \
                 (Arch == "SUN4u" && OpSys == "SOLARIS28")

  Arguments    = +W1024 +H768 +Iimage1.pov
  Queue 

  Arguments    = +W1024 +H768 +Iimage2.pov
  Queue 

  Arguments    = +W1024 +H768 +Iimage3.pov
  Queue

These jobs are submitted to the vanilla universe to assure that once a job is started on a specific platform, it will finish running on that platform. Switching platforms in the middle of job execution cannot work correctly.

There are two common errors made with the substitution macro. The first is the use of a non-existent MachineAdAttribute. If the specified MachineAdAttribute does not exist in the machine's ClassAd, then Condor will place the job in the machine state of hold until the problem is resolved.

The second common error occurs due to an incomplete job set up. For example, the submit description file given above specifies three available executables. If one is missing, Condor report back that an executable is missing when it happens to match the job with a resource that requires the missing binary.

2.5.6.2 Standard Universe Example for Execution on Differing Architectures

Jobs submitted to the standard universe may produce checkpoints. A checkpoint can then be used to start up and continue execution of a partially completed job. For a partially completed job, the checkpoint and the job are specific to a platform. If migrated to a different machine, correct execution requires that the platform must remain the same.

In previous versions of Condor, the author of the heterogeneous submission file would need to write extra policy expressions in the requirements expression to force Condor to choose the same type of platform when continuing a checkpointed job. However, since it is needed in the common case, this additional policy is now automatically added to the requirements expression. The additional expression is added provided the user does not use CkptArch in the requirements expression. Condor will remain backward compatible for those users who have explicitly specified CkptRequirements-implying use of CkptArch, in their requirements expression.

The expression added when the attribute CkptArch is not specified will default to

  # Added by Condor
  CkptRequirements = ((CkptArch == Arch) || (CkptArch =?= UNDEFINED)) && \
                      ((CkptOpSys == OpSys) || (CkptOpSys =?= UNDEFINED))

  Requirements = (<user specified policy>) && $(CkptRequirements)

The behavior of the CkptRequirements expressions and its addition to requirements is as follows. The CkptRequirements expression guarantees correct operation in the two possible cases for a job. In the first case, the job has not produced a checkpoint. The ClassAd attributes CkptArch and CkptOpSys will be undefined, and therefore the meta operator (=?=) evaluates to true. In the second case, the job has produced a checkpoint. The Machine ClassAd is restricted to require further execution only on a machine of the same platform. The attributes CkptArch and CkptOpSys will be defined, ensuring that the platform chosen for further execution will be the same as the one used just before the checkpoint.

Note that this restriction of platforms also applies to platforms where the executables are binary compatible.

The complete submit description file for this example:

  ####################
  #
  # Example of heterogeneous submission
  #
  ####################

  universe     = standard
  Executable   = povray.$$(OpSys).$$(Arch)
  Log          = povray.log
  Output       = povray.out.$(Process)
  Error        = povray.err.$(Process)

  # Condor automatically adds the correct expressions to insure that the
  # checkpointed jobs will restart on the correct platform types.
  Requirements = ( (Arch == "INTEL" && OpSys == "LINUX") || \
                 (Arch == "INTEL" && OpSys =="SOLARIS26") || \
                 (Arch == "SUN4u" && OpSys == "SOLARIS28") )

  Arguments    = +W1024 +H768 +Iimage1.pov
  Queue 

  Arguments    = +W1024 +H768 +Iimage2.pov
  Queue 

  Arguments    = +W1024 +H768 +Iimage3.pov
  Queue


next up previous contents index
Next: 2.6 Managing a Job Up: 2. Users' Manual Previous: 2.4 Road-map for Running   Contents   Index
condor-admin@cs.wisc.edu