When jobs are submitted, Condor will attempt to find resources to run the jobs. A list of all those with jobs submitted may be obtained through condor_ status with the -submitters option. An example of this would yield output similar to:
% condor_status -submitters Name Machine Running IdleJobs HeldJobs ballard@cs.wisc.edu bluebird.c 0 11 0 nice-user.condor@cs. cardinal.c 6 504 0 wright@cs.wisc.edu finch.cs.w 1 1 0 jbasney@cs.wisc.edu perdita.cs 0 0 5 RunningJobs IdleJobs HeldJobs ballard@cs.wisc.edu 0 11 0 jbasney@cs.wisc.edu 0 0 5 nice-user.condor@cs. 6 504 0 wright@cs.wisc.edu 1 1 0 Total 7 516 5
% condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 I -10 1.2 hello.remote 127.0 raman 4/11 15:35 0+00:00:00 R 0 1.4 hello 128.0 raman 4/11 15:35 0+00:02:33 I 0 1.4 hello 3 jobs; 2 idle, 1 running, 0 heldThis output contains many columns of information about the queued jobs. The
ST
column (for status) shows the status of
current jobs in the queue. An R
in the status column
means the the job is currently running.
An I
stands for idle.
The job is not running right
now, because it is waiting for a machine to become available.
The status
H
is the hold state. In the hold state,
the job will not be scheduled to
run until it is released (see the condor_ hold
reference page located on page U
in the status column to stand for unexpanded.
In this state,
a job has never
produced a checkpoint,
and when the job starts running, it will start running from the
beginning.
Newer versions of Condor do not use the U
state.
The CPU_USAGE
time reported for a job is the time that has been
committed to the job. It is not updated for a job until
the job checkpoints. At that time, the job has made guaranteed forward
progress. Depending upon how the site administrator configured the pool,
several hours may pass between checkpoints, so do not worry if you do
not observe the CPU_USAGE
entry changing by the hour.
Also note that this is actual CPU
time as reported by the operating system; it is not time as
measured by a wall clock.
Another useful method of tracking the progress of jobs is through the user log. If you have specified a log command in your submit file, the progress of the job may be followed by viewing the log file. Various events such as execution commencement, checkpoint, eviction and termination are logged in the file. Also logged is the time at which the event occurred.
When your job begins to run, Condor starts up a condor_ shadow process on the submit machine. The shadow process is the mechanism by which the remotely executing jobs can access the environment from which it was submitted, such as input and output files.
It is normal for a machine which has submitted hundreds of jobs to have hundreds of shadows running on the machine. Since the text segments of all these processes is the same, the load on the submit machine is usually not significant. If, however, you notice degraded performance, you can limit the number of jobs that can run simultaneously through the MAX_JOBS_RUNNING configuration parameter. Please talk to your system administrator for the necessary configuration change.
You can also find all the machines that are running your job through the condor_ status command. For example, to find all the machines that are running jobs submitted by ``breach@cs.wisc.edu,'' type:
% condor_status -constraint 'RemoteUser == "breach@cs.wisc.edu"' Name Arch OpSys State Activity LoadAv Mem ActvtyTime alfred.cs. INTEL SOLARIS251 Claimed Busy 0.980 64 0+07:10:02 biron.cs.w INTEL SOLARIS251 Claimed Busy 1.000 128 0+01:10:00 cambridge. INTEL SOLARIS251 Claimed Busy 0.988 64 0+00:15:00 falcons.cs INTEL SOLARIS251 Claimed Busy 0.996 32 0+02:05:03 happy.cs.w INTEL SOLARIS251 Claimed Busy 0.988 128 0+03:05:00 istat03.st INTEL SOLARIS251 Claimed Busy 0.883 64 0+06:45:01 istat04.st INTEL SOLARIS251 Claimed Busy 0.988 64 0+00:10:00 istat09.st INTEL SOLARIS251 Claimed Busy 0.301 64 0+03:45:00 ...To find all the machines that are running any job at all, type:
% condor_status -run Name Arch OpSys LoadAv RemoteUser ClientMachine adriana.cs INTEL SOLARIS251 0.980 hepcon@cs.wisc.edu chevre.cs.wisc. alfred.cs. INTEL SOLARIS251 0.980 breach@cs.wisc.edu neufchatel.cs.w amul.cs.wi SUN4u SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. anfrom.cs. SUN4x SOLARIS251 1.023 ashoks@jules.ncsa.ui jules.ncsa.uiuc anthrax.cs INTEL SOLARIS251 0.285 hepcon@cs.wisc.edu chevre.cs.wisc. astro.cs.w INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. aura.cs.wi SUN4u SOLARIS251 0.996 nice-user.condor@cs. chevre.cs.wisc. balder.cs. INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. bamba.cs.w INTEL SOLARIS251 1.574 dmarino@cs.wisc.edu riola.cs.wisc.e bardolph.c INTEL SOLARIS251 1.000 nice-user.condor@cs. chevre.cs.wisc. ...
% condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 I -10 1.2 hello.remote 132.0 raman 4/11 16:57 0+00:00:00 R 0 1.4 hello 2 jobs; 1 idle, 1 running, 0 held % condor_rm 132.0 Job 132.0 removed. % condor_q -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 125.0 jbasney 4/10 15:35 0+00:00:00 I -10 1.2 hello.remote 1 jobs; 1 idle, 0 running, 0 held
Use of the condor_ hold command causes a hard kill signal to be sent to a currently running job (one in the running state). For a standard universe job, this means that no checkpoint is generated before the job stops running and enters the hold state. When released, this standard universe job continues its execution using the most recent checkpoint available.
Jobs in universes other than the standard universe that are running when placed on hold will start over from the beginning when released.
The manual page for condor_ hold
on page
and the manual page for condor_ release
on page
contain usage details.
In addition to the priorities assigned to each user, Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and can be any integer value, with higher values meaning better priority.
The default priority of a job is 0, but can be changed using the condor_ prio command. For example, to change the priority of a job to -15,
% condor_q raman -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 raman 4/11 15:06 0+00:00:00 I 0 0.3 hello 1 jobs; 1 idle, 0 running, 0 held % condor_prio -p -15 126.0 % condor_q raman -- Submitter: froth.cs.wisc.edu : <128.105.73.44:33847> : froth.cs.wisc.edu ID OWNER SUBMITTED CPU_USAGE ST PRI SIZE CMD 126.0 raman 4/11 15:06 0+00:00:00 I -15 0.3 hello 1 jobs; 1 idle, 0 running, 0 held
It is important to note that these job priorities are completely different from the user priorities assigned by Condor. Job priorities do not impact user priorities. They are only a mechanism for the user to identify the relative importance of jobs among all the jobs submitted by the user to that specific queue.
% condor_q -pool condor -name beak -analyze 331228.2359 Warning: No PREEMPTION_REQUIREMENTS expression in config file --- assuming FALSE -- Schedd: beak.cs.wisc.edu : <128.105.146.14:30918> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD --- 331228.2359: Run analysis summary. Of 819 machines, 159 are rejected by your job's requirements 137 reject your job because of their own requirements 488 match, but are serving users with a better priority in the pool 11 match, but prefer another specific job despite its worse user-priority 24 match, but cannot currently preempt their existing job 0 are available to run your job
A second example shows a job that does not run because the job does not have a high enough priority to cause other running jobs to be preempted.
% condor_q -pool condor -name beak -analyze 207525.0 Warning: No PREEMPTION_REQUIREMENTS expression in config file --- assuming FALSE -- Schedd: beak.cs.wisc.edu : <128.105.146.14:30918> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD --- 207525.000: Run analysis summary. Of 818 machines, 317 are rejected by your job's requirements 419 reject your job because of their own requirements 79 match, but are serving users with a better priority in the pool 3 match, but prefer another specific job despite its worse user-priority 0 match, but cannot currently preempt their existing job 0 are available to run your job Last successful match: Wed Jan 8 14:57:42 2003 Last failed match: Fri Jan 10 15:46:45 2003 Reason for last match failure: insufficient priority
While the analyzer can diagnose most common problems, there are some situations that it cannot reliably detect due to the instantaneous and local nature of the information it uses to detect the problem. Thus, it may be that the analyzer reports that resources are available to service the request, but the job still does not run. In most of these situations, the delay is transient, and the job will run during the next negotiation cycle.
If the problem persists and the analyzer is unable to detect the situation, it may be that the job begins to run but immediately terminates due to some problem. Viewing the job's error and log files (specified in the submit command file) and Condor's SHADOW_LOG file may assist in tracking down the problem. If the cause is still unclear, please contact your system administrator.
The first field in an event is the numeric value assigned as the event type in a 3-digit format. The second field identifies the job which generated the event. Within parentheses are the ClassAd job attributes of ClusterId value, ProcId value, and the MPI-specific rank for MPI universe jobs or a set of zeros (for jobs run under universes other than MPI), separated by periods. The third field is the date and time of the event logging. The fourth field is a string that briefly describes the event. Fields that follow the fourth field give further information for the specific event type.
These are all of the events that can show up in a job log file:
Event Number: 000
Event Name: Job submitted
Event Description: This event occurs when a user submits a job.
It is the first event you will see for a job, and it should only occur
once.
Event Number: 001
Event Name: Job executing
Event Description: This shows up when a job is running.
It might occur more than once.
Event Number: 002
Event Name: Error in executable
Event Description: The job couldn't be run because the
executable was bad.
Event Number: 003
Event Name: Job was checkpointed
Event Description: The job's complete state was written to a checkpoint
file.
This might happen without the job being removed from a machine,
because the checkpointing can happen periodically.
Event Number: 004
Event Name: Job evicted from machine
Event Description: A job was removed from a machine before it finished,
usually for a policy reason: perhaps an interactive user has claimed
the computer, or perhaps another job is higher priority.
Event Number: 005
Event Name: Job terminated
Event Description: The job has completed.
Event Number: 006
Event Name: Image size of job updated
Event Description: This is informational.
It is referring to the memory that the job is using while running. It
does not reflect the state of the job.
Event Number: 007
Event Name: Shadow exception
Event Description:
The condor_ shadow, a program on the submit computer that watches
over the job and performs some services for the job, failed for some
catastrophic reason. The job will leave the machine and go back into
the queue.
Event Number: 008
Event Name: Generic log event
Event Description: Not used.
Event Number: 009
Event Name: Job aborted
Event Description: The user cancelled the job.
Event Number: 010
Event Name: Job was suspended
Event Description: The job is still on the computer, but it is no longer
executing.
This is usually for a policy reason, like an interactive user using
the computer.
Event Number: 011
Event Name: Job was unsuspended
Event Description: The job has resumed execution, after being
suspended earlier.
Event Number: 012
Event Name: Job was held
Event Description: The user has paused the job, perhaps with
the condor_ hold command.
It was stopped, and will go back into the queue again until it is
aborted or released.
Event Number: 013
Event Name: Job was released
Event Description: The user is requesting that a job on hold be re-run.
Event Number: 014
Event Name: Parallel node executed
Event Description: A parallel (MPI) program is running on a node.
Event Number: 015
Event Name: Parallel node terminated
Event Description: A parallel (MPI) program has completed on a node.
Event Number: 016
Event Name: POST script terminated
Event Description: A node in a DAGMan workflow has a script
that should be run after a job.
The script is run on the submit host.
This event signals that the post script has completed.
Event Number: 017
Event Name: Job submitted to Globus
Event Description: A grid job has been delegated to Globus
(version 2, 3, or 4).
Event Number: 018
Event Name: Globus submit failed
Event Description: The attempt to delegate a job to Globus
failed.
Event Number: 019
Event Name: Globus resource up
Event Description: The Globus resource that a job wants to run
on was unavailable, but is now available.
Event Number: 020
Event Name: Detected Down Globus Resource
Event Description: The Globus resource that a job wants to run
on has become unavailable.
Event Number: 021
Event Name: Remote error
Event Description: The condor_ starter (which monitors the job
on the execution machine) has failed.
Event Number: 022
Event Name: Remote system call socket lost
Event Description: The condor_ shadow and condor_ starter
(which communicate while the job runs) have lost contact.
Event Number: 023
Event Name: Remote system call socket reestablished
Event Description: The condor_ shadow and condor_ starter
(which communicate while the job runs) have been able to resume
contact before the job lease expired.
Event Number: 024
Event Name: Remote system call reconnect failure
Event Description: The condor_ shadow and condor_ starter
(which communicate while the job runs) were unable to resume
contact before the job lease expired.
Event Number: 025
Event Name: Grid Resource Back Up
Event Description: A grid resource that was previously
unavailable is now available.
Event Number: 026
Event Name: Detected Down Grid Resource
Event Description: The grid resource that a job is to
run on is unavailable.
Event Number: 027
Event Name: Job submitted to grid resource
Event Description: A job has been submitted,
and is under the auspices of the grid resource.
When your Condor job completes(either through normal means or abnormal termination by signal), Condor will remove it from the job queue (i.e., it will no longer appear in the output of condor_ q) and insert it into the job history file. You can examine the job history file with the condor_ history command. If you specified a log file in your submit description file, then the job exit status will be recorded there as well.
By default, Condor will send you an email message when your job completes. You can modify this behavior with the condor_ submit ``notification'' command. The message will include the exit status of your job (i.e., the argument your job passed to the exit system call when it completed) or notification that your job was killed by a signal. It will also include the following statistics (as appropriate) about your job: