Next: 8.9 Stable Release Series
Up: 8. Version History and
Previous: 8.7 Stable Release Series
Contents
Index
Subsections
8.8 Development Release Series 6.3
This is the second development release series of Condor.
It contains numerous enhancements over the 6.2 stable series.
For example:
- Support for Kerberos and X.509 authentication.
- Support for transferring files needed by jobs (for all universes
except standard and PVM)
- Support for MPICH jobs.
- Support for JAVA jobs.
- Condor DAGMan is dramatically more reliable and efficient, and offers
a number of new features.
The 6.3 series has many other improvements over the 6.2 series, and
may be available on newer platforms. The new features, bugs fixed,
and known bugs of each version are described below in detail.
Version 6.3.3
New Features:
- Added support for Kerberos and X.509 authentication in Condor.
- Added the ability for vanilla jobs on Unix to use Condor's file
transfer mechanism so that you don't have to rely on a shared file
system.
- Added support for MPICH jobs on Windows NT and 2000.
- Added support for the JAVA universe.
- When you use condor_ hold and condor_ release, you now see an
entry about the event in the UserLog file for the job.
- Whenever a job is removed, put on hold, or released (either by a
Condor user or by the Condor system itself), there is a ``reason''
attribute placed in the job ad and written to the UserLog file.
If a job is held, HoldReason will be set.
If a job is released, ReleaseReason will be set.
If a job is removed, RemoveReason will be set.
In addition, whenever a job's status changes,
EnteredCurrentStatus will contain the epoch time when the
change took place.
- The error messages you get from condor_ rm, condor_ hold and
condor_ release have all been updated to be more specific and
accurate.
- Condor users can now specify a policy for when their jobs should
leave the queue or be put on hold.
They can specify expressions that are evaluated periodically, and
whenever the job exits.
This policy can be used to ensure that the job remains in the queue
and is re-run until it exits with a certain exit code, that the job
should be put on hold if a certain condition is true, and so on.
If any of these policy expressions result in the job being removed
from the queue or put on hold, the UserLog entry for the event
includes a string describing why the action was taken.
- Changed the way Condor finds the various condor_ shadow and
condor_ starter binaries you have installed on your machine.
Now, you can specify a SHADOW_LIST and a
STARTER_LIST .
These are treated much like the DAEMON_LIST setting, they
specify a list of attribute names, each of which point to the actual
binary you want to use.
On startup, Condor will check these lists, make sure all the binaries
specified exist, and find out what abilities each program provides.
This information is used during matchmaking to ensure that a job which
requires a certain ability (like having a new enough version of Condor
to support transferring files on Unix) can find a resource that
provides that ability.
- Added new security feature to offer fine-grained control over
what configuration values can be modified by condor_ config_val
using -set and related options.
Pool administrators can now define lists of attributes that can be set
by hosts that authenticate to the various permission levels of
Condor's host based security (for example, WRITE,
ADMINISTRATOR, etc).
These lists are defined by attributes with names like
SETTABLE_ATTRS_CONFIG and
STARTD_SETTABLE_ATTRS_OWNER .
For more information about host-based security in Condor, see
section 3.6.8 on page
.
For more information about how to configure the new settings, see the
same section of the manual.
In particular, see section 3.6.8 on
page
.
- Greatly improved the handling of the ``soft kill signal'' you
can specify for your job.
This signal is now stored as a signal name, not an integer, so that it
works across different platforms.
Also, fixed some bugs where the signal numbers were getting translated
incorrectly in some circumstances.
- Added the -full option to condor_ reconfig.
The -full option causes the Condor daemon to clear its cache of
DNS information and some other expensive operations.
So, the regular condor_ reconfig is now more light-weight, and can
be used more frequently without undue overhead on the Condor daemons.
The default condor_ reconfig has also been changed so that it will
work from any host with WRITE permission in your pool,
instead of requiring ADMINISTRATOR access.
- Added the EMAIL_DOMAIN config file setting.
This allows Condor administrators to define a default domain where
Condor should send email if whatever UID_DOMAIN is set to
would yield invalid email addresses.
For more information, see section 3.3.3 on
page
.
- Added support for Red Hat 7.2.
- When printing out the UserLog, we now only log a new event for
``Image size of job updated'' when the new value is different than the
existing value.
Bugs Fixed:
- Fixed a bug in Condor-PVM where it was possible that a machine would be
placed into the virtual machine, but then ignored by Condor for the purposes
of scheduling tasks there.
- Under Solaris, the checkpointing libraries could segfault while determining
the page size of the machine.
This has been fixed.
- In a heavily loaded submit machine, the condor_ schedd would time out
authentication checks with its shadows.
This would cause the shadows to
exit believing the condor_ schedd had died placing jobs into the idle
state and the condor_ schedd to exhibit poor performance.
This timeout problem has been corrected.
- Removed use of the bfd libary in the Condor Linux distribution.
This will make the dynamic versions of the Condor executables have a
higher chance of being usable when Red Hat upgrades.
- When you specify ``STARTD_HAS_BAD_UTMP = True'' in the config files
on a linux machine with a 2.4+ kernel, the condor_ startd would report
an error stating some of the tty entries in /dev. This would result
in incorrect tty activity sampling causing jobs to not be migrated or
incorrectly started on a resource. This has now been corrected.
- When you specify ``GenEnv = True'' in a condor_ submit file,
your environment is no longer restricted to 10KB.
- The three-digit event numbers which begin each job event in the
userlog were incorrect for some events in Condor 6.3.0 and 6.3.1.
Specifically, ULOG_JOB_SUSPENDED, ULOG_JOB_UNSUSPENDED,
ULOG_JOB_HELD, ULOG_JOB_RELEASED, ULOG_GENERIC, and
ULOG_JOB_ABORTED had incorrect event numbers. This has now been
corrected.
NOTE: This means userlog-parsing code written for Condor 6.3.0 or
6.3.1 development releases may not work reliably with userlogs
generated by other versions of Condor, and visa-versa. Userlog events
will remain compatible between all stable releases of Condor, however,
and with post-6.3.1 releases in this development series.
- The condor_ run script now correctly exits when it sees a job aborted
event, instead of hanging, waiting for a termination event.
- Until now, when a DAG node's Condor job failed, the node failed,
regardless of whether its POST script succeeded or failed. This was a
bug, because it prevented users from using POST scripts to evaluate
jobs with non-zero exit codes and deem them successful anyway. This
has now been fixed - a node's success is equal to its POST script's
success - but the change may affect existing DAGs which rely on the
old, broken behavior. Users utilizing POST scripts must now be sure
to pass the POST script the job's return value, and return it again,
if they do not wish to alter it; otherwise failed jobs will be masked
by ignorant POST scripts which always succeed.
Known Bugs:
- The HP-UX Vendor C++ CFront compiler does not work with condor_ compile
if exception handling is enabled with +eh.
- The HP-UX Vendor aCC compiler does not work at all with Condor.
Version 6.3.2
Version 6.3.2 of Condor was only released as a version of
``Condor-G''.
This version of Condor-G is not widely deployed.
However, to avoid confusion, the Condor developers did not want to
release a full Condor distribution with the same version number.
Version 6.3.1
New Features:
- Added support for an x509proxy option in
condor_ submit. There is now a seperate condor_ GridManager for each
user and proxy pair. This will be detailed in a future release of
Condor.
- More Condor DAGMan improvements and bug fixes:
- Added a [-dag] flag to condor_ q to more succinctly display dags
and their ownership.
- Added a new event to the Condor userlog at the completion of a POST
script. This allows DAGMan, during recovery, to know which POST
scripts have finished succesfully, so it no longer has to re-run them
all to make sure.
- Implemented separate -MaxPre and -MaxPost options to limit
the number of simultaneously running PRE and POST scripts. The
-MaxScripts option is still available, and is equivalent to
setting both -MaxPre and -MaxPost to the same value.
- Added support for a new ``Retry'' parameter in the DAG file, which
instructs DAGMan to automatically retry a node a configurable number
of times if its PRE Script, Job, or POST Script fail for any reason.
- Added timestamps to all DAGMan log messages.
- Fixed a bug whereby DAGMan would clean up its lock file without
creating a rescue file when killed with SIGTERM.
- DAGMan no longer aborts the DAG if it encounters executable error or
job aborted events in the userlog, but rather marks the corresponding
DAG nodes as ``failed'' so the rest of the DAG can continue.
- Fixed a bug whereby DAGMan could crash if it saw userlog events for
jobs it didn't submit.
- Added port restriction capabilities to Condor so you can specify a range
of ports to use for the communication between Condor Daemons.
- To improve performance: if there's no HISTORY file
specified, don't connect back to the schedd to report your exit info on
successful compeletion, since the schedd is simply going to discard that
info anyway.
- Added the macro SECONDARY_COLLECTOR_LIST to tell the
master to send classads to an additional list of collectors so you can
do administration commands when the primary collector is down.
- When a job checkpoints it askes the shadow whether or not it
should and if so where. This fixes some flocking bugs and increases
performance of the pool.
- Added match rejection diagnostics in condor_ q [-analyze] to
give more information on why a particular job hasn't started up yet.
- Added [-vms] argument to condor_ glidein that enables the
control of how many virtual machines to start up on the target platform.
- Added capability to the config file language to retrieve environment
variables while being processed.
- Added capability to make default user user priority factor configurable
with the DEFAULT_PRIO_FACTOR macro in the config files.
- Added full support for Red Hat 7.1 and the gcc 2.96 compiler. However,
the standard universe binaries must still be statically linked.
- When jobs are suspended or unsuspended, an event is now written into
the user job log.
- Added [-a] flag to condor_ submit to add/override attributes
specified in the submit file.
- Under Unix, added the ability for a submittor of a job to describe when
and how a job is allowed/not allowed to leave the queue. For example, if
a job has only run for 5 minutes, but it was supposed to have run an hour
minimum, then do not let the job leave the queue but restart it instead.
- New environment variable available CONDOR_SCRATCH_DIR available
in a standard or vanilla job's environment that denotes temporary space
the job can use that will be cleaned up automatically when the job leaves
from the machine.
- Not exactly a new feature, but some internal parts of Condor had been
fixed up to try and improve the memory footprint of a few of our daemons.
Bugs Fixed:
- Fixed a bug where condor_ q would produce wildly inaccurate run time
reports of jobs in the queue.
- Fixed it so that if the condor scheduler fails to notify the
administrator through email, just print a warning and do not except.
- Fixed a bug where condor_ submit would incorrectly create the user
log file.
- Fixed a bug where a job queue sorted by date with condor_ q would
be displayed in descending instead of ascending order.
- Fixed and improved error handling when condor_ submit fails.
- Numerous fixes in the Condor User Log System.
- Fixed a bug where when Condor inspects its on disk job queue log,
it would do it with case sensitivity. Now there is no case sensitivity.
- Fixed a bug in condor_ glidein where it have trouble figuring out
the architecture of a minimally installed HP-UX machine.
- Fixed it so that email to the user has the word ``condor'' capitalized
in the subject.
- Fixed a situation where when a user has multiple schedulers submitting
to the same pool, the Negotiator would starve some of the schedulers.
- Added a feature whereby if a transfer of an executable
from a submission machine to an execute machine fails, Condor
will retry a configurable numbers of times denotated by the
EXEC_TRANSFER_ATTEMPTS macro. This macro defaults to three if
left undefined. This macro exists only for the Unix port of Condor.
- Fixed a bug where if a schedd had too many rejected clusters during a
match phase, it would ``except'' and have to be restarted by the master.
Known Bugs:
- The HP-UX Vendor C++ CFront compiler does not work with condor_ compile
if exception handling is enabled with +eh.
- The HP-UX Vendor aCC compiler does not work at all with Condor.
Version 6.3.0
New Features:
- Added support for running MPICH jobs under Condor.
Many Condor DAGMan improvements and bug fixes:
- PRE and POST scripts now run asynchronously, rather than synchronously
as in the past. As a result, DAGMan now supports a -MaxScripts
option to limit the number of simultaneously running PRE and POST
scripts.
- Whether or not POST scripts are always executed after failed jobs is
now configurable with the -NoPostFail argument.
- Added a -r flag to condor_ submit_dag to submit DAGMan to a
remote condor_ schedd.
- Made the arguments to condor_ submit_dag case-insensitive.
- Fixed a variety of bugs in DAGMan's event handling, so DAGMan should
no longer hang indefinitely after failed jobs, or mistake one job's
userlog events for those of another.
- DAGMan's error handling and logging output have been substantially
clarified and improved. For example, DAGMan now prints a list of
failed jobs when it exits, rather than just saying ``some jobs
failed''.
- Jobs submitted by a condor_ dagman job now have DAGManJobId
and DAGNodeName in the job classad.
- Fixed a condor_ submit_dag bug preventing the submission of DAGMan
Rescue files.
- Improved the handling of userlog errors (less crashing, more coping).
- Fixed a bug when recovering from the userlog after a crash or reboot.
- Fixed bugs in the handling of -MaxJobs.
- Added a -a line argument to condor_ submit to add a line to the
submit file before processing (overriding the submit file).
- Added a -dag flag to condor_ q to format and sort DAG jobs
sensibly under their DAGMan master job.
Known Bugs:
- condor_ kbdd doesn't work properly under Compaq Tru64 5.1, and
as a result, resources may not leave the ``Unclaimed'' state
regardless of keyboard or pty activity. Compaq Tru64 5.0a and earlier
do work properly.
Next: 8.9 Stable Release Series
Up: 8. Version History and
Previous: 8.7 Stable Release Series
Contents
Index
condor-admin@cs.wisc.edu