Next: 8.9 Stable Release Series Up: 8. Version History and Previous: 8.7 Stable Release Series Contents Index

Subsections

8.8 Development Release Series 6.3

This is the second development release series of Condor.

It contains numerous enhancements over the 6.2 stable series. For example:

Support for Kerberos and X.509 authentication.
Support for transferring files needed by jobs (for all universes except standard and PVM)
Support for MPICH jobs.
Support for JAVA jobs.
Condor DAGMan is dramatically more reliable and efficient, and offers a number of new features.

The 6.3 series has many other improvements over the 6.2 series, and may be available on newer platforms. The new features, bugs fixed, and known bugs of each version are described below in detail.

Version 6.3.3

New Features:

Added support for Kerberos and X.509 authentication in Condor.
Added the ability for vanilla jobs on Unix to use Condor's file transfer mechanism so that you don't have to rely on a shared file system.
Added support for MPICH jobs on Windows NT and 2000.
Added support for the JAVA universe.
When you use condor_ hold and condor_ release, you now see an entry about the event in the UserLog file for the job.
Whenever a job is removed, put on hold, or released (either by a Condor user or by the Condor system itself), there is a ``reason'' attribute placed in the job ad and written to the UserLog file. If a job is held, HoldReason will be set. If a job is released, ReleaseReason will be set. If a job is removed, RemoveReason will be set. In addition, whenever a job's status changes, EnteredCurrentStatus will contain the epoch time when the change took place.
The error messages you get from condor_ rm, condor_ hold and condor_ release have all been updated to be more specific and accurate.
Condor users can now specify a policy for when their jobs should leave the queue or be put on hold. They can specify expressions that are evaluated periodically, and whenever the job exits. This policy can be used to ensure that the job remains in the queue and is re-run until it exits with a certain exit code, that the job should be put on hold if a certain condition is true, and so on. If any of these policy expressions result in the job being removed from the queue or put on hold, the UserLog entry for the event includes a string describing why the action was taken.
Changed the way Condor finds the various condor_ shadow and condor_ starter binaries you have installed on your machine. Now, you can specify a SHADOW_LIST and a STARTER_LIST . These are treated much like the DAEMON_LIST setting, they specify a list of attribute names, each of which point to the actual binary you want to use. On startup, Condor will check these lists, make sure all the binaries specified exist, and find out what abilities each program provides. This information is used during matchmaking to ensure that a job which requires a certain ability (like having a new enough version of Condor to support transferring files on Unix) can find a resource that provides that ability.
Added new security feature to offer fine-grained control over what configuration values can be modified by condor_ config_val using -set and related options. Pool administrators can now define lists of attributes that can be set by hosts that authenticate to the various permission levels of Condor's host based security (for example, WRITE, ADMINISTRATOR, etc). These lists are defined by attributes with names like SETTABLE_ATTRS_CONFIG and STARTD_SETTABLE_ATTRS_OWNER . For more information about host-based security in Condor, see section 3.6.8 on page . For more information about how to configure the new settings, see the same section of the manual. In particular, see section 3.6.8 on page .
Greatly improved the handling of the ``soft kill signal'' you can specify for your job. This signal is now stored as a signal name, not an integer, so that it works across different platforms. Also, fixed some bugs where the signal numbers were getting translated incorrectly in some circumstances.
Added the -full option to condor_ reconfig. The -full option causes the Condor daemon to clear its cache of DNS information and some other expensive operations. So, the regular condor_ reconfig is now more light-weight, and can be used more frequently without undue overhead on the Condor daemons. The default condor_ reconfig has also been changed so that it will work from any host with WRITE permission in your pool, instead of requiring ADMINISTRATOR access.
Added the EMAIL_DOMAIN config file setting. This allows Condor administrators to define a default domain where Condor should send email if whatever UID_DOMAIN is set to would yield invalid email addresses. For more information, see section 3.3.3 on page .
Added support for Red Hat 7.2.
When printing out the UserLog, we now only log a new event for ``Image size of job updated'' when the new value is different than the existing value.

Bugs Fixed:

Fixed a bug in Condor-PVM where it was possible that a machine would be placed into the virtual machine, but then ignored by Condor for the purposes of scheduling tasks there.
Under Solaris, the checkpointing libraries could segfault while determining the page size of the machine. This has been fixed.
In a heavily loaded submit machine, the condor_ schedd would time out authentication checks with its shadows. This would cause the shadows to exit believing the condor_ schedd had died placing jobs into the idle state and the condor_ schedd to exhibit poor performance. This timeout problem has been corrected.
Removed use of the bfd libary in the Condor Linux distribution. This will make the dynamic versions of the Condor executables have a higher chance of being usable when Red Hat upgrades.
When you specify ``STARTD_HAS_BAD_UTMP = True'' in the config files on a linux machine with a 2.4+ kernel, the condor_ startd would report an error stating some of the tty entries in /dev. This would result in incorrect tty activity sampling causing jobs to not be migrated or incorrectly started on a resource. This has now been corrected.
When you specify ``GenEnv = True'' in a condor_ submit file, your environment is no longer restricted to 10KB.
The three-digit event numbers which begin each job event in the userlog were incorrect for some events in Condor 6.3.0 and 6.3.1. Specifically, ULOG_JOB_SUSPENDED, ULOG_JOB_UNSUSPENDED, ULOG_JOB_HELD, ULOG_JOB_RELEASED, ULOG_GENERIC, and ULOG_JOB_ABORTED had incorrect event numbers. This has now been corrected.
NOTE: This means userlog-parsing code written for Condor 6.3.0 or 6.3.1 development releases may not work reliably with userlogs generated by other versions of Condor, and visa-versa. Userlog events will remain compatible between all stable releases of Condor, however, and with post-6.3.1 releases in this development series.
The condor_ run script now correctly exits when it sees a job aborted event, instead of hanging, waiting for a termination event.
Until now, when a DAG node's Condor job failed, the node failed, regardless of whether its POST script succeeded or failed. This was a bug, because it prevented users from using POST scripts to evaluate jobs with non-zero exit codes and deem them successful anyway. This has now been fixed - a node's success is equal to its POST script's success - but the change may affect existing DAGs which rely on the old, broken behavior. Users utilizing POST scripts must now be sure to pass the POST script the job's return value, and return it again, if they do not wish to alter it; otherwise failed jobs will be masked by ignorant POST scripts which always succeed.

Known Bugs:

The HP-UX Vendor C++ CFront compiler does not work with condor_ compile if exception handling is enabled with +eh.
The HP-UX Vendor aCC compiler does not work at all with Condor.

Version 6.3.2

Version 6.3.2 of Condor was only released as a version of ``Condor-G''. This version of Condor-G is not widely deployed. However, to avoid confusion, the Condor developers did not want to release a full Condor distribution with the same version number.

Version 6.3.1

New Features:

Added support for an x509proxy option in condor_ submit. There is now a seperate condor_ GridManager for each user and proxy pair. This will be detailed in a future release of Condor.
More Condor DAGMan improvements and bug fixes:
- Added a [-dag] flag to condor_ q to more succinctly display dags and their ownership.
- Added a new event to the Condor userlog at the completion of a POST script. This allows DAGMan, during recovery, to know which POST scripts have finished succesfully, so it no longer has to re-run them all to make sure.
- Implemented separate -MaxPre and -MaxPost options to limit the number of simultaneously running PRE and POST scripts. The -MaxScripts option is still available, and is equivalent to setting both -MaxPre and -MaxPost to the same value.
- Added support for a new ``Retry'' parameter in the DAG file, which instructs DAGMan to automatically retry a node a configurable number of times if its PRE Script, Job, or POST Script fail for any reason.
- Added timestamps to all DAGMan log messages.
- Fixed a bug whereby DAGMan would clean up its lock file without creating a rescue file when killed with SIGTERM.
- DAGMan no longer aborts the DAG if it encounters executable error or job aborted events in the userlog, but rather marks the corresponding DAG nodes as ``failed'' so the rest of the DAG can continue.
- Fixed a bug whereby DAGMan could crash if it saw userlog events for jobs it didn't submit.
Added port restriction capabilities to Condor so you can specify a range of ports to use for the communication between Condor Daemons.
To improve performance: if there's no HISTORY file specified, don't connect back to the schedd to report your exit info on successful compeletion, since the schedd is simply going to discard that info anyway.
Added the macro SECONDARY_COLLECTOR_LIST to tell the master to send classads to an additional list of collectors so you can do administration commands when the primary collector is down.
When a job checkpoints it askes the shadow whether or not it should and if so where. This fixes some flocking bugs and increases performance of the pool.
Added match rejection diagnostics in condor_ q [-analyze] to give more information on why a particular job hasn't started up yet.
Added [-vms] argument to condor_ glidein that enables the control of how many virtual machines to start up on the target platform.
Added capability to the config file language to retrieve environment variables while being processed.
Added capability to make default user user priority factor configurable with the DEFAULT_PRIO_FACTOR macro in the config files.
Added full support for Red Hat 7.1 and the gcc 2.96 compiler. However, the standard universe binaries must still be statically linked.
When jobs are suspended or unsuspended, an event is now written into the user job log.
Added [-a] flag to condor_ submit to add/override attributes specified in the submit file.
Under Unix, added the ability for a submittor of a job to describe when and how a job is allowed/not allowed to leave the queue. For example, if a job has only run for 5 minutes, but it was supposed to have run an hour minimum, then do not let the job leave the queue but restart it instead.
New environment variable available CONDOR_SCRATCH_DIR available in a standard or vanilla job's environment that denotes temporary space the job can use that will be cleaned up automatically when the job leaves from the machine.
Not exactly a new feature, but some internal parts of Condor had been fixed up to try and improve the memory footprint of a few of our daemons.

Bugs Fixed:

Fixed a bug where condor_ q would produce wildly inaccurate run time reports of jobs in the queue.
Fixed it so that if the condor scheduler fails to notify the administrator through email, just print a warning and do not except.
Fixed a bug where condor_ submit would incorrectly create the user log file.
Fixed a bug where a job queue sorted by date with condor_ q would be displayed in descending instead of ascending order.
Fixed and improved error handling when condor_ submit fails.
Numerous fixes in the Condor User Log System.
Fixed a bug where when Condor inspects its on disk job queue log, it would do it with case sensitivity. Now there is no case sensitivity.
Fixed a bug in condor_ glidein where it have trouble figuring out the architecture of a minimally installed HP-UX machine.
Fixed it so that email to the user has the word ``condor'' capitalized in the subject.
Fixed a situation where when a user has multiple schedulers submitting to the same pool, the Negotiator would starve some of the schedulers.
Added a feature whereby if a transfer of an executable from a submission machine to an execute machine fails, Condor will retry a configurable numbers of times denotated by the EXEC_TRANSFER_ATTEMPTS macro. This macro defaults to three if left undefined. This macro exists only for the Unix port of Condor.
Fixed a bug where if a schedd had too many rejected clusters during a match phase, it would ``except'' and have to be restarted by the master.

Known Bugs:

The HP-UX Vendor C++ CFront compiler does not work with condor_ compile if exception handling is enabled with +eh.
The HP-UX Vendor aCC compiler does not work at all with Condor.

Version 6.3.0

New Features:

Added support for running MPICH jobs under Condor.

Many Condor DAGMan improvements and bug fixes:

PRE and POST scripts now run asynchronously, rather than synchronously as in the past. As a result, DAGMan now supports a -MaxScripts option to limit the number of simultaneously running PRE and POST scripts.
Whether or not POST scripts are always executed after failed jobs is now configurable with the -NoPostFail argument.
Added a -r flag to condor_ submit_dag to submit DAGMan to a remote condor_ schedd.
Made the arguments to condor_ submit_dag case-insensitive.
Fixed a variety of bugs in DAGMan's event handling, so DAGMan should no longer hang indefinitely after failed jobs, or mistake one job's userlog events for those of another.
DAGMan's error handling and logging output have been substantially clarified and improved. For example, DAGMan now prints a list of failed jobs when it exits, rather than just saying ``some jobs failed''.
Jobs submitted by a condor_ dagman job now have DAGManJobId and DAGNodeName in the job classad.
Fixed a condor_ submit_dag bug preventing the submission of DAGMan Rescue files.
Improved the handling of userlog errors (less crashing, more coping).
Fixed a bug when recovering from the userlog after a crash or reboot.
Fixed bugs in the handling of -MaxJobs.
Added a -a line argument to condor_ submit to add a line to the submit file before processing (overriding the submit file).
Added a -dag flag to condor_ q to format and sort DAG jobs sensibly under their DAGMan master job.

Known Bugs:

condor_ kbdd doesn't work properly under Compaq Tru64 5.1, and as a result, resources may not leave the ``Unclaimed'' state regardless of keyboard or pty activity. Compaq Tru64 5.0a and earlier do work properly.

Next: 8.9 Stable Release Series Up: 8. Version History and Previous: 8.7 Stable Release Series Contents Index

condor-admin@cs.wisc.edu