Next: 8.6 Development Release Series Up: 8. Version History and Previous: 8.4 Development Release Series Contents Index

Subsections

8.5 Stable Release Series 6.6

This is a stable release series of Condor. It is based on the 6.5 development series. All new features added or bugs fixed in the 6.5 series are available in the 6.6 series. The details of each version are described below.

8.5.1 Version 6.6.12

Release Notes:

Contains only a couple bug fixes.

Bugs fixed that are included in version 6.7.19:

None.

Bugs fixes irrelevant to the 6.7 series:

Fixed a bug which caused the condor_ collector incorrectly handle Collector ads in which the Machine attribute is missing, or Storage ads in which the Name is missing. In these cases, a condor_ collector running on some platforms (notably, Solaris) could crash.

Known Bugs:

None.

Version 6.6.11

Release Notes:

A security team at UW-Madison is conducting an onging security audit of the Condor system and has identified a few important vulnerabilities. Condor versions 6.6.11 and 6.7.18 fix these security problems and other bugs. There have been no reported exploits, but all sites are urged to upgrade immediately.
The Condor Team will publish detailed reports of these vulnerabilities on 2006-04-24, 4 weeks from the date when the fixes were first released (2006-03-27). This will allow all sites time to upgrade before enough information to exploit these bugs is widely available.

Security Bugs Fixed:

Bugs in previous versions of Condor could allow any user who can submit jobs on a machine to gain access to the ``condor'' account (or whatever non-privileged user the Condor daemons are running as). This bug can not be exploited remotely, only by users already logged onto a submit machine in the Condor pool.
The security of the ``condor_ config_val -set'' feature was found to be insufficient, so this feature is now disabled by default. There are new configuration settings to enable this feature in a secure manner. Please read the descriptions of ENABLE_RUNTIME_CONFIG , ENABLE_PERSISTENT_CONFIG and PERSISTENT_CONFIG_DIR in the example configuration file shipped with the latest Condor releases, or in section 3.3.5 on page .

Other bugs fixed that are included in version 6.7.18:

Fixed a bug which could cause the condor_ collector to crash when it receives certain types of malformed ads.
Fixed a bug which caused the condor_ collector incorrectly handle ads in which the UpdateInterval attribute is set. In particular, the previous versions of the condor_ collector will use the UpdateInterval value as the maximum lifetime of the ad when aging the ads, which could cause it to remove the ad prematurely. The condor_ collector now looks at the ClassAdLifetime attribute, and uses its value (if set). NOTE: No current Condor daemons are publishing either of these attributes, but may do so in the future.

Bugs fixed that are included in version 6.7.14:

Fixed a rare problem in the condor_ negotiator where a poorly formed classad from a single condor_ schedd could halt negotiation for the entire pool. This poorly formed ad could only happen in extrememly rare circumstances, but it was possible. Now, the condor_ negotiator will simply ignore poorly formed classads and continue to negotiate with any other condor_ schedd in the system that has idle jobs.
Fixed a bug which caused log messages which should contain ``PRIV_USER_FINAL'' to be ``PRIV_USER_FINALPRIV_FILE_OWNER''. It's also possible that this same bug could cause crashes if any daemon attempts to log a message which would refer to ``PRIV_FILE_OWNER''.
Fixed a bug which caused the condor_ starter to exit with an error when the sum total of the file transfer size exceeded 2G. This, in turn, caused a ``shadow exeception'', and the job would fail.

Bugs fixed that are included in version 6.7.11:

In very rare cases, the condor_ startd could get into an infinite loop if a job it was managing was suspended and then there were fatal errors trying to send commands to evict the corresponding condor_ starter. This bug has been fixed, and the condor_ startd will now correctly recover (and cleanup all processes) if it fails to send commands to a starter managing a suspended job.
Condor on Solaris has been patched to work around a Solaris stdio limitation of 255 maximum file descriptors. Before this patch, heavily loaded Condor daemons running on Solaris, particularly the condor_ schedd, could exit complaining about lack of file descriptors for dprintf.
Fixed a bug where the condor_ starter would follow symbolic links to directories, when calculating job disk usage. This could cause an incorrect job disk usage calculation, or hang the starter upon encountering an infinite directory loop. This bug only affected Unix platforms.
For Globus jobs, the Rematch expression is now evaluated when a submit fails (in addition to when a submit commit times out).
Fixed a bug that caused the condor_ gridmanager to go into an infinite loop if an entry in the job's environment string was missing an equals sign.

Bugs fixed that are included in version 6.7.9:

Fixed a bug where the condor_ startd would erroneously compute the console idle time utilizing a file called /proc/interrupts on unix machines that were not linux.
Fixed a bug where the condor_ negotiator might dump core if it was reconfiged in the middle of a negotiation cycle.
Fixed a bug where the condor_ negotiator might dump core if a startd had a name longer than 63 bytes.
Fixed a bug that could cause condor_ userprio to crash if the data it gets back from the condor_ negotiator is invalid.
Fixed a bug where DEFAULT_PRIO_FACTOR was ignored if ACCOUNTANT_LOCAL_DOMAIN was not defined.

Bugs fixes irrelevant to the 6.7 series:

Added the -NoEventChecks and the -AllowLogError command-line flags to condor_ submit_dag and the condor_ submit_dag man page (they were already in condor_ dagman). Added -r and -debug to the condor_ submit_dag man page (they were already in condor_ submit_dag, just not documented).
Made command-line arguments case insensitive in the Windows version of condor_ submit_dag; also fixed log file checks in that version.

Known Bugs:

A bug has been found which can cause a condor_ collector to crash on some platforms (notably, Solaris). This can happen if the condor_ collector receives a Collector ad in which the Machine attribute is missing, or a Storage ad in which the Name is missing. There is no security threat involved in either case.

Version 6.6.10

Release Notes:

Most of the fixes included in this release were also included in version 6.7.7 (see below).
The QUEUE_CLEAN_INTERVAL timer is reset during a condor_ schedd reconfig only if this timer value has been changed. Previously, the timer was reset during all condor_ schedd reconfigs, which could prevent the job_queue.log file from being cleaned. Note that this timer is always reset upon a condor_ schedd startup. See the related change for truncating the job_queue.log below, for this same release.
Previously, the condor_ schedd would over-react and exit if it tried to send a user email and SMTP_SERVER was undefined; now it simply prints an error in the SchedLog and moves on.

Bugs fixed that are included in version 6.7.7:

Fixed a bug that could cause the file job_queue.log in the Condor SPOOL directory to grow unnecessarily large, thereby slowing down the startup and/or shutdown times for the condor_ schedd daemon.
Fixed a critical bug where the console idle time for PS/2 keyboards and mice was not being updated correctly.
Fixed a bug in the condor_ collector that could cause it to crash when parsing certain types of invalid ClassAds. In particular, if a Machine, Schedd or License ClassAd sent to the condor_ collector has an IP address field which is empty (which should never happen), the condor_ collector will crash.
Fixed some bugs in how the condor_ schedd handles a graceful shutdown (either because of a condor_ off) or a SIGTERM on UNIX):
- There was a minor bug if JOB_START_DELAY was set to 0 that would prevent the condor_ schedd from correctly cleaning up during graceful shutdown. Now, the condor_ schedd will properly shutdown, even if JOB_START_DELAY is set to 0.
- Fixed a bug when there are scheduler universe jobs that were recently submitted to the queue. Previously, the shutdown code would not evict scheduler universe jobs that had been submitted since the last SCHEDD_INTERVAL (which defaults to 5 minutes). So, if a user submitted a scheduler universe job and then someone shutdown Condor on that machine, the condor_ schedd would wait until the next SCHEDD_INTERVAL had elapsed before evicting the job. Now, the schedd will always attempt to evict scheduler universe jobs during a shutdown, without waiting for this interval to pass.
A number of Windows-specific bugs were fixed:
- It was possible under certain circumstances for execute directories to not be cleaned up properly. This has been fixed.
- Certain Asian locales would cause the condor_ starter to crash due to character translation problems. This has been fixed.
- Condor will now properly report memory sizes that exceed 2 GB.
- The condor_ starter would be unable to run jobs if the LOG path had a period (.) in it. This has been fixed.
- The condor_ startd would leak memory, especially on SMP machines. This has been fixed.
- The condor_ master would crash immediately on Windows 2003 Server if the firewall was enabled. This has been fixed.
Fixed a bug in condor_ dagman that could cause condor_ dagman to fail an assertion if PRE or POST scripts are throttled with the -maxpre or -maxpost condor_ submit_dag command line flags.

Bugs fixed that are NOT included in version 6.7.7:

Fixed a bug where enabling the grid_monitor for any globus job handled by something other than a hard-coded list of jobmanager names would cause the job to stay idle forever. The hard-coded list of jobmanager names was: condor, fork, lsf, pbs, and remote. A jobmanager by any other name (e.g. condor_rh9, or lcgpbs) would cause the problem. This bug was originally fixed in internal releases of 6.7.0, but it was reintroduced by mistake in all public releases.
Fix the way condor_ version handles command line arguments (there were a number of problems and inconsistencies) and added a -help option and usage message.
Fixed some memory leaks in the condor_ startd that would be induced by calling condor_ reconfig or condor_ status -d.
By design, Condor daemons will exit if their parent process exits. On Windows, a bug introduced in v6.5.x series broke this behavior. This is now fixed.
On Windows, users would often observe the condor_ master failing to add exceptions for the Condor daemons to the Windows Firewall on Windows XP SP2 or Windows 2003 Server SP1. The condor_ master will now retry for a longer period of time to add these exceptions, and the number of retries has now been made configurable. See section 3.3.9 on page for more information.

Known Bugs:

None.

Version 6.6.9

Release Notes:

Most of the fixes included in this release were also included in version 6.7.5. However, at the end of this section, a few fixes that were added to 6.6.9 after 6.7.5 was released are mentioned separately.

Bugs fixed that are included in version 6.7.5:

Fixed a security bug in the condor_ schedd that could enable a maliciously modified condor_ submit tool to overwrite files in the Condor SPOOL subdirectory, including the job queue.
Fixed a bug where under very pathological file permission failure conditions with a standard universe job, there would be a cycle of an execute event followed by a termination event in the user log when the job had not actually ran.

Bugs fixed that are NOT included in version 6.7.5:

Fixed a memory management bug introduced in version 6.6.8 that could result in deallocated memory being referenced after a child process forked from a Condor daemon exits.
Fixed bugs in some Condor tools that failed to locate condor_ startd daemons that contained multiple @ signs in their Name attribute. For example, a virtual machine from a multiple-CPU condor_ startd spawned using glidein would have the name: vm1@[pid]@[hostname]. All Condor tools that need to communicate with a condor_ startd like this will now succeed.
Removed a fixed-length buffer in the code that handled the SUBSYS_EXPRS config file setting. Previously, if any attributes referred to were larger than approximate 1000 bytes, Condor daemons would crash. Now, there is no limit to the size of the attributes listed in SUBSYS_EXPRS. For more information about this setting, see section 3.3.5 on page .
Fixed a bug which would cause Condor to fail to cache user GID information and potentially overwhelm NIS servers.
Fixed another bug which could cause UDP machine updates to be dropped by the condor_ collector.

Known Bugs:

If a DAG node has both retries and a POST script, and the actual Condor job for the node fails, the POST script is not run except after the last retry of the job (or if the job succeeds). (The POST script should be run each time the node job is run, whether the job succeeds or not.)
Occasionally, Condor generates both a terminated event and an aborted event for a job that is aborted. If this happens for a DAG node job, condor_ dagman considers this an error and aborts the DAG. If you run into this problem, you can avoid the abort by adding the -NoEventChecks flag to argument list in the condor_ dagman submit file generated by condor_ submit_dag (you have to do condor_ submit_dag -no_submit and hand-edit the resulting submit file). However, if you get the double events on a node that has retries, condor_ dagman will assert. The only fix for this is to upgrade to a 6.7.5 or newer condor_ dagman. You can do this by simply installing a newer condor_ dagman executable, without any other changes to your Condor installation. It is fine to run a 6.7 condor_ dagman on a 6.6 Condor installation.
In a DAG, if a node job generates an executable error event, the DAG is aborted. This can be worked around by adding the -NoEventChecks flag to argument list in the condor_ dagman submit file generated by condor_ submit_dag (you have to do condor_ submit_dag -no_submit and hand-edit the resulting submit file).

Version 6.6.8

Release Notes:

Most of the fixes included in this release were also included in version 6.7.3. However, at the end of this section, a few fixes that were added to 6.6.8 after 6.7.3 was released are mentioned separately.

New Features:

None.

Bugs Fixed:

In version 6.6.7, we fixed bugs related to the -format option to various Condor tools. However, some sites were using -format in ways we did not expect, by not specifying any 'string at all. This used to work, given the old buggy code that handled -format, but the changes in version 6.6.7 broke this, and format strings without a 'Now, if the format string does not contain a 'the attribute name which follows it is once again ignored, and the format string is printed directly without any modification. For example, to print out the machine's Name (always defined) and the RemoteUser (only defined if the machine is claimed), and always print a newline (to keep the formatting legible), this command will now work:
```
% condor_status -f "%s " Name -f "%s " RemoteUser -f "\n" bogus
bird.cs.wisc.edu biguser@raven.cs.wisc.edu
condor.cs.wisc.edu
dodo.cs.wisc.edu biguser@raven.cs.wisc.edu
lark.cs.wisc.edu biguser@raven.cs.wisc.edu
raven.cs.wisc.edu
...
```
Windows bug fixes:
- Fixed a bug in that would cause Condor to fail to gracefully shutdown user jobs that are console applications (including batch scripts).
- Fixed an issue that would cause condor_ store_cred to fail if the user did not have NETWORK logon rights.
- condor_ store_cred query command would appear to succeed, even if the stored credential was invalid (e.g. the password was changed but the password stash was not updated). This has been fixed.
- Fixed a bug that would cause the condor_ startd to crash under certain conditions during job eviction. This bug was introduced in Condor version 6.6.6.
- Fixed a bug that would cause condor_ dagman to crash if it was submitted as a non-Administrator user.
- Fixed a bug that would cause Condor to occasionally kill processes that didn't belong to it during job eviction or daemon restarts.
- On startup, the condor_ master would occasionally fail to add the daemons to the Windows XP firewall exception list because of a race with the Windows SharedAccess service. This bug has been fixed.
- If a user submitted a job with an invalid executable, the starter would often wedge until the job was preempted. Now, the starter attempts to detect invalid executables and prevent wedging.
- Fixed issues that would cause condor_ startd to ``disappear'' from the pool because of dropped machine ad updates. This fix applies to all platforms, but the symptoms were exhibited predominantly on Windows machines.
- Fixed a bug that could cause HIGHPORT and LOWPORT parameters to be ignored if a Windows machine ran for several weeks without being rebooted.
Starting with RedHat 9, newer versions of Linux began to produce core files named core.<pid>. This broke functionality in Condor that managed and transferred back any core file created by the job, since the condor_ starter was unable to locate the proper file. Now, Condor will correctly transfer back core files, even if they are created as core.<pid>. This functionality works in all universes, and is independent of Condor's file transfer mechanism.
Fixed a bug that was causing condor_ startd to consume large amounts of memory over long periods of time.
Fixed a bug that was causing condor_ startd to fail to start up with the message, "caInsert: Can't insert CpuBusy into target ClassAd."
Fixed a long-standing bug in Condor regarding the configuration settings LOWPORT and HIGHPORT . When these were enabled (to restrict Condor's port usage to a specified range), Condor would fail to set the SO_KEEPALIVE option on sockets it created. This meant that in the case of a hard machine failure (such as a sudden power outage, etc) on one machine, Condor daemons communicating with that machine would never notice it had died. Now, the SO_KEEPALIVE option is properly set on all sockets, even with LOWPORT and HIGHPORT defined.
Fixed a bug that caused condor_ rm -forcex to not remove jobs that make use of leave_in_queue. If invoked using a cluster id, username, or constraint expression, condor_ rm would report success but the jobs would remain in the queue. Now, the jobs will leave the queue.
When a held job is released, job ad attributes HoldReasonCode and HoldReasonSubCode are now properly moved to LastHoldReasonCode and LastHoldReasonSubCode.
Fixed a bug that would cause the RemoveReason attribute for a job to be set incorrectly in some circumstances. Specifically, this was when a job was not running and a periodic_remove expression caused the job to be cancelled.
Fixed condor_ submit such that submit description file commands written with syntax both of ThisStyle and this_style will work.
Fixed a very rare but serious bug in Condor that was originally introduced in version 6.3.0. Under exceptional circumstances (a very heavily loaded machine where a huge number of processes are being spawned all the time, and where the condor_ schedd is managing many thousands of jobs in the queue), it was possible for the condor_ schedd to run a job twice. We have fixed the underlying problem that lead to the condor_ schedd making this mistake, rendering this error impossible.
Fixed a bug that occurred when submitted Condor-G jobs while using the grid monitor. If the grid job monitor returned a FAILED status for a job while the jobmanager is asleep, the condor_ gridmanager could sometimes end up in a loop, continuously restarting the remote Globus jobmanager then putting it back to sleep.

Known Bugs:

None

Bugs fixed that are not included in version 6.7.3:

Fixed a discrepancy in the SUBSYS_ADDRESS_FILE setting. Previously, this setting did not work for SUBSYS values of COLLECTOR or NEGOTIATOR (for example, defining COLLECTOR_ADDRESS_FILE had no effect). Now, if either of these is defined in the configuration file, the corresponding Condor daemon will write out the address and port it is using to the specified file. Normally, the condor_ collector and condor_ negotiator listen on a well-known, fixed port. However, on single-machine, Personal Condor installations, these address files allow all of the Condor daemons and tools to locate the condor_ collector and condor_ negotiator, even if they are using a dynamically assigned port. For more information about the SUBSYS_ADDRESS_FILE setting, please see the description in section 3.3.5 on page . For more information about using non-standard ports for the condor_ collector and condor_ negotiator, please see the description of ``Non Standard Ports for Central Managers'' in section 3.7.1 on page .

Version 6.6.7

Release Notes:

None.

New Features:

Added a feature to the condor_ master which automatically adds the Condor daemons to the Windows Firewall exception list. This only applies to machines running Windows XP SP2.

Bugs Fixed:

Fixed a bug specific to Windows that could cause, in rare occurrences due to a race condition, Condor to fail to properly signal the job to suspend, continue, or preempt.
When Condor transfers the job executable using the file transfer mechanism, it used to leave the binary sitting as a world-writable file inside the execute directory on UNIX. Now, executable files transferred by Condor have the proper permissions (mode 0755).
Fixed an important bug in the low-level code that Condor uses to transfer files across a network. There were certain temporary failure cases that were being treated as permanent, fatal errors. This resulted in file transfers that aborted prematurely, causing jobs to needlessly re-run. The code now gracefully recovers from these temporary errors. This should significantly help throughput for some sites, particularly ones that transfer very large files as output from their jobs.
Fixed a bug in the file transfer mechanism which caused segmentation faults when very long input/output/intermediate file lists were used.
Fixed a number of bugs in the -format option to condor_ q and condor_ status. Now, these tools will properly handle printing boolean expressions in all cases. Previously, depending on how the boolean evaluated, either the expression was printed, or the tool could crash. Furthermore, the tools do a better job of handling the different types of format conversion strings and printing out the appropriate value. For example, if a user tries to print out a boolean attribute with condor_status -format "%d\n" HasFileTransfer, the condor_ status tool will evaluate HasFiletransfer and print either a 0 or a 1 (FALSE or TRUE). If, on the other hand, a user tries to print out a boolean attribute with condor_status -format "%s\n" HasFileTransfer, the condor_ status tool will print out the string ``FALSE'' or ``TRUE'' as appropriate.
The ClassAd attribute scope resolution prefixes, MY. and TARGET., are no longer case sensitive.
condor_ dagman now generates a fatal error if any node submit files are missing the log file attribute. This behavior can be overridden with the -AllowLogError command-line option.
condor_ dagman now does better checking for inconsistent events (such as getting multiple terminate events for a single job). This checking can be disabled with the -NoEventChecks command-line option.
Under Tru64, Condor would sometimes fail to start a job while setting the resource limits on behalf of the job. This error appears to be the result of a kernel issue. A workaround has been implemented which will leave the limits of the job unmodified and run the job when this specific error situation arises.
On Windows, occasionally Condor would exhibit erratic behavior when a machine resumes from sleeping. This has been fixed.
On Windows, occasionally Condor would fail to bind to any available interfaces due to a mishandling of a function return value. This has been fixed.

Known Bugs:

None.

Version 6.6.6

Release Notes:

A condor_ dagman job will fail and report a cycle in the DAG when XML logs are used in a single or multiple log format. The Post Script completion event does not get converted to XML and Dagman never sees them complete or fail because of the format of the event.

New Features:

The checkpoint server has moved from contrib module status to being a normal part of Condor.
When the first start running, all Condor daemons will now try to print to their log file the full path to the binary they are executing. Unfortunately, we can only reliably get this information on Linux, Solaris, MacOSX, and Windows platforms. On other platforms, this information will only be printed to the log file in certain cases that depend on how the daemon was invoked. This new feature was added to aid in debugging problems where sites were not running the version of the Condor daemons they thought they were due to problems in custom-built startup scripts.
condor_ wait is now available in the Windows port.
Added a fix to the accountant that allows users to specify user priorities with condor_ userprio before any jobs have been submitted.
Added support for running batch files under Windows when using the STARTD_CRON or USER_JOB_WRAPPER attributes.
Moved from Globus 2.2.2 to Globus 2.2.4 for Condor-G, except for the DUX 4.0f platform.

Bugs Fixed:

Windows bug fixes:
- Fixed a bug which could cause Condor to kill processes that aren't related to Condor or the job it was running at the time.
- Fixed a problem that could cause daemons or tools to crash when they looked up information about processes running on the system.
- Fixed a problem with the collector dropping TCP updates with pools larger than roughly 20 machines. This issue only occurs with UPDATE_COLLECTOR_WITH_TCP enabled.
- Fixed an issue with condor_ store_cred reporting success when in fact under certain circumstances the store command actually failed.
- Removed condor_ kbdd_dll. It is no longer used.
- Fixed an issue with condor_ birdwatcher that caused it to leak resource handles.
- Fixed an issue with the Windows port of condor_ dagman that would cause it to crash when POST scripts were used.
Fixed a bug where the environment of jobs in any universe could be corrupted.
The condor_ startd now properly cleans up execute directories on root-squashed NFS mounts.
Fixed a problem where the condor_ starter could crash if the job it was running used Condor's file transfer mechanism and the full path names to the job's files became longer than a few hundred characters.
The image_size attribute of a job on Mac OS X is much closer to the values that ps returns. Previously it would be highly inflated.
Fixed a memory leak in the condor_ gridmanager.
Added the -Storklog argument to condor_ submit_dag to make it compatible with the older perl script of the same name.
Removed support for the -libc option for condor_ version.
Added a fix to condor_ compile where if our internal ld managed to not be invoked during linking of a standard universe executable, a warning is emitted.
Fixed a minor bug in the file transfer mechanism. Specifically, if a VANILLA job had when_to_transfer_output set to ON_EXIT_OR_EVICT, wrote more than one output file, and was actually evicted, the condor condor_ shadow would have a fatal run-time error (shadow exception) and your job would be rerun.
DAGMan bug fixes:
- If submit files for individual nodes referred to the same log file with different paths, condor_ dagman would read log events incorrectly and the DAG would fail. condor_ dagman is now able to recognize that the different paths actually refer to the same log file.
- Fixed a bug where DAGMan failed to monitor Stork job logs.
- If a node submit file doesn't specify a log file, the warning message now gets printed out in the the DAGMan log file.
- Fixed a bug that caused condor_ dagman to fail if first node submit file has continuation in log file line.
Bugs related to configuration
- Fixed a bug where Condor daemons could crash if COLLECTOR_HOST or NEGOTIATOR_HOST was defined to be something bogus.
- Fixed potential crash in the condor_ collector when COLLECTOR_NAME was too long.
- The default setting for POOL_HISTORY_DIR is no longer SPOOL . Using the spool directory would result in history files being obliterated by condor_ preen.
Fixed a bug which could result in a daemon crashing while it was writing to its logfile.
Fixed a signal handling bug in the checkpoint server which could cause the daemon to hang sometimes.
The Kerberos map file now tolerates spaces on either side of the equals sign instead of generating a parse error.
The -analyze option to condor_ q is only meaningful for certain universes. condor_ q now warns if the output might not be meaningful.
Java universe: when jar files are transferred to the execute machine (with should_transfer_files or transfer_input_files) the condor_ starter will use the local path (in the execute directory) for the jarfiles, instead of the original path specified in the submit file.
Previously, if a scheduler universe job died with a signal, the condor_ schedd would write multiple (conflicting) events into the UserLog file: a terminate event and an abort event. Now, only the terminate event is written, not the abort event.
Fixed a minor bug where if the condor_ schedd crashed or was killed at just the wrong moment while a job was being removed because the periodic_remove expression had evaluated to TRUE, the job might have been successfully removed but the RemoveReason attribute could have been lost. Now, both actions are taken together atomically. If a job is successfully removed, it will always have a RemoveReason attribute.
Fixed a memory leak in the condor_ collector.

Known Bugs:

None.

Version 6.6.5

Release Notes:

None.

New Features:

None.

Bugs Fixed:

Fixed a bug introduced in Condor version 6.6.2 that could cause condor_ dagman to segfault while parsing some DAG files, or fail to recognize already-completed nodes in a rescue DAG.
Fixed a bug in condor_ dagman, whereby it could fail to automatically discover a Condor job's userlog file if the job's submit file did not have whitespace surrounding the equal sign on the log file line.
Fixed a bug in condor_ submit that appears to only have effected OSX machines. Previously, submit files that only defined a single job and used queue without any numerical modifiers would result in an error like this:
```
     ERROR: "test.sub" doesn't contain any "queue" commands -- no jobs queued
```
Now, condor_ submit will properly process and submit the job from job description files that contain a single queue statement with no modifiers.
Fixed a bug in the AIX condor_ starter that was causing the starter to sometimes kill itself when the job completed. Because this happened before the condor_ starter reported the job completion back to the condor_ shadow, such a job would be restarted.
Fixed a few memory and registry handle leaks in the condor_ schedd and condor_ startd. These leaks particularly affected Windows systems.
On Windows, Condor was known to have trouble accessing config files with UNC paths (with appropriate permissions set). This has been fixed.
On Windows, condor_ store_cred would fail if the account did not have Log on Locally privileges, even if the account was allowed to log in interactively. This has been fixed.
Fixed a bug on Windows that would cause the condor_ schedd to crash if D_ FULLDEBUG was turned on, and the submitting user account did not have Administrator access rights.

Known Bugs:

condor_ dagman can fail to detect a job's progress if another job in the DAG specifies the same underlying userlog file using a different path or filename (e.g., log=foo and log=./foo) in its submit file.

Version 6.6.4

Release Notes:

This version only contains platform-specific bug fixes. Therefore, it was only released for the two effected platforms.

Bugs Fixed:

Fixed a major bug in the Windows NT/2000 port that caused the Condor daemons to crash when attempting to authenticate.
Fixed the bug in Condor's file transfer mechanism for Mac OSX that was introduced in version 6.6.3.

Known Bugs:

None.

Version 6.6.3

Release Notes:

The Globus universe support for versions of Globus prior to 2.2 (specifically, those using GRAM 1.5 or earlier) has been removed.

New Features:

The Globus universe now supports submitting jobs to Globus Toolkit 3.2 installations.

Bugs Fixed:

The negotiator no longer crashes when a grid site ClassAd sets WantAdRevaluate but does not contain an UpdateSequenceNumber.
Globus universe jobs were failing to go on hold when a $$() expression could not be expanded.
On Windows, the system-wide TEMP variable is included in the execute environment if it is not specified in the submit file.
Fixed a rarely-occurring bug when the child process forked by the schedd gets stuck in an infinite loop when the user does ``condor_submit -s''. This should also fix problems when the child process forked by the collector would sometimes get stuck in an infinite loop when COLLECTOR_QUERY_WORKERS > 0 in the config file.

Known Bugs:

The Condor file transfer mechanism is broken on Mac OSX in Condor version 6.6.3. OSX users should either upgrade to version 6.6.4, or install a patched condor_ starter binary available from http://www.cs.wisc.edu/condor/binaries/condor-6.6.3-patch1-MacOSX-PPC.tar.Z.

Version 6.6.2

Release Notes:

There will be another release, 6.6.3, within a few weeks. We decided to release this version now because it adds the AIX platform and has some bug fixes which we thought important enough for a release. However, if you are not affected by the bugs fixed (see below) you may wish to wait for 6.6.3.

New Features:

Clipped support for AIX 5.2. This means VANILLA universe only - no checkpointing or STANDARD universe.
The setting GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT allows configuring the two phase commit timeout in Globus. This maps to the two_phase setting in the Globus RSL.
Added a new configuration variable, DAGMAN_MAX_SUBMIT_ATTEMPTS , that controls how many times in a row condor_ dagman will attempt to execute condor_ submit for a given job before giving up. It cannot be set to less than 1 attempt, or more than 10; if left undefined, it defaults to 6.
Added a new tool condor_ updates_stats to dump out the update statistics information from ClassAds in a human readable format. Condor 6.6.1, by default, publishes ``update statistics'' into the ClassAds as published by the condor_ collector. This program parses this output and displays it to the user in a readable format.
Changed the default condor_ dagman behavior so that it doesn't check for cycles at startup, only at runtime, since the former could be expensive for large DAGs. Added a boolean DAGMAN_STARTUP_CYCLE_DETECT config attribute to re-enable cycle-detection at startup.
condor_ dagman now offers a configuration variable, DAGMAN_MAX_SUBMITS_PER_INTERVAL , which controls how many individual jobs condor_ dagman will submit in a row before servicing other requests (such as a condor_ rm).
The grid_monitor now automatically detects jobmanager scripts on the remote gatekeeper. Previously it was limited to supporting the condor, fork, lsf, pbs, and remote jobmanager scripts.
A new parameter, SEC_DEBUG_PRINT_KEYS , controls whether or not the keys used for encryption get printed into the log. The default is false.

Bugs Fixed:

Jobs that make use of Condor's file transfer mechanism were not automatically authorized to read/write input/output files when flocking to machines that did not happen to be in the HOSTALLOW_WRITE list. This bug has existed since 6.3.
Eliminated a small chance that a grid_monitor log file or state file might be reused. The unique identifying numbers are now unique across the entire gridmanager, not each Globus resource.
Eliminated a race condition which might cause the grid monitor to erroneously decide that the status file was broken when in fact it was being uploaded and was empty.
The grid monitor now attempts to restart transfers in the event of globus-url-copy hanging.
Removed some settings from the default configuration files shipped with Condor that are no longer used in the code.
Fixed bugs in condor_ dagman parsing of submit files (to determine node log files). Previously, a submit file line beginning with "log" (e.g., "LogLock = True") would be interpreted as a log file line. Also, if "log" was defined twice in the submit file, condor_ dagman would incorrectly use the first definition, rather than the last.
Re-added PVM support for IRIX 6.5.
Fixed an indirect bug whereby condor_ dagman could fail with an assertion error if it encounters both a terminate and a abort event in the userlog for the same job; this can happen due to a bug in the condor_ schedd, which is not yet fixed.
condor_ dagman now works right with nodes that have an initialdir specified in the node submit file. (Previously, specifying an initialdir only worked if the log file path was absolute.)
condor_ dagman now responds more quickly to a request to be removed from the queue (via condor_ rm), even if it is in the midst of submitting jobs. Previously, condor_ dagman would finish submitting all ready jobs before responding to a removal request, which could take a long time, and forced it to immediately remove all the jobs it had just submitted unnecessarily.
Fixed keyboard idle reporting on Mac OS X. Previously, the code would often return -1 on newer hardware.

Known Bugs:

If a scheduler universe job terminates via a signal, the condor_ schedd logs both a terminate event and an abort event to the userlog.
Keyboard activity is not reported for pseudo-ttys on Mac OS X, only the physically connected keyboard

Version 6.6.1

Release Notes:

condor_ analyze is not included in the downloads of Version 6.6.1. The existing binary from Version 6.6.0 is likely to work on all platforms for which it was released.

New Features:

Added full support (including standard universe jobs with checkpointing and remote system calls) for Linux i386 RedHat 9 (using gcc/g++ version 3.2.2 and glibc version 2.3.2).
Added full support (including standard universe jobs with checkpointing and remote system calls) for Linux i386 RedHat 8 (using gcc/g++ version 3.2 and glibc version 2.2.93).
The time it takes condor_ dagman to submit jobs has been reduced slightly to improve up the startup time of large DAGs.
In order to help reduce load on the condor_ schedd when condor_ dagman is submitting jobs, there is a new config variable, DAGMAN_SUBMIT_DELAY , to specify the number of seconds condor_ dagman will sleep before submitting each job.
Enabled the ``update statistics'' in the condor_ collector by default in both the executable and in the default configuration.
Command-line arguments to condor_ dagman are now handled case-insensitively.
Added support for Condor-G and strong authentication to Condor for IRIX 6.5, but removed support for checkpointing and remote system calls. We plan to add support in Condor for IRIX's kernel-level checkpointing in a future release.
Added a -p option to condor_ store_cred so that users can now specify the the password on the command line instead of getting prompted for it.
The gahp_server helper process for Condor-G includes patches from the LHC Computing Grid Project to increase data transfer performance of the Condor-G client. Previous versions of Condor-G could bog down in accepting new transfer requests, producing a variety of errors.
Added a new configuration setting, SUBMIT_SEND_RESCHEDULE which controls whether or not condor_ submit should automatically send a condor_ reschedule command when it is done. Previously, condor_ submit would always send this reschedule so that the condor_ schedd knew to start trying to find matches for the new jobs. However, for submit machines that are managing a huge number of jobs (thousands or tens of thousands), this step would hurt performance in such a way that it became an obstacle to scalability. In this case, an administrator can set SUBMIT_SEND_RESCHEDULE to FALSE, this extra step is not performed, and the condor_ schedd will try to find matches whenever the periodic timer in the condor_ negotiator (NEGOTIATOR_INTERVAL) goes off.
Pool administrators can now specify the length of time before the condor_ starter sends its initial update to the condor_ shadow by defining STARTER_INITIAL_UPDATE_INTERVAL . The default is 8 seconds. This setting would not normally need changing except to fine-tune a heavily loaded system.
Administrators can now specify the default session duration for each Condor subsystem. This allows for fine tuning the image size of running Condor daemons if the memory footprint is a concern. The default for tools is 1 minute, the default for condor_ submit is one hour, and the default for daemons is 100 days. This does not mean that tools cannot run more than one minute or submit cannot run for more than an hour; it only affects memory usage.
Added new configuration setting GRID_MONITOR_HEARTBEAT_TIMEOUT . If this many seconds pass without hearing from the grid_monitor, it is assumed to be dead. Defaults to 300 (5 minutes). Increasing this number will improve the ability of the grid_monitor to survive in the face of transient problems but will also increase the time before Condor notices a problem. Prior to this change the gridmanager always waited 5 minutes, the user could not change the setting.
Added new configuration setting GRID_MONITOR_RETRY_DURATION . If something goes wrong with the grid_monitor at a particular site (like GRID_MONITOR_HEARTBEAT_TIMEOUT expiring), it will be retried for this many seconds. Defaults to 900 (15 minutes). If we can't successfully get it going again the grid monitor will be disabled for that site until 60 minutes have passed. Prior to this change the condor_gridmanager wait 60 minutes after any failure.

Bugs Fixed:

Fixed bugs related to network communication and timeouts that impact scalability in Condor:
- Fixed a bug inside Condor's network communication layer that could result in Condor daemons blocking trying to read more data after a socket had already been closed.
- Fixed a condor_ negotiator bug that could, in certain rare circumstances, cause a condor_ schedd to hang for five minutes while trying to communicate with it.
- Fixed a bug in which TCP connections would re-authenticate needlessly when Condor's strong authentication was enabled. This was not harmful but incurred a bit of overhead, especially when using Kerberos authentication.
Fixed bugs related to network security sessions which were getting cleared out. If the timing was unfortunate, this could cause some jobs to fail immediately after completion. So, Condor no longer clears out security sessions periodically (it used to happen every 8 hours) nor does it do so when a daemon receives a condor_ reconfig command.
Fixed a bug in the standard universe where C++ code that threw an exception would result in abortion of the executable instead of the delivery of the exception. This bug affects Condor version 6.6.0 for Redhat 7.x.
Fixed a condor_ shadow bug that could result in a fatal error if the following 3 conditions were met: (1) the job enables Condor's file transfer mechanism, (2) the job wants Condor to automatically figure out what files to transfer back (the default), and (3) the job does not specify a userlog.
Fixed bug whereby condor_ dagman, if removed from the queue via condor_ rm, could fail to remove all of its submitted jobs if any of their submit events had not yet appeared in the userlog.
Fixed a few bugs in condor_ preen:
- It will no longer potentially remove files related to a valid Computing on Demand (COD) claim on an otherwise idle machine.
- condor_ preen will no longer keep reporting that it had successfully removed a directory which was in fact failing to be removed.
Fixed the faulty argument parsing in condor_ rm, condor_ release, and condor_ hold. Before you could accidentally type condor_rm -analyze, and it would remove all of your jobs. Now it gives an error.
On Windows, when you type a command like condor_reconfig.exe instead of condor_reconfig, you no longer get an error.
Fixed a bug on Windows that would cause ``GetCursorPos() failed'' to appear repeatedly in the StartLog. The startd now uses a different function to track mouse activity that does not have a tendency to fail.
Fixed a bug on Windows that would prevent some condor_ shadow daemons from obtaining a lock to their log file under heavy load, and thus causing them to EXCEPT().
Fixed a bug on Windows where file transfers would incorrectly fail because of bad permissions when using domain accounts with nested groups, or when UNC paths were used.
Fixed the bug where the condor_ starter would fail to transfer back core files created by Vanilla, Java and MPI universe jobs. This bug was introduced in Condor version 6.5.2. Now, Condor correctly transfers back any core files created by faulty user jobs in any job universe.
In some circumstances, condor_ history would fail to read information about some jobs, and would report errors. In particular, when jobs had large environments, it would fail. This has been corrected.
Fixed a rare bug affecting condor_ dagman when job-throttling was enabled: if condor_ dagman was removed from the queue together with some of its own jobs (e.g., via condor_rm -a), it would quickly submit new jobs to replace them before recognizing that it needs to exit. It now shuts down immediately without submitting and then removing these unnecessary jobs.
Fixed a potential security problem that was introduced in Condor version 6.5.5 when the REQUIRE_LOCAL_CONFIG_FILE configuration setting was added. This setting used to default to FALSE if it was not defined in the configuration files. It now defaults to TRUE. If administrators define local configuration files for the machines in their pool, it should be a fatal error if those files don't exist unless the administrators actively disable this check by defining REQUIRE_LOCAL_CONFIG_FILE to be FALSE.
Fixed a bug on Windows that would cause the condor_ startd to EXCEPT() if the condor_ starter exited and left orphaned processes to be cleaned up. This bug first appeared in 6.5.0.
Fixed a bug on Windows that would cause graceful shutdowns on Windows (such as when condor_vacate is called) to fail to complete.
The gahp_server helper program, which provides Globus services to Condor-G, was always dynamically linked, even in statically-linked releases. The statically linked distributions of Condor now include a static gahp_server.
Fixed minor bug in parsing XML user log files that contain empty strings.
Fixed the messages written to the Condor daemon log files in various error conditions to be more informative and clear:
- The error message in the SchedLog that indicates that swap space has been depleted has been rephrased so it appears to be significant.
- Certain serious error messages are now being written to the D_ ALWAYS debug level that used to only appear if other debug levels were enabled.
- Clarified log messages related to errors looking up user information in the passwd database on UNIX and for creating dynamic users on Windows.
- Log messages related to keep-alives sent between the condor_ schedd and condor_ startd (written to D_ PROTOCOL) now include the ClaimId on both sides, so that it is easier to find potential problems and figure out which keep-alive messages correspond to what resources.
- Added more useful information to certain errors relating to security sessions and strong authentication.
- Fixed the formatting of some messages to correctly include a newline at the end of the message.
Fixed a bug in the condor_ configure installation tool. Previously, it would set MAIL_PATH, which doesn't exist in Condor and had no effect. Now, condor_ configure correctly sets MAIL , instead.
Fixed bug in userlog code in the CondorAPI library to prevent segmentation faults.
Clarified log messages for Condor-G's GridmanagerLog, especially those relating to the grid monitor.
Fixed potential race condition when using the grid monitor. Condor-G now identifies partial grid monitor status updates and waits for the update to complete.
The grid_monitor is slightly more robust in the face of unexpected behavior by the Globus jobmanager. This is only a partial fix, for complete success you really need the Globus patch at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=1425
Internal timeouts in the grid_monitor have been increased, increasing robustness during transient errors.

Known Bugs:

Submission of MPI jobs from a Unix machine to run on Windows machines (or vice versa) fails for machine_count > 1. This is not a new bug. Cross-platform submission of MPI jobs between Unix and Windows has always had this problem.
A multiple install of Condor's standard universe support libraries onto an NFS server for the purposes of having a heterogeneous mix of Linux distribution revisions all being able to utilize the same condor_ compile does not function correctly if Redhat 9 is one of the distributions.

Version 6.6.0

New Features:

The condor_ dagman debugging log now reports the total number of ``Un-Ready'' Nodes (i.e. those waiting for unfinished dependencies) in its periodic summaries. In the past, the omission of this state led to confusion because the total of all reported job states didn't always match the total number of jobs in the DAG.
Most Condor commands (condor_ on, condor_ off, condor_ restart, condor_ reconfig, condor_ vacate, condor_ checkpoint, condor_ reschedule) now support a -all command-line option to specify which daemons to act on. This is more efficient and much easier to use than previous methods for accomplishing the same effect. Using -all with condor_ off correctly leaves the existing condor_ master processes running on each host, so that a subsequent condor_ on would work. See section 3.10.1 on page for more details on proper use of -all with condor_ off and condor_ on

Bugs Fixed:

Fixed a bug under Solaris 8 with Update 6+, and Solaris 9 where Condor would incorrectly report the console and mouse idle times as zero.
The standard-universe fetch_files feature was not cleaning up temporary files on the execution machine.
In rare circumstances, a Linux kernel bug results in conflicting information about system boot time (/proc/stat and /proc/uptime). Specifically, the "btime" field in /proc/stat suddenly jumps to the present moment and then stays at that value. This was resulting in incorrect estimation of process ages, which caused Condor's estimation of CondorLoadAvg to be completely wrong. A more robust heuristic is now being used.
A long configuration line with with continuation lines can cause the config file parser to not properly skip the leading whitespace from the continued lines. This has been corrected.
The Grid Monitor now will automatically probe for and work with ``unknown'' batch systems.
Fixed a bug where under certain circumstances condor_ dagman would fail to detect an unsuccessful invocation of condor_ submit, and would instead report the job as successfully submitted with job id 0.0.
Fixed a bug which was causing problems when a periodic_remove expression for a scheduler universe job evaluates to true. Under these conditions, the schedd did not log the job termination to the job log. Additionally, the schedd would exit with an error status.
Fixed a recently-introduced condor_ dagman bug where the number of node retries (specified with the RETRY keyword) wasn't being updated after some failures; instead, the node would be allowed to retry indefinitely if it kept failing.
Fixed a recently-introduced bug where shutting down the condor_ schedd caused condor_ dagman to remove all its jobs from the queue and write a rescue file, rather than simply exiting so that it could recover automatically upon restart.
Changed the default ``Periodic Expression Interval'' parameter (PERIODIC_EXPR_INTERVAL) from 60 seconds to 300 seconds.
Whenever condor_ reconfig was used to re-configure multiple daemons which included the condor_ collector for a pool, the command would start to fail after the condor_ collector was reconfigured due to problems with security sessions in Condor's strong authentication code. This situation no longer causes problems for the condor_ reconfig tool, and it can properly re-configure multiple daemons at once, even if one of them is the condor_ collector for a pool.
Most Condor commands (condor_ on, condor_ off, condor_ restart, condor_ reconfig, condor_ vacate, condor_ checkpoint, condor_ reschedule) now check to make sure they are not sending a duplicate command if the user specifies the same target machine or daemon twice. For example:
```
     condor_reconfig hostname1 hostname2 hostname1
```
will only send a single reconfig command to hostname1.
Fixed a bug in the HPUX version of Condor which was causing the startd to occasionally abort operation. This has been in Condor since version 6.1.1.
The Condor daemons will no longer overwhelm NIS servers when large numbers of daemons are running. Condor now caches uid and group information internally, and refreshes the cache entries on a specified interval (which defaults to 5 minutes). See section 3.3.3 on page for more details.

Known Bugs:

The condor_ preen program does not know about Computing on Demand (COD) claims. If there are no regular Condor jobs on a given machine, but there are COD claims, and condor_ preen is spawned, it will remove files related to the COD claims. In version 6.6.0, sites using COD are encouraged to disable condor_ preen by commenting out the PREEN setting in the config files. This bug has been fixed in Condor version 6.6.1.
Normally, if a user's job crashes and creates a core file on a remote execution machine, the condor_ starter will automatically transfer the core file back to the submit machine. However, beginning in Condor version 6.5.2, if a vanilla, Java, or MPI universe job creates a core file, the condor_ starter will fail to transfer it back. This bug will be fixed in version 6.6.1.
There are a few bugs related to Condor tools failing to correctly locate the condor_ negotiator daemon. These bugs usually show up if a site is using non-standard ports for the central manager daemon. However, some of the bugs show up regardless of if the negotiator is listening on the standard port or not.
- condor_config_val -negotiator queries the condor_ collector, instead of querying the condor_ negotiator like it should.
- Using the -pool option to condor_q -analyze will not work. The tool will fail to find and query the condor_ negotiator for user priorities which it needs to determine why jobs may not be running.
- The Condor tools that support either the -negotiator or -collector options do not work when a user also specifies the -pool to define a remote pool to communicate with. The tools print a somewhat confusing message in this case.
- Most Condor tools that support -pool hostname will also recognize -pool hostname:port if the remote condor_ collector is listening on a non-standard port. However, the condor_ findhost tool does not work if given a -pool option that includes a port.

Table 8.2: Condor version 6.6.0 supported platforms

Architecture	Operating System
Hewlett Packard PA-RISC (both PA7000 and PA8000 series)	HPUX 10.20
Sun SPARC Sun4m,Sun4c, Sun UltraSPARC	Solaris 2.6, 2.7, 8, 9
Silicon Graphics MIPS (R5000, R8000, R10000)	IRIX 6.5
Intel x86	Red Hat Linux 7.1, 7.2, 7.3
	Red Hat Linux 8 (clipped)
	Red Hat Linux 9 (clipped)
	Windows NT 4.0 Workstation and Server (clipped)
	Windows 2000 Professional and Server, 2003 Server (clipped)
	Windows XP Professional (clipped)
ALPHA	Digital Unix 4.0
	Red Hat Linux 7.1, 7.2, 7.3 (clipped)
	Tru64 5.1 (clipped)
PowerPC	Macintosh OS X (clipped)
Itanium	Red Hat Linux 7.1, 7.2, 7.3 (clipped)

Next: 8.6 Development Release Series Up: 8. Version History and Previous: 8.4 Development Release Series Contents Index

condor-admin@cs.wisc.edu