Next: 8.6 Development Release Series
Up: 8. Version History and
Previous: 8.4 Development Release Series
Contents
Index
Subsections
8.5 Stable Release Series 6.6
This is a stable release series of Condor.
It is based on the 6.5 development series.
All new features added or bugs fixed in the 6.5 series are available
in the 6.6 series.
The details of each version are described below.
8.5.1 Version 6.6.12
Release Notes:
- Contains only a couple bug fixes.
Bugs fixed that are included in version 6.7.19:
Bugs fixes irrelevant to the 6.7 series:
- Fixed a bug which caused the condor_ collector incorrectly
handle Collector ads in which the Machine attribute is
missing, or Storage ads in which the Name is missing. In
these cases, a condor_ collector running on some platforms
(notably, Solaris) could crash.
Known Bugs:
Version 6.6.11
Release Notes:
Security Bugs Fixed:
- Bugs in previous versions of Condor could allow any user who can
submit jobs on a machine to gain access to the ``condor'' account
(or whatever non-privileged user the Condor daemons are running as).
This bug can not be exploited remotely, only by users already logged
onto a submit machine in the Condor pool.
- The security of the ``condor_ config_val -set'' feature was
found to be insufficient, so this feature is now disabled by default.
There are new configuration settings to enable this feature in a
secure manner.
Please read the descriptions of ENABLE_RUNTIME_CONFIG ,
ENABLE_PERSISTENT_CONFIG and PERSISTENT_CONFIG_DIR
in the example configuration file shipped with the latest Condor
releases, or in section 3.3.5 on
page
.
Other bugs fixed that are included in version 6.7.18:
- Fixed a bug which could cause the condor_ collector to crash
when it receives certain types of malformed ads.
- Fixed a bug which caused the condor_ collector incorrectly
handle ads in which the UpdateInterval attribute is set.
In particular, the previous versions of the condor_ collector will
use the UpdateInterval value as the maximum lifetime
of the ad when aging the ads, which could cause it to remove the ad
prematurely.
The condor_ collector now looks at the ClassAdLifetime
attribute, and uses its value (if set).
NOTE: No current Condor daemons are publishing either of these
attributes, but may do so in the future.
Bugs fixed that are included in version 6.7.14:
- Fixed a rare problem in the condor_ negotiator where a poorly
formed classad from a single condor_ schedd could halt negotiation
for the entire pool.
This poorly formed ad could only happen in extrememly rare
circumstances, but it was possible.
Now, the condor_ negotiator will simply ignore poorly formed
classads and continue to negotiate with any other condor_ schedd in
the system that has idle jobs.
- Fixed a bug which caused log messages which should contain
``PRIV_USER_FINAL'' to be ``PRIV_USER_FINALPRIV_FILE_OWNER''.
It's also possible that this same bug could cause crashes if any
daemon attempts to log a message which would refer to
``PRIV_FILE_OWNER''.
- Fixed a bug which caused the condor_ starter to exit with an
error when the sum total of the file transfer size exceeded 2G.
This, in turn, caused a ``shadow exeception'', and the job would
fail.
Bugs fixed that are included in version 6.7.11:
- In very rare cases, the condor_ startd could get into an
infinite loop if a job it was managing was suspended and then there
were fatal errors trying to send commands to evict the corresponding
condor_ starter.
This bug has been fixed, and the condor_ startd will now correctly
recover (and cleanup all processes) if it fails to send commands to a
starter managing a suspended job.
- Condor on Solaris has been patched to work around a Solaris stdio
limitation of 255 maximum file descriptors. Before this patch, heavily
loaded Condor daemons running on Solaris, particularly the condor_ schedd,
could exit complaining about lack of file descriptors for dprintf.
- Fixed a bug where the condor_ starter would follow symbolic links to
directories, when calculating job disk usage. This could cause an incorrect
job disk usage calculation, or hang the starter upon encountering an infinite
directory loop. This bug only affected Unix platforms.
- For Globus jobs, the Rematch expression is now evaluated when a
submit fails (in addition to when a submit commit times out).
- Fixed a bug that caused the condor_ gridmanager to go into an
infinite loop if an entry in the job's environment string was missing
an equals sign.
Bugs fixed that are included in version 6.7.9:
- Fixed a bug where the condor_ startd would erroneously compute the
console idle time utilizing a file called /proc/interrupts on unix machines
that were not linux.
- Fixed a bug where the condor_ negotiator might dump core if it was
reconfiged in the middle of a negotiation cycle.
- Fixed a bug where the condor_ negotiator might dump core if a startd
had a name longer than 63 bytes.
- Fixed a bug that could cause condor_ userprio to crash if the
data it gets back from the condor_ negotiator is invalid.
- Fixed a bug where
DEFAULT_PRIO_FACTOR was ignored if
ACCOUNTANT_LOCAL_DOMAIN was not defined.
Bugs fixes irrelevant to the 6.7 series:
- Added the -NoEventChecks and the -AllowLogError
command-line flags to condor_ submit_dag and the condor_ submit_dag
man page (they were already in condor_ dagman).
Added -r and -debug to the condor_ submit_dag
man page (they were already in condor_ submit_dag, just not
documented).
- Made command-line arguments case insensitive in the Windows
version of condor_ submit_dag; also fixed log file checks in
that version.
Known Bugs:
- A bug has been found which can cause a condor_ collector to
crash on some platforms (notably, Solaris). This can happen if the
condor_ collector receives a Collector ad in which the
Machine attribute is missing, or a Storage ad in which the
Name is missing. There is no security threat involved in
either case.
Version 6.6.10
Release Notes:
- Most of the fixes included in this release were also included in
version 6.7.7 (see below).
- The QUEUE_CLEAN_INTERVAL timer is reset during a
condor_ schedd reconfig only if this timer value has been changed.
Previously, the timer was reset during all condor_ schedd reconfigs, which
could prevent the job_queue.log file from being cleaned. Note that
this timer is always reset upon a condor_ schedd startup. See the
related change for truncating the job_queue.log below, for this same
release.
- Previously, the condor_ schedd would over-react and exit if it
tried to send a user email and SMTP_SERVER was undefined;
now it simply prints an error in the SchedLog and moves on.
Bugs fixed that are included in version 6.7.7:
- Fixed a bug that could cause the file job_queue.log in
the Condor SPOOL directory to grow unnecessarily large, thereby
slowing down the startup and/or shutdown times for the condor_ schedd
daemon.
- Fixed a critical bug where the console idle time for PS/2 keyboards
and mice was not being updated correctly.
- Fixed a bug in the condor_ collector that could cause it to
crash when parsing certain types of invalid ClassAds. In particular, if
a Machine, Schedd or License ClassAd sent to the condor_ collector has
an IP address field which is empty (which should never happen), the
condor_ collector will crash.
- Fixed some bugs in how the condor_ schedd handles a graceful
shutdown (either because of a condor_ off) or a
SIGTERM
on
UNIX):
- There was a minor bug if JOB_START_DELAY was set to
0 that would prevent the condor_ schedd from correctly cleaning
up during graceful shutdown.
Now, the condor_ schedd will properly shutdown, even if
JOB_START_DELAY is set to 0.
- Fixed a bug when there are scheduler universe jobs that were
recently submitted to the queue.
Previously, the shutdown code would not evict scheduler universe
jobs that had been submitted since the last
SCHEDD_INTERVAL (which defaults to 5 minutes).
So, if a user submitted a scheduler universe job and then someone
shutdown Condor on that machine, the condor_ schedd would wait
until the next SCHEDD_INTERVAL had elapsed before
evicting the job.
Now, the schedd will always attempt to evict scheduler universe
jobs during a shutdown, without waiting for this interval to pass.
- A number of Windows-specific bugs were fixed:
- It was possible under certain circumstances for execute
directories to not be cleaned up properly. This has been fixed.
- Certain Asian locales would cause the condor_ starter to crash
due to character translation problems. This has been fixed.
- Condor will now properly report memory sizes that exceed 2 GB.
- The condor_ starter would be unable to run jobs if the
LOG
path had a period (.) in it. This has been fixed.
- The condor_ startd would leak memory, especially on SMP
machines. This has been fixed.
- The condor_ master would crash immediately on Windows 2003
Server if the firewall was enabled. This has been fixed.
- Fixed a bug in condor_ dagman that could cause condor_ dagman
to fail an assertion if PRE or POST scripts are throttled with the
-maxpre or -maxpost condor_ submit_dag command line flags.
Bugs fixed that are NOT included in version 6.7.7:
- Fixed a bug where enabling the grid_monitor for any globus
job handled by something other than a hard-coded list of jobmanager names
would cause the job to stay idle forever. The hard-coded list of
jobmanager names was: condor, fork, lsf, pbs, and remote. A jobmanager
by any other name (e.g. condor_rh9, or lcgpbs) would cause the problem.
This bug was originally fixed in internal releases of 6.7.0, but it was
reintroduced by mistake in all public releases.
- Fix the way condor_ version handles command line arguments
(there were a number of problems and inconsistencies) and added a
-help option and usage message.
- Fixed some memory leaks in the condor_ startd that would be
induced by calling condor_ reconfig or condor_ status -d.
- By design, Condor daemons will exit if their parent process
exits. On Windows, a bug introduced in v6.5.x series broke this
behavior. This is now fixed.
- On Windows, users would often observe the condor_ master failing to
add exceptions for the Condor daemons to the Windows Firewall on Windows
XP SP2 or Windows 2003 Server SP1. The condor_ master will
now retry for a longer period of time to add these exceptions,
and the number of retries has now been made configurable. See
section 3.3.9 on
page
for more information.
Known Bugs:
Version 6.6.9
Release Notes:
- Most of the fixes included in this release were also included in
version 6.7.5.
However, at the end of this section, a few fixes that were added to
6.6.9 after 6.7.5 was released are mentioned separately.
Bugs fixed that are included in version 6.7.5:
- Fixed a security bug in the condor_ schedd that could enable a
maliciously modified condor_ submit tool to overwrite files in the Condor
SPOOL subdirectory, including the job queue.
- Fixed a bug where under very pathological file permission failure
conditions with a standard universe job, there would be a cycle of an
execute event followed by a termination event in the user log when the
job had not actually ran.
Bugs fixed that are NOT included in version 6.7.5:
- Fixed a memory management bug introduced in version 6.6.8 that
could result in deallocated memory being referenced after a child
process forked from a Condor daemon exits.
- Fixed bugs in some Condor tools that failed to locate
condor_ startd daemons that contained multiple
@
signs in
their Name attribute.
For example, a virtual machine from a multiple-CPU condor_ startd
spawned using glidein would have the name:
vm1@[pid]@[hostname]
.
All Condor tools that need to communicate with a condor_ startd
like this will now succeed.
- Removed a fixed-length buffer in the code that handled the
SUBSYS_EXPRS config file setting.
Previously, if any attributes referred to were larger than
approximate 1000 bytes, Condor daemons would crash.
Now, there is no limit to the size of the attributes listed in
SUBSYS_EXPRS.
For more information about this setting, see
section 3.3.5 on page
.
- Fixed a bug which would cause Condor to fail to cache user GID
information and potentially overwhelm NIS servers.
- Fixed another bug which could cause UDP machine updates to be
dropped by the condor_ collector.
Known Bugs:
- If a DAG node has both retries and a POST script, and the
actual Condor job for the node fails, the POST script is not
run except after the last retry of the job (or if the job
succeeds). (The POST script should be run each time the node
job is run, whether the job succeeds or not.)
- Occasionally, Condor generates both a terminated event and
an aborted event for a job that is aborted. If this happens for a
DAG node job, condor_ dagman considers this an error
and aborts the DAG. If you run into this problem, you can avoid
the abort by adding the -NoEventChecks flag to argument list
in the condor_ dagman submit file generated by condor_ submit_dag
(you have to do condor_ submit_dag -no_submit and hand-edit
the resulting submit file). However, if you get the
double events on a node that has retries, condor_ dagman will assert.
The only fix for this is to upgrade to a 6.7.5 or newer condor_ dagman.
You can do this by simply installing a newer condor_ dagman executable,
without any other changes to your Condor installation. It is fine to
run a 6.7 condor_ dagman on a 6.6 Condor installation.
- In a DAG, if a node job generates an executable error event,
the DAG is aborted. This can be worked around by adding the
-NoEventChecks flag to argument list in the condor_ dagman
submit file generated by condor_ submit_dag (you have to do
condor_ submit_dag -no_submit and hand-edit the resulting
submit file).
Version 6.6.8
Release Notes:
- Most of the fixes included in this release were also included in
version 6.7.3.
However, at the end of this section, a few fixes that were added to
6.6.8 after 6.7.3 was released are mentioned separately.
New Features:
Bugs Fixed:
Known Bugs:
Bugs fixed that are not included in version 6.7.3:
- Fixed a discrepancy in the SUBSYS_ADDRESS_FILE
setting.
Previously, this setting did not work for SUBSYS values of
COLLECTOR or NEGOTIATOR (for example, defining
COLLECTOR_ADDRESS_FILE had no effect).
Now, if either of these is defined in the configuration file,
the corresponding Condor daemon will write out the address
and port it is using to the specified file.
Normally, the condor_ collector and condor_ negotiator listen on a
well-known, fixed port.
However, on single-machine, Personal Condor installations,
these address files allow all of the Condor daemons and tools to locate
the condor_ collector and condor_ negotiator, even if they are
using a dynamically assigned port.
For more information about the SUBSYS_ADDRESS_FILE
setting, please see the description in
section 3.3.5 on
page
.
For more information about using non-standard ports for the
condor_ collector and condor_ negotiator, please see the
description of ``Non Standard Ports for Central Managers'' in
section 3.7.1 on
page
.
Version 6.6.7
Release Notes:
New Features:
- Added a feature to the condor_ master which automatically adds
the Condor daemons to the Windows Firewall exception list. This only
applies to machines running Windows XP SP2.
Bugs Fixed:
- Fixed a bug specific to Windows that could cause, in rare occurrences
due to a race condition, Condor to fail to properly signal the job to
suspend, continue, or preempt.
- When Condor transfers the job executable using the file transfer
mechanism, it used to leave the binary sitting as a world-writable
file inside the execute directory on UNIX.
Now, executable files transferred by Condor have the proper
permissions (mode 0755).
- Fixed an important bug in the low-level code that Condor uses to
transfer files across a network.
There were certain temporary failure cases that were being treated
as permanent, fatal errors.
This resulted in file transfers that aborted prematurely, causing
jobs to needlessly re-run.
The code now gracefully recovers from these temporary errors.
This should significantly help throughput for some sites,
particularly ones that transfer very large files as output from
their jobs.
- Fixed a bug in the file transfer mechanism which caused
segmentation faults when very long input/output/intermediate file
lists were used.
- Fixed a number of bugs in the -format option to condor_ q
and condor_ status.
Now, these tools will properly handle printing boolean expressions
in all cases.
Previously, depending on how the boolean evaluated, either the
expression was printed, or the tool could crash.
Furthermore, the tools do a better job of handling the different
types of format conversion strings and printing out the appropriate
value.
For example, if a user tries to print out a boolean attribute with
condor_status -format "%d\n" HasFileTransfer
, the
condor_ status tool will evaluate HasFiletransfer and print
either a 0 or a 1 (FALSE or TRUE).
If, on the other hand, a user tries to print out a boolean attribute
with condor_status -format "%s\n" HasFileTransfer
, the
condor_ status tool will print out the string ``FALSE'' or ``TRUE''
as appropriate.
- The ClassAd attribute scope resolution prefixes, MY. and
TARGET., are no longer case sensitive.
- condor_ dagman now generates a fatal error if any node submit
files are missing the log file attribute. This behavior can be
overridden with the -AllowLogError command-line option.
- condor_ dagman now does better checking for inconsistent events
(such as getting multiple terminate events for a single job). This
checking can be disabled with the -NoEventChecks command-line
option.
- Under Tru64, Condor would sometimes fail to start a job while
setting the resource limits on behalf of the job.
This error appears to be the result of a kernel issue.
A workaround has been implemented which will leave the limits
of the job unmodified and run the job when this specific error
situation arises.
- On Windows, occasionally Condor would exhibit erratic behavior
when a machine resumes from sleeping. This has been fixed.
- On Windows, occasionally Condor would fail to bind to any available
interfaces due to a mishandling of a function return value. This has
been fixed.
Known Bugs:
Version 6.6.6
Release Notes:
- A condor_ dagman job will fail and report a cycle in the DAG
when XML logs are used in a single or multiple log format. The Post
Script completion event does not get converted to XML and Dagman
never sees them complete or fail because of the format of the event.
New Features:
- The checkpoint server has moved from contrib module status to being
a normal part of Condor.
- When the first start running, all Condor daemons will now try to
print to their log file the full path to the binary they are
executing.
Unfortunately, we can only reliably get this information on Linux,
Solaris, MacOSX, and Windows platforms.
On other platforms, this information will only be printed to the log
file in certain cases that depend on how the daemon was invoked.
This new feature was added to aid in debugging problems where sites
were not running the version of the Condor daemons they thought they
were due to problems in custom-built startup scripts.
- condor_ wait is now available in the Windows port.
- Added a fix to the accountant that allows users to specify user
priorities with condor_ userprio before any jobs have been submitted.
- Added support for running batch files under Windows when using the
STARTD_CRON or USER_JOB_WRAPPER attributes.
- Moved from Globus 2.2.2 to Globus 2.2.4 for Condor-G, except for
the DUX 4.0f platform.
Bugs Fixed:
- Windows bug fixes:
- Fixed a bug which could cause Condor to kill processes that
aren't related to Condor or the job it was running at the time.
- Fixed a problem that could cause daemons or tools to crash
when they looked up information about processes running on the
system.
- Fixed a problem with the collector dropping TCP updates with
pools larger than roughly 20 machines. This issue only occurs with
UPDATE_COLLECTOR_WITH_TCP enabled.
- Fixed an issue with condor_ store_cred reporting success when
in fact under certain circumstances the store command actually failed.
- Removed condor_ kbdd_dll. It is no longer used.
- Fixed an issue with condor_ birdwatcher that caused it to
leak resource handles.
- Fixed an issue with the Windows port of condor_ dagman that
would cause it to crash when POST scripts were used.
- Fixed a bug where the environment of jobs in any universe could
be corrupted.
- The condor_ startd now properly cleans up execute directories on
root-squashed NFS mounts.
- Fixed a problem where the condor_ starter could crash if the
job it was running used Condor's file transfer mechanism and the
full path names to the job's files became longer than a few hundred
characters.
- The image_size attribute of a job on Mac OS X is much
closer to the values that ps returns.
Previously it would be highly inflated.
- Fixed a memory leak in the condor_ gridmanager.
- Added the -Storklog argument to condor_ submit_dag to make it
compatible with the older perl script of the same name.
- Removed support for the -libc option for condor_ version.
- Added a fix to condor_ compile where if our internal ld managed
to not be invoked during linking of a standard universe executable,
a warning is emitted.
- Fixed a minor bug in the file transfer mechanism. Specifically,
if a VANILLA job had when_to_transfer_output set to
ON_EXIT_OR_EVICT, wrote more than one output file, and was
actually evicted, the condor condor_ shadow would have a fatal
run-time error (shadow exception) and your job would be rerun.
- DAGMan bug fixes:
- If submit files for individual nodes referred to the same log
file with different paths, condor_ dagman would read log events
incorrectly and the DAG would fail.
condor_ dagman is now able to recognize that the different paths
actually refer to the same log file.
- Fixed a bug where DAGMan failed to monitor Stork job logs.
- If a node submit file doesn't specify a log file, the warning
message now gets printed out in the the DAGMan log file.
- Fixed a bug that caused condor_ dagman to fail if first node
submit file has continuation in log file line.
- Bugs related to configuration
- Fixed a bug where Condor daemons could crash if
COLLECTOR_HOST or NEGOTIATOR_HOST was defined to
be something bogus.
- Fixed potential crash in the condor_ collector when
COLLECTOR_NAME was too long.
- The default setting for POOL_HISTORY_DIR is no
longer SPOOL .
Using the spool directory would result in history files being
obliterated by condor_ preen.
- Fixed a bug which could result in a daemon crashing while it was
writing to its logfile.
- Fixed a signal handling bug in the checkpoint server which could
cause the daemon to hang sometimes.
- The Kerberos map file now tolerates spaces on either side of the
equals sign instead of generating a parse error.
- The -analyze option to condor_ q is only meaningful for certain
universes. condor_ q now warns if the output might not be meaningful.
- Java universe: when jar files are transferred to the execute
machine (with should_transfer_files or
transfer_input_files) the condor_ starter will use the
local path (in the execute directory) for the jarfiles, instead of
the original path specified in the submit file.
- Previously, if a scheduler universe job died with a signal, the
condor_ schedd would write multiple (conflicting) events into the
UserLog file: a terminate event and an abort event.
Now, only the terminate event is written, not the abort event.
- Fixed a minor bug where if the condor_ schedd crashed or was
killed at just the wrong moment while a job was being removed
because the periodic_remove expression had evaluated to
TRUE, the job might have been successfully removed but the
RemoveReason attribute could have been lost.
Now, both actions are taken together atomically.
If a job is successfully removed, it will always have a
RemoveReason attribute.
- Fixed a memory leak in the condor_ collector.
Known Bugs:
Version 6.6.5
Release Notes:
New Features:
Bugs Fixed:
Known Bugs:
- condor_ dagman can fail to detect a job's progress if another
job in the DAG specifies the same underlying userlog file using
a different path or filename (e.g., log=foo and log=./foo) in
its submit file.
Version 6.6.4
Release Notes:
- This version only contains platform-specific bug fixes.
Therefore, it was only released for the two effected platforms.
Bugs Fixed:
- Fixed a major bug in the Windows NT/2000 port that caused the
Condor daemons to crash when attempting to authenticate.
- Fixed the bug in Condor's file transfer mechanism for Mac OSX
that was introduced in version 6.6.3.
Known Bugs:
Version 6.6.3
Release Notes:
- The Globus universe support for versions of Globus prior to 2.2 (specifically, those using GRAM 1.5 or earlier) has been removed.
New Features:
- The Globus universe now supports submitting jobs to Globus Toolkit 3.2 installations.
Bugs Fixed:
- The negotiator no longer crashes when a grid site ClassAd sets WantAdRevaluate but does not contain an UpdateSequenceNumber.
- Globus universe jobs were failing to go on hold when a $$() expression
could not be expanded.
- On Windows, the system-wide TEMP variable is included in the
execute environment if it is not specified in the submit file.
- Fixed a rarely-occurring bug when the child process forked by the schedd gets stuck in an infinite loop when the user does ``condor_submit -s''. This should also fix problems when the child process forked by the collector would sometimes get stuck in an infinite loop when COLLECTOR_QUERY_WORKERS > 0 in the config file.
Known Bugs:
Version 6.6.2
Release Notes:
- There will be another release, 6.6.3, within a few weeks. We decided to
release this version now because it adds the AIX platform and has some bug
fixes which we thought important enough for a release. However, if you are
not affected by the bugs fixed (see below) you may wish to wait for 6.6.3.
New Features:
- Clipped support for AIX 5.2.
This means VANILLA universe only - no checkpointing or STANDARD universe.
- The setting GRIDMANAGER_GLOBUS_COMMIT_TIMEOUT allows
configuring the two phase commit timeout in Globus. This maps to the
two_phase setting in the Globus RSL.
- Added a new configuration variable,
DAGMAN_MAX_SUBMIT_ATTEMPTS , that controls how many
times in a row condor_ dagman will attempt to execute
condor_ submit for a given job before giving up. It cannot be
set to less than 1 attempt, or more than 10; if left undefined,
it defaults to 6.
- Added a new tool condor_ updates_stats to dump out the update
statistics information from ClassAds in a human readable format.
Condor 6.6.1, by default, publishes ``update statistics'' into the
ClassAds as published by the condor_ collector. This program parses
this output and displays it to the user in a readable format.
- Changed the default condor_ dagman behavior so that it doesn't
check for cycles at startup, only at runtime, since the former
could be expensive for large DAGs. Added a boolean
DAGMAN_STARTUP_CYCLE_DETECT config attribute to
re-enable cycle-detection at startup.
- condor_ dagman now offers a configuration variable,
DAGMAN_MAX_SUBMITS_PER_INTERVAL , which controls how
many individual jobs condor_ dagman will submit in a row before
servicing other requests (such as a condor_ rm).
- The grid_monitor now automatically detects jobmanager scripts on the
remote gatekeeper. Previously it was limited to supporting the condor,
fork, lsf, pbs, and remote jobmanager scripts.
- A new parameter, SEC_DEBUG_PRINT_KEYS , controls whether or not
the keys used for encryption get printed into the log.
The default is false.
Bugs Fixed:
- Jobs that make use of Condor's file transfer mechanism were not
automatically authorized to read/write input/output files when
flocking to machines that did not happen to be in the
HOSTALLOW_WRITE list. This bug has existed since 6.3.
- Eliminated a small chance that a grid_monitor log file or state file
might be reused. The unique identifying numbers are now unique across
the entire gridmanager, not each Globus resource.
- Eliminated a race condition which might cause the grid monitor to
erroneously decide that the status file was broken when in fact it
was being uploaded and was empty.
- The grid monitor now attempts to restart transfers in the event of
globus-url-copy hanging.
- Removed some settings from the default configuration files
shipped with Condor that are no longer used in the code.
- Fixed bugs in condor_ dagman parsing of submit files (to determine
node log files). Previously, a submit file line beginning with
"log" (e.g., "LogLock = True") would be interpreted as a log file
line. Also, if "log" was defined twice in the submit file,
condor_ dagman would incorrectly use the first definition, rather than
the last.
- Re-added PVM support for IRIX 6.5.
- Fixed an indirect bug whereby condor_ dagman could fail with an
assertion error if it encounters both a terminate and a abort event in
the userlog for the same job; this can happen due to a bug in the
condor_ schedd, which is not yet fixed.
- condor_ dagman now works right with nodes that have an initialdir
specified in the node submit file. (Previously, specifying
an initialdir only worked if the log file path was absolute.)
- condor_ dagman now responds more quickly to a request to be
removed from the queue (via condor_ rm), even if it is in the
midst of submitting jobs. Previously, condor_ dagman would
finish submitting all ready jobs before responding to a removal
request, which could take a long time, and forced it to
immediately remove all the jobs it had just submitted
unnecessarily.
- Fixed keyboard idle reporting on Mac OS X. Previously, the code
would often return -1 on newer hardware.
Known Bugs:
- If a scheduler universe job terminates via a signal, the
condor_ schedd logs both a terminate event and an abort event
to the userlog.
- Keyboard activity is not reported for pseudo-ttys on Mac OS X, only
the physically connected keyboard
Version 6.6.1
Release Notes:
- condor_ analyze is not included in the downloads of Version 6.6.1.
The existing binary from Version 6.6.0 is likely to work on all platforms
for which it was released.
New Features:
- Added full support (including standard universe jobs with
checkpointing and remote system calls) for Linux i386 RedHat 9
(using gcc/g++ version 3.2.2 and glibc version 2.3.2).
- Added full support (including standard universe jobs with
checkpointing and remote system calls) for Linux i386 RedHat 8
(using gcc/g++ version 3.2 and glibc version 2.2.93).
- The time it takes condor_ dagman to submit jobs has been
reduced slightly to improve up the startup time of large DAGs.
- In order to help reduce load on the condor_ schedd when
condor_ dagman is submitting jobs, there is a new config
variable, DAGMAN_SUBMIT_DELAY , to specify the number
of seconds condor_ dagman will sleep before submitting each
job.
- Enabled the ``update statistics'' in the condor_ collector by
default in both the executable and in the default configuration.
- Command-line arguments to condor_ dagman are now handled
case-insensitively.
- Added support for Condor-G and strong authentication to Condor
for IRIX 6.5, but removed support for checkpointing and remote
system calls.
We plan to add support in Condor for IRIX's kernel-level
checkpointing in a future release.
- Added a -p option to condor_ store_cred so that users
can now specify the the password on the command line instead of getting
prompted for it.
- The gahp_server helper process for Condor-G includes patches from
the LHC Computing Grid Project to increase data transfer performance of
the Condor-G client. Previous versions of Condor-G could bog down in
accepting new transfer requests, producing a variety of errors.
- Added a new configuration setting,
SUBMIT_SEND_RESCHEDULE which controls whether or not
condor_ submit should automatically send a condor_ reschedule
command when it is done.
Previously, condor_ submit would always send this reschedule so
that the condor_ schedd knew to start trying to find matches for
the new jobs.
However, for submit machines that are managing a huge number of jobs
(thousands or tens of thousands), this step would hurt performance
in such a way that it became an obstacle to scalability.
In this case, an administrator can set
SUBMIT_SEND_RESCHEDULE to
FALSE
, this extra
step is not performed, and the condor_ schedd will try to find
matches whenever the periodic timer in the condor_ negotiator
(NEGOTIATOR_INTERVAL) goes off.
- Pool administrators can now specify the length of time before
the condor_ starter sends its initial update to the
condor_ shadow by defining
STARTER_INITIAL_UPDATE_INTERVAL .
The default is 8 seconds.
This setting would not normally need changing except to fine-tune a
heavily loaded system.
- Administrators can now specify the default session duration for
each Condor subsystem.
This allows for fine tuning the image size of running Condor daemons
if the memory footprint is a concern.
The default for tools is 1 minute, the default for condor_ submit
is one hour, and the default for daemons is 100 days.
This does not mean that tools cannot run more than one minute or
submit cannot run for more than an hour; it only affects memory
usage.
- Added new configuration setting
GRID_MONITOR_HEARTBEAT_TIMEOUT .
If this many
seconds pass without hearing from the grid_monitor, it is
assumed to be dead. Defaults to 300 (5 minutes). Increasing
this number will improve the ability of the grid_monitor to
survive in the face of transient problems but will also
increase the time before Condor notices a problem. Prior to
this change the gridmanager always waited 5 minutes, the user
could not change the setting.
- Added new configuration setting
GRID_MONITOR_RETRY_DURATION .
If something goes wrong
with the grid_monitor at a particular site (like
GRID_MONITOR_HEARTBEAT_TIMEOUT expiring), it will be retried
for this many seconds. Defaults to 900 (15 minutes). If we
can't successfully get it going again the grid monitor will be
disabled for that site until 60 minutes have passed. Prior to
this change the condor_gridmanager wait 60 minutes after any
failure.
Bugs Fixed:
- Fixed bugs related to network communication and timeouts that
impact scalability in Condor:
- Fixed a bug inside Condor's network communication layer that
could result in Condor daemons blocking trying to read more data
after a socket had already been closed.
- Fixed a condor_ negotiator bug that could, in certain rare
circumstances, cause a condor_ schedd to hang for five minutes
while trying to communicate with it.
- Fixed a bug in which TCP connections would re-authenticate
needlessly when Condor's strong authentication was enabled.
This was not harmful but incurred a bit of overhead, especially
when using Kerberos authentication.
- Fixed bugs related to network security sessions which were
getting cleared out.
If the timing was unfortunate, this could cause some jobs to fail
immediately after completion.
So, Condor no longer clears out security sessions periodically (it
used to happen every 8 hours) nor does it do so when a daemon
receives a condor_ reconfig command.
- Fixed a bug in the standard universe where C++ code that threw an
exception would result in abortion of the executable instead of the
delivery of the exception. This bug affects Condor version 6.6.0 for
Redhat 7.x.
- Fixed a condor_ shadow bug that could result in a fatal error
if the following 3 conditions were met: (1) the job enables Condor's
file transfer mechanism, (2) the job wants Condor to automatically
figure out what files to transfer back (the default), and (3) the
job does not specify a userlog.
- Fixed bug whereby condor_ dagman, if removed from the queue via
condor_ rm, could fail to remove all of its submitted jobs if
any of their submit events had not yet appeared in the userlog.
- Fixed a few bugs in condor_ preen:
- It will no longer potentially remove files related to a valid
Computing on Demand (COD) claim on an otherwise idle machine.
- condor_ preen will no longer keep reporting that it had
successfully removed a directory which was in fact failing to be
removed.
- Fixed the faulty argument parsing in condor_ rm,
condor_ release, and condor_ hold.
Before you could accidentally type
condor_rm -analyze
, and it
would remove all of your jobs.
Now it gives an error.
- On Windows, when you type a command like
condor_reconfig.exe
instead of condor_reconfig
, you no
longer get an error.
- Fixed a bug on Windows that would cause ``GetCursorPos() failed''
to appear repeatedly in the StartLog. The startd now uses a different
function to track mouse activity that does not have a tendency to fail.
- Fixed a bug on Windows that would prevent some condor_ shadow
daemons from obtaining a lock to their log file under heavy load, and
thus causing them to EXCEPT().
- Fixed a bug on Windows where file transfers would incorrectly fail
because of bad permissions when using domain accounts with nested groups,
or when UNC paths were used.
- Fixed the bug where the condor_ starter would fail to transfer
back core files created by Vanilla, Java and MPI universe jobs.
This bug was introduced in Condor version 6.5.2.
Now, Condor correctly transfers back any core files created by
faulty user jobs in any job universe.
- In some circumstances, condor_ history would fail to read
information about some jobs, and would report errors. In particular,
when jobs had large environments, it would fail. This has been
corrected.
- Fixed a rare bug affecting condor_ dagman when job-throttling
was enabled: if condor_ dagman was removed from the queue
together with some of its own jobs (e.g., via
condor_rm -a
),
it would quickly submit new jobs to replace them before
recognizing that it needs to exit. It now shuts down
immediately without submitting and then removing these
unnecessary jobs.
- Fixed a potential security problem that was introduced in Condor
version 6.5.5 when the REQUIRE_LOCAL_CONFIG_FILE
configuration setting was added.
This setting used to default to FALSE if it was not defined in the
configuration files.
It now defaults to TRUE.
If administrators define local configuration files for the machines
in their pool, it should be a fatal error if those files don't exist
unless the administrators actively disable this check by defining
REQUIRE_LOCAL_CONFIG_FILE to be FALSE.
- Fixed a bug on Windows that would cause the condor_ startd to
EXCEPT() if the condor_ starter exited and left orphaned processes to
be cleaned up. This bug first appeared in 6.5.0.
- Fixed a bug on Windows that would cause graceful shutdowns on
Windows (such as when
condor_vacate
is called) to fail to
complete.
- The gahp_server helper program, which provides Globus services
to Condor-G, was always dynamically linked, even in statically-linked
releases.
The statically linked distributions of Condor now include a static
gahp_server.
- Fixed minor bug in parsing XML user log files that contain empty
strings.
- Fixed the messages written to the Condor daemon log files in
various error conditions to be more informative and clear:
- The error message in the SchedLog that indicates that swap
space has been depleted has been rephrased so it appears to be
significant.
- Certain serious error messages are now being written to the
D_ ALWAYS debug level that used to only appear if other debug
levels were enabled.
- Clarified log messages related to errors looking up user
information in the passwd database on UNIX and for creating
dynamic users on Windows.
- Log messages related to keep-alives sent between the
condor_ schedd and condor_ startd (written to D_ PROTOCOL)
now include the ClaimId on both sides, so that it is easier
to find potential problems and figure out which keep-alive
messages correspond to what resources.
- Added more useful information to certain errors relating to
security sessions and strong authentication.
- Fixed the formatting of some messages to correctly include a
newline at the end of the message.
- Fixed a bug in the condor_ configure installation tool.
Previously, it would set MAIL_PATH, which doesn't exist
in Condor and had no effect.
Now, condor_ configure correctly sets MAIL , instead.
- Fixed bug in userlog code in the CondorAPI library to prevent
segmentation faults.
- Clarified log messages for Condor-G's GridmanagerLog,
especially those relating to the grid monitor.
- Fixed potential race condition when using the grid monitor.
Condor-G now identifies partial grid monitor status updates and
waits for the update to complete.
- The grid_monitor is slightly more robust in the face of
unexpected behavior by the Globus jobmanager. This is only a
partial fix, for complete success you really need the Globus
patch at
http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=1425
- Internal timeouts in the grid_monitor have been increased,
increasing robustness during transient errors.
Known Bugs:
- Submission of MPI jobs from a Unix machine to run on Windows
machines (or vice versa) fails for machine_count > 1. This is
not a new bug. Cross-platform submission of MPI jobs between
Unix and Windows has always had this problem.
- A multiple install of Condor's standard universe support libraries
onto an NFS server for the purposes of having a heterogeneous mix of Linux
distribution revisions all being able to utilize the same condor_ compile
does not function correctly if Redhat 9 is one of the distributions.
Version 6.6.0
New Features:
- The condor_ dagman debugging log now reports the total number
of ``Un-Ready'' Nodes (i.e. those waiting for unfinished
dependencies) in its periodic summaries. In the past, the
omission of this state led to confusion because the total of all
reported job states didn't always match the total number of jobs
in the DAG.
- Most Condor commands (condor_ on, condor_ off,
condor_ restart, condor_ reconfig, condor_ vacate,
condor_ checkpoint, condor_ reschedule) now support a -all
command-line option to specify which daemons to act on.
This is more efficient and much easier to use than previous methods
for accomplishing the same effect.
Using -all with condor_ off correctly leaves the existing
condor_ master processes running on each host, so that a subsequent
condor_ on would work.
See section 3.10.1 on
page
for more details on
proper use of -all with condor_ off and condor_ on
Bugs Fixed:
Known Bugs:
- The condor_ preen program does not know about Computing on
Demand (COD) claims.
If there are no regular Condor jobs on a given machine, but there
are COD claims, and condor_ preen is spawned, it will remove files
related to the COD claims.
In version 6.6.0, sites using COD are encouraged to disable
condor_ preen by commenting out the PREEN setting in the
config files.
This bug has been fixed in Condor version 6.6.1.
- Normally, if a user's job crashes and creates a core file on a
remote execution machine, the condor_ starter will automatically
transfer the core file back to the submit machine.
However, beginning in Condor version 6.5.2, if a vanilla, Java, or
MPI universe job creates a core file, the condor_ starter will fail
to transfer it back.
This bug will be fixed in version 6.6.1.
- There are a few bugs related to Condor tools failing to
correctly locate the condor_ negotiator daemon.
These bugs usually show up if a site is using non-standard ports for
the central manager daemon.
However, some of the bugs show up regardless of if the negotiator is
listening on the standard port or not.
condor_config_val -negotiator
queries the
condor_ collector, instead of querying the
condor_ negotiator like it should.
- Using the -pool option to
condor_q -analyze
will not work.
The tool will fail to find and query the condor_ negotiator
for user priorities which it needs to determine why jobs may
not be running.
- The Condor tools that support either the -negotiator
or -collector options do not work when a user also
specifies the -pool to define a remote pool to
communicate with.
The tools print a somewhat confusing message in this case.
- Most Condor tools that support
-pool hostname
will
also recognize -pool hostname:port
if the remote
condor_ collector is listening on a non-standard port.
However, the condor_ findhost tool does not work if given a
-pool option that includes a port.
Table 8.2:
Condor version 6.6.0 supported platforms
Architecture |
Operating System |
Hewlett Packard PA-RISC (both PA7000 and PA8000 series) |
HPUX 10.20 |
Sun SPARC Sun4m,Sun4c, Sun UltraSPARC |
Solaris 2.6, 2.7, 8, 9 |
Silicon Graphics MIPS (R5000, R8000, R10000) |
IRIX 6.5 |
Intel x86 |
Red Hat Linux 7.1, 7.2, 7.3 |
|
Red Hat Linux 8 (clipped) |
|
Red Hat Linux 9 (clipped) |
|
Windows NT 4.0 Workstation and Server (clipped) |
|
Windows 2000 Professional and Server, 2003 Server (clipped) |
|
Windows XP Professional (clipped) |
ALPHA |
Digital Unix 4.0 |
|
Red Hat Linux 7.1, 7.2, 7.3 (clipped) |
|
Tru64 5.1 (clipped) |
PowerPC |
Macintosh OS X (clipped) |
Itanium |
Red Hat Linux 7.1, 7.2, 7.3 (clipped) |
|
Next: 8.6 Development Release Series
Up: 8. Version History and
Previous: 8.4 Development Release Series
Contents
Index
condor-admin@cs.wisc.edu