Next: 8.4 Development Release Series
Up: 8. Version History and
Previous: 8.2 Upgrade Surprises
Contents
Index
Subsections
8.3 Stable Release Series 6.8
This is a stable release series of Condor.
It is based on the 6.7 development series.
All new features added or bugs fixed in the 6.7 series are available
in the 6.8 series.
As usual, only bug fixes (and potentially, ports to new platforms)
will be provided in future 6.8.x releases.
New features will be added in the forthcoming 6.9.x development series.
The 6.8.x series supports a different set of platforms than 6.6.x.
Please see the updated table of available platforms in
section 1.5 on page
.
The details of each version are described below.
Version 6.8.3
Release Notes:
- Performed a security audit of all places where Condor opens files,
to make certain files are opened with a reasonable permission mode
and with the
O_EXCL flag whenever possible.
New Features:
- Added the JOB_INHERITS_STARTER_ENVIRONMENT configuration
macro. When set
to True, jobs inherit all environment variables from
the condor_ starter. This is useful for glidein jobs that need to access
environment variables from the batch system running the glidein daemons.
The default for this configuration macro is False, so existing behavior
is unchanged. This feature does not apply to standard and pvm universe
jobs.
- Changed the default UDP receive buffer for the
condor_ collector from 1M to 10M. This value can be configured with
the (existing) COLLECTOR_SOCKET_BUFSIZE macro.
NOTE: For some Linux distributions, it may be necessary to configure
a larger value than the default; this parameter is
/proc/sys/net/core/rmem_max . You can see the values that the
condor_ collector actually used by enabling D_FULLDEBUG for the
condor_ collector and looking at the log line that looks like this:
Reset OS socket buffer size to 2048k (UDP), 255k (TCP).
- Added a new configuration macro to control the size of the
TCP send buffers for the condor_ collector. This macro used to
be the same as COLLECTOR_SOCKET_BUFSIZE. The new macro is
COLLECTOR_TCP_SOCKET_BUFSIZE , and it defaults to 128K.
- Added a clipped port for SuSE Linux Enterprise Server 9 running on the
PowerPC architecture. Note the known bug below.
- The condor_ schedd now maintains a birth date for the job queue.
Nothing in Condor currently uses this feature, but future versions of condor_ quill may require it.
- There is a new configuration file macro
RANDOM_INTEGER(min,max[,step]). It produces a
pseudo-random integer within the range
min
and max
,
inclusive at configuration time.
Bugs Fixed:
- Fixed a deadlock situation between the condor_ schedd and
the condor_ startd that can
significantly impact the condor_ schedd's performance. The likelihood of the
deadlock increased based upon the number of VMs advertised by the
condor_ startd.
- Fixed a bug reading the user job log on Windows that caused
occasional DAGMan confusion.
Thanks to Fairview Software, Inc. for
both finding the bug and writing a patch.
- Fixed a denial of service problem: Condor daemons no longer freeze
for 20 seconds when a client connects to them and then sends no data.
This behavior is common with port scanners.
- Fixed a race condition with condor_ quill caused by
PostgreSQL's default transaction isolation level being ``read
committed''.
This bug would cause truncated condor_ q reads when using Quill.
- Fixed a bug where the condor_ ckpt_server would segfault when
turned off with condor_ off -fast.
- Fixed a bug in the condor_ startd where it could die with
SIGABRT when a condor_ starter exited under certain rare
circumstances.
The bug seems to have been most likely to appear on x86_64 Linux
machines, but could potentially affect all platforms.
- Fixed a problem with condor_ history when running with Quill enabled,
which caused it to allocate an unbounded amount of memory.
- Fixed a problem with condor_ q when running with Quill, which caused
it to silently truncate the printing of the job queue.
- Fixed a bug in the condor_ gridmanager that caused the following
configuration files parameters to be ignored for grid types condor and
nordugrid jobs: GRIDMANAGER_RESOURCE_PROBE_INTERVAL,
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE, and
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE.
- Fixed a bug in condor_ run that caused it to abort on non-fatal
warnings from condor_ submit and print incorrect error messages.
- Fixed a bug in the condor_ gridmanager dealing with grid type gt4
grid universe jobs. If the job's standard output or error was not specified
in the job ClassAd, the condor_ gridmanager would create an improper GRAM
RSL string, causing the job to fail.
- Fixed a bug in the condor_ gridmanager that could cause it to
delegate the wrong credential when refreshing the credentials for a
grid type gt4 grid universe job.
- The condor_ gridmanager could get into a state where it would no
longer start up Globus jobmanagers for grid type gt2 grid universe jobs,
if previous requests failed due to connection errors. This bug has been
fixed.
- The condor_ c-gahp now properly exits when the pipe to its parent
goes away. Before, it would fill its log with large amounts of useless
messages, before exiting several minutes later.
- Fixed a bug where a problem opening standard input, output, or error,
the standard universe might generate an incorrect warning in the
condor_ shadow's log.
- The condor_ gridmanager now recovers properly when a proxy refresh
fails for a gt2 grid universe job in the stage-out state. Before, the job
would become held with a hold reason of ``Globus error 3: an I/O operation
failed''.
- A number of fixes to minor typos and incorrect formatting in
Condor's log files.
- When REQUEST_CLAIM_TIMEOUT was reached and the
condor_ schedd
failed to contact the condor_ startd to release the claim, the
condor_ schedd would
periodically try releasing the claim indefinitely, possibly resulting in
a lengthy communication delay each time.
- Under Windows, Condor daemons such as the condor_ schedd were sometimes
limiting their use of pending connect operations more than they should
have. This would result in the message, ``file descriptor safety level
exceeded''.
- condor_ fetchlog no longer allows or documents the -dagman option.
The option's appearance was an error. The option never worked.
- The condor_ schedd ensures that the initial job queue log file
contains a sequence number for use by Quill. This fixes a case in
which no sequence number was inserted, because the initial rotation of
this (empty) file failed. Quill also now reports exactly what the
problem is if it reads a job queue log in this state, rather than
simply crashing. This problem has so far only been observed under
Windows.
- Fixed a problem on Windows where, when submitting a job with a
sandbox (for example, using the -s or -r option to
condor_ submit), an erroneous file permissions check in the
condor_ schedd would result in a failed submission.
- The condor_ startd would crash shortly after start up if the
RANK expression contained any use of the unary minus
operator. This patch should also fix any other cases where Condor
daemons crashed due to the use of the unary minus operator in ClassAd
expressions.
- Stork now writes a terminated event to the user log when it removes
a transfer job from its queue because of failures to invoke a transfer
module. Without this event, DAGMan would not notice that these jobs had
left the queue.
- Fixed a problem where the condor_ schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name. This bug was thought to have been fixed in Condor 6.8.0.
- Fixed problems with condor_ store_cred behaving strangely when
storing or removing a user name that is some initial substring of
``condor_pool''. Specifying such a user name would be incorrectly
interpreted as equivalent to specifying the -c option.
- Fixed a problem with condor_ glidein spewing lots of text to
the screen when checking the status of a job it submitted.
- A new version of the GT4 GAHP is included, with the following changes:
- A new axis.jar from Globus fixes a thread safety bug that
can cause lockups in subscriptions for WS notifications. See Globus
Bugzilla 4858
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4858).
- Fixed bugs that caused memory related to destroyed jobs to not
be reclaimed in both the client and the server.
- Removed redundant usage of Secure Message, Secure Conversation,
and Transport Security when talking to a WS GRAM service. Now, only
Transport Security is used.
- Fixed memory leaks in condor_ quill.
- Fixed a bug that might have caused condor_ startd problems
launching the condor_ starter for the standard universe on 64-bit systems.
- Improved Condor's file transfer. If you request that Condor
automatically transfer back your output, it now detects changes better.
Previously, it would only transfer back files that had a more recent timestamp
than the spool date. Now, it will transfer back any file that has changed
in date (including being dated in the past) or changed in size.
Known Bugs:
Version 6.8.2
Release Notes:
- Condor now uses Globus 4.0.3 for GSI, GRAM, and GridFTP support.
This includes a patch for the OpenSSL vulnerability detailed in
CVE-2006-4339 and http://www.openssl.org/news/secadv_20060905.txt.
It also includes fixes for Globus Bugzilla 4689
(http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4689) and a
bug that can cause duplicate UUIDs to be generated for WS GRAM jobs.
- The condor_ schedd daemon no longer forks separate processes to
change ownership of job directories in the spool.
Previously on Unix-like systems, this would create a
new process before a job started running and after it finished running. Some
sites with very busy condor_ schedd daemons were encountering scaling problems.
New Features:
- Because, by default, the condor_ startd daemon references the job
ClassAd attribute NumCkpts, Condor's default configuration
will now round up the value of NumCkpts, in order to improve
matchmaking performance. See the entry on SCHEDD_ROUND_ATTR
in section 3.3.11.
- Enhanced the RHEL3 x86_64 port of Condor to include the standard
universe.
- condor_ submit_dag -f no longer deletes the
dagman.out file. condor_ submit_dag without the -f
option will now submit a DAGMan run even if the dagman.out
file exists. In this case, the file will be appended to.
- Added a property to the Windows installer program to determine
whether the Condor service will be started after installation. The
property name is STARTSERVICE, and the default value is ``Y''.
Bugs Fixed:
- A bug caused the condor_ master daemon to kill
only immediate children within the process tree,
upon an abnormal exit of the condor_ master daemon.
The condor_ master daemon now kills all descendant processes.
- Fixed a bug where if the file system was full, the debugging log
files (for example SchedLog) would silently lose messages. Now,
if the disk is full, the Condor daemons will
exit.
- Fixed a bug in the condor_ schedd daemon that caused it to stop
negotiating for grid universe jobs in the case that it decided
it could not spawn any new condor_ shadow processes.
- Added the ProcessId class (which more uniquely identifies a
process than a PID does) to the condor_ dagman abort duplicate
runs feature. This makes it less likely that a given instance of
condor_ dagman will mistakenly conclude that another instance of
condor_ dagman is already running on the same DAG. Also fixed an
unrelated bug in the abort duplicate runs feature that could cause
a condor_ dagman to not abort itself when it should.
- Condor daemons leaked memory (consuming more and more memory over time)
when parsing ClassAds that use functions with arguments.
- Fixed a bug in the condor_ starter daemon,
which caused it to look in the
wrong place for the job's executable, if TransferExecutable was set
to True in the job ClassAd.
- condor_ history no longer crashes if HISTORY is not defined
in the Condor configuration file.
- Fixed an unintentional change to the value of -Condorlog
in a condor_ dagman submit description file: it is once again the log file of
the first node job.
- Fixed a bug in condor_ q that would cause condor_ q -hold or
condor_ q -run to exit with an error on some platforms.
- Fixed a bug on Unix platforms, in which a misconfiguration of
MAIL would cause the condor_ master daemon to restart
all of its child
daemons whenever it tried (and failed) to send e-mail to the
administrator.
- Network related error messages have been improved to make debugging
easier. For example, when timing out on a read or write operation, the
peer's address is now included in the error message.
- An invalid value for UPDATE_INTERVAL now causes
the condor_ startd daemon to abort. Previously, it would continue running,
but some invalid values (for example, 0) could cause it to stop sending
periodic ClassAd updates to the condor_ collector, even after being
reconfigured with a valid value. Only a complete restart of
the condor_ startd daemon was sufficient to get it out of this state.
- Fixed a bug that caused X.509 limited proxies to be delegated as
impersonation (i.e. non-limited) proxies. Any authentication attempted
with the resulting proxies would fail.
- Fixed a couple bugs that would cause Condor to lose track of
some Condor-related processes and subsequently fail to clean up (kill)
these processes.
- Fixed a bug that would cause condor_ history to crash when
dealing with rotated history files. Note that history file rotation is
turned on by default. (See
Section 3.3.3 for descriptions of
ENABLE_HISTORY_ROTATION and
MAX_HISTORY_ROTATIONS .)
Known Bugs:
Version 6.8.1
Release Notes:
New Features:
- Added an optional argument to the condor_ dagman ABORT-DAG-ON
command that allows the DAGMan exit code to be specified separately
from the node value that causes the abort; also, a DAG can now be
aborted on a zero exit code from a node.
- Added the ALLOW_FORCE_RM configuration variable.
If this expression evaluates to True,
then an condor_ rm -f attempt is allowed. If it evaluated to False,
the attempt is disallowed.
The expression is evaluated in the context of the job ClassAd.
If not defined, the value defaults to True, matching the behavior of
previous Condor releases.
- condor_ dagman will now reject DAGs for which any of the nodes'
user job log files are on NFS (because of the unreliability of NFS
file locking, this can cause DAGs to fail). This feature can be
turned off by setting the DAGMAN_LOG_ON_NFS_IS_ERROR
configuration macro to False (the default is True).
- condor_ submit can now be configured to reject jobs for which
the log file is on NFS.
To do this, set the LOG_ON_NFS_IS_ERROR
configuration macro to True.
The default is that condor_ submit will issue a warning
for a log file on NFS.
- Added the DAGMAN_ABORT_DUPLICATES configuration macro,
which causes
condor_ dagman to attempt to detect at startup whether another
condor_ dagman is already running on the same DAG; if so, the second
condor_ dagman will abort itself.
- The new configuration variable
NETWORK_MAX_PENDING_CONNECTS may be used to limit the
maximum number of simultaneous network connection attempts. This is
primarily relevant to the condor_ schedd daemon, which may try to connect to
large numbers of condor_ startd daemons when claiming them.
The condor_ negotiator may also
connect to large numbers of condor_ startd daemons when initiating
security sessions
used for sending MATCH messages. On Unix, the default is to allow up to
eighty percent of the process file descriptor limit. On Windows, the
default is 1600.
- Added some more debug output to condor_ dagman to clarify
fatal errors.
- The -format argument to condor_ q and condor_ status can now take an expression in addition to a simple attribute name.
- DRMAA is now available on most Linux platforms, Windows and PPC MacOS.
Bugs Fixed:
- When a large number of jobs (roughly 200 or more) are running from a
single condor_ schedd daemon, and those jobs are using job leases
(the default in 6.8), it is
possible for the condor_ schedd daemon to enter a state
where it crashes on startup until all of
the job leases expire.
- Condor jobs submitted with the NiceUser priority were
not being matched if the NEGOTIATOR_MATCHLIST_CACHING
setting was TRUE (which is enabled by default).
- Fixed a Quill bug that prevented it from running on Windows. The
symptom showed with errors in the QuillLog such as
POLLING RESULT: ERROR
- Fixed a bug in Quill where it would cause errors such as
duplicate key violates unique constraint "history_vertical_pkey"
in the QuillLog and the PostgreSQL log file. These errors
triggered
a significant slowdown in the performance of Quill and the database. This
would only happen when a job attribute changed type from a string
type to a numeric type, or vice versa.
- In those unusual cases where Condor is unable to create a new process,
it shuts down cleanly, eliminating a small possibility of data corruption.
- Fixed a bug with the gt4 and nordugrid grid universe jobs that
caused the stdout and stderr of a job to not be
transferred correctly, if the given file names had absolute paths.
- condor_ dagman now echos warnings from condor_ submit and
stork_ submit to the dagman.out file.
- Fixed a bug introduced in 6.7.20, causing the condor_ ckpt_server
to exit immediately after starting up, unless Condor's security
negotiation was disabled.
- MAX_<SUBSYS>_LOG defaults to one Megabyte, even if the
setting is missing from the configuration. Previously it was 64 Kilobytes.
- Fixed a bug related to non-blocking connect that could occasionally
cause Condor daemons to crash.
- Fixed a rare bug where an exceptionally large query to the
condor_ collector could cause it to crash. The most common cause was a single
condor_ schedd daemon restarting,
and trying to recover a large number of job leases at once.
More than approximately 250 running jobs on a single condor_ schedd daemon
would be necessary to trigger this bug.
- When using the JOB_PROXY_OVERRIDE_FILE configuration
parameter, the X.509 proxy will now be properly forwarded for Condor-C jobs.
- Greatly reduced the chance that a Condor-C job in the REMOVED state
will be HELD due to an expired proxy or failure to talk to the remote
condor_ schedd.
- Fixed error and debug messages added in Condor version 6.7.20 that
incorrectly reported IP and port numbers. These messages were
intended to report the peer's address, but they were instead reporting the
local address of the network socket.
- Fixed a bug introduced in Condor version 6.7.20
which could cause Condor daemons to
die with the message
PANIC -- OUT OF FILE DESCRIPTORS
The conditions
causing this related to failed attempts to send updated status
to the condor_ collector daemon,
with both non-blocking updates and security negotiation
enabled (the defaults).
- Also fixed a bug in the negotiator with the same effect as
above, except it only happened with the configuration setting
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT=False.
- Fixed a bug in condor_ schedd under Solaris that could also
cause file descriptors to become exhausted over time when many
machines were claimed in a short spans of time (e.g. over 100) and the
condor_ schedd process file descriptor limit was near 256.
- Fixed a bug in condor_ schedd under Windows that could cause
network sockets to be allocated and never released back to the system.
The circumstances that could cause this were very rare. The error
message in the logs indicating that this problem was happening is
ERROR: DuplicateHandle() failed in Sock::set_inheritable
In cases where this error message is displayed, the network socket
is closed.
- Under some conditions, when making TCP connections, Condor was
still trying to connect for the full duration of the operation timeout
(often 10 or 20 seconds), even if the connection attempt was refused
(for example, because the port being accessed is not accepting connections).
Now, the connect operation finishes immediately after the first such
failure, allowing the Condor process to continue with other tasks.
- Fixed the problems relating to credential cache problems in the Kerberos
authentication mechanism. The current version of Kerberos is 1.4.3.
- Fixed bugs in the SSL authentication mechanism that caused the
condor_ schedd to crash when submitting a job (on Unix) and caused
all tools and daemons to crash on Windows when using SSL.
- Some of the binaries required to use Condor-C on Windows were
mistakenly not included in previous releases of Condor. This has been
fixed.
- Fixed a problem on Windows where the condor_ startd could fail to
include some attributes in its ClassAd. This would result in some jobs
incorrectly not being matched to that machine. This only happened if
CREDD_HOST was defined and Condor daemons on the execute
machine were unable to authenticate with the condor_ credd.
- Fixed a condor_ dagman bug which had prevented the
$(DAGManJobId) attribute from being expanded in job submit files
(for example,
when used as the value to define the Priority command).
- Fixed a bug in condor_ submit that caused parallel universe jobs
submitted via Condor-C to become mpi universe jobs.
- Fixed a bug which could cause Condor daemons to hang if they try
to write to the standard error stream (stderr) on some platforms. In
general, this should never happen, but can, due to third party
libraries (beyond our control) trying to write error or other messages.
- Fixed condor_ status to report error messages.
- Fixed a bug in which setting the configuration variable
NEGOTIATOR_CONSIDER_PREEMPTION = False
caused an incorrect calculation.
The fraction of the pool already being claimed by a user was
calculated using the wrong total number of condor_ startd daemons.
This could cause some condor_ startd daemons to remain unclaimed,
even when there were jobs available to run on them.
- Fixed a security vulnerability in Condor's FS and FS_REMOTE
authentication methods. The vulnerability allowed an attacker to impersonate
another user on the system, potentially allowing submission of jobs as a
different user. This may allow escalation to root privilege if the Condor
binaries and configuration files have improper permissions. The fix is not
backwards compatible, which means all daemons and tools using FS authentication
must be running Condor 6.8.1 or greater. The same applies to FS_REMOTE; All
daemons and tools using FS_REMOTE must be using Condor 6.8.1 or greater. In
practice, this means that for FS, all Condor binaries on one host must be
version 6.8.1 or greater, but versions can be different from host to host. For
FS_REMOTE it means all binaries across all hosts must be 6.8.1 or greater.
- Fixed a couple race conditions in stork and the credd where credential
files were possibly created with improper permissions before being set to owner
permissions.
- Fixed a bug in the condor_ gridmanager that caused it to delegate
12-hour proxies for grid-type gt4 jobs and then not refresh them.
- Fixed a bug in the condor_ gridmanager that caused a directory
needed for staging-in of grid-type gt4 job files to be removed when
the condor_ Gridmanager exited, causing the stage-in to fail.
- Fixed a bug that caused the checkpoint server to restart
because of (ostensibly) getting an unexpected errno from select().
- Fixed a bug on Windows where setting output or
error to a relative or absolute path (as opposed to a
simple file name without path information) would not work properly.
- History file rotation did not previously work on Windows because
the name of a rotated files would contain an ISO 8601 extended format
timestamp, which contains colon characters. The naming convention for
rotated files has been modified to use ISO 8601 basic format, avoiding
this problem.
- The CLAIMTOBE authentication method (which is inherently
insecure and should only be used for testing or other special
circumstances) previously would authenticate without providing the
``domain'' portion of the user name. As an example, a user would be
authenticated as simply ``user'' rather than
``user@cs.wisc.edu''. This problem has been fixed, but the new
protocol is not backwards compatible so the fix is turned off by
default. Correct behavior can be enabled by setting the
SEC_CLAIMTOBE_INCLUDE_DOMAIN parameter to True.
- Fixed a bug with the NEGOTIATOR_MATCHLIST_CACHING that
would cause very low-priority jobs (like jobs submitted with
nice_user=True) to not match even if resources were available.
- Fixed a buffer overflow that could crash the condor_ negotiator.
- SCHEDD_ROUND_ATTR_<xxxx> preserves the value being
rounded up when it is a multiple of the power of 10 specified for
rounding. Previously, the value would be incremented; now it remains
the same. For example, if SCHEDD_ROUND_ATTR_<xxxx>=2 and the value
being rounded up is 100, it now remains 100, rather than being
incremented to 200.
- Fixed condor_ updates_stats to report it's version number
correctly.
Known Bugs:
- The -completedsince option to condor_ history works
when Quill is enabled. The behavior of condor_ history
-completedsince is undefined when Quill is not
enabled.
Version 6.8.0
Release Notes:
- The default configuration for Condor now requires that
HOSTALLOW_WRITE be explicitly set. Condor will refuse
to start if the default configuration is used unmodified.
Existing installations should not need to change anything. For
those who desire the earlier default, you can set it to "*", but
note that this is potentially a security hole allowing anyone to
submit jobs or machines to your pool.
- Most Linux distributions are now supported using dynamically
linked binaries built on a RedHat Enterprise Linux 3 machine.
Recent security patches to a number of Linux distributions have
rendered the binaries built on RedHat 9 machines ineffective.
The download pages have been changed to reflect this, but Linux users
should be aware of this change.
The recommended download for most x86 Linux users is now:
condor-6.8.0-linux-x86-rhel3-dynamic.tar.gz.
- Some log messages have been clarified or moved to different
debugging levels.
For example, certain messages that looked like errors were printed
to D_ALWAYS, even though nothing was wrong and the system was
behaving as expected.
- The new features and bugs fixed in the rest of this section only
refer to changes made since the 6.7.20 release, not the last stable
release (6.6.11).
For a complete list of changes since 6.6.11, read the 6.7 version
history in section 8.4 on
page
.
New Features:
- Version 1.4 of the Condor DRMAA libraries are now included
with the Condor release.
For more information about DRMAA, see section 4.4.2 on
page
.
- Version 1.0.15 of the Condor GAHP is now used for Condor-G and
Condor-C.
- Added the -outfile_dir command-line argument to
condor_ submit_dag. This allows you to change the directory in which
condor_ dagman writes the dagman.out file.
- Added a new -summary (also -s) option to the
condor_ update_stats tool. If enabled, this prevents it from
displaying the entire history for each machine and only displays the
summary info.
Bugs Fixed:
- Fixed a number of potential static buffer overflows in various
Condor daemons and libraries.
- Fixed some small memory leaks in the condor_ startd,
condor_ schedd, and a potential leak that effected all Condor
daemons.
- Fixed a bug in Quill which caused it to crash when certain
long attributes appeared in a job ad.
- The startd would crash after a reconfig if the address of a
collector had not been resolved since the previous reconfig
(e.g. because DNS was down during that time).
- Once a Condor daemon failed to lookup the IP address of the
collector (e.g. because DNS was down), it would fail to contact the
collector from that time until the next reconfig. Now, each time Condor
tries to contact the collector, it generates a fresh DNS query if the
previous attempt failed.
- When using Condor-C or the -s or -r command-line options to
condor_ submit, the job's standard output and error would be placed
in the job's initial working directory, even if the job ad said to
place them in a different directory.
- Greatly sped up the parsing of large DAGs (by a factor of 50
or so) by using a hash table instead of linear search to find DAG nodes.
- Fixed a bug in condor_ dagman that caused an EXECUTABLE_ERROR
event from a node job to abort the DAG instead of just marking the
relevant node as failed.
- Fixed a bug in condor_ collector that caused it to discard
machine ads that don't have an IP address field (either StartdIpAddr
or STARTD_IP_ADDR). The condor_ startd will always produce a
StartdIpAddr field, but machine ads published through
condor_ advertise may not.
- When using BIND_ALL_INTERFACES on a dual-homed
machine, a bug introduced in 6.7.18 was causing Condor daemons to
sometimes incorrectly report their IP addresses, which could cause
jobs to fail to start running.
- Made the event checking in condor_ dagman less strict:
added the new "allow duplicate events" value to the
DAGMAN_ALLOW_EVENTS macro (this value is part of the
default); 16 value now also allows terminate event before submit;
changed "allow all events" to "allow almost all events"
(all except "run after terminal event"), so it is more useful.
- condor_ dagman and condor_ submit_dag now report
-NoEventChecks as ignored rather than deprecated.
- Fixed a bug in the condor_ dagman -maxidle feature:
a shadow exception event now puts the corresponding job into the
idle state in condor_ dagman's internal count.
- Fixed a problem on Windows where daemons would sometimes crash
when dealing with UNC path names.
- Fixed a problem where the condor_ schedd on Windows would
incorrectly reject a job if the client provided an Owner
attribute that was correct but differed in case from the authenticated
name.
- Fixed a condor_ startd crash introduced in version 6.7.20. This
crash would appear if an execute machine was matched for preemption
but then not claimed in time by the appropriate condor_ schedd.
- Resolved an issue where the condor_ startd was unable to clean
up jobs' execute directories on Windows when the condor_ master was
started from the command line rather than as a service.
- Added more patches to Condor's DRMAA interface to make it more
compatible with Sun Grid Engine's DRMAA interface.
- Removed the unused D_UPDOWN debug level and added the
D_CONFIG debug level.
- Fixed a bug that caused condor_ q with the -l or -xml
arguments to print out duplicate attributes when using Quill.
- Fixed a bug that prevented Condor-C jobs (universe grid jobs of type condor)
from submitting correctly if QUEUE_ALL_USERS_TRUSTED is set to
True.
- Fixed a bug that could cause the condor_ negotiator to crash if the
pool contains several different versions of the condor_ schedd and in the
config file NEGOTIATOR_MATCHLIST_CACHING is set to True.
- Changed the default value for config file entry
NEGOTIATOR_MATCHLIST_CACHING from False to True. When set to
True, this will instruct the negotiator to safely cache data in order to
improve matchmaking performance.
- The Condormaster now recognizes condor_ quill as a valid
Condor daemon without any manual configuration on the part of site
administrators.
This simplifies the configuration changes required to enable Quill.
- Fixed a rare bug in the condor_ starter where if there was a
failure transferring job output files back to the submitting host,
it could hang indefinitely, and the job appeared as if it was
continuing to run.
Known Bugs:
- The -completedsince option to condor_ history works
when Quill is enabled. The behavior of condor_ history
-completedsince is undefined when Quill is not
enabled.
Next: 8.4 Development Release Series
Up: 8. Version History and
Previous: 8.2 Upgrade Surprises
Contents
Index
condor-admin@cs.wisc.edu