Next: 8.5 Stable Release Series Up: 8. Version History and Previous: 8.3 Stable Release Series Contents Index

Subsections

8.4 Development Release Series 6.7

This is the development release series of Condor. The details of each version are described below.

Version 6.7.20

Release Notes:

Condor no longer supports SGI IRIX platforms. No futher releases for this platform will be built or distributed.
condor_ submit on Windows no longer checks that the schedd has access to the submitter's credential if invoked with the -n or -r option. It is therefore necessary to make sure ahead of time that the credential is correctly stored with condor_ store_cred before doing a remote submit.
Version 1.3.2 of the Generic Connection Broker (GCB) library is now used for building Condor, and it is the 1.3.2 versions of the gcb_broker and gcb_relay_server programs that are included in this release. For more information about GCB, see section 3.7.3 on page .

New Features:

Added a variety of built-in functions to ClassAds. Examples of new functionality include the ability to express conditionals, string operations, and regular expression matching.
Condor can now map authenticated names (e.g. an X509 subject name or Kerberos principle) to canonical Condor user names via a unified mapfile.
condor_stats and the view server are now aware of the new backfill state for machines, and record and report statistics on it.
Condor now supports running backfill jobs on Windows machines. See section 3.13.9 on page for more information about running backfill jobs with Condor.
Condor-C is now supported on Windows. When using Condor-C to direct a job to a Windows remote schedd, one must be careful to ensure that their credential is accessible to the remote schedd and that the NTDomain attribute in the remote job ClassAd is set correctly. In particular, if the local schedd resides in a different Windows domain from that of the remote schedd, it is necessary to include a line like the following in the submit file:
```
+remote_NTDomain = "OTHERDOMAIN"
```
Added a SUBMIT_MAX_PROCS_IN_CLUSTER configuration parameter to allow administrators to limit the number of jobs that can be submitted in a single cluster when using condor_ submit. This parameter defaults to 0, which implies no limit.
Added config file parameter QUEUE_ALL_USERS_TRUSTED which can be used to disable authorization checks to the job queue. See section 3.3.11 on page .
condor_ dagman now re-checks immediately before job submission that every node job submit file defines a log file.
condor_ dagman now requires that all Stork submit files used in a DAG define a log file.
New macro functionality in job ClassAds: $$([ClassAd Expression]). The contained ClassAd expression is evaluated when the job is matched. "My" refers to attributes in the job's ClassAd, "Target" refers to attributes in the machine classad.
Condor can now run a program to obtain it's configuration parameters. If a configuration filename (such the environment variable CONDOR_CONFIG or the configuration parameter LOCAL_CONFIG_FILE) ends with a vertical bar (``| ''), it is executed and its standard output is parsed for configuration parameters. If LOCAL_CONFIG_FILE is used in this way, then it can only contain a single item, and spaces in the value will be interpreted as part of the command to be executed.
Added the condor_ dagman configuration parameter DAGMAN_PROHIBIT_MULTI_JOBS , which prohibits condor_ dagman from running a DAG that references node job submit files that queue multiple jobs (other than parallel universe).
A number of types of failures to run a job now result in the job going on hold, rather than immediately being returned to the idle state to be tried again. This currently does not apply to standard universe jobs. The types of errors that now result in the job going on hold are failure to execute the specified program, failure to transfer files, failure to open input or output files, and failure to access the job's initial working directory. In all such cases, a specific hold reason is specified in the job ClassAd, along with a numeric hold code and subcode. If you wish to automatically retry in such cases (the old behavior), then you can specify a PeriodicRelease expression that checks for specific hold states.
Strong authentication using SSL is now available for web-service clients using the SOAP (BirdBath) interface commands. The Condor daemons can communicate via HTTPS on a specified port, and clients must present a client-side SSL certificate.
Previously, the condor_ schedd only would communicate with one web-service client at a time. This restriction has now been removed; multiple simultaneous transactions to the schedd via the SOAP (BirdBath) interface is now supported.
condor_ submit will now issue a warning if the user / job log is on an NFS mounted file system.
When a job terminates or is removed and its working directory is on an NFS mounted file system, the condor_ schedd creates and removes a file in the working directory to force the NFS client to sync with the NFS server and see any files written by the job.
Non-blocking connect operations are now used in two cases: sending ClassAd updates from Condor Daemons to the collector and sending match information from the negotiator to the startd. Both of these operations are UDP-based (unless you enable TCP updates to the collector), so non-blocking connects would not be an issue, except that TCP connections are required whenever it is necessary to establish a new security session. An example where the new non-blocking behavior is helpful is when a machine is down and TCP connections to it timeout. Daemons that try to connect to it using non-blocking connections will no longer stop everything they are doing for the full duration of the timeout.
This feature may result in a greater number of sockets being open at one time than previously (especially in the negotiator). There is not yet support for placing a limit on the number of simultaneous connection attempts, therefore, if you need to turn off the use of non-blocking connects, you may do so with the following configuration settings:
NONBLOCKING_COLLECTOR_UPDATE = False
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT = False
condor_ quill now includes an additional ``schema version'' table. If the database was created prior to 6.7.20, the new table is automatically added by the 6.7.20 Quill daemon.
condor_ submit will now issue a warning if the user / job log is on an NFS mounted file system.

Bugs Fixed:

Fixed a bug introduced in 6.7.19 which prevented MPI jobs from running in some situations.
Fixed a bug in the Dedicated scheduler where parallel jobs with job leases would get restarted from scratch if the collector had crashed when the schedd restarted.
Fixed a rare bug where after a job has been marked terminated in the user's job log, it could be run again and another execute event could be written to the user's job log after the terminate event. This caused problems with DAGMan, which really depends on the this log being correct. This fix is applicable to the vanilla, standard, parallel, mpi, and java universes.
Fixed several bugs in quill that would cause it to crash when very long values for job classad attributes where in the queue. This typically happened with jobs submitted with "getenv = true" in the submit file, and very large environments.
Fixed a bug in ClassAds where an attribute name which was constructed from a keyword with one or more digits appended would cause parse errors. Discovered with T1 and F1(short form of true and false).
Fixed a bug introduced in 6.7.19, which could cause the startd to get into a state where it would reject all new claims (requiring a restart of the startd to resume normal operation). The symptom of this bug was a recurring message in the startd such as this:
```
5/19 23:51:47 vm1: ClaimId from schedd (<xxx.xxx.xxx.xxx:45877>#1147893880#1208) doesn't match (<xxx.xxx.xxx.xxx:45877>#1147893880#1207)
5/19 23:51:47 vm1: State change: claiming protocol failed
```
Fixed a bug introduced in 6.7.19 which could cause the startd to abort (signal 6 under unix) when a match timed out for a claim preempting the existing preempting claim.
Fixed a bug which could cause condor_ quill to perform an illegal memory access and potential segmentation fault.
When a job is matched to a remote machine through flocking, the remote machine is given WRITE and DAEMON permissions on the submit machine. This functionality previously worked, but broken some time in the 6.7 series.
Fixed a bug which could cause condor_ status (and probably other tools and possibly even daemons) to crash on Solaris machines if address resolution fails because of an ill-configured DNS server.
Fixed a bug on unix which could cause the Condor daemons to mistakenly think a child process was successfully created (when the fork() system call returned -1).
Fixed a bug which could cause network write operations to block Condor daemons for the full networking timeout time when the connection was closed while the daemon was still writing to it.
Streaming standard input, output, and error has been fixed for Windows jobs.
Grid-type gt4 jobs will now go on hold if Condor can't delegate their X509 proxies to the remote Delegation Service.
The sample configuration file condor_config.local.credd contained a typo which has now been fixed.
Fixed a bug which caused the checkpoint server (condor_ ckpt_server) to publish an incorrect IP address in the Machine attribute of it's ClassAd.
Fixed some rare bugs when the Condor claiming protocol fails while a machine is running a backfill job. The condor_ startd now correctly recovers from these failures.
Fixed a bug in Condor's file-transfer mode that could cause file transfer errors to go unnoticed by one side of the connection. One possible result of this would be a job leaving the queue ``successfully'' when there was actually an error copying back one of the output files. The possibility of this bug happening was much less for large files (greater than 65536 bytes).
DAG variable names beginning with "queue" can goof up DAG node job submits; condor_ dagman now checks for such variable names and fails with an explicit error message if there are any. This bug has probably existed for a long time, but was just recently discovered.
condor_ off -peaceful was still resulting in a shutdown timeout after GRACEFUL_SHUTDOWN_TIMEOUT, which would then cause jobs to be preempted.
If PeriodicHold or PeriodicRemove triggered for a held or idle job, the hold or abort event would not be written to the user log if XML logging was enabled.
Vanilla universe jobs were failing to run if the executable was specified as a relative path and transfer_executable was set to false.
Many user log events for grid universe jobs were not written in XML format when log_xml was set to true.
Gridftp server jobs automatically submitted to handle file transfers for grid-type gt4 jobs now properly leave the queue when they enter the REMOVED state.
Fixed a bug in the event checking code used by condor_ dagman and condor_ check_userlogs which caused errors to be reported when they should not have been for parallel universe jobs. Also fixed a bug in condor_ check_userlogs that caused it to sometimes not report an error when it should have.
PBS and LSF grid universe jobs now fully handle the new job attributes Arguments and Environment. Previously, jobs would be put on hold if the values couldn't be converted to the old representation used in attributes Args and Env.
A bug that was introduced in version 6.7.18 in which Condor would fail to send e-mail when trying to send to multiple recipients has been fixed.
Fixed a bug in condor_ dagman that caused it to abort without generating a rescue DAG file if a node job user log somehow contained a bad event type number.
When the EXECUTE_LOGIN_IS_DEDICATED config file option is set to True, the condor_ starter would occasionally crash. This bug has been fixed.
When the LOCAL_CONFIG_DIR config file option was set to an invalid path, the daemons and command line tools would segfault. This has been fixed.
Fixed the -dag option to condor_ q. Ever since version 6.7.7, this option did not print DAG node names correctly if there were multiple DAGMan jobs submitted.
Eliminated a bug where a malformed ClassAd attribute could make the schedd's on-disk job queue unreadable.
Fixed a number of cases where a large ClassAd value could crash Condor.

Known Bugs:

None.

Version 6.7.19

Release Notes:

A major security hole has been fixed in the checkpoint server. The hole allows arbitrary files owned by the condor UID, or the UID of a Personal Condor running a checkpoint server, to be read and written. Users who can not upgrade to this release of Condor are urged to replace the condor_ ckpt_server binary with the version in this release. This applies to all versions of Condor, including the 6.6 series.
To replace only the condor_ ckpt_server binary,
1. Download the 6.7.19 condor_ ckpt_server binary, from the contrib page at http://www.cs.wisc.edu/condor/downloads/contrib.license.html.
2. Rename the binary, and place it in the $(SBIN) directory.
3. Turn off the checkpoint server that is currently running, by turning off Condor on the machine where it is running.
```
    condor_off
```
4. Change the configuration variable that specifies the path and name of the checkpoint server to the renamed 6.7.19 condor_ ckpt_server binary. For example
```
    CKPT_SERVER = $(SBIN)/ckpt_server.v6719
```
5. Reconfigure, and turn Condor back on.
```
    condor_reconfig -master 
    condor_on
```
6. Check the checkpoint server log to verify that the correct 6.7.19 condor_ ckpt_server binary is running.
Condor is no longer available for HPUX 10.20 on Hewlett Packard PA-RISC or Red Hat Linux on ALPHA.
The globus universe and the follow commands that have been used in submit description files are retired. Please remove from submit description files: globusscheduler, grid_type, jobmanager_type, nordugrid_resource, remote_pool, remote_schedd, unicore_u_site, and unicore_v_site. For the newer syntax of grid universe jobs, please see section 5.3 on the grid universe, as well as the condor_ submit manual page on page .
The utilities condor_ store_cred, condor_ get_cred, condor_ list_cred, and condor_ rm_cred for dealing with credentials for Stork on UNIX have been renamed to stork_ store_cred, stork_ get_cred, stork_ list_cred, and stork_ rm_cred, respectively. This is because the Windows condor_ store_cred tool, which can be used to set the shared secret for the password authentication method, is now present on all platforms.

New Features:

A condor_ schedd can now submit jobs directly to a local PBS or LSF installation. To do this, submit the job with a universe of grid and a grid_resource of pbs or lsf.
Implemented all of the functions from new classads into the "old" classads now in Condor.
Condor's format for storing the history file has been improved so that some queries will now go much faster. In particular, condor_ history now accepts the -backwards option, which will take advantage of this change. Queries that only reference the job's cluster id and proc id will be able to take advantage of this speed increase, and in the near future, more fast queries will be supported. You need to make no changes in order to deal with this new history file format, unless you want to be able to search your entire history file backwards, in which case you should run the new condor_ convert_history program.
Condor can now delegate a job's GSI X509 credentials when transferring them over the wire, instead of copying them. This is much more secure when communications are not encrypted. As this can be a major performance hit when submitting large numbers of jobs remotely, the old behavior can be forced by setting DELEGATE_JOB_GSI_CREDENTIALS to False in the configuration file.
Added configuration parameter NO_DNS , which allows Condor to work on machines with no DNS. When this option is set to True, Condor will use pseudo-hostnames constructed from a machine's IP address and DEFAULT_DOMAIN_NAME , rather than attempting to resolve hostnames into IP addresses and vice-versa.
The JobLeaseDuration now defaults to 20 minutes for all jobs that support this feature (everything except standard and PVM universe, and jobs that request streaming I/O). This way, by default, if the submit host crashes or there is a short network outage, the condor_ schedd will be able to reconnect to jobs that were executing at the time of problem.
Condor daemons now touch their daemon log file periodically. When a daemon starts up, it prints to the log the last time the log file was modified. This lets an admin estimate when a daemon stopped running. The configuration parameter TOUCH_LOG_INTERVAL sets the time between touches (in seconds) and defaults to 60 seconds.
Added the ability to pass a specific condor_ config_val program to the cron/Hawkeye ``modules''. If ``HAWKEYE_CONFIG_VAL'' is specified in the configuration, an environment variable with the same name and the same value will be added to all cron job environments. This change has no effect if the above macro is not specified in the configuration. The above name ``HAWKEYE_CONFIG_VAL'' is derived from the cron name (i.e. STARTD_CRON_NAME or SCHEDD_CRON_NAME).
condor_ submit_dag now generates a submit file with copy_to_spool set to false. This reduces the load and saves file space on the submit machine, especially if you are running multiple instances of condor_ dagman.
The authorization levels in Condor's security system now form a hierarchy. A client with DAEMON or ADMINISTRATOR access also have WRITE access. A client with WRITE, NEGOTIATOR, or CONFIG access also have READ access.
Added configuration parameter GRIDMANAGER_EMPTY_RESOURCE_DELAY , which sets how long the condor_ gridmanager retains information about a grid resource after it has no active jobs to that resource.
Added configuration parameter JOB_PROXY_OVERRIDE_FILE , which lets an admin force a particular X509 proxy to be used for all grid universe jobs, overriding whatever proxy may be specified in the job ad.
condor_ dagman no longer uses the popen() system call when running commands; this provides better security and allows it to run on Windows without being a service.
Added a new version of DRMAA which includes fixes and updates per DRMAA spec finalization.
For grid-type gt4 jobs, the resource lifetime on the remote server will be based on job_lease_duration, if it's set.
Improved the error message in condor_ dagman for pclose() failures after submitting a node job.
Added new job attribute GridResourceUnavailableTime, which is equivalent to GlobusResourceUnavailableTime, but is used for all grid universe jobs. One benefit of this new attribute is that grid resource up/down user log events are logged correctly when the gridmanager crashes and restarts.
Added the ability to set GROUP_AUTOREGROUP on a per-group basis, using the syntax GROUP_AUTOREGROUP_<groupname> = True/False.
Added configuration variable SYSAPI_GET_LOADAVG to control if Condor should attempt to fetch the system load average. See section 3.3.3.
Added configuration variable SCHEDD_ROUND_ATTR_<xxxx>. See description in section 3.3.11 on page .
The password authentication method now works on all platforms. It was previously only available on Windows. UNIX platforms will store the pool password in the file defined by the configuration parameter SEC_PASSWORD_FILE . This file will be owned by the real UID that Condor runs as and will only be accessible by that user.
A new tool, condor_ userlog_job_counter, has been added. Given a userlog file as an argument, it determines the number of queued (e.g., submitted but not yet terminated or aborted) jobs recorded in that userlog, and returns that value as an exit code. It returns 255 if there are more than 254 queued jobs or to indicate an error (e.g., a userlog reading/parsing error, no events found, a job count <0 or >254, improper usage, etc.).
The condor_ chirp tool has been added to the Windows distribution.
The environment variable X509_USER_PROXY is now set to the full path of the proxy for scheduler universe jobs if a proxy is associated with the job.

Bugs Fixed:

Fixed a bug where condor_ q would exit with a non-zero exit status even though it found and displayed the requested information or job queue.
Fixed a bug in the dedicated scheduler where parallel and mpi jobs with more than one proc in a cluster would only have the Scheduler attribute set in the first proc.
Fixed a bug in the condor_ collector that could cause it to crash if it's configured as a view collector (i.e. KEEP_POOL_HISTORY is TRUE). In particular, machine ads with a State value of ``Backfill'' could trigger this crash.
Fixed a related bug in condor_ stats that could cause a crash when encountering a machine state of ``Backfill''.
Disconnected starter-shadow connections (job leases) now work for flocked jobs.
Fixed numeric value wrap-around bug for the totals in condor_ status.
Grid universe jobs sent to Globus Toolkit 2 resources now generate an evict user log event when the job transitions from Running to Idle, along with another execute even when the job restarts. Previously no events were logged in these cases, leading to the potentially confusing situation where a job would be Idle in the queue, but the last job log entry would indicate that the job was Running.
All grid universe jobs now properly handle the new job attributes Arguments, Environment, and TransferOutputRemaps.
In some cases, a restart of Condor was required to properly handle a change in the undocumented configuration parameter SIGNIFICANT_ATTRIBUTES. Now, a condor_ reconfig is sufficient.
Fixed a permissions problem that would cause automatic X509 proxy renewal for vanilla universe jobs to fail.
Fixed a bug introduced in 6.7.17 that caused the configuration parameter ENABLE_GRID_MONITOR to be ignored. The value would always be considered true.
Improved fault recovery of gt2 grid jobs. This includes a work-around for Globus bugzilla ticket 871.
When the condor_ gridmanager cancels a job after GlobusResubmit evaluates to true, it will no longer put the job on hold if the cancel fails.
Fixed the default COLLECTOR_QUERY_WORKERS entry in the example central manager config_config; due to a cut and paste error it was COLLECTOR_CLASS_HISTORY_SIZE.
In some cases, CondorLoadAvg was reporting a different result, depending on the setting of NUM_CPUS, even with everything else, such as the actual number of cpus, being the same. The specific case in which this effect was noticeable was when the machine load was greater than NUM_CPUS. CondorLoadAvg is now independent of the setting of NUM_CPUS.
When the Grid Monitor encounters problems, Condor will now try to restart the Globus JobManagers for the affect grid universe jobs, limited by GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE . The previous behavior caused problems with sites that don't have a fork JobManager, and Condor wouldn't react when a job's proxy expired.
Fixed a bug that could cause extra Grid Monitor file to accumulate under /tmp until the condor_ gridmanager exited.
Fixed a problem in which preempting claims waiting on retiring jobs (i.e. waiting on MaxJobRetirementTime) could get preempted without sufficient rank or priority (because the new preemption only had to beat the retiring job, not the preempting claim). Furthermore, both the new preempting claim and the original preempting claim had the same claim id, so they collided in a way that ultimately caused both to be removed, and the respective jobs would go back into unmatched state. The result was unnecessary negotiation churn and slower convergence of resource usage to the desired distribution. Now, preemption of preempting claims during long job retirement is correctly handled.
Fixed a bug that caused the shadow to transfer a job's files twice to the starter if the files were stored in Condor's spool directory.
On some systems, when Condor starts a gridftp server for gt4 grid jobs, all transfers to or from the server will fail if it's not told where its executable is located (using '-exec' on the command line). Condor now gives this option to the gridftp server.
Fixed a bug in condor_ dagman that could cause DAGMan to crash if all submits fail for several nodes that have POST scripts. This bug existed in versions 6.7.17 and 6.7.18.
Fixed a bug in how Condor determines the version of a Condor executable. This was preventing the grid universe from working on Tru64 5.1 on Alpha.
Fixed a bug in the condor_ gridmanager that could cause gt2 grid jobs with an invalid proxy to become stuck (the condor_ gridmanager would do nothing with the jobs and not acknowledge a hold or removal).
Fixed a bug on Win32 that caused a failure when sending a WM_CLOSE message to a job when the Condor daemons are running as a normal user (i.e. not running as LocalSystem). Also, fixed a thread handle leak when sending a WM_CLOSE message.
On Win32, fixed a bug that would cause the condor_ master to exit upon a condor_ restart command when started as a service with a service name other than "condor" (the default name used by the installer).
Fixed a bug that could cause the condor_ master to crash when sending a shutdown fast to a child process after the SHUTDOWN_GRACEFUL_TIMEOUT timeout expired.
Fixed a bug with automatically setting the undocumented SIGNIFICANT_ATTRIBUTES configuration parameter in order to speed up negotiation-- previously, the job's Requirements expression was not correctly considered. With certain scheduling policy expressions, this bug could have resulted in jobs staying idle in the queue when they should have been launched.
It used to be impossible to use the SUBMIT_EXPRS configuration setting to provide default values for job submit file keywords that were recognized by condor_ submit. For example, administrators could define a default value for a custom job attribute, but not something like Notification or WantRemoteIO. Now, administrators can use SUBMIT_EXPRS for any settings, whether they are regular condor_ submit keywords or custom job attributes.
Fixed a bug in the condor_ startd that could cause resources to get stuck in the Backill/Killing state if both the START_BACKFILL and EVICT_BACKFILL expressions evaluated to TRUE at the same time.

Known Bugs:

None.

Version 6.7.18

Release Notes:

A security team at UW-Madison is conducting an ongoing security audit of the Condor system and has identified a few important vulnerabilities. Condor versions 6.6.11 and 6.7.18 fix these security problems and other bugs. There have been no reported exploits, but all sites are urged to upgrade immediately.
The Condor Team will publish detailed reports of these vulnerabilities on 2006-04-24, 4 weeks from the date when the fixes were first released (2006-03-27). This will allow all sites time to upgrade before enough information to exploit these bugs is widely available.
The -flock option in condor_ cold_start and condor_ cold_stop has been replaced by -filelock to avoid any confusion between file locking and Condor job flocking.
As of 6.7.17, Quill's database schema has been slightly altered. For more information, please see the corresponding 6.7.17 version history entry in section 8.4 on page .

Security Bugs Fixed:

Bugs in previous versions of Condor could allow any user who can submit jobs on a machine to gain access to the ``condor'' account (or whatever non-privileged user the Condor daemons are running as). This bug can not be exploited remotely, only by users already logged onto a submit machine in the Condor pool.
The security of the ``condor_ config_val -set'' feature was found to be insufficient, so this feature is now disabled by default. There are new configuration settings to enable this feature in a secure manner. Please read the descriptions of ENABLE_RUNTIME_CONFIG , ENABLE_PERSISTENT_CONFIG and PERSISTENT_CONFIG_DIR in the example configuration file shipped with the latest Condor releases, or in section 3.3.5 on page .

New Features:

Added a new LOCAL_CONFIG_DIR configuration setting. This now allows entire directories of files to be included as though they were configuration files. See 3.3.3 for more info.
You can now put extra information into the notification email. The information is a list of attributes, which you provide. For example, if your submit file has ``+EmailAttributes = "RemoteHost, Requirements"'', then RemoteHost and Requirements will be listed in the notification email.
Added a new clipped port of Condor to HP-UX 11 running on the HP-PA architecture.
Condor is now much better at recognizing when a grid-type gt2 grid universe job failure is unrecoverable and at cleaning up failed or canceled job submissions. This should reduce the number of jobs that perpetually return to held state when released.
When job attribute GlobusResubmit evaluates to true for grid-type gt2 jobs, the condor_ gridmanager will try to cancel the existing job before starting the new submission. If the cancel attempt fails, the condor_ gridmanager will proceed with the new submission anyway.
When BIND_ALL_INTERFACES is enabled, Condor daemons now advertise their IP address as that of the network interface used to contact the collector. This makes it possible, for example, to have a schedd on a multi-homed machine flock jobs to Condor pools in two separate networks, because the schedd can advertise a different IP address to the two collectors. condor_ cod also benefits in the case where the startd is reachable through a network interface other than the default one that would normally be advertised. This change also produces improved default behavior in cases such as condor_ glidein where the startd lands on a dual-homed machine with both public and private IP addresses.
In condor_ dagman, the informational messages about hitting the -maxidle, -maxjobs, -maxpre, and -maxpost limits are no longer printed to the dagman.out file by default. To see these messages, add -debug 4 to the condor_ submit_dag command line. A summary of the total number of job and script deferrals is now printed by default each time the node status is printed and at the end of the dagman.out file. This can be turned off by setting the debug level to 2 or lower on the condor_ submit_dag command line.
Added support for a new configuration setting, STARTD_RESOURCE_PREFIX . For more information, see section 3.3.10 on page .
The number of CPUs Condor detects may now have an upper bound. The MAX_NUM_CPUS configuration setting controls this.
When preempting a claim, the condor_ negotiator now prints the startd rank of the job that is being preempted and the startd rank of the job it is causing the preemption.
Improved the error messages from condor_ check_userlogs, especially if it fails because it doesn't have write permission on the log files (unfortunately, the log reading code requires write locks to avoid collisions between multiple readers and writers).
Improved error messages in condor_ dagman when getcwd() fails (this is only relevant if the -UseDagDir flag is used).
Added QUILL_MANAGE_VACUUM to determine whether Quill needs to perform vacuuming tasks or not. In the latter case, vacuuming tasks can be automatically managed by PostgreSQL version 8.1 onwards. Please see Quill's section in the Administrator's Manual for more details.
The Grid Monitor now works at sites where /etc/grid-security/certificates is out of date, but $(GLOBUS_LOCATION)/share/certificates is not.
A new authentication method, PASSWORD, has been added; it provides mutual authentication between a client and server using a shared secret. Password authentication currently only works on Windows, and only for daemon-to-daemon communication.
Condor on Windows now supports running jobs as the submitting user. This feature requires the use of a central daemon for storing users' passwords (the condor_ credd). See the example configuration file condor_config.local.credd included with the Condor distribution for more information.
Added a -n option to condor_ store_cred to allow for storing a password to a remote host.
Support for DRMAA on Windows has been added.
Kerberos support has been upgraded to use version 1.4.3 of the Kerberos library. This adds support for Kerberos as an authentication method on Windows.
Added the new condor_ replication daemon which works with condor_ had to enable replication of data for daemons configured for high availability. In particular, condor_ replication can be configured to replicate the accountant log so the a fail-over condor_ negotiator can share the user priority state from the primary condor_ negotiator.
The condor_ collector now has the ability to receive ClassAds via it's SOAP interface.

Bugs Fixed:

Fixed a memory corruption bug in condor_ quill where it could miscalculate the hostname of the db server to which it connects.
Fixed a security hole in condor_ quill where the daemon would emit the quillwriter user's password in cleartext into the condor_ quill logfile.
Fixed a bug in 6.7.17 that could cause the schedd state to be wiped out, clearing the contents of the job queue. The most likely case in which this problem could have happened is when the disk containing the spool directory became full and the schedd restarted several times due to failures writing to job_queue.log. The problem no longer exists in 6.7.18, but for users who cannot upgrade immediately, the workaround to prevent the bug from ever happening is to add the following line to the configuration file:
```
MAX_JOB_QUEUE_LOG_ROTATIONS = 0
```
Fixed a bug that could have caused corruption of the job queue log file in very rare circumstances involving a full disk. This potential problem existed in all previous versions of Condor.
Fixed a problem with parallel universe jobs with multiple procs (i.e. multiple queue statements in one submit file). Before, in such a case, the user log would have multiple submit events per cluster but only one terminate event. This caused confusion for dagman. Now there is one submit event and one terminate event for such parallel universe jobs.
A ClassAd bug that has existed since the Condor 6.3 series has been fixed, and it might affect your pool. In a ClassAd, MY and TARGET are supposed to narrow the scope for looking up a ClassAd variable. For instance, in a job's requirements, MY refers to the job's attributes, and TARGET refers to the machine's attributes. Unfortunately, since 6.3 MY and TARGET actually made a search order, not a scope restriction. That is, if a job's requirements had TARGET.foo and foo was undefined in the machine ad, it would look in the job ad for the value instead of deciding that foo was undefined. This is now fixed. However, there is a chance that users have made Classad expressions that confused MY and TARGET but worked. With this bug fix, they might not work anymore. We expect this bug fix to affect few users, but it may be tricky to understand for those of you that it affects. We needed to make this bug fix because the bug caused problems for some users that could not be worked around.
Fixed a vague error message for the standard universe as to now emit reason for failure when reading a checkpoint image.
Fixed a bug which was causing erroneous load average numbers for the AIX port of Condor.
Fixed a bug which caused the update proxy command to the condor_ schedd to fail if the job was running and Condor was started as root.
Fixed a bug which was causing jobs to never leave the "run" state if the condor_ schedd's cron/hawkeye feature is enabled. This bug was introduced with the addition of the cron logic to the condor_ schedd in 6.7.8.
Improved how the condor_ gridmanager reacts to proxy delegation commands failing for grid-type gt4 jobs. Before, it could end up retrying the commands every couple seconds. Now, it retries them every 5 minutes.
Fixed Condor's code for automatically starting a gridftp server for grid-type gt4 jobs to work when Condor is started as root.
Scheduler universe jobs no longer inherit the environment of the condor_ schedd.
Fixed a bug causing jobs to fail to run when submitted from a 6.7.15+ condor_ schedd to an older condor_ starter. This problem only affected jobs with no argument specification in the submit file or jobs with arguments specified in the new syntax (surrounded by double quotes).
The condor_ c-gahp now ensures that arguments and environment in the job ClassAd are converted to a syntax understood by the target schedd. Previously (starting with 6.7.15), jobs with empty arguments/environment, or jobs using the new syntax for these would fail to run when submitted as Condor-C jobs targeting a pre 6.7.15 schedd.
When running grid-type gt4 jobs with an automatically-started gridftp server, a restart of Condor could cause all of the gt4 jobs to be canceled and resubmitted due to the gridftp server's port changing. Now, the old port will be reused when possible.
Fixed a bug that could cause the condor_ gridmanager to crash when running grid-type gt4 jobs with an automatically-started gridftp server.
Fixed a bug that caused the condor_ c-gahp to exit if a file transfer failed.
When a grid-type condor job is removed, any active file transfer for the job is aborted. Previously, the transfer would be allowed to complete before the job was canceled.
Fixed bug where FileLock.pm was not included in release.
Clusters of jobs using transfer-file mode and output or error files containing path information and references to $(Process) or $(Cluster) were incorrectly storing the output files in the initial working directory rather than the specified path. This happened for all jobs in the cluster except for the first job (process 0). This bug was introduced in 6.7.13.
Previously, setting the default job environment within SUBMIT_EXPRS did not work, because condor_ submit would always override this default with an empty environment.
Since 6.7.15, Condor-C has incorrectly handled the use of both remote_env and remote_args. The normal environment and arguments commands were honored, but the 'remote' versions were ignored unless the corresponding 'normal' command happened to be set to a double-quoted value (i.e. the new syntax for these settings). Now that this problem is fixed, when remote_env or remote_args is specified, it correctly sets the respective property of the job ClassAd in the remote schedd. It is still preferable to use the environment and arguments commands instead of the setting remote attributed directly, because then you can use either the new double-quoted environment/argument syntax or the old one, and condor_ submit will automatically set the correct ClassAd attributes.
Fixed a bug in the -l option to condor_ q which, when querying Quill, used to display the attributes of the last cluster in every job ad even though they were submitted as part of different clusters.
Fixed a bug in the Quill daemon which used to incorrectly parse classad attribute values of the form "number and some stuff" (e.g. attribute=Rank and value=1000 * memory).
Added HAD to the default DC_DAEMON_LIST, both in the default condor_config and in the default list hard coded into condor_ master.
Java universe jobs submitted with the old-style arguments syntax (argument string not surrounded by double quotes in the submit file) would fail to run (and therefore stay in the schedd job queue) if the path to Condor's execute directory contained a space (e.g. ``Program Files''). This bug was introduced in 6.7.15.
Fixed a problem introduced in 6.7.17 where repeated ``ProcAPI sanity failure'' entries would appear in daemon logs on Windows.
The -submitter option to condor_ q has been fixed to handle submitters in accounting groups.
The condor_ shadow now correctly handles the case where RESERVED_SWAP is set to 0.

Known Bugs:

None.

Version 6.7.17

Release Notes:

The default output for condor_ status was changed between 6.7.16 and 6.7.17 to support the new Backfill state which Condor resources can now enter (described below in more detail).
Added two new columns to the Quill database schema to support historical job queue logs (see MAX_JOB_QUEUE_LOG_ROTATIONS in the New Features section below). These are log_seq_num and log_creation_time. For a description of those two columns, check out the schema of the JobQueuePollingInfo table in section 3.12.3 on page .
Databases created by versions of Quill prior to 6.7.17 must be updated to reflect these two new columns. This can be achieved by either dropping the database and letting Quill recreate it on the next polling cycle, or by manually adding the two columns and initializing their values via the following sql commands:
```
	alter table jobqueuepollinginfo add column log_seq_num bigint;
	alter table jobqueuepollinginfo add column log_creation_time bigint;
	update jobqueuepollinginfo set log_seq_num = 0, log_creation_time=0;
```
If the schema is being manually changed, it must be done so before the condor_ quill daemon is started.

New Features:

Added support for Condor resources to perform backfill computations when there are no Condor jobs to run. Condor can be configured such that whenever a machine is in the Unclaimed/Idle state and otherwise has nothing else to do, the condor_ startd will automatically spawn backfill jobs to continue to perform useful work. Currently, Condor only supports using the Berkeley Open Infrastructure for Network Computing (BOINC) to provide the backfill jobs (see http://boinc.berkeley.edu for more information about BOINC). See section 3.13.9 on page for more information about running backfill jobs with Condor. At this time, backfill jobs are not supported on windows machines.
The history file, which is a flat file for each submitting computer that stores information about all jobs completed on that computer is now rotated automatically. By default, the file will be rotated when it is more than 20MB and two backup files will be allowed (for a total of three history files with 60MB of data). This means that older history will be lost once it is rotated out. You can disable the history file rotation if you like, and you can change the number and size of the backup files. condor_ history has been updated to understand these backup history files.
Added parallel universe support to condor_ dagman (condor_ dagman can now handle submit files that submit more than one Condor job proc).
Added a -format option to the condor_ history command which behaves just like the -format option to condor_ status and condor_ q commands.
Added remove and get_job_attr options to the condor_ chirp command line tool. Changed parallel universe script to use them.
When the Grid Monitor encounters problems, Condor no longer tries to restart the Globus JobManagers for all of the affected grid universe jobs. Restarting the JobManagers can easily bring down a remote headnode. Condor will attempt to restart the Grid Monitor, but there will be no update of job status in the mean time.
When started as root on a Linux 32-bit x86 machine, Condor daemons will leave core files in the log directory when they crash. Recent changes to the Linux kernel default to blocking these core files. This change means Condor behaves more consistently across different Unix-like operating systems.
Made several changes to make Condor-G much less likely to overload a pre-WS GRAM server for grid-type gt2 jobs. Added configuration parameter GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE, which limits the number of globus-job-manager processes Condor will let run on the server at a time. Streaming of output for gt2 jobs is disabled if GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE isn't set to unlimited. If the Grid Monitor encounters problems, the condor_ gridmanager doesn't restart the globus-job-managers of the affected jobs. Fixed a couple bugs in the Grid Monitor that could cause it to spawn extra polling processes on the server.
Added support for Parallel scheduling groups for the parallel universe. This is useful if you have machines connected by InfiniBand switches, and want to constrain your parallel jobs to never run across two different switches.
Added a new suite of tools to dynamically deploy Condor. The most important of these tools are condor_ cold_start and condor_ cold_stop. Another significant subset of this suite are tools to determine whether a process is alive or dead. The most advanced of which are the uniq_pid_midwife and uniq_pid_undertaker. Currently these programs are only supported on Linux.
Added MAX_JOB_QUEUE_LOG_ROTATIONS to control how many historical job queue logs are kept when the job queue log is rotated. These historical logs are used by Quill to avoid missing information in Quill's job history information when the schedd rotates to a new log. The default value for this configuration setting is 1, so one old copy of the job queue log file will be kept.
Added support for DRMAA on the Mac OSX platform.
Enabled COLLECTOR_QUERY_WORKERS in the default condor_ collector configuration, and set this value to 16. This replaces the previous implicit default of 0 and will result in a more responsive condor_ collector in the common case. Note this COLLECTOR_QUERY_WORKERS has no effect on non-UNIX systems (Windows).
HIGHPORT and LOWPORT can now specify ports below 1024 when Condor is started as root on Unix systems. This always worked on Windows.
It is now possible to specify separate port ranges for binding incoming (listen) sockets and outgoing (connect) sockets by using IN_LOWPORT /IN_HIGHPORT and OUT_LOWPORT /OUT_HIGHPORT . if not present, we still fall back to the regular LOWPORT /HIGHPORT settings.
Port ranges from LOWPORT , HIGHPORT , IN_LOWPORT , IN_HIGHPORT , OUT_LOWPORT , and OUT_HIGHPORT are now passed to Globus through the correct environment variables.

Bugs Fixed:

Previously, the condor_ startd would not recompute the CurrentRank attribute each time a new job was spawned, but only computed it whenever a new claim was made. Now, the condor_ startd correctly recomputes CurrentRank each time a new job starts running.
When running a gridftp server for grid-type gt4 jobs, Condor will now start the server so as to ignore /etc/grid-security/gridftp.conf and $GLOBUS_LOCATION/etc/gridftp.conf. These files may contain options that would cause the gridftp server to fail when not run as root. Also, Condor's gridftp server is started to ensure that it does not erroneously try to load libraries from an existing Globus installation, causing the gridftp server to crash.
Fixed a bug where jobs using the grid (or globus) universe that specified an AccountingGroup would never run because the condor_ gridmanager would fail to start.
Fixed a bug introduced in 6.7.14 where the job attributes RemoteUserCpu and RemoteSysCpu were incorrectly reported as 0 in the history file and the job queue for non-standard universe jobs.
Fixed a physical memory reporting bug for the Mac OSX port of Condor.
Since the addition of the ``new'' cron syntax (introduced in version 6.7.11), the condor_ startd has (silently) ignored any jobs defined with the ``old'' syntax if any jobs are defined with the ``new'' syntax. Now, the condor_ startd will honor both definitions, but will log a warning to it's log file if any jobs with the ``old'' syntax are found (whether or not any new jobs are found). The condor_ schedd (which also has the ``cron'' logic) will behave in the same way.
The bug which was causing the ``Cron'' job command lines to have the name added each invocation has been fixed.
Fixed some messages about keyboard and mouse idle time had been logged too often in the condor_ startd logs under certain conditions to be logged less often.
Fixed the -dag option to condor_ q. Previously, this did not print DAG node names as it should have. (This bug has existed since approximately v6.7.11.)
Fixed a bug that could cause the condor_ gridmanager to crash if the GridJobId attribute for a gt2 job became mangled. The cause of mangling seen by some users is still unknown.
Submission from 6.7.15 or 6.7.16 condor_ submit to a 6.7.14 or earlier condor_ schedd was not working unless the submit file explicitly set both arguments and environment using the old syntax. Now condor_ submit automatically converts the environment and argument syntax when necessary. If the conversion is not possible, due to limitations in the old syntax, condor_ submit will generate an error message and refuse to complete the submission.
condor_ submit now returns an error if the executable file specified in the submit file exists but is zero length.

Known Bugs:

RPM packages of Condor may refuse to install because of a failed dependency on perl(FileLock). The module in question is missing from the bundles. As a workaround, use rpm's -nodeps option to ignore the requirement. (Bug introduced in version 6.7.17)
The new dynamic deployment tools (condor_ cold_start and others) may fail because FileLock.pm is missing. If you would like to use the new dynamic deployments tools, contact condor-admin@cs.wisc.edu to receive a copy of FileLock.pm. (Bug introduced in version 6.7.17)
Jobs with no arguments submitting using Condor versions 6.7.15 up to and including 6.7.17 that try to run on a pre-6.7.15 condor_ starter will fail to start. The condor_ starter will fail and exit. The job will not run until matched with a condor_ starter from 6.7.15 or later. The workaround is to always specify arguments in the submit file using the old syntax. You must specify the arguments, even if they are empty. For example: "argument=". Existing jobs in the queue can be modified with condor_ qedit. For example: ``condor_ qedit <jobid> Args '""'''. Jobs submitted prior to upgrading to 6.7.15 or later are not affected. (Bug introduced in version 6.7.15)
Enabling the cron/Hawkeye feature of the condor_ schedd causes jobs to never leave the "run" state. This bug was introduced with the addition of the cron logic to the condor_ schedd in version 6.7.8. This functionality is not enabled by default, so most users will not encounter it. This does not affect the cron/Hawkeye feature in condor_ startd. (Bug introduced in version 6.7.8)
Multi-cluster condor submits will cause condor_ dagman to hang. This bug was introduced by the implementation of parallel universe support. Prior to version 6.7.17, any Condor submit file creating more than one Condor process would be treated as an error by condor_ dagman. Now this is not the case, because a single cluster with multiple processes will work; but condor_ dagman does not deal properly with multi-cluster submits (e.g., a submit file queuing jobs with different executables). As a workaround, take care that submit files submitted by condor_ dagman only submit multiple processes, not multiple clusters. (Bug introduced in version 6.7.17.)
The -maxjobs and -maxidle settings for condor_ dagman are inconsistent: maxjobs applies to job clusters, but maxidle applies to individual processes. Note that this only makes any difference in the case of node submit files that queue more than one process, which has only been supported since version 6.7.17. (Bug introduced in version 6.7.17.)
condor_ dagman does not properly handle failures when removing jobs for a failed node. Note that this only makes any difference in the case of node submit files that queue more than one process, which has only been supported since version 6.7.17. If one process for a node fails, the entire cluster is considered failed, and any other processes in that cluster are removed. If removing the processes fails, condor_ dagman may hang, waiting for those processes to abort. (Bug introduced in version 6.7.17.)
The FileLock.pm perl module (written in-house) was not included in this release. As a direct result the -flock option of condor_cold_start will not work. This can be remedied by downloading FileLock.pm from: ftp://ftp.cs.wisc.edu/condor/temporary/filelock/FileLock.pm and installing it in the lib directory of your Condor installation. (Bug introduced in version 6.7.17)
In some circumstances, the schedd state may be wiped out, clearing the contents of the job queue. The most likely case in which this problem can happen is when the disk containing the spool directory becomes full and the schedd restarts several times due to failures writing to job_queue.log. The problem no longer exists in 6.7.18, but for users who cannot upgrade immediately, the workaround to prevent the bug from ever happening is to add the following line to your config file:
```
MAX_JOB_QUEUE_LOG_ROTATIONS = 0
```

Version 6.7.16

Release Notes:

None.

New Features:

Support for running a personal Condor on Windows using condor_ master -f

Bugs Fixed:

Support for NorduGrid jobs was accidentally left out of the condor_ gridmanager in previous releases. This has been corrected.
The condor_ starter was refusing to run jobs if it could not perform a reverse-DNS lookup of the submit-host. Now that this is fixed, when the reverse-DNS lookup fails, the job can still run, but Condor will not be able to verify the authenticity of the submit-host's uid domain. In this case, if you enable TRUST_UID_DOMAIN , everything will function as normal, minus the verification of the domain; if you do not enable TRUST_UID_DOMAIN, the starter will treat the job as being from a different uid domain, regardless of what uid domain the job advertises.
Fixed a few bugs with transfer_output_remaps that caused files to be remapped while in a temporary sandbox. Now, the remapping occurs only when the files are returned to the job submitter.
Fixed some minor memory leaks in the condor_ gridmanager.
Fixed a bug in 6.7.15 that was causing startd cron jobs to fail to run if the old-style configuration setting STARTD_CRON_JOBS was used instead of the new-style configuration setting STARTD_CRON_JOBLIST .

Known Bugs:

The command line string used in starting cron jobs is correct the first time the job is run, but incorrect each subsequent time the job is run. The error is that the job's name is incorrectly appended to the previous run's command line. As an example, the first time the job is run with the correct command line
```
/path/to/job jobname
```
The second time, this job is run with the incorrect command line
```
/path/to/job jobname jobname
```
And, the third time, this job is run with the incorrect command line
```
/path/to/job jobname jobname jobname
```
On Windows only (as far as we know) a condor_ rm of a scheduler universe job after a condor_ qedit may cause the schedd to crash. (This bug has existed at least since 6.7.12.)
On Windows only (as far as we know) a condor_ hold followed by a condor_ release on a job sometimes results in the job being removed instead of going into the idle state. (This bug has existed at least since 6.7.12.)
On Windows only, condor_ dagman may fail with a "DLL not initialized" error (exit code -1073741502). (This bug has existed at least since 6.7.12.)

Version 6.7.15

Release Notes:

If you have not used the undocumented configuration setting SIGNIFICANT_ATTRIBUTES , there is no need to read the rest of this paragraph. For sites that have been using SIGNIFICANT_ATTRIBUTES in the config file, we suggest removing that setting, because Condor now automatically selects the list of attributes that are used to cluster job ClassAds into distinct ads for negotiation. In 6.7.15, any setting of SIGNIFICANT_ATTRIBUTES will be combined with the automated list of attributes that Condor produces. In the future, this behavior may change (e.g. it might override the automated behavior rather than combining with it). If you know in advance that your use of Condor heavily depends on SIGNIFICANT_ATTRIBUTES not including some attributes that are used in requirements expressions (e.g. ImageSize), then you should be aware that 6.7.15 provides no way for you to suppress such attributes. In that case, we recommend that you wait for this issue to be addressed before upgrading. This should not concern most users-especially anyone who is not even using SIGNIFICANT_ATTRIBUTES , or who has defined SIGNIFICANT_ATTRIBUTES to include all attributes that are used in requirements expressions (which is the normal usage case).
Added a clipped port of Condor to YellowDog Linux 3.0 on the PowerPC architecture.
``Cron'' jobs defined with the ``old'' configuration syntax (usually through ``STARTD_CRON_JOBS'' or ``HAWKEYE_CRON_JOBS'' - see the condor_ startd manual section for more details) are broken. Using the ``new'' syntax (``STARTD_CRON_JOBLIST'') will work around this problem.

New Features:

For those platforms which support it, libcondorapi.so is now produced and available in the lib/ directory after installing Condor.
The negotiation protocol between the condor_ schedd and the condor_ negotiator daemons has been improved for both scalability and correctness. In general, most sites will see faster negotiation cycles when many jobs are submitted after upgrading both the negotiator and all schedd daemons to version 6.7.15. This means the scheduling overhead per job is reduced. If you have used the undocumented macro SIGNIFICANT_ATTRIBUTES , please read the note above in the release notes, because this new automated behavior affects the use of that configuration setting-in most cases making it unnecessary.
Due to kernel bugs between the Linux 2.4.x and 2.6.x kernels, Condor now implements "checkpointing signatures" which allow more fine grained and automatic control over whether or not a particular machine is willing to resume a job using a previously created checkpoint. This functionality is homogenized across all platforms which provide the standard universe feature set.
Grid matchmaking ads are now aged and replaced by the negotiator based on a configurable classad expression from the condor config file. This configuration parameter is called STARTD_AD_REEVAL_EXPR . In previous versions, this was done strictly based on the UpdateSequenceNumber field in the ad. The default value for the new parameter behaves the same as the older, hard-coded algorithm.
Condor can now dynamically start its own gridftp server to handle file transfers for grid-type gt4 jobs. The gridftp server appears as a job in the queue and disappears when it's no longer needed.
Automatic renewal of job proxies from a MyProxy server now works for all grid universe jobs. Before, it only worked for grid-type gt2 jobs.
condor_ dagman now reports to its POST scripts uniquely distinguishable return codes for non-exe job failures (e.g., condor_ dagman, batch-system, or other external errors such as failed batch job submission, or batch job removal). In the past these errors were reported as various signals (e.g., SIGABRT for job removal or SIGUSR1 for failed job submission), making it impossible to distinguish them from the real signals as which they were masquerading. We now represent these errors using the previously-unused return-code space below -64 (we start below -1000, in fact). As before, 0-255 reflect normal exe return codes, and -1 to -64 represent signals 1 to 64 - but now -1000 and below represent DAGMan, batch-system, or other external errors.
Added the DAGMAN_RETRY_NODE_FIRST configuration macro to condor_ dagman to control whether failed nodes are retried before or after other ready nodes. The default is FALSE (condor_ dagman's previous behavior), which means that failed nodes will be retried after other ready nodes.
Added a new (backward compatible) syntax for job arguments and environment, allowing special characters to be escaped in a uniform way. The old limit of 4096 characters in the job arguments has also been removed. See condor_ submit manual for details of the new syntax.
Added more configuration parameters to the condor_ master's restart / backoff mechanism. You can now configure the initial value of the backoff time (via MASTER_BACKOFF_CONSTANT ). Additionally, you can now set daemon specific values for all of these parameters. See the condor_ master entry in the manual for more details.
condor_ userprio now supports -setaccum -setbegin -setlast options to set the Accumulated Usage, Begin Usage Time, and Last Usage time of a submitter. This is in addition to the existing -setprio and -setfactor options. These options can be used to safely reconstruct priority information if the only backup data available is the output from condor_ userprio -l
An updated DRMAA version is available on supported platforms. The previous DRMAA implementation has been removed.
Added new per-job Stork user logs. Stork user logs are now optional, and specified in the job submit file. Stork now uses Condor user log output format, including optional XML format. Previous, per-server Stork user log in LOG/Stork.user_log is now deprecated, and will be removed in a future release.
condor_ dagman now supports the new, per-job Stork user logs. "Old-style" Stork logs (specified with -Storklog on the condor_ submit_dag command line) are supported for now, but this support will probably be eliminated in the 6.7.16 release.
Added new per-job Stork input, output and error output file specifications. Stork job output is now optional, and specified in the job submit file. Previous, per-server Stork user log in LOG/Stork-module.stderr and LOG/Stork-module.stdout has been removed.
The Condor installer for Windows is now MSI compliant.

Bugs Fixed:

Fixed a bug introduced in Condor 6.7.14 that caused the GT2 GAHP server to ignore configuration parameters LOWPORT and HIGHPORT and the GT4 GAHP to fail at startup.
condor_ status -any now reports quill ads when quill is enabled.
condor_ restart -peaceful was causing condor_ master to only do a graceful shutdown, rather than a peaceful one. This means that GRACEFUL_SHUTDOWN_TIMEOUT would come into effect if jobs running under the startd took too long to finish. However, -peaceful restart did work in the case where a specific subsystem (e.g. -startd) was specified.
When run from a privileged (root) Stork server, modules lose LD_LIBRARY_PATH and other key environments, for security reasons. This is not actually a Stork bug, but a feature of glibc. When run with a dynamically linked globus-url-copy, the contributed modules for the HTTP, FTP and GSIFTP transfer protocols will fail. To compensate, these modules can now restore their environment via the pre-existing STORK_ENVIRONMENT configuration macro. Unprivileged (user level) Storks are not affected by this behavior.
Jobs that are are placed on held because of on_exit_hold evaluated to TRUE or jobs that stay in the queue after finishing because on_exit_remove evaluated to FALSE again correctly report the expression as being a "job attribute", not "UNKNOWN (never set)".
condor_ glidein was creating a default configuration with UPDATE_INTERVAL =20, which causes unnecessary scaling problems in large glidein pools. It now simply leaves this value undefined so that the default behavior may be assumed.
Fixed a bug that could cause the condor_ gridmanager to crash when a grid-type condor grid universe job left the queue.
When using job leases with the condor grid-type, a completed job will now leave the remote condor_ schedd's queue when the lease expires.
Fixed a bug in the fullpath() function that tests whether a file path is a full path - paths of the form "c:/" were not recognized as full paths, which could lead to something being prepended to what was already a full path, thereby creating an invalid path.
Fixed a problem with WhenToTransferOutput=ALWAYS. The bug affected jobs that were evicted after producing one or more intermediate files that were removed by the job before finally running to completion in a subsequent run. Condor was treating the missing intermediate files as an error and the job would typically keep running and failing until the user intervened. In addition to fixing this bug, file transfer error messages are now propagated back to the shadow log and the user log, making it easier to debug problems related to file-transfers.
condor_ submit was not paying attention to transfer_output_remaps when doing permissions checks on output files.

Known Bugs:

The command line string used in starting cron jobs is correct the first time the job is run, but incorrect each subsequent time the job is run. The error is that the job's name is incorrectly appended to the previous run's command line. As an example, the first time the job is run with the correct command line
```
/path/to/job jobname
```
The second time, this job is run with the incorrect command line
```
/path/to/job jobname jobname
```
And, the third time, this job is run with the incorrect command line
```
/path/to/job jobname jobname jobname
```

Version 6.7.14

Release Notes:

None.

New Features:

The Condor grid universe can now be used to submit jobs to Nordugrid and Unicore resources.
The Condor daemons now automatically restart when the system clock jumps more than 20 minutes in either direction. This may happen if the machine running Condor entered a "sleep" state. This resolves a variety of minor problems.
Added a -direct debugging option to condor_ q which, when using or querying a quill installation, allows talking directly to the rdbms, the quill daemon, or the schedd without performing the queue location discovery algorithm.
condor_ schedd provides more flexibility in how local and scheduler universe jobs are started. The new configuration macros START_LOCAL_UNIVERSE and START_SCHEDULER_UNIVERSE allow administrators to control whether condor_ schedd will start an idle local or scheduler universe job. If a job's respective universe macro evaluates to true, condor_ schedd will then evaluate the Requirements expression for the job. Only if both conditions are met will a job be allowed to begin execution.
condor_ schedd advertises how many local and scheduler universe jobs are currently running or idle in its ClassAd. The total number of running jobs is denoted by the TotalLocalJobsRunning and TotalSchedulerJobsRunning attributes. The total number of idle jobs is denoted by the TotalLocalJobsIdle and TotalSchedulerJobsIdle.
A job submission can now specify the exact time that it should be executed at using the DeferralTime attribute. The time is specified as the number seconds since the Unix epoch (00:00:00 UTC, Jan 1, 1970). An additional attribute DeferralWindow can be specified along with the deferral time that will allow a job to run even if it misses the execution time. The window is the number of seconds in the past that Condor will allow for a missed job to execute. This feature is not supported for scheduler universe jobs.
Added the concept of a ``controlling'' daemon to the condor_ master. This feature is currently used only for ``High Availability'' (HA) configurations involving the condor_ had daemon. To properly use these Condor HA features you must set this macro.
To configure the condor_ negotiator daemon to be controlled by the condor_ had, you should add an entry to your condor_config:
```
MASTER_NEGOTIATOR_CONTROLLER = HAD
```
This will cause the condor_ master to treat the condor_ had as the ``controller'' of the condor_ negotiator.
Grid-type condor grid universe jobs now respect configuration parameters GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE and GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE .
Grid universe jobs can now determine their grid_type via matchmaking, in addition to which resource they will be submitted to. A grid universe job may become any grid_type job, depending on what resource ad it is matched with.
Added support for a new configuration value, STARTD_CRON_AUTOPUBLISH . This setting can be used to tell the condor_ startd to automatically publish a new update to the condor_ collector whenever any of the cron modules it is configured to run have produced output. For more information, see the description of STARTD_CRON_AUTOPUBLISH in section 3.3.10 on page .
Reduced delay in negotiation when a job is released. A reschedule request is sent to the negotiator when a job is released from hold. This reduces the delay in several cases, most notably when using Condor-C or "condor_submit -s". Previously the negotiator would not be notified and would normally wait until the next scheduled negotiation cycle.
Added three new user log events: GridResourceUp, GridResourceDown, and GridSubmit. They are equivalent to the existing Globus-specific log events, but are used for all grid universe jobs.
When known, CPU-usage information will be reflected in the Terminated user log event for grid universe jobs.
Changed ClassAd expression evaluation so that logical and and logical or are short-circuited. This means that an expression like TARGET.foo && TARGET.bar will not evaluate TARGET.bar if TARGET.foo evaluates to false. This will speed up some expressions, particularly those involving user-defined functions. Although this was thoroughly tested, this is the sort of change that could have subtle, unexpected behavior, so please be on the lookout for problems that might be caused by it.
Added the condor_ check_userlogs command, which checks user log files for "illegal" events.
New settings SYSTEM_PERIODIC_HOLD , SYSTEM_PERIODIC_RELEASE , and SYSTEM_PERIODIC_REMOVE . These expressions behave identically to the job expressions periodic_hold, periodic_release, and periodic_remove, but are evaluated for all jobs in the queue. If not present, they default to FALSE.
An improved version of the DRMAA C library is available for download from http://prdownloads.sourceforge.net/condor-ext/condor_drmaa_6_7_14_src.tgz
Added CLAIM_WORKLIFE configuration option. The startd will not allow claims older than the specified number of seconds to run more jobs. Any existing job that is running when the worklife expires, however, is allowed to continue to run as normal.

Bugs Fixed:

Fixed the following problems with the Condor SOAP interface: a) placing a job on hold now stops the job as expected, b) fixed potential schedd segfaults when sending NULL buffers via SOAP, c) fixed compatibility problems with .NET clients
Fixed a potential security problem where any machine in the pool could advertise an additional condor_ negotiator in the pool. Now, the condor_ collector will only accept negotiator classads from machines listed in the HOSTALLOW_NEGOTIATOR variable. This bug has been in Condor since version 6.7.4.
Fixed bug in the dedicated scheduler where on busy pools running mixed parallel and sequential jobs, it would incorrectly try to preempt dedicated jobs.
Fixed some problems when Microsoft .NET clients communicate with Condor via SOAP. The issues were resolved by upgrading the version of gsoap included inside of Condor to gsoap ver 2.7.6c.
Fixed the bug in the condor_ ckpt_server from version 6.7.13 where it would give clients the wrong IP address and no checkpointing was possible. This would result in the following sorts of errors in the log file generated by the condor_ shadow (by default, ShadowLog):
```
Read: connect() failed - errno = 111
Read: open_tcp_stream() failed
Read: ERROR:open_ckpt_file failed, aborting ckpt
```
Version 6.7.14 of the condor_ ckpt_server is working properly once again.
Fixed bugs in Condor's Generic Connection Broker (GCB) support. Condor version 6.7.14 is linked with a new version of the GCB library (1.3.1) that fixes a major bug in how GCB handles UDP messages. Previous versions of GCB had a UDP receive buffer that was far too small, resulting in many dropped UDP packets. Now, GCB will dynamically allocate more buffer space as needed. The new version of GCB also adds support for comments (any line beginning with #) in the GCB routing table. For more information about GCB, see section 3.7.3 on page .
Update job information such as ImageSize, RemoteUserCpu and RemoteSysCpu at job completion. Previously this was only done periodically.
Fixed a bug that could cause the condor_ gridmanager to crash when a job using job leases left the queue.
Fixed a bug that could cause the condor_ schedd to repeatedly start the condor_ gridmanager to manage jobs that were complete. This would happen when LeaveJobInQueue evaluated to True.
When GSI_DAEMON_TRUSTED_CA_DIR is set, pass the setting down to the gt4 gahp server.
Fixed a bug in condor_ dagman that caused the UNLESS-EXIT feature to not work with POST scripts (the return value from a POST script was not tested against the UNLESS-EXIT value).
Fixed a bug in condor_ dagman that caused POST scripts to work incorrectly with node retries: if the node job failed for a node with retries, the POST script was only run on the last retry.
Fixed a bug in condor_ dagman that caused rescue DAGs to fail if the original DAG was run with the -UseDagDir command-line flag. (This bug was introduced at some point after version 6.7.10 and before version 6.7.13.)
Improved usage of uid caching introduced in 6.6.0. This will further reduce load on NIS servers. See the discussion of PASSWD_CACHE_REFRESH in section 3.3.3 on page for more details.
Fixed a bug in 6.7.13 for Windows causing incorrect handling of absolute paths in the job's output/error if the path began with a forward slash rather than a backslash.
Fixed a bug in the condor_ master that caused ``condor_off -subsystem'' (and similar commands) to fail if the daemon name wasn't hard-coded into the condor_ master. The condor_ master now handle any daemon listed in the DAEMON_LIST for these commands.
Fixed a bug in the condor_ gridmanager that caused it to undercount already-submitted jobs at start-up for purposes of job throttling to Globus grid resources.
Improved the handling of job leases for grid-type condor jobs. There was a race condition between the lease expiring and the condor_ gridmanager attempting to extend the lease. Also, a lease could be set and not extended well before the job was actually submitted. In certain cases, the forwarded lease could exceed JobLeaseDuration.
The startd no longer advertises itself as available to run jobs when it is in shutdown mode (e.g. waiting for jobs to finish). This was a noticeable problem when using large values for MAXJOBRETIREMENTTIME on multi-VM startds; while waiting for one of its VMs to finish running a job, the startd would be available for matching to jobs, but it would reject them when the schedd tried to start them, possibly causing an endless cycle of matching, attempting to run, and failing.
Fixed some minor typos and formatting bugs in some of the log messages generated by the condor_ ckpt_server.
For grid-type condor jobs, the condor_ gridmanager now notices when a job disappears from the remote condor_ schedd unexpectedly.
When getting ``connection refused'', Condor command-line tools and daemons no longer continuously retry the connection attempt until timing out. These retries were causing 10 second or longer delays when trying to connect to Condor services which, for one reason or another, were no longer listening on the expected TCP port.
Fixed a bug in the condor_ c-gahp that could cause it to crash if it fails to connect to a remote condor_ schedd when submitting a job.
Minor memory leaks have been fixed.
Fixed a bug in Quill that could result in an infinite loop in condor_ q when querying Quill.
condor_ preen now looks at the HISTORY setting in the configuration file when considering files to erase. Previously it assumed the history file was always called "history."

Changes:

The condor_ dagman log file path is converted to an absolute path inside condor_ dagman itself, so that the logging works for multi-directory rescue DAGs (which it didn't before), but the .condor.sub files are still portable.
Added the Stork log file (if any) to the list of log files that condor_ dagman lists in the dagman.out file.
condor_ dagman now reports the node return value for all failed nodes.
Attributes names forced into the job ad via '+' are no longer converted to lower-case. This conversion was a side-effect of a bug-fix in 6.7.11 and caused problems with code that assumed that Condor would preserve the case of attribute names.
Job policy expressions are now evaluated on COMPLETED and REMOVED jobs in the schedd.

Known Bugs:

The NEGOTIATOR_MATCHLIST_CACHING setting is broken. It should not be used. This setting is FALSE by default, but if set to TRUE, the condor_ negotiator will crash.
Jobs that are are placed on held because of on_exit_hold evaluated to TRUE or jobs that stay in the queue after finishing because on_exit_remove evaluated to FALSE will erroneously report the reason as "UNKNOWN (never set)".

Version 6.7.13

Release Notes:

Added a new natively compiled clipped port for the Red Hat Enterprise Linux 3 IA64 distribution.

New Features:

Added support complete support for Quill on Windows, so job queues can now be accessed via a relation database. Quill is now available on all Condor supported platforms. See page for more information.
Added support in Condor for the Generic Connection Broker (GCB). This is a system for managing network connections across public and private networks. More information about GCB can be found in section 3.7.3 on page .
Added a new configuration option, BIND_ALL_INTERFACES This is a boolean value that controls if Condor should bind and listen to all the network interfaces on a multi-homed machine. If set to TRUE, the value of NETWORK_INTERFACE will only control what IP address is published by Condor daemons, even though they will still be listening on all interfaces. The default is FALSE.
Added a -pool option to condor_ submit. It lets you submit jobs to a condor_ schedd in a different pool. The other options to condor_ submit now have long names, but the single-character versions still work.
``grid_resource'' can now be used to directly set the new grid universe job attribute ``GridResource.'' The old attributes still work, but they will be ignored if ``grid_resource'' is present. As a side-effect, ``stream_output'' and ``stream_error'' will default to ``False'' for all jobs.
X509 user proxies are now updated for vanilla universe jobs. If a job specifically sets x509userproxy and is using file transfer, when the proxy file is updated, it will be transfered to the running job.
If a cycle is detected in the DAG while running, condor_ dagman now prints (in the dagman.out file) the status of all DAG nodes.
BeginTransaction call in condor_ schedd's SOAP interface now notifies the caller if too many transactions are currently running via an error code of FAIL. Previous behavior was to abort a running transaction in order to allow the BeginTransaction call to succeed.
MAX_SOAP_TRANSACTION_DURATION config option added so that a single transaction cannot take up too many condor_ schedd resourced. This option specifies an optional maximum duration between SOAP calls in a single transaction.
If a machine is acting as both a submit and an execute node, and it cannot communicate with the central manager, it will attempt to run jobs locally. If Condor specific terms, if the condor_ schedd fails to hear from the central manager, it will attempt to run jobs on a locally running condor_ startd. The SCHEDD_ASSUME_NEGOTIATOR_GONE config macro was added to support this feature; see page for details.
You can now specify per-subsystem entries in your condor_config file by prepending the subsystem name and a period to the normal name. The per-subsystem settings take precedence over the regular settings.
condor_ dagman now recovers automatically after being abruptly killed by something other than Condor itself (e.g., by Unix initd during a ``fast'' system shutdown). This is accomplished through the use of a default OnExitRemove expression inserted by condor_ submit_dag which instructs the condor_ schedd not to treat death by SIGKILL as a valid exit condition for condor_ dagman.
Added submit attribute globus_xml, for use with grid-type gt4 jobs. The given XML text will be inserted at the end of the XML job description written by Condor for submission to the WS-GRAM server.
For grid-type gt4 jobs, if a URL scheme is missing from the resource name, ``https://'' will be inserted automatically.
Added submit attribute transfer_output_remaps. This specifies the name (and optionally path) to use when downloading output files from the completed job. Normally output files are transferred back to the initial working directory with the same name they had in the execution directory. This gives you the option to save them with a different path or name.

Bugs Fixed:

Fixed a bug concerning backslash escaping in classad attribute values when condor_ q was using quill.
Fixed a bug where condor_ q could not accept multiple jobids on the command line.
Fixed parallel universe ssh script to now clean up all temporary files it creates.
Fixed a bug in the dedicated scheduler that caused it to request resources it could not use, resulting in longer job startup times.
Fixed a bug in the condor_ schedd that caused grid-type gt2 jobs submitted by an older condor_ submit or in the queue during an upgrade (version 6.7.10 or earlier) to go on hold if the grid_type was ``globus''.
Fixed a bug in condor_ submit that caused it to not set JobGridType in the job ad for grid universe jobs when submitting to a condor_ schedd older than version 6.7.11.
When using file transfer, transferring the results back to the submit machine could silently fail for Condor releases 6.7.0 though 6.7.12. This was relatively rare through 6.7.10. For 6.7.11 and 6.7.12, the bug would be easily triggered if a vanilla job had an X509 user proxy associated with it. This is now fixed.
Fixed a logic bug in the condor_ schedd. Previously, if there was an error expanding any $$(attribute) references in a job classad when trying to spawn a condor_ shadow, the condor_ schedd would die with the fatal exception ``Impossible: GetJobAd() returned NULL for X.Y but that job is already known to exist''. Now, the condor_ schedd correctly distinguishes between a non-fatal error expanding $$(attribute) and the fatal error of the job already being gone (which is, in fact, impossible). This bug was first introduced in Condor version 6.7.1.
The reason strings generated when a user job policy expression fires are now consistent for grid universe jobs.
The condor_ gridmanager now evaluates the periodic job policy expressions at the interval set by PERIODIC_EXPR_INTERVAL .
Fixed a bug which prevented standard universe from working on a linux kernel post 2.6.12.2.
The condor_ schedd used to crash in certain cases if a given job was vacated using condor_ vacate_job, then put on hold and released. The bug only appeared if a specific job id was given to condor_ vacate_job, as opposed to specifying a username or another constraint. Now, the use of condor_ vacate_job for individual job identifiers is safe and the condor_ schedd will not crash. This bug has been in Condor since support for condor_ vacate_job was first added in version 6.7.0.
Fixed a bug that caused the condor_ gridmanager to crash if a grid-type condor job ad contained the attribute remote_.
Fixed a bug with the FS_REMOTE authentication mechanism that caused it to fail occasionally when using NFS.
Fixed a bug in which a double terminated event in a DAG node with a POST script could cause condor_ dagman to abort the DAG and claim that a cycle exists in the DAG.
In the DAG status messages in dagman.out files, condor_ dagman now shows nodes with queued PRE or POST scripts in the Pre or Post columns. Previously, these nodes were shown in the Un-Ready column.
Fixed the GetFile SOAP call on the condor_ schedd so that it behaves more like POSIX read() and does not report errors when trying to read more data than is available.
Fixed a hash function bug that could cause condor_ dagman to crash.
JobCurrentStartDate and JobLastStartDate are no longer changed in the job ad when the condor_ schedd and condor_ shadow reconnect to a running job after a crash.
condor_ dagman now allows POST scripts to be used with DATA nodes in a DAG (previously this caused the DAG to hang).
Using the new Remote_ simplified syntax no longer generates unnecessary debug messages.
Fixed a bug in estimating the size of attribute value buffers that caused quill to crash. This arose when job ads had variables with very large values (more than 3KB).
Fixed a bug in the condor_ gridmanager that could cause it to crash when the Rematch attribute evaluates to True.
The default base scratch directory for WS-GRAM doesn't exist on most server machines. Added a work-around to create the directory as part of the job submission.
Starting in version 6.7.11, the execute host reported for grid jobs in the user log execute event can contain spaces. The C++ user log reading code now properly reads the entire string for these events.
Fixed a bug that caused the condor_ gridmanager to die when it tried to renew the job lease of a grid-type condor job.
Fixed a bug that was causing the condor_ schedd to crash on Solaris if the cron macros aren't defined.
Fixed a bug where output may be lost when spooling (with the -s option to condor_ submit or implicitly with Condor-C). This bug could only happen if the job terminated within one second of starting.
Fixed a bug affecting transferral of output and error files where the file specified in the submit file contains path information. The file was being staged back into the initial working directory and then it was copied to the final path specified. The bug is that if there was an error copying the file to the final location, the intermediate copy would not be deleted and the job would still exit successfully, as though it had succeeded. Now, no intermediate copy of the file is made, and errors in transferring the file will be treated as a failure to run the job, which will typically cause the job to return to idle state and run again.

Changes:

Added a couple missing parameters to the example configuration file condor_config.generic.
Slightly cleaned up event checking error messages in condor_ dagman.
Fixed a bug in the condor_ c-gahp that caused it to crash when handling grid-type condor jobs with job leases.
Starting in 6.7.11, the ``JM-Contact'' field of the ``Job submitted to Globus'' user log event was mis-printed. This has been corrected.
Fixed bug that prevented Stork detection of hung jobs.
Fixed an obscure bug that incorrectly quoted the status of completed jobs, visible via stork_ status.

Known Bugs:

The condor_ ckpt_server is broken in version 6.7.13. Please do not attempt to use it. It is safe to use the 6.7.12 condor_ ckpt_server in a pool running 6.7.13 until the 6.7.14 release is out. Of course, the 6.7.12 condor_ ckpt_server will not work with GCB, so sites wishing to use both GCB and a condor_ ckpt_server will have to wait for 6.7.14.
Rescue DAGs generated from DAGs run with the -UseDagDir command-line flag no longer work. (The original run with -UseDagDir should work, but if it fails and generates a rescue DAG, the rescue DAG will always fail.)

Version 6.7.12

Release Notes:

6.7.12 addresses several critical bugs in 6.7.11. 6.7.11 should not be used.

Bugs Fixed:

Fixed a serious bug introduced in 6.7.11 which prevented condor_ dagman from successfully removing its own jobs from the Condor queue after receiving a condor_ rm request from the condor_ schedd.
Fixed a serious bug introduced in 6.7.11 where the condor_ master on Windows would not properly shut down.

Version 6.7.11

Release Notes:

Condor is now linked against GSI from Globus 4.0.1.
GSI security and the grid universe should now work in the Alpha Linux port.
All Condor release packages are now compressed with GNU's gzip. We no longer ship releases compressed with the vendor's compress utility.

New Features:

Added a new feature called Quill to Condor which allows an SQL server to mirror the job queue in order to speed up queries about the job queue via condor_ q and condor_ history. Please see page for the description of this feature.
condor_ dagman has a new -maxidle command-line argument that can be used to throttle DAG job submissions according to the number of idle jobs in the DAG.
stork_ submit is now able to search for X.509 credentials in the standard locations.
The condor_ negotiator can now limit how long it negotiates with a single submitter before moving on to the next one.
On platforms and filesystems that support files larger than 2 GB, the history file can now be larger than 2 GB.
Added two options to condor_ q: -jobads and -machineads. They will take ads from files instead of the schedd and collector, respectively. These options are mostly useful for debugging.
Added a new, hopefully less confusing, Cron (Hawkeye) configuration syntax. The old syntax is still supported, but should be considered deprecated, and will eventually go away. The new syntax splits the old colon separated ``name:prefix:executable:period'' string into separate macros.
Improved support for job leases. ``job_lease_duration'' now works for grid-type condor jobs. New job ad attribute ``TimerRemove'' specifies a specific time at which a job should be removed. These attributes will be passed through multiple layers of grid-type condor jobs.
Grid universe jobs now use a unified pair of attributes (``GridResource'' and ``GridJobId'') to identify the remote resource. This will make it possible to match jobs to multiple types of resources. The submit file syntax remains the same for now, except that ``remote_pool'' is now required for grid-type condor jobs.
Significantly improved response time for condor_ q when job classads are larger than 4 kbytes (by disabling TCP Nagle algorithm as appropriate).

Bugs Fixed:

Fixed bug in the dedicated scheduler where if the condor_ startd rejected a match, the condor_ schedd would never retry new matches for that machine. This would result in MPI and parallel jobs sticking in the Idle state, and the message "DedicatedScheduler::negotiate sent match for machine, but we've already got it".
Fixed problem with the parallel universe to allow for LAM jobs to get SIGTERM on exit so they can exit cleanly.
Fixed a bug that was visible to the end user as file transfer failures on a busy system. The root problem was that if the condor_ negotiator gave out the same match twice (due to having stale info in the condor_ collector when trying to negotiate), the condor_ schedd would be confused, attempt to re-use the match, fail to do so, and then kill the previous (legitimate) use of the match. This bug was introduced in version 6.7.4.
Fixed bug in the parallel universe that caused the schedd to crash when reconnecting to jobs that couldn't be reconnected to.
Fixed bug in parallel shadow which caused Shadow Exceptions in parallel jobs when the components exited in the wrong order.
Fixed a bug in condor_ dagman that caused it to fail on Windows for DAGs with nodes having absolute paths to their log files. (This bug was introduced in version 6.7.10.)
Fixed a bug whereby condor_ dagman could crash after executing the POST script of a node whose Condor job had never been successfully submitted due to repeated condor_ submit failures. (This bug was introduced in 6.7.7 or earlier.)
Fixed a bug in a debug message. If an error occurred during file transfer, Condor would print the wrong expected filesize in the error message on some platforms.
Fixed bug where stork_ submit was corrupting log notes passed from the command line. This bug also had the effect of disabling Stork jobs running from DAGMan versions v6.7.10, and later.
If you have DATA nodes in your DAG but no Stork log specified (with the -Storklog argument), condor_ dagman now fails with an explanatory message when parsing the DAG file(s). (Previously, it would just wait forever for the Stork jobs to finish, because it wouldn't see the relevant events.)
In condor_ dagman, argument quoting for stork_ submit now matches argument quoting for condor_ submit.
Corrected how condor_ submit handles attributes forced into the job ad with '+'. Now, the attribute names are case-insensitive, they are not treated as normal submit attributes, and they always over-ride normal submit attributes.
Fixed bugs that would cause a segfault when reading a classad from a file. Triggered by consecutive blank lines and lines containing only white-space.
Fixed a bug that could cause duplicated output when a gt4 grid job is executed more than once.
Fixed a bug that could cause the condor_ gridmanager to assert if it tried to delegate credentials for gt4 grid jobs before the gahp server was started.
Fixed a race condition that could cause condor grid-type jobs to be held with hold reason ``Spooling input data files''.
condor_ glidein now correctly handles extracting necessary information from modern Condor configurations where NEGOTIATOR_HOST is not defined.
Refinements in how grid universe components track jobs. Grid universe jobs are less likely to generate multiple terminate events in the job's user log. There will also be slight performance improvements are redundant work is no longer done.
Fixed a bug in condor_ dagman that caused it to core dump on a 'job reconnected' event from a node job.
condor_ submit will now exit zero as long as the submission succeeds. Debugging output will still be printed if the internal reschedule fails.
On Windows, exited child processes of the Condor services will be handled in order of termination. This fixes the problem where jobs submitted from a Windows machine appear to run much longer than normal because the condor_ schedd fails to notice that a condor_ shadow exits when the system is very busy.
Fixed a bug that caused scheduler universe jobs to often wait five minutes (or whatever SCHEDD_INTERVAL is set to) before running.
Fixed a bug that prevented the condor_ starter from running on a Win32 machine with a FAT32 filesystem.
A reschedule command will now be sent to the condor_ schedd whenever a job is released from held state. This should make grid-type condor jobs start much faster.
Config parameters GAHP and GAHP_ARGS have been deprecated. GT2_GAHP should be used instead.

Changes:

condor_ configure no longer creates a
```
 $(LOCAL_DIR)/ViewHist
```
directory, which was begun in version 6.7.10. This directory was of limited value for most users.

Known Bugs:

None.

Version 6.7.10

Release Notes:

This release contains all of the bug fixes and improvements from the 6.6 stable series up to and including version 6.6.10.
The Mac OS X binaries shipped with this release were built on OS 10.3. Previous versions of Condor for OS X were built with version 10.2. Condor is officially dropping support for Mac OS 10.2 with this release (though it is possible the 10.3 binaries still work, we have not verified it either way). These binaries are known to work with Mac OS 10.4 (``Tiger''), as well.
There is a minor bug in version 6.7.10's condor_ configure script. It will create a directory called ViewHist in the local directory (next to log, spool, etc). This directory is not used by Condor at all, except in the case of a condor_ view collector (which is optional, and not enabled by default). This behavior will be removed in version 6.7.11, and condor_ configure will go back to not creating the ViewHist directory.

New Features:

condor_ dagman can now run multiple DAGs in separate directories.
Added DAGMAN_CONDOR_SUBMIT_EXE , DAGMAN_STORK_SUBMIT_EXE , DAGMAN_CONDOR_RM_EXE , and DAGMAN_STORK_RM_EXE configuration settings to specify the condor_ submit, stork_ submit, condor_ rm, and stork_ rm executables used by condor_ dagman. If unset (which they are by default), condor_ dagman looks for each in the PATH.
For Condor-C jobs, the condor_ gridmanager will retry and delay failed connections to a remote condor_ schedd like it does for Condor-G jobs. The same configuration settings apply (GRIDMANAGER_CONNECT_FAILURE_RETRY_COUNT and GRIDMANAGER_RESOURCE_PROBE_INTERVAL ).
remote_initialdir is now supported in all universes except for standard universe. Previously, it was only supported in the grid universe.
+Remote_ syntax for Condor-C jobs has been simplified for the specific commands of universe, remote_schedd, remote_pool, globus_rsl, and globus_scheduler.
Added default user priority factors for accounting groups. More on accounting groups will be available in future versions of the manual.
The condor_ startd can now be configured to write out the ClaimId of the next available claim for each virtual machine to separate files. This functionality will enable enhanced fault tolerance in future versions of Condor. For more information, see section 3.3.10 for details on STARTD_SHOULD_WRITE_CLAIM_ID_FILE and STARTD_CLAIM_ID_FILE , the two configuration settings that control this behavior.

Bugs Fixed:

Fixed bugs on the Win32 platform in the condor_ schedd that could cause jobs to never complete when the condor_ schedd is busy with many jobs running at once.
Fixed a bug on Windows where if lots of jobs submitted were from the same condor_ schedd, some of the condor_ shadow processes would block for an extremely long time trying to get a lock for writing to the ShadowLog file. Now, log writing happens more fairly, and no condor_ shadow processes can be delayed indefinitely.
condor_ submit -name formerly had no effect on Windows and did not work properly. This is now fixed.
Significantly sped up the removal of large groups of jobs by changing the default value of JOB_IS_FINISHED_INTERVAL from 1 to 0 (see section 3.3.11 for details on this setting).
Improved performance of the condor_ schedd when not running as root. In version 6.7.7, the new code to support the scheduler universe with Condor-C involved adding some additional overhead to the condor_ schedd. However, this overhead is not needed unless the condor_ schedd is running as root. In version 6.7.10, the condor_ schedd notices if it is not root and does an optimization to avoid the overhead.
Fixed a bug that caused the gridmanager to crash if a gt2, gt3, or gt4 grid job had a proxy that couldn't be read properly. Now the job gets put on hold.
The Condor-C GAHP now performs file staging in a separate process, allowing remote grid jobs to be started earlier.
When contacting the embedded web server on Condor daemons, authentication is no longer requested. The previous authentication requirement didn't provide any additional security, and could confuse users.
Fixed rare bug that could cause condor_ submit to crash when both getenv=true and environment=... were in a submit file and when very large variable names were in the environment.
Fixed a rare bug where the condor_ schedd would die with a fatal exception under extremely heavy load on the machine. The error message was:
```
  
  ERROR ``Impossible: Create_Thread child_errno (xxx) is not
  ERRNO_PID_COLLISION!'' at line 6181 in file daemon_core.C
```
Fixed a rare bug where certain attributes in a job description file could cause the condor_ schedd to crash when restarting and parsing the job_queue.log file.
Improved performance of standard universe jobs when WantRemoteIO is set to false in the job ClassAd. In this case, Condor's checkpointing libraries now avoid some additional communication with the condor_ shadow which are not required if there's no remote IO.
Fixed some messages in the Condor log files that were improperly formatted, or contained incomplete information.
Improved some user-log-reading error messages in condor_ dagman.
Removed support for deprecated -NoPostFail option from condor_ dagman. (The same functionality can be achieved through the use of a simple POST script.)
Fixed bug in dedicated scheduler, where under heavy load, the schedd would occasionally try to start the same job twice, and subsequently exit with the message:
```
  
  ERROR ``Trying to run job x.x, but already marked RUNNING!''
```
Fixed bug in dedicated scheduler, so that it now creates a spool directory for each condor proc of a parallel or MPI job with multiple requirements.

Known Bugs:

On Windows only, condor_ dagman fails for DAGs with nodes having absolute log file paths in their submit files.
condor_ dagman does not correctly handle the case where all submit attempts for a node job fail, and the node has a POST script. If this happens for a single node in a DAG, it is usually okay, but if it happens for a second node, condor_ dagman will crash.
The Condor-C GAHP now performs file staging in a separate process, allowing remote grid jobs to be started earlier.
Using the new Remote_ syntax simplification causes condor_ submit to display debug messages to standard output, possibly confusing programs that parse condor_ submit's output. Fixed in 6.7.13.

Version 6.7.9

Release Notes:

This release contains all of the bug fixes and improvements from the 6.6 stable series up to and including version 6.6.10.

New Features:

The Parallel Universe has been added. For more information, see section 2.10 on page .
The environment variable X509_USER_PROXY is set to the full path of the proxy if a proxy is associated with the job. This is usually done using x509userproxy in the submit file. This currently works in the local, java, and vanilla universes.
condor_ submit generates more precise error messages in some failure cases.
condor_ hold, condor_ release and condor_ rm now allow the user to change the HoldReason, ReleaseReason or RemoveReason with the -reason flag.
condor_ dagman no longer does a one-second sleep before each submit if all node jobs have the same log file. (The sleep is still needed if there are multiple log files, for unambiguous ordering of events during bootstrapping.) Note that if DAGMAN_SUBMIT_DELAY is specified, the specified delay takes effect whether or not all jobs have the same log file.

Bugs Fixed:

Many crashes related to running the Dedicated Scheduler have been fixed.
Setting COLLECTOR_HOST or NEGOTIATOR_HOST with a port but without a hostname no longer causes the condor_ master to crash.
The Condor-G Grid Monitor now works with Globus 4.0 pre-Web Services GRAM.
Several deadlocks in the Condor-C GAHP server have been fixed.

Version 6.7.8

Release Notes:

This release contains all of the bug fixes and improvements from the 6.6 stable series up to and including version 6.6.9.

New Features:

Controlling whether or not a standard universe job asks the condor_ shadow about how/where to open every single file can be better controlled with the want_remote_io attribute in the submit description file. This attribute can be set to true or false and it is true be default. If set to false, then this attribute forces a standard universe job in Condor to always look to the local file system when opening files and not to contact the shadow. This increases performance of user jobs where the jobs open a very large amount of files in a small space of time. However, the user jobs must be matched to machines that have the same UID_DOMAIN and FILESYSTEM_DOMAIN, as per vanilla universe jobs with a homogeneous file system.
condor_ dagman now has the capability to run more than one independent DAG in a single condor_ dagman process.
User policy expressions (on_exit_remove and on_exit_hold) now work for scheduler universe jobs.
TotalCpus and TotalMemory are now set in machine ads.
condor_ dagman now tolerates the "two terminated events for a single job" bug by default. There is a new bit in DAGMAN_ALLOW_EVENTS to control whether this bug is considered a fatal error in a condor_ dagman run.
Added a new debug formatting flag, D_ PID, that prints out the process id (PID) of the process writing a given entry to a log file. This is useful in Condor daemons (such as the condor_ schedd) where the daemon can fork() multiple processes to perform various tasks and it is helpful to see what log messages are coming from forked process versus the main thread of execution. The default SCHEDD_DEBUG in the sample configuration files shipped with Condor now includes this flag.
When condor_ dagman writes rescue files, each node is now specified with the same number of retries as was specified in the original DAG, rather than with only the ``remaining'' number of retries based on the failed run. The latter behavior can be restored by setting DAGMAN_RESET_RETRIES_UPON_RESCUE to false.
Added ``Hawkeye'' capabilities to condor_ schedd. It's configured identically to that of condor_ startd, but using ``SCHEDD'' in place of ``STARTD'', in particular for the ``SCHEDD_CRON_NAME'' macro.

Bugs Fixed:

Fixed a bug in condor_ dagman that prevented POST scripts from being used with jobs that write XML-format logs.
The event-checking code used by condor_ dagman now defaults to allowing an execute event before the submit event for the same job; if this happens, there will be a warning, but the DAG will continue. See section 3.3.22 for more info.
condor_ userprio option -pool was failing with ``Can't find address for negotiator'' since version 6.7.5.
Fixed a bug the prevented SOAP clients from being able to access a job's spooled data files if the condor_ schedd restarted.
Fixed a bug that caused the condor_ gridmanager to panic when trying to retire a job from the queue that was already gone. This could cause multiple terminate events to be logged for some jobs.
Fixed a bug that caused match-making to not work for Condor-C jobs.
Added workaround for a Globus bug that can cause re-execution of a completed GT2 job in the correct failure case (Globus bugzilla ticket 3411).
Properly extend the lifetime of GT4 jobs and credentials on the remote server.

Version 6.7.7

Release Notes:

This release contains all of the bug fixes and improvements from the 6.6 stable series up to and including version 6.6.9.

New Features:

The STARTD_EXPRS list can now be on a per-VM basis, and entries on the list can also be specific to a VM. See 3.13.7 for more details.
The LOCAL_CONFIG_FILE can now be overridden. This now allows files to include other local config files. See 3.3.3 for more info.
Resources that are claimed but suspended can now optionally not be charged for at the accountant. When the resource is unsuspended, the accountant will resume charging for usage. This is controlled by the NEGOTIATOR_DISCOUNT_SUSPENDED_RESOURCES config file entry, and it defaults to false.
The DAGManJobID attribute which condor_ dagman inserts into the classad of every job it submits now contains only its cluster ID (instead of a cluster.proc ID pair), so that it may be referenced as an integer in DAG job submit files. This allows, for example, a user to automatically set the relative local queue priority of jobs based on the condor_ dagman job that submitted them, so that jobs submitted by ``older'' DAGs will start before jobs submitted by ``newer'' DAGs (assuming they are otherwise identical).
GSI authentication can now be used when Condor-C jobs are submitted from one condor_ schedd to another.
File permissions are now preserved when a job's data files are transferred between unix machines. File transfers that involve a windows machine or older version of Condor remain as before.
Condor-C now supports the scheduler remote universe.
condor_ advertise now publishes a ``MyAddress'' if none is provided in the source ClassAd. This will prevent the collector from throwing out ads with no address (see Bugs Fixed).
Added a new condor_ dagman parameter DAGMAN_ALLOW_EVENTS controlling which ``bad'' events are not considered fatal errors; the -NoEventChecks command-line argument is deprecated and has no effect.
condor_ fetchlog now takes an optional log file extension in order to select logs such as ``StarterLog.vm2''.

Bugs Fixed:

Fixed a throughput performance bottle neck when standard universe jobs vacate when the user has specified WantCheckpoint equal to False in the submit file.
Added initial support for the getdents(), getdents64(), glob(), and the family of functions opendir(), readdir(), closedir() for the standard universe.
It is recommended that you do not directly invoke getdents() or getdents64(), but instead use the other POSIX functions specified above.
There are two caveats: these calls will not work in heterogeneous contexts, and you may not call getdents() directly when condor_ compileing a 32-bit program while specifying the 64-bit interfaces for the Unix API.
In versions 6.7.4 through 6.7.6, Computing On Demand (COD) support was broken due to a bug in how Condor daemons parsed their command line arguments. The bug was introduced with the changes to provide a web services (SOAP) interface to Condor. This bug has been fixed and COD support is now working again.
In version 6.7.6, the DAGParentNodeNames attribute which condor_ dagman adds to all DAG job classads could grow too long and cause job submission to fail. Now, if the DAGParentNodeNames value would be too long to add to the job classad, the attribute is instead left undefined and a warning is emitted in the DAGMan debugging log. This behavior means that such a node can be reliably distinguished from a node with no parents, as the latter will have a DAGParentNodeNames attribute defined but empty.
In version 6.7.3, the value of the X509UserProxySubject job attribute was changed in such a way that Condor-G jobs submitted by a newer condor_ submit to an older condor_ schedd could fail to run. Now, condor_ submit reverts to the old behavior when talking to an old condor_ schedd.
Bug-fixes and improvements to grid_type gt4:
- Condor will now delegate a single proxy to the GT4 server for multiple. If the local proxy is refreshed, Condor will forward the refreshed copy to the server.
- Exit codes are now recorded properly.
- JAVA_EXTRA_ARGUMENTS now used when invoking the GT4 GAHP server (which is written in java).
- If LOWPORT and HIGHPORT are set in the config file, the GT4 GAHP server will now obey the port restriction.
- Fixed a bug that caused Condor not to notice when some GT4 jobs completed.
- Fixed a bug in handling the job's environment for GT4 jobs. Condor incorrectly used <name>=<value> for each variable's name.
- Improved hold reason in certain cases when a GT4 job goes on hold.
- condor_ q -globus now works properly for GT4 jobs. Also, the resource name in the user log execute event is printed properly for GT4 jobs.
- Fixed a bug that could cause Condor to not detect when a GT4 job completes. This was triggered by Condor not properly recognizing the StageOut Globus job state.
Fixed a bug that can cause the condor_ gridmanager to abort if PeriodicRelease evaluates to true while it's putting a job on hold.
Fixed a bug in condor_ dagman that caused the DAG to be aborted if a job generated an executable error event.
Fixed a bug in condor_ dagman on Windows that would cause it to hang or crash on exit.
MPI universe jobs now honor the JOB_START_DELAY configuration setting.
The condor_ collector now throws out startd, schedd, and License ClassAds that don't have a valid IP address (used in it's hashing). The collector now correctly will fall back to ``MyAddress'' if it's provided.
Fixed a bug in condor_ dagman that could cause condor_ dagman to fail an assertion if PRE or POST scripts are throttled with the -maxpre or -maxpost condor_ submit_dag command line flags.

Version 6.7.6

Release Notes:

Version 6.7.6 contains all the bug fixes and improvements from the 6.6 stable series up to and including version 6.6.9.

New Features:

Added support for libc's (()system) function for standard universe executables. This call is not checkpoint-safe in that the standard universe job could call it twice or more times in the event of a resumption from an earlier checkpoint. The invocation of this call by the shadow on behalf of the user job is controlled by a configuration file parameter called SHADOW_ALLOW_UNSAFE_REMOTE_EXEC and is off by default. The full environment of the user job is preserved during the invocation of (()system) and this might cause problems in heterogeneous submission contexts of the user is not careful.
Added support for a web services (SOAP) interface to Condor. For more information, see and section 4.4.1 on page .
NOTE: Due to a bug in gSOAP, the SOAP support in Condor 6.7.6 does not work with all SOAP toolkits. Some of the responses that gSOAP generates contain unqualified tags. Therefore, SOAP toolkits that are strict (such as gSOAP or .Net) will not accept these poorly formed responses. SOAP toolkits that are more lax in the responses they accept (such as Axis, SOAP::Lite, or ZSI) will work with version 6.7.6. This problem has already been fixed and the solution will be released in Condor version 6.7.7.
Added support for the GT4 grid_type in Condor's grid universe. This new grid type supports jobs submitted to grid resources controlled by Globus Toolkit version 4 (GT4).
New configuration settings are required to support jobs submitted for the GT4 grid type. These settings have been added to the default configuration files shipped with Condor, but sites that are upgrading an existing installation and choosing to keep their old configuration files must add these settings to allow GT4 jobs to work:
```
## The location of the wrapper for invoking GT4 GAHP server
GT4_GAHP = $(SBIN)/gt4_gahp
 
## The location of GT4 files. This should normally be lib/gt4
GT4_LOCATION = $(LIB)/gt4

## gt4-gahp requires gridftp server. This should be the address of gridftp
## server to use
GRIDFTP_URL_BASE = gsiftp://$(FULL_HOSTNAME)
```
Condor version 6.7.6 includes the Stork data movement system, the Condor Credential Daemon (condor_ credd), and support for using MyProxy for credential management. However, currently these are only supported in our release for Linux using the 2.4 kernel with glibc version 2.3 (RedHat 9, etc). All of these features require changes to the Condor configuration files to function properly. The default configuration files shipped with Condor already include all the new settings, but sites upgrading an existing installation must add these new settings to their Condor configuration. For a list of settings and more information, see section 3.3.28 on page for Stork, section 3.3.19 on page for condor_ credd, and section 3.3.26 on page for MyProxy. For more information about MyProxy, you can also see http://grid.ncsa.uiuc.edu/myproxy
Added preliminary support for the High Availability Daemon (HAD).
Added a new SCHED_UNIV_RENICE_INCREMENT configuration variable used by the condor_ schedd for scheduler universe jobs, analogous to the existing JOB_RENICE_INCREMENT variable used by the condor_ startd for other job universes. The SCHED_UNIV_RENICE_INCREMENT variable is undefined by default, and when undefined, defaults to 0 internally.
The relative priority of a user's own jobs in the local condor_ schedd queue is no longer limited to the range -20 to +20, but can be any integer value.
DAGMan Improvements:
- condor_ dagman now inserts a DAGParentNodeNames attribute into classad of all Condor jobs it submits, containing the names of the job's parents in the DAG. The list is in the form of a comma-delimited string.
- Added the condor_ dagman arguments -noeventchecks and -allowlogerror to condor_ submit_dag.
condor_ glidein Improvements:
- Added condor_ glidein options for setting up GSI authentication.
- Added condor_ glidein option -run_here for direct execution of Glidein, instead of submitting it for remote execution. You may also save a script for doing this and then run the script through whatever mechanism you want (like some batch system interface not supported by Condor-G).
Added support for the NEGOTIATOR_CYCLE_DELAY configuration setting, which is only intended for expert administrators. For more information, see section 3.3.18 on page .

Bugs Fixed:

Previous versions of the condor_ master had a bug where if the administrator attempted to use <SUBSYS>_ARGS to pass -p to any Condor daemon to have it listen on a specific, fixed port, the underlying daemon would not honor the flag. Now, the condor_ master correctly supports using <SUBSYS>_ARGS to define a port using -p. For more information about <SUBSYS>_ARGS, see section 3.3.9 on page .
Removed case-sensitivity of command-line argument names in condor_ submit_dag.
Fixed the -r (remote schedd) option in condor_ submit_dag.
Condor versions 6.7.1 through 6.7.5 exhibit a bug in which the commands condor_ off, condor_ restart, and condor_ vacate did not handle the -pool command-line option correctly. The bug caused these commands to correctly query the central manager of the remote pool, and to incorrectly send the command to the central manager machine. This bug has now been fixed, and these tools no longer send the command to the central manager machine.

Known Bugs:

None.

Version 6.7.5

Release Notes:

None.

New Features:

Added DAG aborting feature - a DAG can be configured to abort immediately if a node exits with a given exit value.
The dedicated scheduler can now preempt running MPI jobs from appropriately configured machines. See 3.13.8 for details.
The MPI universe now supports submit files with multiple procs (queue commands), each with distinct requirements. This is useful for placing the head node of an MPI job on a specific machine, and the rest of the nodes elsewhere. See 2.10.5 for details.
The condor_ negotiator now publishes its own ClassAd to the condor_ collector which includes the IP address and port where it is listening. This negotiator ClassAd can be viewed using the new -negotiator option with condor_ status. In addition to removing an unnecessary fixed port for the condor_ negotiator, this change corrects some problems with commands that attempted to communicate directly with the condor_ negotiator. These bugs were first listed in the Known Bugs section of the 6.6.0 version history.
To enable this feature and have the condor_ negotiator listen on a dynamic port, you must comment out the NEGOTIATOR_HOST setting in your configuration file. The new example configuration files shipped with version 6.7.4 and later will already have this setting undefined. However, if you upgrade your binaries and retain an older copy of your configuration files, you should consider commenting out NEGOTIATOR_HOST.
To disable this feature and have the condor_ negotiator still listen on a well-known port, you can uncomment the NEGOTIATOR_HOST setting in the default configuration. For example:
```
NEGOTIATOR_HOST = $(CONDOR_HOST)
```
Pools that are comprised of older versions of Condor and a 6.7.4 or later central manager machine should either continue to use their old condor_config file (which will still have NEGOTIATOR_HOST defined) or they should re-define the NEGOTIATOR_HOST setting in the new example configuration files which are used during the installation process.
Added optional DAGMAN_RETRY_SUBMIT_FIRST configuration parameter that tells condor_ dagman whether to immediately retry the submit if a node submit fails, or to put that job at the end of the ready jobs queue. The default is TRUE, which retries the failed submit before trying to submit any other jobs.
The schedd now uses non-blocking connection attempts when contacting startds. This prevents the long (typically 40 second) hang of all schedd operations when the connection attempt does not complete, due to network problems.

Bugs Fixed:

Fixed a performance problem with the standard universe when gettimeofday() is called in a very tight loop by the application.
Fixed the default value of OPSYS in the MacOSX version of Condor. Once again, Condor reports OSX for all versions of MacOSX. This bug was introduced in version 6.7.3 of Condor.
Fixed a bug in condor_ dagman that caused it to be killed if the DAGMAN_MAX_SUBMIT_ATTEMPTS parameter was set to too high a value.
Fixed a bug in condor_ gridmanager that caused it to crash if the grid_monitor was activated.
Fixed support for the getdents64() system call inside the standard universe on Linux and Solaris.
Fixed a bug in condor_ dagman that dealt incorrectly with the problem of Condor sometimes writing both a terminated and an aborted event for the same job. The spurious aborted event is now ignored.

Known Bugs:

None.

Version 6.7.3

Release Notes:

This release contains all the bug fixes from the 6.6 stable series up to and including version 6.6.7, and some of the fixes that will be included in version 6.6.8. The bug fixes in version 6.6.8 that were not included in version 6.7.3 are listed in a separate section of the 6.6.8 version history.

New Features:

Added Full Ports of Condor to Redhat Fedora Core 1, 2 and 3 on the 32-bit x86 architecture. Please read the Linux platform specific section 6.1.6 in this manual for more information on caveats with this port.
Added a feature to condor_ dagman that will allow VARS names to include numerics and underscores.
Added optional COLLECTOR_HOST_FOR_NEGOTIATOR configuration parameter to indicate which condor_ collector the condor_ negotiator on this (local) host should query first. This is designed to improve negotiation performance.
Added a new condor_ dagman capability to allow the DAG to continue if it encounters a double run of the same node job (set the DAGMAN_IGNORE_DUPLICATE_JOB_EXECUTION parameter to true to do this).
Added Condor-C: the "condor" grid_type. Condor-C allows jobs to be handed from one condor_ schedd to another condor_ schedd.
Added setup_here option to condor_ glidein for cases where direct installation is desired instead of submitting a setup job to the remote gatekeeper. (For example, this is useful when doing an installation onto AFS.)
If RemoteOwner is exported via STARTER_VM_EXPRS into the ad of other virtual machines, the condor_ negotiator automatically inserts RemoteUserPrio into the ad as well, so policy expressions can now take into account the priority of jobs running on other virtual machines on the same host.
Linux 2.6 kernels do not update the access time for console devices, so Condor was unable to detect if there has been activity at the keyboard or mouse. As a work-around, Condor now polls /proc/interrupts to detect if the keyboard has requested attention. This does not work for USB keyboards or pseudo TTYs, so ConsoleIdle on 2.6 kernels will be wrong for some devices. Future versions of Condor or Linux may correct this.
condor_ dagman no longer removes the X509_USER_PROXY environment variable. This should allow users to set the environment variable before invoking condor_ submit_dag and have the jobs submitted by condor_ dagman correctly find the proxy file.

Bugs Fixed:

Fixed a condor_ dagman bug that could cause it to leave jobs running when aborting a DAG.
Fixed a condor_ dagman bug which, if its debug level was set to zero (silent), could cause it to to improperly recognize persistent condor_ submit failures.
Fixed a bug in Condor's file transfer mechanism that showed up when users tried to use streaming output for either STDOUT or STDERR. There were situations where Condor would attempt to transfer back the STDOUT or STDERR file from the execution host, even though these files didn't exist and all the data was already streamed back to the submit host. Now, if either stream_output or stream_error are set to true in the job submit description file, Condor will transfer any other output but will not attempt to transfer back STDOUT or STDERR.
The Condor user log library (libcondorapi) now correctly handles execute events that lack a hostname.

Known Bugs:

Unfortunately, the default OPSYS value for the MacOSX version of Condor was incorrectly changed in version 6.7.3. Condor used to always report OSX, but in version 6.7.3 it will report either OSX10_2, OSX10_3, or OSX_UNK. This is wrong, since Condor jobs submitted to any version of OSX should be able to run on any other version of OSX, and the above change needlessly partitions resources and complicates things for end-users. Therefore, anyone running version 6.7.3 on MacOSX is encouraged to add the following line to their global condor_config file:
```
OPSYS = OSX
```
If your pool is already running the new release, you can cause the above change to take effect by running the following command on your pool's central manager machine (or any machine listed in the HOSTALLOW_ADMINISTRATOR list) after you have changed the OPSYS value in your configuration:
```
condor_reconfig -all
```
However, if you have already submitted jobs to your pool with the old OPSYS value, the Requirements expression in those jobs will still refer to the incorrect value. In this case, you should either a) wait for the jobs to complete before making the above change, b) remove the jobs and resubmit them after you've made the change, or c) manually run condor_ qedit on the jobs to change their Requirements expressions.
When running in recovery mode on a DAG that has PRE scripts, condor_ dagman may attempt more than the specified number of retries of a node (counting retries attempted during the first run of the DAG). This is because if a node fails because of the PRE script failing, that fact is not recorded in the log, so that retry is missed in recovery mode.

Version 6.7.2

Release Notes:

Condor Version 6.7.2 includes some bug fixes from Version 6.6.7, but none from Version 6.6.8.
MPI users who are upgrading from previous versions of Condor to version 6.7.2 will need to modify the MPI_CONDOR_RSH_PATH configuration macro of their dedicated resource to be $(LIBEXEC) instead of $(SBIN). Users who are installing Condor version 6.7.2 for the first time will not need to make any changes.

New Features:

Added an INCLUDE configuration file variable to define the location of header files shipped with Condor that are currently needed to be included when compiling Condor APIs. When INCLUDE is defined, condor_ config_val can be used to list header files.
A Condor pool can now support multiple Collectors. This should improve stability due to automatic failover. All daemons will now send updates to ALL of the specified collectors. All daemons/tools will query the Collectors in sequence, until an appropriate response is received. Thus if one (or more) of the Collectors are down, the pool will continue to function normally, as long as there is at least one functioning Collector. You can specify multiple (comma-separated) collector host (and port) addresses in the COLLECTOR_HOST entry in the configuration file. A given condor_ master can only run one Collector.
When the condor_ master is started with the -r option to indicate that it should quite after a period of time, the condor_ startd will now indicate how much time is remaining before it exits. It does this by advertising TimeToLive in the machine ClassAd.
Added new macro JOB_START_COUNT that works in conjunction with existing macro JOB_START_DELAY to throttle job starts. Together, this macro pair provides greater flexibility tuning job start rate given available condor_ schedd performance.
Added a LIBEXEC directory to the install process. Support commands that the Condor system needs will be added to this directory in future releases. This directory should not be added to a user or system-wide path.
Added the ability to decide for each file that condor transfers whether it should be encrypted or not, using encrypt_input_files, dont_encrypt_input_files, encrypt output files, and dont_encrypt_output_files in the job's submit file.
Added DISABLE_AUTHENTICATION_IP_CHECK which will work around problems on dual-homed machines where the IP address is reported incorrectly to condor. This is particularly a problem when using Kerberos on multi-homed machines.

Bugs Fixed:

Fixed a bug on Linux systems caused by both Condor and the Linux distribution having a library file called libc.a. The problem caused the link step to fail on Condor API programs. The evaluation order to determine the location of library files caused use of the wrong file, given the duplicate naming. The bug is fixed by renaming the Condor library files.
When the condor_ startd is evaluating the state of each virtual machine (VM), it now refreshes any ClassAd attributes which are shared from other virtual machines (using STARTD_VM_EXPRS) before it tries to evaluate. This way, if a given VM changes its state, all other VMs will immediately see this state change.
Fixed a bug where you couldn't transfer input files larger than 2 gigabytes.
Condor can now detect the size of memory on a Linux machine with the 2.6 kernel.
JAR files specified in the submit file were not being transfered along with the job unless they were also explicitly placed in the list of input files to transfer. Now, the JAR files are implicitly added to the list of input files to transfer.

Known Bugs:

None.

Version 6.7.1

Release Notes:

Version 6.7.1 contains all of the features, ports, and bug fixes from the previous stable series, up to and including version 6.6.6. There are a few additional bugs that have been fixed in the 6.6.x stable series which have not yet been released, but which will appear in version 6.6.7. These bug fixes have been included in version 6.7.1, and appear in the ``Bugs fixes included from version 6.6.7'' list below. In addition, a number of new features and some bug fixes have been made, which are described below in more detail.
None.

New Features:

Added an option to DAGMan's retry ability. If a DAG specifies something like ``RETRY job 10 unless-exit 9'', then the retries will only happen if the node doesn't exit with a value of 9.
Condor-G can now submit jobs to Globus 3.2 (WS) (for jobs with universe = grid, grid_type = gt3). Submitting to Globus 3.0 (as in Condor 6.7.0) is no longer supported. Submitting to pre-WS Globus (2.x) is still supported (grid_type = gt2).
Added new startd policy expression MaxJobRetirementTime. This specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job (for owner preemption, negotiator preemption, or graceful startd shutdown).
Added -peaceful shutdown/restart mode. This will shut down the startd without killing any jobs, effectively treating both MaxJobRetirementTime and GRACEFUL_SHUTDOWN_TIMEOUT as infinite. The default shutdown/restart mode is still -graceful, which behaves according to whatever MaxJobRetirementTime and GRACEFUL_SHUTDOWN_TIMEOUT are. The behavior of -fast mode is unchanged; it kills jobs immediately, regardless of the other timeout settings.
Jobs can now be submitted as ``noop'' jobs. Jobs submitted with noop_job = true will not be executed by Condor, and instead will immediately have a terminate event written to the job log file and removed from the queue. This is useful for DAGs where the pre-script determines the job should not run.
Added preliminary support for the Tool Daemon Protocol (TDP) into Condor. This protocol is still under development, but the goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. Assuming this protocol is adopted by other scheduling systems and by various monitoring tools, it would allow arbitrary combinations of tools and schedulers to co-exist, function properly, and provide monitoring services for jobs running under the schedulers. This initial support allows users to specify a ``tool'' that should be spawned along-side their regular Condor job. On Linux, the ability to have the batch Condor job suspend immediately upon start-up is also implemented, which allows a monitoring tool to attach with ptrace() before the job's main() function is called.

Bugs Fixed:

Fixed a significant memory leak in the condor_ schedd that was introduced in version 6.7.0. In 6.7.0, the condor_ schedd would leak a copy of ClassAd for every job it tried to spawn (on average, around 2000 bytes per job).
Fixed the bugs in Condor's MPI support that were introduced in version 6.7.0. Condor now supports MPI jobs linked with MPICH 1.2.4 and older. Improved Condor's log messages and email notifications when MPI jobs run on multiple virtual machines (the messages now include the appropriate ``vmX'' identifier, not just the hostname). Unfortunately, due to changes in MPICH between version 1.2.4 and 1.2.5, Condor's MPI support is not compatible with MPICH 1.2.5. We will be addressing this problem in a future release.

Bugs fixes included from version 6.6.7:

Fixed an important bug in the low-level code that Condor uses to transfer files across a network. There were certain temporary failure cases that were being treated as permanent, fatal errors. This resulted in file transfers that aborted prematurely, causing jobs to needlessly re-run. The code now gracefully recovers from these temporary errors. This should significantly help throughput for some sites, particularly ones that transfer very large files as output from their jobs.
Fixed a number of bugs in the -format option to condor_ q and condor_ status. Now, these tools will properly handle printing boolean expressions in all cases. Previously, depending on how the boolean evaluated, either the expression was printed, or the tool could crash. Furthermore, the tools do a better job of handling the different types of format conversion strings and printing out the appropriate value. For example, if a user tries to print out a boolean attribute with condor_status -format "%d\n" HasFileTransfer, the condor_ status tool will evaluate HasFiletransfer and print either a 0 or a 1 (FALSE or TRUE). If, on the other hand, a user tries to print out a boolean attribute with condor_status -format "%s\n" HasFileTransfer, the condor_ status tool will print out the string ``FALSE'' or ``TRUE'' as appropriate.
The ClassAd attribute scope resolution prefixes, MY. and TARGET., are no longer case sensitive.
condor_ dagman now does better checking for inconsistent events (such as getting multiple terminate events for a single job). This checking can be disabled with the -NoEventChecks command-line option.

Known Bugs:

None.

Version 6.7.0

Release Notes:

Version 6.7.0 contains all of the features, ports, and bug fixes from the previous stable series, up to and including version 6.6.4. In addition, a number of new features and some bug fixes have been made, which are described below in more detail.

New Features:

Added support for vanilla and Java jobs to reconnect when the connection between the submitting and execution nodes is lost for any reason. Possible reasons for this disconnect include: network outages, rebooting the submit machine, restarting the Condor daemons on the submit machine, etc. If the execution machine is rebooted or the Condor daemons are restarted, reconnection is not possible. To take advantage of this reconnect feature, jobs must be submitted with a JobLeaseDuration. There are new events in the UserLog related to disconnect and reconnect.
Added a new Condor tool, condor_ vacate_job. This command is similar to condor_ vacate, except the kinds of arguments it takes define jobs in a job queue, not machines to vacate. For example, a user can vacate a specific job id, all the jobs in a given cluster, all the jobs matching a job queue constraint, or even all jobs owned by that user. The owner of a job can always vacate their own jobs, regardless of the pool security policy controlling condor_ vacate (which is an administrative command which acts directly on machines). See the new command reference, section 9 on page for details.
Added a new ``High Availability'' service to the condor_ master. You can now specify a daemon which can have ``fail over'' capabilities (i.e. the master on another machine can start a matching daemon if the first one fails). Currently, this is only available over a shared file system (i.e. NFS), and has only been tested for the condor_ schedd.
Scheduler universe jobs on UNIX can now specify a HoldKillSig, the signal that should be sent when the job is put on hold. If not specified, the default is to use the KillSig, and if that is not defined, the job will be sent a SIGTERM. The submit file keyword to use for defining this signal is hold_kill_sig, for example, hold_kill_sig = SIGUSR1.
The condor_ startd can now support policies on SMP machines where each virtual machine (VM) has knowledge of the other VMs on the same host. For example, if a job starts running on one of the VMs, a job running on another VM could immediately be suspended. This is accomplished by using the new configuration variable STARTD_VM_EXPRS , which is a list of ClassAd attribute names that should be shared across all VMs on the machine. For each VM on the machine, every attribute in this list is looked up in the VM-specific machine ClassAd, the attribute name is given a prefix indicating what VM it came from, and then inserted into the machine ClassAds of all the other VMs.
The condor_ startd publishes four new attributes into the machine ClassAds it generates when it is in the Claimed state: TotalJobRunTime, TotalJobSuspendTime, TotalClaimRunTime, TotalClaimSuspendTime. These attributes keep track of the total time the resource was either running a job (in the Busy activity) or had a job suspended, regardless of how many suspend/resume cycles the job went through. The first two attributes (with ``Job'' in the name) keep track for a single job (i.e. since the last time the resource was Claimed/Idle). The last two attributes (with ``Claim'' in the name) keep track of these totals across all jobs that ran under the same claim (i.e. since the last state change into the Claimed state).
Added a -num option to the condor_ wait tool to wait for a specified number of jobs to finish.
Added a configuration option STARTER_JOB_ENVIRONMENT so the admin can configure the default environment inherited by user jobs.
Added a (configurable, defaults to off) feature to the condor_ schedd to allow backup the spool file before doing anything else.
The "Continuous" option of the condor_ startd ``cron'' jobs is being deprecated. It's being replaced by two new options which control separate aspects of it's behavior:
- "WaitForExit" specifies the "exit timing" mode
- "ReConfig" specifies that the job can handle SIGHUPs, and it should be sent a SIGHUP when the condor_ startd is reconfigured.
A lot of the items logged by the condor_ startd ``cron'' logic, changed to D_FULLDEBUG (from D_ALWAYS), etc.
Added NEGOTIATOR_PRE_JOB_RANK and NEGOTIATOR_POST_JOB_RANK . These expressions are applied respectively before and after the user-supplied job rank when deciding which of the possible matches to choose. (The existing expression PREEMPTION_RANK is applied after NEGOTIATOR_POST_JOB_RANK .) The pool administrator may use these expressions to steer jobs in ways that improve the overall performance of the pool. For example, using the pre job rank, preemption may be avoided as long as there are idle machines, even when the user-supplied rank expression prefers a machine that happens to be busy. Using the post job rank, one could steer jobs towards machines that are known to be dedicated to batch jobs, or one could enforce breadth-first instead of depth-first filling of a cluster of multi-processor machines.
Added the ability for Condor to transfer files larger than 2G on platforms that support large files. This works automatically for transferred executables, input files and output files.
Added the ability for jobs to stream back standard input, output, and error files while running. This is activated by the stream_input, stream_output, and stream_error options to condor_ submit. Note that this feature is incompatible with the new feature described above where the shadow and starter can reconnect in certain circumstances.
Added support for vanilla jobs to be mirrored on a second condor_ schedd. The jobs are submitted to the second condor_ schedd on hold and will be released if the second condor_ schedd hasn't heard from the first condor_ schedd (actually, a condor_ gridmanager running under the first condor_ schedd) for a configurable amount of time. Once the second condor_ schedd releases the jobs, the first condor_ schedd acts as a mirror, reflecting the state of the jobs on the second condor_ schedd. To use this mirroring feature, jobs must be submitted with a mirror_schedd parameter in the submit file and require no file transfer.

Bugs Fixed:

Fixed a bug in the condor_ startd ``cron'' logic which caused the condor_ startd to except when trying to delete a job that could never be run (i.e. invalid executable, etc).
Fixed a bug in condor_ startd ``cron'' logic which caused it to not detect when the starting of a ``job'' failed.
Fixed several bugs in the reconfiguration handling of the condor_ startd ``cron'' logic. In particular, even if the job has the "reconfig" option set (or "continuous"), the job(s) won't be sent a SIGHUP when the startd first starts, or when the job itself is first run (until it outputs its first output block, defined by the "-" separator).

Known Bugs:

Condor's MPI support (for MPICH 1.2.4) was broken by other changes in version 6.7.0. Support for MPI jobs will return in Condor version 6.7.1.

Table 8.1: Condor 6.7.0 supported platforms

Architecture	Operating System
Hewlett Packard PA-RISC (both PA7000 and PA8000 series)	HPUX 10.20
Sun SPARC Sun4m,Sun4c, Sun UltraSPARC	Solaris 2.6, 2.7, 8, 9
Silicon Graphics MIPS (R5000, R8000, R10000)	IRIX 6.5 (clipped)
Intel x86	Red Hat Linux 7.1, 7.2, 7.3, 8.0
	Red Hat Linux 9
	Windows 2000 Professional and Server, 2003 Server (clipped)
	Windows XP Professional (clipped)
ALPHA	Digital Unix 4.0
	Red Hat Linux 7.1, 7.2, 7.3 (clipped)
	Tru64 5.1 (clipped)
PowerPC	Macintosh OS X (clipped)
	AIX 5.2L (clipped)
Itanium	Red Hat Linux 7.1, 7.2, 7.3 (clipped)
	SuSE Linux Enterprise 8.1 (clipped)

Next: 8.5 Stable Release Series Up: 8. Version History and Previous: 8.3 Stable Release Series Contents Index

condor-admin@cs.wisc.edu