Next: 8.11 Stable Release Series Up: 8. Version History and Previous: 8.9 Stable Release Series Contents Index

Subsections

8.10 Development Release Series 6.1

This was the first development release series. It contains numerous enhancements over the 6.0 stable series. For example:

Support for running multiple jobs on SMP machines
Enhanced functionality for pool administrators
Support for PVM, MPI and Globus jobs
Support for Flocking jobs across different Condor pools

The 6.1 series has many other improvements over the 6.0 series, and is available on more platforms. The new features, bugs fixed, and known bugs of each version are described below in detail.

Version 6.1.17

This version is the 6.2.0 ``release candidate''. It was publically released in Feburary of 2001, and it will be released as 6.2.0 once it is considered ``stable'' by heavy testing at the UW-Madison Computer Science Department Condor pool.

New Features:

Hostnames in the HOSTALLOW and HOSTDENY entries are now case-insensitive.
It is now possible to submit NT jobs from a UNIX machine.
The NT release of Condor now supports a USE_VISIBLE_DESKTOP parameter. If true, Condor will allow the job to create windows on the desktop of the execute machine and interact with the job. This is particularly useful for debugging why an application will not run under Condor.
The condor_ startd contains support for the new MPI dedicated scheduler that will appear in the 6.3 development series. This will allow you to use your 6.2 Condor pool with the new scheduler.
Added a mixedcase option to condor_ config_val to allow for overriding the default of lowercasing all the config names
Added a pid_snapshot_interval option to the config file to control how often the condor_ startd should examine the running process family. It defaults to 50 seconds.

Bugs Fixed:

Fixed a bug with the condor_ schedd reaching the MAX_JOBS_RUNNING mark and properly calculating Scheduler Universe jobs for preemption.
Fixed a bug in the condor_ schedd loosing track of condor_ startds in the initial claiming phase. This bug affected all platforms, but was most likely to manifest on Solaris 2.6
CPU Time can be greater than wall clock time in Multi-threaded apps, so do not consider it an error in the UserLog.
condor_ restart -master now works correctly.
Fixed a rare condition in the condor_ startd that could corrupt memory and result in a signal 11 (SIGSEGV, or segmentation violation).
Fixed a bug that would cause the ``execute event'' to not be logged to the UserLog if the binary for the job resided on AFS.
Fixed a race-condition in Condor's PVM support on SMP machines (introduced in version 6.1.16) that caused PVM tasks to be associated with the wrong daemon.
Better handling of checkpointing on large-memory Linux machines.
Fixed random occasions of job completion email not being sent.
It is no longer possible to use condor_ user_prio to set a priority of less than 1.
Fixed a bug in the job completion email statistics. Run Time was being underreported when the job completed after doing a periodic checkpoint.
Fixed a bug that caused CondorLoadAvg to get stuck at 0.0 on Linux when the system clock was adjusted.
Fixed a condor_ submit bug that caused all machine_count commands after the first queue statement to be ignored for PVM jobs.
PVM tasks now run as the user when appropriate instead of always running under the UNIX ``nobody'' account.
Fixed support for the PVM group server.
PVM uses an environment variable to communicate with it's children instead of a file in /tmp. This file previously could become overwritten by mulitple PVM jobs.
condor_ stats now lives in the ``bin'' directory instead of ``sbin''.

Known Bugs:

The condor_ negotiator can crash if the Accountantnew.log file becomes corrupted. This most often occurs if the Central Manager runs out of diskspace.

Version 6.1.16

New Features:

Condor now supports multiple pvmds per user on a machine. Users can now submit more than one PVM job at a time, PVM tasks can now run on the submission machine, and multiple PVM tasks can run on SMP machines. condor_ submit no longer inserts default job requirements to restrict PVM jobs to one pvmd per user on a machine. This new functionality requires the condor_ pvmd included in this (and future) Condor releases. If you set ``PVM_OLD_PVMD = True'' in the Condor configuration file, condor_ submit will insert the default PVM job requirements as it did in previous releases. You must set this if you don't upgrade your condor_ pvmd binary or if your jobs flock with pools that user an older condor_ pvmd.
The NT release of Condor no longer contains debugging information. This drastically reduces the size of the binaries you must install.

Bugs Fixed:

The configuration files shipped with version 6.1.15 contained a number of errors relating to host-based security, the configuration of the central manager, and a few other things. These errors have all been corrected.
Fixed a memory management bug in the condor_ schedd that could cause it to crash under certain circumstances when machines were taken away from the schedd's control.
Fixed a potential memory leak in a library used by the condor_ startd and condor_ master that could leak memory while Condor jobs were executing.
Fixed a bug in the NT version of Condor that would result in faulty reporting of the load average.
The condor_ shadow.pvm should now correctly return core files when a task or condor_ pvmd crashes.
This release fixes a memory error introduced in version 6.1.15 that could crash the condor_ shadow.pvm.
Some condor_ pvmd binaries in previous releases included debugging code we added that could cause the condor_ pvmd to crash. This release includes new condor_ pvmd binaries for all platforms with the problematic debugging code removed.
Fixed a bug in the -unset options to condor_ config_val that was introduced in version 6.1.15. Both -unset and -runset work correctly, now.

Known Bugs:

None.

Version 6.1.15

New Features:

In the job submit description file passed to condor_ submit, a new style of macro (with two dollar-signs) can reference attributes from the machine ClassAd. This new style macro can be used in the job's Executable, Arguments, or Environment settings in the submit description file. For example, if you have both Linux and Solaris machines in your pool, the following submit description file will run either foo.INTEL.LINUX or foo.SUN4u.SOLARIS27 as appropiate, and will pass in the amount of memory available on that machine on the command line:
```
	executable = foo.$$(Arch).$$(Opsys)
	arguments = $$(Memory)
	queue
```
The CONFIG security access level now controls the modification of daemon configurations using condor_ config_val. For more information about security access levels, see section 3.6.8 on page .
The DC_DAEMON_LIST macro now indicates to the condor_ master which processes in the DAEMON_LIST use Condor's DaemonCore inter-process communication mechanisms. This allows the condor_ master to monitor both processes developed with or without the Condor DaemonCore library.
The new NEGOTIATE_ALL_JOBS_IN_CLUSTER macro can be use to configure the condor_ schedd to not assume (for efficiency) that if one job in a cluster can't be scheduled, then no other jobs in the cluster can be scheduled. If NEGOTIATE_ALL_JOBS_IN_CLUSTER is set to True, the condor_ schedd will now always try to schedule each individual job in a cluster.
The condor_ schedd now automatically adds any machine it is matched with to its HOSTALLOW_WRITE list. This simplifies setting up a machine for flocking, since the submitting user doesn't have to know all the machines where the job might execute, they only have to know what central manager they wish to flock to. Submitting users must trust a central manager they report to, so this doesn't impact security in any way.
Some static limits relating to the number of jobs which can be simultaneously started by the condor_ schedd has been removed.
The default Condor config file(s) which are installed by the installation program have been re-organized for greater clarity and simplicity.

Bugs Fixed:

In the STANDARD Universe, jobs submitted to Condor could segfault if they opened multiple files with the same name. Usually this bug was exposed when users would submit jobs without specifying a file for either stdout or stderr; in this case, both would default to /dev/null, and this could trigger the problem.
The Linux 2.2.14 kernel, which is used by default with Red Hat 6.2, has a serious bug can cause the machine to lock up when the same socket is used for repeated connection attempts. Thus, previous versions of Condor could cause the 2.2.14 kernel to hang (lots of other applications could do this as well). The Condor Team recommends that you upgrade your kernel to 2.2.16 or later. However, in v6.1.15 of Condor, a patch was added to the Condor networking layer so that Condor would not trigger this Linux kernel bug.
If no email address was specified when the job was submitted with condor_ submit, completion email was being sent to user@submit-machine-hostname. This is not the correct behavior. Now email goes by default to user@uid-domain, where uid-domain is defined by the UID_DOMAIN setting in the config file.
The condor_ master can now correctly shutdown and restart the condor_ checkpoint_server.
Email sent when a SCHEDULER Universe job compeltes now has the correct From: header.
In the STANDARD universe, jobs which call sigsuspend() will now receive the correct return value.
Abnormal error conditions, such as the hard disk on the submit machine filling up, are much less likely to result in a job disappearing from the queue.
The condor_ checkpoint_server now correctly reconfigures when a condor_ reconfig command is received by the condor_ master.
Fixed a bug with how the condor_ schedd associates jobs with machines (claimed resources) which would, under certain circumstances, cause some jobs to remain idle until other jobs in the queue complete or are preempted.
A number of PVM universe bugs are fixed in this release. Bugs in how the condor_ shadow.pvm exited, which caused jobs to hang at exit or to run multiple times, have been fixed. The condor_ shadow.pvm no longer exits if there is a problem starting up PVM on one remote host. The condor_ starter.pvm now ignores the periodic checkpoint command from the startd. Previously, it would vacate the job when it received the periodic checkpoint command. A number of bugs with how the condor_ starter.pvm handled asynchronous events, which caused it to take a long time to clean up an exited PVM task, have been fixed. The condor_ schedd now sets the status correctly on multi-class PVM jobs and removes them from the job queue correctly on exit. condor_ submit no longer ignores the machine_count command for PVM jobs. And, a problem which caused pvm_exit() to hang was diagnosed: PVM tasks which call pvm_catchout() to catch the output of child tasks should be sure to call it again with a NULL argument to disable output collection before calling pvm_exit().
The change introduced in 6.1.13 to the condor_ shadow regarding when it logged the execute event to the user log produced situations where the shadow could log other events (like the shadow exception event) before the execute event was logged. Now, the condor_ shadow will always log an execute event before it logs any other events. The timing is still improved over 6.1.12 and older versions, with the execute event getting logged after the bulk of the job initialization has finished, right before the job will actually start executing. However, you will no longer see user logs that contain a ``shadow exception'' or ``job evicted'' message without a ``job executing'' event, first.
stat() and varient calls now go through the file table to get the correct logical size and access times of buffered files. Before, stat() used to return zero size on a buffered file that had not yet been synced to disk.

Known Bugs:

On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and linked with the Condor libraries (using condor_ compile) will not execute the constructors for any global objects. There is a work-around for this bug, so if this is a problem for you, please send email to condor-admin@cs.wisc.edu.
In HP-UX 10.20, condor_ compile will not work correctly with HP's C++ compiler. The jobs might link, but they will produce incorrect output, or die with a signal such as SIGSEGV during restart after a checkpoint/vacate cycle. However, the GNU C/C++ and the HP C compilers work just fine.
The getrusage() call does not work always as expected in STANDARD Universe jobs. If your program uses getrusage(), it could decrease incorrectly by a second across a checkpoint and restart. In addition, the time it takes Condor to restart from a checkpoint is included in the usage times reported by getrusage(), and it probably should not be.

Version 6.1.14

New Features:

Initial supported added for Red Hat Linux 6.2 (i.e. glibc 2.1.3).

Bugs Fixed:

In version 6.1.13, periodic checkpoints would not occur (see the Known Bugs section for v6.1.13 listed below). This bug, which only impacts v6.1.13, has been fixed.

Known Bugs:

The getrusage() call does not work properly inside ``standard'' jobs. If your program uses getrusage(), it will not report correct values across a checkpoint and restart. If your program relies on proper reporting from getrusage(), you should either use version 6.0.3 or 6.1.10.
While Condor now supports many networking calls such as socket() and connect(), (see the description below of this new feature added in 6.1.11), on Linux, we cannot at this time support gethostbyname() and a number of other database lookup calls. The reason is that on Linux, these calls are implemented by bringing in a shared library that defines them, based on whether the machine is using DNS, NIS, or some other database method. Condor does not support the way in which the C library tries to explicitly bring in these shared libraries and use them. There are a number of possible solutions to this problem, but the Condor developers are not yet agreed on the best one, so this limitation might not be resolved by 6.1.14.
In HP-UX 10.20, condor_ compile will not work correctly with HP's C++ compiler. The jobs might link, but they will produce incorrect output, or die with a signal such as SIGSEGV during restart after a checkpoint/vacate cycle. However, the GNU C/C++ and the HP C compilers work just fine.
When a program linked with the Condor libraries (using condor_ compile) is writing output to a file, stat()-and variant calls, will return zero for the size of the file if the program has not yet read from the file or flushed the file descriptors. This is a side effect of the file buffering code in Condor and will be corrected to the expected semantic.
On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and linked with the Condor libraries (using condor_ compile) will not execute the constructors for any global objects. There is a work-around for this bug, so if this is a problem for you, please send email to condor-admin@cs.wisc.edu.

Version 6.1.13

New Features:

Added DEFAULT_IO_BUFFER_SIZE and DEFAULT_IO_BUFFER_BLOCK_SIZE to config parameters to allow the administrator to set the default file buffer sizes for user jobs in condor_ submit.
There is no longer any difference in the configuration file syntax between ``macros'' (which were specified with an ``='' sign) and ``expressions'' (which were specified with a ``:'' sign). Now, all config file entries are treated and referenced as macros. You can use either ``='' or ``:'' and they will work the same way. There is no longer any problem with forward-referencing macros (referencing macros you haven't yet defined), so long as they are eventually defined in your config files (even if the forward reference is to a macro defined in another config file, like the local config file, for example).
condor_ vacate now supports a -fast option that forces Condor to hard-kill the job(s) immediately, instead of waiting for them to checkpoint and gracefully shutdown.
condor_ userlog now displays times in days+hours:minutes format instead of total hours or total minutes.
The condor_ run command provides a simple front-end to condor_ submit for submitting a shell command-line as a vanilla universe job.
Solaris 2.7 SPARC, 2.7 INTEL have been added to the list of ports that now support remote system calls and checkpointing.
Any mail being sent from Condor now shows up as having been sent from the designated Condor Account, instead of root or ``Super User''.
The condor_ submit ``hold'' command may be used to submit jobs to the queue in the hold state. Held jobs will not run until released with condor_ release.
It is now possible to use checkpoint servers in remote pools when flocking even if the local pool doesn't use a checkpoint server. This is now the default behavior (see the next item).
USE_CKPT_SERVER now defaults to True if a checkpoint server is available. It is usually more efficient to use a checkpoint server near the execution site instead of storing the checkpoint back to the submission machine, especially when flocking.
All Condor tools that used to expect just a hostname or address (condor_ checkpoint, condor_ off, condor_ on, condor_ restart, condor_ reconfig, condor_ reschedule, condor_ vacate) to specify what machine to effect, can now take an optional -name or -addr in front of each target. This provides consistancy with other Condor tools that require the -name or -addr options. For all of the above mentioned tools, you can still just provide hostnames or addresses, the new flags are not required.
Added -pool and -addr options to condor_ rm, condor_ hold and condor_ release.
When you start up the condor_ master or condor_ schedd as any user other than ``root'' or ``condor'' on Unix, or ``SYSTEM'' on NT, the daemon will have a default Name attribute that includes both the username of the user who the daemon is running as and the full hostname of the machine where it is running.
Clarified our Linux platform support. We now officially support the Red Hat 5.2 and 6.x distributions, and although other Linux distributions (especially those with similar libc versions) may work, they are not tested or supported.
The schedd now periodically updates the run-time counters in the job queue for running jobs, so if the schedd crashes, the counters will remain relatively up-to-date. This is controlled by the WALL_CLOCK_CKPT_INTERVAL parameter.
The condor_ shadow now logs the ``job executing'' event in the user log after the binary has been successfully transfered, so that the events appear closer to the actual time the job starts running. This can create some somewhat unexpected log files. If something goes wrong with the job's initialization, you might see an ``evicted'' event before you see an ``executing'' event.

Bugs Fixed:

Fixed how we internally handle file names for user jobs. This fixes a nasty bug due to changing directories between checkpoints.
Fixed a bug in our handling of the Arguments macro in the command file for a job. If the arguments were extremely long, or there were an extreme number of them, they would get corrupted when the job was spawned.
Fixed DAGMan. It had not worked at all in the previous release.
Fixed a nasty bug under Linux where file seeks did not work correctly when buffering was enabled.
Fixed a bug where condor_ shadow would crash while sending job completion e-mail forcing a job to restart multiple times and the user to get multiple completion messages.
Fixed a long standing bug where Fortran 90 would occasionally truncate its output files to random sizes and fill them with zeros.
Fixed a bug where close() did not propogate its return value back to the user job correctly.
If a SIGTERM was delivered to a condor_ shadow, it used to remove the job it was running from the job queue, as if condor_ rm had been used. This could have caused jobs to leave the queue unexpectedly. Now, the condor_ shadow ignores SIGTERM (since the condor_ schedd knows how to gracefully shutdown all the shadows when it gets a SIGTERM), so jobs should no longer leave the queue prematurely. In addition, on a SIGQUIT, the shadow now does a fast shutdown, just like the rest of the Condor daemons.
Fixed a number of bugs which caused checkpoint restarts to fail on some releases of Irix 6.5 (for example, when migrating from a mips4 to a mips3 CPU or when migrating between machines with different pagesizes).
Fixed a bug in the implementation of the stat() family of remote system calls on Irix 6.5 which caused file opens in Fortran programs to sometimes fail.
Fixed a number of problems with the statistics reported in the job completion email and by condor_ q -goodput, including the number of checkpoints and total network usage. Correct values will now be computed for all new jobs.
Changes in USE_CKPT_SERVER and CKPT_SERVER_HOST no longer cause problems for jobs in the queue which have already checkpointed.
Many of the Condor administration tools had a bug where they would suffer a segmentation violation if you specified a -pool option and did not specify a hostname. This case now results in an error message instead.
Fixed a bug where the condor_ schedd could die with a segmentation violation if there was an error mapping an IP address into a hostname.
Fixed a bug where resetting the time in a large negative direction caused the condor_ negotiator to have a floating point error on some platforms.
Fixed condor_ q's output so that certain arguments are not ignored.
Fixed a bug in condor_ q where issuing a -global with a fairly restrictive -constraint argument would cause garbage to be printed to the terminal sometimes.
Fixed a bug which caused jobs to exit without completing a checkpoint when preempted in the middle of a periodic checkpoint. Now, the jobs will complete their periodic checkpoint in this case before exiting.

Known Bugs:

Periodic checkpoints do not occur. Normally, when the config file attribute PERIODIC_CHECKPOINT evaluates to True, Condor performs a periodic checkpoint of the running job. This bug has been fixed in v6.1.14. NOTE: there is a work-around to permit periodic checkpoints to occur in v6.1.13: include the attribute name ``PERIODIC_CHECKPOINT'' to the attributes listed in the STARTD_EXPRS entry in the config file.
The getrusage() call does not work properly inside ``standard'' jobs. If your program uses getrusage(), it will not report correct values across a checkpoint and restart. If your program relies on proper reporting from getrusage(), you should either use version 6.0.3 or 6.1.10.
While Condor now supports many networking calls such as socket() and connect(), (see the description below of this new feature added in 6.1.11), on Linux, we cannot at this time support gethostbyname() and a number of other database lookup calls. The reason is that on Linux, these calls are implemented by bringing in a shared library that defines them, based on whether the machine is using DNS, NIS, or some other database method. Condor does not support the way in which the C library tries to explicitly bring in these shared libraries and use them. There are a number of possible solutions to this problem, but the Condor developers are not yet agreed on the best one, so this limitation might not be resolved by 6.1.14.
In HP-UX 10.20, condor_ compile will not work correctly with HP's C++ compiler. The jobs might link, but they will produce incorrect output, or die with a signal such as SIGSEGV during restart after a checkpoint/vacate cycle. However, the GNU C/C++ and the HP C compilers work just fine.
When writing output to a file, stat()-and variant calls, will return zero for the size of the file if the program has not yet read from the file or flushed the file descriptors, This is a side effect of the file buffering code in Condor and will be corrected to the expected semantic.
On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and linked with the Condor libraries (using condor_ compile) will not execute the constructors for any global objects. There is a work-around for this bug, so if this is a problem for you, please send email to condor-admin@cs.wisc.edu.

Version 6.1.12

Version 6.1.12 fixes a number of bugs from version 6.1.11. If you linked your ``standard'' jobs with version 6.1.11, you should upgrade to 6.1.12 and re-link your jobs (using condor_ compile) as soon as possible.

New Features:

None.

Bugs Fixed:

A number of system calls that were not being trapped by the Condor libraries in version 6.1.11 are now being caught and sent back to the submit machine. Not having these functions being executed as remote system calls prevented a number of programs from working, in particular Fortran programs, and many programs on IRIX and Solaris platforms.
Sometimes submitted jobs report back as having no owner and have -????- in the status line for the job. This has been fixed.
condor_ q -io has been fixed in this release.

Known Bugs:

The getrusage() call does not work properly inside ``standard'' jobs. If your program uses getrusage(), it will not report correct values across a checkpoint and restart. If your program relies on proper reporting from getrusage(), you should either use version 6.0.3 or 6.1.10.
While Condor now supports many networking calls such as socket() and connect(), (see the description below of this new feature added in 6.1.11), on Linux, we cannot at this time support gethostbyname() and a number of other database lookup calls. The reason is that on Linux, these calls are implemented by bringing in a shared library that defines them, based on whether the machine is using DNS, NIS, or some other database method. Condor does not support the way in which the C library tries to explicitly bring in these shared libraries and use them. There are a number of possible solutions to this problem, but the Condor developers are not yet agreed on the best one, so this limitation might not be resolved by 6.1.13.
In HP-UX 10.20, condor_ compile will not work correctly with HP's C++ compiler. The jobs might link, but they will produce incorrect output, or die with a signal such as SIGSEGV during restart after a checkpoint/vacate cycle. However, the GNU C/C++ and the HP C compilers work just fine.
When writing output to a file, stat()-and variant calls, will return zero for the size of the file if the program has not yet read from the file or flushed the file descriptors, This is a side effect of the file buffering code in Condor and will be corrected to the expected semantic.
On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and linked with the Condor libraries (using condor_ compile) will not execute the constructors for any global objects. There is a work-around for this bug, so if this is a problem for you, please send email to condor-admin@cs.wisc.edu.
The -format option in condor_ q has no effect when querying remote machines with the -n option.
condor_ dagman does not work at all in this release. The behaviour of its failure is to exit immediately with a success and to not perform any work. It will be fixed in the next release of Condor.

Version 6.1.11

New Features:

condor_ status outputs information for held jobs instead of MaxRunningJobs when supplied with -schedd or -submitter.
condor_ userprio now prints 4 digit years (for Y2K compiance). If you give a two digit date, it also will assume that 1/1/00 is 1/1/2000 and not 1/1/1900.
IRIX 6.5 has been added to the list of ports that now support remote system calls and checkpointing.
condor_ q has been fixed to be faster and much more memory efficient. This is much more obvious when getting the queue from condor_ schedd's that have more than 1000 jobs.
Added support for support for socket() and pipe() in standard jobs. Both sockets and pipes are created on the executing machine. Checkpointing is deferred anytime a socket or pipe is open.
Added limited support for select() and poll() in standard jobs. Both calls will work only on files opened locally.
Added limited support for fcntl() and ioctl() in standard jobs. Both calls will be performed remotely if the control-number is understood and the third argument is an integer.
Replaced buffer implementation in standard jobs. The new buffer code reads and writes variable sized chunks. It will never issue a read to satisfy a write. Buffering is enabled by default.
Added extensive feedback on I/O performance in the user's email.
Added -io option to condor_ q to show I/O statistics.
Removed libckpt.a and libzckpt.a. To build for standalone checkpointing, just do a regular condor_ compile. No -standalone option is necessary.
The checkpointing library now only re-opens files when they are actually used. If files or other needed resources cannot be found at restart time, the checkpointer will fail with a verbose error.
The RemoteHost and LastRemoteHost attributes in the job classad now contain hostnames instead IP address and port numbers. The -run option of older versions of condor_ q is not compatible with this change.
Condor will now automatically check for compatibility between the version of the Condor libraries you have linked into a standard job (using condor_ compile) and the version of the condor_ shadow installed on your submit machine. If they are incompatible, the condor_ shadow will now put your job on hold. Unless you set ``Notification = Never'' in your submit file, Condor will also send you email explaining what went wrong and what you can do about it.
All Condor daemons and tools now have a CondorPlatform string, which shows which platform a given set of Condor binaries was built for. In all places that you used to see CondorVersion, you will now see both CondorVersion and CondorPlatform, such as in each daemon's ClassAd, in the output to a -version option (if supported), and when running ident on a given Condor binary. This string can help identify situations where you are running the wrong version of the Condor binaries for a given platform (for example, running binaries built for Solaris 2.5.1 on a Solaris 2.6 machine).
Added commented-out settings in the default condor_config file we ship for various SMP-specific settings in the condor_ startd. Be sure to read section 3.13.7 on ``Configuring the Startd for SMP Machine'' on page for details about using these settings.
condor_ rm, condor_ hold, and condor_ release all support -help and -version options now.

Bugs Fixed:

A race condition which could cause the condor_ shadow to not exit when its job was removed has been fixed. This bug would cause jobs that had been removed with condor_ rm to remain in the queue marked as status ``X'' for a long time. In addition, Condor would not shutdown quickly on hosts that had hit this race condition, since the condor_ schedd wouldn't exit until all of its condor_ shadow children had exited.
A signal race condition during restart of a Condor job has been fixed.
In a Condor linked job, getdomainname() is now supported.
IRIX 6.5 can give negative time reports for how long a process has been running. We account for that now in our statistics about usage times.
The condor_ status memory error introduced in version 6.1.10 has been fixed.
The DAEMON_LIST configuration setting is now case insensitive.
Fixed a bug where the condor_ schedd, under rare circumstances, cause another schedd's jobs not to be matched.
The free disk space is now properly computed on Digital Unix. This fixed problems where the Disk attribute in the condor_ startd classad reported incorrect values.
The config file parser now detects incremental macro definitions correctly (see section 3.3.1 on page ). Previously, when a macro (or expression) being defined was a substring of a macro (or expression) being referenced in its definition, the reference would be erroneously marked as an incremental definition and expanded immediately. The parser now verifies that the entire strings match.

Known Bugs:

The output for condor_ q -io is incorrect and will likely show zeroes for all values. A fixed version will appear in the next release.

Version 6.1.10

New Features:

condor_ q now accepts -format parameters like condor_ status
condor_ rm, condor_ hold and condor_ release accept -constraint parameters like condor_ status
condor_ status now sorts displayed totals by the first column. (This feature introduced a bug in condor_ status. See ``Known Bugs'' below.)
Condor version 6.1.10 introduces ``clipped'' support for Sparc Solaris version 2.7. This version does not support checkpointing or remote system calls. Full support for Solaris 2.7 will be released soon.
Introduced code to enable Linux to use the standard C library's I/O buffering again, instead of relying on the Condor I/O buffering code (which is still in beta testing).

Bugs Fixed:

The bug in checkpointing introduced in version 6.1.9 has been fixed. Checkpointing will now work on all platforms, as it always used to. Any jobs linked with the 6.1.9 Condor libraries will need to be relinked with condor_ compile once version 6.1.10 has been installed at your site.

Known Bugs:

The CondorLoadAvg attribute in the condor_ startd has some problems in the way it is computed. The CondorLoadAvg is somewhat inaccurate for the first minute a job starts running, and for the first minute after it completes. Also, the computation of CondorLoadAvg is very wrong on NT. All of this will be fixed in a future version.
A memory error may cause condor_ status to die with SIGSEGV (segmentation violation) when displaying totals or cause incorrect totals to be displayed. This will be fixed in version 6.1.11.

Version 6.1.9

New Features:

Added full support for Linux 2.0.x and 2.2.x kernels using libc5, glibc20 and glibc21. This includes support for Red Hat 6.x, Debian 2.x and other popular Linux distributions. Whereas the Linux machines had once been fragmented across libc5 and GNU libc, they have now been reunified. This means there is no longer any need for the ``LINUX-GLIBC'' OpSys setting in your pool: all machines will now show up as ``LINUX''. Part of this reunification process was the removal of dynamically linked user jobs on Linux. condor_ compile now forces static linking of your Standard Universe Condor jobs. Also, please use condor_ compile on the same machine on which you compiled your object files.
Added condor_ qedit utility to allow users to modify job attributes after submission. See the new manual page on page .
Added -runforminutes option to daemonCore to have the daemon gracefully shut down after the given number of minutes.
Added support for statfs(2) and fstatfs(2) in user jobs. We support only the fields f_bsize, f_blocks, f_bfree, f_bavail, f_files, f_ffree from the structure statfs. This is still in the experimental stage.
Added the -direct option to condor_ status. After you give -direct, you supply a hostname, and condor_ status will query the condor_ startd on the specified host and display information directly from there, instead of querying the condor_ collector. See the manual page on page for details.
Users can now define NUM_CPUS to override the automatic computation of the number of CPUs in your machine. Using this config setting can cause unexpected results, and is not recommended. This feature is only provided for sites that specifically want this behavior and know what they are doing.
The -set and -rset options to condor_ config_val have been changed to allow administrators to set both macros and expressions. Previously, condor_ config_val assumed you wanted to set expressions. Now, these two options each take a single argument, the string containing exactly what you would put into the config file, so you can specify you want to create a macro by including an ``='' sign, or an expression by including a ``:''. See section 3.3.1 on page for details on macros vs. expressions. See the condor_ config_val man page on page for details on condor_ config_val.
If the directory you specified for LOCK (which holds lock files used by Condor) doesn't exist, Condor will now try to create that directory for you instead of giving up right away.
If you change the COLLECTOR_HOST setting and reconfig the condor_ startd, the startd will ``invalidate'' its ClassAds at the old collector before it starts reporting to the new one.

Bugs Fixed:

Fixed a major bug dealing with the group access a Condor job is started with. Now, Condor jobs are started with all the groups the job's owner is in, not just their default group. This also fixes a security hole where user jobs could be started up in access groups they didn't belong to.
Fixed a bug where there was a needless limitation on the number of open file descriptors a user job could have.
Fixed a standalone checkpointing bug where we weren't blocking signals in critical sections and causing file table corruption at checkpoint time.
Fixed a linker bug on Digital Unix 4.0 concerning fortran where the linker would fail on __uname and __sigsuspend.
Fixed a bug in condor_ shadow that would send incorrect job completion email under Linux.
Fixed a bug in the remote system call of fchdir() that caused a garbage file descriptor to be used in Standard Universe jobs.
Fixed a bug in the condor_ shadow which was causing condor_ q -goodput to display incorrect values for some jobs.
Fixed some minor bugs and made some minor enhancements in the condor_ install script. The bugs included a typo in one of the questions asked, and incorrect handling for the answers of a few different questions. Also, if DNS is misconfigured on your system, condor_ install will try a few ways to find your fully qualified hostname, and if it still can't determine the correct hostname, it will prompt the user for it. In addition, we now avoid one installation step in cases were it is not needed.
Fixed a rare race condition that could delay the completion of large clusters of short running jobs.
Added more checking to the various arguments that might be passed to condor_ status, so that in the case of bad input, condor_ status will print an error message and exit, instead of performing a segmentation fault. Also, when you use the -sort option, condor_ status will only display ClassAds where the attributes you use to sort are defined.
Fixed a bug in the handling of the config files created by using the -set or -rset options to condor_ config_val. Previously, if you manually deleted the files that were created, you could cause the affected Condor daemon to have a segmentation fault. Now, the daemons simply exit with a fatal error but still have a chance to clean up.
Fixed a bug in the -negotiator option for most Condor tools that was causing it to get the wrong address.
Fixed a couple of bugs in the condor_ master that could cause improper shutdowns. There were cases during shutdown where we would restart a daemon (because we previously noticed a new executable, for example). Now, once you begin a shutdown, the condor_ master will not restart anything. Also, fixed a rare bug that could cause the condor_ master to stop checking the timestamps on a daemon.
Fixed a minor bug in the -owner option to condor_ config_val that was causing condor_ init not to work.
Fixed a bug where the condor_ startd, while it was already shutting down, was allowing certain actions to succeed that should have failed. For example, it allowed itself to be matched with a user looking for available machines, or to begin a new PVM task.

Known Bugs:

The CondorLoadAvg attribute in the condor_ startd has some problems in the way it is computed. The CondorLoadAvg is somewhat inaccurate for the first minute a job starts running, and for the first minute after it completes. Also, the computation of CondorLoadAvg is very wrong on NT. All of this will be fixed in a future version.
There is a serious bug in checkpointing when using Condor's I/O buffering for ``standard'' jobs. By default, Linux uses Condor buffering in version 6.1.9 for all standard jobs. The bug prevents checkpointing from working more than once. This renders the condor_ vacate and condor_ checkpoint commands useless, and jobs will just be killed without a checkpoint when machine owners come back to their machines.

Version 6.1.8

Added file_remaps as command in the job submit file given to STANDARD universe jobs. A Job can now specify that it would like to have files be remapped from one file to another. In addition you can specify that files should be read from the local machine by specifing them. See the condor_ submit manual page on page for more details.
Added buffer_size and buffer_block_size so that STANDARD universe jobs can specify that they wish to have I/O buffering turned on. Without buffering, all I/O requests in the STANDARD universe are sent back over the network to be executed on the submit machine. With buffering, read ahead, write behind, and seek batch buffering is performed to minimize network traffic and latency. By default, jobs do not specify buffering, however, for many situations buffering can drastically increase throughput. See the condor_ submit manual page on page for more details.
The condor_ schedd is much more memory efficient handling clusters with hundreds/thousands of jobs. If you submit large clusters, your submit machine will only use a fraction of the amount of RAM it used to require. NOTE: The memory savings will only be realized for new clusters submitted after the upgrade to v6.1.8 - clusters which previously existed in the queue at upgrade time will still use the same amount of RAM in the condor_ schedd.
Submitting jobs, especially submitting large clusters containing many jobs, is much faster.
Added a -goodput option to condor_ q, which displays statistics about the execution efficiency of STANDARD universe jobs.
Added FS_REMOTE method of user authentication to possible values of the configuration option AUTHENTICATION_METHODS to fix problems with using the -r remote scheduler option of condor_ submit. Additionally, the user authentication protocol has changed, so previous versions of Condor programs cannot co-exist with this new protocol.
Added a new utility and documentation for condor_ glidein which uses Globus resources to extend your local pool to use remote Globus machines as part of your Condor pool.
Fixed more bugs in the handling of the stat() system call and its relatives on Linux with glibc. This was causing problems mainly with Fortran I/O, though other I/O related problems on glibc Linux will probably be solved now.
Fixed a bug in various Condor tools (condor_ status, condor_ user_prio, condor_ config_val, and condor_ stats) that would cause them to seg fault on bad input to the -pool option.
Fixed a bug with the -rset option to condor_ config_val which could crash the Condor daemon whose configuration was being changed.
Added allow_startup_script command to the job submit description file which is given to condor_ submit. This allows the submission of a startup script to the STANDARD universe. See
Fixed a bug in the condor_ schedd where it would get into an infinite loop if the persistant log of the job queue got corrupted. The condor_ schedd now correctly handles corrupted log files.
The full release tar file now contains a dagman subdirectory in the examples directory. This subdirectory includes an example DAGMan job, including a README (in both ASCII and HTML), a Makefile, and so on.
Condor will now insert an environment variable, CONDOR_VM, into the environment of the user job. This variable specifies which SMP ``virtual machine'' the job was started on. It will equal either vm1, vm2, vm3, ... , depending upon which virtual machine was matched. On a non-SMP machine, CONDOR_VM will always be set to vm1.
Fixed some timing bugs introduced in v6.1.6 which could occur when Condor tries to simultaneously start a large number of jobs submitted from a single machine.
Fixed bugs when Condor is told to gracefully shutdown; Condor no longer starts up new jobs when shutting down. Also, the condor_ schedd progressively checkpoints running jobs during a graceful shutdown instead of trying to vacate all the job simultaneously. The rate at which the shutdown occurs is controlled by the JOB_START_DELAY configuration parameter (see page ).
Fixed a bug which could cause the condor_ master process to exit if the Condor daemons have been hung for a while by the operating system (if, for instance, the LOG directory was placed on an NFS volume and the NFS server is down for an extended period).
Previously, removing a large number of jobs with condor_ rm would result in the condor_ schedd being unresponsive for a period of time (perhaps leading to timeouts when running condor_ q). The condor_ schedd has been improved to multitask the removal of jobs while servicing new requests.
Added new configuration parameter COLLECTOR_SOCKET_BUFSIZE which controls the size of TCP/IP buffers used by the condor_ collector. For more info, see section refparam:CollectorSocketBufsize on page pagerefparam:CollectorSocketBufsize.
Fixed a bug with the -analyze option to condor_ q: in some cases, the RANK expression would not be evaluated correctly. This could cause the output from -analyze to be in error.
When running on a multi-CPU (SMP) Hewlett-Packard machine, fixed bugs computing the system load average.
Fixed bug in condor_ q which could cause the RUN_TIME reported to be temporarily incorrect when jobs first start running.
The condor_ startd no longer rapidly sends multiple ClassAds one right after another to the Central Manager when its state/activity is in rapid transition. Also, on SMP machines, the condor_ startd will only send updates for 4 nodes per second (to avoid overflowing the central manager when reporting the state of a very large SMP machine with dozens of CPUs).
Reading a parameter with condor_ config_val is now allowed from any machine with Host-IP READ permission. Previsouly, you needed ADMINISTRATOR permission. Of course, setting a parameter still requires ADMINISTRATOR permission.
Worked around a bug in the StreamTokenizer Java class from Sun that we use in the CondorView client Java applet. The bug would cause errors if usernames or hostnames in your pool contained ``-'' or ``_'' characters. The CondorView applet now gets around this and properly displays all data, including entries with the ``bad'' characters.

Version 6.1.7

NOTE: Version 6.1.7 only adds support for platforms not supported in 6.1.6. There are no bug fixes, so there are no binaries released for any other platforms. You do not need 6.1.7 unless you are using one of the two platforms we released binaries for.

Added ``clipped'' support for Alpha Linux machines running the 2.0.X kernel and glibc 2.0.X (such as Red Hat 5.X). We do not yet support checkpointing and remote system calls on this platform, but we can start ``vanilla'' jobs. See section 2.4.1 on page for details on vanilla vs. standard jobs.
Re-added support for Intel Linux machines running the 2.0.X Linux kernel, glibc 2.0.X, using the GNU C compiler (gcc/g++ 2.7.X) or the EGCS compilers (versions 1.0.X, 1.1.1 and 1.1.2). This includes Red Hat 5.X, and Debian 2.0. Red Hat 6.0 and Debian 2.1 are not yet supported, since they use glibc 2.1.X and the 2.2.X Linux kernel. Future versions of Condor will support all combinations of kernels, compilers and versions of libc.

Version 6.1.6

Added file_remaps as command in the job submit file given to condor_ submit. This allows the user to explicitly specify where to find a given file (e.g. either on the submit or execute machine), as well as remap file access to a different filename altogether.
Changed the way that condor_ master spawns daemons and condor_ preen which allows you to specify command line arguments for any of them, though a SUBSYS_ARGS setting. Previously, when you specified PREEN , you added the command line arguments directly to that setting, but that caused some problems, and only worked for condor_ preen. Once you upgrade to version 6.1.6, if you continue to use your old condor_config files, you must change the PREEN setting to remove any arguments you have defined and place those arguments into a separate config setting, PREEN_ARGS . See section 3.3.9, ``condor_ master Config File Entries'', on page for more details.
Fixed a very serious bug in the Condor library linked in with condor_ compile to create standard jobs that was causing checkpointing to fail in many cases. Any jobs that were linked with the 6.1.5 Condor libraries should probably be removed, re-linked, and re-submitted.
Fixed a bug in condor_ userprio that was introduced in version 6.1.5 that was preventing it from finding the address of the condor_ negotiator for your pool.
Fixed a bug in condor_ stats that was introduced in version 6.1.5 that was preventing it from finding the address of the condor_ collector for your pool.
Fixed a bug in the way the -pool option was handled by many Condor tools that was introduced in version 6.1.5.
condor_ q now displays job allocation time by default, instead of displaying CPU time. Job allocation time, or RUN_TIME, is the amount of wall-clock time the job has spent running. Unlike CPU time information which is only updated when a job is checkpointed, the allocation time displayed by condor_ q is continuously updated, even for vanilla universe jobs. By default, the allocation time displayed will be the total time across all runs of the job. The new -currentrun option to condor_ q can be used to display the allocation time for solely the current run of the job. Additionally, the -cputime option can be used to view job CPU times as in earlier versions of Condor.
condor_ q will display an error message if there is a timeout fetching the job queue listing from a condor_ schedd. Previously, condor_ q would simply list the queue as empty upon a communication error.
The condor_ schedd daemon has been updated to verify all queue access requests via Condor's IP/Host-Based Security mechanism (see section 3.6.8).
Fixed a bug on platforms which require the condor_ kbdd (currently Digital Unix and IRIX). This bug could have allowed Condor to start a job within the first five minutes after the Condor daemons had been started, even if there is a user typing on the keyboard.
condor_ release now gives an error message if the user tries to release a job which either does not exist or is not in the hold state.
Added a new config file parameter, USER_JOB_WRAPPER , which allows administrators to specify a file to act as a ``wrapper'' script around all jobs started by Condor. See inside section 3.3.14, on page , for more details.
condor_ dagman now permits the backslash character (`` $\mathtt{\backslash}$ '') to be used as a line-continuation character for DAG Input Files, just like the condor_ config files.
The Condor version string is now included in all Condor libraries. You can now run ident on any program linked with condor_ compile to view which version of the Condor libraries you linked with. In addition, the format of the version string changed in 6.1.6. Now, the identifier used is ``CondorVersion'' instead of ``Version'' to prevent any potential ambiguity. Also, the format of the date changed slightly.
The SMP startd can now handle dynamic reconfiguration of the number of each type of virtual machine being reported. This allows you, during the normal running of the startd, to increase or decrease the number of CPUs that Condor is using. If you reconfigure the startd to use less CPUs than it currently has under its control, it will first remove CPUs that have no Condor jobs running on them. If more CPUs need to be evicted, the startd will checkpoint jobs and evict them in reverse rank order (using the startd's Rank expression). So, the lower the value of the rank, the more likely a job will be kicked off.
The SMP startd contrib module's condor_ starter no longer makes a call that was causing warning messages about ``ERROR: Unknown System Call (-58) - system call not supported by Condor'' when used with the 6.0.X condor_ shadow. This was a harmless call, but removing the call prevents the error message.
The SMP contrib module now includes the condor_ checkpoint and condor_ vacate programs, which allow you to vacate or checkpoint jobs on individual CPUs on the SMP, instead of checkpointing or vacating everything. You can now use ``condor_ vacate vm1@hostname'' to just vacate the first virtual machine, or ``condor_ vacate hostname'' to vacate all virtual machines.
Added support for SMP Digital Unix (Alpha) machines.
Fixed a bug that was causing an overflow in the computation of free disk and swap space on Digital Unix (Alpha) machines.
The condor_ startd and condor_ schedd now can ``invalidate'' their classads from the collector. So, when a daemon is shut down, or a machine is reconfigured to advertise fewer virtual machines, those changes will be instantly visible with condor_ status, instead of having to wait 15 minutes for the stale classads to time-out.
The condor_ schedd no longer forks a child process (a ``schedd agent'') to claim available condor_ startds. You should no longer see multiple condor_ schedd processes running on your machine after a negotiation cycle. This is now accomplished in a non-blocking manner within the condor_ schedd itself.
The startd now adds an VirtualMachineID attribute to each virtual machine classad it advertises. This is just an integer, starting at 1, and increasing for every different virtual machine the startd is representing. On regular hosts, this is the only ID you will ever see. On SMP hosts, you will see the ID climb up to the number of different virtual machines reported. This ID can be used to help write more complex policy expressions on SMP hosts, and to easily identify which hosts in your pool are in fact SMP machines.
Modified the output for condor_ q -run for scheduler and PVM universe jobs. The host where the scheduler universe job is running is now displayed correctly. For PVM jobs, a count of the current number of hosts where the job is running is displayed.
Fixed the condor_ startd so that it no longer prints lots of ProcAPI errors to the log file when it is being run as non-root.
FS_PATHNAME and VOS_PATHNAME are no longer used. AFS support now works similar to NFS support, via the FILESYSTEM_DOMAIN macro.
Fixed a minor bug in the Condor.pm perl module that was causing it to be case-sensitive when parsing the Condor submit file. Now, the perl module is properly case-insensitive, as indicated in the documentation.

Version 6.1.5

Fixed a nasty bug in condor_ preen that would cause it to remove files it shouldn't remove if the condor_ schedd and/or condor_ startd were down at the time condor_ preen ran. This was causing jobs to mysteriously disappear from the job queue.
Added preliminary support to Condor for running on machines with multiple network interfaces. On such machines, users can specify the IP address Condor should use in the NETWORK_INTERFACE config file parameter on each host. In addition, if the pool's central manager is on such a machine, users should set the CM_IP_ADDR parameter to the ip address you wish to use on that machine. See section 3.7.2 on page for more details.
The support for multiple network interfaces introduced bugs in condor_ userprio, condor_ stats, CondorPVM, and the -pool option to many Condor tools. All of these will be fixed in version 6.1.6.
Fixed a bug in the remote system call library that was preventing certain Fortran operations from working correctly on Linux.
The Linux binaries for GLIBC we now distribute are compiled on a Red Hat 5.2 machine. If you're using this version of Red Hat, you might have better luck with the dynamically linked version of Condor than previous releases of Condor. Sites using other GLIBC Linux distributions should continue to use the statically linked version of Condor.
Fixed a bug in the condor_ shadow that could cause it to die with signal 11 (segmentation violation) under certain rare circumstances.
Fixed a bug in the condor_ schedd that could cause it to die with signal 11 (segmentation violation) under certain rare circumstances.
Fixed a bug in the condor_ negotiator that could cause it to die with signal 8 (floating point exception) on Digital Unix machines.
The following shadow parameters have been added to control checkpointing: COMPRESS_PERIODIC_CKPT , COMPRESS_VACATE_CKPT , PERIODIC_MEMORY_SYNC , SLOW_CKPT_SPEED . See section 3.3.12 on page for more details. In addition, the shadow now honors the CkptWanted flag in a job classad, and if it is set to ``False'', the job will never checkpoint.
Fixed a bug in the condor_ startd that could cause it to report negative values for the CondorLoadAvg on rare occasions.
Fixed a bug in the condor_ startd that could cause it to die with a fatal exception in situations where the act of getting claimed by a remote schedd failed for some reason. This resulted in the condor_ startd exiting on rare occasions with a message in its log file to the effect of ERROR ``Match timed out but not in matched state''.
Fixed a bug in the condor_ schedd that under rare circumstances could cause a job to be left in the ``Running'' state even after the condor_ shadow for that job had exited.
Fixed a bug in the condor_ schedd and various tools that prevented remote read-only access to the job queue from working. So, for example, condor_q -name foo, if run on any machine other than foo, wouldn't display any jobs from foo's queue. This fix re-enables the following options to condor_ q to work: submitter, name, global, etc.
Changed the condor_ schedd so that when starting jobs, it always sorts on the cluster number, in addition to the date the jobs were enqueued and the process number within clusters, so that if many clusters were submitted at the same time, the jobs are started in order.
Fixed a bug in condor_ compile that was modifying the PATH environment variable by adding things to the front of it. This would potentially cause jobs to be compiled and linked with a different version of a compiler than they thought they were getting.
Minor change in the way the condor_ startd handles the D_ LOAD and D_ KEYBOARD debug flags. Now, each one, when set, will only display every UPDATE_INTERVAL , regardless of the startd state. If you wish to see the values for keyboard activity or load average every POLLING_INTERVAL , you must enable D_ FULLDEBUG.

Version 6.1.4

Fixed a bug in the socket communication library used by Condor that was causing daemons and tools to die on some platforms (notably, Digital Unix) with signal 8, SIGFPE (floating point exception).
Fixed a bug in the usage message of many Condor tools that mentioned a -all option that isn't yet supported. This option will be supported in future versions of Condor.
Fixed a bug in the filesystem authentication code used to authenticate operations on the job queue that left empty temporary files in /tmp. These files are now properly removed after they are used.
Fixed a minor bug in the totals condor_ status displays when you use the ckptsrvr option.
Fixed a minor syntax error in the condor_ install script that would cause warnings.
the Condor.pm Perl module is now included in the lib directory of the main release directory.

Version 6.1.3

NOTE: There are a lot of new, unstable features in 6.1.3. PLEASE do not install all of 6.1.3 on a production pool. Almost all of the bug fixes in 6.1.3 are in the condor_ startd or condor_ starter, so, unless you really know what you're doing, we recommend you just upgrade SMP-Startd contrib module, not the entire 6.1.3 release.

Owners can now specify how the SMP-Startd partitions the system resources into the different types and numbers of virtual machines, specifying the number of CPUs, megs of RAM, megs of swap space, etc., in each. Previously, each virtual machine reported to Condor from an SMP machine always had one CPU, and all shared system resources were evenly divided among the virtual machines.
Fixed a bug in the reporting of virtual memory and disk space on SMP machines where each virtual machine represented was advertising the total in the system for itself, instead of its own share. Now, both the totals, and the virtual machine-specific values are advertised.
Fixed a bug in the condor_ starter when it was trying to suspend jobs. While we always killed all of the processes when we were trying to vacate, if a vanilla job forked, the starter would sometimes not suspend some of the children processes. In addition, we could sometimes miss a standard universe job for suspending as well. This is all fixed.
Fixed a bug in the SMP-Startd's load average computation that could cause processes spawned by Condor to not be associated w/ the Condor load average. This would cause the startd to over-estimate the owner's load average, and under-estimate the Condor load, which would cause a cycle of suspending and resuming a Condor job, instead of just letting it run.
Fixed a bug in the SMP-Startd's load average computation that could cause certain rare exceptions to be treated as fatal, when in fact, the Startd could recover from them.
Fixed a bug in the computation of the total physical memory on some platforms that was resulting in an overflow on machines with lots of ram (over 1 gigabyte).
Fixed some bugs that could cause condor_ starter processes to be left as zombies underneath the condor_ startd under very rare conditions.
For sites using AFS, if there are problems in the condor_ startd computing the AFS cell of the machine it's running on, the startd will exit with an error message at start-up time.
Fixed a minor bug in condor_ install that would lead to a syntax error in your config file given a certain set of installation options.
Added the -maxjobs option to the condor_ submit_dag script that can be used to specify the maximum number of jobs Condor will run from a DAG at any given time. Also, condor_ submit_dag automatically creates a ``rescue DAG''. See section 2.11 on page for details on DAGMan.
Fixed bug in ClassAd printing when you tried to display an integer or float attribute that didn't exist in the given ClassAd. This could show up in condor_ status, condor_ q, condor_ history, etc.
Various commands sent to the Condor daemons now have separate debug levels associated with them. For example, commands such as ``keep-alives'', and the command sent from the condor_ kbdd to the condor_ startd are only seen in the various log files if D_ FULLDEBUG is turned on, instead of D_ COMMAND, which the default and now enabled for all daemons on all platforms by default. Administrators retaining their old configuration when upgrading to this version are encouraged to enable D_ COMMAND in the SCHEDD_DEBUG setting. In addition, for IRIX and Digital Unix machines, it should be enabled in the STARTD_DEBUG setting as well. See section 3.3.4 on page for details on debug levels in Condor.
New debug levels added to Condor:
- D_ NETWORK, used by various daemons in Condor to report various network statistics about the Condor daemons.
- D_ PROCFAMILY, used to report information about various families of processes that are monitored by Condor. For example, this is used in the condor_ startd when monitoring the family of processes spawned by a given user job for the purposes of computing the Condor load average.
- D_ KEYBOARD, used by the condor_ startd to print out statistics about remote tty and console idle times in the condor_ startd. This information used to be logged at D_ FULLDEBUG, along with everything else, so now, you can see just the idle times, and/or have the information stored to a separate file.
Added a -run option to condor_ q, which displays information for running jobs, including the remote host where each job is running.
Macros can now be incrementally defined. See section 3.3.1 on page for more details.
condor_ config_val can now be used to set configuration variables. See the man page on page for more details.
The job log file now contains a record of network activity. The evict, terminate, and shadow exception events indicate the number of bytes sent and received by the job for the specific run. The terminate event additionally indicates totals for the life of the job.
STARTER_CHOOSES_CKPT_SERVER now defaults to true. See section 3.3.8 on page for more details.
The infrastructure for authentication within Condor has been overhauled, allowing for much greater flexibility in supporting new forms of authentication in the future. This means that the 6.1.3 schedd and queue management tools (like condor_ q, condor_ submit, condor_ rm and so on) are incompatible with previous versions of Condor.
Many of the Condor administration tools have been improved to allow you to specify the ``subsystem'' you want them to effect. For example, you can now use ``condor_ reconfig -startd'' to just have the startd reconfigure itself. Similarly, condor_ off, condor_ on and condor_ restart can now all work on a single daemon, instead of machine-wide. See the man pages (section 9 on page ) or run any command with -help for details. NOTE: The usage message in 6.1.3 incorrectly reports -all as a valid option.
Fixed a bug in the Condor tools that could cause a segmentation violation in certain rare error conditions.

Version 6.1.2

Fixed some bugs in the condor_ install script. Also, enhanced condor_ install to customize the path to perl in various perl scripts used by Condor.
Fixed a problem with our build environment that left some files out of the release.tar files in the binary releases on some platforms.
condor_ dagman, ``DAGMan'' (see section 2.11 on page for details) is now included in the development release by default.
Fixed a bug in the computation of the total physical memory in HPUX machines that was resulting in an overflow on machines with lots of ram (over 1 gigabyte). Also, if you define ``MEMORY'' in your config file, that value will override whatever value Condor computes for your machine.
Fixed a bug in condor_ starter.pvm, the PVM version of the Condor starter (available as an optional ``Contrib module''), when you disabled STARTER_LOCAL_LOGGING . Now, having this set to ``False'' will properly place debug messages from condor_ starter.pvm into the ShadowLog file of the machine that submitted the job (as opposed to the StarterLog file on the machine executing the job).

Version 6.1.1

Fixed a bug in the condor_ startd where we compute the load average caused by Condor that was causing us to get the wrong values. This could cause a cycle of continuous job suspends and job resumes.
Beginning with this version, any jobs linked with the Condor checkpoint libraries will use the zlib compression code (used by gzip and others) to compress periodic checkpoints before they are written to the network. These compressed checkpoints are uncompressed at startup time. This saves network bandwidth, disk space, as well as time (if the network is the bottleneck to checkpointing, which it usually is). In future versions of Condor, all checkpoints will probably be compressed, but at this time, it is only used for periodic checkpoints. Note, you have to relink your jobs with the condor_ compile command to have this feature enabled. Old jobs (not relinked) will continue to run just fine, they just won't be compressed.
condor_ status now has better support for displaying checkpoint server ClassAds.
More contrib modules from the development series are now available, such as the checkpoint server, PVM support, and the CondorView server.
Fixed some minor bugs in the UserLog code that were causing problems for DAGMan in exceptional error cases.
Fixed an obscure bug in the logging code when D_ PRIV was enabled that could result in incorrect file permissions on log files.

Version 6.1.0

Support has been added to the condor_ startd to run multiple jobs on SMP machines. See section 3.13.7 on page for details about setting up and configuring SMP support.
The expressions that control the condor_ startd policy for vacating, jobs has been simplified. See section 3.5 on page for complete details on the new policy expressions, and section 3.5.11 on page for an explanation of what's different from the version 6.0 expressions.
We now perform better tracking of processes spawned by Condor. If children die and are inherited by init, we still know they belong to Condor. This allows us to better ensure we don't leave processes lying around when we need to get off a machine, and enables us to have a much more accurate computation of the load average generated by Condor (the CondorLoadAvg as reported by the condor_ startd).
The condor_ collector now can store historical information about your pool state. This information can be queried with the condor_ stats program (see the man page on page ), which is used by the condor_ view Java GUI, which is available as a separate contrib module.
Condor jobs can now be put in a ``hold'' state with the condor_ hold command. Such jobs remain in the job queue (and can be viewed with condor_ q), but there will not be any negotiation to find machines for them. If a job is having a temporary problem (like the permissions are wrong on files it needs to access), the job can be put on hold until the problem can be solved. Jobs put on hold can be released with the condor_ release command.
condor_ userprio now has the notion of user factors as a way to create different groups of users in different priority levels. See section 3.4 on page for details. This includes the ability to specify a local priority domain, and all users from other domains get a much worse priority.
Usage statistics by user is now available from condor_ userprio. See the man page on page for details.
The condor_ schedd has been enhanced to enable ``flocking'', where it seeks matches with machines in multiple pools if its requests cannot be serviced in the local pool. See section 5.2 on page for more details.
The condor_ schedd has been enhanced to enable condor_ q and other interactive tools better response time.
The condor_ schedd has also been enhanced to allow it to check the permissions of the files you specify for input, output, error and so on. If the schedd doesn't have the required access rights to the files, the jobs will not be submitted, and condor_ submit will print an error message.
When you perform a condor_ rm command, and the job you removed was using a ``user log'', the remove event is now recorded into the log.
Two new attributes have been added to the job classad when it begins executing: RemoteHost and LastRemoteHost. These attributes list the IP address and port of the startd that is either currently running the job, or the last startd to run the job (if it's run on more than one machine). This information helps users track their job's execution more closely, and allows administrators to troubleshoot problems more effectively.
The performance of checkpointing was increased by using larger buffers for the network I/O used to get the checkpoint file on and off the remote executing host (this helps for all pools, with or without checkpoint servers).

Next: 8.11 Stable Release Series Up: 8. Version History and Previous: 8.9 Stable Release Series Contents Index

condor-admin@cs.wisc.edu