Next: 8.11 Stable Release Series
Up: 8. Version History and
Previous: 8.9 Stable Release Series
Contents
Index
Subsections
8.10 Development Release Series 6.1
This was the first development release series.
It contains numerous enhancements over the 6.0 stable series.
For example:
- Support for running multiple jobs on SMP machines
- Enhanced functionality for pool administrators
- Support for PVM, MPI and Globus jobs
- Support for Flocking jobs across different Condor pools
The 6.1 series has many other improvements over the 6.0 series, and
is available on more platforms.
The new features, bugs fixed, and known bugs of each version are
described below in detail.
Version 6.1.17
This version is the 6.2.0 ``release candidate''.
It was publically released in Feburary of 2001, and it will be released
as 6.2.0 once it is considered ``stable'' by heavy testing at the
UW-Madison Computer Science Department Condor pool.
New Features:
- Hostnames in the HOSTALLOW and HOSTDENY entries are now case-insensitive.
- It is now possible to submit NT jobs from a UNIX machine.
- The NT release of Condor now supports a USE_VISIBLE_DESKTOP parameter.
If true, Condor will allow the job to create windows on the desktop of the
execute machine and interact with the job. This is particularly useful for
debugging why an application will not run under Condor.
- The condor_ startd contains support for the new MPI dedicated
scheduler that will appear in the 6.3 development series. This will allow
you to use your 6.2 Condor pool with the new scheduler.
- Added a mixedcase option to condor_ config_val to allow
for overriding the default of lowercasing all the config names
- Added a pid_snapshot_interval option to the config file to
control how often the condor_ startd should examine the running
process family. It defaults to 50 seconds.
Bugs Fixed:
- Fixed a bug with the condor_ schedd reaching the MAX_JOBS_RUNNING
mark and properly calculating Scheduler Universe jobs for preemption.
- Fixed a bug in the condor_ schedd loosing track of condor_ startds
in the initial claiming phase. This bug affected all platforms, but was most
likely to manifest on Solaris 2.6
- CPU Time can be greater than wall clock time in Multi-threaded
apps, so do not consider it an error in the UserLog.
- condor_ restart -master now works correctly.
- Fixed a rare condition in the condor_ startd that could corrupt
memory and result in a signal 11 (SIGSEGV, or segmentation violation).
- Fixed a bug that would cause the ``execute event'' to not be
logged to the UserLog if the binary for the job resided on AFS.
- Fixed a race-condition in Condor's PVM support on SMP machines
(introduced in version 6.1.16) that caused PVM tasks to be associated
with the wrong daemon.
- Better handling of checkpointing on large-memory Linux machines.
- Fixed random occasions of job completion email not being sent.
- It is no longer possible to use condor_ user_prio to set a priority of less
than 1.
- Fixed a bug in the job completion email statistics.
Run Time was being underreported when the job completed after doing a
periodic checkpoint.
- Fixed a bug that caused CondorLoadAvg to get stuck at 0.0 on
Linux when the system clock was adjusted.
- Fixed a condor_ submit bug that caused all machine_count
commands after the first queue statement to be ignored for PVM jobs.
- PVM tasks now run as the user when appropriate instead of always
running under the UNIX ``nobody'' account.
- Fixed support for the PVM group server.
- PVM uses an environment variable to communicate with it's children
instead of a file in /tmp. This file previously could become overwritten
by mulitple PVM jobs.
- condor_ stats now lives in the ``bin'' directory instead of ``sbin''.
Known Bugs:
- The condor_ negotiator can crash if the Accountantnew.log file becomes
corrupted. This most often occurs if the Central Manager runs out of diskspace.
Version 6.1.16
New Features:
- Condor now supports multiple pvmds per user on a machine. Users
can now submit more than one PVM job at a time, PVM tasks can now run
on the submission machine, and multiple PVM tasks can run on SMP
machines. condor_ submit no longer inserts default job requirements
to restrict PVM jobs to one pvmd per user on a machine. This new
functionality requires the condor_ pvmd included in this (and future)
Condor releases. If you set ``PVM_OLD_PVMD = True'' in the Condor
configuration file, condor_ submit will insert the default PVM job
requirements as it did in previous releases. You must set this if you
don't upgrade your condor_ pvmd binary or if your jobs flock with pools
that user an older condor_ pvmd.
- The NT release of Condor no longer contains debugging
information.
This drastically reduces the size of the binaries you must install.
Bugs Fixed:
- The configuration files shipped with version 6.1.15 contained a
number of errors relating to host-based security, the configuration of
the central manager, and a few other things.
These errors have all been corrected.
- Fixed a memory management bug in the condor_ schedd that could
cause it to crash under certain circumstances when machines were taken
away from the schedd's control.
- Fixed a potential memory leak in a library used by the
condor_ startd and condor_ master that could leak memory while
Condor jobs were executing.
- Fixed a bug in the NT version of Condor that would result in
faulty reporting of the load average.
- The condor_ shadow.pvm should now correctly return core files
when a task or condor_ pvmd crashes.
- This release fixes a memory error introduced in version
6.1.15 that could crash the condor_ shadow.pvm.
- Some condor_ pvmd binaries in previous releases included
debugging code we added that could cause the condor_ pvmd to crash.
This release includes new condor_ pvmd binaries for all platforms
with the problematic debugging code removed.
- Fixed a bug in the -unset options to condor_ config_val
that was introduced in version 6.1.15.
Both -unset and -runset work correctly, now.
Known Bugs:
Version 6.1.15
New Features:
Bugs Fixed:
- In the STANDARD Universe, jobs submitted to Condor could segfault
if they opened multiple files with the same name. Usually this bug
was exposed when users would submit jobs without specifying a file
for either stdout or stderr; in this case, both would default to
/dev/null, and this could trigger the problem.
- The Linux 2.2.14 kernel, which is used by default with Red Hat 6.2,
has a serious bug can cause the machine to lock up when
the same socket is used for repeated connection attempts. Thus,
previous versions of Condor could cause the 2.2.14 kernel to hang
(lots of other applications could do this as well). The Condor Team
recommends that you upgrade your kernel to 2.2.16 or later. However,
in v6.1.15 of Condor, a patch was added to the Condor networking
layer so that Condor would not trigger this Linux kernel bug.
- If no email address was specified when the job was submitted
with condor_ submit, completion email was being sent to
user@submit-machine-hostname. This is not the correct behavior. Now
email goes by default to user@uid-domain, where uid-domain is
defined by the UID_DOMAIN setting in the config file.
- The condor_ master can now correctly shutdown and restart the
condor_ checkpoint_server.
- Email sent when a SCHEDULER Universe job compeltes now has the
correct From: header.
- In the STANDARD universe, jobs which call sigsuspend() will
now receive the correct return value.
- Abnormal error conditions, such as the hard disk on the submit
machine filling up, are much less likely to result in a job disappearing
from the queue.
- The condor_ checkpoint_server now correctly reconfigures when
a condor_ reconfig command is received by the condor_ master.
- Fixed a bug with how the condor_ schedd associates jobs with
machines (claimed resources) which would, under certain circumstances,
cause some jobs to remain idle until other jobs in the queue complete
or are preempted.
- A number of PVM universe bugs are fixed in this release.
Bugs in how the condor_ shadow.pvm exited, which caused jobs to hang
at exit or to run multiple times, have been fixed.
The condor_ shadow.pvm no longer exits if there is a problem starting
up PVM on one remote host.
The condor_ starter.pvm now ignores the periodic checkpoint command
from the startd. Previously, it would vacate the job when it received
the periodic checkpoint command.
A number of bugs with how the condor_ starter.pvm handled
asynchronous events, which caused it to take a long time to clean up
an exited PVM task, have been fixed.
The condor_ schedd now sets the status correctly on multi-class PVM
jobs and removes them from the job queue correctly on exit.
condor_ submit no longer ignores the machine_count command for PVM
jobs.
And, a problem which caused pvm_exit() to hang was diagnosed:
PVM tasks which call pvm_catchout() to catch the output of
child tasks should be sure to call it again with a NULL argument to
disable output collection before calling pvm_exit().
- The change introduced in 6.1.13 to the condor_ shadow regarding
when it logged the execute event to the user log produced situations
where the shadow could log other events (like the shadow exception
event) before the execute event was logged.
Now, the condor_ shadow will always log an execute event before it
logs any other events.
The timing is still improved over 6.1.12 and older versions, with the
execute event getting logged after the bulk of the job initialization
has finished, right before the job will actually start executing.
However, you will no longer see user logs that contain a ``shadow
exception'' or ``job evicted'' message without a ``job executing''
event, first.
- stat() and varient calls now go through the file table to
get the correct logical size and access times of buffered files.
Before, stat() used to return zero size on a buffered file that had
not yet been synced to disk.
Known Bugs:
- On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and
linked with the Condor libraries (using condor_ compile) will not
execute the constructors for any global objects.
There is a work-around for this bug, so if this is a problem for you,
please send email to condor-admin@cs.wisc.edu.
- In HP-UX 10.20, condor_ compile will not work correctly with HP's
C++ compiler.
The jobs might link, but they will produce incorrect output, or die with
a signal such as SIGSEGV during restart after a checkpoint/vacate cycle.
However, the GNU C/C++ and the HP C compilers work just fine.
- The getrusage() call does not work always as expected in
STANDARD Universe jobs.
If your program uses getrusage(), it
could decrease incorrectly by a second
across a checkpoint and restart. In addition, the time it takes
Condor to restart from a checkpoint is included in the usage times
reported by getrusage(), and it probably should not be.
Version 6.1.14
New Features:
- Initial supported added for Red Hat Linux 6.2 (i.e. glibc 2.1.3).
Bugs Fixed:
- In version 6.1.13, periodic checkpoints would not occur (see the
Known Bugs section for v6.1.13 listed below). This bug, which only
impacts v6.1.13, has been fixed.
Known Bugs:
- The getrusage() call does not work properly inside
``standard'' jobs.
If your program uses getrusage(), it will not report correct values
across a checkpoint and restart.
If your program relies on proper reporting from getrusage(), you
should either use version 6.0.3 or 6.1.10.
- While Condor now supports many networking calls such as
socket() and connect(), (see the description below of this
new feature added in 6.1.11), on Linux, we cannot at this time support
gethostbyname() and a number of other database lookup calls.
The reason is that on Linux, these calls are implemented by bringing in a
shared library that defines them, based on whether the machine is using
DNS, NIS, or some other database method.
Condor does not support the way in which the C library tries to explicitly
bring in these shared libraries and use them.
There are a number of possible solutions to this problem, but the Condor
developers are not yet agreed on the best one, so this limitation might not
be resolved by 6.1.14.
- In HP-UX 10.20, condor_ compile will not work correctly with HP's
C++ compiler.
The jobs might link, but they will produce incorrect output, or die with
a signal such as SIGSEGV during restart after a checkpoint/vacate cycle.
However, the GNU C/C++ and the HP C compilers work just fine.
- When a program linked with the Condor libraries (using condor_ compile)
is writing output to a file, stat()-and variant calls,
will return zero for the size of the file if the program has not yet
read from the file or flushed the file descriptors.
This is a side effect of the file buffering code in Condor and will be
corrected to the expected semantic.
- On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and
linked with the Condor libraries (using condor_ compile) will not
execute the constructors for any global objects.
There is a work-around for this bug, so if this is a problem for you,
please send email to condor-admin@cs.wisc.edu.
Version 6.1.13
New Features:
- Added DEFAULT_IO_BUFFER_SIZE and
DEFAULT_IO_BUFFER_BLOCK_SIZE to config parameters to allow
the administrator to set the default file buffer sizes for user jobs
in condor_ submit.
- There is no longer any difference in the configuration file
syntax between ``macros'' (which were specified with an ``='' sign)
and ``expressions'' (which were specified with a ``:'' sign).
Now, all config file entries are treated and referenced as macros.
You can use either ``='' or ``:'' and they will work the same way.
There is no longer any problem with forward-referencing macros
(referencing macros you haven't yet defined), so long as they are
eventually defined in your config files (even if the forward reference
is to a macro defined in another config file, like the local config
file, for example).
- condor_ vacate now supports a -fast option that forces
Condor to hard-kill the job(s) immediately, instead of waiting for
them to checkpoint and gracefully shutdown.
- condor_ userlog now displays times in days+hours:minutes format
instead of total hours or total minutes.
- The condor_ run command provides a simple front-end to
condor_ submit for submitting a shell command-line as a vanilla
universe job.
- Solaris 2.7 SPARC, 2.7 INTEL have been added to the
list of ports that now support remote system calls and checkpointing.
- Any mail being sent from Condor now shows up as having been sent from
the designated Condor Account, instead of root or ``Super User''.
- The condor_ submit ``hold'' command may be used to submit jobs
to the queue in the hold state. Held jobs will not run until released
with condor_ release.
- It is now possible to use checkpoint servers in remote pools
when flocking even if the local pool doesn't use a checkpoint server.
This is now the default behavior (see the next item).
- USE_CKPT_SERVER now defaults to True if a checkpoint
server is available. It is usually more efficient to use a checkpoint
server near the execution site instead of storing the checkpoint back
to the submission machine, especially when flocking.
- All Condor tools that used to expect just a hostname or address
(condor_ checkpoint, condor_ off, condor_ on, condor_ restart,
condor_ reconfig, condor_ reschedule, condor_ vacate) to specify
what machine to effect, can now take an optional -name or
-addr in front of each target.
This provides consistancy with other Condor tools that require the
-name or -addr options.
For all of the above mentioned tools, you can still just provide
hostnames or addresses, the new flags are not required.
- Added -pool and -addr options to condor_ rm,
condor_ hold and condor_ release.
- When you start up the condor_ master or condor_ schedd as any
user other than ``root'' or ``condor'' on Unix, or ``SYSTEM'' on NT,
the daemon will have a default Name attribute that includes
both the username of the user who the daemon is running as and the
full hostname of the machine where it is running.
- Clarified our Linux platform support. We now officially
support the Red Hat 5.2 and 6.x distributions, and although other Linux
distributions (especially those with similar libc versions) may work,
they are not tested or supported.
- The schedd now periodically updates the run-time counters in the
job queue for running jobs, so if the schedd crashes, the counters
will remain relatively up-to-date. This is controlled by the
WALL_CLOCK_CKPT_INTERVAL parameter.
- The condor_ shadow now logs the ``job executing'' event in the
user log after the binary has been successfully transfered, so that
the events appear closer to the actual time the job starts running.
This can create some somewhat unexpected log files.
If something goes wrong with the job's initialization, you might see
an ``evicted'' event before you see an ``executing'' event.
Bugs Fixed:
- Fixed how we internally handle file names for user jobs. This
fixes a nasty bug due to changing directories between checkpoints.
- Fixed a bug in our handling of the Arguments macro in
the command file for a job. If the arguments were extremely long, or
there were an extreme number of them, they would get corrupted when the
job was spawned.
- Fixed DAGMan. It had not worked at all in the previous release.
- Fixed a nasty bug under Linux where file seeks did not work
correctly when buffering was enabled.
- Fixed a bug where condor_ shadow would crash while sending job
completion e-mail forcing a job to restart multiple times and the user
to get multiple completion messages.
- Fixed a long standing bug where Fortran 90 would occasionally
truncate its output files to random sizes and fill them with zeros.
- Fixed a bug where close() did not propogate its return
value back to the user job correctly.
- If a SIGTERM was delivered to a condor_ shadow, it used to
remove the job it was running from the job queue, as if condor_ rm
had been used.
This could have caused jobs to leave the queue unexpectedly.
Now, the condor_ shadow ignores SIGTERM (since the condor_ schedd
knows how to gracefully shutdown all the shadows when it gets a
SIGTERM), so jobs should no longer leave the queue prematurely.
In addition, on a SIGQUIT, the shadow now does a fast shutdown, just
like the rest of the Condor daemons.
- Fixed a number of bugs which caused checkpoint restarts
to fail on some releases of Irix 6.5 (for example, when migrating from
a mips4 to a mips3 CPU or when migrating between machines with
different pagesizes).
- Fixed a bug in the implementation of the stat() family
of remote system calls on Irix 6.5 which caused file opens in Fortran
programs to sometimes fail.
- Fixed a number of problems with the statistics reported in the
job completion email and by condor_ q -goodput, including the
number of checkpoints and total network usage. Correct values will
now be computed for all new jobs.
- Changes in USE_CKPT_SERVER and
CKPT_SERVER_HOST no longer cause problems for jobs in the
queue which have already checkpointed.
- Many of the Condor administration tools had a bug where they
would suffer a segmentation violation if you specified a -pool
option and did not specify a hostname.
This case now results in an error message instead.
- Fixed a bug where the condor_ schedd could die with a
segmentation violation if there was an error mapping an IP address
into a hostname.
- Fixed a bug where resetting the time in a large negative direction
caused the condor_ negotiator to have a floating point error on some
platforms.
- Fixed condor_ q's output so that certain arguments are not ignored.
- Fixed a bug in condor_ q where issuing a -global with a
fairly restrictive -constraint argument would cause garbage to be
printed to the terminal sometimes.
- Fixed a bug which caused jobs to exit without completing a
checkpoint when preempted in the middle of a periodic checkpoint.
Now, the jobs will complete their periodic checkpoint in this case
before exiting.
Known Bugs:
- Periodic checkpoints do not occur. Normally, when the config
file attribute PERIODIC_CHECKPOINT evaluates to True,
Condor performs a periodic checkpoint of the running job. This
bug has been fixed in v6.1.14. NOTE: there is a work-around to permit
periodic checkpoints to occur in v6.1.13: include the attribute name
``PERIODIC_CHECKPOINT'' to the attributes
listed in the STARTD_EXPRS entry in the config file.
- The getrusage() call does not work properly inside
``standard'' jobs.
If your program uses getrusage(), it will not report correct values
across a checkpoint and restart.
If your program relies on proper reporting from getrusage(), you
should either use version 6.0.3 or 6.1.10.
- While Condor now supports many networking calls such as
socket() and connect(), (see the description below of this
new feature added in 6.1.11), on Linux, we cannot at this time support
gethostbyname() and a number of other database lookup calls.
The reason is that on Linux, these calls are implemented by bringing in a
shared library that defines them, based on whether the machine is using
DNS, NIS, or some other database method.
Condor does not support the way in which the C library tries to explicitly
bring in these shared libraries and use them.
There are a number of possible solutions to this problem, but the Condor
developers are not yet agreed on the best one, so this limitation might not
be resolved by 6.1.14.
- In HP-UX 10.20, condor_ compile will not work correctly with HP's
C++ compiler.
The jobs might link, but they will produce incorrect output, or die with
a signal such as SIGSEGV during restart after a checkpoint/vacate cycle.
However, the GNU C/C++ and the HP C compilers work just fine.
- When writing output to a file, stat()-and variant calls,
will return zero for the size of the file if the program has not yet
read from the file or flushed the file descriptors,
This is a side effect of the file buffering code in Condor and will be
corrected to the expected semantic.
- On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and
linked with the Condor libraries (using condor_ compile) will not
execute the constructors for any global objects.
There is a work-around for this bug, so if this is a problem for you,
please send email to condor-admin@cs.wisc.edu.
Version 6.1.12
Version 6.1.12 fixes a number of bugs from version 6.1.11.
If you linked your ``standard'' jobs with version 6.1.11, you should
upgrade to 6.1.12 and re-link your jobs (using condor_ compile) as soon as
possible.
New Features:
Bugs Fixed:
- A number of system calls that were not being trapped by the Condor
libraries in version 6.1.11 are now being caught and sent back to the
submit machine.
Not having these functions being executed as remote system calls prevented
a number of programs from working, in particular Fortran programs, and
many programs on IRIX and Solaris platforms.
- Sometimes submitted jobs report back as having no owner and have
-????- in the status line for the job. This has been fixed.
- condor_ q -io has been fixed in this release.
Known Bugs:
- The getrusage() call does not work properly inside
``standard'' jobs.
If your program uses getrusage(), it will not report correct values
across a checkpoint and restart.
If your program relies on proper reporting from getrusage(), you
should either use version 6.0.3 or 6.1.10.
- While Condor now supports many networking calls such as
socket() and connect(), (see the description below of this
new feature added in 6.1.11), on Linux, we cannot at this time support
gethostbyname() and a number of other database lookup calls.
The reason is that on Linux, these calls are implemented by bringing in a
shared library that defines them, based on whether the machine is using
DNS, NIS, or some other database method.
Condor does not support the way in which the C library tries to explicitly
bring in these shared libraries and use them.
There are a number of possible solutions to this problem, but the Condor
developers are not yet agreed on the best one, so this limitation might not
be resolved by 6.1.13.
- In HP-UX 10.20, condor_ compile will not work correctly with HP's
C++ compiler.
The jobs might link, but they will produce incorrect output, or die with
a signal such as SIGSEGV during restart after a checkpoint/vacate cycle.
However, the GNU C/C++ and the HP C compilers work just fine.
- When writing output to a file, stat()-and variant calls,
will return zero for the size of the file if the program has not yet
read from the file or flushed the file descriptors,
This is a side effect of the file buffering code in Condor and will be
corrected to the expected semantic.
- On IRIX 6.2, C++ programs compiled with GNU C++ (g++) 2.7.2 and
linked with the Condor libraries (using condor_ compile) will not
execute the constructors for any global objects.
There is a work-around for this bug, so if this is a problem for you,
please send email to condor-admin@cs.wisc.edu.
- The -format option in condor_ q has no effect when querying
remote machines with the -n option.
- condor_ dagman does not work at all in this release.
The behaviour of its failure is to exit immediately with a success and
to not perform any work. It will be fixed in the next release of Condor.
Version 6.1.11
New Features:
- condor_ status outputs information for held jobs instead of
MaxRunningJobs when supplied with -schedd or -submitter.
- condor_ userprio now prints 4 digit years (for Y2K compiance).
If you give a two digit date, it also will assume that 1/1/00 is 1/1/2000
and not 1/1/1900.
- IRIX 6.5 has been added to the list of ports that now support
remote system calls and checkpointing.
- condor_ q has been fixed to be faster and much more memory
efficient. This is much more obvious when getting the queue from
condor_ schedd's that have more than 1000 jobs.
- Added support for support for socket() and pipe() in standard
jobs. Both sockets and pipes are created on the executing machine.
Checkpointing is deferred anytime a socket or pipe is open.
- Added limited support for select() and poll() in standard jobs.
Both calls will work only on files opened locally.
- Added limited support for fcntl() and ioctl() in standard jobs.
Both calls will be performed remotely if the control-number is understood
and the third argument is an integer.
- Replaced buffer implementation in standard jobs.
The new buffer code reads and writes variable sized chunks.
It will never issue a read to satisfy a write. Buffering is enabled
by default.
- Added extensive feedback on I/O performance in the user's email.
- Added -io option to condor_ q to show I/O statistics.
- Removed libckpt.a and libzckpt.a. To build for standalone
checkpointing, just do a regular condor_ compile.
No -standalone option is necessary.
- The checkpointing library now only re-opens files when they are
actually used. If files or other needed resources cannot be found
at restart time, the checkpointer will fail with a verbose error.
- The RemoteHost and LastRemoteHost attributes in
the job classad now contain hostnames instead IP address and port
numbers. The -run option of older versions of condor_ q is not
compatible with this change.
- Condor will now automatically check for compatibility between
the version of the Condor libraries you have linked into a standard
job (using condor_ compile) and the version of the condor_ shadow
installed on your submit machine.
If they are incompatible, the condor_ shadow will now put your job on
hold.
Unless you set ``Notification = Never'' in your submit file, Condor
will also send you email explaining what went wrong and what you can
do about it.
- All Condor daemons and tools now have a CondorPlatform
string, which shows which platform a given set of Condor binaries was
built for.
In all places that you used to see CondorVersion, you will now
see both CondorVersion and CondorPlatform, such as in
each daemon's ClassAd, in the output to a -version option (if
supported), and when running ident on a given Condor binary.
This string can help identify situations where you are running the
wrong version of the Condor binaries for a given platform (for
example, running binaries built for Solaris 2.5.1 on a Solaris 2.6
machine).
- Added commented-out settings in the default
condor_config file we ship for various SMP-specific settings
in the condor_ startd.
Be sure to read section 3.13.7 on ``Configuring the
Startd for SMP Machine'' on page
for
details about using these settings.
- condor_ rm, condor_ hold, and condor_ release all support
-help and -version options now.
Bugs Fixed:
- A race condition which could cause the condor_ shadow to not
exit when its job was removed has been fixed.
This bug would cause jobs that had been removed with condor_ rm to
remain in the queue marked as status ``X'' for a long time.
In addition, Condor would not shutdown quickly on hosts that had hit
this race condition, since the condor_ schedd wouldn't exit until all
of its condor_ shadow children had exited.
- A signal race condition during restart of a Condor job has
been fixed.
- In a Condor linked job, getdomainname() is now
supported.
- IRIX 6.5 can give negative time reports for how long a process has been
running. We account for that now in our statistics about usage times.
- The condor_ status memory error introduced in version 6.1.10
has been fixed.
- The DAEMON_LIST configuration setting is now case
insensitive.
- Fixed a bug where the condor_ schedd, under rare circumstances,
cause another schedd's jobs not to be matched.
- The free disk space is now properly computed on Digital Unix.
This fixed problems where the Disk attribute in the
condor_ startd classad reported incorrect values.
- The config file parser now detects incremental macro definitions
correctly (see section 3.3.1 on
page
). Previously, when a macro (or
expression) being defined was a substring of a macro (or expression)
being referenced in its definition, the reference would be erroneously
marked as an incremental definition and expanded immediately. The
parser now verifies that the entire strings match.
Known Bugs:
- The output for condor_ q -io is incorrect and will likely show
zeroes for all values. A fixed version will appear in the next release.
Version 6.1.10
New Features:
- condor_ q now accepts -format parameters like condor_ status
- condor_ rm, condor_ hold and condor_ release accept
-constraint parameters like condor_ status
- condor_ status now sorts displayed totals by the first column.
(This feature introduced a bug in condor_ status. See ``Known Bugs''
below.)
- Condor version 6.1.10 introduces ``clipped'' support for Sparc
Solaris version 2.7.
This version does not support checkpointing or remote system calls.
Full support for Solaris 2.7 will be released soon.
- Introduced code to enable Linux to use the standard C library's
I/O buffering again, instead of relying on the Condor I/O buffering
code (which is still in beta testing).
Bugs Fixed:
- The bug in checkpointing introduced in version 6.1.9 has been
fixed.
Checkpointing will now work on all platforms, as it always used to.
Any jobs linked with the 6.1.9 Condor libraries will need to be
relinked with condor_ compile once version 6.1.10 has been installed
at your site.
Known Bugs:
- The CondorLoadAvg attribute in the condor_ startd has
some problems in the way it is computed.
The CondorLoadAvg is somewhat inaccurate for the first minute a job
starts running, and for the first minute after it completes.
Also, the computation of CondorLoadAvg is very wrong on NT.
All of this will be fixed in a future version.
- A memory error may cause condor_ status to die with SIGSEGV
(segmentation violation) when displaying totals or cause incorrect
totals to be displayed. This will be fixed in version 6.1.11.
Version 6.1.9
New Features:
- Added full support for Linux 2.0.x and 2.2.x kernels using
libc5, glibc20 and glibc21.
This includes support for Red Hat 6.x, Debian 2.x and other popular
Linux distributions.
Whereas the Linux machines had once been fragmented across libc5 and
GNU libc, they have now been reunified.
This means there is no longer any need for the ``LINUX-GLIBC'' OpSys
setting in your pool: all machines will now show up as ``LINUX''.
Part of this reunification process was the removal of dynamically
linked user jobs on Linux.
condor_ compile now forces static linking of your Standard Universe
Condor jobs.
Also, please use condor_ compile on the same machine on which you
compiled your object files.
- Added condor_ qedit utility to allow users to modify job
attributes after submission. See the new manual page on
page
.
- Added -runforminutes option to daemonCore to have
the daemon gracefully shut down after the given number of minutes.
- Added support for statfs(2) and fstatfs(2) in user jobs. We support
only the fields
f_bsize, f_blocks, f_bfree, f_bavail, f_files, f_ffree from
the structure statfs. This is still in the experimental stage.
- Added the -direct option to condor_ status.
After you give -direct, you supply a hostname, and
condor_ status will query the condor_ startd on the specified host
and display information directly from there, instead of querying the
condor_ collector.
See the manual page on page
for details.
- Users can now define NUM_CPUS to override the automatic
computation of the number of CPUs in your machine.
Using this config setting can cause unexpected results, and is not
recommended.
This feature is only provided for sites that specifically want this
behavior and know what they are doing.
- The -set and -rset options to condor_ config_val
have been changed to allow administrators to set both macros and
expressions.
Previously, condor_ config_val assumed you wanted to set
expressions.
Now, these two options each take a single argument, the string
containing exactly what you would put into the config file, so you can
specify you want to create a macro by including an ``='' sign, or an
expression by including a ``:''.
See section 3.3.1 on
page
for details on macros
vs. expressions.
See the condor_ config_val man page on
page
for details on
condor_ config_val.
- If the directory you specified for LOCK (which holds lock files
used by Condor) doesn't exist, Condor will now try to create that
directory for you instead of giving up right away.
- If you change the COLLECTOR_HOST setting and reconfig
the condor_ startd, the startd will ``invalidate'' its ClassAds at
the old collector before it starts reporting to the new one.
Bugs Fixed:
- Fixed a major bug dealing with the group access a Condor job is
started with.
Now, Condor jobs are started with all the groups the job's owner is
in, not just their default group.
This also fixes a security hole where user jobs could be started up in
access groups they didn't belong to.
- Fixed a bug where there was a needless limitation on the number of open
file descriptors a user job could have.
- Fixed a standalone checkpointing bug where we weren't blocking signals
in critical sections and causing file table corruption at checkpoint
time.
- Fixed a linker bug on Digital Unix 4.0 concerning fortran where
the linker would fail on __uname and __sigsuspend.
- Fixed a bug in condor_ shadow that would send incorrect job
completion email under Linux.
- Fixed a bug in the remote system call of fchdir() that caused
a garbage file descriptor to be used in Standard Universe jobs.
- Fixed a bug in the condor_ shadow which was causing condor_ q
-goodput to display incorrect values for some jobs.
- Fixed some minor bugs and made some minor enhancements in the
condor_ install script.
The bugs included a typo in one of the questions asked, and incorrect
handling for the answers of a few different questions.
Also, if DNS is misconfigured on your system, condor_ install will
try a few ways to find your fully qualified hostname, and if it still
can't determine the correct hostname, it will prompt the user for it.
In addition, we now avoid one installation step in cases were it is
not needed.
- Fixed a rare race condition that could delay the completion of
large clusters of short running jobs.
- Added more checking to the various arguments that might be
passed to condor_ status, so that in the case of bad input,
condor_ status will print an error message and exit, instead of
performing a segmentation fault.
Also, when you use the -sort option, condor_ status will only
display ClassAds where the attributes you use to sort are defined.
- Fixed a bug in the handling of the config files created by
using the -set or -rset options to condor_ config_val.
Previously, if you manually deleted the files that were created, you
could cause the affected Condor daemon to have a segmentation fault.
Now, the daemons simply exit with a fatal error but still have a
chance to clean up.
- Fixed a bug in the -negotiator option for most Condor
tools that was causing it to get the wrong address.
- Fixed a couple of bugs in the condor_ master that could cause
improper shutdowns.
There were cases during shutdown where we would restart a daemon
(because we previously noticed a new executable, for example).
Now, once you begin a shutdown, the condor_ master will not restart
anything.
Also, fixed a rare bug that could cause the condor_ master to stop
checking the timestamps on a daemon.
- Fixed a minor bug in the -owner option to
condor_ config_val that was causing condor_ init not to work.
- Fixed a bug where the condor_ startd, while it was already
shutting down, was allowing certain actions to succeed that should
have failed.
For example, it allowed itself to be matched with a user looking for
available machines, or to begin a new PVM task.
Known Bugs:
- The CondorLoadAvg attribute in the condor_ startd has
some problems in the way it is computed.
The CondorLoadAvg is somewhat inaccurate for the first minute a job
starts running, and for the first minute after it completes.
Also, the computation of CondorLoadAvg is very wrong on NT.
All of this will be fixed in a future version.
- There is a serious bug in checkpointing when using Condor's
I/O buffering for ``standard'' jobs.
By default, Linux uses Condor buffering in version 6.1.9 for all
standard jobs.
The bug prevents checkpointing from working more than once.
This renders the condor_ vacate and condor_ checkpoint commands
useless, and jobs will just be killed without a checkpoint when
machine owners come back to their machines.
Version 6.1.8
- Added file_remaps as command in the job submit file given to
STANDARD universe jobs.
A Job can now specify that it would like to have files be remapped
from one file to another.
In addition you can specify that files should be read from the local machine
by specifing them.
See the condor_ submit manual page on page
for
more details.
- Added buffer_size and buffer_block_size so that STANDARD
universe jobs can specify that they wish to have I/O buffering turned on.
Without buffering, all I/O requests in the STANDARD universe are sent back
over the network to be executed on the submit machine.
With buffering, read ahead, write behind, and seek batch buffering is
performed to minimize network traffic and latency.
By default, jobs do not specify buffering, however, for many situations buffering
can drastically increase throughput. See the condor_ submit manual page
on page
for more details.
- The condor_ schedd is much more memory efficient handling clusters
with hundreds/thousands of jobs.
If you submit large clusters, your submit machine will only use a fraction
of the amount of RAM it used to require.
NOTE: The memory savings will only be realized for new clusters submitted
after the upgrade to v6.1.8 - clusters which previously existed in the
queue at upgrade time will still use the same amount of RAM in the
condor_ schedd.
- Submitting jobs, especially submitting large clusters containing many
jobs, is much faster.
- Added a -goodput option to condor_ q, which displays
statistics about the execution efficiency of STANDARD universe jobs.
- Added FS_REMOTE method of user authentication to possible values
of the configuration option AUTHENTICATION_METHODS to fix problems
with using the -r remote scheduler option of condor_ submit.
Additionally, the user authentication protocol has changed, so previous
versions of Condor programs cannot co-exist with this new protocol.
- Added a new utility and documentation for condor_ glidein which uses
Globus resources to extend your local pool to use remote Globus machines as
part of your Condor pool.
- Fixed more bugs in the handling of the stat() system call
and its relatives on Linux with glibc.
This was causing problems mainly with Fortran I/O, though other I/O
related problems on glibc Linux will probably be solved now.
- Fixed a bug in various Condor tools (condor_ status,
condor_ user_prio, condor_ config_val, and condor_ stats) that
would cause them to seg fault on bad input to the -pool option.
- Fixed a bug with the -rset option to condor_ config_val which
could crash the Condor daemon whose configuration was being changed.
- Added allow_startup_script command to the job submit
description file which is given to condor_ submit. This allows the
submission of a startup script to the STANDARD universe. See
- Fixed a bug in the condor_ schedd where it would get into an
infinite loop if the persistant log of the job queue got corrupted.
The condor_ schedd now correctly handles corrupted log files.
- The full release tar file now contains a dagman
subdirectory in the examples directory.
This subdirectory includes an example DAGMan job, including a README
(in both ASCII and HTML), a Makefile, and so on.
- Condor will now insert an environment variable, CONDOR_VM, into
the environment of the user job.
This variable specifies which SMP ``virtual machine'' the job was started on.
It will equal either vm1, vm2, vm3, ... , depending upon which virtual
machine was matched.
On a non-SMP machine, CONDOR_VM will always be set to vm1.
- Fixed some timing bugs introduced in v6.1.6 which could occur when
Condor tries to simultaneously start a large number of jobs submitted from a
single machine.
- Fixed bugs when Condor is told to gracefully shutdown; Condor no
longer starts up new jobs when shutting down. Also, the condor_ schedd
progressively checkpoints running jobs during a graceful shutdown instead of
trying to vacate all the job simultaneously. The rate at which the shutdown
occurs is controlled by the JOB_START_DELAY configuration
parameter (see page
).
- Fixed a bug which could cause the condor_ master process to exit if
the Condor daemons have been hung for a while by the operating system (if,
for instance, the LOG directory was placed on an NFS volume and the NFS
server is down for an extended period).
- Previously, removing a large number of jobs with condor_ rm would
result in the condor_ schedd being unresponsive for a period of time
(perhaps leading to timeouts when running condor_ q). The condor_ schedd
has been improved to multitask the removal of jobs while servicing new
requests.
- Added new configuration parameter COLLECTOR_SOCKET_BUFSIZE
which controls the size of TCP/IP buffers used by the condor_ collector.
For more info, see section refparam:CollectorSocketBufsize on
page pagerefparam:CollectorSocketBufsize.
- Fixed a bug with the -analyze option to condor_ q: in some
cases, the RANK expression would not be evaluated correctly. This could
cause the output from -analyze to be in error.
- When running on a multi-CPU (SMP) Hewlett-Packard machine, fixed bugs
computing the system load average.
- Fixed bug in condor_ q which could cause the RUN_TIME reported to
be temporarily incorrect when jobs first start running.
- The condor_ startd no longer rapidly sends multiple ClassAds one
right after another to the Central Manager when its state/activity is in
rapid transition. Also, on SMP machines, the condor_ startd will only send
updates for 4 nodes per second (to avoid overflowing the central manager when
reporting the state of a very large SMP machine with dozens of CPUs).
- Reading a parameter with condor_ config_val is now allowed from any
machine with Host-IP READ permission.
Previsouly, you needed ADMINISTRATOR permission.
Of course, setting a parameter still requires ADMINISTRATOR permission.
- Worked around a bug in the StreamTokenizer Java class from Sun
that we use in the CondorView client Java applet.
The bug would cause errors if usernames or hostnames in your pool
contained ``-'' or ``_'' characters.
The CondorView applet now gets around this and properly displays all
data, including entries with the ``bad'' characters.
Version 6.1.7
NOTE: Version 6.1.7 only adds support for platforms not supported in
6.1.6.
There are no bug fixes, so there are no binaries released for any
other platforms.
You do not need 6.1.7 unless you are using one of the two platforms we
released binaries for.
- Added ``clipped'' support for Alpha Linux machines running the
2.0.X kernel and glibc 2.0.X (such as Red Hat 5.X).
We do not yet support checkpointing and remote system calls on this
platform, but we can start ``vanilla'' jobs.
See section 2.4.1 on
page
for details on vanilla
vs. standard jobs.
- Re-added support for Intel Linux machines running the 2.0.X
Linux kernel, glibc 2.0.X, using the GNU C compiler (gcc/g++ 2.7.X) or
the EGCS compilers (versions 1.0.X, 1.1.1 and 1.1.2).
This includes Red Hat 5.X, and Debian 2.0.
Red Hat 6.0 and Debian 2.1 are not yet supported, since they use
glibc 2.1.X and the 2.2.X Linux kernel.
Future versions of Condor will support all combinations of kernels,
compilers and versions of libc.
Version 6.1.6
- Added file_remaps as command in the job submit file given to
condor_ submit.
This allows the user to explicitly specify where to find a given file (e.g.
either on the submit or execute machine), as well as remap file access to a
different filename altogether.
- Changed the way that condor_ master spawns daemons and
condor_ preen which allows you to specify command line arguments for
any of them, though a SUBSYS_ARGS setting.
Previously, when you specified PREEN , you added the command
line arguments directly to that setting, but that caused some
problems, and only worked for condor_ preen.
Once you upgrade to version 6.1.6, if you continue to use your
old condor_config files, you must change the PREEN
setting to remove any arguments you have defined and place those
arguments into a separate config setting, PREEN_ARGS .
See section 3.3.9, ``condor_ master
Config File Entries'', on
page
for more details.
- Fixed a very serious bug in the Condor library linked in with
condor_ compile to create standard jobs that was causing
checkpointing to fail in many cases.
Any jobs that were linked with the 6.1.5 Condor libraries should
probably be removed, re-linked, and re-submitted.
- Fixed a bug in condor_ userprio that was introduced in version
6.1.5 that was preventing it from finding the address of the
condor_ negotiator for your pool.
- Fixed a bug in condor_ stats that was introduced in version
6.1.5 that was preventing it from finding the address of the
condor_ collector for your pool.
- Fixed a bug in the way the -pool option was handled by
many Condor tools that was introduced in version 6.1.5.
- condor_ q now displays job allocation time by default, instead
of displaying CPU time.
Job allocation time, or RUN_TIME, is the amount of wall-clock time the job
has spent running.
Unlike CPU time information which is only updated when a job is
checkpointed, the allocation time displayed by condor_ q is continuously
updated, even for vanilla universe jobs.
By default, the allocation time displayed will be the total time across all
runs of the job.
The new -currentrun option to condor_ q can be used to display the
allocation time for solely the current run of the job.
Additionally, the -cputime option can be used to view job CPU times as
in earlier versions of Condor.
- condor_ q will display an error message if there is a timeout
fetching the job queue listing from a condor_ schedd. Previously,
condor_ q would simply list the queue as empty upon a communication error.
- The condor_ schedd daemon has been updated to verify all queue access
requests via Condor's IP/Host-Based Security mechanism (see
section 3.6.8).
- Fixed a bug on platforms which require the condor_ kbdd (currently
Digital Unix and IRIX).
This bug could have allowed Condor to start a job within the first five
minutes after the Condor daemons had been started, even if there is a user
typing on the keyboard.
- condor_ release now gives an error message if the user tries to
release a job which either does not exist or is not in the hold state.
- Added a new config file parameter, USER_JOB_WRAPPER , which
allows administrators to specify a file to act as a ``wrapper'' script
around all jobs started by Condor.
See inside section 3.3.14, on
page
, for more details.
- condor_ dagman now permits the backslash character (``
'') to be used
as a line-continuation character for DAG Input Files, just like the
condor_ config files.
- The Condor version string is now included in all Condor
libraries.
You can now run ident on any program linked with
condor_ compile to view which version of the Condor libraries you
linked with.
In addition, the format of the version string changed in 6.1.6.
Now, the identifier used is ``CondorVersion'' instead of ``Version''
to prevent any potential ambiguity.
Also, the format of the date changed slightly.
- The SMP startd can now handle dynamic reconfiguration of the
number of each type of virtual machine being reported.
This allows you, during the normal running of the startd, to increase
or decrease the number of CPUs that Condor is using.
If you reconfigure the startd to use less CPUs than it currently has
under its control, it will first remove CPUs that have no Condor jobs
running on them.
If more CPUs need to be evicted, the startd will checkpoint jobs and
evict them in reverse rank order (using the startd's Rank
expression).
So, the lower the value of the rank, the more likely a job will be
kicked off.
- The SMP startd contrib module's condor_ starter no longer makes
a call that was causing warning messages about ``ERROR: Unknown System
Call (-58) - system call not supported by Condor'' when used with the
6.0.X condor_ shadow.
This was a harmless call, but removing the call prevents the error
message.
- The SMP contrib module now includes the condor_ checkpoint and
condor_ vacate programs, which allow you to vacate or checkpoint jobs
on individual CPUs on the SMP, instead of checkpointing or vacating
everything.
You can now use ``condor_ vacate vm1@hostname'' to just vacate the
first virtual machine, or ``condor_ vacate hostname'' to vacate all
virtual machines.
- Added support for SMP Digital Unix (Alpha) machines.
- Fixed a bug that was causing an overflow in the computation of
free disk and swap space on Digital Unix (Alpha) machines.
- The condor_ startd and condor_ schedd now can ``invalidate''
their classads from the collector.
So, when a daemon is shut down, or a machine is reconfigured to
advertise fewer virtual machines, those changes will be instantly
visible with condor_ status, instead of having to wait 15 minutes for
the stale classads to time-out.
- The condor_ schedd no longer forks a child process (a ``schedd
agent'') to claim available condor_ startds.
You should no longer see multiple condor_ schedd processes running on
your machine after a negotiation cycle.
This is now accomplished in a non-blocking manner within the
condor_ schedd itself.
- The startd now adds an VirtualMachineID attribute to
each virtual machine classad it advertises.
This is just an integer, starting at 1, and increasing for every
different virtual machine the startd is representing.
On regular hosts, this is the only ID you will ever see.
On SMP hosts, you will see the ID climb up to the number of different
virtual machines reported.
This ID can be used to help write more complex policy expressions on
SMP hosts, and to easily identify which hosts in your pool are in fact
SMP machines.
- Modified the output for condor_ q -run for scheduler and PVM
universe jobs. The host where the scheduler universe job is running
is now displayed correctly. For PVM jobs, a count of the current
number of hosts where the job is running is displayed.
- Fixed the condor_ startd so that it no longer prints lots of
ProcAPI errors to the log file when it is being run as non-root.
- FS_PATHNAME and VOS_PATHNAME are no longer
used. AFS support now works similar to NFS support, via the
FILESYSTEM_DOMAIN macro.
- Fixed a minor bug in the Condor.pm perl module that was
causing it to be case-sensitive when parsing the Condor submit file.
Now, the perl module is properly case-insensitive, as indicated in the
documentation.
Version 6.1.5
- Fixed a nasty bug in condor_ preen that would cause it to
remove files it shouldn't remove if the condor_ schedd and/or
condor_ startd were down at the time condor_ preen ran.
This was causing jobs to mysteriously disappear from the job queue.
- Added preliminary support to Condor for running on machines with
multiple network interfaces.
On such machines, users can specify the IP address Condor should use
in the NETWORK_INTERFACE config file parameter on each host.
In addition, if the pool's central manager is on such a machine, users
should set the CM_IP_ADDR parameter to the ip address you wish
to use on that machine.
See section 3.7.2 on
page
for more details.
- The support for multiple network interfaces introduced bugs in
condor_ userprio, condor_ stats, CondorPVM, and the -pool
option to many Condor tools.
All of these will be fixed in version 6.1.6.
- Fixed a bug in the remote system call library that was
preventing certain Fortran operations from working correctly on
Linux.
- The Linux binaries for GLIBC we now distribute are compiled on a
Red Hat 5.2 machine.
If you're using this version of Red Hat, you might have better luck
with the dynamically linked version of Condor than previous releases
of Condor.
Sites using other GLIBC Linux distributions should continue to use the
statically linked version of Condor.
- Fixed a bug in the condor_ shadow that could cause it to die
with signal 11 (segmentation violation) under certain rare
circumstances.
- Fixed a bug in the condor_ schedd that could cause it to die
with signal 11 (segmentation violation) under certain rare
circumstances.
- Fixed a bug in the condor_ negotiator that could cause it to
die with signal 8 (floating point exception) on Digital Unix
machines.
- The following shadow parameters have been added to control
checkpointing: COMPRESS_PERIODIC_CKPT ,
COMPRESS_VACATE_CKPT , PERIODIC_MEMORY_SYNC ,
SLOW_CKPT_SPEED . See
section 3.3.12 on
page
for more details.
In addition, the shadow now honors the CkptWanted flag in a job
classad, and if it is set to ``False'', the job will never
checkpoint.
- Fixed a bug in the condor_ startd that could cause it to
report negative values for the CondorLoadAvg on rare occasions.
- Fixed a bug in the condor_ startd that could cause it to die
with a fatal exception in situations where the act of getting claimed
by a remote schedd failed for some reason.
This resulted in the condor_ startd exiting on rare occasions with a
message in its log file to the effect of ERROR ``Match timed
out but not in matched state''.
- Fixed a bug in the condor_ schedd that under rare circumstances
could cause a job to be left in the ``Running'' state even after the
condor_ shadow for that job had exited.
- Fixed a bug in the condor_ schedd and various tools that
prevented remote read-only access to the job queue from working.
So, for example, condor_q -name foo, if run on any machine
other than foo, wouldn't display any jobs from foo's queue.
This fix re-enables the following options to condor_ q to work:
submitter, name, global, etc.
- Changed the condor_ schedd so that when starting jobs, it
always sorts on the cluster number, in addition to the date the jobs
were enqueued and the process number within clusters, so that if many
clusters were submitted at the same time, the jobs are started in
order.
- Fixed a bug in condor_ compile that was modifying the
PATH environment variable by adding things to the front of it.
This would potentially cause jobs to be compiled and linked with a
different version of a compiler than they thought they were getting.
- Minor change in the way the condor_ startd handles the
D_ LOAD and D_ KEYBOARD debug flags.
Now, each one, when set, will only display every
UPDATE_INTERVAL , regardless of the startd state.
If you wish to see the values for keyboard activity or load average
every POLLING_INTERVAL , you must enable D_ FULLDEBUG.
Version 6.1.4
- Fixed a bug in the socket communication library used by Condor
that was causing daemons and tools to die on some platforms (notably,
Digital Unix) with signal 8, SIGFPE (floating point exception).
- Fixed a bug in the usage message of many Condor tools that
mentioned a -all option that isn't yet supported.
This option will be supported in future versions of Condor.
- Fixed a bug in the filesystem authentication code used to
authenticate operations on the job queue that left empty temporary
files in /tmp.
These files are now properly removed after they are used.
- Fixed a minor bug in the totals condor_ status displays when
you use the ckptsrvr option.
- Fixed a minor syntax error in the condor_ install script that
would cause warnings.
- the Condor.pm Perl module is now included in the
lib directory of the main release directory.
Version 6.1.3
NOTE: There are a lot of new, unstable features in 6.1.3.
PLEASE do not install all of 6.1.3 on a production pool.
Almost all of the bug fixes in 6.1.3 are in the condor_ startd or
condor_ starter, so, unless you really know what you're doing, we
recommend you just upgrade SMP-Startd contrib module, not the entire
6.1.3 release.
- Owners can now specify how the SMP-Startd partitions the system
resources into the different types and numbers of virtual machines,
specifying the number of CPUs, megs of RAM, megs of swap space, etc.,
in each.
Previously, each virtual machine reported to Condor from an SMP
machine always had one CPU, and all shared system resources were
evenly divided among the virtual machines.
- Fixed a bug in the reporting of virtual memory and disk space on
SMP machines where each virtual machine represented was advertising
the total in the system for itself, instead of its own share.
Now, both the totals, and the virtual machine-specific values are
advertised.
- Fixed a bug in the condor_ starter when it was trying to
suspend jobs.
While we always killed all of the processes when we were trying to
vacate, if a vanilla job forked, the starter would sometimes not
suspend some of the children processes.
In addition, we could sometimes miss a standard universe job for
suspending as well.
This is all fixed.
- Fixed a bug in the SMP-Startd's load average computation that
could cause processes spawned by Condor to not be associated w/ the
Condor load average.
This would cause the startd to over-estimate the owner's load average,
and under-estimate the Condor load, which would cause a cycle of
suspending and resuming a Condor job, instead of just letting it run.
- Fixed a bug in the SMP-Startd's load average computation that
could cause certain rare exceptions to be treated as fatal, when in
fact, the Startd could recover from them.
- Fixed a bug in the computation of the total physical memory on
some platforms that was resulting in an overflow on machines with
lots of ram (over 1 gigabyte).
- Fixed some bugs that could cause condor_ starter processes to
be left as zombies underneath the condor_ startd under very rare
conditions.
- For sites using AFS, if there are problems in the
condor_ startd computing the AFS cell of the machine it's running on,
the startd will exit with an error message at start-up time.
- Fixed a minor bug in condor_ install that would lead to a
syntax error in your config file given a certain set of installation
options.
- Added the -maxjobs option to the condor_ submit_dag
script that can be used to specify the maximum number of jobs Condor
will run from a DAG at any given time.
Also, condor_ submit_dag automatically creates a ``rescue DAG''.
See section 2.11 on page
for details
on DAGMan.
- Fixed bug in ClassAd printing when you tried to display an
integer or float attribute that didn't exist in the given ClassAd.
This could show up in condor_ status, condor_ q, condor_ history,
etc.
- Various commands sent to the Condor daemons now have separate
debug levels associated with them.
For example, commands such as ``keep-alives'', and the command sent
from the condor_ kbdd to the condor_ startd are only seen in the
various log files if D_ FULLDEBUG is turned on, instead of
D_ COMMAND, which the default and now enabled for all daemons on
all platforms by default.
Administrators retaining their old configuration when upgrading to
this version are encouraged to enable D_ COMMAND in the
SCHEDD_DEBUG setting.
In addition, for IRIX and Digital Unix machines, it should be enabled
in the STARTD_DEBUG setting as well.
See section 3.3.4 on
page
for details on
debug levels in Condor.
- New debug levels added to Condor:
- D_ NETWORK, used by various daemons in Condor to report
various network statistics about the Condor daemons.
- D_ PROCFAMILY, used to report information about various
families of processes that are monitored by Condor.
For example, this is used in the condor_ startd when monitoring the
family of processes spawned by a given user job for the purposes of
computing the Condor load average.
- D_ KEYBOARD, used by the condor_ startd to print out
statistics about remote tty and console idle times in the
condor_ startd.
This information used to be logged at D_ FULLDEBUG, along with
everything else, so now, you can see just the idle times, and/or have
the information stored to a separate file.
- Added a -run option to condor_ q, which displays
information for running jobs, including the remote host where each job
is running.
- Macros can now be incrementally defined. See
section 3.3.1 on
page
for more details.
- condor_ config_val can now be used to set configuration
variables. See the man page on page
for more details.
- The job log file now contains a record of network activity. The
evict, terminate, and shadow exception events indicate the number of
bytes sent and received by the job for the specific run.
The terminate event additionally indicates totals for the life of the
job.
- STARTER_CHOOSES_CKPT_SERVER now defaults to true.
See section 3.3.8 on
page
for more details.
- The infrastructure for authentication within Condor has been
overhauled, allowing for much greater flexibility in supporting new
forms of authentication in the future.
This means that the 6.1.3 schedd and queue management tools (like
condor_ q, condor_ submit, condor_ rm and so on) are incompatible
with previous versions of Condor.
- Many of the Condor administration tools have been improved to
allow you to specify the ``subsystem'' you want them to effect.
For example, you can now use ``condor_ reconfig -startd'' to just
have the startd reconfigure itself.
Similarly, condor_ off, condor_ on and condor_ restart can now all
work on a single daemon, instead of machine-wide.
See the man pages (section 9 on
page
) or run any command with -help
for details.
NOTE: The usage message in 6.1.3 incorrectly reports -all as a
valid option.
- Fixed a bug in the Condor tools that could cause a segmentation
violation in certain rare error conditions.
Version 6.1.2
- Fixed some bugs in the condor_ install script.
Also, enhanced condor_ install to customize the path to perl in
various perl scripts used by Condor.
- Fixed a problem with our build environment that left some files
out of the release.tar files in the binary releases on some
platforms.
- condor_ dagman, ``DAGMan'' (see section 2.11 on
page
for details) is now included in the
development release by default.
- Fixed a bug in the computation of the total physical memory in
HPUX machines that was resulting in an overflow on machines with
lots of ram (over 1 gigabyte).
Also, if you define ``MEMORY'' in your config file, that value will
override whatever value Condor computes for your machine.
- Fixed a bug in condor_ starter.pvm, the PVM version of the
Condor starter (available as an optional ``Contrib module''), when you
disabled STARTER_LOCAL_LOGGING .
Now, having this set to ``False'' will properly place debug messages
from condor_ starter.pvm into the ShadowLog file of the
machine that submitted the job (as opposed to the StarterLog
file on the machine executing the job).
Version 6.1.1
- Fixed a bug in the condor_ startd where we compute the load
average caused by Condor that was causing us to get the wrong values.
This could cause a cycle of continuous job suspends and job resumes.
- Beginning with this version, any jobs linked with the Condor
checkpoint libraries will use the zlib compression code (used by gzip
and others) to compress periodic checkpoints before they are written
to the network.
These compressed checkpoints are uncompressed at startup time.
This saves network bandwidth, disk space, as well as time (if the
network is the bottleneck to checkpointing, which it usually is).
In future versions of Condor, all checkpoints will probably be
compressed, but at this time, it is only used for periodic
checkpoints.
Note, you have to relink your jobs with the condor_ compile command
to have this feature enabled.
Old jobs (not relinked) will continue to run just fine, they just
won't be compressed.
- condor_ status now has better support for displaying checkpoint
server ClassAds.
- More contrib modules from the development series are now
available, such as the checkpoint server, PVM support, and the
CondorView server.
- Fixed some minor bugs in the UserLog code that were causing
problems for DAGMan in exceptional error cases.
- Fixed an obscure bug in the logging code when D_ PRIV was
enabled that could result in incorrect file permissions on log files.
Version 6.1.0
- Support has been added to the condor_ startd to run multiple
jobs on SMP machines.
See section 3.13.7 on
page
for details about setting up and
configuring SMP support.
- The expressions that control the condor_ startd policy for
vacating, jobs has been simplified.
See section 3.5 on
page
for complete details on the new
policy expressions, and section 3.5.11 on
page
for an explanation of what's
different from the version 6.0 expressions.
- We now perform better tracking of processes spawned by Condor.
If children die and are inherited by init, we still know they belong
to Condor.
This allows us to better ensure we don't leave processes lying around
when we need to get off a machine, and enables us to have a much more
accurate computation of the load average generated by Condor (the
CondorLoadAvg as reported by the condor_ startd).
- The condor_ collector now can store historical information
about your pool state.
This information can be queried with the condor_ stats program (see
the man page on page
), which is used by the
condor_ view Java GUI, which is available as a separate contrib
module.
- Condor jobs can now be put in a ``hold'' state with the
condor_ hold command.
Such jobs remain in the job queue (and can be viewed with condor_ q),
but there will not be any negotiation to find machines for them.
If a job is having a temporary problem (like the permissions are
wrong on files it needs to access), the job can be put on hold until
the problem can be solved.
Jobs put on hold can be released with the condor_ release command.
- condor_ userprio now has the notion of user factors as a
way to create different groups of users in different priority levels.
See section 3.4 on page
for
details.
This includes the ability to specify a local priority domain, and all
users from other domains get a much worse priority.
- Usage statistics by user is now available from
condor_ userprio.
See the man page on page
for details.
- The condor_ schedd has been enhanced to enable ``flocking'',
where it seeks matches with machines in multiple pools if its requests
cannot be serviced in the local pool.
See section 5.2 on page
for more
details.
- The condor_ schedd has been enhanced to enable condor_ q and
other interactive tools better response time.
- The condor_ schedd has also been enhanced to allow it to check
the permissions of the files you specify for input, output, error and
so on.
If the schedd doesn't have the required access rights to the files,
the jobs will not be submitted, and condor_ submit will print an
error message.
- When you perform a condor_ rm command, and the job you removed
was using a ``user log'', the remove event is now recorded into the
log.
- Two new attributes have been added to the job classad when it
begins executing: RemoteHost and LastRemoteHost.
These attributes list the IP address and port of the startd that is
either currently running the job, or the last startd to run the job
(if it's run on more than one machine).
This information helps users track their job's execution more closely,
and allows administrators to troubleshoot problems more effectively.
- The performance of checkpointing was increased by using larger
buffers for the network I/O used to get the checkpoint file on and off
the remote executing host (this helps for all pools, with or without
checkpoint servers).
Next: 8.11 Stable Release Series
Up: 8. Version History and
Previous: 8.9 Stable Release Series
Contents
Index
condor-admin@cs.wisc.edu