next up previous contents index
Next: 7.2 Setting up Condor Up: 7. Frequently Asked Questions Previous: 7. Frequently Asked Questions   Contents   Index

Subsections

7.1 Obtaining & Installing Condor


Where can I download Condor?

Condor can be downloaded from http://www.cs.wisc.edu/condor/downloads (Madison, Wisconsin, USA) or http://www.bo.infn.it/condor-mirror/downloads (a mirror site at the Istituto Nazionale di Fisica Nucleare in Bologna, Italy).

When I click to download Condor, it sends me back to the downloads page!

If you are trying to download Condor through a web proxy, try disabling it. Our web site uses the ``referring page'' as you navigate through our download menus in order to give you the right version of Condor, but sometimes proxies block this information from reaching our web site.

What platforms do you support?

See Section 1.5, on page [*]. Also, you might want to read the platform-specific information in Chapter 6 on page [*].

What versions of Red Hat Linux does Condor support?

See Section 6.1 on page [*].


Do you distribute source code?

At this time we do not distribute source code publicly, but instead consider requests on a case-by-case basis. If you need the source code, please e-mail us at condor-admin@cs.wisc.edu explaining why, and we'll get back to you.


How do I upgrade the Unix machines in my pool from 6.4.x to 6.6.x?

This series of steps explains how to upgrade a pool of machines from running Condor version 6.4.x to version 6.6.x. Read through the entire set of directions before following them.

Briefly, the steps are to download the new version in order to replace your current binaries with the new binaries. Condor will notice that there are new binaries, since it checks for this every few minutes. The next time it checks, the new binaries will be used.

Step 1: (Optional) Place test jobs in queue
This optional first step safeguards jobs currently in the queue when you upgrade. By completing this extra step, you will not lose any partially completed jobs, even if something goes wrong with your upgrade.

Manufacture test jobs that utilize each universe you use in your Condor pool. Submit each job, and put the job in the hold state, using condor_ hold.

Step 2: Place all jobs on hold
Place all jobs into the hold state while replacing binaries.

Step 3: Download Condor 6.6.x
To ensure that both new and current binaries are within the same volume, make a new directory within your current release directory where 6.6.x will go. Unix commands will be of the form
  cd <release-dir>
  mkdir new
  cd new

Locate the correct version of the Condor binary, and download into this new directory.

Do not install the downloaded version. Do uncompress and then untar the downloaded version. Further untar the release directory (called release.tar). This will create the directories

      bin
      etc
      include
      sbin
      libexec
      lib
      man
From this list of created directories, bin, include, sbin, libexec, and lib will be used to replace current directories. Note that older versions of Condor do not have a libexec directory.

Step 4: Configuration files
The downloaded version 6.6.x configuration file will have extra, new suggestions for configuration macro settings, to go with new features in Condor. These extra configuration macros are not be required in order to run version Condor 6.6.x.

Make a backup copy of the current configuration, to safeguard backing out of the upgrade, if something goes wrong.

Work through the new example configuration file to see if there is anything useful and merge with your site-specific (current) configuration file.

Note that starting in Condor 6.6.x, security sessions are turned on by default. If you will be retaining some 6.4.x series Condor installations in your pool, you must turn security sessions off in your 6.6.x configuration files. This can be accomplished by setting

SEC_DEFAULT_NEGOTIATION = NEVER

Also in 6.6.x, the definition of Hawkeye / Startd Cron jobs has changed. The old syntax allowed the following

HAWKEYE_JOBS =\
	job1:job1_:/path/to/job1:1h \
	job2:job2_:/path/to/job2:5m \
	...

This is no longer supported, and must be replaced with the following

HAWKEYE_JOBS = job1:job1_:/path/to/job1:1h
HAWKEYE_JOBS = $(HAWKEYE_JOBS) job2:job2_:/path/to/job2:5m
HAWKEYE_JOBS = $(HAWKEYE_JOBS) ...

It should also be noted that in 6.6.x, the condor_ collector and condor_ negotiator can be set to run on non-standard ports. This will cause older (6.4.x and earlier) Condor installations in that pool to no longer function.

Step 5: Replace release directories
For each of the directories that is to be replaced, move the current one aside, and put the new one in its place. The Unix commands to do this will be of the form
  cd <release-dir>

  mv bin bin.v64
  mv new/bin bin

  mv include include.v64
  mv new/include include

  mv sbin sbin.v64
  mv new/sbin sbin

  mv lib lib.v64
  mv new/lib lib

Do this series of directory moves at one sitting, especially avoiding a long time lag between the moves relating to the sbin directory. Condor imposes a delay by design, but it does not idly wait for the new binaries to be in place.

Step 6: Observe propagation of new binaries

Use condor_ status to observe the propagation of the upgrade through the pool. As the machines notice and use the new binaries, their version number will change. Complete propagation should occur in five to ten minutes.

The command

condor_status -format "%s" Machine -format " %s\n" CondorVersion
gives a single line of information about each machine in the pool, containing only the machine name and version of Condor it is running.

Step 7: (Optional) Release test jobs
Release the test jobs that were placed into the hold state in Step 1. If these test jobs complete successfully, then the upgrade is successful. If these test jobs fail (possibly by leaving the queue before finishing), then the upgrade is unsuccessful. If unsuccessful, back out of the upgrade by replacing the new configuration file with the backup copy and moving the Version 6.4.x release directories back to their previous location. Also send e-mail to condor-admin@cs.wisc.edu, explaining the situation and we'll help you work through it.

Step 8: Release all jobs
Release all jobs in the queue, but running condor_ release.

Step 9: (Optional) Install manual pages

The man directory was new with Condor version 6.4.x. It contains manual pages. Note that installation of manual pages is optional; the chapter containing manual pages are in section 9.

To install the manual pages, move the man directory from <release-dir>/new to the desired location. Add the path name to this directory to the MANPATH.


What is Personal Condor?

Personal Condor is a term used to describe a specific style of Condor installation suited for individual users who do not have their own pool of machines, but want to submit Condor jobs to run elsewhere.

A Personal Condor is essentially a one-machine, self-contained Condor pool which can use flocking to access resources in other Condor pools. See Section 5.2, on page [*] for more information on flocking.

What do I do now? My installation of Condor does not work.

What to do to get Condor running properly depends on what sort of error occurs. One common error category are communication errors. Condor daemon log files report a failure to bind. For example:

(date and time) Failed to bind to command ReliSock

Or, the errors in the various log files may be of the form:

(date and time) Error sending update to collector(s)
(date and time) Can't send end_of_message
(date and time) Error sending UDP update to the collector

(date and time) failed to update central manager

(date and time) Can't send EOM to the collector

This problem can also be observed by running condor_ status. It will give a message of the form:

Error:  Could not fetch ads --- error communication error

To solve this problem, understand that Condor uses the first network interface it sees on the machine. Since machines often have more than one interface, this problem usually implies that the wrong network interface is being used. It also may be the case that the system simply has the wrong IP address configured.

It is incorrect to use the localhost network interface. This has IP address 127.0.0.1 on all machines. To check if this incorrect IP address is being used, look at the contents of the CollectorLog file on the pool's your central manager right after it is started. The contents will be of the form:

5/25 15:39:33 ******************************************************
5/25 15:39:33 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
5/25 15:39:33 ** $CondorVersion: 6.2.0 Mar 16 2001 $
5/25 15:39:33 ** $CondorPlatform: INTEL-LINUX-GLIBC21 $
5/25 15:39:33 ** PID = 18658
5/25 15:39:33 ******************************************************
5/25 15:39:33 DaemonCore: Command Socket at <128.105.101.15:9618>

The last line tells the IP address and port the collector has bound to and is listening on. If the IP address is 127.0.0.1, then Condor is definitely using the wrong network interface.

There are two solutions to this problem. One solution changes the order of the network interfaces. The preferred solution sets which network interface Condor should use by adding the following parameter to the local Condor configuration file:

NETWORK_INTERFACE = machine-ip-address

Where machine-ip-address is the IP address of the interface you wish Condor to use.

After an installation of Condor, why do the daemons refuse to start, placing this message in the log files?

ERROR "The following configuration macros appear to contain default values 
that must be changed before Condor will run.  These macros are:
hostallow_write 
(found on line 1853 of /scratch/adesmet/TRUNK/work/src/localdir/condor_config)"
at line 217 in file condor_config.C

As of Condor 6.8.0, if Condor sees the bare key word: YOU_MUST_CHANGE_THIS_INVALID_CONDOR_CONFIGURATION_VALUE as the value of a configuration file entry, Condor daemons will log the given error message and exit.

By default, an installation of Condor 6.8.0 and later releases will have the configuration file entry HOSTALLOW_WRITE set to the above sentinel value. The Condor administrator must alter this value to be the correct domain or IP addresses that the administrator desires. The wildcard character (*) may be used to define this entry, but that allows anyone, from anywhere, to submit jobs into your pool. A better value will be of the form *.domainname.com.


Why do standard universe jobs never run after an upgrade?

Standard universe jobs that remain in the job queue across an upgrade from any Condor release previous to 6.7.15 to any Condor release of 6.7.15 or more recent cannot run. They are missing a required ClassAd attribute (LastCheckpointPlatform) added for all standard universe jobs as of Condor version 6.7.15. This new attribute describes the platform where a job was running when it produced a checkpoint. The attribute is utilized to identify platforms capable of continuing the job (using the checkpoint).

This attribute becomes necessary due to bugs in some Linux kernels. A standard universe job may be continued on some, but not all Linux machines. And, the CkptOpSys attribute is not specific enough to be utilized.

There are two possible solutions for these standard universe jobs that cannot run, yet are in the queue:

  1. Remove and resubmit the standard universe jobs that remain in the queue across the upgrade. This includes all standard universe jobs that have flocked in to the pool. Note that the resubmitted jobs will start over again from the beginning.

  2. For each standard universe job in the queue, modify its job ClassAd such that it can possibly run within the upgraded pool. If the job has already run and produced a checkpoint on a machine before the upgrade, determine the machine that produced the checkpoint using the LastRemoteHost attribute in the job's ClassAd. Then look at that machine's ClassAd (after the upgrade) to determine and extract the value of the CheckpointPlatform attribute. Add this (using condor_ qedit) as the value of the new attribute LastCheckpointPlatform in the job's ClassAd. Note that this operation must also have to be performed on standard universe jobs flocking in to an upgraded pool. It is recommended that pools that flock between each other upgrade to a post 6.7.15 version of Condor.

Note that if the upgrade to Condor takes place at the same time as a platform change (such as booting an upgraded kernel), there is no way to properly set the LastCheckpointPlatform attribute. The only option is to remove and resubmit the standard universe jobs.


next up previous contents index
Next: 7.2 Setting up Condor Up: 7. Frequently Asked Questions Previous: 7. Frequently Asked Questions   Contents   Index
condor-admin@cs.wisc.edu