next up previous contents index
Next: 3.11 The High Availability Up: 3. Administrators' Manual Previous: 3.9 DaemonCore   Contents   Index

Subsections


3.10 Pool Management

Condor provides administrative tools to help with pool management. The following sections describe some of these various tasks.

All of the commands described in this section must be run from an authorized machine. An authorized machine is one that is listed in the HOSTALLOW_ADMINISTRATOR configuration variable, such that IP/host-based security allows the administrator commands to be serviced. See section 3.6.8 on page [*] for full details about IP/host-based security in Condor.


3.10.1 Shutting Down and Restarting a Condor Pool

The installation of new binaries is a situation in which the shutdown and restart of an entire Condor pool is appropriate. It is generally best to make sure no jobs are running, shut down Condor, and then install the new daemons.


3.10.1.1 Shutting Down a Condor Pool

The best way to shut down a pool is to take advantage of the remote administration capabilities of the condor_ master. The first step is to save the IP address and port of the condor_ master daemon on all of the machines to a file, so that even if the case that the condor_ collector is shut down, one can still send administrator commands to the different machines. Use the following command:

  % condor_status -master -format "%s\n" MasterIpAddr > addresses

The first step to shutting down the pool is to stop any currently running jobs, and give them a chance to produce a checkpoint. Depending on the size of the pool, the network infrastructure, and the image-size of the standard jobs running on the pool, this may logically be a slow process, only vacating one host at a time. Either shut down hosts that have jobs submitted (in which case all the jobs from that host will try to produce a checkpoint simultaneously), or shut down individual hosts that are running jobs. To shutdown a host, issue the command:

  % condor_off hostname
where hostname is the name of the host to be shut down. This only works so long as the condor_ collector is still running. Once Condor is shut down on the central manager, rely on the addresses file already created.

If all the running jobs have produced a checkpoint and stopped, or if not worried about the network load caused by shutting down everything at once, it is safe to turn off all daemons on all machines in the pool. Do this with a single command, issued from an authorized administrator machine:

  % condor_off -all

condor_ off will shut down all the daemons, but leave the condor_ master running, so that a future condor_ on will work.

Once all of the Condor daemons (except the condor_ master) on each host is turned off, all is done. It is now safe to install new binaries, move the checkpoint server to another host, or any other task that requires the pool to be shut down to successfully complete.

If planning to install a new condor_ master binary, be sure to read the following section to learn of the special considerations associated with this somewhat delicate task.


3.10.1.2 Installing a New condor_ master

To install a new condor_ master binary, follow a a few more steps. When the condor_ master restarts, it will listen on a new port, so the addresses file will contain stale information. Moreover, when the condor_ master restarts, it does not know of the previously issued condor_ off command, and will just start up all the daemons it is configured to spawn. It must be explicitly told otherwise.

If it is desired that the pool completely restart itself whenever the condor_ master notices its new binary, then neither of these issues are of any concern: skip this (and the next) section. Just be sure installing the new condor_ master binary is the final step, and once the new binary is in place, the pool will restart itself over the next 5 minutes (whenever all the condor_ master daemons notice the new binary, which they each check for once every 5 minutes by default).

However, to have absolute control over when the rest of the daemons restart, take a few steps:

  1. Place the following in the global configuration file:
      START_DAEMONS = False
    
    This will make sure that when the master restarts itself, it does not also start up the rest of its daemons.
  2. Install the new condor_ master binary.
  3. Start up Condor on the central manager machine. This is done manually by logging into the machine and sending commands locally. First, send a condor_ restart to make sure you have the new condor_ master, then send a condor_ on to start up the other daemons (including, most importantly, the condor_ collector).
  4. Wait 5 minutes, such that all the condor_ master daemons have a chance to notice the new binary, restart themselves, and send an update with their new address. Make sure that:
      % condor_status -master
    
    lists all the machines in the pool.
  5. Remove the special setting from the global configuration file.
  6. Recreate the addresses file as described above:
      % condor_status -master -format "%s\n" MasterIpAddr > addresses
    

Once the new master is in place, and you are ready to start up the pool again, restart your whole pool by following the steps in the next section.


3.10.1.3 Restarting your Condor Pool

Once all preliminary tasks are done and it is time to restart the pool, send a condor_ on to all the condor_ master daemons on each host. Do this with a single command, issued from an authorized administrator machine:

  % condor_on `cat addresses`
At this point, all the daemons should now be restarted, and the pool will be back on its way.


3.10.2 Reconfiguring A Condor Pool

To change a global configuration file setting and have all the machines start to use the new setting, send a condor_ reconfig command to each host. Do this with a single command, issued from an authorized administrator machine:

  % condor_reconfig -all

If the global configuration file is not shared among all the machines (using a shared file system), the change must be made to each copy of the global configuration file before issuing the condor_ reconfig command.


3.10.3 Using Dynamic Attributes

\fbox{This section has not yet been written}

next up previous contents index
Next: 3.11 The High Availability Up: 3. Administrators' Manual Previous: 3.9 DaemonCore   Contents   Index
condor-admin@cs.wisc.edu