next up previous contents index
Next: 3.9 DaemonCore Up: 3. Administrators' Manual Previous: 3.7 Networking   Contents   Index


3.8 The Checkpoint Server

A Checkpoint Server maintains a repository for checkpoint files. Using checkpoint servers reduces the disk requirements of submitting machines in the pool, since the submitting machines no longer need to store checkpoint files locally. Checkpoint server machines should have a large amount of disk space available, and they should have a fast connection to machines in the Condor pool.

If your spool directories are on a network file system, then checkpoint files will make two trips over the network: one between the submitting machine and the execution machine, and a second between the submitting machine and the network file server. If you install a checkpoint server and configure it to use the server's local disk, the checkpoint will travel only once over the network, between the execution machine and the checkpoint server. You may also obtain checkpointing network performance benefits by using multiple checkpoint servers, as discussed below.

NOTE: It is a good idea to pick very stable machines for your checkpoint servers. If individual checkpoint servers crash, the Condor system will continue to operate, although poorly. While the Condor system will recover from a checkpoint server crash as best it can, there are two problems that can (and will) occur:

  1. A checkpoint cannot be sent to a checkpoint server that is not functioning. Jobs will keep trying to contact the checkpoint server, backing off exponentially in the time they wait between attempts. Normally, jobs only have a limited time to checkpoint before they are kicked off the machine. So, if the server is down for a long period of time, chances are that a lot of work will be lost by jobs being killed without writing a checkpoint.

  2. If a checkpoint is not available from the checkpoint server, a job cannot be retrieved, and it will either have to be restarted from the beginning, or the job will wait for the server to come back online. This behavior is controlled with the MAX_DISCARDED_RUN_TIME parameter in the config file (see section 3.3.8 on page [*] for details). This parameter represents the maximum amount of CPU time you are willing to discard by starting a job over from scratch if the checkpoint server is not responding to requests.

3.8.1 Preparing to Install a Checkpoint Server

The location of checkpoints changes upon the installation of a checkpoint server. A configuration change would cause currently queued jobs with checkpoints to not be able to find their checkpoints. This results in the jobs with checkpoints remaining indefinitely queued (never running) due to the lack of finding their checkpoints. It is therefore best to either remove jobs from the queues or let them complete before installing a checkpoint server. It is advisable to shut your pool down before doing any maintenance on your checkpoint server. See section 3.10 on page [*] for details on shutting down your pool.

A graduated installation of the checkpoint server may be accomplished by configuring submit machines as their queues empty.

3.8.2 Installing the Checkpoint Server Module

Files relevant to a checkpoint server are

condor_ ckpt_server is the checkpoint server binary. condor_ cleanckpts is a script that can be periodically run to remove stale checkpoint files from your server. The checkpoint server normally cleans all old files itself. However, in certain error situations, stale files can be left that are no longer needed. You may set up a cron job that calls condor_ cleanckpts every week or so to automate the cleaning up of any stale files. The example configuration file give with the module is described below.

There are three steps necessary towards running a checkpoint server:

  1. Configure the checkpoint server.
  2. Start the checkpoint server.
  3. Configure your pool to use the checkpoint server.

Configure the Checkpoint Server

Place settings in the local configuration file of the checkpoint server. The file etc/examples/condor_config.local.ckpt.server contains the needed settings. Insert these into the local configuration file of your checkpoint server machine.

The CKPT_SERVER_DIR must be customized. The CKPT_SERVER_DIR attribute defines where your checkpoint files are to be located. It is better if this is on a very fast local file system (preferably a RAID). The speed of this file system will have a direct impact on the speed at which your checkpoint files can be retrieved from the remote machines.

The other optional settings are:

(Described in section 3.3.9). To have the checkpoint server managed by the condor_ master, the DAEMON_LIST entry must have MASTER and CKPT_SERVER. Add STARTD if you want to allow jobs to run on your checkpoint server. Similarly, add SCHEDD if you would like to submit jobs from your checkpoint server.

The rest of these settings are the checkpoint server-specific versions of the Condor logging entries, as described in section 3.3.4 on page [*].

The CKPT_SERVER_LOG is where the checkpoint server log is placed.

Sets the maximum size of the checkpoint server log before it is saved and the log file restarted.

Regulates the amount of information printed in the log file. Currently, the only debug level supported is D_ ALWAYS.

Start the Checkpoint Server

To start the newly configured checkpoint server, restart Condor on that host to enable the condor_ master to notice the new configuration. Do this by sending a condor_ restart command from any machine with administrator access to your pool. See section 3.6.8 on page [*] for full details about IP/host-based security in Condor.

Configure the Pool to Use the Checkpoint Server

After the checkpoint server is running, you change a few settings in your configuration files to let your pool know about your new server:

This parameter should be set to TRUE (the default).

This parameter should be set to the full hostname of the machine that is now running your checkpoint server.

It is most convenient to set these parameters in your global configuration file, so they affect all submission machines. However, you may configure each submission machine separately (using local configuration files) if you do not want all of your submission machines to start using the checkpoint server at one time. If USE_CKPT_SERVER is set to FALSE, the submission machine will not use a checkpoint server.

Once these settings are in place, send a condor_ reconfig to all machines in your pool so the changes take effect. This is described in section 3.10.2 on page [*].

3.8.3 Configuring your Pool to Use Multiple Checkpoint Servers

It is possible to configure a Condor pool to use multiple checkpoint servers. The deployment of checkpoint servers across the network improves checkpointing performance. In this case, Condor machines are configured to checkpoint to the nearest checkpoint server. There are two main performance benefits to deploying multiple checkpoint servers:

Once you have multiple checkpoint servers running in your pool, the following configuration changes are required to make them active.

First, USE_CKPT_SERVER should be set to TRUE (the default) on all submitting machines where Condor jobs should use a checkpoint server. Additionally, STARTER_CHOOSES_CKPT_SERVER should be set to TRUE (the default) on these submitting machines. When TRUE, this parameter specifies that the checkpoint server specified by the machine running the job should be used instead of the checkpoint server specified by the submitting machine. See section 3.3.8 on page [*] for more details. This allows the job to use the checkpoint server closest to the machine on which it is running, instead of the server closest to the submitting machine. For convenience, set these parameters in the global configuration file.

Second, set CKPT_SERVER_HOST on each machine. As described, this is set to the full hostname of the checkpoint server machine. In the case of multiple checkpoint servers, set this in the local configuraton file. It is the hostname of the nearest server to the machine.

Third, send a condor_ reconfig to all machines in the pool so the changes take effect. This is described in section 3.10.2 on page [*].

After completing these three steps, the jobs in your pool will send checkpoints to the nearest checkpoint server. On restart, a job will remember where its checkpoint was stored and get it from the appropriate server. After a job successfully writes a checkpoint to a new server, it will remove any previous checkpoints left on other servers.

NOTE: If the configured checkpoint server is unavailable, the job will keep trying to contact that server as described above. It will not use alternate checkpoint servers. This may change in future versions of Condor.

3.8.4 Checkpoint Server Domains

The configuration described in the previous section ensures that jobs will always write checkpoints to their nearest checkpoint server. In some circumstances, it is also useful to configure Condor to localize checkpoint read transfers, which occur when the job restarts from its last checkpoint on a new machine. To localize these transfers, we want to schedule the job on a machine which is near the checkpoint server on which the job's checkpoint is stored.

We can say that all of the machines configured to use checkpoint server ``A'' are in ``checkpoint server domain A.'' To localize checkpoint transfers, we want jobs which run on machines in a given checkpoint server domain to continue running on machines in that domain, transferring checkpoint files in a single local area of the network. There are two possible configurations which specify what a job should do when there are no available machines in its checkpoint server domain:

These two configurations are described below.

The first step in implementing checkpoint server domains is to include the name of the nearest checkpoint server in the machine ClassAd, so this information can be used in job scheduling decisions. To do this, add the following configuration to each machine:

  CkptServer = "$(CKPT_SERVER_HOST)"
For convenience, we suggest that you set these parameters in the global config file. Note that this example assumes that STARTD_EXPRS is defined previously in your configuration. If not, then you should use the following configuration instead:
  CkptServer = "$(CKPT_SERVER_HOST)"
  STARTD_EXPRS = CkptServer
Now, all machine ClassAds will include a CkptServer attribute, which is the name of the checkpoint server closest to this machine. So, the CkptServer attribute defines the checkpoint server domain of each machine.

To restrict jobs to one checkpoint server domain, we need to modify the jobs' Requirements expression as follows:

  Requirements = ((LastCkptServer == TARGET.CkptServer) || (LastCkptServer =?= UNDEFINED))
This Requirements expression uses the LastCkptServer attribute in the job's ClassAd, which specifies where the job last wrote a checkpoint, and the CkptServer attribute in the machine ClassAd, which specifies the checkpoint server domain. If the job has not written a checkpoint yet, the LastCkptServer attribute will be UNDEFINED, and the job will be able to execute in any checkpoint server domain. However, once the job performs a checkpoint, LastCkptServer will be defined and the job will be restricted to the checkpoint server domain where it started running.

If instead we want to allow jobs to transfer to other checkpoint server domains when there are no available machines in the current checkpoint server domain, we need to modify the jobs' Rank expression as follows:

  Rank = ((LastCkptServer == TARGET.CkptServer) || (LastCkptServer =?= UNDEFINED))
This Rank expression will evaluate to 1 for machines in the job's checkpoint server domain and 0 for other machines. So, the job will prefer to run on machines in its checkpoint server domain, but if no such machines are available, the job will run in a new checkpoint server domain.

You can automatically append the checkpoint server domain Requirements or Rank expressions to all STANDARD universe jobs submitted in your pool using APPEND_REQ_STANDARD or APPEND_RANK_STANDARD . See section 3.3.15 on page [*] for more details.

next up previous contents index
Next: 3.9 DaemonCore Up: 3. Administrators' Manual Previous: 3.7 Networking   Contents   Index