next up previous contents index
Next: condor_ history Up: 9. Command Reference Manual Previous: condor_ findhost   Contents   Index

Subsections


condor_ glidein

add a Globus resource to a Condor pool

Synopsis

condor_ glidein [-help] [-admin address] [-anybody] [-archdir dir] [-basedir basedir] [-count CPUcount] [-genconfig] [-genstartup] [-gensubmit] [-idletime minutes] [-localdir dir] [-memory MBytes] [-project name] [-queue name] [-runtime minutes] [-runonly] [-setuponly] [-setup_here] [-setup_jobmanager jobmanager] [-scheduler name] [-suffix suffix] [-useconfig filename] [-usestartup filename] [-vms VMcount] {-contactfile filename } | Globus contact string

Description

condor_ glidein allows the temporary addition of a Globus resource to a local Condor pool. The addition is accomplished by installing and executing some of the Condor daemons on the Globus resource. A glidein_startup job appears in the queue of the local Condor pool for each glidein request. To remove the Globus resource from the local Condor pool, use condor_ rm to remove the glidein_startup job from the job queue.

You must have an X.509 certificate and access to the Globus resource to use condor_ glidein. The Globus software must also be installed.

Globus is a software system that provides uniform access to different high-performance computing resources. When specifying a machine to use with Globus, you provide a Globus contact string. Often, the contact string can be just the hostname of the machine. Sometimes, a more complicated contact string is required. For example, if a machine has multiple schedulers (ways to run a job), the contact string may need to specify which to use. See the Globus home page, http://www.globus.org/ for more information about Globus.

condor_ glidein works in two steps: set up and execution. During set up, a configuration file, startup file, and the Condor daemons master, startd and starter are installed on the Globus resource. Binaries for the correct architecture are copied from a central server. To obtain access to the server, or to set up your own server, follow instructions on the Glidein Server Setup page, at http://www.cs.wisc.edu/condor/glidein. Set up need only be done once per site. The execution step starts the Condor daemons running through the resource's Globus interface.

By default, all files placed on the remote machine are placed in $(HOME)/Condor_glidein (or whatever -basedir is defined to be). It is assumed that this directory is shared by all of the machines that will be running the glidein Condor daemons. By default, the daemon log files will also be written into this area, but you are encouraged to change this (e.g. with -localdir) to make them write to local scratch space on the execution machine. However, for debugging initial problems, it may be convenient to have the log files in a more accessible place. If you do leave the default setting alone, you should at least occasionally clean up old log and execute directories left behind by glideins or you may eventually run out of space.

Examples

To setup and run 10 glideins under PBS on a grid site with a gatekeeper named gatekeeper.site.edu:

% condor_glidein -count 10 gatekeeper.site.edu/jobmanager-pbs

If you try something like the above and condor_ glidein is not able to automatically determine everything it needs to know about the remote site, it will ask you to provide more information. A typical result of this process is something like the following command:

% condor_glidein \
    -count 10 \
    -arch 6.6.7-i686-pc-Linux-2.4 \
    -setup_jobmanager jobmanager-fork \
    gatekeeper.site.edu/jobmanager-pbs

You may use condor_ q to see the glidein jobs that have been submitted. Once they successfully run, you may see them join your Condor pool by using condor_ status.

See the list of common problems and solutions near the end of this section if you have trouble getting the system to work.

Options

-help
Display brief usage information and exit
-basedir basedir
Specifies the base directory on the Globus resource used for placing files. The default file is $(HOME)/Condor_glidein on the Globus resource.
-archdir dir
Specifies the directory on the Globus resource for placement of the executables. The default value for -archdir , given according to version information on the Globus resource, is basedir/<condor-version>-<Globus canonicalsystemname> An example of the directory (without the base directory on the Globus resource) for Condor version 6.1.13 running on a Sun Sparc machine with Solaris 2.6 is 6.1.13-sparc-sun-solaris-2.6
-localdir dir
Specifies the directory on the Globus resource in which to create log and execution subdirectories needed by Condor. If limited disk quota in the home or base directory on the Globus resource is a problem, set -localdir to a large temporary space, such as /tmp or /scratch. If the batchsystem makes glidein start up in a temporary scratch directory, you can use `.' for -localdir.
-contactfile filename
Allows the use of a file of Globus contact strings, rather than the single Globus contact string given in the command line. For each of the contacts listed in the file, the Globus resource is added to the local Condor pool.
-runonly
Starts execution of the Condor daemons on the Globus resource. If any of the files are missing, exits with an error code. This option cannot be run simultaneously with -setuponly
-run_here
Runs condor_ master directly rather than submitting it to Condor-G for remote execution. To instead generate a script that does this, use -run_here in combination with -gensubmit. This may be useful for running Glidein on resources that are not directly accessible to Condor-G.
-setuponly
Performs only the placement of files on the Globus resource. This option cannot be run simultaneously with -runonly
-setup_here
Runs the setup process directly instead of submitting a setup job to the remote Globus resource. For example, this may be used to install glidein in an AFS area that is read-only from the remote Globus resource.
-setup_jobmanager jobmanager-fork
Jobmanager to use for running Glidein setup process. If a readonable default can be discovered through MDS, this is optional.
-arch architecture
Identifies the glidein tarball to download and install. If a readonable default can be discovered through MDS, this is optional. A list of possible values may be found here: http://www.cs.wisc.edu/condor/glidein/binaries. The architecture name is the same as the tarball name minus the tar.gz. For example: 6.6.5-i686-pc-Linux-2.4

-scheduler name
Selects the Globus job scheduler type. Defaults to fork. NOTE: Contact strings which already contain the scheduler type will not be overridden by this option.
-queue name
The argument name is a string which specifies which job queue is to be used for submission on the Globus resource.
-project name
The argument name is a string which specifies which project is to be used for submission on the Globus resource.
-memory MBytes
The maximum memory size to request from the Globus resource (in megabytes).
-count CPUcount
Number of CPUs to request, default is 1.
-vms VMcount
For machines with multiple CPUs, the CPUs maybe divided up into virtual machines. VMcount is the number of virtual machines that results. By default, Condor divides multiple-CPU resources such that each CPU is a virtual machine, each with an equal share of RAM, disk, and swap space. This option configures the number of virtual machines, so that multi-threaded jobs can run in a virtual machine with multiple CPUs. For example, if 4 CPUs are requested and -vms is not specified, Condor will divide the request up into 4 virtual machines with 1 CPU each. However, if -vms 2 is specified, Condor will divide the request up into 2 virtual machines with 2 CPUs each, and if -vms 1 is specified, Condor will put all 4 CPUs into one virtual machine.
-idletime minutes
How long the Condor daemons on the Globus resource can remain idle before the resource reverts back to its former state of not being part of the local Condor pool. If the value is 0 (zero), the resource will not revert back to its former state. In this case, the Condor daemons will run until the runtime time expires, or they are killed by the resource or with condor_ rm. The default value is 20 minutes.
-runtime minutes
How long the Condor daemons on the Globus resource will run before shutting themselves down. This option is useful for resources with enforced maximum run times. Setting runtime to be a few minutes shorter than the allowable limit gives the daemons time to perform a graceful shutdown.
-anybody
Sets the Condor START expression to TRUE to allow any user job which meets the job's requirements to run on the Globus resource added to the local Condor pool. Without this option, only jobs owned by the user executing condor_ glidein can execute on the Globus resource. WARNING: Using this option may violate the usage policies of many institutions.
-admin address
Where to send e-mail with problems. The defaults is the login of the user running condor_ glidein at UID domain of the local Condor pool.
-genconfig
This option creates a local copy of the configuration file used on the Globus resource. The file is called glidein_condor_config{suffix}. You may edit this file and use -useconfig to install your modified config.
-useconfig config_file
This option makes the setup process copy the config file you specify, rather than generating one from scratch.
-genstartup
This option creates a local copy of the startup script used on the Globus resource when condor_ master runs. The file is called glidein_startup{suffix}. You may edit this file and use -usestartup to install your modified config.
-usestartup startup_file
This option makes the setup process copy the startup script you specify, rather than generating one from scratch.
-suffixX
Suffix to use when generating files. Default is process id.
-gsi_daemon_name cert_name
Using this option turns on GSI authentication in the glidein configuration. The argument to this option is the GSI certificate name that the glidein daemons will use to authenticate themselves. It should be set to whatever certificate name you will use to execute the glideins.
-install_gsi_trusted_ca_dir path
Using this option turns on GSI authentication in the glidein configuration. The argument to this option is the path to the trusted CA certificates that you wish the glidein daemons to use (e.g. /etc/grid-security/certificates). The contents of this directory will be installed at the remote site in -basedir/grid-security.
-install_gsi_gridmap file
Using this option turns on GSI authentication in the glidein configuration. The argument to this option is the filename of the grid-mapfile that you wish the glidein daemons to use. The file will be installed at the remote site in -basedir/grid-security. The file should contain entries mapping grid-certificates to user names. At the very least, it must contain an entry for the certificate -gsi_daemon_name . If your other Condor daemons use different certificates, then this file should also mention any certificates that the glidein daemons will encounter (schedd, collector, and negotiator). See section 3.6.3 for more information.

Exit Status

condor_ glidein will exit with a status value of 0 (zero) upon complete success. The script exits with non-zero values upon failure. The status value will be 1 (one) if condor_ glidein encountered an error making a directory, was unable to copy a tar file, encountered an error in parsing the command line, or was not able to gather required information. The status value will be 2 (two) if there was an error in the remote set up. The status value will be 3 (three) if there was an error in remote submission. The status value will be -1 (negative one) if no resource was specified in the command line.

Common problems are listed below. Many of these are best discovered by looking in the remove StartLog in the glidein ``localdir''.

WARNING: The file xxx is not writable by condor
This happens if you run condor_ glidein from a directory that does not have the right permissions for Condor to access files. If you are in an AFS directory, keep in mind that Condor does not have your AFS ACLs.

Glideins fail to run due to GLIBC errors
Check the list of available glidein binaries (http://www.cs.wisc.edu/condor/glidein/binaries) and try setting up glidein with an architecture name that includes the correct glibc version for the remote site.

Glideins join pool but no jobs run on them
One common cause of this problem is that the glidein machines are in a different filesystem domain and your jobs have been submitted with an implicit requirement that they must run in the same filesystem domain. If this is your problem, see section 2.5.4 for details on using Condor's file-transfer capabilities. Another cause of this problem is a communication failure. For example, a firewall may be preventing condor_ negotiator or condor_ schedd from connecting to the glidein condor_ startd. Although work is being done to remove this requirement in the future, it is currently necessary to have full bi-directional connectivity, at least over a restricted range of ports. See page [*] for more information on configuring a port range.

Glideins run but fail to join the pool
This may be caused by your pool's security settings or by a communication failure. Check that the security settings in your pool's Condor config file allow write access to the glidein machines. If you do not wish to modify the security settings for the pool, you can run a separate pool specifically for the glideins and use flocking to balance jobs across the two pools of resources. If instead the glidein daemon log files indicate a communication failure, then see the next item.

The startd cannot connect to the collector
This may be caused by several things. One is a firewall. Another is when the compute nodes do not have even outgoing network access. Configuring Glidein to work without full network access to and from the compute nodes is still in the experimental stages, so for now, the short answer is that you must at least have a range of open (bi-directional) ports and set up the glidein config file as described on page [*]. (Use -genconfig, edit the file, and then use -useconfig.)

Another possible cause of connectivity problems may be the use of UDP by the condor_ startd to register itself with the collector. You can force it to use TCP as described on page [*].

Yet another possible cause of connectivity problems is when the glidein machines have more than one network interface and the default one chosen by Condor is not the correct one. One way to fix this is to modify the glidein startup script (using -genstartup and -usestartup). The script simply needs to determine the IP address associated with the correct network interface and assign this to the environment variable _condor_NETWORK_INTERFACE.

NFS file locking problems
If you have the -localdir configured to be on NFS (not recommended, but sometimes convenient for testing), the Condor daemons may have trouble manipulating file locks. You may insert the following into the Glidein config file:

IGNORE_NFS_LOCK_ERRORS = True

Author

Condor Team, University of Wisconsin-Madison

Copyright

Copyright © 1990-2006 Condor Team, Computer Sciences Department, University of Wisconsin-Madison, Madison, WI. All Rights Reserved. No use of the Condor Software Program is authorized without the express consent of the Condor Team. For more information contact: Condor Team, Attention: Professor Miron Livny, 7367 Computer Sciences, 1210 W. Dayton St., Madison, WI 53706-1685, (608) 262-0856 or miron@cs.wisc.edu.

U.S. Government Rights Restrictions: Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 or subparagraphs (c)(1) and (2) of Commercial Computer Software-Restricted Rights at 48 CFR 52.227-19, as applicable, Condor Team, Attention: Professor Miron Livny, 7367 Computer Sciences, 1210 W. Dayton St., Madison, WI 53706-1685, (608) 262-0856 or miron@cs.wisc.edu.

See the Condor Version 6.8.3 Manual for additional notices.


next up previous contents index
Next: condor_ history Up: 9. Command Reference Manual Previous: condor_ findhost   Contents   Index
condor-admin@cs.wisc.edu