next up previous contents index
Next: 5.4 Glidein Up: 5. Grid Computing Previous: 5.2 Connecting Condor Pools   Contents   Index

Subsections


5.3 The Grid Universe


5.3.1 Condor-C, The condor Grid Type

Condor-C allows jobs in one machine's job queue to be moved to another machine's job queue. These machines may be far removed from each other, providing powerful grid computation mechanisms, while requiring only Condor software and its configuration.

Condor-C is highly resistant to network disconnections and machine failures on both the submission and remote sides. An expected usage sets up Personal Condor on a laptop, submits some jobs that are sent to a Condor pool, waits until the jobs are staged on the pool, then turns off the laptop. When the laptop reconnects at a later time, any results can be pulled back.

Condor-C scales gracefully when compared with Condor's flocking mechanism. The machine upon which jobs are submitted maintains a single process and network connection to a remote machine, without regard to the number of jobs queued or running.


5.3.1.1 Condor-C Configuration

There are two aspects to configuration to enable the submission and execution of Condor-C jobs. These two aspects correspond to the endpoints of the communication: there is the machine from which jobs are submitted, and there is the remote machine upon which the jobs are placed in the queue (executed).

Configuration of a machine from which jobs are submitted requires a few extra configuration variables:

CONDOR_GAHP=$(SBIN)/condor_c-gahp
C_GAHP_LOG=/tmp/CGAHPLog.$(USERNAME)
C_GAHP_WORKER_THREAD_LOG=/tmp/CGAHPWorkerLog.$(USERNAME)

The acronym GAHP stands for Grid ASCII Helper Protocol. A GAHP server provides grid-related services for a variety of underlying middle-ware systems. The configuration variable CONDOR_GAHP gives a full path to the GAHP server utilized by Condor-C. The configuration variable C_GAHP_LOG defines the location of the log that the Condor GAHP server writes. The log for the Condor GAHP is written as the user on whose behalf it is running; thus like GRIDMANAGER_LOG the C_GAHP_LOG configuration variable must point to a location the end user can write to.

A submit machine must also have a condor_ collector daemon to which the condor_ schedd daemon can submit a query. The query is for the location (IP address and port) of the intended remote machine's condor_ schedd daemon. This facilitates communication between the two machines. This condor_ collector does not need to be the same collector that the local condor_ schedd daemon reports to.

The machine upon which jobs are executed must also be configured correctly. This machine must be running a condor_ schedd daemon. Unless specified explicitly in a submit file, CONDOR_HOST must point to a condor_ collector daemon that it can write to, and the machine upon which jobs are submitted can read from. This facilitates communication between the two machines.

An important aspect of configuration is the security configuration relating to authentication. Condor-C on the remote machine relies on an authentication protocol to know the identity of the user under which to run a job. The following is a working example of the security configuration for authentication. This authentication method, CLAIMTOBE, trusts the identity claimed by a host or IP address.

SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE


5.3.1.2 Condor-C Job Submission

Job submission of Condor-C jobs is the same as for any Condor job. The universe is grid. grid_resource specifies the remote condor_ schedd daemon to which the job should be submitted, and its value consists of three fields. The first field is the grid type, which is condor. The second field is the name of the remote condor_ schedd daemon. Its value is the same as the condor_ schedd ClassAd attribute Name on the remote machine. The third field is the name of the remote pool's condor_ collector.

The following represents a minimal submit description file for a job.

# minimal submit description file for a Condor-C job
universe = grid
executable = myjob
output = myoutput
error = myerror
log = mylog

grid_resource = condor joe@remotemachine.example.com remotecentralmanager.example.com
+remote_jobuniverse = 5
+remote_requirements = True
+remote_ShouldTransferFiles = "YES"
+remote_WhenToTransferOutput = "ON_EXIT"
queue

The remote machine needs to understand the attributes of the job. These are specified in the submit description file using the '+' syntax, followed by the string remote_. At a minimum, this will be the job's universe and the job's requirements. It is likely that other attributes specific to the job's universe (on the remote pool) will also be necessary. Note that attributes set with '+' are inserted directly into the job's ClassAd. Specify attributes as they must appear in the job's ClassAd, not the submit description file. For example, the universe is specified using an integer assigned for a job ClassAd JobUniverse. Similarly, place quotation marks around string expressions. As an example, a submit description file would ordinarily contain

when_to_transfer_output = ON_EXIT
This must appear in the Condor-C job submit description file as
+remote_WhenToTransferOutput = "ON_EXIT"

For convenience, the specific entries of universe, remote_grid_resource, globus_rsl, and globus_xml may be specified as remote_ commands without the leading '+'. Instead of

+remote_universe = 5

the submit description file command may appear as

remote_universe = vanilla

Similarly, the command

+remote_gridresource = "condor schedd.example.com cm.example.com"

may be given as

remote_grid_resource = condor schedd.example.com cm.example.com

For the given example, the job is to be run as a vanilla universe job at the remote pool. The (remote pool's) condor_ schedd daemon is likely to place its job queue data on a local disk and execute the job on another machine within the pool of machines. This implies that the file systems for the resulting submit machine (the machine specified by remote_schedd) and the execute machine (the machine that runs the job) will not be shared. Thus, the two inserted ClassAds

+remote_ShouldTransferFiles = "YES"
+remote_WhenToTransferOutput = "ON_EXIT"
are used to invoke Condor's file transfer mechanism.

As Condor-C is a recent addition to Condor, the universes, associated integer assignments, and notes about the existence of functionality are given in Table 5.1. The note "untested" implies that submissions under the given universe have not yet been throughly tested. They may already work.


Table 5.1: Functionality of remote job universes with Condor-C
Universe Name Value Notes
standard 1 untested
PVM 4 untested
vanilla 5 works well
scheduler 7 works well
MPI 8 untested
grid 9  
  grid_resource is condor works well
  grid_resource is gt2 works well
  grid_resource is gt3 untested
  grid_resource is gt4 untested
  grid_resource is nordugrid untested
  grid_resource is unicore untested
  grid_resource is lsf works well
  grid_resource is pbs works well
java 10 untested
parallel 11 untested
local 12 works well


For communication between condor_ schedd daemons on the submit and remote machines, the location of the remote condor_ schedd daemon is needed. This information resides in the condor_ collector of the remote machine's pool. The third field of the grid_resource command in the submit description file says which condor_ collector should be queried for the remote condor_ schedd daemon's location. An example of this submit command is

grid_resource = condor schedd.example.com machine1.example.com
If the remote condor_ collector is not listening on the standard port (9618), then the port it is listening on needs to be specified:
grid_resource = condor schedd.example.comd machine1.example.com:12345

File transfer of a job's executable, stdin, stdout, and stderr are automatic. When other files need to be transferred using Condor's file transfer mechanism (see section 2.5.4 on page [*]), the mechanism is applied based on the resulting job universe on the remote machine.


5.3.1.3 Condor-C Jobs Between Differing Platforms

Condor-C jobs given to a remote machine running Windows must specify the Windows domain of the remote machine. This is accomplished by defining a ClassAd attribute for the job. Where the Windows domain is different at the submit machine from the remote machine, the submit description file defines the Windows domain of the remote machine with

  +remote_NTDomain = "DomainAtRemoteMachine"

A Windows machine not part of a domain defines the Windows domain as the machine name.


5.3.1.4 Current Limitations in Condor-C

Submitting jobs to run under the grid universe has not yet been perfected. The following is a list of known limitations with Condor-C:

  1. Authentication methods other than CLAIMTOBE, such as GSI and KERBEROS, are untested, and may not yet work.


5.3.2 Condor-G, the gt2, gt3, and gt4 Grid Types

Condor-G is the name given to Condor when grid universe jobs are sent to grid resources utilizing Globus software for job execution. The Globus Toolkit provides a framework for building grid systems and applications. See the Globus Alliance web page at http://www.globus.org for descriptions and details of the Globus software.

Condor provides the same job management capabilities for Condor-G jobs as for other jobs. From Condor, a user may effectively submit jobs, manage jobs, and have jobs execute on widely distributed machines.

It may appear that Condor-G is a simple replacement for the Globus Toolkit's globusrun command. However, Condor-G does much more. It allows the submission of many jobs at once, along with the monitoring of those jobs with a convenient interface. There is notification when jobs complete or fail and maintenance of Globus credentials that may expire while a job is running. On top of this, Condor-G is a fault-tolerant system; if a machine crashes, all of these functions are again available as the machine returns.


5.3.2.1 Globus Protocols and Terminology

The Globus software provides a well-defined set of protocols that allow authentication, data transfer, and remote job execution. Authentication is a mechanism by which an identity is verified. Given proper authentication, authorization to use a resource is required. Authorization is a policy that determines who is allowed to do what.

Condor (and Globus) utilize the following protocols and terminology. The protocols allow Condor to interact with grid machines toward the end result of executing jobs.

GSI
The Globus Toolkit's Grid Security Infrastructure (GSI) provides essential building blocks for other grid protocols and Condor-G. This authentication and authorization system makes it possible to authenticate a user just once, using public key infrastructure (PKI) mechanisms to verify a user-supplied grid credential. GSI then handles the mapping of the grid credential to the diverse local credentials and authentication/authorization mechanisms that apply at each site.
GRAM
The Grid Resource Allocation and Management (GRAM) protocol supports remote submission of a computational request (for example, to run a program) to a remote computational resource, and it supports subsequent monitoring and control of the computation. GRAM is the Globus protocol that Condor-G uses to talk to remote Globus jobmanagers.
GASS
The Globus Toolkit's Global Access to Secondary Storage (GASS) service provides mechanisms for transferring data to and from a remote HTTP, FTP, or GASS server. GASS is used by Condor for the gt2 and gt3 grid types to transfer a job's files to and from the machine where the job is submitted and the remote resource.
GridFTP
GridFTP is an extension of FTP that provides strong security and high-performance options for large data transfers. It is used with the gt4 grid type to transfer the job's files between the machine where the job is submitted and the remote resource.
RSL
RSL (Resource Specification Language) is the language GRAM accepts to specify job information.
gatekeeper
A gatekeeper is a software daemon executing on a remote machine on the grid. It is relevant only to the gt2 grid type, and this daemon handles the initial communication between Condor and a remote resource.
jobmanager
A jobmanager is the Globus service that is initiated at a remote resource to submit, keep track of, and manage grid I/O for jobs running on an underlying batch system. There is a specific jobmanager for each type of batch system supported by Globus (examples are Condor, LSF, and PBS).

Figure 5.1: Condor-G interaction with Globus-managed resources
\includegraphics{grids/gfig1.eps}

Figure 5.1 shows how Condor interacts with Globus software towards running jobs. The diagram is specific to the gt2 type of grid. Condor contains a GASS server, used to transfer the executable, stdin, stdout, and stderr to and from the remote job execution site. Condor uses the GRAM protocol to contact the remote gatekeeper and request that a new jobmanager be started. The GRAM protocol is also used to when monitoring the job's progress. Condor detects and intelligently handles cases such as if the remote resource crashes.

There are now three different versions of the GRAM protocol. Condor supports all three:

gt2
This initial GRAM protocol is used in Globus Toolkit versions 1 and 2. It is still used by many production systems. Where available in the other, more recent versions of the protocol, gt2 is referred to as the pre-web services GRAM (or pre-WS GRAM).
gt3
gt3 corresponds to Globus Toolkit version 3 as part of Globus' shift to web services-based protocols. It is replaced by the Globus Toolkit version 4. An installation of the Globus Toolkit version 3 (or OSGA GRAM) may also include the the pre-web services GRAM.
gt4
The GRAM protocol was introduced in Globus Toolkit version 4 as a more standards-compliant version of the GT3 web services-based GRAM. It is also called WS GRAM. An installation of the Globus Toolkit version 4 may also include the the pre-web services GRAM.


5.3.2.2 The gt2 Grid Type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit versions 1 and 2, also called the pre-web services GRAM (or pre-WS GRAM). These Condor-G jobs are submitted the same as any other Condor job. The universe is grid, and the pre-web services GRAM protocol is specified by setting the type of grid as gt2 in the grid_resource command.

Under Condor, successful job submission to the grid universe with gt2 requires credentials. An X.509 certificate is used to create a proxy, and an account, authorization, or allocation to use a grid resource is required. For general information on proxies and certificates, please consult the Globus page at

http://www-unix.globus.org/toolkit/docs/4.0/security/key-index.html

Before submitting a job to Condor under the grid universe, use grid-proxy-init to create a proxy.

Here is a simple submit description file. The example specifies a gt2 job to be run on an NCSA machine.

executable = test
universe = grid
grid_resource = gt2 modi4.ncsa.uiuc.edu/jobmanager
output = test.out
log = test.log
queue

The executable for this example is transferred from the local machine to the remote machine. By default, Condor transfers the executable, as well as any files specified by an input command. Note that the executable must be compiled for its intended platform.

The command grid_resource is a required command for grid universe jobs. The second field specifies the scheduling software to be used on the remote resource. There is a specific jobmanager for each type of batch system supported by Globus. The full syntax for this command line appears as

grid_resource = gt2 machinename[:port]/jobmanagername[:X.509 distinguished name]
The portions of this syntax specification enclosed within square brackets ([ and ]) are optional. On a machine where the jobmanager is listening on a nonstandard port, include the port number. The jobmanagername is one of five strings:
jobmanager
jobmanager-condor
jobmanager-pbs
jobmanager-lsf
jobmanager-sge
The Globus software running on the remote resource uses this string to identify and select the correct service to perform. Other jobmanagername strings may be used, where additional services are defined and implemented.

No input file is specified for this example job. Any output (file specified by an output command) or error (file specified by an error command) is transferred from the remote machine to the local machine as it is generated. This implies that these files may be incomplete in the case where the executable does not finish running on the remote resource. The ability to transfer standard output and standard error as they are produced may be disabled by adding to the submit description file:

stream_output = False
stream_error  = False
As a result, standard output and standard error will be transferred only after the job completes.

The job log file is maintained on the submit machine.

Example output from condor_ q for this submission looks like:

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER         SUBMITTED     RUN_TIME ST PRI SIZE CMD
   7.0   smith        3/26 14:08   0+00:00:00 I  0   0.0  test

1 jobs; 1 idle, 0 running, 0 held

After a short time, the Globus resource accepts the job. Again running condor_ q will now result in

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER         SUBMITTED     RUN_TIME ST PRI SIZE CMD
   7.0   smith        3/26 14:08   0+00:01:15 R  0   0.0  test

1 jobs; 0 idle, 1 running, 0 held

Then, very shortly after that, the queue will be empty again, because the job has finished:

% condor_q


-- Submitter: wireless48.cs.wisc.edu : <128.105.48.148:33012> : wireless48.cs.wi

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held

A second example of a submit description file runs the Unix ls program on a different Globus resource.

executable = /bin/ls
transfer_executable = false
universe = grid
grid_resource = gt2 vulture.cs.wisc.edu/jobmanager
output = ls-test.out
log = ls-test.log
queue

In this example, the executable (the binary) has been pre-staged. The executable is on the remote machine, and it is not to be transferred before execution. Note that the required grid_resource and universe commands are present. The command

transfer_executable = false
within the submit description file identifies the executable as being pre-staged. In this case, the executable command gives the path to the executable on the remote machine.

A third example submits a Perl script to be run as a submitted Condor job. The Perl script both lists and sets environment variables for a job. Save the following Perl script with the name env-test.pl, to be used as a Condor job executable.

#!/usr/bin/env perl

foreach $key (sort keys(%ENV))
{
   print "$key = $ENV{$key}\n"
}

exit 0;

Run the Unix command

chmod 755 env-test.pl
to make the Perl script executable.

Now create the following submit description file. Replace example.cs.wisc.edu/jobmanager with a resource you are authorized to use.

executable = env-test.pl
universe = grid
grid_resource = gt2 example.cs.wisc.edu/jobmanager
environment = foo=bar; zot=qux
output = env-test.out
log = env-test.log
queue

When the job has completed, the output file, env-test.out, should contain something like this:

GLOBUS_GRAM_JOB_CONTACT = https://example.cs.wisc.edu:36213/30905/1020633947/
GLOBUS_GRAM_MYJOB_CONTACT = URLx-nexus://example.cs.wisc.edu:36214
GLOBUS_LOCATION = /usr/local/globus
GLOBUS_REMOTE_IO_URL = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633948
HOME = /home/smith
LANG = en_US
LOGNAME = smith
X509_USER_PROXY = /home/smith/.globus/.gass_cache/globus_gass_cache_1020633951
foo = bar
zot = qux

Of particular interest is the GLOBUS_REMOTE_IO_URL environment variable. Condor-G automatically starts up a GASS remote I/O server on the submit machine. Because of the potential for either side of the connection to fail, the URL for the server cannot be passed directly to the job. Instead, it is placed into a file, and the GLOBUS_REMOTE_IO_URL environment variable points to this file. Remote jobs can read this file and use the URL it contains to access the remote GASS server running inside Condor-G. If the location of the GASS server changes (for example, if Condor-G restarts), Condor-G will contact the Globus gatekeeper and update this file on the machine where the job is running. It is therefore important that all accesses to the remote GASS server check this file for the latest location.

The following example is a Perl script that uses the GASS server in Condor-G to copy input files to the execute machine. In this example, the remote job counts the number of lines in a file.

#!/usr/bin/env perl
use FileHandle;
use Cwd;

STDOUT->autoflush();
$gassUrl = `cat $ENV{GLOBUS_REMOTE_IO_URL}`;
chomp $gassUrl;

$ENV{LD_LIBRARY_PATH} = $ENV{GLOBUS_LOCATION}. "/lib";
$urlCopy = $ENV{GLOBUS_LOCATION}."/bin/globus-url-copy";

# globus-url-copy needs a full pathname
$pwd = getcwd();
print "$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts\n\n";
`$urlCopy $gassUrl/etc/hosts file://$pwd/temporary.hosts`;

open(file, "temporary.hosts");
while(<file>) {
print $_;
}

exit 0;

The submit description file used to submit the Perl script as a Condor job appears as:

executable = gass-example.pl
universe = grid
grid_resource = gt2 example.cs.wisc.edu/jobmanager
output = gass.out
log = gass.log
queue

There are two optional submit description file commands of note: x509userproxy and globus_rsl. The x509userproxy command specifies the path to an X.509 proxy. The command is of the form:

x509userproxy = /path/to/proxy
If this optional command is not present in the submit description file, then Condor-G checks the value of the environment variable X509_USER_PROXY for the location of the proxy. If this environment variable is not present, then Condor-G looks for the proxy in the file /tmp/x509up_uXXXX, where the characters XXXX in this file name are replaced with the Unix user id.

The globus_rsl command is used to add additional attribute settings to a job's RSL string. The format of the globus_rsl command is

globus_rsl = (name=value)(name=value)
Here is an example of this command from a submit description file:
globus_rsl = (project=Test_Project)
This example's attribute name for the additional RSL is project, and the value assigned is Test_Project.


5.3.2.3 The gt3 Grid Type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 3.2. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0. See http://www-unix.globus.org/toolkit/docs/3.2/index.html for more information about the Globus Toolkit version 3.2.

For grid jobs destined for gt3, the submit description file is much the same as for gt2 jobs. The grid_resource command is still required, but the format changes from gt2 to one that is a URL. The syntax follows the form:

grid_resource = gt3 http://hostname[:port]/ogsa/services/base/gram/
XXXManagedJobFactoryService

or

grid_resource = gt3 http://IPaddress[:port]/ogsa/services/base/gram/
XXXManagedJobFactoryService

This value is placed on two lines for formatting purposes, but is all on a single line within a submit description file. The portion of this syntax specification enclosed within square brackets ([ and ]) is optional. The substring XXX within the last part of the value is replaced by one of five strings that (like for gt2) identifies and selects the correct service to perform. The five strings that replace XXX are

Fork
Condor
PBS
LSF
SGE

An example, given on two lines (again, for formatting reasons) is

grid_resource = gt3 http://198.51.254.40:8080/ogsa/services/base/gram/
ForkManagedJobFactoryService

On the machine where the job is submitted, there is no requirement for any Globus Toolkit 3.2 components. Condor itself installs all necessary framework within the directory $(LIB)/lib/gt3. The machine where the job is submitted is required to have Java 1.4 or a higher version installed. The configuration variable JAVA must identify the location of the installation. See page [*] within section 3.3 for the complete description of the configuration variable JAVA.


5.3.2.4 The gt4 Grid Type

Condor-G supports submitting jobs to remote resources running the Globus Toolkit version 4.0. Please note that this Globus Toolkit version is not compatible with the Globus Toolkit version 3.0 or 3.2. See http://www-unix.globus.org/toolkit/docs/4.0/index.html for more information about the Globus Toolkit version 4.0.

For grid jobs destined for gt4, the submit description file is much the same as for gt2 or gt3 jobs. The grid_resource command is still required, and is given in the form of a URL. The syntax follows the form:

grid_resource = gt4 [https://]hostname[:port][/wsrf/services/ManagedJobFactoryService] scheduler-string

or

grid_resource = gt4 [https://]IPaddress[:port][/wsrf/services/ManagedJobFactoryService] scheduler-string
The portions of this syntax specification enclosed within square brackets ([ and ]) are optional.

The scheduler-string field of grid_resource indicates which job execution system should to be used on the remote system, to execute the job. One of these values is subsitituted for scheduler-string:

Fork
Condor
PBS
LSF
SGE

The globus_xml command can be used to add additional attributes to the XML-based RSL string that Condor writes to submit the job to GRAM. Here is an example of this command from a submit description file:

globus_xml = <project>Test_Project</project>
This example's attribute name for the additional RSL is project, and the value assigned is Test_Project.

File transfer occurs as expected for a Condor job (for the executable, input, and output). However, the underlying transfer mechanism requires access to a GridFTP server from the machine where the job is submitted. On this machine, there is no requirement for any Globus Toolkit 4.0 components, other than the GridFTP server for file transfer. Condor itself installs all necessary framework within the directory $(LIB)/lib/gt4. The machine where the job is submitted is also required to have Java 1.4.2 or a higher version installed. The configuration variable JAVA must identify the location of the installation. See page [*] within section 3.3 for the complete description of the configuration variable JAVA.


5.3.2.5 Delimiting Arguments

The delimiting of arguments passed to a Condor-G job varies based on the grid type of the job. For the gt2 and gt3 types, there are two languages involved, leading to two sets of parsing rules that must work together. gt4 jobs are less complex with respect to the delimiting of arguments, as Condor encapsulates one set of parsing rules, thereby isolating the user from needing to understand or use them.

For all Condor-G jobs, the arguments to a job are kept in the job ClassAd attribute Args. This attribute is a string, and therefore enclosed within double quote marks. Condor uses space characters to delimit the listed arguments. Here is an arguments command from a submit description file with spaces to delimit the arguments.

arguments = 13 argument2 argument3
The Args ClassAd attribute becomes
Args = "13 argument2 argument3"
All further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in the 3 arguments
argv[1] = 13
argv[2] = argument2
argv[3] = argument3

Since the double quote mark character (") marks the beginning and end of a string (in the ClassAd language), an escaped double quote mark (\") is utilized to have a double quote mark within the string. For example, the submit description file arguments command

arguments = 13 argument2 \"string3\"
gives the ClassAd attribute
Args = "13 argument2 \"string3\""
Again, all further parsing of the arguments uses the Args attribute as a starting point. A query upon this attribute, such as to give the arguments, results in
argv[1] = 13
argv[2] = argument2
argv[3] = "string3"

For gt2 and gt3 types, the jobmanager on the remote resource must receive information about job arguments in RSL (Resource Specification Language). This language has its own way of delimiting arguments. Therefore, the arguments command in the submit description file (and the associated ClassAd attribute) must take both languages into account.

Delimiters in RSL are spaces, the single quote mark, and the double quote mark. In addition, the characters +, &, %, (, and ) have special meaning in RSL, so must be delimited, to include them in an argument. Placing a space character into an argument is accomplished by delimiting with one of the quote marks. As an example, the submit description file command

arguments = '%s' 'argument with spaces' '+%d'
results in the Condor-G job receiving the arguments
argv[1] = %s
argv[2] = argument with spaces
argv[3] = +%d

Should the arguments themselves contain the single quote character, an argument may be delimited with a double quote mark. Note that because the ClassAd attribute Args represents the information, the double quote marks must be escaped in the submit description file command. The submit description file command

arguments = \"don't\" \"mess with\" \"quoting rules\"
results in the RSL arguments
argv[1] = don't
argv[2] = mess with
argv[3] = quoting rules

And, if the job arguments have both single and double quotes, the appearance of a quote character twice in a row is converted (in RSL) to a single instance of the character and the literal continues until the next solo quote character. The submit description file command

arguments = 'don''t yell \"No!\"' '+%s'
results in the RSL arguments
argv[1] = don't yell "No!"
argv[2] = +%s

For gt4 jobs, follow Condor's ClassAd language rules for delimiting arguments. Spaces delimit arguments, and the double quote mark character must be escaped to be included in an argument. Condor itself will modify the arguments to be expressed correctly in RSL. Note that the space character cannot be a part of an argument.


5.3.2.6 Credential Management with MyProxy

Condor-G can use MyProxy software to automatically renew GSI proxies for grid universe jobs with grid type gt2. MyProxy is a software component developed at NCSA and used widely throughout the grid community. For more information see: http://myproxy.ncsa.uiuc.edu/

Difficulties with proxy expiration occur in two cases. The first case are long running jobs, which do not complete before the proxy expires. The second case occurs when great numbers of jobs are submitted. Some of the jobs may not yet be started or not yet completed before the proxy expires. One proposed solution to these difficulties is to generate longer-lived proxies. This, however, presents a greater security problem. Remember that a GSI proxy is sent to the remote Globus resource. If a proxy falls into the hands of a malicious user at the remote site, the malicious user can impersonate the proxy owner for the duration of the proxy's lifetime. The longer the proxy's lifetime, the more time a malicious user has to misuse the owner's credentials. To minimize the window of opportunity of a malicious user, it is recommended that proxies have a short lifetime (on the order of several hours).

The MyProxy software generates proxies using credentials (a user certificate or a long-lived proxy) located on a secure MyProxy server. Condor-G talks to the MyProxy server, renewing a proxy as it is about to expire. Another advantage that this presents is it relieves the user from having to store a GSI user certificate and private key on the machine where jobs are submitted. This may be particularly important if a shared Condor-G submit machine is used by several users.

In the a typical case, the following steps occur:

  1. The user creates a long-lived credential on a secure MyProxy server, using the myproxy-init command. Each organization generally has their own MyProxy server.

  2. The user creates a short-lived proxy on a local submit machine, using grid-proxy-init or myproxy-get-delegation.

  3. The user submits a Condor-G job, specifying:
    MyProxy server name (host:port)
    MyProxy credential name (optional)
    MyProxy password

  4. At the short-lived proxy expiration Condor-G talks to the MyProxy server to refresh the proxy.

Condor-G keeps track of the password to the MyProxy server for credential renewal. Although Condor-G tries to keep the password encrypted and secure, it is still possible (although highly unlikely) for the password to be intercepted from the Condor-G machine (more precisely, from the machine that the condor_ schedd daemon that manages the grid universe jobs runs on, which may be distinct from the machine from where jobs are submitted). The following safeguard practices are recommended.

  1. Provide time limits for credentials on the MyProxy server. The default is one week, but you may want to make it shorter.

  2. Create several different MyProxy credentials, maybe as many as one for each submitted job. Each credential has a unique name, which is identified with the MyProxyCredentialName command in the submit description file.

  3. Use the following options when initializing the credential on the MyProxy server:

    myproxy-init -s <host> -x -r <cert subject> -k <cred name>
    

    The option -x -r <cert subject> essentially tells the MyProxy server to require two forms of authentication:

    1. a password (initially set with myproxy-init)
    2. an existing proxy (the proxy to be renewed)

  4. A submit description file may include the password. An example contains commands of the form:
    executable      = /usr/bin/my-executable
    universe        = grid
    grid_resource   = gt4 condor-unsup-7
    MyProxyHost     = example.cs.wisc.edu:7512
    MyProxyServerDN = /O=doesciencegrid.org/OU=People/CN=Jane Doe 25900
    MyProxyPassword = password
    MyProxyCredentialName = my_executable_run
    queue
    
    Note that placing the password within the submit file is not really secure, as it relies upon whatever file system security there is. This may still be better than option 5.

  5. Use the -p option to condor_ submit. The submit command appears as
    condor_submit -p mypassword /home/user/myjob.submit
    
    The argument list for condor_ submit defaults to being publically available. An attacker with a log in to the local machine could generate a simple shell script to watch for the password.

Currently, Condor-G calls the myproxy-get-delegation command-line tool, passing it the necessary arguments. The location of the myproxy-get-delegation executable is determined by the configuration variable MYPROXY_GET_DELEGATION in the configuration file on the Condor-G machine. This variable is read by the condor_ gridmanager. If myproxy-get-delegation is a dynamically-linked executable (verify this with ldd myproxy-get-delegation), point MYPROXY_GET_DELEGATION to a wrapper shell script that sets LD_LIBRARY_PATH to the correct MyProxy library or Globus library directory and then calls myproxy-get-delegation. Here is an example of such a wrapper script:

#!/bin/sh
export LD_LIBRARY_PATH=/opt/myglobus/lib
exec /opt/myglobus/bin/myproxy-get-delegation $@


5.3.2.7 The Grid Monitor

Condor's Grid Monitor is designed to improve the scalability of machines running Globus Toolkit 2 gatekeepers. Normally, this gatekeeper runs a jobmanager process for every job submitted to the gatekeeper. This includes both currently running jobs and jobs waiting in the queue. Each jobmanager runs a Perl script at frequent intervals (every 10 seconds) to poll the state of its job in the local batch system. For example, with 400 jobs submitted to a gatekeeper, there will be 400 jobmanagers running, each regularly starting a Perl script. When a large number of jobs have been submitted to a single gatekeeper, this frequent polling can heavily load the gatekeeper. When the gatekeeper is under heavy load, the system can become non-responsive, and a variety of problems can occur.

Condor's Grid Monitor temporarily replaces these jobmanagers. It is named the Grid Monitor, because it replaces the monitoring (polling) duties previously done by jobmanagers. When the Grid Monitor runs, Condor attempts to start a single process to poll all of a user's jobs at a given gatekeeper. While a job is waiting in the queue, but not yet running, Condor shuts down the associated jobmanager, and instead relies on the Grid Monitor to report changes in status. The jobmanager started to add the job to the remote batch system queue is shut down. The jobmanager restarts when the job begins running.

By default, standard output and standard error are streamed back to the submitting machine while the job is running. Streamed I/O requires the jobmanager. As a result, the Grid Monitor cannot replace the jobmanager for jobs that use streaming. If possible, disable streaming for all jobs; this is accomplished by placing the following lines in each job's submit description file:

stream_output = False
stream_error  = False

The Grid Monitor requires that the gatekeeper support the fork jobmanager with the name jobmanager-fork. If the gatekeeper does not support the fork jobmanager, the Grid Monitor will not be used for that site. The condor_ gridmanager log file reports any problems using the Grid Monitor.

To enable the Grid Monitor, two variables are added to the Condor configuration file. The configuration macro GRID_MONITOR is already present in current distributions of Condor, but it may be missing from earlier versions of Condor. Also set the configuration macro ENABLE_GRID_MONITOR to True.

GRID_MONITOR        = $(SBIN)/grid_monitor.sh
ENABLE_GRID_MONITOR = TRUE


5.3.2.8 Limitations of Condor-G

Submitting jobs to run under the grid universe has not yet been perfected. The following is a list of known limitations:

  1. No checkpoints.
  2. No job exit codes. Job exit codes are not available (when using gt2 and gt3).
  3. Limited platform availability. Windows support is not yet available.


5.3.3 The nordugrid Grid Type

NorduGrid is a project to develop free grid middleware named the Advanced Resource Connector (ARC). See the NorduGrid web page (http://www.nordugrid.org) for more information about NorduGrid software.

Condor jobs may be submitted to NorduGrid resources using the grid universe. The grid_resource command specifies the name of the NorduGrid resource as follows:

grid_resource = nordugrid ng.example.com

NorduGrid uses X.509 credentials for authentication, usually in the form a proxy certificate. For more information about proxies and certificates, please consult the Alliance PKI pages at http://archive.ncsa.uiuc.edu/SCD/Alliance/GridSecurity/. condor_ submit looks in default locations for the proxy. The submit description file command x509userproxy is used to give the full path name to the directory containing the proxy, when the proxy is not in a default location. If this optional command is not present in the submit description file, then the value of the environment variable X509_USER_PROXY is checked for the location of the proxy. If this environment variable is not present, then the proxy in the file /tmp/x509up_uXXXX is used, where the characters XXXX in this file name are replaced with the Unix user id.

NorduGrid uses RSL syntax to describe jobs. The submit description file command nordugrid_rsl adds additional attributes to the job RSL that Condor constructs. The format this submit description file command is

nordugrid_rsl = (name=value)(name=value)


5.3.4 The unicore Grid Type

Unicore is a Java-based grid scheduling system. See http://unicore.sourceforge.net for more information about Unicore.

Condor jobs may be submitted to Unicore resources using the grid universe. The grid_resource command specifies the name of the Unicore resource as follows:

grid_resource = unicore usite.example.com vsite
usite.example.com is the host name of the Unicore gateway machine to which the Condor job is to be submitted. vsite is the name of the Unicore virtual resource to which the Condor job is to be submitted.

Unicore uses certificates stored in a Java keystore file for authentication. The following submit description file commands are required to properly use the keystore file.

keystore_file
Specifies the complete path and file name of the Java keystore file to use.
keystore_alias
A string that specifies which certificate in the Java keystore file to use.
keystore_passphrase_file
Specifies the complete path and file name of the file containing the passphrase protecting the certificate in the Java keystore file.


5.3.5 The pbs Grid Type

The popular PBS (Portable Batch System) comes in several varieties: OpenPBS (http://www.openpbs.org), PBS Pro (http://www.altair.com/software/pbspro.htm), and Torque (http://www.clusterresources.com/pages/products/torque-resource-manager.php).

Condor jobs are submitted to a local PBS system using the grid universe and the grid_resource command by placing the following into the submit description file.

grid_resource = pbs

The pbs grid type requires two variables to be set in the Condor configuration file. PBS_GAHP is the path to the PBS GAHP server binary that is to be used to submit PBS jobs. GLITE_LOCATION is the path to the directory containing the GAHP's configuration file and auxillary binaries. In the Condor distribution, these files are located in $(LIB)/glite. The PBS GAHP's configuration file is in $(GLITE_LOCATION)/etc/batch_gahp.config. The PBS GAHP's auxillary binaries are to be in the directory $(GLITE_LOCATION)/bin. The Condor configuration file appears

GLITE_LOCATION = $(LIB)/glite
PBS_GAHP       = $(GLITE_LOCATION)/bin/batch_gahp

The PBS GAHP's configuration file contains two variables that must be modified to tell it where to find PBS on the local system. pbs_binpath is the directory that contains the PBS binaries. pbs_spoolpath is the PBS spool directory.


5.3.6 The lsf Grid Type

Condor jobs may be submitted to the Platform LSF batch system. See the Products page of the Platform web page at http://www.platform.com/Products/ for more information about Platform LSF.

Condor jobs are submitted to a local Platform LSF system using the grid universe and the grid_resource command by placing the following into the submit description file.

grid_resource = lsf

The lsf grid type requires two variables to be set in the Condor configuration file. LSF_GAHP is the path to the LSF GAHP server binary that is to be used to submit Platform LSF jobs. GLITE_LOCATION is the path to the directory containing the GAHP's configuration file and auxillary binaries. In the Condor distribution, these files are located in $(LIB)/glite. The LSF GAHP's configuration file is in $(GLITE_LOCATION)/etc/batch_gahp.config. The LSF GAHP's auxillary binaries are to be in the directory $(GLITE_LOCATION)/bin. The Condor configuration file appears

GLITE_LOCATION = $(LIB)/glite
LSF_GAHP       = $(GLITE_LOCATION)/bin/batch_gahp

The LSF GAHP's configuration file contains two variables that must be modified to tell it where to find LSF on the local system. lsf_binpath is the directory that contains the LSF binaries. lsf_confpath is the location of the LSF configuration file.


5.3.7 Matchmaking in the Grid Universe

In a simple usage, the grid universe allows users to specify a single grid site as a destination for jobs. This is sufficient when a user knows exactly which grid site they wish to use, or a higher-level resource broker (such as the European Data Grid's resource broker) has decided which grid site should be used.

When a user has a variety of grid sites to choose from, Condor allows matchmaking of grid universe jobs to decide which grid resource a job should run on. Please note that this form of matchmaking is relatively new. There are some rough edges as continual improvement occurs.

To facilitate Condor's matching of jobs with grid resources, both the jobs and the grid resources are involved. The job's submit description file provides all commands needed to make the job work on a matched grid resource. The grid resource identifies itself to Condor by advertising a ClassAd. This ClassAd specifies all necessary attributes, such that Condor can properly make matches. The grid resource identification is accomplished by using condor_ advertise to send a ClassAd representing the grid resource, which is then used by Condor to make matches.

5.3.7.1 Job Submission

To submit a grid universe job intended for a single, specific gt2 resource, the submit description file for the job explicitly specifies the resource:

grid_resource = gt2 grid.example.com/jobmanager-pbs

If there were multiple gt2 resources that might be matched to the job, the submit description file changes:

grid_resource   = $$(resource_name)
requirements    = TARGET.resource_name =!= UNDEFINED

The grid_resource command uses a substitution macro. The substitution macro defines the value of resource_name using attributes as specified by the matched grid resource. The requirements command further restricts that the job may only run on a machine (grid resource) that defines grid_resource. Note that this attribute name is invented for this example. To make matchmaking work in this way, both the job (as used here within the submit description file) and the grid resource (in its created and advertised ClassAd) must agree upon the name of the attribute.

As a more complex example, consider a job that wants to run not only on a gt2 resource, but on one that has the Bamboozle software installed. The complete submit description file might appear:

universe        = grid
executable      = analyze_bamboozle_data
output          = aaa.$(Cluster).out
error           = aaa.$(Cluster).err
log             = aaa.log
grid_resource   = $$(resource_name)
requirements    = (TARGET.HaveBamboozle == True) && (TARGET.resource_name =!= UNDEFINED)
queue

Any grid resource which has the HaveBamboozle attribute defined as well as set to True is further checked to have the resource_name attribute defined. Where this occurs, a match may be made (from the job's point of view). A grid resource that has one of these attributes defined, but not the other results in no match being made.

Note that the entire value of grid_resource comes from the grid resource's ad. This means that the job can be matched with a resource of any type, not just gt2.

5.3.7.2 Advertising Grid Resources to Condor

Any grid resource that wishes to be matched by Condor with a job must advertise itself to Condor using a ClassAd. To properly advertise, a ClassAd is sent periodically to the condor_ collector daemon. A ClassAd is a list of pairs, where each pair consists of an attribute name and value that describes an entity. There are two entities relevant to Condor: a job, and a machine. A grid resource is a machine. The ClassAd describes the grid resource, as well as identifying the capabilities of the grid resource. It may also state both requirements and preferences (called rank) for the jobs it will run. See Section 2.3 for an overview of the interaction between matchmaking and ClassAds. A list of common machine ClassAd attributes is given in Section 2.5.2.

To advertise a grid site, place the attributes in a file. Here is a sample ClassAd that describes a grid resource that is capable of running a gt2 job.

# example grid resource ClassAd for a gt2 job
MyType         = "Machine"
TargetType     = "Job"
Name           = "Example1_Gatekeeper"
Machine        = "Example1_Gatekeeper"
resource_name  = "gt2 grid.example.com/jobmanager-pbs"
UpdateSequenceNumber  = 4
Requirements   = (TARGET.JobUniverse == 9)
Rank           = 0.000000
CurrentRank    = 0.000000

Some attributes are defined as expressions, while others are integers, floating point values, or strings. The type is important, and must be correct for the ClassAd to be effective. The attributes

MyType         = "Machine"
TargetType     = "Job"
identify the grid resource as a machine, and that the machine is to be matched with a job. In Condor, machines are matched with jobs, and jobs are matched with machines. These attributes are strings. Strings are surrounded by double quote marks.

The attributes Name and Machine are likely to be defined to be the same string value as in the example:

Name           = "Example1_Gatekeeper"
Machine        = "Example1_Gatekeeper"

Both give the fully qualified host name for the resource. The Name may be different on an SMP machine, where the individual CPUs are given names that can be distiguished from each other. Each separate grid resource must have a unique name.

Where the job depends on the resource to specify the value of the grid_resource command by the use of the substitution macro, the ClassAd for the grid resource (machine) defines this value. The example given as

grid_resource = "gt2 grid.example.com/jobmanager-pbs"
defines this value. Note that the invented name of this variable must match the one utilized within the submit description file. To make the matchmaking work, both the job (as used within the submit description file) and the grid resource (in this created and advertised ClassAd) must agree upon the name of the attribute.

A machine's ClassAd information can be time sensitive, and may change over time. Therefore, ClassAds expire and are thrown away. In addition, the communication method by which ClassAds are sent implies that entire ads may be lost without notice or may arrive out of order. Out of order arrival leads to the definition of an attribute which provides an ordering. This positive integer value is given in the example ClassAd as

UpdateSequenceNumber  = 4
This value must increase for each subsequent ClassAd. If state information for the ClassAd is kept in a file, a script executed each time the ClassAd is to be sent may use a counter for this value. An alternative for a stateless implementation sends the current time in seconds (since the epoch, as given by the C time function call).

The requirements that the grid resource sets for any job that it will accept are given as

Requirements     = (TARGET.JobUniverse == 9)
This set of requirments state that any job is required to be for the grid universe.

The attributes

Rank             = 0.000000
CurrentRank      = 0.000000
are both necessary for Condor's negotiation to procede, but are not relevant to grid matchmaking. Set both to the floating point value 0.0.

The example machine ClassAd becomes more complex for the case where the grid resource allows matches with more than one job:

# example grid resource ClassAd for a gt2 job
MyType         = "Machine"
TargetType     = "Job"
Name           = "Example1_Gatekeeper"
Machine        = "Example1_Gatekeeper"
resource_name  = "gt2 grid.example.com/jobmanager-pbs"
UpdateSequenceNumber  = 4
Requirements   = (CurMatches < 10) && (TARGET.JobUniverse == 9)
Rank           = 0.000000
CurrentRank    = 0.000000
WantAdRevaluate = True
CurMatches     = 1

In this example, the two attributes WantAdRevaluate and CurMatches appear, and the Requirements expression has changed.

WantAdRevaluate is a boolean value, and may be set to either True or False. When True in the ClassAd and a match is made (of a job to the grid resource), the machine (grid resource) is not removed from the set of machines to be considered for further matches. This implements the ability for a single grid resource to be matched to more than one job at a time. Note that the spelling of this attribute is incorrect, and remains incorrect to maintain backward compatibility.

To limit the number of matches made to the single grid resource, the resource must have the ability to keep track of the number of Condor jobs it has. This integer value is given as the CurMatches attribute in the advertised ClassAd. It is then compared in order to limit the number of jobs matched with the grid resource.

Requirements   = (CurMatches < 10) && (TARGET.JobUniverse == 9)
CurMatches     = 1

This example assumes that the grid resource already has one job, and is willing to accept a maximum of 9 jobs. If CurMatches does not appear in the ClassAd, Condor uses a default value of 0.

This ClassAd (likely in a file) is to be periodically sent to the condor_ collector daemon using condor_ advertise. A recommended implementation uses a script to create or modify the ClassAd together with cron to send the ClassAd every five minutes. The condor_ advertise program must be installed on the machine sending the ClassAd, but the remainder of Condor does not need to be installed. The required argument for the condor_ advertise command is UPDATE_STARTD_AD.

condor_ advertise uses UDP to transmit the ClassAd. Where this is insufficient, specify the -tcp option to condor_ advertise to use TCP for communication.

5.3.7.3 Advanced usage

What if a job fails to run at a grid site due to an error? It will be returned to the queue, and Condor will attempt to match it and re-run it at another site. Condor isn't very clever about avoiding sites that may be bad, but you can give it some assistance. Let's say that you want to avoid running at the last grid site you ran at. You could add this to your job description:

match_list_length = 1
Rank              = TARGET.Name != LastMatchName0

This will prefer to run at a grid site that was not just tried, but it will allow the job to be run there if there is no other option.

When you specify match_list_length, you provide an integer N, and Condor will keep track of the last N matches. The oldest match will be LastMatchName0, and next oldest will be LastMatchName1, and so on. (See the condor_ submit manual page for more details.) The Rank expression allows you to specify a numerical ranking for different matches. When combined with match_list_length, you can prefer to avoid sites that you have already run at.

In addition, condor_ submit has two options to help you control grid universe job resubmissions and rematching. See globus_resubmit and globus_rematch in the condor_ submit manual page. These options are independent of match_list_length.

There are some new attributes that will be added to the Job ClassAd, and may be useful to you when you write your rank, requirements, globus_resubmit or globus_rematch option. Please refer to Section 2.5.2 and read about the following option:

The following example of a command within the submit description file releases jobs 5 minutes after being held, increasing the time between releases by 5 minutes each time. It will continue to retry up to 4 times per Globus submission, plus 4. The plus 4 is necessary in case the job goes on hold before being submitted to Globus, although this is unlikely.

periodic_release = ( NumSystemHolds <= ((NumGlobusSubmits * 4) + 4) ) \
   && (NumGlobusSubmits < 4) && \
   ( HoldReason != "via condor_hold (by user $ENV(USER))" ) && \
   ((CurrentTime - EnteredCurrentStatus) > ( NumSystemHolds *60*5 ))

The following example forces Globus resubmission after a job has been held 4 times per Globus submission.

globus_resubmit = NumSystemHolds == (NumGlobusSubmits + 1) * 4

If you are concerned about unknown or malicious grid sites reporting to your condor_ collector, you should use Condor's security options, documented in Section 3.6.


next up previous contents index
Next: 5.4 Glidein Up: 5. Grid Computing Previous: 5.2 Connecting Condor Pools   Contents   Index
condor-admin@cs.wisc.edu