Condor daemons understand and implement the SOAP (Simple Object Access Protocol) XML API to provide a web service interface for Condor job submission and management.
The API utilizes a two-phase commit mechanism to provide a transaction-based protocol. This structure enhances reliability when using the API.
Condor daemons understand and communicate using the SOAP XML protocol. An application seeking to use this protocol will require code that handles the communication. The XML WSDL (Web Services Description Language) that Condor implements is included with the Condor distribution. It is in $(RELEASE_DIR)/lib/webservice. The WSDL must be run through a toolkit to produce language-specific routines that do communication. The application is compiled with these routines.
Condor must be configured to enable responses to SOAP calls. Please see section 3.3.27 for definitions of the configuration variables related to the web services API.
The API's routines can be roughly categorized into ones that deal with
getJobAds(transaction, "(ClusterId==clusterId && JobId==jobId)")
The following quote from the DRMAA Specification 1.0 abstract nicely describes the purpose of the API:
The Distributed Resource Management Application API (DRMAA), developed by a working group of the Global Grid Forum (GGF),
provides a generalized API to distributed resource management systems (DRMSs) in order to facilitate integration of application programs. The scope of DRMAA is limited to job submission, job monitoring and control, and the retrieval of the finished job status. DRMAA provides application developers and distributed resource management builders with a programming model that enables the development of distributed applications tightly coupled to an underlying DRMS. For deployers of such distributed applications, DRMAA preserves flexibility and choice in system design.
The API allows users who write programs using DRMAA functions and link to a DRMAA library to submit, control, and retrieve information about jobs to a Grid system. The Condor implementation of a portion of the API allows programs (applications) to use the library functions provided to submit, monitor and control Condor jobs.
See the DRMAA site (http://www.drmaa.org) to find the API specification for DRMA 1.0 for further details on the API.
The library was developed from the DRMA API Specification 1.0 of January 2004 and the DRMAA C Bindings v0.9 of September 2003. It is a static C library that expects a POSIX thread model on Unix systems and a Windows thread model on Windows systems. Unix systems that do not support POSIX threads are not guaranteed thread safety when calling the library's functions.
The object library file is called libcondordrmaa.a, and it is located within the <release>/lib directory in the Condor download. Its header file is called lib_condor_drmaa.h, and it is located within the <release>/include directory in the Condor download. Also within <release>/include is the file lib_condor_drmaa.README, which gives further details on the implementation.
Use of the library requires that a local condor_ schedd daemon must be running, and the program linked to the library must have sufficient spool space. This space should be in /tmp or specified by the environment variables TEMP, TMP, or SPOOL. The program linked to the library and the local condor_ schedd daemon must have read, write, and traverse rights to the spool space.
The library currently supports the following specification-defined job attributes:
The attribute DRMAA_NATIVE_SPECIFICATION can be used to direct all commands supported within submit description files. See the condor_ submit manual page at section 9 for a complete list. Multiple commands can be specified if separated by newlines.
As in the normal submit file, arbitrary attributes can be added to the job's ClassAd by prefixing the attribute with +. In this case, you will need to put string values in quotation marks, the same as in a submit file.
Thus to tell Condor that the job will likely use 64 megabytes of memory (65536 kilobytes), to more highly rank machines with more memory, and to add the arbitrary attribute of department set to chemistry, you would set AttrDRMAA_NATIVE_SPECIFICATION to the C string:
drmaa_set_attribute(jobtemplate, DRMAA_NATIVE_SPECIFICATION, "image_size=65536\nrank=Memory\n+department=\"chemistry\"", err_buf, sizeof(err_buf)-1);
The Condor Perl module facilitates automatic submitting and monitoring of Condor jobs, along with automated administration of Condor. The most common use of this module is the monitoring of Condor jobs. The Condor Perl module can be used as a meta scheduler for the submission of Condor jobs.
The Condor Perl module provides several subroutines. Some of the subroutines are used as callbacks; an event triggers the execution of a specific subroutine. Other of the subroutines denote actions to be taken by Perl. Some of these subroutines take other subroutines as arguments.
The following is an example that uses the Condor Perl module.
The example uses the submit description file
mycmdfile.cmd to specify the submission of a job.
As the job is matched with a machine and begins to execute,
a callback subroutine (called execute
)
sends a condor_ vacate signal to the job,
and it increments a counter which keeps track of the
number of times this callback executes.
A second callback keeps a count of the number of times
that the job was evicted before the job completes.
After the job completes, the termination
callback (called normal
) prints out a summary of what happened.
#!/usr/bin/perl use Condor; $CMD_FILE = 'mycmdfile.cmd'; $evicts = 0; $vacates = 0; # A subroutine that will be used as the normal execution callback $normal = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "Job $cluster.$job exited normally without errors.\n"; print "Job was vacated $vacates times and evicted $evicts times\n"; exit(0); }; $evicted = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "Job $cluster, $job was evicted.\n"; $evicts++; &Condor::Reschedule(); }; $execute = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; $host = $parameters{'host'}; $sinful = $parameters{'sinful'}; print "Job running on $sinful, vacating...\n"; &Condor::Vacate($sinful); $vacates++; }; $cluster = Condor::Submit($CMD_FILE); printf("Could not open. Access Denied\n"); break; &Condor::RegisterExitSuccess($normal); &Condor::RegisterEvicted($evicted); &Condor::RegisterExecute($execute); &Condor::Monitor($cluster); &Condor::Wait();
This example program will submit the command file 'mycmdfile.cmd' and attempt to vacate any machine that the job runs on. The termination handler then prints out a summary of what has happened.
A second example Perl script facilitates the metascheduling of two of Condor jobs. It submits a second job if the first job successfully completes.
#!/s/std/bin/perl # tell Perl where to find the Condor library use lib '/unsup/condor/lib'; # tell Perl to use what it finds in the Condor library use Condor; $SUBMIT_FILE1 = 'Asubmit.cmd'; $SUBMIT_FILE2 = 'Bsubmit.cmd'; # Callback used when first job exits without errors. $firstOK = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; $cluster = Condor::Submit($SUBMIT_FILE2); if (($cluster) == 0) { printf("Could not open $SUBMIT_FILE2.\n"); } &Condor::RegisterExitSuccess($secondOK); &Condor::RegisterExitFailure($secondfails); &Condor::Monitor($cluster); }; $firstfails = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The first job, $cluster.$job failed, exiting with an error. \n"; exit(0); }; # Callback used when second job exits without errors. $secondOK = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The second job, $cluster.$job successfully completed. \n"; exit(0); }; # Callback used when second job exits WITH an error. $secondfails = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The second job ($cluster.$job) failed. \n"; exit(0); }; $cluster = Condor::Submit($SUBMIT_FILE1); if (($cluster) == 0) { printf("Could not open $SUBMIT_FILE1. \n"); } &Condor::RegisterExitSuccess($firstOK); &Condor::RegisterExitFailure($firstfails); &Condor::Monitor($cluster); &Condor::Wait();
Some notes are in order about this example. The same task could be accomplished using the Condor DAGMan metascheduler. The first job is the parent, and the second job is the child. The input file to DAGMan is significantly simpler than this Perl script.
A third example using the Condor Perl module expands upon the second example. Whereas the second example could have been more easily implemented using DAGMan, this third example shows the versatility of using Perl as a metascheduler.
In this example, the result generated from the successful completion of the first job are used to decide which subsequent job should be submitted. This is a very simple example of a branch and bound technique, to focus the search for a problem solution.
#!/s/std/bin/perl # tell Perl where to find the Condor library use lib '/unsup/condor/lib'; # tell Perl to use what it finds in the Condor library use Condor; $SUBMIT_FILE1 = 'Asubmit.cmd'; $SUBMIT_FILE2 = 'Bsubmit.cmd'; $SUBMIT_FILE3 = 'Csubmit.cmd'; # Callback used when first job exits without errors. $firstOK = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; # open output file from first job, and read the result if ( -f "A.output" ) { open(RESULTFILE, "A.output") or die "Could not open result file."; $result = <RESULTFILE>; close(RESULTFILE); # next job to submit is based on output from first job if ($result < 100) { $cluster = Condor::Submit($SUBMIT_FILE2); if (($cluster) == 0) { printf("Could not open $SUBMIT_FILE2.\n"); } &Condor::RegisterExitSuccess($secondOK); &Condor::RegisterExitFailure($secondfails); &Condor::Monitor($cluster); } else { $cluster = Condor::Submit($SUBMIT_FILE3); if (($cluster) == 0) { printf("Could not open $SUBMIT_FILE3.\n"); } &Condor::RegisterExitSuccess($thirdOK); &Condor::RegisterExitFailure($thirdfails); &Condor::Monitor($cluster); } } else { printf("Results file does not exist.\n"); } }; $firstfails = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The first job, $cluster.$job failed, exiting with an error. \n"; exit(0); }; # Callback used when second job exits without errors. $secondOK = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The second job, $cluster.$job successfully completed. \n"; exit(0); }; # Callback used when third job exits without errors. $thirdOK = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The third job, $cluster.$job successfully completed. \n"; exit(0); }; # Callback used when second job exits WITH an error. $secondfails = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The second job ($cluster.$job) failed. \n"; exit(0); }; # Callback used when third job exits WITH an error. $thirdfails = sub { %parameters = @_; $cluster = $parameters{'cluster'}; $job = $parameters{'job'}; print "The third job ($cluster.$job) failed. \n"; exit(0); }; $cluster = Condor::Submit($SUBMIT_FILE1); if (($cluster) == 0) { printf("Could not open $SUBMIT_FILE1. \n"); } &Condor::RegisterExitSuccess($firstOK); &Condor::RegisterExitFailure($firstfails); &Condor::Monitor($cluster); &Condor::Wait();