next up previous contents index
Next: 5. Grid Computing Up: 4. Miscellaneous Concepts Previous: 4.3 Computing On Demand   Contents   Index

Subsections


4.4 Application Program Interfaces


4.4.1 Web Service

Condor daemons understand and implement the SOAP (Simple Object Access Protocol) XML API to provide a web service interface for Condor job submission and management.

The API utilizes a two-phase commit mechanism to provide a transaction-based protocol. This structure enhances reliability when using the API.


4.4.1.1 Implementation Details

Condor daemons understand and communicate using the SOAP XML protocol. An application seeking to use this protocol will require code that handles the communication. The XML WSDL (Web Services Description Language) that Condor implements is included with the Condor distribution. It is in $(RELEASE_DIR)/lib/webservice. The WSDL must be run through a toolkit to produce language-specific routines that do communication. The application is compiled with these routines.

Condor must be configured to enable responses to SOAP calls. Please see section 3.3.27 for definitions of the configuration variables related to the web services API.

The API's routines can be roughly categorized into ones that deal with

The routines for each of these categories is detailed. Note that the signature provided will accurately reflect a routine's name, but that return values and parameter specification will vary according to the target programming language.


4.4.1.2 Methods for Transaction Management

StatusAndTransaction beginTransaction(int duration)
Begin a transaction.
Status commitTransaction(Transaction transaction)
Commits a transaction.
Status abortTransaction(Transaction transaction)
Abort a transaction.
StatusAndTransaction extendTransaction(Transaction transaction, int duration)
Request an extension in duration for a specific transaction.


4.4.1.3 Methods for Job Submission

Status submit(Transaction transaction, int clusterId, int jobId, ClassAd jobAd)
Submit a job.
StatusAndClassAd createJobTemplate(int clusterId, int jobId, String owner, UniverseType type, String command, String arguments, String requirements)
Request a job Class Ad, given some of the job requirements. This job Class Ad will be suitable for use when submitting the job.


4.4.1.4 Methods for File Transfer

Status declareFile(Transaction transaction, int clusterId, int jobId, String name, int size, HashType hashType, String hash)
Declare a file that may be used by a job.
Status sendFile(Transaction transaction, int clusterId, int jobId, String name, int offset, Base64 data)
Send a file that a job may use.
StatusAndBase64 getFile(Transaction transaction, int clusterId, int jobId, String name, int offset, int length)
Get a file from a job's spool. Does not need to occur in a transaction.
Status closeSpool(Transaction transaction, int clusterId, int jobId)
Close a job's spool. Does not need to occur in a transaction. All the files in the job's spool can be deleted.
StatusAndFileInfoArray listSpool(Transaction transaction, int clusterId, int jobId)
List the files in a job's spool. Does not need to occur in a transaction.


4.4.1.5 Methods for Job Management

StatusAndInteger newCluster(Transaction transaction)
Create a new job cluster.
Status removeCluster(Transaction transaction, int clusterId, String reason)
Remove a job cluster, and all the jobs within it. Does not need to occur in a transaction.
StatusAndInteger newJob(Transaction transaction, int clusterId)
Creates a new job within the most recently created job cluster.
Status removeJob(Transaction transaction, int clusterId, int jobId, String reason, boolean forceRemoval)
Remove a job, regardless of the job's state. Does not need to occur in a transaction.
Status holdJob()
Put a job into the Hold state, regardless of the job's current state. Does not need to occur in a transaction.
Status releaseJob(Transaction transaction, int clusterId, int jobId, String reason, boolean emailUser, boolean emailAdmin)
Release a job that has been in the Hold state. Does not need to occur in a transaction.

StatusAndClassAdArray getJobAds(Transaction transaction, String constraint)
Find all the job ClassAds matching the given constraint. Does not need to occur in a transaction.
StatusAndClassAd getJobAd(Transaction transaction, int clusterId, int jobId)
Find a specific job ClassAd. This method does much the same as the first element from the array returned by

getJobAds(transaction, "(ClusterId==clusterId && JobId==jobId)")

Status requestReschedule()
Request a condor_ reschedule from the condor_ schedd daemon.


4.4.2 The DRMAA API

The following quote from the DRMAA Specification 1.0 abstract nicely describes the purpose of the API:

The Distributed Resource Management Application API (DRMAA), developed by a working group of the Global Grid Forum (GGF),

provides a generalized API to distributed resource management systems (DRMSs) in order to facilitate integration of application programs. The scope of DRMAA is limited to job submission, job monitoring and control, and the retrieval of the finished job status. DRMAA provides application developers and distributed resource management builders with a programming model that enables the development of distributed applications tightly coupled to an underlying DRMS. For deployers of such distributed applications, DRMAA preserves flexibility and choice in system design.

The API allows users who write programs using DRMAA functions and link to a DRMAA library to submit, control, and retrieve information about jobs to a Grid system. The Condor implementation of a portion of the API allows programs (applications) to use the library functions provided to submit, monitor and control Condor jobs.

See the DRMAA site (http://www.drmaa.org) to find the API specification for DRMA 1.0 for further details on the API.


4.4.2.1 Implementation Details

The library was developed from the DRMA API Specification 1.0 of January 2004 and the DRMAA C Bindings v0.9 of September 2003. It is a static C library that expects a POSIX thread model on Unix systems and a Windows thread model on Windows systems. Unix systems that do not support POSIX threads are not guaranteed thread safety when calling the library's functions.

The object library file is called libcondordrmaa.a, and it is located within the <release>/lib directory in the Condor download. Its header file is called lib_condor_drmaa.h, and it is located within the <release>/include directory in the Condor download. Also within <release>/include is the file lib_condor_drmaa.README, which gives further details on the implementation.

Use of the library requires that a local condor_ schedd daemon must be running, and the program linked to the library must have sufficient spool space. This space should be in /tmp or specified by the environment variables TEMP, TMP, or SPOOL. The program linked to the library and the local condor_ schedd daemon must have read, write, and traverse rights to the spool space.

The library currently supports the following specification-defined job attributes:

DRMAA_REMOTE_COMMAND
DRMAA_JS_STATE
DRMAA_NATIVE_SPECIFICATION
DRMAA_BLOCK_EMAIL
DRMAA_INPUT_PATH
DRMAA_OUTPUT_PATH
DRMAA_ERROR_PATH
DRMAA_V_ARGV
DRMAA_V_ENV
DRMAA_V_EMAIL

The attribute DRMAA_NATIVE_SPECIFICATION can be used to direct all commands supported within submit description files. See the condor_ submit manual page at section 9 for a complete list. Multiple commands can be specified if separated by newlines.

As in the normal submit file, arbitrary attributes can be added to the job's ClassAd by prefixing the attribute with +. In this case, you will need to put string values in quotation marks, the same as in a submit file.

Thus to tell Condor that the job will likely use 64 megabytes of memory (65536 kilobytes), to more highly rank machines with more memory, and to add the arbitrary attribute of department set to chemistry, you would set AttrDRMAA_NATIVE_SPECIFICATION to the C string:

  drmaa_set_attribute(jobtemplate, DRMAA_NATIVE_SPECIFICATION,
      "image_size=65536\nrank=Memory\n+department=\"chemistry\"",
      err_buf, sizeof(err_buf)-1);


4.4.3 The Command Line Interface

\fbox{This section has not yet been written}


4.4.4 The Condor GAHP

\fbox{This section has not yet been written}


4.4.5 The Condor Perl Module

The Condor Perl module facilitates automatic submitting and monitoring of Condor jobs, along with automated administration of Condor. The most common use of this module is the monitoring of Condor jobs. The Condor Perl module can be used as a meta scheduler for the submission of Condor jobs.

The Condor Perl module provides several subroutines. Some of the subroutines are used as callbacks; an event triggers the execution of a specific subroutine. Other of the subroutines denote actions to be taken by Perl. Some of these subroutines take other subroutines as arguments.

4.4.5.1 Subroutines

Submit(submit_description_file)
This subroutine takes the action of submitting a job to Condor. The argument is the name of a submit description file. The condor_ submit program should be in the path of the user. If the user wishes to monitor the job with condor they must specify a log file in the command file. The cluster submitted is returned. For more information see the condor_ submit man page.

Vacate(machine)
This subroutine takes the action of sending a condor_ vacate command to the machine specified as an argument. The machine may be specified either by host name, or by sinful string. For more information see the condor_ vacate man page.

Reschedule(machine)
This subroutine takes the action of sending a condor_ reschedule command to the machine specified as an argument. The machine may be specified either by host name, or by sinful string. For more information see the condor_ reschedule man page.

Monitor(cluster)
Takes the action of monitoring this cluster. It returns when all jobs in cluster terminate.

Wait()
Takes the action of waiting until all monitor subroutines finish, and then exits the Perl script.

DebugOn()
Takes the action of turning debug messages on. This may be useful when attempting to debug the Perl script.

DebugOff()
Takes the action of turning debug messages off.

RegisterEvicted(sub)
Register a subroutine (called sub) to be used as a callback when a job from a specified cluster is evicted. The subroutine will be called with two arguments: cluster and job. The cluster and job are the cluster number and process number of the job that was evicted.

RegisterEvictedWithCheckpoint(sub)
Same as RegisterEvicted except that the handler is called when the evicted job was checkpointed.

RegisterEvictedWithoutCheckpoint(sub)
Same as RegisterEvicted except that the handler is called when the evicted job was not checkpointed.

RegisterExit(sub)
Register a termination handler that is called when a job exits. The termination handler will be called with two arguments: cluster and job. The cluster and job are the cluster and process numbers of the existing job.

RegisterExitSuccess(sub)
Register a termination handler that is called when a job exits without errors. The termination handler will be called with two arguments: cluster and job The cluster and job are the cluster and process numbers of the existing job.

RegisterExitFailure(sub)
Register a termination handler that is called when a job exits with errors. The termination handler will be called with three arguments: cluster, job and retval. The cluster and job are the cluster and process numbers of the existing job and the retval is the exit code of the job.

RegisterExitAbnormal(sub)
Register an termination handler that is called when a job abnormally exits (segmentation fault, bus error, ...). The termination handler will be called with four arguments: cluster, job signal and core. The cluster and job are the cluster and process numbers of the existing job. The signal indicates the signal that the job died with and core indicates whether a core file was created and if so, what the full path to the core file is.

RegisterAbort(sub)
Register a handler that is called when a job is aborted by a user.

RegisterJobErr(sub)
Register a handler that is called when a job is not executable.

RegisterExecute(sub)
Register an execution handler that is called whenever a job starts running on a given host. The handler is called with four arguments: cluster, job host, and sinful. Cluster and job are the cluster and process numbers for the job, host is the Internet address of the machine running the job, and sinful is the Internet address and command port of the condor_ starter supervising the job.

RegisterSubmit(sub)
Register a submit handler that is called whenever a job is submitted with the given cluster. The handler is called with cluster, job host, and sinful. Cluster and job are the cluster and process numbers for the job, host is the Internet address of the machine running the job, and sinful is the Internet address and command port of the condor_ schedd responsible for the job.

Monitor(cluster)
Begin monitoring this cluster. Returns when all jobs in cluster terminate.

Wait()
Wait until all monitors finish and exit.

DebugOn()
Turn debug messages on. This may be useful if you don't understand what your script is doing.

DebugOff()
Turn debug messages off.


4.4.5.2 Examples

The following is an example that uses the Condor Perl module. The example uses the submit description file mycmdfile.cmd to specify the submission of a job. As the job is matched with a machine and begins to execute, a callback subroutine (called execute) sends a condor_ vacate signal to the job, and it increments a counter which keeps track of the number of times this callback executes. A second callback keeps a count of the number of times that the job was evicted before the job completes. After the job completes, the termination callback (called normal) prints out a summary of what happened.

#!/usr/bin/perl
use Condor;

$CMD_FILE = 'mycmdfile.cmd';
$evicts = 0;
$vacates = 0;

# A subroutine that will be used as the normal execution callback
$normal = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "Job $cluster.$job exited normally without errors.\n";
    print "Job was vacated $vacates times and evicted $evicts times\n";
    exit(0);
};	

$evicted = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "Job $cluster, $job was evicted.\n";
    $evicts++;
    &Condor::Reschedule();	
};

$execute = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};
    $host = $parameters{'host'};
    $sinful = $parameters{'sinful'};

    print "Job running on $sinful, vacating...\n";
    &Condor::Vacate($sinful);
    $vacates++;
};

$cluster = Condor::Submit($CMD_FILE);
printf("Could not open. Access Denied\n");
			break;
&Condor::RegisterExitSuccess($normal);
&Condor::RegisterEvicted($evicted);
&Condor::RegisterExecute($execute);
&Condor::Monitor($cluster);
&Condor::Wait();

This example program will submit the command file 'mycmdfile.cmd' and attempt to vacate any machine that the job runs on. The termination handler then prints out a summary of what has happened.

A second example Perl script facilitates the metascheduling of two of Condor jobs. It submits a second job if the first job successfully completes.

#!/s/std/bin/perl

# tell Perl where to find the Condor library
use lib '/unsup/condor/lib';
# tell Perl to use what it finds in the Condor library
use Condor;

$SUBMIT_FILE1 = 'Asubmit.cmd';
$SUBMIT_FILE2 = 'Bsubmit.cmd';

# Callback used when first job exits without errors.
$firstOK = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    $cluster = Condor::Submit($SUBMIT_FILE2);
    if (($cluster) == 0)
    {
        printf("Could not open $SUBMIT_FILE2.\n");
    }

    &Condor::RegisterExitSuccess($secondOK);
    &Condor::RegisterExitFailure($secondfails);
    &Condor::Monitor($cluster);
};	

$firstfails = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The first job, $cluster.$job failed, exiting with an error. \n";
    exit(0);
};	

# Callback used when second job exits without errors.
$secondOK = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The second job, $cluster.$job successfully completed. \n";
    exit(0);
};	

# Callback used when second job exits WITH an error.
$secondfails = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The second job ($cluster.$job) failed. \n";
    exit(0);
};	


$cluster = Condor::Submit($SUBMIT_FILE1);
if (($cluster) == 0)
{
    printf("Could not open $SUBMIT_FILE1. \n");
}
&Condor::RegisterExitSuccess($firstOK);
&Condor::RegisterExitFailure($firstfails);


&Condor::Monitor($cluster);
&Condor::Wait();

Some notes are in order about this example. The same task could be accomplished using the Condor DAGMan metascheduler. The first job is the parent, and the second job is the child. The input file to DAGMan is significantly simpler than this Perl script.

A third example using the Condor Perl module expands upon the second example. Whereas the second example could have been more easily implemented using DAGMan, this third example shows the versatility of using Perl as a metascheduler.

In this example, the result generated from the successful completion of the first job are used to decide which subsequent job should be submitted. This is a very simple example of a branch and bound technique, to focus the search for a problem solution.

#!/s/std/bin/perl

# tell Perl where to find the Condor library
use lib '/unsup/condor/lib';
# tell Perl to use what it finds in the Condor library
use Condor;

$SUBMIT_FILE1 = 'Asubmit.cmd';
$SUBMIT_FILE2 = 'Bsubmit.cmd';
$SUBMIT_FILE3 = 'Csubmit.cmd';

# Callback used when first job exits without errors.
$firstOK = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    # open output file from first job, and read the result
    if ( -f "A.output" )
    {
        open(RESULTFILE, "A.output") or die "Could not open result file.";
        $result = <RESULTFILE>;
        close(RESULTFILE);
        # next job to submit is based on output from first job
        if ($result < 100)
        {
            $cluster = Condor::Submit($SUBMIT_FILE2);
            if (($cluster) == 0)
            {
                printf("Could not open $SUBMIT_FILE2.\n");
            }

            &Condor::RegisterExitSuccess($secondOK);
            &Condor::RegisterExitFailure($secondfails);
            &Condor::Monitor($cluster);
        }
        else
        {
            $cluster = Condor::Submit($SUBMIT_FILE3);
            if (($cluster) == 0)
            {
                printf("Could not open $SUBMIT_FILE3.\n");
            }

            &Condor::RegisterExitSuccess($thirdOK);
            &Condor::RegisterExitFailure($thirdfails);
            &Condor::Monitor($cluster);
        }
    }
    else
    {
        
        printf("Results file does not exist.\n");
    }
};	

$firstfails = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The first job, $cluster.$job failed, exiting with an error. \n";
    exit(0);
};	


# Callback used when second job exits without errors.
$secondOK = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The second job, $cluster.$job successfully completed. \n";
    exit(0);
};	


# Callback used when third job exits without errors.
$thirdOK = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The third job, $cluster.$job successfully completed. \n";
    exit(0);
};	


# Callback used when second job exits WITH an error.
$secondfails = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The second job ($cluster.$job) failed. \n";
    exit(0);
};	

# Callback used when third job exits WITH an error.
$thirdfails = sub
{
    %parameters = @_;
    $cluster = $parameters{'cluster'};
    $job = $parameters{'job'};

    print "The third job ($cluster.$job) failed. \n";
    exit(0);
};	


$cluster = Condor::Submit($SUBMIT_FILE1);
if (($cluster) == 0)
{
    printf("Could not open $SUBMIT_FILE1. \n");
}
&Condor::RegisterExitSuccess($firstOK);
&Condor::RegisterExitFailure($firstfails);


&Condor::Monitor($cluster);
&Condor::Wait();


next up previous contents index
Next: 5. Grid Computing Up: 4. Miscellaneous Concepts Previous: 4.3 Computing On Demand   Contents   Index
condor-admin@cs.wisc.edu