Next: 2.11 DAGMan Applications Up: 2. Users' Manual Previous: 2.9 PVM Applications Contents Index

Subsections

2.10 Parallel Applications (Including MPI Applications)

Condor's Parallel universe supports a wide variety of parallel programming environments, and it encompasses the execution of MPI jobs. It supports jobs which need to be co-scheduled. A co-scheduled job has more than one process that must be running at the same time on different machines to work correctly. The parallel universe supersedes the mpi universe. The mpi universe eventually will be removed from Condor.

2.10.1 Prerequisites to Running Parallel Jobs

Condor must be configured such that resources (machines) running parallel jobs are dedicated. Note that dedicated has a very specific meaning in Condor: dedicated machines never vacate their executing Condor jobs, should the machine's interactive owner return. This is implemented by running a single dedicated scheduler process on a machine in the pool, which becomes the single machine from which parallel universe jobs are submitted. Once the dedicated scheduler claims a dedicated machine for use, the dedicated scheduler will try to use that machine to satisfy the requirements of the queue of parallel universe or MPI universe jobs. If the dedicated scheduler cannot use a machine for a configurable amount of time, it will release its claim on the machine, making it available again for the opportunistic scheduler.

Since Condor does not ordinarily run this way, (Condor usually uses opportunistic scheduling), dedicated machines must be specially configured. Section 3.13.8 of the Administrator's Manual describes the necessary configuration and provides detailed examples.

To simplify the scheduling of dedicated resources, a single machine becomes the scheduler of dedicated resources. This leads to a further restriction that jobs submitted to execute under the parallel universe must be submitted from the machine acting as the dedicated scheduler.

2.10.2 Parallel Job Submission

Given correct configuration, parallel universe jobs may be submitted from the machine running the dedicated scheduler. The dedicated scheduler claims machines for the parallel universe job, and invokes the job when the correct number of machines of the correct platform (architecture and operating system) are claimed. Note that the job likely consists of more than one process, each to be executed on a separate machine. The first process (machine) invoked is treated different than the others. When this first process exits, Condor shuts down all the others, even if they have not yet completed their execution.

An overly simplified submit description file for a parallel universe job appears as

#############################################
##   submit description file for a parallel program
#############################################
universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 8
queue

This job specifies the universe as parallel, letting Condor know that dedicated resources are required. The machine_count command identifies the number of machines required by the job.

When submitted, the dedicated scheduler allocates eight machines with the same architecture and operating system as the submit machine. It waits until all eight machines are available before starting the job. When all the machines are ready, it invokes the /bin/sleep command, with a command line argument of 30 on all eight machines more or less simultaneously.

A more realistic example of a parallel job utilizes other features.

######################################
## Parallel example submit description file
######################################
universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
queue

The specification of the input, output, and error files utilize the predefined macro $(NODE). See the condor_ submit manual page on page for further description of predefined macros. The $(NODE) macro is given a unique value as processes are assigned to machines. The $(NODE) value is fixed for the entire length of the job. It can therefore be used to identify individual aspects of the computation. In this example, it is used to utilize and assign unique names to input and output files.

This example presumes a shared file system across all the machines claimed for the parallel universe job. Where no shared file system is either available or guaranteed, use Condor's file transfer mechanism, as described in section 2.5.4 on page . This example uses the file transfer mechanism.

######################################
## Parallel example submit description file
## without using a shared file system
######################################
universe = parallel
executable = /bin/cat
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

The job requires exactly four machines, and queues four processes. Each of these processes requires a correctly named input file, and produces an output file.

2.10.3 Parallel Jobs with Separate Requirements

The different machines executing for a parallel universe job may specify different machine requirements. A common example requires that the head node execute on a specific machine. It may be also useful for debugging purposes.

Consider the following example.

######################################
## Example submit description file
## with multiple procs
######################################
universe = parallel
executable = example
machine_count = 1
requirements = ( machine == "machine1")
queue

requirements = ( machine =!= "machine1")
machine_count = 3
queue

The dedicated scheduler allocates four machines. All four executing jobs have the same value for $(Cluster) macro. The $(Process) macro takes on two values; the value 0 will be assigned for the single executable that must be executed on machine1, and the value 1 will be assigned for the other three that must be executed anywhere but on machine1.

Carefully consider the ordering and nature of multiple sets of requirements in the same submit description file. The scheduler matches jobs to machines based on the ordering within the submit description file. Mutually exclusive requirements eliminate the dependence on ordering within the submit description file. Without mutually exclusive requirements, the scheduler may unable to schedule the job. The ordering within the submit description file may preclude the scheduler considering the specific allocation that could satisfy the requirements.

2.10.4 MPI Applications Within Condor's Parallel Universe

MPI applications utilize a single executable that is invoked in order to execute in parallel on one or more machines. Condor's parallel universe provides the environment within which this executable is executed in parallel. However, the various implementations of MPI (for example, LAM or MPICH) require further framework items within a system-wide environment. Condor supports this necessary framework through user visible and modifiable scripts. An MPI implementation-dependent script becomes the Condor job. The script sets up the extra, necessary framework, and then invokes the MPI application's executable.

Condor provides these scripts in the $(RELEASE_DIR)/etc/examples directory. The script for the LAM implementation is lamscript. The script for the MPICH implementation is mp1script. Therefore, a Condor submit description file for these implementations would appear similar to:

######################################
## Example submit description file
## for MPICH 1 MPI
## works with MPICH 1.2.4, 1.2.5 and 1.2.6
######################################
universe = parallel
executable = mp1script
arguments = my_mpich_linked_executable arg1 arg2
machine_count = 4
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = my_mpich_linked_executable
queue

######################################
## Example submit description file
## for LAM MPI
######################################
universe = parallel
executable = lamscript
arguments = my_lam_linked_executable arg1 arg2
machine_count = 4
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = my_lam_linked_executable
queue

The executable is the MPI implementation-dependent script. The first argument to the script is the MPI application's executable. Further arguments to the script are the MPI application's arguments. Condor must transfer this executable; do this with the transfer_input_files command.

For other implementations of MPI, copy and modify one of the given scripts. Most MPI implementations require two system-wide prerequisites. The first prerequisite is the ability to run a command on a remote machine without being prompted for a password. ssh is commonly used, but other command may be used. The second prerequisite is an ASCII file containing the list of machines that may utilize ssh. These common prerequisites are implemented in a further script called sshd.sh. sshd.sh generates ssh keys (to enable password-less remote execution), and starts an sshd daemon. The machine name and MPI rank are given to the submit machine.

The sshd.sh script requires the definition of two Condor configuration variables. Configuration variable CONDOR_SSHD is an absolute path to an implementation of sshd. sshd.sh has been tested with openssh version 3.9, but should work with more recent versions. Configuration variable CONDOR_SSH_KEYGEN points to the corresponding ssh-keygen executable.

Scripts lamscript and mp1script each have their own idiosyncrasies. In mp1script, the PATH to the MPICH installation must be set. The shell variable MPDIR indicates its proper value. This directory contains the MPICH mpirun executable. For LAM, there is a similar path setting, but it is called LAMDIR in the lamscript script. In addition, this path must be part of the path set in the user's .cshrc script. As of this writing, the LAM implementation does not work if the user's login shell is the Bourne or compatible shell.

2.10.5 Outdated Documentation of the MPI Universe

The following sections on implementing MPI applications utilizing the MPI universe are superseded by the sections describing MPI applications utilizing the parallel universe. These sections are included in the manual as reference, until the time when the MPI universe is no longer supported within Condor.

MPI stands for Message Passing Interface. It provides an environment under which parallel programs may synchronize, by providing communication support. Running the MPI-based parallel programs within Condor eases the programmer's effort. Condor dedicates machines for running the programs, and it does so using the same interface used when submitting non-MPI jobs.

The MPI universe in Condor currently supports MPICH versions 1.2.2, 1.2.3, and 1.2.4 using the ch_p4 device. The MPI universe does not support MPICH version 1.2.5. These supported implementations are offered by Argonne National Labs without charge by download. See the web page at http://www-unix.mcs.anl.gov/mpi/mpich/ for details and availability. Programs to be submitted for execution under Condor will have been compiled using mpicc. No further compilation or linking is necessary to run jobs under Condor.

The Parallel universe 2.10 is now the preferred way to run MPI jobs. Support for the MPI universe will be removed from Condor at a future date.

2.10.5.1 MPI Details of Set Up

Administratively, Condor must be configured such that resources (machines) running MPI jobs are dedicated. Dedicated machines never vacate their running condor jobs should the machine's interactive owner return. Once the dedicated scheduler claims a dedicated machine for use, it will try to use that machine to satisfy the requirements of the queue of MPI jobs.

Since Condor is not ordinarily used in this manner (Condor uses opportunistic scheduling), machines that are to be used as dedicated resources must be configured as such. Section 3.13.8 of Administrator's Manual describes the necessary configuration and provides detailed examples.

To simplify the dedicated scheduling of resources, a single machine becomes the scheduler of dedicated resources. This leads to a further restriction that jobs submitted to execute under the MPI universe (with dedicated machines) must be submitted from the machine running as the dedicated scheduler.

2.10.5.2 MPI Job Submission

Once the programs are written and compiled, and Condor resources are correctly configured, jobs may be submitted. Each Condor job requires a submit description file. The simplest submit description file for an MPI job:

#############################################
##   submit description file for mpi_program
#############################################
universe = MPI
executable = mpi_program
machine_count = 4
queue

This job specifies the universe as mpi, letting Condor know that dedicated resources will be required. The machine_count command identifies the number of machines required by the job. The four machines that run the program will default to be of the same architecture and operating system as the machine on which the job is submitted, since a platform is not specified as a requirement.

The simplest example does not specify an input or output, meaning that the computation completed is useless, since both input comes from and the output goes to /dev/null. A more complex example of a submit description file utilizes other features.

######################################
## MPI example submit description file
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
queue

The specification of the input, output, and error files utilize a predefined macro that is only relevant to mpi universe jobs. See the condor_ submit manual page on page for further description of predefined macros. The $(NODE) macro is given a unique value as programs are assigned to machines. This value is what the MPICH version ch_p4 implementation terms the rank of a program. Note that this term is unrelated and independent of the Condor term rank. The $(NODE) value is fixed for the entire length of the job. It can therefore be used to identify individual aspects of the computation. In this example, it is used to give unique names to input and output files.

If your site does NOT have a shared file system across all the nodes where your MPI computation will execute, you can use Condor's file transfer mechanism. You can find out more details about these settings by reading the condor_ submit man page or section 2.5.4 on page . Assuming your job only reads input from STDIN, here is an example submit file for a site without a shared file system:

######################################
## MPI example submit description file
## without using a shared file system
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 4
should_transfer_files = yes
when_to_transfer_output = on_exit
queue

Consider the following C program that uses this example submit description file.

/**************
 * simplempi.c
 **************/
#include <stdio.h>
#include "mpi.h"

int main(argc,argv)
    int argc;
    char *argv[];
{
    int myid;
    char line[128];

    MPI_Init(&argc,&argv);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    fprintf ( stdout, "Printing to stdout...%d\n", myid );
    fprintf ( stderr, "Printing to stderr...%d\n", myid );
    fgets ( line, 128, stdin );
    fprintf ( stdout, "From stdin: %s", line );

    MPI_Finalize();
    return 0;
}

Here is a makefile that works with the example. It would build the MPI executable, using the MPICH version ch_p4 implementation.

###################################################################
## This is a very basic Makefile                                 ##
###################################################################

# the location of the MPICH compiler
CC          = /usr/local/bin/mpicc
CLINKER     = $(CC)

CFLAGS    = -g
EXECS     = simplempi

all: $(EXECS)

simplempi: simplempi.o
        $(CLINKER) -o simplempi simplempi.o -lm

.c.o:
        $(CC) $(CFLAGS) -c $*.c

The submission to Condor requires exactly four machines, and queues four programs. Each of these programs requires an input file (correctly named) and produces an output file.

If input file for $(NODE) = 0 (called infile.0) contains

Hello number zero.

and the input file for $(NODE) = 1 (called infile.1) contains

Hello number one.

then after the job is submitted to Condor, there will be eight files created: errfile.[0-3] and outfile.[0-3]. outfile.0 will contain

Printing to stdout...0
From stdin: Hello number zero.

and errfile.0 will contain

Printing to stderr...0

Different nodes for an MPI job can have different machine requirements. For example, often the first node, sometimes called the head node, needs to run on a specific machine. This can be also useful for debugging. Condor accomodates this by supporting multiple queue statements in the submit file, much like with the other universes. For example:

######################################
## MPI example submit description file
## with multiple procs
######################################
universe = MPI
executable = simplempi
log = logfile
input = infile.$(NODE)
output = outfile.$(NODE)
error = errfile.$(NODE)
machine_count = 1
should_transfer_files = yes
when_to_transfer_output = on_exit
requirements = ( machine == "machine1")
queue

requirements = ( machine =!= "machine1")
machine_count = 3
queue

The dedicated scheduler will allocate four machines (nodes) total in two procs for this job. The first proc has one node, (rank 0 in MPI terms) and will run on the machine named machine1. The other three nodes, in the second proc, will run on other machines. Like in the other condor universes, the second requirements command overwrites the first, but the other commands are inherited from the first proc.

When submitting jobs with multiple requirements, it is best to write the requirements to be mutually exclusive, or to have the most selective requirement first in the submit file. This is because the scheduler tries to match jobs to machine in submit file order. If the requirements are not mutually exclusive, it can happen that the scheduler may unable to schedule the job, even if all needed resources are available.

Next: 2.11 DAGMan Applications Up: 2. Users' Manual Previous: 2.9 PVM Applications Contents Index

condor-admin@cs.wisc.edu