next up previous contents index
Next: 2.10 Parallel Applications (Including Up: 2. Users' Manual Previous: 2.8 Java Applications   Contents   Index

Subsections


2.9 PVM Applications

Applications that use PVM (Parallel Virtual Machine) may use Condor. PVM offers a set of message passing primitives for use in C and C++ language programs. The primitives, together with the PVM environment allow parallelism at the program level. Multiple processes may run on multiple machines, while communicating with each other. More information about PVM is available at http://www.epm.ornl.gov/pvm/.

Condor-PVM provides a framework to run PVM applications in Condor's opportunistic environment. Where PVM needs dedicated machines to run PVM applications, Condor does not. Condor can be used to dynamically construct PVM virtual machines from a Condor pool of machines.

In Condor-PVM, Condor acts as the resource manager for the PVM daemon. Whenever a PVM program asks for nodes (machines), the request is forwarded to Condor. Condor finds a machine in the Condor pool using usual mechanisms, and adds it to the virtual machine. If a machine needs to leave the pool, the PVM program is notified by normal PVM mechanisms.

NOTE: Condor-PVM is an optional Condor module. It is not automatically installed with Condor. To check and see if it has been installed at your site, enter the command:

        ls -l `condor_config_val PVMD`
Please note the use of back ticks in the above command. They specify to run the condor_ config_val program. If the result of this program shows the file condor_pvmd on your system, then the Condor-PVM module is installed. If not, ask your site administrator to download and install Condor-PVM from http://www.cs.wisc.edu/condor/downloads/.

2.9.1 Effective Usage: the Master-Worker Paradigm

There are several different parallel programming paradigms. One of the more common is the master-worker (or pool of tasks) arrangement. In a master-worker program model, one node acts as the controlling master for the parallel application and sends pieces of work out to worker nodes. The worker node does some computation, and it sends the result back to the master node. The master has a pool of work that needs to be done, so it assigns the next piece of work out to the next worker that becomes available.

Condor-PVM is designed to run PVM applications which follow the master-worker paradigm. Condor runs the master application on the machine where the job was submitted and will not preempt it. Workers are pulled in from the Condor pool as they become available.

Not all parallel programming paradigms lend themselves to Condor's opportunistic environment. In such an environment, any of the nodes could be preempted and disappear at any moment. The master-worker model does work well in this environment. The master keeps track of which piece of work it sends to each worker. The master node is informed of the addition and disappearance of nodes. If the master node is informed that a worker node has disappeared, the master places the unfinished work it had assigned to the disappearing node back into the pool of tasks. This work is sent again to the next available worker node. If the master notices that the number of workers has dropped below an acceptable level, it requests more workers (using pvm_addhosts()). Alternatively, the master requests a replacement node every time it is notified that a worker has gone away. The benefit of this paradigm is that the number of workers is not important and changes in the size of the virtual machine are easily handled.

A tool called MW has been developed to assist in the development of master-worker style applications for distributed, opportunistic environments like Condor. MW provides a C++ API which hides the complexities of managing a master-worker Condor-PVM application. We suggest that you consider modifying your PVM application to use MW instead of developing your own dynamic PVM master from scratch. Additional information about MW is available at http://www.cs.wisc.edu/condor/mw/.

2.9.2 Binary Compatibility and Runtime Differences

Condor-PVM does not define a new API (application program interface); programs use the existing resource management PVM calls such as pvm_addhosts() and pvm_notify(). Because of this, some master-worker PVM applications are ready to run under Condor-PVM with no changes at all. Regardless of using Condor-PVM or not, it is good master-worker design to handle the case of a disappearing worker node, and therefore many programmers have already constructed their master program with all the necessary fault tolerant logic.

Regular PVM and Condor-PVM are binary compatible. The same binary which runs under regular PVM will run under Condor, and vice-versa. There is no need to re-link for Condor-PVM. This permits easy application development (develop your PVM application interactively with the regular PVM console, XPVM, etc.) as well as binary sharing between Condor and some dedicated MPP systems.

This release of Condor-PVM is based on PVM 3.4.2. PVM versions 3.4.0 through 3.4.2 are all supported. The vast majority of the PVM library functions under Condor maintain the same semantics as in PVM 3.4.2, including messaging operations, group operations, and pvm_catchout().

The following list is a summary of the changes and new features of PVM running within the Condor environment:


2.9.3 Sample PVM submit file

PVM jobs are submitted to the PVM universe. The following is an example of a submit description file for a PVM job. This job has a master PVM program called master.exe.

###########################################################
# sample_submit
# Sample submit file for PVM jobs. 
###########################################################

# The job is a PVM universe job.
universe = PVM  

# The executable of the master PVM program is ``master.exe''.
executable = master.exe

input = "in.dat"
output = "out.dat"
error = "err.dat"

###################  Machine class 0  ##################

Requirements = (Arch == "INTEL") && (OpSys == "LINUX") 

# We want at least 2 machines in class 0 before starting the 
# program.  We can use up to 4 machines.
machine_count = 2..4  
queue

###################  Machine class 1  ##################

Requirements = (Arch == "SUN4x") && (OpSys == "SOLARIS26") 

# We need at least 1 machine in class 1 before starting the 
# executable.  We can use up to 3 to start with.
machine_count = 1..3
queue

###################  Machine class 2  ##################

Requirements = (Arch == "INTEL") && (OpSys == "SOLARIS26") 

# We don't need any machines in this class at startup, but we can use
# up to 3. 
machine_count = 0..3
queue

###############################################################
# note: the program will not be started until the least 
#       requirements in all classes are satisfied.
###############################################################

In this sample submit file, the command universe = PVM specifies that the jobs should be submitted into PVM universe.

The command executable = master.exe tells Condor that the PVM master program is master.exe. This program will be started on the submitting machine. The workers should be spawned by this master program during execution.

The input, output, and error commands specify files that should be redirected to the standard in, out, and error of the PVM master program. Note that these files will not include output from worker processes unless the master calls pvm_catchout().

This submit file also tells Condor that the virtual machine consists of three different classes of machine. Class 0 contains machines with INTEL processors running LINUX; class 1 contains machines with SUN4x (SPARC) processors running SOLARIS26; class 2 contains machines with INTEL processors running SOLARIS26.

By using machine_count = <min>..<max>, the submit file tells Condor that before the PVM master is started, there should be at least <min> number of machines of the current class. It also asks Condor to give it as many as <max> machines. During the execution of the program, the application may request more machines of each of the class by calling pvm_addhosts() with a string specifying the machine class. It is often useful to specify <min> of 0 for each class, so the PVM master will be started immediately when the first host from any machine class is allocated.

The queue command should be inserted after the specifications of each class.


next up previous contents index
Next: 2.10 Parallel Applications (Including Up: 2. Users' Manual Previous: 2.8 Java Applications   Contents   Index
condor-admin@cs.wisc.edu