The Condor system [#LitzkowCAHOI##1#] is one of the most advanced load
sharing systems to date. It operates in a workstation environment, aiming to
maximise the utilisation of workstations with as little interference as
possible between the jobs it schedules and the activities of the people who own
the workstations. It identifies idle workstations and schedules background jobs
on them. When the owner of a workstation resumes activity at a station, Condor
checkpoints the remote job running on the node and transfers it to another
machine. They claim that the overhead needed to support remote execution is
very low.
The Condor project has conducted work in three major areas:
- The gathering and analysis of workstation usage patterns.
- The exploration of algorithms for the management of idle workstation
capacity. This resulted in the design of the Up-Down algorithm
[#MutkaSRPCI##1#] which allows fair access to remote capacity for light users
of the system in spite of large demands from heavy users.
- The development of remote execution facilities, known as Remote Unix
[#LitzkowRU##1#].
In order to make the Condor system attractive to its potential users several
issues requiring attention were identified:
- The placement of background jobs should be transparent to users. The
system should be responsible for knowing when workstations are idle and the
jobs should be location transparent.
- If a remote site running a background job were to fail the job should
be restarted automatically at some other location.
- Access to remote cycles should be fair.
- The mechanisms for implementing the system should not consume
sufficient resources to interfere with other activity.