Download
| User Documentation
| Developer Documentation
| History
| Presentations
JugMaster v 1.1
JugMaster is a system designed to efficiently run large batches of
jobs on a computing grid. It is composed of a master that manages the
list of tasks, and many workers that are run, typically under the
control of a batch system such as Condor.
One "jug" is the fundamental unit of repetition in juggling. It is
also a container for storing precious fluids and an object to be
juggled, so the juggler's quest is to maximize the number of jugs,
whichever kind they may be.
JugMaster is a juggler of compute-jobs, with an emphasis on
many-handedness. Components of the system are multi-threaded and
replicable with automatic load balancing across multiple machines.
When appropriate, even instances of the same job may be replicated in
parallel to streamline performance in the face of unpredictable
factors, such as network outages that isolate remote workers.
Main Attractions
- Application software (and support files) may be installed
on-the-fly in the local workspace on the worker node. Installations
are automatically re-used from one job to the next, saving network
bandwidth for large batches of jobs.
- Batches of jobs may be chained together so that output files from
the parent batch(es) become the input files for child jobs. The parent
and child batches may run concurrently or the child may be created at
a later time, taking advantage of the persistent job database.
- Flexible and highly scalable storage system. Any number of
storage nodes may be utilized and load will be balanced across them.
User-supplied software packages may be plugged in to move the files
into whatever backend storage system is desired. Similarly, user-supplied
modules can provide read-access to files using arbitrary protocols.
- Versatile queue management system for submitting jobs to a batch
queue or computing grid. Any number of submission nodes and batch
systems may be utilized. Condor (also Condor-G) is supported out of
the box and user-supplied software packages may be plugged in to
interface to additional batch systems.
- High level of reliability. Any portion of the system may go
off-line at any time and it will simply pick up the pieces and
continue forward with minimal disturbance. Network connections may
fail, file checksums may not match, workers may be isolated,
suspended, or killed, perhaps even in the middle of staging out
results.
- Slow resources may contribute useful work without resulting in a
sluggish "tail" at the end of a run. This means that a job which
appears to be unexpectedly slow can be automatically re-assigned to
another worker when there is no other useful work to do. Whichever
one finishes first gets to commit its output files.
- Efficient "pipelined" workloop, so that stage-in, execution, and
stage-out steps may be juggled concurrently to arbitrary depth under
the supervision of JugWorker. This may also be used to give worker
nodes greater autonomy in the face of network and service outages,
since they may queue up inputs or outputs and continue running jobs
during the outage.
- Support for interruptible jobs that can be stopped and restarted,
as in the Condor preemption system.
- Automatic throttling of batches of jobs that experience
unexpectedly high failure rates in order to prevent them from hogging
resources.
- All state information is kept in a relational database, providing
an easier path for vertical integration with other tools/front-ends.
- The database also makes it easy to dynamically adjust settings
after batches of jobs have already been created. For example, a
random seed range may be increased or priorities may be changed. Jobs
affected by external sources of error (e.g. ex post facto data loss)
may be marked as failed and (optionally) rerun from the saved job
description.
JugMaster is open source software, distributed under a BSD-style
license.
Author
Dan Bradley
With financial support from the National Science Foundation and the
University of Wisconsin.