UW-HEP Condor User Info
Q1. I am in posession of a program that requires large amounts
of CPU time to run, and I would like to use the Wisconsin grid for
running this program. How do I do it?
Some examples of how to run a program under Condor are presented
below. First, some general remarks:
If you are a CMS user, there is more specific information by going
here and clicking on
"User Documentation".
You will need an account for logging into login.hep.wisc.edu. To
request an account, contact help AT hep.wisc.edu.
To tell Condor to run a program, you must first describe the "job"
in a submit description file. You then tell Condor to run the job(s)
by running condor_submit. Each computer has its own
queue of jobs, so if you submit jobs from say login02, you must
remember to go back to login02 to manage those jobs. You can view the
job queue using the condor_q command. If you don't
remember where you submitted the job, you can view all jobs still in
the queue using condor_q -global your-user-name.
General advice on submitting and managing Condor jobs can be found in
the Condor
Manual.
It is best to break up the work you have to do into chunks that
take less than 24 hours and more than a few minutes. This reduces
time lost to overhead and improves the chances of the job running to
completion before being preempted by other higher priority users. See
more about preemption below.
Some Condor worker nodes are 32-bit machines. Most are 64-bit. If
you compile your program in 32-bit mode, it can run on all Condor
worker nodes. The login.hep.wisc.edu machines are 64-bit machines, so
by default, compiled programs will be 64-bit. To tell gcc to generate
32-bit programs, use the -m32 option.
By default, Condor will assume that your job requires the same
architecture as the submit machine. So if you submit a 32-bit program
from a 64-bit machine, it will still only be allowed to run on 64-bit
machines. To tell Condor that you want 32-bit or 64-bit you can
override the default by explicitly specifying this requirement in your
Condor submit file:
requirements = ARCH == "INTEL" || ARCH == "X86_64"
Similarly, some Condor worker nodes may be running older versions
of Linux. If your job is compiled on Scientific Linux 5, for example,
it may fail to run on a Scientific Linux 4 machine. Compiling your
program on an older version of Linux is one solution. Statically
compiling your program is another way to try to make it more portable,
though in practice we have found that the program is still not
portable if it makes any libc calls that do DNS lookups or reading of
unix account information, since these are handled via dynamic library
loading.
If the program simply can't run on older versions of Linux, you
should specify what version is required. One way to do this is by
checking the glibc version.
Example for specifying that Scientific Linux 5 or newer is required:
requirements = TARGET.OSglibc_major == 2 && TARGET.OSglibc_minor >= 5
Note that if you have multiple requirements (such as 64-bit and
SL5), you must combine these into a single requirements expression
using the && boolean operator.
Case 1: Use AFS for software and /scratch for data files
In some cases it may be convenient to use AFS for your job's data
files. However, there are a number of disadvantages to using AFS in
this way, including both performance and security, so we
strongly recommend that you put your input/output
data files in a directory in /scratch/your-user-name. Your
program executables and libraries could also exist in /scratch, but it
is usually convenient and reasonable to put these in AFS so that libraries
can be easily accessed by the Condor job from wherever it runs.
To grant Condor access to your software, you must make the
directories containing the software readable without an AFS token and
you must make all parent directories listable without an AFS token.
We recommend using the condor-hosts AFS group for this
purpose. The following example command can be used to grant access to
a sub-directory sw in your AFS home directory:
fs setacl -dir ~ -acl condor-hosts l
find ~/sw -noleaf -type d -exec fs setacl -dir '{}' -acl condor-hosts rl \;
Once you have compiled your program (keeping in mind the
32-bit/64-bit issues mentioned above), and your input files are ready,
you can create a submit file describing your job(s) and submit it to Condor.
Here is a simple example:
universe = vanilla
executable = /afs/hep.wisc.edu/home/dan/sw/my_program
arguments = arg1 arg2
# Copy environment variables that are set at submit time, such as
# LD_LIBRARY_PATH.
getenv = true
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
output = stdout
error = stderr
log = condor_log
transfer_input_files = inputfile1,inputfile2
# remove the following if you wish to receive email when the job
# completes
notification = never
queue
Once the submit file is ready, you can submit the job to Condor
using the following command. Run this command from the directory
where you want the output files to go (i.e. in
/scratch/your-user-name/...) or expicitly
specify an initial working directory in the submit file.
condor_submit submit_file
Important events in the life of your job will be logged in the log
file specified in the submit file (condor_log in the
example above). This includes the time and place where your job began
executing and the time when it finished or was preempted by higher priority users on the machine
where it was running. You can view the current status of the job in
the job queue using condor_q jobid.
Case 2: Use AFS for software and data files
To allow your job to write to an AFS directory, you must give all
processes on all Condor worker nodes and submit machines the ability
to write to the directory. This is generally not a good thing to do.
Don't do things this way unless you have to! Please
inform us at condor-help AT hep.wisc.edu before you make heavy use of
this option, because it can cause performance problems on the AFS
server when many Condor jobs are writing to it at the same time.
The following command can be used to give all Condor machines write
access to a directory:
find /path/to/directory -noleaf -type d -exec fs sa -dir '{}' -acl condor-hosts rlidkw \;
When you are done, you should remove the ability of Condor machines
to write to the directory. To do that, use the following command:
find /path/to/directory -noleaf -type d -exec fs sa -dir '{}' -acl condor-hosts none \;
Q2. What Condor pools exist at UW Madison?
From hep.wisc.edu, several Condor pools are accessible. You don't
have to do anything special to access them. Once jobs are submitted
to Condor at hep.wisc.edu, they "flock" to these pools.
- condor.hep.wisc.edu: The local HEP Condor pool is just a
collection of desktops and a few other machines. If you have a lot of
jobs to run, they will generally spill over from this pool into the
other larger pools on campus.
- glow.cs.wisc.edu: The GLOW Condor pool is composed of machines
owned by various big computing users on campus, including CMS, ATLAS,
IceCube, and others. Since these machines are owned by specific
groups, those groups have immediate priority. Guest jobs will be
kicked off whenever the owners have work to do. (See Preemption for more information on dealing with
your job getting kicked off.)
- cm.chtc.wisc.edu: The CHTC Condor pool is for use by UW Madison
researchers.
- glidein.chtc.wisc.edu: The dynamic OSG glidein pool for CHTC
users. See Open Science Grid for more information.
To view the status of machines in the various condor pools, use
condor_status. For most purposes, such as seeing how
much horse-power exists, you should filter out special-purpose slots
as in the following example queries:
condor_status -pool condor.hep.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glidein.chtc.wisc.edu -const 'IS_MONITOR_VM =!= True'
To see more details about the machines in the pools, you can use
the -long or -format options to
condor_status. For example, to see what operating system
flavors are being run you could use condor_status -long
and notice that in the information about each machine, there is an attribute
named OSIssue. The following command could then be used
to summarize how many slots are running each operating system flavor
in the CHTC condor pool:
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot' \
-format "%s\n" OSIssue | sort | uniq -c
To additionally see which glibc versions are in use, the following
command could be used:
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot' \
-format "%s " OSIssue \
-format "glibc %s." OSglibc_major \
-format "%s\n" OSglibc_minor \
| sort | uniq -c
Q3. Why is my job not running?
There is a tool for analyzing what machines match your job's requirements.
Example:
condor_q -better-analyze -pool cm.chtc.wisc.edu jobid
One possible reason for a job not to be running is that Condor hit
some error such as a missing input file. In this case, the job will
go "on hold", indicated with an 'H' in the status field in the job
queue. To remove the job and resubmit it, use condor_rm
jobid. If instead you can fix the problem without
resubmitting the job, you can release it from hold with
condor_release jobid.
Another reason for a job not to be running is that it ran once and
Condor observed that it had a very large virtual image size before the
job was preempted. Then future attempts to find a suitable machine
may fail if no slots have sufficient memory to match the observe image
size. If you have this problem, try to reduce the amount of memory
needed by the job. If that is not possible, contact
us for additional options.
Q4. My jobs keep getting evicted. What can I do?
Your job may get kicked off of a computer before it finishes in
some cases. This can happen for two main reasons: the owner of the
machine has immediate need for that machine, or another user in the
pool with a better fair-share priority has work for the machine to do.
In the case of preemption by the machine owner, your job is kicked off
immediately. In the case of fair-share prioritization, your job can
run for typically up to 24 hours before being killed.
If you are willing to restrict yourself to machines that are not
owned by anyone else (i.e. machines provided for all UW researchers),
then you can avoid having your job evicted by the machine owner. If
you have a relatively small number of jobs, this is a reasonable thing
to do. The following may be inserted into your Condor submit file to
achieve this. If you already have a requirements line, you must
logically merge this one with your other requirements.
+estimated_run_hours = 24
requirements = TARGET.MAX_PREEMPT >= MY.estimated_run_hours*3600
Alternatively, you can try to be opportunistic and get work out of
the computers that are owned by other people. When your job is
preempted by someone else, it returns to the idle state in the job
queue and will try to run again. It is possible to make a job save
state when it is kicked off so that it can resume from where it left
off. Otherwise, it must restart from the beginning.
One way to make it save state is to use Condor's "standard
universe". This requires relinking your program with Condor's
standard library. Not all programs are compatible with this
(e.g. multi-threaded or dynamically linked programs). For more
information, see the Condor
Manual.
Another option is to have your job intercept the kill signal
(SIGTERM) sent by Condor when it wishes to kick the job off the
machine. It should then quickly write out whatever information it
needs in order to resume from where it left off. (If it doesn't shut
down within the grace period (typically 10 minutes), then it will be
hard-killed with SIGKILL.) In order to tell Condor to save the
intermediate files that your program has generated in the working
directory on the worker node, you should use the following option in
the submit description file:
when_to_transfer_output = ON_EXIT_OR_EVICT
The Open Science Grid (OSG)
The UW is part of the Open
Science Grid. This means that you can use computers at many other
campuses when those computers are available for opportunistic use.
For the right type of job, this can add up to a lot of additional
computing power.
The mechanism that we use to access the OSG is called glideinWMS.
When jobs are submitted that express a desire to run on the OSG, this
causes the UW's glidein.chtc.wisc.edu Condor pool to be
dynamically expanded as computers from the OSG are made available.
Requirements for a job to run on OSG:
- The job must be submitted from login.hep.wisc.edu or
submit.chtc.wisc.edu. If you would like additional submit machines to
be supported, please let us know.
- The job must define
WantGlidein=true. To do this,
insert the following line in the job's submit file:
+WantGlidein = true
Note that the '+' is a required part of the syntax.
- The job must be a vanilla universe job (the default). Standard
universe is not currently supported, due to firewalls that exist at
many OSG sites.
- The job must be entirely self-contained. For example, it must
not depend on access to AFS, because the computers at other campuses
often do not support AFS.
- The job should make minimal assumptions about what shared
libraries and other programs are available. A variety of Linux
versions exist on the OSG. It is best to ship all libraries with the
job (or statically compile). As of 2012-02-22, most machines in OSG
are compatible with Scientific Linux 5.
- The job should not need to run for long periods of time to get
useful work done. A job that runs for 2 hours or less is ideal. A
job that runs for more than a day must specify estimated_run_hours,
which may limit the availability of computers that it has access to.
If it does not specify estimated_run_hours, the job will likely get
interrupted before it finishes, which will cause the job to return to
the idle state and start over from the beginning in the next attempt.
To set this parameter, put the following in your submit file,
adjusting the number of hours to be appropriate for your job:
+estimated_run_hours = 36
- The job should ideally use less than 2GB of RAM. If it needs
more, request_memory must be set to the required amount of memory (in
MB). This may reduce the number of computers available to run the
job, but it is important to avoid running on computers with not enough
memory. Example:
request_memory = 4000
Note that a + should not preceed the setting of
request_memory, because this is a built-in command recognized by
condor_submit.
Another way to get more memory for the job is to run in
whole-machine mode. This is described below.
- The job should ideally only keep one CPU core busy. If the job
needs all of the CPU power and/or memory of the machine, it must run
in whole-machine mode. To do that, the job must be submitted
with the following in the submit file. (Merge the requirements with
your existing requirements, if any.)
+RequiresWholeMachine = True
requirements = CAN_RUN_WHOLE_MACHINE
Note that there may not be as many computers available to run
whole-machine jobs as single-core jobs.
- Some OSG sites require that jobs be submitted with x509
credentials. Therefore, if you wish to have access to more resources,
you will need to get an x509 certificate that is registered as a
member of the GLOW Virtual Organization. To do that, email help AT
hep.wisc.edu.
Because of the flexibility of Condor flocking, a batch of
identically submitted jobs may use some computers in UW Condor pools
as well as OSG computers when and if they become available. Condor
will try to find a machine for your job at UW before it attempts to
run it on OSG. Therefore, if your job is suitable for running in OSG,
there are few disadvantages to doing so. The two main costs to
consider are:
- Your job may run on computers that are configured differently
from UW computers; if this causes problems, you may have to spend a
little more time debugging.
- Your job may get preempted (killed) at any time when there is
higher priority work at the OSG site. When this happens, the job will
remain in the queue and will be scheduled to run again when a computer
becomes available.
Although normally it is desireable to let jobs run at UW in
addition to OSG, for testing, you may wish to submit jobs that will
only run in OSG. The following requirements
expression can be used to do that. If you already have a requirements
expression in your submit file, you will need to logically AND this
expression with your existing one.
requirements = IS_GLIDEIN
Getting Help
- Please contact help AT hep.wisc.edu if you have trouble and
need further assistance.
|