UW-HEP Condor User Info
Q1. I am in posession of a program that requires large amounts
of CPU time to run, and I would like to use the Wisconsin grid for
running this program. How do I do it?
Some examples of how to run a program under Condor are presented
below. First, some general remarks:
If you are a CMS user, there is more specific information by going
here and clicking on
"User Documentation".
To tell Condor to run a program, you must first describe the "job"
in a submit description file. You then tell Condor to run the job(s)
by running condor_submit. Each computer has its own
queue of jobs, so if you submit jobs from say login02, you must
remember to go back to login02 to manage those jobs. You can view the
job queue using the condor_q command. If you don't
remember where you submitted the job, you can view all jobs still in
the queue using condor_q -global your-user-name.
General advice on submitting and managing Condor jobs can be found in
the Condor
Manual.
It is best to break up the work you have to do into chunks that
take less than 24 hours and more than a few minutes. This reduces
time lost to overhead and improves the chances of the job running to
completion before being preempted by other higher priority users. See
more about preemption below.
Some Condor worker nodes are 32-bit machines. Most are 64-bit. If
you compile your program in 32-bit mode, it can run on all Condor
worker nodes. The login.hep.wisc.edu machines are 64-bit machines, so
by default, compiled programs will be 64-bit. To tell gcc to generate
32-bit programs, use the -m32 option.
If you wish to use a 64-bit program, you need to tell Condor to
only run the job on 64-bit machines, because we currently have it
configured to assume that all programs can run anywhere. You can
specify this requirement with the following expression in your Condor
submit file:
requirements = ARCH == "X86_64"
Case 1: Use AFS for software and /scratch for data files
In some cases it may be convenient to use AFS for your job's data
files. However, there are a number of disadvantages to using AFS in
this way, including both performance and security, so we
strongly recommend that you put your input/output
data files in a directory in /scratch/your-user-name. Your
program executables and libraries could also exist in /scratch, but it
is usually convenient and reasonable to put these in AFS so that libraries
can be easily accessed by the Condor job from wherever it runs.
Once you have compiled your program (keeping in mind the
32-bit/64-bit issues mentioned above), and your input files are ready,
you can create a submit file describing your job(s) and submit it to Condor.
Here is a simple example:
universe = vanilla
executable = /afs/hep.wisc.edu/home/dan/sw/my_program
arguments = arg1 arg2
# Copy environment variables that are set at submit time, such as
# LD_LIBRARY_PATH.
getenv = true
should_transfer_files = yes
when_to_transfer_output = ON_EXIT
output = stdout
error = stderr
log = condor_log
transfer_input_files = inputfile1,inputfile2
# remove the following if you wish to receive email when the job
# completes
notification = never
queue
Once the submit file is ready, you can submit the job to Condor
using the following command. Run this command from the directory
where you want the output files to go (i.e. in
/scratch/your-user-name/...) or expicitly
specify an initial working directory in the submit file.
condor_submit submit_file
Important events in the life of your job will be logged in the log
file specified in the submit file (condor_log in the
example above). This includes the time and place where your job began
executing and the time when it finished or was preempted by higher priority users on the machine
where it was running. You can view the current status of the job in
the job queue using condor_q jobid.
Case 2: Use AFS for software and data files
To allow your job to write to an AFS directory, you must give all
processes on all Condor worker nodes and submit machines the ability
to write to the directory. This is generally not a good thing to do.
Don't do things this way unless you have to! Please
inform us at condor-help AT hep.wisc.edu before you make heavy use of
this option, because it can cause performance problems on the AFS
server when many Condor jobs are writing to it at the same time.
The following command can be used to give all Condor machines write
access to a directory:
find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts rlidkw \;
When you are done, you should remove the ability of Condor machines
to write to the directory. To do that, use the following command:
find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts none \;
Q2. What Condor pools exist at UW Madison?
From hep.wisc.edu, several Condor pools are accessible. You don't
have to do anything special to access them. Once jobs are submitted
to Condor at hep.wisc.edu, they "flock" to these pools.
- condor.hep.wisc.edu: The local HEP Condor pool is just a
collection of desktops and a few other machines. If you have a lot of
jobs to run, they will generally spill over from this pool into the
other larger pools on campus.
- glow.cs.wisc.edu: The GLOW Condor pool is composed of machines
owned by various big computing users on campus, including CMS, ATLAS,
IceCube, and others. Since these machines are owned by specific
groups, those groups have immediate priority. Guest jobs will be
kicked off whenever the owners have work to do. (See Preemption for more information on dealing with
your job getting kicked off.)
- cm.chtc.wisc.edu: The CHTC Condor pool is for use by UW Madison
researchers. Since the machines in CHTC are not owned by specific
groups, you may have better luck running your jobs in CHTC than in
GLOW unless you are one of the GLOW members, such as CMS or ATLAS.
Currently, the recommended way to force your jobs to run in CHTC
instead of trying to run everywhere is to add the following requirements
to your submit description file:
requirements = PoolName == "CHTC"
If you have a lot more jobs than there are available slots in the
CHTC pool (say more than a few hundred), then it makes sense to let
the jobs attempt to run everywhere, rather than restricting them to
CHTC.
To view the status of machines in the various condor pools, use
condor_status. For most purposes, such as seeing how
much horse-power exists, you should filter out special-purpose slots
as in the following example queries:
condor_status -pool condor.hep.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot'
To see more details about the machines in the pools, you can use
the -long or -format options to
condor_status. For example, to see what operating system
flavors are being run you could use condor_status -long
and notice that in the information about each machine, there is an attribute
named OSIssue. The following command could then be used
to summarize how many slots are running each operating system flavor:
condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot' \
-format "%s\n" OSIssue | sort | uniq -c
Q3. Why is my job not running?
There is a tool for analyzing what machines match your job's requirements.
Example:
condor_q -better-analyze -pool cm.chtc.wisc.edu jobid
One possible reason for a job not to be running is that Condor hit
some error such as a missing input file. In this case, the job will
go "on hold", indicated with an 'H' in the status field in the job
queue. To remove the job and resubmit it, use condor_rm
jobid. If instead you can fix the problem without
resubmitting the job, you can release it from hold with
condor_release jobid.
Another reason for a job not to be running is that it ran once and
Condor observed that it had a very large virtual image size before the
job was preempted. Then future attempts to find a suitable machine
may fail if no slots have sufficient memory to match the observe image
size. If you have this problem, try to reduce the amount of memory
needed by the job. If that is not possible, contact
us for additional options.
Q4. My jobs keep getting evicted. What can I do?
Your job may get kicked off of a computer before it finishes in
some cases. This can happen for two main reasons: the owner of the
machine has immediate need for that machine, or another user in the
pool with a better fair-share priority has work for the machine to do.
In the case of preemption by the machine owner, your job is kicked off
immediately. In the case of fair-share prioritization, your job can
run for up to 24 hours before being killed.
When your job is preempted, it returns to the idle state in the job
queue and will try to run again. It is possible to make a job save
state when it is kicked off so that it can resume from where it left
off. Otherwise, it must restart from the beginning.
One way to make it save state is to use Condor's "standard
universe". This requires relinking your program with Condor's
standard library. Not all programs are compatible with this
(e.g. multi-threaded or dynamically linked programs). For more
information, see the Condor
Manual.
Another option is to have your job intercept the kill signal
(SIGTERM) sent by Condor when it wishes to kick the job off the
machine. It should then quickly write out whatever information it
needs in order to resume from where it left off. (If it doesn't shut
down within the grace period (typically 10 minutes), then it will be
hard-killed with SIGKILL.) In order to tell Condor to save the
intermediate files that your program has generated in the working
directory on the worker node, you should use the following option in
the submit description file:
when_to_transfer_output = ON_EXIT_OR_EVICT
Getting Help
- Please contact condor-help AT hep.wisc.edu if you have trouble and
need further assistance.
|