University of Wisconsin-Madison Skip navigationUW-Madison Home PageMy UW-MadisonSearch UW
 

 

UW Home page

University Communications
UW High Energy Physics

 

UW-HEP Condor User Info


Q1. I am in posession of a program that requires large amounts of CPU time to run, and I would like to use the Wisconsin grid for running this program. How do I do it?

Some examples of how to run a program under Condor are presented below. First, some general remarks:

If you are a CMS user, there is more specific information by going here and clicking on "User Documentation".

To tell Condor to run a program, you must first describe the "job" in a submit description file. You then tell Condor to run the job(s) by running condor_submit. Each computer has its own queue of jobs, so if you submit jobs from say login02, you must remember to go back to login02 to manage those jobs. You can view the job queue using the condor_q command. If you don't remember where you submitted the job, you can view all jobs still in the queue using condor_q -global your-user-name. General advice on submitting and managing Condor jobs can be found in the Condor Manual.

It is best to break up the work you have to do into chunks that take less than 24 hours and more than a few minutes. This reduces time lost to overhead and improves the chances of the job running to completion before being preempted by other higher priority users. See more about preemption below.

Some Condor worker nodes are 32-bit machines. Most are 64-bit. If you compile your program in 32-bit mode, it can run on all Condor worker nodes. The login.hep.wisc.edu machines are 64-bit machines, so by default, compiled programs will be 64-bit. To tell gcc to generate 32-bit programs, use the -m32 option.

If you wish to use a 64-bit program, you need to tell Condor to only run the job on 64-bit machines, because we currently have it configured to assume that all programs can run anywhere. You can specify this requirement with the following expression in your Condor submit file:

requirements = ARCH == "X86_64"

Case 1: Use AFS for software and /scratch for data files

In some cases it may be convenient to use AFS for your job's data files. However, there are a number of disadvantages to using AFS in this way, including both performance and security, so we strongly recommend that you put your input/output data files in a directory in /scratch/your-user-name. Your program executables and libraries could also exist in /scratch, but it is usually convenient and reasonable to put these in AFS so that libraries can be easily accessed by the Condor job from wherever it runs.

Once you have compiled your program (keeping in mind the 32-bit/64-bit issues mentioned above), and your input files are ready, you can create a submit file describing your job(s) and submit it to Condor. Here is a simple example:

universe = vanilla
executable = /afs/hep.wisc.edu/home/dan/sw/my_program

arguments = arg1 arg2

# Copy environment variables that are set at submit time, such as
# LD_LIBRARY_PATH.
getenv = true

should_transfer_files = yes
when_to_transfer_output = ON_EXIT

output = stdout
error = stderr
log = condor_log

transfer_input_files = inputfile1,inputfile2

# remove the following if you wish to receive email when the job
# completes
notification = never

queue

Once the submit file is ready, you can submit the job to Condor using the following command. Run this command from the directory where you want the output files to go (i.e. in /scratch/your-user-name/...) or expicitly specify an initial working directory in the submit file.

condor_submit submit_file

Important events in the life of your job will be logged in the log file specified in the submit file (condor_log in the example above). This includes the time and place where your job began executing and the time when it finished or was preempted by higher priority users on the machine where it was running. You can view the current status of the job in the job queue using condor_q jobid.

Case 2: Use AFS for software and data files

To allow your job to write to an AFS directory, you must give all processes on all Condor worker nodes and submit machines the ability to write to the directory. This is generally not a good thing to do. Don't do things this way unless you have to! Please inform us at condor-help AT hep.wisc.edu before you make heavy use of this option, because it can cause performance problems on the AFS server when many Condor jobs are writing to it at the same time.

The following command can be used to give all Condor machines write access to a directory:

find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts rlidkw \;

When you are done, you should remove the ability of Condor machines to write to the directory. To do that, use the following command:

find /path/to/directory -type d -exec fs sa -dir '{}' -acl condor-hosts none \;

Q2. What Condor pools exist at UW Madison?

From hep.wisc.edu, several Condor pools are accessible. You don't have to do anything special to access them. Once jobs are submitted to Condor at hep.wisc.edu, they "flock" to these pools.

  1. condor.hep.wisc.edu: The local HEP Condor pool is just a collection of desktops and a few other machines. If you have a lot of jobs to run, they will generally spill over from this pool into the other larger pools on campus.
  2. glow.cs.wisc.edu: The GLOW Condor pool is composed of machines owned by various big computing users on campus, including CMS, ATLAS, IceCube, and others. Since these machines are owned by specific groups, those groups have immediate priority. Guest jobs will be kicked off whenever the owners have work to do. (See Preemption for more information on dealing with your job getting kicked off.)
  3. cm.chtc.wisc.edu: The CHTC Condor pool is for use by UW Madison researchers. Since the machines in CHTC are not owned by specific groups, you may have better luck running your jobs in CHTC than in GLOW unless you are one of the GLOW members, such as CMS or ATLAS. Currently, the recommended way to force your jobs to run in CHTC instead of trying to run everywhere is to add the following requirements to your submit description file:
    requirements = PoolName == "CHTC"
    

    If you have a lot more jobs than there are available slots in the CHTC pool (say more than a few hundred), then it makes sense to let the jobs attempt to run everywhere, rather than restricting them to CHTC.

To view the status of machines in the various condor pools, use condor_status. For most purposes, such as seeing how much horse-power exists, you should filter out special-purpose slots as in the following example queries:

condor_status -pool condor.hep.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot'
condor_status -pool cm.chtc.wisc.edu -constraint 'IsGeneralPurposeSlot'

To see more details about the machines in the pools, you can use the -long or -format options to condor_status. For example, to see what operating system flavors are being run you could use condor_status -long and notice that in the information about each machine, there is an attribute named OSIssue. The following command could then be used to summarize how many slots are running each operating system flavor:

condor_status -pool glow.cs.wisc.edu -constraint 'IsGeneralPurposeSlot' \
              -format "%s\n" OSIssue | sort | uniq -c

Q3. Why is my job not running?

There is a tool for analyzing what machines match your job's requirements. Example:

condor_q -better-analyze -pool cm.chtc.wisc.edu jobid

One possible reason for a job not to be running is that Condor hit some error such as a missing input file. In this case, the job will go "on hold", indicated with an 'H' in the status field in the job queue. To remove the job and resubmit it, use condor_rm jobid. If instead you can fix the problem without resubmitting the job, you can release it from hold with condor_release jobid.

Another reason for a job not to be running is that it ran once and Condor observed that it had a very large virtual image size before the job was preempted. Then future attempts to find a suitable machine may fail if no slots have sufficient memory to match the observe image size. If you have this problem, try to reduce the amount of memory needed by the job. If that is not possible, contact us for additional options.

Q4. My jobs keep getting evicted. What can I do?

Your job may get kicked off of a computer before it finishes in some cases. This can happen for two main reasons: the owner of the machine has immediate need for that machine, or another user in the pool with a better fair-share priority has work for the machine to do. In the case of preemption by the machine owner, your job is kicked off immediately. In the case of fair-share prioritization, your job can run for up to 24 hours before being killed.

When your job is preempted, it returns to the idle state in the job queue and will try to run again. It is possible to make a job save state when it is kicked off so that it can resume from where it left off. Otherwise, it must restart from the beginning.

One way to make it save state is to use Condor's "standard universe". This requires relinking your program with Condor's standard library. Not all programs are compatible with this (e.g. multi-threaded or dynamically linked programs). For more information, see the Condor Manual.

Another option is to have your job intercept the kill signal (SIGTERM) sent by Condor when it wishes to kick the job off the machine. It should then quickly write out whatever information it needs in order to resume from where it left off. (If it doesn't shut down within the grace period (typically 10 minutes), then it will be hard-killed with SIGKILL.) In order to tell Condor to save the intermediate files that your program has generated in the working directory on the worker node, you should use the following option in the submit description file:

when_to_transfer_output = ON_EXIT_OR_EVICT

Getting Help

  • Please contact condor-help AT hep.wisc.edu if you have trouble and need further assistance.

 

 
 
UW High Energy Physics | UW Home